A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients

Peter Sarvari; Zaid Al‐fagih; Abdullatif Ghuwel; Othman Al‐fagih

doi:10.1002/hcs2.79

. 2024 Feb 1;3(1):3–18. doi: 10.1002/hcs2.79

A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients

Peter Sarvari ^1,^✉, Zaid Al‐fagih ¹, Abdullatif Ghuwel ², Othman Al‐fagih ²

PMCID: PMC11080831 PMID: 38939167

Abstract

Background

Given the strikingly high diagnostic error rate in hospitals, and the recent development of Large Language Models (LLMs), we set out to measure the diagnostic sensitivity of two popular LLMs: GPT‐4 and PaLM2. Small‐scale studies to evaluate the diagnostic ability of LLMs have shown promising results, with GPT‐4 demonstrating high accuracy in diagnosing test cases. However, larger evaluations on real electronic patient data are needed to provide more reliable estimates.

Methods

To fill this gap in the literature, we used a deidentified Electronic Health Record (EHR) data set of about 300,000 patients admitted to the Beth Israel Deaconess Medical Center in Boston. This data set contained blood, imaging, microbiology and vital sign information as well as the patients' medical diagnostic codes. Based on the available EHR data, doctors curated a set of diagnoses for each patient, which we will refer to as ground truth diagnoses. We then designed carefully‐written prompts to get patient diagnostic predictions from the LLMs and compared this to the ground truth diagnoses in a random sample of 1000 patients.

Results

Based on the proportion of correctly predicted ground truth diagnoses, we estimated the diagnostic hit rate of GPT‐4 to be 93.9%. PaLM2 achieved 84.7% on the same data set. On these 1000 randomly selected EHRs, GPT‐4 correctly identified 1116 unique diagnoses.

Conclusion

The results suggest that artificial intelligence (AI) has the potential when working alongside clinicians to reduce cognitive errors which lead to hundreds of thousands of misdiagnoses every year. However, human oversight of AI remains essential: LLMs cannot replace clinicians, especially when it comes to human understanding and empathy. Furthermore, a significant number of challenges in incorporating AI into health care exist, including ethical, liability and regulatory barriers.

Keywords: GPT‐4, diagnostic errors, MIMIC‐IV, clinical decision‐making, language model, patient care, health care efficiency

Our objective was to evaluate the diagnostic accuracy of GPT‐4 and PaLM2 by assessing their ability to correctly identify diagnoses from real‐life Electronic Health Record data. The analysis focused on 1000 randomly selected patient reports generated from the multi‐modal time series patient admission data in Medical Information Mart for Intensive Care IV.

GPT‐4 and PaLM2 were prompted to suggest all possible diagnoses based on the generated patient report. The ground truth diagnoses were established using the diagnostic billing codes of the same patient. GPT‐4 then evaluated the GPT‐4/PaLM2 predictions against the ground truth using a defined prompt. Predictions were marked as hits if deemed directly related to ground truth diagnoses. Crucially, doctors then manually reviewed this evaluation to ensure accuracy. GPT‐4 achieved 93.8% and PaLM2 achieved 84.7% diagnostic sensitivity.

graphic file with name HCS2-3-3-g001.jpg

Abbreviations

EHR: Electronic Health Record
EPR: electronic patient record
ICD: international classification of diseases
LLM: Large Language Model
MIMIC: Medical Information Mart for Intensive Care

1. INTRODUCTION

Technological advancements have brought major improvements in the delivery of clinical care. Despite these strides, significant challenges persist, with one of the most pressing being diagnostic errors. Studies estimate that diagnostic errors are responsible for roughly 10% of patient deaths and a similar proportion of patient complications and hospital readmissions [1, 2, 3]. These errors lead to adverse patient outcomes and generate considerable economic costs. A 2010 estimate suggests that diagnostic errors account for over 19 billion Dollars in malpractice claims each year in the United States [4, 5]. This figure is likely to have increased substantially over the past 13 years. Additionally, around 345 billion Dollars are wasted annually on uncoordinated, low‐value, or entirely unnecessary care [6], much of which originates from diagnostic errors. These numbers underscore the need for more effective strategies to mitigate the effect of diagnostic errors on patient outcomes and economies.

The rapid progress in artificial intelligence (AI) and Large Language Models (LLMs) provides an exciting avenue for progress in this field. AI has the potential to transform medical diagnosis by enhancing accuracy, speeding up the diagnostic process, and reducing human and system errors, resulting from poor communication, information overload, and cognitive errors more broadly, which account for 75% of cases of diagnostic error alone [7]. Graber et al. [8] find the most common cognitive error to be premature closure, i.e. the failure to consider reasonable alternatives after an initial diagnosis. LLMs are especially promising in this context, because they can leverage vast amounts of data to make accurate predictions or generate plausible hypotheses.

The ongoing digitalization of clinical notes and their storage on electronic patient records (EPRs), which is occurring alongside the development of advanced LLMs places health care systems at a critical juncture. This is because clinical information is now not only stored electronically but is largely interpretable by LLMs, opening a world of possibilities for LLM‐based clinical applications.

Among the various LLM‐based tools, GPT‐4, an advanced language model developed by OpenAI [9], has shown considerable promise for medical applications. GPT‐4 has been shown to be capable of solving medical challenge problems [10] and making highly accurate differential diagnoses: previous research diagnosed 48 patients with GPT‐3 [11], 50 patients with GPT‐4 [12] and 70 challenging cases with GPT‐4 [13].

However, these studies have only evaluated GPT‐4's ability to diagnose on small numbers of real patients or on synthetic cases. While these initial results are promising, a more comprehensive evaluation with a larger sample size is needed to substantiate these findings. The limited scope of previous research leaves a gap in our understanding of GPT‐4's performance and applicability in real‐world health care settings, particularly in high‐stress environments such as intensive care units.

To address this gap, the present study aims to conduct a large‐scale evaluation of GPT‐4's diagnostic performance using the Medical Information Mart for Intensive Care IV (MIMIC‐IV) data set. MIMIC‐IV is a publicly available deidentified data set containing comprehensive medical information from critical care units [14, 15, 16]. This data set offers a unique opportunity to test GPT‐4's ability to diagnose various conditions across a broad spectrum of patients in a critical care setting. Importantly, our study includes 1000 patients, far exceeding the scope of previous evaluations.

In conducting this evaluation, it is essential to note that we do not employ a fine‐tuned version of GPT‐4. Instead, we focus on evaluating its inherent capabilities for diagnosing medical conditions. This approach enables us to assess the base model's strengths and limitations, providing a foundation for future research and development efforts.

Through this comprehensive evaluation, we aim to contribute significantly to the ongoing discourse on AI's role in health care, especially the potential of LLMs to enhance diagnostic accuracy. Additionally, we hope to shed light on the prospects of such models in improving patient care and boosting health care efficiency, which are key concerns in modern health care systems worldwide. As we move forward in the era of digital health care, these insights will be crucial in guiding the development and implementation of AI tools in clinical practice.

The subsequent sections of this paper will detail our methodology, present our findings, and discuss their implications for the future of AI in health care. We will delve into the potentials, limitations, and considerations that surround the deployment of advanced AI tools like GPT‐4 in a health care setting. Ultimately, this study aims to offer a comprehensive assessment of GPT‐4's potential as a diagnostic tool, which we hope will pave the way for further research in this area and ultimately contribute to enhancing health care delivery and patient outcomes.

2. METHODOLOGY

In this section, we present how we transformed the MIMIC‐IV data and fed it to LLMs to obtain alternative patient diagnoses and subsequently, how we evaluated the accuracy of such diagnoses.

2.1. Data

All the data used in this study is included in MIMIC‐IV, a database disseminated on PhysioNet, sourced from the Electronic Health Records (EHRs) of the Beth Israel Deaconess Medical Center. SQL code on Google's BigQuery compute engine was used to create the text input to GPT‐4 for each MIMIC‐IV patient by merging the relevant tables from the mimiciv_hosp, mimiciv_icu and mimiciv_derived datasets into one big dataframe. The resulting dataframe describes in text the patient's imaging reports, microbiology lab test results, intracranial pressure measurements, raw blood gas data and aggregate statistics (minimum, average and maximum) on the patient's vitalsigns, blood differentials, blood counts, blood cardiac markers, blood chemistry, blood coagulation markers, C‐reactive protein and blood enzyme levels. Whenever raw data points were utilized (rather than the aggregate statistics), the creation time of the data point was explicitly stated in hours relative to hospital admission. This ensured that the disease progression of the patient is clear to the LLM analyzing the data.

2.2. Diagnostic predictions

The following prompt was applied to generate GPT‐4's suggested diagnoses based on the patient input data.

Suggest as many potential diagnoses as possible from the following patient data.

In addition, include previously diagnosed conditions and information about patient's medical history (if any).

Give exact numbers and/or text quotes from the data that made you think of each of the diagnoses and, if necessary, give further tests that could confirm the diagnosis.

Once you're done, suggest further, more complex diseases that may be ongoing based on the existing diagnoses you already made.

Use the International Classification of Disease (ICD) standard for reporting the diagnoses.

Before finalizing your answer check if you haven't missed any abnormal data points and hence any diagnoses that could be made based on them. If you did, add them to your list of diagnoses

For example, if the patient data mentions:

“Blood report:

min glucose: 103, max glucose: 278, avg glucose: 156.5, max inr: 2.1, max pt: 22.4, max ptt: 150, avg wbc: 13.8, max wbc: 14.1, max lactate: 5.9, max bun: 101, max creatinine: 5.8, avg bun: 38.15, avg creatinine: 2.78

Blood gas report:

3 h after admission the blood gas results from venous blood are: ph: 7.2

Imaging report:

Status post left total shoulder replacement

Chest X‐Ray Possible small right pleural effusion and Mild, bibasilar atelectasis… Lung volumes have slightly increased but areas of atelectasis are seen at both the left and the right lung bases

Microbiology tests:

24 h after admission the microbiology culture test MRSA SCREEN obtained via MRSA SCREEN identified POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS

Vitalsigns data from ICU:

max temperature: 38, min peripheral oxygen saturation: 70, max respiration rate: 29”

then your answer may be:

1: Methicillin resistant Staphylococcus aureus infection, site unspecified

Foundational data: Microbiology culture test identifying “POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS”

2: Atelectasis

Foundational data from Chest X‐Ray: “Mild, bibasilar atelectasis… Lung volumes have slightly increased but areas of atelectasis are seen at both the left and the right lung bases”

3: Pleural effusion, not elsewhere classified

Foundational data from Chest X‐Ray: “Possible small right pleural effusion.”

Further tests: Thoracentesis, CT chest

4: Acidosis

Foundational data: “ph: 7.2”

Further tests: Urine pH, Anion Gap

5: Lactic acidosis

Foundational data: “max lactate: 5.9”

6: Acquired coagulation factor deficiency

Foundational data: “max inr: 2.1, max pt: 22.4, max ptt: 150”

Further tests: Antiphospholipid Antibodies (APL), Protein C, Protein S, Antithrombin III, Factor V Leiden, Fibrinogen test

7: Hyperglycemia, unspecified

Foundational data: “max glucose: 278, avg glucose: 156.5.”

Further tests: Hemoglobin A1c (HbA1c) test

8: Hypoxemia

Foundational data: “min peripheral oxygen saturation: 70”

Further tests: Measure PaO2 in blood

9: Leukocytosis

Foundational data: “max wbc: 14.1, avg wbc: 13.8.” The patient's white blood cell count is consistently elevated which may suggest an ongoing inflammatory response or infection.

Further tests: Infection markers such as CRP or PCT, Assessment of symptoms like fever, inflammation or fatigue.

10. Unspecified acute kidney failure:

Foundational data: “max bun: 101, max creatinine: 5.8, avg bun: 38.15, avg creatinine: 2.78”

Further tests: Urine output measurements for oliguria, ultrasound to rule out obstruction

11. Presence of left artificial shoulder joint

Foundational data: The imaging report mentions: “Status post left total shoulder replacement”

Further diseases based on these diagnoses (continued the indexing from the previous number in the list):

12: Unspecified septicemia

Foundational data: positive MRSA screen, systemic inflammatory response: “max respiration rate: 29,” “max temperature: 38,” leukocytosis

Further tests: HR, BP, wound culture, respiratory excretion tests

13: Septic shock

Foundational data: Septicemia with acidosis and lactic acidosis may suggest septic shock

Further tests: patient examination (low BP, mental disorientation, nausea, pale skin may confirm the finding)

14: Acute respiratory failure, with hypoxia or hypercapnia

Foundational data: hypoxemia and the presence of atelectasis

Further tests: Clinical symptoms (severe shortness of breath, rapid breathing, and confusion), arterial blood gas measurements showing hypoxia or hypercapnia

15: Type 2 diabetes mellitus with diabetic chronic kidney disease

Foundational data: Hyperglycemia and kidney failure

Further tests: urine test, hemoglobin (A1C) test, GFR, BP, physical examination (swelling, nausea, weakness, and eye disease)

Patient data:

<Patient data>

2.3. Ground truth diagnoses

Clinicians together with medical coders reported the diagnoses for each patient in MIMIC‐IV in the ICD‐9/ICD‐10 (International Classification of Diseases) format, which were used as the true (reference) diagnoses. Note that the clinicians who prepare the diagnostic reports from which the ICD codes are extracted have access to a broader range of data, such as ECG results as well as the physical examination and the full medical history of the patient, none of which are currently available in MIMIC‐IV. This creates an information disparity between the clinician diagnosing patients based on a complete EHR and the LLM diagnosing patients based on an incomplete EHR.

We further ask the reader to bear in mind that our study is inherently different from the simultaneous ongoing effort of ICD code extraction from clinical notes, such as the one described in [17]. This is because those clinical notes provide a complete picture of the patient after the doctor made note of the diagnosis and prepared the subjective objective assessment plan. The MIMIC‐IV data set doesn't contain such notes, but even if it did, we wouldn't use it for our study, as we aim to predict patient diagnoses and not extract ICD codes for diagnoses already made by the doctor. As outlined in Section 2.1, the only notes we use are the imaging notes prepared by the radiologist for the doctor to take a look at. We acknowledge that some diagnoses may be explicitly mentioned in these notes, however since these are prepared by the radiologists and communicated to the doctor via these notes, we must include these to help the doctor with suggesting a comprehensive patient diagnostic picture.

2.4. Automated evaluation and manual review

GPT‐4 is employed once again to determine the proximity of the predicted diagnosis (made by GPT‐4) to the ground truth diagnosis taking the information disparity mentioned in Section 2.3. into account: if there's a true diagnosis that cannot be reasonably inferred from the data GPT‐4 has access to, the model is not penalized for missing that; we only expect it to predict whatever is reasonable to predict given the limited data.

The following prompt is utilized to evaluate the diagnostic predictions (we will refer to this as automated evaluation):

Below are the actual diagnoses of the same patient reported by clinicians.

Go through the actual diagnoses and cross‐check each actual diagnosis with the initial list of diagnoses you provided answer the following two questions:

Question 1: Is this actual diagnosis a new disease, not directly related to any of the diagnoses you suggested in your initial list? If an actual diagnosis is a complication of, a more specific version of, or falls under a broader category of a diagnosis you initially listed, it should not be considered a new disease. If an actual diagnosis affects the same organ as a diagnosis you initially listed, but it has a different onset and progression (e.g., the actual diagnosis is chronic but you initially listed the acute disease), then your answer should be “No.” If an actual diagnosis is caused by the same pathogen as a diagnosis in your initial list, the answer should also be “No.” If an actual diagnosis is not a medical diagnosis, your answer should be “No.”

If your answer to Question 1 was “No,” put N/A as answer for Question 2 and skip to the Example below.

Question 2: Would it be possible to directly infer this actual diagnosis from the patient data provided in the initial query?

If yes, support with facts: quote exact numbers or text from the initial query.

If no, in case the data contradicts the diagnosis, quote the data and say why it does not support the diagnosis. Otherwise, please specify what additional data would have been helpful to establish this diagnosis.

Example:

If the patient data is:

“Blood report: min potassium: 3.1, avg hemoglobin: 14.5, max sodium: 139, avg wbc: 13.9

Blood gas report: ph: 7.2

Imaging report: patient with polysubstance abuse, lungs look normal”

and your initial list in your previous response contained the following suggested diagnoses:

1: Acidosis

Foundational data: “ph: 7.2”

Further tests: Urine pH, Anion Gap

2: Polysubstance abuse, not elsewhere classified

Foundational data: The imaging report mentions “patient with polysubstance abuse”

3: Leukocytosis

Foundational data: “avg wbc: 13.9.”

Further tests: Infection markers such as CRP or PCT, Assessment of symptoms like fever, inflammation or fatigue.

and actual diagnoses are:

D1: Poisoning by cocaine

D2: Hypokalemia

D3: Hypernatremia

D4: Severe sepsis

Then your answer should be:

D1: Poisoning by cocaine

Question 1: No, this is similar to diagnosis 2: Polysubstance abuse, not elsewhere classified.

Question 2: N/A

D2: Hypokalemia

Question 1: Yes

Question 2: Yes, the blood report mentions “min potassium: 3.1”

D3: Hypernatremia

Question 1: Yes

Question 2: No, the blood report mentions “max sodium: 139,” but only sodium levels above 145 mmol/L indicate hypernatremia, hence the data does not support hypernatremia.

D4: Severe sepsis

Question 1: Yes

Question 2: No, additional data such as fever, increased heart rate, increased respiratory rate, positive blood cultures, or evidence of organ dysfunction would have been helpful to establish this diagnosis.

Before finalizing your answer check if you haven't missed noticing any diagnoses from your initial list that are related to any of the actual diagnoses you answered the two questions for! If you did, modify the answers to the questions accordingly!

Actual diagnoses:

<Actual diagnoses>

Our primary purpose for using GPT‐4 not only to predict but to also evaluate the prediction is to enable large‐scale assessment. If GPT‐4 determines that a ground truth condition is “directly related” to a predicted diagnosis, we mark the ground truth condition as a ‘hit’ (meaning that the corresponding prediction is a true positive). To be precise, we define “direct relatedness” as a negative (“No”) answer to Question 1 in the evaluation prompt above (with the caveat that the condition under assessment is a real medical diagnosis). If GPT‐4 determines that a true diagnosis cannot be reasonably inferred from the data or that a condition is not a real medical diagnosis (e.g., Do Not Resuscitate), we exclude that diagnosis. The number of hits divided by the number of ground truth diagnoses minus the number of excluded diagnoses define the Patient Diagnostic Hit Rate of GPT‐4 in our study.

P a t i e n t D i a g n o s t i c H i t R a t e = \frac{# h i t s}{# t r u e d i a g n o s e s - # e x c l u d e d d i a g n o s e s} = \frac{# h i t s}{# i n f e r a b l e d i a g n o s e s} .

(1)

To establish the Average Hit Rate of GPT‐4 on a patient cohort, we may take the average of the Patient Diagnostic Hit Rate. Alternatively, we may calculate the Aggregate Hit Rate of GPT‐4 on a patient cohort using the following formula:

Aggregate Hit Rate = \frac{\sum_{patients in cohort} # hits}{\sum_{patients in cohort} # inferable diagnoses} .

(2)

Three medical doctors (all authors of this paper and have 27 years of medical experience amongst them) set out to review the automated evaluation and find diagnoses that were incorrectly excluded by GPT‐4 or incorrectly classified as “hits.” This is because both incorrect exclusion and incorrect “hit” classification would inflate the estimated diagnostic hit rate. By asking doctors to focus on this specific task, we did not only reduce their review time, but also increased the doctors' focus and thereby reduced their propensity to error. After the doctors' review, we're confident that the final sensitivity, if anything, is a conservative rather than an inflated estimate of the real diagnostic hit rate of the LLM.

To summarize, a “hit” is defined as a “diagnosis coded in the patient's electronic record which is directly related to a diagnosis generated by the LLM.” For example, “chronic kidney disease” would be accepted as a “hit” if “acute kidney injury” was suggested by the LLM. The rationale is that given the lack of patient medical history, both of these conditions are equally reasonable when one is suggesting alternative diagnoses based on the incomplete patient EHR data from MIMIC‐IV.

A “miss” is defined as “an inferable diagnosis coded in the patient's electronic record, not predicted by the LLM.” For example, if a true diagnosis of “hyperkalemia” was not detected, but blood results were showing a raised potassium level, then this would be considered a “miss.”

2.5. Major assumption

Hereby, we make the assumption that the ground truth diagnoses are accurate since the MIMIC‐IV data set stems from the Beth Israel Deaconess Medical Center, the teaching hospital of Harvard Medical School, which is arguably one of the most renowned and well‐respected hospitals in the world. However, it may be the case that some medical conditions that a patient evidently suffers from are not listed as part of the ground truth ICD diagnoses for that patient (reasons may vary, but the condition could be too minor or irrelevant for medical billing purposes). This is exactly why we do not quantify the false positive rate of the model (and equivalently, metrics depending on the false positive rate, such as precision). As long as GPT‐4 gives a medically‐sound reason why a patient should have a certain condition based on the input data, it can be a useful suggestion for a doctor looking to diagnose the patient regardless of whether that condition later ends up being on the medical billing report or not. Indeed, the medical doctors who reviewed the data set found all the diagnostic suggestions GPT‐4 made reasonable, except for three seemingly erroneous suggestions noting the presence of artificial heart, knee joint and shoulder joint, respectively. Given that GPT‐4 detected thousands of diagnoses across 1000 patients and only three of them were inaccurate, we believe the additional work required to remove these extremely rare unhelpful suggestions is negligible compared to the potential benefits of the technology. We do not quantify the true negative rate of the model either, since there are an arbitrary large number of medical conditions that a patient doesn't have.

2.6. PaLM2

In this study we also evaluated the diagnostic ability of PaLM2 [18], Google's next generation LLM introduced in May 2023. We used the same diagnostic prompt described in Section 2.2 to get the suggested diagnoses from PaLM2, however, we employed GPT‐4 to evaluate the results. Note that the sole purpose of using GPT‐4 here is to get an automated proxy for the diagnostic ability of PaLM2: all the diagnostic predictions are coming from PaLM2. We use GPT‐4 rather than PaLM2 for the automated evaluation for two main reasons. First, because we know from the manual review by doctors (see Section 2.4 and Results) that GPT‐4 is able to evaluate diagnostic predictions well. Second, because in some hospital admission cases, the current maximum output token size (1024) from PaLM2 is not sufficient to iterate through all actual diagnoses and compare them to the predicted ones using the evaluation prompt described in Section 2.4.

3. RESULTS

Our automated evaluation results show an estimated Average Hit Rate of 94.1% and an estimated Aggregate Hit Rate of 94.3%. These numbers are based on our initial random sample of 1000 hospital admissions. During their review, the doctors found two false hits and they uncovered a further 15 diagnoses which GPT‐4 falsely classified as non‐inferable. In the automated evaluation, the total number of inferable diagnoses was 4682, with 4414 hits. The number of non‐inferable diagnoses was 9661. After the manual review by doctors, the total number of hits decreased to 4412, the total number of inferable diagnoses increased to 4697 and the number of non‐inferable diagnoses decreased to 9646. Accounting for this, we report the final reviewed diagnostic hit rate (sensitivity) of GPT‐4 to be 93.9% (4412/4697). The 4412 hits correspond to 1116 unique ICD diagnoses. Based on this, we're able to say that GPT‐4 identified 1116 unique diagnoses by either naming the exact diagnosis or suggesting one that is directly related (for definition of direct relatedness, see Section 2.4). In Table 1, we display the most common correctly identified diagnoses (hits) as well as most common misses in our patient cohort.

Table 1.

Top 5 hits and misses.

Hit	Number of cases	Miss	Number of cases
Acute kidney failure	192	Anemia	23
Diabetes mellitus without mention of complication	128	Unspecified essential hypertension	11
Congestive heart failure	98	Essential primary hypertension	11
Chronic kidney disease	89	Hypoxemia	10
Acidosis	86	Hyposmolality and/or hypernatremia	9

Open in a new tab

Next, we draw a comparison between the diagnostic ability of GPT‐4 and PaLM2. Our results show that the diagnostic ability of PaLM2 is inferior to that of GPT‐4: it achieved an Average Hit Rate of 87.0% and an Aggregate Hit Rate of 84.7%. The total number of inferable diagnoses was 4363, with 3694 hits. The number of non‐inferable diagnoses was 9978. The 3694 hits correspond to 1054 unique ICD diagnoses. Note that because of the inferior sensitivity, we have decided not to invest time in having the PaLM2 results reviewed by doctors.

One might notice that the evaluation prompt first found 9661 non‐inferable diagnoses and then, when run to evaluate the predicted diagnoses obtained from PaLM2, found 9978 non‐inferable diagnoses on the same data set. This potentially raises concerns about the robustness and reproducibility of the automated evaluation results by GPT‐4; however, it is importance to acknowledge that it is normal that LLMs give slightly different answers each time to the same complex query, and we attenuated the randomness of GPT‐4 in the evaluation stage by setting the temperature to zero. Further, the inferability of some of these diagnoses from the data are debatable and by having the data reviewed by doctors who were asked to find additional false positives, we believe we mitigated these reliability concerns and that the final reviewed diagnostic hit rate of 93.9% is a reliable estimate of the true diagnostic ability of baseline GPT‐4 on a complex EHR data set.

Example GPT‐4 prediction and evaluation on a real hospital admission case:

Patient data:

Blood report:

The patient stayed in the hospital and had the following statistics on the blood test measurements during the stay: min wbc: 4.3, min absolute basophils: 0.03, min absolute eosinophils: 0.03, min absolute monocytes: 1.16, min absolute neutrophils: 9.25, min basophils: 0.3, min eosinophils: 0.3, min lymphocytes: 10, min monocytes: 9.9, min neutrophils: 78.5, min immature granulocytes: 1, min ntprobnp: 4920, min aniongap: 15, min bun: 54, min calcium: 7.8, min chloride: 90, min creatinine: 2.5, min glucose: 87, min sodium: 134, min potassium: 3.5, min inr: 1.6, min pt: 17.8, min ptt: 40.2, min hematocrit: 22.9, min hemoglobin (g/dl): 7.2, min mch: 28.2, min mchc: 30.5, min mcv: 90, min platelet: 138, min rbc: 2.5, min rdw: 16.8, max wbc: 12.4, max absolute basophils: 0.03, max absolute eosinophils: 0.03, max absolute monocytes: 1.16, max absolute neutrophils: 9.25, max basophils: 0.3, max eosinophils: 0.3, max lymphocytes: 10, max monocytes: 9.9, max neutrophils: 78.5, max immature granulocytes: 1, maxntprobnp: 29612, max aniongap: 23, max bun: 86, max calcium: 9.2, max chloride: 101, max creatinine: 3.6, max glucose: 152, max sodium: 144, max potassium: 4.7, max inr: 3.4, max pt: 37.7, max ptt: 150, max hematocrit: 26.4, “max hemoglobin (g/dl): 8.2, max mch: 29.6, max mchc: 32.4, max mcv: 95, max platelet: 195, max rbc: 2.85, max rdw: 18.1, avg wbc: 6.75, avg absolute basophils: 0.03, avg absolute eosinophils: 0.03, avg absolute monocytes: 1.16, avg absolute neutrophils: 9.25, avg basophils: 0.3, avg eosinophils: 0.3, avg lymphocytes: 10, avg monocytes: 9.9, avg neutrophils: 78.5, avg immature granulocytes: 1, avg ntprobnp: 17266, avg aniongap: 19.09, avg bun: 69.14, avg calcium: 8.33, avg chloride: 94.77, avg creatinine: 2.9, avg glucose: 104.55, avg sodium: 137.14, avg potassium: 4.01, avg inr: 2.18, avg pt: 23.86, avg ptt: 104.54, avg hematocrit: 24.26, avg hemoglobin (g/dl): 7.62, avg mch: 28.89, avg mchc: 31.41, avg mcv: 92.08, avg platelet: 175.77, avg rbc: 2.64, avg rdw: 17.23

Blood gas report:

71 h after admission the blood gas results from arterial blood are: ph: 7.48, po2: 166, pco2: 30, base excess: 0, total co2: 23, lactate: 0.9

Imaging report:

‐8 h after admission the radiologist wrote: INDICATION: ___ with CHF//New O2 requirement and Hx of CHF

TECHNIQUE: Single portable view of the chest.

COMPARISON: Chest x‐ray from ___. Chest CT from ___.

FINDINGS:

There are bilateral calcified pleural plaques, more extensive on the right than on the left. There are superimposed more confluent underlying opacities, particularly at the lung bases which have progressed since prior. There are also small to moderate bilateral pleural effusions and pulmonary edema. Cardiomediastinal silhouette is stable. Median sternotomy wires are intact. Atherosclerotic calcifications seen at the aorta.

IMPRESSION:

New small to moderate bilateral pleural effusions with pulmonary edema and more confluent regions of consolidation raising concern for superimposed infection.

43 h after admission the radiologist wrote: EXAMINATION: UNILAT LOWER EXT VEINS

INDICATION: ___ year old woman with CHF exacerbation, APLA syndrome w/history

of DVT on warfarin who has LLE pain and edema.//?DVT

TECHNIQUE: Gray scale, color, and spectral Doppler evaluation was performed on the left lower extremity veins.

COMPARISON: No prior similar exams for comparison.

FINDINGS:

There is normal compressibility, flow, and augmentation of the left common femoral, femoral, and popliteal veins. Normal color flow is demonstrated in the tibial and peroneal veins.

There is normal respiratory variation in the common femoral veins bilaterally.

IMPRESSION:

No evidence of deep venous thrombosis in the left lower extremity veins.

69 h after admission the radiologist wrote: INDICATION: ___ year old woman with history of CAD, HFpEF, moderate AS, Af,

APLA syndrome, admitted for acute on chronic HFpEF, now with worsening SOB and persistent 5 L O2 requirement.//?Interval change

TECHNIQUE: Frontal radiograph of the chest

COMPARISON: ___.

IMPRESSION:

Moderate left and small right pleural effusions may be slightly increased in size from prior exam. Moderate bilateral pulmonary edema appears similar. Opacity at the right lung base may represent pulmonary edema though consolidative opacity cannot be excluded. Median sternotomy wires appear unchanged. Cardiomegaly appears unchanged. No pneumothorax.

214 h after admission the radiologist wrote: INDICATION: ___ year old woman with PMH AFib, HFpEF, prior MVR with HF exacerbation after multiple admissions for PNA//please evaluate for PNA vs pulm edema interval change after diuresis.

TECHNIQUE: AP and lateral radiographs of the chest.

COMPARISON: Radiograph from ___.

FINDINGS:

Compared to the prior exam, and opacity at the right lung base appears improved compared to the prior exam. Moderate left and small right pleural effusions are grossly unchanged compared to the prior exam. Median sternotomy wires appear unchanged. Persistent dense left basilar atelectasis. Mild cardiomegaly is unchanged. There is no evidence of pneumothorax. Mild pulmonary edema appears slightly improved. Visualized osseous structures are unremarkable.

IMPRESSION:

1. Interval improvement of bibasilar opacities, right greater than left,

however with persistent moderate left and small right pleural effusions.

2. Slight interval improvement of mild pulmonary edema.

3. Persistent dense left basilar atelectasis.

Predicted diagnoses:

1: Congestive heart failure (CHF)

Foundational data: Bibasilar opacities, right greater than left; Persistent moderate left and small right pleural effusions; Slight interval improvement of mild pulmonary edema; Persistent dense left basilar atelectasis,” “maxntprobnp: 29612,” “avg ntprobnp: 17266.”

2: Atelectasis

Foundational data from Chest X‐Ray: “Persistent dense left basilar atelectasis.”

3: Pleural effusion, not elsewhere classified

Foundational data from Chest X‐Ray: “Persistent moderate left and small right pleural effusions.”

Further tests: Thoracentesis, CT chest

4: Hyperglycemia, unspecified

Foundational data: “max glucose: 152, avg glucose: 104.55.”

Further tests: Hemoglobin A1c (HbA1c) test

5: Acquired coagulation factor deficiency

Foundational data: “max inr: 3.4, max pt: 37.7, max ptt: 150.”

Further tests: Antiphospholipid Antibodies (APL), Protein C, Protein S, Antithrombin III, Factor V Leiden, Fibrinogen test

6. Unspecified acute kidney failure:

Foundational data: “max bun: 86, max creatinine: 3.6, avg bun: 69.14, avg creatinine: 2.9”

Further tests: Urine output measurements for oliguria, ultrasound to rule out obstruction

7. Presence of median sternotomy wires and history of mitral valve replacement (MVR).

Foundational data: The imaging report mentions: “Median sternotomy wires are intact.” and “___ year old woman with PMH AFib, HFpEF, prior MVR with HF exacerbation after multiple admissions for PNA”

8. Antiphospholipid syndrome (APLA syndrome)

Foundational data: The mention of “APLA syndrome” hints at this autoimmune disorder that can cause blood clots.

9. Chronic obstructive pulmonary disease (COPD) including chronic bronchitis.

Foundational data: “Bilateral calcified pleural plaques”

Further diseases based on these diagnoses:

10. Pulmonary heart disease, unspecified

Foundational data: CHF, pulmonary edema, Atelectasis, pleural effusions, and COPD are all diseases that can lead to Pulmonary heart disease.

11. Chronic Kidney Disease (CKD)

Foundational data: High BUN and creatinine levels suggest kidney dysfunction which requires long‐term monitoring to diagnose CKD.

12. Type 2 diabetes mellitus with renal complications

Foundational data: Elevated glucose levels and kidney dysfunction can be indicative of this concurrent condition. A confirmed diagnosis would require further analysis and monitoring by a health care professional.

Automated Evaluation summary:

Number of correctly identified diagnoses: 7

Number of diagnoses excluded because they are deemed to be unreasonable to infer from data: 22

Number of diagnoses missed by LLM: 3

Automated evaluation:

D1: Hypertensive heart and chronic kidney disease with heart failure and stage 1 through stage 4 chronic kidney disease, or unspecified chronic kidney disease

Question 1: No, this is similar to diagnosis 1: Congestive heart failure (CHF) and diagnosis 6: Unspecified acute kidney failure.

Question 2: N/A

D2: Acute on chronic diastolic (congestive) heart failure

Question 1: No, this is similar to diagnosis 1: Congestive heart failure (CHF).

Question 2: N/A

D3: Myocardial infarction type 2

Question 1: Yes

Question 2: No, additional data such as EKG changes, troponin levels, or chest pain symptoms would have been helpful to establish this diagnosis.

D4: Acute respiratory failure with hypoxia

Question 1: Yes

Question 2: No, additional data such as oxygen saturation levels, respiratory rate, or arterial blood gas analysis would have been helpful to establish this diagnosis.

D5: Acute kidney failure, unspecified

Question 1: No, this is similar to diagnosis 6: Unspecified acute kidney failure.

Question 2: N/A

D6: Antiphospholipid syndrome

Question 1: No, this is similar to diagnosis 8: Antiphospholipid syndrome (APLA syndrome).

Question 2: N/A

D7: Pulmonary hypertension, unspecified

Question 1: Yes

Question 2: No, additional data such as echocardiogram results or pulmonary artery pressure measurements would have been helpful to establish this diagnosis.

D8: Unspecified atrial fibrillation

Question 1: Yes

Question 2: No, additional data such as EKG results or patient symptoms would have been helpful to establish this diagnosis.

D9: Epilepsy, unspecified, not intractable, without status epilepticus

Question 1: Yes

Question 2: No, additional data such as EEG results, seizure history, or neurological examination findings would have been helpful to establish this diagnosis.

D10: Anemia in chronic kidney disease

Question 1: Yes

Question 2: Yes, the blood report mentions “min hemoglobin (g/dl): 7.2, max hemoglobin (g/dl): 8.2, avg hemoglobin (g/dl): 7.62” which indicates anemia, and the diagnosis of chronic kidney disease was also suggested.

D11: Hyperlipidemia, unspecified

Question 1: Yes

Question 2: No, additional data such as cholesterol levels would have been helpful to establish this diagnosis.

D12: Nonrheumatic aortic (valve) stenosis

Question 1: Yes

Question 2: No, additional data such as echocardiogram results or physical examination findings would have been helpful to establish this diagnosis.

D13: Atherosclerotic heart disease of native coronary artery without angina pectoris

Question 1: Yes

Question 2: No, additional data such as coronary angiogram results or patient symptoms would have been helpful to establish this diagnosis.

D14: Chronic kidney disease, unspecified

Question 1: No, this is similar to diagnosis 11: Chronic Kidney Disease (CKD).

Question 2: N/A

D15: Epistaxis

Question 1: Yes

Question 2: No, additional data such as patient symptoms or physical examination findings would have been helpful to establish this diagnosis.

D16: Elevated white blood cell count, unspecified

Question 1: Yes

Question 2: Yes, the blood report mentions “max wbc: 12.4, avg wbc: 6.75” which indicates elevated white blood cell count.

D17: Hypothyroidism, unspecified

Question 1: Yes

Question 2: No, additional data such as TSH or T4 levels would have been helpful to establish this diagnosis.

D18: Vitamin D deficiency, unspecified

Question 1: Yes

Question 2: No, additional data such as vitamin D levels would have been helpful to establish this diagnosis.

D19: Fever, unspecified

Question 1: Yes

Question 2: No, additional data such as temperature measurements would have been helpful to establish this diagnosis.

D20: Nonscarring hair loss, unspecified

Question 1: Yes

Question 2: No, additional data such as patient symptoms or physical examination findings would have been helpful to establish this diagnosis.

D21: Gout, unspecified

Question 1: Yes

Question 2: No, additional data such as uric acid levels or joint examination findings would have been helpful to establish this diagnosis.

D22: Contact with and (suspected) exposure to asbestos

Question 1: Yes

Question 2: No, additional data such as occupational history or lung biopsy results would have been helpful to establish this diagnosis.

D23: Coronary angioplasty status

Question 1: Yes

Question 2: No, additional data such as patient medical history or angiogram results would have been helpful to establish this diagnosis.

D24: Long term (current) use of anticoagulants

Question 1: Yes

Question 2: Yes, the patient data mentions “___ year old woman with CHF exacerbation, APLA syndrome w/history of DVT on warfarin”

D25: Presence of prosthetic heart valve

Question 1: No, this is similar to diagnosis 7: Presence of median sternotomy wires and history of mitral valve replacement (MVR).

Question 2: N/A

D26: Personal history of other malignant neoplasm of bronchus and lung

Question 1: Yes

Question 2: No, additional data such as patient medical history or biopsy results would have been helpful to establish this diagnosis.

D27: Personal history of other venous thrombosis and embolism

Question 1: Yes

Question 2: No, additional data such as patient medical history or imaging results would have been helpful to establish this diagnosis.

D28: Personal history of transient ischemic attack (TIA), and cerebral infarction without residual deficits

Question 1: Yes

Question 2: No, additional data such as patient medical history or neuroimaging results would have been helpful to establish this diagnosis.

D29: Personal history of pneumonia (recurrent)

Question 1: Yes

Question 2: No, additional data such as patient medical history or chest imaging results would have been helpful to establish this diagnosis.

D30: Presence of xenogenic heart valve

Question 1: No, this is similar to diagnosis 7: Presence of median sternotomy wires and history of mitral valve replacement (MVR).

Question 2: N/A

D31: Personal history of nicotine dependence

Question 1: Yes

Question 2: No, additional data such as patient medical history or patient symptoms would have been helpful to establish this diagnosis.

D32: Family history of ischemic heart disease and other diseases of the circulatory system

Question 1: Yes

Question 2: No, additional data such as family medical history would have been helpful to establish this diagnosis.

In the future we plan to evaluate Google's MedPaLM2 (not yet available for public use) in a similar fashion to draw a direct comparison between the diagnostic ability of MedPaLM2 and that of GPT‐4. MedPaLM2 is a medical LLM trained using PaLM2 as a base model and targeted medical domain‐specific fine‐tuning [19]. MedPaLM2 achieved 86.5% on MultiMedQA, a medical question‐answer data set [20], outperforming GPT‐4, which achieved 86.1% [19]. Then we plan to use the rest of the MIMIC‐IV data that we did not employ for this study to fine‐tune GPT‐4 as well as MedPaLM2 and evaluate the diagnostic ability of these fine‐tuned LLMs using the method introduced here.

4. DISCUSSION

We believe that our results estimating the diagnostic ability of GPT‐4 are very promising given that the diagnostic error rate (false negative rate) of GPT‐4 is lower than the estimated misdiagnosis rate in hospitals (around 10%, as discussed in the Introduction). The authors hypothesize that either a fine‐tuned version of GPT‐4 or a more sophisticated LLM will soon be able to help reduce hospital misdiagnosis to a much lower level than the prevailing one. Nevertheless, ethical, regulatory and liability challenges remain the main blockers of integration of AI to Health care.

Navigating the labyrinth of ethical and regulatory complexities associated with the deployment of LLMs in the health care sector is a challenging task. Data acquisition, storage, and processing necessitate a continuous and thorough examination through the lens of legal and ethical principles. In the context of data acquisition, issues such as consent, de‐identification, and the minimum necessary principle are pertinent. Utilizing patient data for AI training without informed consent may infringe upon the patients' autonomy [21]. Furthermore, de‐identifying patient data, though seemingly straightforward, can pose challenges due to the potential for re‐identification, rendering privacy safeguards ineffective. As for data storage and processing, the risks of data breaches and unauthorized access are omnipresent, warranting robust cybersecurity mechanisms. Transparent data usage policies and compliance with the provisions of data protection regulations, such as the General Data Protection Regulation (GDPR) in the European Union and the Health Insurance Portability and Accountability Act (HIPAA) in the United States, are indispensable. Further concerns related to the lack of transparency (the “black box” problem) [22] and potential biases in AI predictions [23] may emerge. The unavailability of a clear audit trail for AI predictions could hamper accountability, while bias can inadvertently creep into AI outputs due to skewed representation or inherent prejudices in the training data. To address these concerns, regulatory mechanisms must ensure rigorous algorithmic auditing, facilitate interpretability, and promote the principle of algorithmic fairness.

One notable drawback of LLMs is the phenomenon of “hallucination”—the generation of outputs lacking verifiable correlation with the input data. In the medical sphere, the implications of such a limitation are significant, with the potential to critically impact patient outcomes [24]. Hence, establishing who or what is responsible in the case of medical errors attributed to LLMs is a crucial question. Recent concepts, such as Retrieval Augmented Generation (RAG) have been proposed to combat hallucination [25]. Determining liability in the event of adverse outcomes due to misdiagnoses instigated by AI hallucinations or misinterpretation of AI outputs by health care professionals is a complex conundrum. Traditional negligence theories may fall short in addressing these novel scenarios, prompting the need for new legal perspectives. The risk of over‐reliance on LLMs is another significant liability concern. The allure of AI, coupled with its capacity for speedy, high‐volume data analysis, may tempt health care providers to delegate critical clinical decision‐making to AI, which could lead to “automation bias”. This over‐reliance could potentially compromise the quality of patient care and fuel an upsurge in litigation from patients and hospitals. Liability insurance providers, too, would need to reassess their coverage parameters, factoring in the unique challenges posed by AI. On one hand, AI tools could potentially decrease malpractice claims by reducing human errors. However, on the other hand, novel forms of malpractice and unprecedented errors could occur due to AI malfunctions or misuse. This underscores the necessity for human practitioners maintaining the ultimate responsibility for diagnosis and decision‐making.

Further, the integration of LLMs into existing EPR systems presents a formidable technical challenge. The diverse array of EPR systems, each with its unique complexities, makes the effective implementation of a system‐agnostic AI software a potentially difficult technical task. At the time of writing, in the UK alone there were eight different accredited enterprise‐wide EPR systems. There are some ongoing efforts in the United States to produce standardized APIs for health records, the development of which is yet to be seen.

Not only practical, but also psychological barriers like user adoption needs addressing. The efficacy and continued improvement of AI tools is contingent on their endorsement and utilization by health care professionals. Concerns pertaining to AI superseding jobs, misconceptions regarding AI capabilities, or skepticism towards AI predictions could impede widespread acceptance. Consequently, developing user‐friendly AI tools and fostering AI literacy within the medical community are integral to this endeavor.

5. CONCLUSION

The evaluation of GPT‐4's diagnostic performance using the MIMIC‐IV data set represents a significant contribution to the ongoing dialog on the role of AI in health care. Our findings illuminate not only the potential of LLMs like GPT‐4 in enhancing diagnostic accuracy (we estimate it to be 93.9% sensitive based on 1000 patients, which is a considerable improvement from the real life benchmark of 85%–90%) but also underscore their inherent limitations and challenges. It is critical to remember that while the diagnostic accuracy of GPT‐4 is impressive, AI tools like this should not replace medical practitioners in patient care. Instead, they should serve as supplementary tools that aid the diagnostic and decision‐making process. The human element in health care‐empathy, understanding, and nuanced decision‐making based on a comprehensive understanding of the patient‐is something AI cannot duplicate‐yet.

Future research could profitably explore the fine‐tuning of foundational LLMs and investigate the real‐world impact of the implementation of LLM‐based tools in health care, including the cost‐savings, effects on patient outcomes and changes in clinical workflows. We believe that the future is to employ AI to take the burden of rote, analytic and tedious tasks off the shoulders of physicians so rather than reading through and writing patient records and notes, they can focus on actively practicing patient care and thereby increasing their productivity, which will be key in a world of aging populations and hospital staff shortages. This does not just reflect our view but seems to be generally accepted by medical decision makers: the recent NHS workforce plan [26] highlights the importance of using AI in a medical setting to make doctors more productive and to reduce the desperate need for more staff.

In conclusion, while AI models like GPT‐4 herald a transformative potential in health care, their implementation is not devoid of challenges and considerations. These encompass data privacy concerns, user adoption hurdles, technical limitations, and ethical implications. A sustained engagement with these issues can facilitate a more nuanced and accurate understanding of AI's role in health care, paving the way towards improved patient care and streamlined health care delivery.

AUTHOR CONTRIBUTIONS

Peter Sarvari: Conceptualization—equal; data curation‐equal; formal analysis—lead; investigation‐equal, methodology—equal; project administration—equal. Zaid Al‐fagih: Conceptualization—equal; data curation—equal; investigation—equal; methodology—equal; project administration—equal. Abdullatif Ghuwel: Data curation—supporting; investigation—supporting. Othman Al‐fagih: Data curation‐supporting; investigation—supporting.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflict of interest.

ETHICS STATEMENT

The research has been conducted in accordance with the Wiley's Best Practice Guidelines on Publishing Ethics.

INFORMED CONSENT

Research was carried out in accordance with the Declaration of Helsinki, as revised in 2013.

Supporting information

Supporting information.

HCS2-3-3-s001.xlsx^{(1.3MB, xlsx)}

ACKNOWLEDGMENTS

We would like to thank the MIMIC‐IV and PhysioNet authors as well as the staff and patients at Beth Israel Deaconess Medical Center for making it possible to conduct this study by making their data available for research purposes and thereby advancing science and health care research tremendously.

Sarvari P, Al‐fagih Z, Ghuwel A, Al‐fagih O. A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients. Health Care Sci. 2024;3:3–18. 10.1002/hcs2.79

DATA AVAILABILITY STATEMENT

MIMIC‐IV data is openly available for credentialed researchers on PhysioNet. GPT‐4 submission occurred via secure API after 1 March 2023, which means that the data are not retained by OpenAI for model training purposes.

REFERENCES

1. Newman‐Toker DE, Wang Z, Zhu Y, Nassery N, Saber Tehrani AS, Schaffer AC, et al. Rate of diagnostic errors and serious misdiagnosis‐related harms for major vascular events, infections, and cancers: toward a national incidence estimate using the “Big Three”. Diagnosis. 2020;8(1):67–84. 10.1515/dx-2019-0104 [DOI] [PubMed] [Google Scholar]
2. Neale G, Hogan H, Sevdalis N. Misdiagnosis: analysis based on case record review with proposals aimed to improve diagnostic processes. Clin Med. 2011;11(4):317–321. 10.7861/clinmedicine.11-4-317 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Makary MA, Daniel M. Medical error—the third leading cause of death in the US. BMJ. 2016;353:i2139. 10.1136/bmj.i2139 [DOI] [PubMed] [Google Scholar]
4. Mello MM, Chandra A, Gawande AA, Studdert DM. National costs of the medical liability system. Health Aff. 2010;29(9):1569–1577. 10.1377/hlthaff.2009.0807 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Newman‐Toker DE, McDonald KM, Meltzer DO. How much diagnostic safety can we afford, and how should we decide? A health economics perspective. BMJ Qual Safety. 2013;22(Suppl 2):ii11–ii20. 10.1136/bmjqs-2012-001616 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Shrank WH, Rogstad TL, Parekh N. Waste in the US: health care system: estimated costs and potential for savings. JAMA. 2019;322(15):1501–1509. 10.1001/jama.2019.13978 [DOI] [PubMed] [Google Scholar]
7. Thammasitboon S, Cutrer WB. Diagnostic decision‐making and strategies to improve diagnosis. Curr Probl Pediatr Adolesc Health Care. 2013;43(9):232–241. 10.1016/j.cppeds.2013.07.003 [DOI] [PubMed] [Google Scholar]
8. Graber ML, Franklin N, Gordon R. Diagnostic error in internal medicine. Arch Intern Med. 2005;165(13):1493–1499. 10.1001/archinte.165.13.1493 [DOI] [PubMed] [Google Scholar]
9. OpenAI . https://openai.com/gpt-4 (2023). Accessed 12 Oct. 2023.
10. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT‐4 on medical challenge problems. arXiv [Preprint]. 2023:1:1. 10.48550/arXiv.2303.13375 [DOI]
11. Levine DM, Tuwani R, Kompa B, Varma A, Finlayson SG, Mehrotra A, et al. The diagnostic and triage accuracy of the GPT‐3 artificial intelligence model. medRxiv [Preprint]. 2023;1:1. 10.1101/2023.01.30.23285067 [DOI] [PubMed] [Google Scholar]
12. Mehnen L, Gruarin S, Vasileva M, Knapp B. ChatGPT as a medical doctor? A diagnostic accuracy study on common and rare diseases. medRxiv [Preprint]. 2023;1:1. 10.1101/2023.04.20.23288859 [DOI] [Google Scholar]
13. Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78–80. 10.1001/jama.2023.8288 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Johnson A, Bulgarelli L, Pollard T, Horng S, Celi LA, Mark R. MIMIC‐IV (version 2.2). PhysioNet. 10.13026/6mm1-ek67 (2023). Accessed 12 Oct. 2023. [DOI]
15. Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC‐IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1. 10.1038/s41597-022-01899-x [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000;101(23):E215–E220. 10.1161/01.cir.101.23.e215 [DOI] [PubMed] [Google Scholar]
17. Masud JHB, Kuo C‐C, Yeh C‐Y, Yang H‐C, Lin M‐C. Applying deep learning model to predict diagnosis code of medical records. Diagnostics. 2023;13(13):2297. 10.3390/diagnostics13132297 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Google . PaLM 2 Technical Report 2023. https://ai.google/static/documents/palm2techreport.pdf. Accessed 12 Oct. 2023.
19. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, et al. Towards expert‐level medical question answering with large language models. arXiv [preprint]. 2023;1:1. 10.48550/arXiv.2305.09617 [DOI]
20. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. 10.1038/s41586-023-06291-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Blanco‐González A, Cabezón A, Seco‐González A, Conde‐Torres D, Antelo‐Riveiro P, Piñeiro Á, et al. The role of ai in drug discovery: challenges, opportunities, and strategies. Pharmaceuticals. 2023;16(6):891. 10.3390/ph16060891 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Wadden JJ. Defining the undefinable: the black box problem in healthcare artificial intelligence. J Med Ethics. 2022;48:764–768. 10.1136/medethics-2021-107529 [DOI] [PubMed] [Google Scholar]
23. Flores L, Kim S, Young SD. Addressing bias in artificial intelligence for public health surveillance. J Med Eth. 2023;1:1. 10.1136/jme-2022-108875 [DOI] [PubMed] [Google Scholar]
24. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15(2):e35179. 10.7759/cureus.35179 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Wang C, Ong J, Wang C, Ong H, Cheng R, Ong D. Potential for GPT Technology to Optimize Future Clinical Decision‐Making Using Retrieval‐Augmented Generation. Ann Biomed Eng. 2023;1:1. 10.1007/s10439-023-03327-6 [DOI] [PubMed] [Google Scholar]
26. NHS Long‐term Workforce Plan . https://www.england.nhs.uk/long-read/accessible-nhs-long-term-workforce-plan/2023 (2023). Accessed 12 Oct. 2023.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting information.

HCS2-3-3-s001.xlsx^{(1.3MB, xlsx)}

Data Availability Statement

[hcs279-bib-0001] 1. Newman‐Toker DE, Wang Z, Zhu Y, Nassery N, Saber Tehrani AS, Schaffer AC, et al. Rate of diagnostic errors and serious misdiagnosis‐related harms for major vascular events, infections, and cancers: toward a national incidence estimate using the “Big Three”. Diagnosis. 2020;8(1):67–84. 10.1515/dx-2019-0104 [DOI] [PubMed] [Google Scholar]

[hcs279-bib-0002] 2. Neale G, Hogan H, Sevdalis N. Misdiagnosis: analysis based on case record review with proposals aimed to improve diagnostic processes. Clin Med. 2011;11(4):317–321. 10.7861/clinmedicine.11-4-317 [DOI] [PMC free article] [PubMed] [Google Scholar]

[hcs279-bib-0003] 3. Makary MA, Daniel M. Medical error—the third leading cause of death in the US. BMJ. 2016;353:i2139. 10.1136/bmj.i2139 [DOI] [PubMed] [Google Scholar]

[hcs279-bib-0004] 4. Mello MM, Chandra A, Gawande AA, Studdert DM. National costs of the medical liability system. Health Aff. 2010;29(9):1569–1577. 10.1377/hlthaff.2009.0807 [DOI] [PMC free article] [PubMed] [Google Scholar]

[hcs279-bib-0005] 5. Newman‐Toker DE, McDonald KM, Meltzer DO. How much diagnostic safety can we afford, and how should we decide? A health economics perspective. BMJ Qual Safety. 2013;22(Suppl 2):ii11–ii20. 10.1136/bmjqs-2012-001616 [DOI] [PMC free article] [PubMed] [Google Scholar]

[hcs279-bib-0006] 6. Shrank WH, Rogstad TL, Parekh N. Waste in the US: health care system: estimated costs and potential for savings. JAMA. 2019;322(15):1501–1509. 10.1001/jama.2019.13978 [DOI] [PubMed] [Google Scholar]

[hcs279-bib-0007] 7. Thammasitboon S, Cutrer WB. Diagnostic decision‐making and strategies to improve diagnosis. Curr Probl Pediatr Adolesc Health Care. 2013;43(9):232–241. 10.1016/j.cppeds.2013.07.003 [DOI] [PubMed] [Google Scholar]

[hcs279-bib-0008] 8. Graber ML, Franklin N, Gordon R. Diagnostic error in internal medicine. Arch Intern Med. 2005;165(13):1493–1499. 10.1001/archinte.165.13.1493 [DOI] [PubMed] [Google Scholar]

[hcs279-bib-0009] 9. OpenAI . https://openai.com/gpt-4 (2023). Accessed 12 Oct. 2023.

[hcs279-bib-0010] 10. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT‐4 on medical challenge problems. arXiv [Preprint]. 2023:1:1. 10.48550/arXiv.2303.13375 [DOI]

[hcs279-bib-0011] 11. Levine DM, Tuwani R, Kompa B, Varma A, Finlayson SG, Mehrotra A, et al. The diagnostic and triage accuracy of the GPT‐3 artificial intelligence model. medRxiv [Preprint]. 2023;1:1. 10.1101/2023.01.30.23285067 [DOI] [PubMed] [Google Scholar]

[hcs279-bib-0012] 12. Mehnen L, Gruarin S, Vasileva M, Knapp B. ChatGPT as a medical doctor? A diagnostic accuracy study on common and rare diseases. medRxiv [Preprint]. 2023;1:1. 10.1101/2023.04.20.23288859 [DOI] [Google Scholar]

[hcs279-bib-0013] 13. Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78–80. 10.1001/jama.2023.8288 [DOI] [PMC free article] [PubMed] [Google Scholar]

[hcs279-bib-0014] 14. Johnson A, Bulgarelli L, Pollard T, Horng S, Celi LA, Mark R. MIMIC‐IV (version 2.2). PhysioNet. 10.13026/6mm1-ek67 (2023). Accessed 12 Oct. 2023. [DOI]

[hcs279-bib-0015] 15. Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC‐IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1. 10.1038/s41597-022-01899-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[hcs279-bib-0016] 16. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000;101(23):E215–E220. 10.1161/01.cir.101.23.e215 [DOI] [PubMed] [Google Scholar]

[hcs279-bib-0017] 17. Masud JHB, Kuo C‐C, Yeh C‐Y, Yang H‐C, Lin M‐C. Applying deep learning model to predict diagnosis code of medical records. Diagnostics. 2023;13(13):2297. 10.3390/diagnostics13132297 [DOI] [PMC free article] [PubMed] [Google Scholar]

[hcs279-bib-0018] 18. Google . PaLM 2 Technical Report 2023. https://ai.google/static/documents/palm2techreport.pdf. Accessed 12 Oct. 2023.

[hcs279-bib-0019] 19. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, et al. Towards expert‐level medical question answering with large language models. arXiv [preprint]. 2023;1:1. 10.48550/arXiv.2305.09617 [DOI]

[hcs279-bib-0020] 20. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. 10.1038/s41586-023-06291-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[hcs279-bib-0021] 21. Blanco‐González A, Cabezón A, Seco‐González A, Conde‐Torres D, Antelo‐Riveiro P, Piñeiro Á, et al. The role of ai in drug discovery: challenges, opportunities, and strategies. Pharmaceuticals. 2023;16(6):891. 10.3390/ph16060891 [DOI] [PMC free article] [PubMed] [Google Scholar]

[hcs279-bib-0022] 22. Wadden JJ. Defining the undefinable: the black box problem in healthcare artificial intelligence. J Med Ethics. 2022;48:764–768. 10.1136/medethics-2021-107529 [DOI] [PubMed] [Google Scholar]

[hcs279-bib-0023] 23. Flores L, Kim S, Young SD. Addressing bias in artificial intelligence for public health surveillance. J Med Eth. 2023;1:1. 10.1136/jme-2022-108875 [DOI] [PubMed] [Google Scholar]

[hcs279-bib-0024] 24. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15(2):e35179. 10.7759/cureus.35179 [DOI] [PMC free article] [PubMed] [Google Scholar]

[hcs279-bib-0025] 25. Wang C, Ong J, Wang C, Ong H, Cheng R, Ong D. Potential for GPT Technology to Optimize Future Clinical Decision‐Making Using Retrieval‐Augmented Generation. Ann Biomed Eng. 2023;1:1. 10.1007/s10439-023-03327-6 [DOI] [PubMed] [Google Scholar]

[hcs279-bib-0026] 26. NHS Long‐term Workforce Plan . https://www.england.nhs.uk/long-read/accessible-nhs-long-term-workforce-plan/2023 (2023). Accessed 12 Oct. 2023.

PERMALINK

A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients

Peter Sarvari

Zaid Al‐fagih

Abdullatif Ghuwel

Othman Al‐fagih

Abstract

Background

Methods

Results

Conclusion

Abbreviations

1. INTRODUCTION

2. METHODOLOGY

2.1. Data

2.2. Diagnostic predictions

2.3. Ground truth diagnoses

2.4. Automated evaluation and manual review

2.5. Major assumption

2.6. PaLM2

3. RESULTS

Table 1.

4. DISCUSSION

5. CONCLUSION

AUTHOR CONTRIBUTIONS

CONFLICT OF INTEREST STATEMENT

ETHICS STATEMENT

INFORMED CONSENT

Supporting information

ACKNOWLEDGMENTS

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients

Peter Sarvari

Zaid Al‐fagih

Abdullatif Ghuwel

Othman Al‐fagih

Abstract

Background

Methods

Results

Conclusion

Abbreviations

1. INTRODUCTION

2. METHODOLOGY

2.1. Data

2.2. Diagnostic predictions

2.3. Ground truth diagnoses

2.4. Automated evaluation and manual review

2.5. Major assumption

2.6. PaLM2

3. RESULTS

Table 1.

4. DISCUSSION

5. CONCLUSION

AUTHOR CONTRIBUTIONS

CONFLICT OF INTEREST STATEMENT

ETHICS STATEMENT

INFORMED CONSENT

Supporting information

ACKNOWLEDGMENTS

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases