Skip to main content
. 2025 Nov 3;6:e74899. doi: 10.2196/74899

Table 2. Representative responses from the LLMsa.

Model Experiment Patient characteristics LLM responseb,c HTd treatment decision Interpretation
PaLM 2 Raw 56-year-old male; EuroSCOREe II: 0.55%; no comorbidities except diffuse, mild coronary atherosclerosis and arterial hypertension; no relevant anatomical aspects to consider
  • Response: “Based on the provided information, the recommended treatment for the patient would be transcatheter aortic valve replacement (TAVR). The patient has severe aortic stenosis, which is causing symptoms and has been confirmed by echocardiography and cardiac catheterization. The patient is also at high risk for surgical aortic valve replacement (SAVR) due to their age, comorbidities, and the presence of diffuse coronary artery diseasef. TAVR is a less invasive procedure than SAVR and is associated with a lower risk of complicationsf. Therefore, TAVRf would be the preferred treatment option for this patient.”

SAVRg
  • The LLM incorrectly recommended TAVRh based on falsely extracted and/or interpreted patient characteristics (factual inconsistency).

  • According to the literature, it has not yet been conclusively determined whether TAVR is associated with fewer complications than SAVR [20] (factual inconsistency/bias).

BioGPT Raw 69-year-old male; EuroSCORE II: 7.2%; postcardiac arrest syndrome, frailty, long-term mechanical ventilation, and liver cirrhosis; no relevant anatomical aspects to consider
  • Response: “Surgical replacement of the aortic valve. The patient was referred to the operating room for AVR. The operation was performed under general anesthesia and extracorporeal circulation ... The operation was performed with a median sternotomy. The ascending aorta was dissected free and opened…and the coronary arteries were visualized…f

TAVR
  • The LLM falsely stated a surgery report from a procedure this patient never received and that was not mentioned in the prompt (context inconsistency).

BioGPT Raw 75-year-old female; EuroSCORE II: 2.4%; STSi score: 2.9%; COPDj (GOLDk Stage III); pulmonary hypertension and frailty; no relevant anatomical aspects to consider
  • Response: “Heart transplantation. - The patient’s family decided against surgery. - The patient died due to heart failure.f

TAVR
  • The LLM did not follow the instruction to choose between SAVR and TAVR but opted for heart transplantation (instruction inconsistency).

  • The LLM fabricated a clinical course that was not mentioned in the prompt (context inconsistency).

GPT-3.5 Sum 72-year-old female; EuroSCORE II: 1.6%; STS score: 1.1%; no relevant comorbidities; no relevant anatomical aspects to consider



  • Response1: “Based on the given information, the recommended treatment for a person with severe aortic valve stenosis and no other significant comorbidities would be SAVRl (Surgical Aortic Valve Replacement) if conservative therapy is not an option.”

  • Response2: “Based on the provided information, the recommended treatment for a person with severe aortic valve stenosis, who does not have any contraindications or relevant comorbidities, would be TAVIf (Transcatheter Aortic Valve Implantation) rather than SAVR (Surgical Aortic Valve Replacement).”

SAVR
  • The LLM gave opposite answers to the exact same question (unreliability).

  • The LLM incorrectly recommended TAVR based on the absence of comorbidities (factual inconsistency).

PaLM 2 Raw+ 56-years-old male; EuroSCORE II: 0.55%; STS score: 0.7%; no comorbidities except arterial hypertension; no relevant anatomical aspects to consider
  • Response: “Based on the guidelines, the recommended treatment would be transcatheter aortic valve replacement (TAVRf). The patient is 56 years old and at high surgical riskf (STS-PROM score >8% or EuroSCORE II >8 %) due to severe frailtyf and previous cardiac surgeryf. TAVR is the preferred treatment for patients who are inoperable or at high surgical risk.”

SAVR
  • The LLM falsely claimed that the patient is at high surgical risk due to frailty and previous cardiac surgery, which are both entirely fabricated (factual inconsistency).

GPT-3.5 Sum+ 81-year-old female; logistic EuroSCORE: 8.44%; EuroSCORE II: 1.82%; STS score: 4.33%; stage 3A chronic kidney disease; no relevant anatomical aspects to consider
  • Response: “Based on the guidelines provided, the recommended treatment for the patient would be SAVRf (surgical aortic valve replacement) if conservative therapy is not an option. This is because the patient is 81 years old and falls into the older age categoryl, and their logistic EuroSCOREf is 8.44%, which is above the threshold for TAVR (transcatheter aortic valve replacement) recommendation.”

TAVR
  • The LLM incorrectly mentioned an association between the logistic EuroSCORE and surgical risk prohibiting SAVR (context inconsistency).

  • The LLM correctly assigned the patient to the older age category but then incorrectly recommended SAVR as the optimal therapy (logical inconsistency).

GPT-4 Raw 65-year-old female; EuroSCORE II: 2.5%; STS score: 1.4%; no relevant comorbidities; ascending aortic aneurysm (48 mm) mentioned in the CTm scan report
  • Response: “Based on the guidelines, the recommended treatment would be Surgical Aortic Valve Replacement (SAVRl) as the patient has a severe aortic stenosis and an ascending aortic aneurysml. Transcatheter Aortic Valve Implantation (TAVI) is typically recommended for patients who are at high or prohibitive surgical risk, which does not seem to be the case here.l

SAVR
  • Correct treatment decision plus additional extraction of decision-relevant clinical data.

LLaMA-2 Sum+ 68-year-old male; EuroSCORE II: 1.29%; STS score: 3.04%; COPD; no relevant anatomical aspects to consider
  • Reponse1: “Based on the guidelines provided, the recommended treatment for this patient would be TAVI. The patient’s age (68 y old) and low surgical riskl (EuroSCORE II 1.29 %, STS score 3.037 %) make them a suitable candidate for TAVI. Additionally, the patient does not have any contraindications for TAVIf, such as severe frailty, previous cardiac surgery, or significant thorax deformity.f

  • Response2: ‘Based on the information provided, the patient’s EuroSCORE II is 1.29% and STS score is 3.037 %, which indicates that they are at low surgical riskl. The patient is also younger than 75 yearsl old. Therefore, according to the guidelines, SAVRl is the recommended treatment option for this patient.”

SAVR
  • The LLM correctly identified low surgical risk but then incorrectly rejected SAVR as a therapeutic option (logical inconsistency).

  • The LLM incorrectly mentioned factors favoring TAVR from the prompt (context inconsistency).

  • Although the LLM correctly selected SAVR the second time, it gave the opposite response to the exact same question (unreliability).

DeepSeek-R1 Raw+ 65-year-old male; EuroSCORE II: 0.92%; STS score: 0.73%; end-stage renal disease requiring hemodialysis; no relevant anatomical aspects to consider
  • Response: “… Based on the 2021 ESC/EACTS guidelines and the patient’s clinical profile, the recommended treatment would be TAVRf. Reasoning: Patient age (65 y) is below 75 but has significant comorbidities, including terminal renal insufficiency, indicating high surgical riskf. Suitable tricuspid aortic valve anatomy for TAVR. High surgical risk factors favor TAVR over SAVR.”

SAVR
  • The LLM overestimated the effect of end-stage renal disease on surgical risk, which was still relatively low overall, as evidenced by low surgical risk scores (factual inconsistency).

a

LLM: large language model.

b

The LLMs’ treatment responses included well-informed decisions but also hallucinations ranging from obvious misinformation to absurd treatment recommendations and logical errors. We largely adhered to the taxonomy for the description of hallucinations established by Huang et al [21].

c

LLM responses with subscripts indicate responses to the same question (obtained during 10 runs).

d

HT: heart team.

e

EuroSCORE: European System for Cardiac Operative Risk Evaluation.

f

The italicized part indicates an incorrect or harmful response.

g

SAVR: surgical aortic valve replacement.

h

TAVR: transcatheter aortic valve replacement.

i

STS: Society of Thoracic Surgeons.

j

COPD: chronic obstructive pulmonary disease.

k

GOLD: Global Initiative for Chronic Obstructive Lung Disease.

l

The italicized part indicates a correct or useful response.

m

CT: computed tomography.