Large language model performance in clinical cardiology multiple choice questions; has reasoning improved performance?

R Crichton; B Liu; S Hothi

doi:10.1093/ehjdh/ztaf143.011

Abstract

Introduction

Large language models (LLMs) have garnered significant attention in applications throughout medicine, however, many have struggled with more nuanced clinical challenges. Reasoning models such as GPT-o1 and DeepSeek R1 leverage reinforcement learning and chain-of-thought methodologies which potentially improve performance in complex cognitive tasks. It is challenging to benchmark these models however and whilst the underlying reasoning is often displayed, the reasoning process used to provide an answer is not always clear.

Purpose

This study aims to compare the ability of Chat GPT-4o, GPT-4.5, GPT-o1, DeepSeek, and DeepSeek R1 to accurately respond to cardiology multiple choice questions (MCQs) from a commonly used UK cardiology textbook.

Methods

This study is a cross-sectional in-silico benchmarking study. A question corpus of 236 text-only questions and 26 image based questions were employed from a popular UK based, board level cardiology textbook. Video questions were excluded. Image-based queries were excluded for DeepSeek models due to the lack of image interpretation capabilities. Each model was presented with identical zero-shot prompts.

Responses were scored against textbook answers. Additionally, two UK consultant cardiologists reviewed questions answered incorrectly by the majority of models to examine performance deficits. To ensure consistency, reviewed questions were deemed to be accepted (in line with accepted guidance and evidence) or not accepted and removed from the evaluation. Revised Bloom’s Taxonomy was applied to classify all questions.

Results

All models completed 236 text MCQs with answers A-E. An additional 26 image MCQs were tested with GPT models. Due to updated evidence since source publication, 5 text and 1 image question was removed. Text-only accuracy ranged from 77.5% (GPT-4o) to 82.3% (GPT-o1). Image-based MCQ accuracy ranged from 38% (GPT-4.5) to 53% (GPT-4o).

When divided by Bloom’s Taxonomy, performance across all models in knowledge recall (90.0%) was significantly higher than in understanding (68.6%), application (76.2%), analysis (78.9%) and evaluation (75.9%). Due to the nature of MCQs none related to creation.

Narrative review by two consultant cardiologists noted 1 episode of hallucination, 1 set of unrelated reasoning and answer and poor performance in double negative questions.

Conclusions

Modest performance gains over GPT-4o, GPT-4.5, and GPT-o1 were observed, though not between DeepSeek and DeepSeek R1. The reasoning models GPT-o1 and DeepSeek R1 did not outperform their counterparts significantly.

Existing literature often benchmarks LLM performance against standardised examination question sets, however, this provides a significant challenge when models are iterative and dynamic. Possible solutions to this include the use of very large question corpora, physician graded testing or real-world comparator testing.

graphic file with name 34749820250611100709_1.jpg

graphic file with name 34749820250611100709_2.jpg

PERMALINK

Large language model performance in clinical cardiology multiple choice questions; has reasoning improved performance?

R Crichton

B Liu

S Hothi

Abstract

Introduction

Purpose

Methods

Results

Conclusions

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Large language model performance in clinical cardiology multiple choice questions; has reasoning improved performance?

R Crichton

B Liu

S Hothi

Abstract

Introduction

Purpose

Methods

Results

Conclusions

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases