Skip to main content
JAMA Network logoLink to JAMA Network
. 2024 Mar 18;331(15):1320–1321. doi: 10.1001/jama.2023.27861

Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions

Tianyu Han 1, Lisa C Adams 2, Keno K Bressem 3, Felix Busch 3, Sven Nebelung 1, Daniel Truhn 1,
PMCID: PMC10949144  PMID: 38497956

Abstract

This study compares 2 large language models and their performance vs that of competing open-source models.


Large language models (LLMs) have attracted attention for their ability to process text and other complex content.1,2 In medicine, an LLM-based advisory chat agent could assist in research, education, and clinical care.3

Currently, one model is GPT-4 (OpenAI), released in March 2023. Initially, GPT-4’s limitation to text-only input posed a challenge for medical applications, which often rely on visual data. As an extension, GPT-4V(ision), a multimodal version of GPT-4 that can process visual input in addition to text, was released in October 2023. In December 2023, Google announced the release of its own multimodal model, Gemini Pro, as a competitor to GPT-4V. Although GPT-4 and Gemini Pro are proprietary, several open-source models have been developed that do not require sending data to third parties. We evaluated the performance of GPT-4V and Gemini Pro on a series of multimodal clinical vignette quiz questions with images from medical journals and compared them with competing open-source models.

Methods

We used Clinical Challenges from JAMA and Image Challenges from the New England Journal of Medicine (NEJM) published between January 1, 2017, and August 31, 2023, to assess the diagnostic accuracy of GPT-4V, Gemini Pro, and 4 language-only models: GPT-4, GPT-3.5, and 2 open-source models, Llama 2 (Meta) and Med42. Med42 is a variant of Llama 2 adapted for medical applications (eTable in Supplement 1).

Both JAMA and NEJM cases contain a case description, a medical image, and the questions “What would you do next?” (JAMA) and “What is the diagnosis?” (NEJM), with 4 (JAMA) or 5 (NEJM) answer choices. JAMA cases with different questions were excluded. For NEJM challenge items, we extracted statistics on NEJM subscriber responses and stratified question difficulty by percentage of correct human responses into 4 equal intervals.

Case descriptions and answer choices were fed into the models, and the models were asked to select the correct answer. For GPT-4V and Gemini Pro, we provided images along with case description. Statistical analyses were conducted with Python 3.7.10, calculating 95% CIs from binomial distributions and comparing LLM accuracy via 2-tailed t tests at P < .05 significance.

Results

On 140 JAMA questions, GPT-4V consistently achieved the highest accuracy, with 73.3% (95% CI, 66.3%-80.9%) vs 55.7% (95% CI, 47.5%-63.9%) for Gemini Pro, 63.6% (95% CI, 55.6%-71.5%) for GPT-4, 50.7% (95% CI, 42.4%-59.0%) for GPT-3.5, 53.6% (95% CI, 45.3%-61.8%) for Med42, and 41.4% (95% CI, 33.3%-49.6%) for Llama 2 (all pairwise comparisons P < .001). Results were similar for the 348 NEJM questions, with GPT-4V and Gemini Pro correctly answering 88.7% (95% CI, 85.5%-92.1%) and 68.7% (95% CI, 63.8%-73.6%), respectively (Figure 1). Although Med42 outperformed GPT-3.5 on the JAMA challenge, it was inferior to GPT-3.5 on the NEJM challenge (59.9% [95% CI, 54.6%-64.9%] vs 61.7% [95% CI, 56.7%-66.9%]; P < .001). For the NEJM challenges, human readers correctly answered 51.4% (95% CI, 46.2%-56.7%) of questions.

Figure 1. Performance of Large Language Models on New England Journal of Medicine (NEJM) and JAMA Vignette Questions.

Figure 1.

Performance of the proprietary models GPT-4V, Gemini Pro, GPT-4, GPT-3.5, and the open-source models Llama 2 and Med42 in answering the questions on the NEJM and JAMA question sets. The center of each bar indicates the mean; error bars indicate 95% CIs.

When stratified by question difficulty, the NEJM results were similar for the highest 3 difficulty levels, but GPT-4V, Gemini Pro, GPT-4, and Med42 all had 100% accuracy for the lowest-difficulty questions (Figure 2).

Figure 2. Human and Large Language Model (LLM) Performance in New England Journal of Medicine (NEJM) Image Challenge.

Figure 2.

For the NEJM Image Challenge, statistics about the performance of human readers were provided. We stratified the question difficulty into 4 categories and evaluated the performance of the LLMs within these categories. The categories were defined by the performance of human readers in answering the questions. In the lowest category, human readers achieved a mean accuracy of 20% to 40%, followed by 40% to 60%, 60% to 80%, and 80% to 100%. At the 80% to 100% interval, GPT-4V, Gemini Pro, GPT-4, and Med42 achieved 100% accuracy and thus do not contain error bars. For the rest of the data in the plot, the center of each bar indicates the mean; error bars indicate 95% CIs.

Discussion

In both the JAMA Clinical Challenge and the NEJM Image Challenge databases, GPT-4V demonstrated significantly better accuracy than its unimodal predecessors, GPT-4 and GPT-3.5, and Gemini Pro, as well as the open-source models Llama 2 and Med42, confirming that GPT-4V can interpret medical images even without dedicated fine-tuning. Although the findings are promising, caution is warranted because these were curated vignettes from medical journals and do not fully represent the medical decision-making skills required in clinical practice.4 The integration of artificial intelligence models must consider their role in different clinical scenarios and the broader ethical implications.

This study has limitations. First, the training data of the proprietary models may have included cases used in this study. Second, the clinical challenges do not simulate clinical practice because physicians would not provide multiple choice answers to a model to obtain help with diagnosis. Third, the study does not contain all available multimodal LLMs.

Section Editors: Kristin Walter, MD, and Jody W. Zylke, MD, Deputy Editors; Karen Lasser, MD, Senior Editor.

Supplement 1.

eTable. Overview of the Selected LLMs

jama-e2327861-s001.pdf (256.2KB, pdf)
Supplement 2.

Data Sharing Statement

jama-e2327861-s002.pdf (11.1KB, pdf)

References

  • 1.Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616(7956):259-265. doi: 10.1038/s41586-023-05881-4 [DOI] [PubMed] [Google Scholar]
  • 2.Truhn D, Reis-Filho JS, Kather JN. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat Med. 2023;29(12):2983-2984. doi: 10.1038/s41591-023-02594-z [DOI] [PubMed] [Google Scholar]
  • 3.Haupt CE, Marks M. AI-generated medical advice—GPT and beyond. JAMA. 2023;329(16):1349-1350. doi: 10.1001/jama.2023.5321 [DOI] [PubMed] [Google Scholar]
  • 4.Harris E. Large language models answer medical questions accurately, but can’t match clinicians’ knowledge. JAMA. 2023;330(9):792-794. doi: 10.1001/jama.2023.14311 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1.

eTable. Overview of the Selected LLMs

jama-e2327861-s001.pdf (256.2KB, pdf)
Supplement 2.

Data Sharing Statement

jama-e2327861-s002.pdf (11.1KB, pdf)

Articles from JAMA are provided here courtesy of American Medical Association

RESOURCES