Multimodal large language models address clinical queries in laryngeal cancer surgery: a comparative evaluation of image interpretation across different models

Bingyu Liang; Yifan Gao; Taibao Wang; Lei Zhang; Qin Wang

doi:10.1097/JS9.0000000000002234

. 2025 Jan 27;111(3):2727–2730. doi: 10.1097/JS9.0000000000002234

Multimodal large language models address clinical queries in laryngeal cancer surgery: a comparative evaluation of image interpretation across different models

Bingyu Liang ^a, Yifan Gao ^b,^c, Taibao Wang ^a, Lei Zhang ^a, Qin Wang ^a,^*

PMCID: PMC12372740 PMID: 39869389

Abstract

Background and objectives:

Recent advances in multimodal large language models (MLLMs) have shown promise in medical image interpretation, yet their utility in surgical contexts remains unexplored. This study evaluates six MLLMs’ performance in interpreting diverse imaging modalities for laryngeal cancer surgery.

Methods:

We analyzed 169 images (X-rays, CT scans, laryngoscopy, and pathology findings) from 50 patients using six state-of-the-art MLLMs. Model performance was assessed across 1084 clinically relevant questions by two independent physicians.

Results:

Claude 3.5 Sonnet achieves the highest accuracy (79.43%, 95% CI: 77.02%-81.84%). Performance varied significantly across imaging modalities and between commercial and open-source models, with a 19-percentage point gap between the best commercial and open-source solutions.

Conclusion:

Advanced MLLMs show promising potential as clinical decision support tools in laryngeal cancer surgery, while performance variations suggest the need for specialized model development and clinical workflow integration. Future research should focus on developing specialized MLLMs trained on large-scale multi-center laryngeal cancer datasets.

HIGHLIGHTS

Six multimodal large language models (MLLMs) were evaluated across 6 image types, 169 images, and 1084 open-ended clinical questions in laryngeal cancer surgery.
Advanced MLLMs demonstrate high accuracy (up to 79.43%) in interpreting diverse image modalities, with commercial models outperforming open-source alternatives.
MLLMs show potential to enhance clinical decision-making across the surgical timeline of laryngeal cancer, from preoperative planning to post-operative care.

Introduction

Recent advances in multimodal large language models (MLLMs) represent a paradigm shift in artificial intelligence, distinguished by their ability to engage in natural conversation about images. Unlike traditional systems that typically output predefined classifications or measurements, these general-purpose models can interpret images through interactive dialogue. MLLMs have demonstrated significant potential in medical image understanding, exhibiting impressive performance in tasks such as answering clinical questions^[1,2], disease classification^[3,4], and report interpretation^[5]. Recent studies^[6] have also revealed their capacity to simulate stepwise clinical reasoning processes to some extent, indicating their applicability in diverse medical scenarios.

In surgical specialties, the ability to accurately interpret multiple image modalities is crucial for treatment planning and execution. Laryngeal cancer is one of the most prevalent types of head and neck cancers^[7], where precise interpretation of diverse image findings directly influences surgical approach and outcomes^[8]. The comprehensive assessment of laryngeal cancer relies on complementary image modalities: laryngoscopy for primary lesion visualization, CT imaging for evaluating tumor extension and nodal status, radiography for detecting esophageal involvement, and pathological examination for definitive histological diagnosis.

While artificial intelligence has broadly demonstrated promise in enhancing diagnostic accuracy^[9,10], the potential utility of MLLMs in complex surgical contexts, specifically laryngeal cancer management, remains unexplored. This study aims to bridge this knowledge gap by evaluating the performance of six leading MLLMs in interpreting multiple image modalities and pathology findings crucial for laryngeal cancer surgical management. We hypothesize that MLLMs can interpret diverse image modalities and pathological data to provide accurate and clinically relevant insights for laryngeal cancer surgical management, potentially enhancing decision-making throughout the surgical process.

Methods

To test our hypothesis, we collected a private dataset of multimodal images related to laryngeal cancer from 50 patients during routine clinical visits. The dataset comprised 169 images of various types, including X-rays, CT scans, laryngoscopy images, and pre-, intra-, and post-operative pathology images. Images with resolution below 128 × 128 pixels were excluded to ensure reliable model performance. It’s worth noting that not all patients had images for every modality. To facilitate analysis of sequential imaging data, such as CT and X-ray scans, we merged multiple slices into a single large image, preserving the sequential nature of the data. Examples of the images are illustrated in Figure 1. The image data was sourced from an ethically approved database (approval number 5101293), and this secondary analysis of fully anonymized data was granted exemption from additional ethical approval and informed consent requirements by the institutional ethics committee.

Figure 1. — Examples of different image types used in laryngeal cancer assessment. (A) X-ray, (B) CT scan, (C) laryngoscopy image, (D) pre-operative, (E) intra-operative, and (F) post-operative pathology images.

We assessed both commercial (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) and open-source (LLaVA-Med, InternVL, HuatuoGPT-Vision) models on their ability to accurately answer clinical queries based on the selected imaging and pathological data.

The overall study design is illustrated in Figure 2. For each image type, we formulated a set of clinically relevant questions, totaling 1084 questions across all images, that reflected the key considerations in laryngeal cancer surgical planning and management. The complete list of questions is provided in Supplementary Table 1 (http://links.lww.com/JS9/D795).

The accuracy of responses was validated against clinical case reports, with two independent physicians reviewing for factual correctness and clinical appropriateness. In cases where there was a discrepancy between the two evaluators, a senior physician was consulted to make the final determination. To maintain consistency and minimize bias, the evaluation process was conducted blindly, with evaluators unaware of which MLLM generated each response. We calculated the percentage of correct responses and 95% confidence intervals (CI) for each model across all image types.

Results

Figure 3 illustrates the performance of the six state-of-the-art MLLMs in interpreting diverse imaging and pathological data relevant to laryngeal cancer surgery. Claude 3.5 Sonnet exhibited the highest overall accuracy at 79.43% (95% CI, 77.02%–81.84%), closely followed by GPT-4o at 76.85% (95% CI, 74.33%–79.36%). Gemini 1.5 Pro showed moderate performance with an accuracy of 67.53% (95% CI, 64.74%–70.32%). Among the open-source models, HuatuoGPT-Vision and InternVL demonstrated similar capabilities, achieving accuracy rates of 60.52% (95% CI, 57.60%–63.43%) and 58.39% (95% CI, 55.46%–61.33%), respectively. LLaVA-Med had the lowest overall accuracy at 41.14% (95% CI, 38.21%–44.08%).

The performance gap between the best-performing commercial model (Claude 3.5 Sonnet) and the top open-source model (HuatuoGPT-Vision) was substantial, with a difference of approximately 19 percentage points. Subgroup analysis revealed differential performance across image modalities. Claude 3.5 Sonnet achieved an impressive 96.80% accuracy (95% CI, 94.60%–99.00%) on CT scans, while its performance on X-rays was 59.00% (95% CI, 52.12%–65.88%).

Discussion

This study provides a comprehensive evaluation of the performance of six state-of-the-art MLLMs in interpreting diverse imaging and pathological data relevant to laryngeal cancer surgery. Our findings reveal significant variability in the capabilities of these models, with important implications for their potential integration into clinical practice.

The superior performance of commercial models, particularly Claude 3.5 Sonnet and GPT-4o, with overall accuracies exceeding 75%, demonstrates the potential of advanced MLLMs to support clinical decision-making in laryngeal cancer surgery. The observed variations in model performance across different image modalities provide valuable insights into the strengths and limitations of current MLLMs. For instance, while performance on CT scans was consistently high across models, there was greater variability in the interpretation of laryngoscopy images and X-rays. This suggests that further refinement may be needed to enhance MLLM performance on these specific modalities.

The analysis of pathology interpretations across pre-, intra-, and post-operative stages reveals interesting patterns. The relatively consistent performance of top models across these stages is encouraging, suggesting potential applicability throughout the surgical timeline. However, the slight variations in accuracy across these stages warrant further investigation to ensure reliable support at all points of care.

While the results are promising, we acknowledge their limitations. Importantly, these tools are designed to support, not replace, the clinical judgment of healthcare professionals. The value of clinical expertise remains paramount in medical decision-making. Our evaluation protocol utilized predefined questions that may not fully encapsulate the intricate nature of real-world clinical scenarios. Moreover, our single institution dataset may not fully represent the diverse patient demographics seen in clinical practice, and the model’s performance could vary across different patient populations and anatomical variations.

Conclusion

We conducted a comprehensive evaluation of six state-of-the-art MLLMs in interpreting diverse imaging and pathological data relevant to laryngeal cancer surgery. Our study demonstrates that advanced MLLMs, particularly commercial models, show promising performance in accurately answering clinical queries related to laryngeal cancer management. These models could serve as rapid consultation tools during clinical decision-making, providing additional diagnostic perspectives in complex cases. Future research should focus on developing specialized MLLMs trained on large-scale multi-center laryngeal cancer image datasets and validating their integration into clinical workflows to enhance reliability and practical utility in patient care.

Footnotes

^{^#}

Bingyu Liang and Yifan Gao are first author and co-first author. These authors contributed equally to this work.

Supplemental Digital Content is available for this article. Direct URL citations are provided in the HTML and PDF versions of this article on the journal’s website, www.lww.com/international-journal-of-surgery.

Published online 27 January 2025

Contributor Information

Bingyu Liang, Email: ahmuliangbingyu@163.com.

Yifan Gao, Email: yifangao@mail.ustc.edu.cn.

Taibao Wang, Email: wangtaibao0201@163.com.

Lei Zhang, Email: 2542005693@qq.com.

Qin Wang, Email: wangqin@ahmu.edu.cn.

Ethical approval

This study utilized fully de-identified medical imaging data collected from routine clinical visits, without involving human subjects or interventions. As our research focused solely on evaluating artificial intelligence models using retrospective, anonymized data, it did not require formal ethical approval as per institutional guidelines. All data handling procedures complied with relevant data protection regulations and institutional policies for secondary use of clinical data in research. The study was conducted in accordance with the principles of the Declaration of Helsinki.

Consent

Not applicable.

Sources of funding

This work is supported in part by the Foundation of Anhui Provincial Department of Education (Grant No. 2022AH051156), in part by the 2023 Applied Medical Research Project of Hefei Municipal Health Commission (Grant No. Hwk2023zc002).

Author’s contribution

B.L.: Study concept and design, data collection, data analysis and interpretation, writing the original draft, and revision of the manuscript. Y.G.: Study design, data analysis and interpretation, development of computational methods, and critical revision of the manuscript for important intellectual content. T.W.: Data collection, clinical expertise in laryngeal cancer imaging interpretation, and validation of clinical relevance of the questions and results. L.Z.: Data collection, assistance with image preprocessing, and contribution to the methodology section of the manuscript. Qin Wang: Study supervision, conceptualization, funding acquisition, project administration, and final approval of the manuscript. All authors have read and agreed to the published version of the manuscript.

Conflicts of interest disclosure

The authors declare that no potential conflict of interest.

Research registration unique identifying number (UIN)

Not applicable.

Guarantor

Qin Wang.

Provenance and peer review

Not applicable.

Data availability statement

The data are available upon reasonable request.

References

[1].Han T, Adams LC, Bressem KK, Busch F, Nebelung S, Truhn D. Comparative analysis of multimodal large language model performance on clinical vignette questions. JAMA 2024 Apr 16;331:1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Suh PS, Shim WH, Suh CH, et al. Comparing diagnostic accuracy of radiologists versus GPT-4V and Gemini pro vision using image inputs from diagnosis please cases. Moy L, editor. Radiology 2024 Jul 1;312:e240273. [DOI] [PubMed] [Google Scholar]
[3].Mihalache A, Huang RS, Popovic MM, et al. Accuracy of an artificial intelligence chatbot’s interpretation of clinical ophthalmic images. JAMA Ophthalmol 2024 Apr 1;142:321. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Zhou J, He X, Sun L, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using skinGPT-4. Nat Commun 2024 Jul 5;15:5649. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Steimetz E, Minkowitz J, Gabutan EC, et al. Use of artificial intelligence chatbots in interpretation of pathology reports. JAMA Netw Open 2024 May 22;7:e2412767. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Lu MY, Chen B, Williamson DFK, et al. A multimodal generative AI copilot for human pathology. Nature 2024;634:466–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Huang J, Chan SC, Ko S, et al. Updated disease distributions, risk factors, and trends of laryngeal cancer: a global analysis of cancer registries. Int J Surg 2024;110:810. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Steuer CE, El-Deiry M, Parks JR, Higgins KA, Saba NF. An update on larynx cancer. CA Cancer J Clin 2017;67:31–50. [DOI] [PubMed] [Google Scholar]
[9].Barata C, Rotemberg V, Codella NCF, et al. A reinforcement learning model for AI-based decision support in skin cancer. Nat Med 2023;29:1941–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Zhu L, Mou W, Lai Y, et al. Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)’s ability to interpret radiological images. Int J Surg 2024;110:4096–102. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data are available upon reasonable request.

[R1] [1].Han T, Adams LC, Bressem KK, Busch F, Nebelung S, Truhn D. Comparative analysis of multimodal large language model performance on clinical vignette questions. JAMA 2024 Apr 16;331:1320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Suh PS, Shim WH, Suh CH, et al. Comparing diagnostic accuracy of radiologists versus GPT-4V and Gemini pro vision using image inputs from diagnosis please cases. Moy L, editor. Radiology 2024 Jul 1;312:e240273. [DOI] [PubMed] [Google Scholar]

[R3] [3].Mihalache A, Huang RS, Popovic MM, et al. Accuracy of an artificial intelligence chatbot’s interpretation of clinical ophthalmic images. JAMA Ophthalmol 2024 Apr 1;142:321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Zhou J, He X, Sun L, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using skinGPT-4. Nat Commun 2024 Jul 5;15:5649. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Steimetz E, Minkowitz J, Gabutan EC, et al. Use of artificial intelligence chatbots in interpretation of pathology reports. JAMA Netw Open 2024 May 22;7:e2412767. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Lu MY, Chen B, Williamson DFK, et al. A multimodal generative AI copilot for human pathology. Nature 2024;634:466–73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Huang J, Chan SC, Ko S, et al. Updated disease distributions, risk factors, and trends of laryngeal cancer: a global analysis of cancer registries. Int J Surg 2024;110:810. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Steuer CE, El-Deiry M, Parks JR, Higgins KA, Saba NF. An update on larynx cancer. CA Cancer J Clin 2017;67:31–50. [DOI] [PubMed] [Google Scholar]

[R9] [9].Barata C, Rotemberg V, Codella NCF, et al. A reinforcement learning model for AI-based decision support in skin cancer. Nat Med 2023;29:1941–46. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Zhu L, Mou W, Lai Y, et al. Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)’s ability to interpret radiological images. Int J Surg 2024;110:4096–102. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Multimodal large language models address clinical queries in laryngeal cancer surgery: a comparative evaluation of image interpretation across different models

Bingyu Liang, MSc

Yifan Gao, PhD

Taibao Wang, MSc

Lei Zhang, MSc

Qin Wang, MD