Assessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin Tones

Katrina Cirone; Mohamed Akrout; Latif Abid; Amanda Oakley

doi:10.2196/55508

letter

. 2024 Mar 13;7:e55508. doi: 10.2196/55508

Assessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin Tones

Katrina Cirone ^1,^2,^✉, Mohamed Akrout ^2,³, Latif Abid ², Amanda Oakley ^4,⁵

Editor: Robert Dellavalle

Reviewed by: Fuxiao Liu, Ery Ko, Gunnar Mattson, Anmol Sodhi

PMCID: PMC10973960 PMID: 38477960

Abstract

The large language models GPT-4 Vision and Large Language and Vision Assistant are capable of understanding and accurately differentiating between benign lesions and melanoma, indicating potential incorporation into dermatologic care, medical research, and education.

Keywords: melanoma, nevus, skin pigmentation, artificial intelligence, AI, multimodal large language models, large language model, large language models, LLM, LLMs, machine learning, expert systems, natural language processing, NLP, GPT, GPT-4V, dermatology, skin, lesion, lesions, cancer, oncology, visual

Introduction

Large language models (LLMs), artificial intelligence (AI) tools trained on large quantities of human-generated text, are adept at processing and synthesizing text and mimicking human capabilities, making the distinction between them nearly imperceptible [1]. The versatility of LLMs in addressing various requests, coupled with their capabilities in handling complex concepts and engaging in real-time user interactions, indicates their potential integration into health care and dermatology [1,2]. Within dermatology, studies have found LLMs can retrieve, analyze, and summarize information to facilitate decision-making [3].

Multimodal LLMs with visual understanding, such as GPT-4 Vision (GPT-4V) [4] and Large Language and Vision Assistant (LLaVA) [5], can also analyze images, videos, and speech, a significant evolution. They can solve novel, intricate tasks that language-only systems cannot, due to their unique capabilities combining language and vision with inherent intelligence and reasoning [4,5]. This study assesses the ability of publicly available multimodal LLMs to accurately recognize and differentiate between melanoma and benign melanocytic nevi across all skin tones.

Methods

Our data set comprised macroscopic images (900 × 1100 pixels; 96-dpi resolution) of melanomas (malignant) and melanocytic nevi (benign) obtained from the publicly available and validated MClass-D data set [6], Dermnet NZ [7], and dermatology textbooks [8]. Each LLM was provided with 20 unique text-based prompts that were each tested on 3 images (n=60 unique image-prompt combinations) consisting of questions about “moles” (the term used for benign and malignant lesions), instructions, and image-based prompts where the image was annotated to alter the focus. Our prompts represented potential users, such as general physicians, providers in remote areas, or educational users and residents. The chat content was deleted before each submitted prompt to prevent repeat images influencing responses, and testing was performed over a 1-hour timespan, which is insufficient for learning to take place. Prompts were designed to either involve conditioning of ABCDE (asymmetry, border irregularity, color variation, diameter >6 mm, evolution) melanoma features or to assess effects of background skin color on predictions. Conditioning involved asking the LLM to differentiate between benign and malignant lesions where one feature (eg, symmetry, border irregularity, color, diameter) remained constant in both images to determine whether the fixed element was involved in overall reasoning. To assess the impact of color on melanoma recognition, color distributions of nevi and melanoma were manipulated by decolorizing images or altering their colors.

Results

Analysis revealed GPT-4V outperformed LLaVA in all examined areas, with overall accuracy of 85% compared to 45% for LLaVA, and consistently provided thorough descriptions of relevant ABCDE features of melanoma (Table 1 and Multimedia Appendix 1). While both LLMs were able to identify melanoma in lighter skin tones and recognize that dermatologists should be consulted for diagnostic confirmation, LLaVA was unable to confidently recognize melanoma in skin of color nor comment on suspicious features, such as ulceration and bleeding.

Table 1.

Performance of Large Language and Vision Assistant (LLaVA) and GPT-4 Vision (GPT-4V) for melanoma recognition.

Feature		LLaVA	GPT-4V
Melanoma detection		Melanoma identified—referenced shape and color	Melanoma identified—referenced the other ABCDEs^a of melanoma
Feature conditioning
	Asymmetry	Melanoma identified—referenced size and color	Melanoma identified—referenced the other ABCDEs of melanoma
	Border irregularity	Melanoma identified—referenced size and color	Melanoma identified—referenced the other ABCDEs of melanoma
	Color	Melanoma identified—incorrectly commented on color distribution	Melanoma identified—referenced the other ABCDEs of melanoma
	Diameter	Melanoma missed—confused by the darker color	Melanoma identified—referenced the other ABCDEs of melanoma
	Color + diameter	Melanoma missed—confused by the darker color and morphology	Melanoma identified—referenced morphology, complexity, color, and border
	Evolution	Melanoma identified—referenced size and color	Melanoma identified—referenced the other ABCDEs of melanoma
Color bias
	Benign—darkened pigment	Darkened lesion classified as melanoma, became confused about other melanoma features	Darkened lesion classified as melanoma, became confused about other melanoma features
	Melanoma—darkened pigment	Darkened lesion classified as melanoma, became confused about the other ABCDEs of melanoma	Darkened lesion classified as melanoma, became confused about the other ABCDEs of melanoma
	Melanoma—lightened pigment	Unable to recognize malignancy and to identify that the image had been altered	Melanoma identified—referenced the other ABCDEs of melanoma and recognized that the altered image had been lightened
Skin of color
	Melanoma detection	Diagnostic uncertainty—unsure of lesion severity and diagnosis	Melanoma identified—referenced the other ABCDEs of melanoma
	Suspicious features	Did not identify suspicious features	Identified suspicious features and recommended medical evaluation—ulceration, bleeding, and skin distortion
Image manipulation
	Visual referring	Tricked into thinking the annotations indicated sunburned skin	Correctly identified that the annotations were artificially added and could be used to monitor skin lesion evolution or to communicate concerns between providers
	Rotation	Tricked into thinking an altered image orientation constituted a novel image	Correctly indicated it could not differentiate between the 2 images and accurately referenced the ABCDEs of melanoma

Open in a new tab

^aABCDE: asymmetry, border irregularity, color variation, diameter >6 mm, evolution.

Discussion

Across all prompts analyzing feature conditioning, GPT-4V correctly identified the melanoma, while LLaVA did not, when color, diameter, or both were held constant (Figure 1). This suggests these features influence melanoma detection in LLaVA, with less importance placed on symmetry and border. Both LLMs were susceptible to color bias, as when a pigment was darkened with all other features held constant, the lesion was believed to be malignant. Alternatively, when pigments were lightened, GPT-4V appropriately recognized this alteration, while LLaVA did not. Finally, image manipulation did not impact GPT-4V’s diagnostic abilities; however, LLaVA was unable to detect these manipulations and was vulnerable to visual referring associated with melanoma manifestations. The red lines added around the nevus’s edges were identified as sunburned skin when presented to LLaVA, while GPT-4V correctly recognized these annotations as useful for monitoring lesion evolution or communicating specific concerns between health care providers.

Melanoma detection when conditioned on color and diameter. GPT-4V: GPT-4 Vision; LLaVA: Large Language and Vision Assistant.

Although limitations are present, GPT-4V can accurately differentiate between benign and melanoma lesions. Performing additional training of these LLMs on specific conditions can improve their overall performance. Despite our findings, it is critical to account for and address limitations such as reproduction of existing biases, hallucinations, and visual prompt injection vulnerabilities and incorporate validation checks before clinical uptake [9]. Recently, the integration of technology within medicine has accelerated, and AI has been used in dermatology to augment the diagnostic process and improve clinical decision-making [10]. There is an urgent global need to address high volumes of skin conditions posing health concerns, and the integration of multimodal LLMs, such as GPT-4V, into health care has the potential to deliver material increases in efficiency and improve education and patient care.

Abbreviations

ABCDE: asymmetry, border irregularity, color variation, diameter >6 mm, evolution
AI: artificial intelligence
GPT-4V: GPT-4 Vision
LLaVA: Large Language and Vision Assistant
LLM: large language model

Multimedia Appendix 1

The 20 unique text-based prompts provided to GPT-4 Vision and Large Language and Vision Assistant and the responses of both large language models depicted side by side.

derma_v7i1e55508_app1.docx^{(5.4MB, docx)}

Footnotes

Conflicts of Interest: None declared.

References

1.Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt J, Laleh NG, Löffler Chiara Maria Lavinia, Schwarzkopf S, Unger M, Veldhuizen GP, Wagner SJ, Kather JN. The future landscape of large language models in medicine. Commun Med (Lond) 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1. doi: 10.1038/s43856-023-00370-1.10.1038/s43856-023-00370-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. JAMA. 2023 Sep 05;330(9):866–869. doi: 10.1001/jama.2023.14217.2808296 [DOI] [PubMed] [Google Scholar]
3.Sathe A, Seth I, Bulloch G, Xie Y, Hunter-Smith David J, Rozen WM. The role of artificial intelligence language models in dermatology: opportunities, limitations and ethical considerations. Australas J Dermatol. 2023 Nov;64(4):548–552. doi: 10.1111/ajd.14133. [DOI] [PubMed] [Google Scholar]
4.GTP-4V(ision) system card. OpenAI. [2024-04-05]. https://openai.com/research/gpt-4v-system-card .
5.Liu HL. Visual instruction tuning. arXiv. doi: 10.48550/arXiv.2304.08485. doi: 10.48550/arXiv.2304.08485. Preprint published online December 11, 2023. [DOI] [Google Scholar]
6.Brinker TJ, Hekler A, Hauschild A, Berking C, Schilling B, Enk AH, Haferkamp S, Karoglan A, von Kalle C, Weichenthal M, Sattler E, Schadendorf D, Gaiser MR, Klode J, Utikal JS. Comparing artificial intelligence algorithms to 157 German dermatologists: the melanoma classification benchmark. Eur J Cancer. 2019 Apr;111:30–37. doi: 10.1016/j.ejca.2018.12.016. https://linkinghub.elsevier.com/retrieve/pii/S0959-8049(18)31562-4 .S0959-8049(18)31562-4 [DOI] [PubMed] [Google Scholar]
7.Melanoma in situ images. DermNet. [2024-05-04]. https://dermnetnz.org/images/melanoma-in-situ-images .
8.Donkor CA. Atlas of Dermatological Conditions in Populations of African Ancestry. Cham, Switzerland: Springer; 2021. Malignancies. [Google Scholar]
9.Guan T, Liu F, Wu X, Xian R, Li Z, Liu X, Wang X, Chen L, Huang F, Yacoob Y, Manocha D, Zhou T. HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. arXiv. doi: 10.48550/arXiv.2310.14566. Preprint published online October 23, 2023. [DOI] [Google Scholar]
10.Haggenmüller Sarah, Maron RC, Hekler A, Utikal JS, Barata C, Barnhill RL, Beltraminelli H, Berking C, Betz-Stablein B, Blum A, Braun SA, Carr R, Combalia M, Fernandez-Figueras M, Ferrara G, Fraitag S, French LE, Gellrich FF, Ghoreschi K, Goebeler M, Guitera P, Haenssle HA, Haferkamp S, Heinzerling L, Heppt MV, Hilke FJ, Hobelsberger S, Krahl D, Kutzner H, Lallas A, Liopyris K, Llamas-Velasco M, Malvehy J, Meier F, Müller Cornelia S L, Navarini AA, Navarrete-Dechent C, Perasole A, Poch G, Podlipnik S, Requena L, Rotemberg VM, Saggini A, Sangueza OP, Santonja C, Schadendorf D, Schilling B, Schlaak M, Schlager JG, Sergon M, Sondermann W, Soyer HP, Starz H, Stolz W, Vale E, Weyers W, Zink A, Krieghoff-Henning E, Kather JN, von Kalle C, Lipka DB, Fröhling Stefan, Hauschild A, Kittler H, Brinker TJ. Skin cancer classification via convolutional neural networks: systematic review of studies involving human experts. Eur J Cancer. 2021 Oct;156:202–216. doi: 10.1016/j.ejca.2021.06.049. https://linkinghub.elsevier.com/retrieve/pii/S0959-8049(21)00444-5 .S0959-8049(21)00444-5 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia Appendix 1

The 20 unique text-based prompts provided to GPT-4 Vision and Large Language and Vision Assistant and the responses of both large language models depicted side by side.

derma_v7i1e55508_app1.docx^{(5.4MB, docx)}

[ref1] 1.Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt J, Laleh NG, Löffler Chiara Maria Lavinia, Schwarzkopf S, Unger M, Veldhuizen GP, Wagner SJ, Kather JN. The future landscape of large language models in medicine. Commun Med (Lond) 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1. doi: 10.1038/s43856-023-00370-1.10.1038/s43856-023-00370-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2.Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. JAMA. 2023 Sep 05;330(9):866–869. doi: 10.1001/jama.2023.14217.2808296 [DOI] [PubMed] [Google Scholar]

[ref3] 3.Sathe A, Seth I, Bulloch G, Xie Y, Hunter-Smith David J, Rozen WM. The role of artificial intelligence language models in dermatology: opportunities, limitations and ethical considerations. Australas J Dermatol. 2023 Nov;64(4):548–552. doi: 10.1111/ajd.14133. [DOI] [PubMed] [Google Scholar]

[ref4] 4.GTP-4V(ision) system card. OpenAI. [2024-04-05]. https://openai.com/research/gpt-4v-system-card .

[ref5] 5.Liu HL. Visual instruction tuning. arXiv. doi: 10.48550/arXiv.2304.08485. doi: 10.48550/arXiv.2304.08485. Preprint published online December 11, 2023. [DOI] [Google Scholar]

[ref6] 6.Brinker TJ, Hekler A, Hauschild A, Berking C, Schilling B, Enk AH, Haferkamp S, Karoglan A, von Kalle C, Weichenthal M, Sattler E, Schadendorf D, Gaiser MR, Klode J, Utikal JS. Comparing artificial intelligence algorithms to 157 German dermatologists: the melanoma classification benchmark. Eur J Cancer. 2019 Apr;111:30–37. doi: 10.1016/j.ejca.2018.12.016. https://linkinghub.elsevier.com/retrieve/pii/S0959-8049(18)31562-4 .S0959-8049(18)31562-4 [DOI] [PubMed] [Google Scholar]

[ref7] 7.Melanoma in situ images. DermNet. [2024-05-04]. https://dermnetnz.org/images/melanoma-in-situ-images .

[ref8] 8.Donkor CA. Atlas of Dermatological Conditions in Populations of African Ancestry. Cham, Switzerland: Springer; 2021. Malignancies. [Google Scholar]

[ref9] 9.Guan T, Liu F, Wu X, Xian R, Li Z, Liu X, Wang X, Chen L, Huang F, Yacoob Y, Manocha D, Zhou T. HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. arXiv. doi: 10.48550/arXiv.2310.14566. Preprint published online October 23, 2023. [DOI] [Google Scholar]

[ref10] 10.Haggenmüller Sarah, Maron RC, Hekler A, Utikal JS, Barata C, Barnhill RL, Beltraminelli H, Berking C, Betz-Stablein B, Blum A, Braun SA, Carr R, Combalia M, Fernandez-Figueras M, Ferrara G, Fraitag S, French LE, Gellrich FF, Ghoreschi K, Goebeler M, Guitera P, Haenssle HA, Haferkamp S, Heinzerling L, Heppt MV, Hilke FJ, Hobelsberger S, Krahl D, Kutzner H, Lallas A, Liopyris K, Llamas-Velasco M, Malvehy J, Meier F, Müller Cornelia S L, Navarini AA, Navarrete-Dechent C, Perasole A, Poch G, Podlipnik S, Requena L, Rotemberg VM, Saggini A, Sangueza OP, Santonja C, Schadendorf D, Schilling B, Schlaak M, Schlager JG, Sergon M, Sondermann W, Soyer HP, Starz H, Stolz W, Vale E, Weyers W, Zink A, Krieghoff-Henning E, Kather JN, von Kalle C, Lipka DB, Fröhling Stefan, Hauschild A, Kittler H, Brinker TJ. Skin cancer classification via convolutional neural networks: systematic review of studies involving human experts. Eur J Cancer. 2021 Oct;156:202–216. doi: 10.1016/j.ejca.2021.06.049. https://linkinghub.elsevier.com/retrieve/pii/S0959-8049(21)00444-5 .S0959-8049(21)00444-5 [DOI] [PubMed] [Google Scholar]

PERMALINK

Assessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin Tones

Katrina Cirone, HBSc

Mohamed Akrout, BSCEN, MScAC

Latif Abid, BEng, HBA

Amanda Oakley, MBChB

Abstract

Introduction

Methods

Results

Table 1.

Discussion

Figure 1.

Abbreviations

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Assessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin Tones

Katrina Cirone, HBSc

Mohamed Akrout, BSCEN, MScAC

Latif Abid, BEng, HBA

Amanda Oakley, MBChB

Abstract

Introduction

Methods

Results

Table 1.

Discussion

Figure 1.

Abbreviations

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases