Skip to main content
JMIR Dermatology logoLink to JMIR Dermatology
letter
. 2025 Mar 21;8:e67551. doi: 10.2196/67551

Evaluating the Diagnostic Accuracy of ChatGPT-4 Omni and ChatGPT-4 Turbo in Identifying Melanoma: Comparative Study

Samantha S Sattler 1, Nitin Chetla 2, Matthew Chen 3, Tamer Rajai Hage 4,, Joseph Chang 5, William Young Guo 1, Jeremy Hugh 1
Editor: John Meisenheimer
PMCID: PMC11952272  PMID: 40117499

Abstract

ChatGPT is increasingly used in healthcare. Fields like dermatology and radiology could benefit from ChatGPT’s ability to help clinicians diagnose skin lesions. This study evaluates the accuracy of ChatGPT in diagnosing melanoma. Our analysis indicates that ChatGPT cannot be used reliably to diagnose melanoma, and further improvements are needed to reach this capability.

Introduction

Artificial Intelligence (AI) is being increasingly integrated into health care [1]. Multiple AI systems exist in medicine, including large language models (LLMs), neural networks, and predictive models. While studies have demonstrated AI’s mixed precision and accuracy, the technology is poised to assist with data-driven diagnostics in dermatology [2].

There has a been rapid popularization of the LLM, ChatGPT for home-based medical inquiries [3]. Minimal research exists on ChatGPT’s accuracy in detecting melanoma. Given that patients are increasingly presenting internet-derived diagnostics during cancer consultations, it is imperative to understand the capabilities of commonly used AI engines, such as ChatGPT [4]. In this study, we compare the capabilities of two models—ChatGPT-4 Omni (GPT-4o) and ChatGPT-4 Turbo (GPT-4 Turbo)—in identifying melanoma versus “not melanoma” skin lesions. These LLMs were chosen due to their accessibility and ability to answer image-based dermatology board-style questions correctly [5].

Methods

OpenAI was used to query GPT-4o and GPT-4 Turbo for classifying dermatoscopic images of melanoma versus “not melanoma” (ie, melanocytic nevi, basal cell carcinoma, actinic keratoses, dermatofibromas, and vascular lesions) selected from the HAM10K database, which contains >10,000 dermatoscopic images collected over 20 years from multiple populations, and verified by histopathology or confocal microscopy [6].

Five-hundred melanoma and “not melanoma” diagnoses were randomly selected with no image modifications. A standardized prompt was used: “This is an image of the step 1 examination. The multiple-choice question is as follows: Based on the image, does the patient have (A) melanoma (B) no melanoma? Only output the answer as A or B.” Incomplete responses were categorized as “not a number” and excluded.

To assess the effect of binary versus nonbinary prompting, an additional 1000 randomly selected “not melanoma” dermatoscopic images were classified by GPT-4o, given its higher sensitivity compared to GPT-4 Turbo. Manual classification was applied for “not a number” results when the response leaned towards “melanoma” or “not melanoma” but did not explicitly state “A” or “B.”

Results

The diagnostic accuracies of GPT-4 Turbo and GPT-4o were 0.546 (95% CI 0.515‐0.577) and 0.577 (95% CI 0.547‐0.608), respectively. There was no significant difference in accuracy between the two models (P=.10). GPT-4 Turbo demonstrated a sensitivity of 76.3%, specificity of 32.9%, and false-positive rate of 67.1% (Table 1). GPT-4o yielded a higher sensitivity of 96.8% (P<.001), lower specificity of 18.4% (P=.09), and higher false-positive rate of 81.6% (P<.001).

Table 1. GPT-4 Omni and GPT-4 Turbo demonstrate low accuracy and low specificity for melanoma diagnosis.

Statistical measure Chat-GPT 4 Turbo Chat-GPT 4 Omni
Accuracy, (95% CI) 0.546 (0.515‐0.577) 0.577 (0.547‐0.608)
Precision 0.532 0.544
Specificity, % (95% CI) 32.9 (0.288‐0.370) 18.4 (0.150‐0.218)
Sensitivity, % (95% CI) 76.3 (0.726‐0.801) 96.8 (0.952‐0.983)
F1-score 0.627 0.697
False-positive rate (%) 67.1 81.6

GPT-4o’s additional analysis of “not melanoma” images using nonbinary prompting yielded an accuracy of 6.56% (95% CI 4.94%‐8.18%), correctly classifying 59 of 899 images (Table 2). Binary prompting increased GPT-4o accuracy to 25.25% (95% CI 22.55%‐27.95%), with 252 of 998 images correctly identified as “not melanoma.” The confusion matrices associated with the statistical measures of GPT-4o and GPT-4 Turbo are shown in Multimedia Appendix 1.

Table 2. Accuracy of ChatGPT-4o in diagnosing melanoma and “not melanoma” with binary versus nonbinary prompting.

Statistical measure Nonbinary prompting (n=899) Binary prompting (n=998)
Accuracy, n (%) 59 (6.56) 252 (25.25)
95% CI (%) 4.94‐8.18 22.55‐27.95
False-positive rate (%) 81.6 67.1

Discussion

Currently, GPT engines demonstrate low accuracy for diagnosing melanoma. Higher diagnostic accuracies have been achieved using neural networks such as Moleanalyzer pro (87.7%) and ChatGPT Vision (85%); however, these studies included much smaller sample sizes of 100 and 60 images, respectively [7,8]. Our findings exhibit a higher-powered analysis of ChatGPT performance.

GPT-4o’s improved accuracy with binary versus nonbinary prompting aligns with prior AI research demonstrating that these search engines struggle without explicit direction [8]. When more intricate prompts are provided, results improve [7,8]. However, such a methodology is not generalizable to the average user. Patients using these engines to self-diagnose suspicious lesions at home are more likely to use nonbinary prompts without detailed instructions for the AI engine. Thus, our nonbinary prompting results reflect that ChatGPT would provide inaccurate outputs when used by the average patient.

The high false-positive rates of GPT-4o and GPT-4 Turbo in evaluating “not melanoma” suggest a conservative bias. This raises ethical concerns, as undue patient harm may result from AI’s overdiagnosis of “melanoma.” Patients receiving incorrect “melanoma” diagnoses from ChatGPT prior to their dermatology appointments may develop mistrust if the physician accurately contradicts AI diagnoses. These patients may feel unheard if they do not receive biopsies for their “suspicious” moles. Increased in-office counseling may be warranted to disentangle the biases AI imparts to patients.

Limitations included using a single dataset and dermatoscopic images without broader clinical information. The models were not specifically trained before querying. ChatGPT is a generative AI that may be less suitable than specialized AI systems in dermatoscopic image diagnoses [2]. Nevertheless, inherent flaws in the GPT4-o and GPT-4 Turbo systems are still evident. Therefore, patients should avoid ChatGPT diagnoses before evaluation of their suspected pigemented lesions by trained dermatologists.

Supplementary material

Multimedia Appendix 1. Confusion matrix of ChatGPT-4 Omni performance (top) and confusion matrix of ChatGPT-4 Turbo performance (bottom).
derma-v8-e67551-s001.png (55.1KB, png)
DOI: 10.2196/67551

Abbreviations

AI

artificial intelligence

GPT-4 Turbo

ChatGPT-4 Turbo

GPT-4o

ChatGPT-4 Omni

LLM

large language model

Footnotes

Conflicts of Interest: None declared.

References

  • 1.Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature New Biol. 2017 Feb 2;542(7639):115–118. doi: 10.1038/nature21056. doi. Medline. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Tejeda H, Kumar A, Smyth P, Steyvers M. AI-assisted decision-making: a cognitive modeling approach to infer latent reliance strategies. Comput Brain Behav. 2022 Dec;5(4):491–508. doi: 10.1007/s42113-022-00157-y. doi. [DOI] [Google Scholar]
  • 3.Chow JCL, Wong V, Li K. Generative pre-trained transformer-empowered healthcare conversations: current trends, challenges, and future directions in large language model-enabled medical chatbots. BioMedInformatics. 2024;4(1):837–852. doi: 10.3390/biomedinformatics4010047. doi. [DOI] [Google Scholar]
  • 4.Xu L, Sanders L, Li K, Chow JCL. Chatbot for health care and oncology applications using artificial intelligence and machine learning: systematic review. JMIR Cancer. 2021 Nov 29;7(4):e27850. doi: 10.2196/27850. doi. Medline. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Smith L, Hanna R, Hatch L, Hanna K. Computer vision meets large language models: performance of ChatGPT 4.0 on dermatology boards-style practice questions. SKIN J Cutan Med. 2024;8(5):1815–1821. doi: 10.25251/skin.8.5.5. doi. [DOI] [Google Scholar]
  • 6.Alexander Scarlat Melanoma: augmented dermoscopic pigmented skin lesions from HAM10k. Kaggle. [29-11-2024]. https://www.kaggle.com/datasets/drscarlat/melanoma URL. Accessed.
  • 7.Winkler JK, Blum A, Kommoss K, et al. Assessment of diagnostic performance of dermatologists cooperating with a convolutional neural network in a prospective clinical study: Human With Machine. JAMA Dermatol. 2023 Jun 1;159(6):621–627. doi: 10.1001/jamadermatol.2023.0905. doi. Medline. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cirone K, Akrout M, Abid L, Oakley A. Assessing the utility of multimodal large language models (GPT-4 Vision and Large Language and Vision Assistant) in identifying melanoma across different skin tones. JMIR Dermatol. 2024 Mar 13;7:e55508. doi: 10.2196/55508. doi. Medline. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia Appendix 1. Confusion matrix of ChatGPT-4 Omni performance (top) and confusion matrix of ChatGPT-4 Turbo performance (bottom).
derma-v8-e67551-s001.png (55.1KB, png)
DOI: 10.2196/67551

Articles from JMIR Dermatology are provided here courtesy of JMIR Publications Inc.

RESOURCES