When Jack of All Trades Is a Master of None: Comparing the Performance of GPT-4 Omni against Specialised Neural Networks in Identifying Malignant Dermatological Lesions from Smartphone Images and Structured Clinical Data

Jiawen Deng; Heather Jianbo Zhao; Jaehyun Hwang; Aya Alsefaou; Eddie Guo; Kiyan Heybati; Myron Moskalyk

doi:10.1159/000550153

. 2025 Dec 18;242(2):225–234. doi: 10.1159/000550153

When Jack of All Trades Is a Master of None: Comparing the Performance of GPT-4 Omni against Specialised Neural Networks in Identifying Malignant Dermatological Lesions from Smartphone Images and Structured Clinical Data

Jiawen Deng ^a,^✉, Heather Jianbo Zhao ^a, Jaehyun Hwang ^b, Aya Alsefaou ^a, Eddie Guo ^c, Kiyan Heybati ^d, Myron Moskalyk ^e,^f

PMCID: PMC12875639 PMID: 41411218

Abstract

Introduction

Artificial intelligence (AI) can potentially assist in triaging suspicious skin lesions as malignant or benign. General-purpose multimodal large language models (LLMs), such as GPT-4o, have not been rigorously evaluated for this task. This study assessed GPT-4o’s ability to triage skin lesions and compared its performance to specialised neural networks.

Methods

We evaluated GPT-4o using 1,000 random cases from the PAD-UFES-20 dataset with 50 repeated trials. GPT-4o was tested using clinical data-only, image-only, and multimodal inputs. GPT-4o’s performance, consistency, and fairness across different demographic subgroups was evaluated. Its performance metrics were compared against specialised unimodal and multimodal neural networks trained on a separate subset of the PAD-UFES-20 dataset.

Results

GPT-4o exhibited poor triage performance across all modalities, with average balanced accuracies of 0.571, 0.602, and 0.622 for clinical data, image, and multimodal inputs, respectively. Sensitivity was consistently high (>0.95) with the trade-off of very low specificity. Mean agreement rates were high (>0.90); however, Fleiss’ κ indicated only moderate consistency due to a strong bias toward malignant classifications. Fairness evaluations showed poorer discriminative performance in younger patients compared to middle-aged and elderly patients but no notable differences between different sex and skin tone subgroups. Specialised neural networks significantly outperformed GPT-4o on most pairwise comparisons. Multimodal inputs significantly improved GPT-4o performance over unimodal inputs.

Conclusion

Although GPT-4o consistently triaged skin lesions with high sensitivity, its very low specificity limits clinical utility. Thus, general-purpose LLMs like GPT-4o are currently unsuitable for clinical dermatological diagnostics without significant field-specific developments and validation.

Keywords: Skin cancer, Artificial intelligence, Neural networks, GPT-4o, Multimodal models

Plain Language Summary

Skin cancer is common, and early detection greatly improves treatment outcomes. Because access to dermatologists often involves long wait times, there is growing interest in using artificial intelligence (AI) to help identify suspicious skin lesions. This study evaluated GPT-4o, a large language model (LLM) capable of interpreting both text and images, to see how well it could recognise skin cancers compared with AI systems designed specifically for this purpose. We tested GPT-4o on 1,000 publicly available skin lesion cases using three input types: images alone, clinical information alone (such as age, sex, and symptoms), and a combination of both. Each case was analysed 50 times to assess consistency. GPT-4o showed stable and highly sensitive performance. It was good at detecting cancers when present but often misclassified harmless lesions as malignant, leading to many false positives. In contrast, specialist dermatology AIs were much more accurate. GPT-4o performed slightly better when given both clinical and image data than when using either alone. In summary, GPT-4o’s tendency to over-diagnose limits its usefulness for real-world skin cancer screening. Purpose-built dermatology AI tools remain the safer and more reliable option.

Introduction

Skin cancer is one of the most common malignancies worldwide, with over 100,000 new melanoma cases and more than 8,000 related deaths projected in the USA for 2024 [1]. Despite ongoing efforts to expand dermatological services, patients in countries such as the USA, UK, and Canada often face wait times exceeding 6 months [2, 3]. Consequently, artificial intelligence (AI) tools are increasingly being explored to assist in triaging suspicious skin lesions before specialist referral [4].

Traditional computer vision models in dermatology have relied on convolutional neural networks (CNNs), which can classify lesions as malignant or benign based on specific image features (Fig. 1a) [5, 6]. Other information, such as patient demographics and comorbidities, can be further incorporated into the model to create multimodal neural networks [7]. More recent transformer-based large language models (LLMs), such as ChatGPT, Gemini, and Claude, operate differently from CNNs by relying on natural text-based reasoning and response generation (Fig. 1b). While these models can produce responses that appear sentient and authoritative, their reliance on probabilistic text generation also makes them prone to hallucinations and inconsistent outputs [8], which raises safety concerns for their use by patients or clinicians for medical applications.

This figure compares how convolutional neural networks and multimodal large language models process and interpret different types of data. Panel A: The upper portion of the figure illustrates a convolutional neural network designed for image analysis. It begins with a series of skin lesion photographs, which are first normalised and resized. These images are then passed through convolutional layers that gradually extract more complex visual information. The early layers detect basic visual elements such as edges, shapes, and textures, while the deeper layers identify higher level features like lesion boundaries, internal structures, and symmetry. The resulting visual information is converted into numerical representations and sent to a multilayer perceptron, which is a basic neural network composed of interconnected layers of neurons, that produces the final diagnostic prediction. Panel B: The lower portion of the figure depicts a multimodal large language model that integrates text and image information. On the left, a tabular clinical dataset containing patient details—such as age, sex, and lesion symptoms—is first transformed into a written clinical description. This narrative text, together with the corresponding lesion image, forms the input to the model. The text is divided into smaller units, or tokens, and each token is converted into a numerical text embedding. The image is divided into smaller regions, or patches, each of which is transformed into an image embedding. Both the text and image embeddings are processed through transformer layers that apply an attention mechanism to determine which parts of the input are most relevant for the task. The combined outputs are then passed to a multilayer perceptron, which produces the final prediction, such as a diagnostic probability or textual response. — Comparison between the general architectures of convolutional neural networks and multimodal large language models. a Convolutional neural networks use successive convolutional layers to extract increasingly abstract image information. Early layers capture basic features such as shapes, colours, and brightness, while later layers identify anatomical structures or lesion characteristics, including size, depth, and symmetry. These visual features are converted into numerical representations and processed by a multilayer perceptron to generate final predictions. b Large language models divide input data into small units (such as text “tokens” or image “patches”) and transform them into numerical “embeddings.” These embeddings are processed through transformer layers, which use attention mechanisms to highlight the most relevant embeddings for the tasks at hand. Outputs from the transformer layers are then passed to a multilayer perceptron to produce the final results. The skin lesion images shown originate from the PAD-UFES-20 dataset [9] and they were made available under a CC BY 4.0 license (https://creativecommons.org/licenses/by/4.0/).

Early studies assessing the use of LLMs in diagnosing skin cancer report encouraging accuracies (up to 85%) but have relied on small datasets and did not assess the reproducibility of their results across repeated sessions [10, 11]. Therefore, this study aimed to extensively evaluate GPT-4 Omni (GPT-4o) for triaging suspicious skin lesions using 1,000 patient cases across 50 repeated trials, and compare unimodal and multimodal inputs against specialised, CNN-based neural networks.

Methods

This study was conducted and reported in accordance with the TRIPOD-LLM checklist [12] (online suppl. Table S1; for all online suppl. material, see https://doi.org/10.1159/000550153).

Data Sources

Image and clinical data were obtained from the PAD-UFES-20 dataset (https://doi.org/10.17632/zr7vgbcyr2.1), which comprised 2,298 sets of lesion images captured by patients using smartphones along with associated clinical information. The dataset contained six lesion categories, including three categories of malignant diseases (basal cell carcinoma, melanoma, and squamous cell carcinoma) and three categories of benign skin lesions (actinic keratosis, melanocytic nevus, and seborrhoeic keratosis).

We randomly selected 1,000 patient cases for performance evaluation and used the remaining 1,298 cases for training specialised neural network models. Missing clinical data were imputed using 40 iterations of Multivariate Imputation by Chained Equations [13], separately for the performance evaluation and training sets to prevent data leakage [14]. Because the current study only used publicly available patient data, it is exempt from ethics review under the Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans (TCPS 2).

GPT-4o Configurations

All queries were executed using GPT-4o (version gpt-4o-2024-05-13) via OpenAI’s Batch API (https://platform.openai.com/docs/api-reference/batch). Each request was submitted using the openai package in Python and processed independently without chat history or shared context.

We tested three input formats: (1) image-only, (2) clinical data-only, and (3) multimodal (image + clinical data). Standardised prompts were used to instruct GPT-4o to classify lesions as “suspected cancer” or “unlikely to be cancer” (see online suppl. Table S2 and Table S3). If the input required clinical data, they were automatically populated from the dataset into placeholders within the prompt. Images were stripped of EXIF metadata and filenames, then provided to the API in tetrasexagesimal format. This ensured that GPT-4o only received the image content without influence from the image metadata, which have previously been demonstrated to affect model behaviours [15].

Each of the 1,000 test cases was evaluated 50 times per input format, totalling 150,000 requests. The temperature parameter, which ranged from 0 to 2 and controls the stochasticity of the model, was set to 1. Model outputs were parsed to extract the aforementioned classification statement to assign binary labels for performance evaluation. All prompts, inputs, and model responses are available on Mendeley Data (https://doi.org/10.17632/3kdcc8cf92.1).

Specialised Neural Network Configurations

The development process of the specialised neural networks has been thoroughly described in a previous publication [16]. The full codebase, datasets, and trained model files are available on Mendeley Data (https://doi.org/10.17632/2yv6rv3pzs.1). In brief, we developed a DenseNet-121 [17] based image-only neural network, a clinical data-only neural network, and a multimodal neural network. Classification thresholds were selected to maximise Matthews correlation coefficient (MCC), with secondary “high-sensitivity” thresholds derived to achieve sensitivities ≥0.90 and ≥0.95 for real-world triage applications.

Performance Evaluation

Model performance was assessed using standard binary classification metrics: area under the curve of the receiver operating characteristic curve, area under the curve of the precision-recall curve, accuracy, balanced accuracy, MCC, sensitivity, and specificity. For GPT-4o, metrics were computed independently across the 50 repeated evaluations and then summarised using mean and 95% confidence intervals (CIs).

Performance Comparison

Performance comparisons between GPT-4o and each specialised neural network were stratified by input modality, and exact two-sided permutation tests were used. GPT-4o’s own performance across the three input modalities was also compared using Friedman tests followed by pairwise Wilcoxon signed-rank tests with Holm-Bonferroni correction. All statistical testing used a significance threshold of p < 0.05.

Fairness Evaluation

Given known disparities in skin cancer prevalence and outcomes across sex, age, and skin phototypes [18–20], post hoc fairness analyses were used to examine whether GPT-4o’s performance varied by these demographic factors within each input modality. We focused on balanced accuracy, MCC, sensitivity, and specificity as these metrics are less influenced by subgroup imbalance. Age was categorised into three clinically relevant bins [21]: ≤34 years old (young), 35–64 years old (middle-aged), and ≥65 years old (elderly). Sex was treated as a binary variable (male versus female). Skin phototype was grouped into light (Fitzpatrick skin phototype I-III) versus dark (Fitzpatrick skin phototype IV-VI) skin tones [16, 22].

Consistency Evaluation

GPT-4o’s output consistency across trials was assessed by mean agreement rate and Fleiss’ κ. For each test case, the proportion of 50 outputs matching the majority label was averaged to obtain the mean agreement rate with 95% CIs. Fleiss’ κ is a global measure that assesses overall inter-trial agreement beyond random chance [23, 24], where values of 1 and 0 indicate perfect and random agreement, respectively.

Results

Characteristics of the PAD-UFES-20 dataset are presented in Table 1. The performance of GPT-4o and specialised neural networks across different input data modalities is summarised in Table 2. The models’ ROC and PR curves are shown in Figure 2. Confusion matrices illustrating the distribution of correct and incorrect predictions from the models are shown in Figure 3.

Table 1.

Characteristics of cases from the PAD-UFES-20 dataset used for performance evaluation and training of the specialised neural networks

	Total (N = 2,298)	Neural network training subset (N = 1,298)	Performance evaluation subset (N = 1,000)
Age, n (%)	60.5 (15.9)	60.2 (16.1)	60.8 (15.6)
Young (≤34)	148 (6.4)	90 (6.9)	58 (5.8)
Middle-aged (35–64)	1,143 (49.7)	640 (49.3)	503 (50.3)
Elderly (≥65)	1,007 (43.8)	568 (43.8)	439 (43.9)
Sex, n (%)^a
Male	741 (49.6)	420 (50.1)	321 (49.0)
Female	753 (50.4)	419 (49.9)	334 (51.0)
Fitzpatrick skin phototype, n (%)^a
Type I	153 (10.2)	79 (9.4)	74 (11.3)
Type II	876 (58.6)	494 (58.9)	382 (58.3)
Type III	392 (26.2)	217 (25.9)	175 (26.7)
Type IV	62 (4.1)	42 (5.0)	20 (3.1)
Type V	10 (0.7)	7 (0.8)	3 (0.5)
Type VI	1 (0.1)	0 (0.0)	1 (0.2)
Has personal or family skin cancer history, n (%)^a	681 (45.6)	389 (46.4)	292 (44.6)
Pathological diagnosis, n (%)
Actinic keratosis	730 (31.8)	419 (32.3)	311 (31.1)
Basal cell carcinoma	845 (36.8)	472 (36.4)	373 (37.3)
Malignant melanoma	52 (2.3)	25 (1.9)	27 (2.7)
Melanocytic nevus	244 (10.6)	135 (10.4)	109 (10.9)
Squamous cell carcinoma	192 (8.4)	118 (9.1)	74 (7.4)
Seborrhoeic keratosis	235 (10.2)	129 (9.9)	106 (10.6)
Lesion location, n (%)
Abdomen	36 (1.6)	21 (1.6)	15 (1.5)
Arm	192 (8.4)	112 (8.6)	80 (8.0)
Back	248 (10.8)	136 (10.5)	112 (11.2)
Chest	280 (12.2)	158 (12.2)	122 (12.2)
Ear	73 (3.2)	42 (3.2)	31 (3.1)
Face	570 (24.8)	348 (26.8)	222 (22.2)
Foot	16 (0.7)	6 (0.5)	10 (1.0)
Forearm	392 (17.1)	197 (15.2)	195 (19.5)
Hand	126 (5.5)	78 (6.0)	48 (4.8)
Lip	23 (1.0)	12 (0.9)	11 (1.1)
Neck	93 (4.0)	47 (3.6)	46 (4.6)
Nose	158 (6.9)	95 (7.3)	63 (6.3)
Scalp	18 (0.8)	7 (0.5)	11 (1.1)
Thigh	73 (3.2)	39 (3.0)	34 (3.4)
Maximum lesion diameter^a,b	11.9 (8.9)	12.0 (8.4)	11.9 (8.6)
Lesion characteristics, n (%)
Elevated^c	1,433 (62.4)	799 (61.7)	634 (63.4)
Itching^d	1,455 (63.5)	830 (64.2)	625 (62.6)
Recently grew^e	925 (48.8)	527 (49.1)	398 (48.4)
Painful^f	397 (17.4)	234 (18.1)	163 (16.3)
Bleeding^d	614 (26.8)	348 (26.9)	266 (26.6)

Open in a new tab

Adapted from Table 1, Case characteristics in PAD-UFES-20 dataset and in the derivation and internal validation subsets, from Deng et al. [16] under a Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/).

Summary statistics are presented as mean (standard deviation) for continuous characteristics, and as N (%) for discrete characteristics.

^a459 and 345 cases had missing sex, Fitzpatrick skin phototype, cancer history, and lesion measurement data in the neural network training and performance evaluation subsets, respectively.

^bMaximum lesion diameter is reported in millimetres.

^cTwo cases had missing lesion elevation data in the neural network training subset.

^dFive and one cases had missing itching and bleeding data in the neural network training and performance evaluation subsets, respectively.

^e225 and 177 cases had missing recent lesion growth data in the neural network training and performance evaluation subsets, respectively.

^fSeven and three cases had missing painful data in the neural network training and performance evaluation subsets, respectively.

Table 2.

Summary of GPT-4o and specialised neural network performance metrics across different input data modalities

	Structured clinical data				Smartphone images				Multimodal
	GPT-4o	specialised neural network			GPT-4o	specialised neural network			GPT-4o	specialised neural network
	GPT-4o	balanced threshold	sensitivity ≥0.90 threshold	sensitivity ≥0.95 threshold	GPT-4o	balanced threshold	sensitivity ≥0.90 threshold	sensitivity ≥0.95 threshold	GPT-4o	balanced threshold	sensitivity ≥0.90 threshold	sensitivity ≥0.95 threshold
AUC-ROC	0.571 (0.570–0.572)	0.859			0.602 (0.601–0.603)	0.870			0.622 (0.621–0.623)	0.913
AUC-PR	0.512 (0.512–0.512)	0.866			0.531 (0.531–0.531)	0.845			0.544 (0.544–0.544)	0.908
Accuracy	0.550 (0.549–0.551)	0.743	0.476	0.476	0.583 (0.582–0.584)	0.794	0.790	0.769	0.604 (0.603–0.605)	0.841	0.801	0.762
Balanced accuracy	0.571 (0.570–0.572)	0.731	0.502	0.502	0.602 (0.601–0.603)	0.795	0.794	0.777	0.622 (0.621–0.623)	0.840	0.807	0.772
MCC	0.235 (0.234–0.236)	0.529	0.043	0.043	0.302 (0.301–0.303)	0.590	0.593	0.577	0.338 (0.337–0.339)	0.681	0.623	0.578
Sensitivity	0.973 (0.972–0.974)	0.498	1.000	1.000	0.975 (0.974–0.976)	0.821	0.876	0.932	0.976 (0.975–0.977)	0.829	0.916	0.958
Specificity	0.169 (0.168–0.170)	0.964	0.004	0.004	0.229 (0.228–0.230)	0.770	0.713	0.622	0.268 (0.267–0.269)	0.852	0.698	0.586

Open in a new tab

AUC-ROC and AUC-PR are threshold-independent metrics and do not change based on the threshold used for the specialised neural networks.

For the specialised neural network trained on structured clinical data, classification metrics from the two high-sensitivity thresholds are the same.

Because GPT-4o does not produce continuous risk probability estimates, its AUC-ROC is equal to the balanced accuracy.

GPT-4o metrics are presented as mean (95% CI).

GPT-4o, GPT-4 Omni; AUC-ROC, area under the curve of the receiver operating characteristic curve; AUC-PR, area under the curve of the precision-recall curve; CI, confidence interval; MCC, Matthews Correlation Coefficient.

This figure compares the performance of GPT-4 Omni and specialised neural networks in classifying skin lesions, showing receiver operating characteristic and precision–recall curves. Panel A: The left plot displays receiver operating characteristic curves, where the true positive rate is plotted against the false-positive rate. The dashed diagonal line marks the performance of a no-skill classifier. The specialised models trained on clinical data, image data, and multimodal data form three solid curves that rise sharply toward the top-left corner, indicating high discriminative ability. In contrast, the corresponding dashed curves for GPT-4 Omni lie closer to the diagonal, reflecting weaker performance. Each GPT-4 Omni curve is accompanied by a shaded region representing its 95 percent confidence interval. Panel B: The right plot shows precision–recall curves for the same models. The specialised models maintain high precision across most recall values, forming curves positioned well above the baselines. The GPT-4 Omni curves, shown as dashed lines with shaded intervals, remain lower, indicating reduced precision at comparable recall levels. Across both plots, the specialised multimodal model performs best, followed by the image-only and clinical data models. GPT-4 Omni performs consistently below these, though with similar trends across modalities. — Plots comparing the discriminative performance of GPT-4o and specialised neural networks for skin lesion classification. a Receiver operating characteristic curves. b Precision-recall curves. The averaged receiver operating characteristic curves and precision-recall curves with corresponding 95% confidence intervals for GPT-4o were generated via vertical averaging of individual curves from all 50 trials. Metrics for GPT-4o were presented as mean (95% confidence interval). AUC-ROC, area under the curve of the receiver operating characteristic curve; AUC-PR, area under the curve of the precision-recall curve; GPT-4o, GPT-4 Omni.

This figure presents confusion matrices comparing the classification outcomes of GPT-4 Omni and specialised neural networks across three input modalities: clinical data, image data, and multimodal data. The colour scale below the panels represents the number of patient cases, ranging from light blue for smaller counts to dark blue for larger counts. Panel A: The top row shows results from models using clinical data. The first matrix corresponds to GPT-4 Omni, displaying moderate agreement between predicted and actual classes, with high false-positive counts. The three matrices to the right show results from the specialised neural network under different operating thresholds: a balanced threshold, a threshold achieving at least 0.90 sensitivity, and one achieving at least 0.95 sensitivity. As sensitivity increases, true positive counts rise, while true negatives decrease, indicating the expected trade-off between sensitivity and specificity. Panel B: The middle row shows results for image-based classification. GPT-4 Omni again exhibits substantial false-positive counts and limited discrimination. In contrast, the specialised network achieves more balanced classifications at the default threshold and progressively prioritises sensitivity as thresholds are adjusted upward. Panel C: The bottom row shows multimodal results integrating clinical and image inputs. GPT-4 Omni produces similar confusion patterns to those in Panels A and B, while the specialised network displays improved balance between correctly identified positive and negative cases. As in previous panels, higher sensitivity thresholds correspond to an increase in correctly identified positive cases at the cost of additional false positives. Across all three modalities, specialised neural networks demonstrate higher accuracy and more controlled trade-offs between sensitivity and specificity compared with GPT-4 Omni. — Confusion matrices for GPT-4o and specialised neural network classifications. a Confusion matrices for classification using clinical data inputs. b Confusion matrices for classification using image inputs. c Confusion matrices for classification using multimodal inputs. GPT-4o, GPT-4 Omni.

GPT-4o Performance

GPT-4o demonstrated weak classification performance, with mean balanced accuracies of 0.571 (95% CI: 0.570–0.572), 0.602 (95% CI: 0.601–0.603), and 0.622 (95% CI: 0.621–0.623) for clinical data, image, and multimodal inputs, respectively. While GPT-4o achieved consistently high sensitivity (>0.95) across all three input modalities, it did so with the trade-off of very low specificity.

GPT-4o Consistency

The mean agreement rate for GPT-4o was 0.938 (95% CI: 0.920–0.951) for clinical data inputs, 0.928 (95% CI: 0.909–0.942) for image inputs, and 0.918 (95% CI: 0.899–0.933) for multimodal inputs. Overall, GPT-4o had moderate agreement between trials across all three input modalities, with Fleiss’ κ of 0.517 for clinical data inputs, 0.547 for image inputs, and 0.566 for multimodal inputs.

GPT-4o Fairness

GPT-4o performance stratified by demographic characteristics is summarised in online supplementary Table S4, Table S5, Table S6, Table S7, Table S8 and Table S9. Across age groups, GPT-4o showed the poorest discriminative performance in younger patients compared to middle-aged and elderly patients across all three input modalities, as reflected by the MCC. Across sex and skin phototype subgroups, GPT-4o showed only small subgroup performance differences. While there were significant differences in sensitivities and specificities, there was no consistent, directional bias towards one subgroup versus the other.

Comparing GPT-4o against Specialised Neural Networks

When relying on only clinical data inputs using the balanced threshold, GPT-4o was inferior on all other metrics (exact p < 0.05) except for sensitivity. The specialised neural network performed substantially worse when using the high-sensitivity thresholds, being significantly outperformed on all threshold-dependent metrics (exact p < 0.05).

When using image or multimodal inputs with any of the three thresholds, the specialised neural network significantly outperformed GPT-4o on all metrics (exact p < 0.05), except for sensitivity, where GPT-4o significantly outperformed the specialised neural networks (exact p < 0.05).

Comparing GPT-4o Performance between Different Modalities

Overall, there were significant differences in GPT-4o performance between different input modalities for all metrics (Friedman p < 0.01) except for sensitivity (Friedman p = 0.30). In post hoc pairwise comparisons of significantly different metrics, GPT-4o performed significantly better when using image inputs compared to clinical data inputs on all metrics (Holm-adjusted Wilcoxon p < 0.01), and it performed significantly better when using multimodal inputs compared to image inputs on all metrics (Holm-adjusted Wilcoxon p < 0.01).

Discussion

This study evaluated GPT-4o’s ability to triage skin lesions as malignant or benign using clinical, image, and multimodal inputs. Across 1,000 test cases with 50 repeated trials per modality, GPT-4o showed consistent yet consistently poor classification performance, especially when compared with specialised neural networks.

Notably, the mean agreement rate among GPT-4o trials exceeded 90%, which suggests that the model may have stable internal heuristics despite its probabilistic nature. However, Fleiss’ κ only showed moderate agreement among trials. This discrepancy is due to the penalisation by Fleiss’ κ for uniform ratings (e.g., consistently classifying lesions as malignant) and highlights GPT-4o’s strong bias towards malignant classifications. This bias diminishes GPT-4o’s clinical value due to inflated false positives. The bias appeared stronger in younger patients, likely due to sparse training data in this population and a precautionary bias typical of LLMs with less confident predictions [25, 26].

GPT-4o’s improved performance with multimodal compared to unimodal inputs suggests that GPT-4o does integrate meaningful clinical signals rather than defaulting to malignant classifications. This indicates potential for future transformer- and even LLM-based models in skin lesion assessment, especially if fine-tuned on clinical datasets and augmented through techniques such as retrieval-based or agentic frameworks to ground model outputs in validated clinical knowledge. However, as a standalone tool, GPT-4o remains unsuitable for real-world triaging or classification of skin lesions.

Limitations

This study has several important limitations. Firstly, the specialised neural networks were internally validated using the testing dataset, which may inflate performance estimates. While this does not affect conclusions about GPT-4o’s performance or fairness, it highlights the need to further externally validate our specialised networks. Secondly, the characteristics of the PAD-UFES-20 dataset may restrict the generalisability of our findings as it strictly represents a Brazilian patient population and is limited in its patient demographics and included skin phototypes. Lastly, we used fixed, standardised prompts for all GPT-4o evaluations and did not assess the impact of prompt variation. While GPT-4o is generally more resistant to influences from prompting than prior generative models [27], future studies would benefit from testing LLMs through a variety of usage scenarios and prompt syntaxes.

Conclusion

GPT-4o demonstrated consistency in its skin lesion triaging and yielded performance improvements from integrating multimodal inputs versus unimodal inputs. However, its inherent biases toward high sensitivity and poor specificity significantly limit its clinical utility.

Key Message

GPT-4o performs consistently but poorly in skin lesion triage, especially compared to specialised neural networks.

Acknowledgements

We would like to thank the Programa de Assistência Dermatológica e Cirúrgica at the Federal University of Espírito Santo and the Nature Inspired Computing Laboratory for developing the PAD-UFES-20 dataset and making it available for our model development.

Statement of Ethics

Because the current study only used publicly available patient data, it is exempt from ethics review under the Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans (TCPS 2).

Conflict of Interest Statement

Jiawen Deng is a member of the OpenAI Researcher Access Program and receives grants from OpenAI in the form of API credits for purposes of research involving large language models. Other authors report no conflict of interest.

Funding Sources

Access to the GPT-4o API was facilitated by a grant from OpenAI’s Researcher Access Program (#0000002403). OpenAI had no role in the conception and design of the study, nor was OpenAI involved in data collection, data analysis, interpretation of results, or the decision to submit the manuscript for publication. OpenAI did not have access to the manuscript and its associated data during the manuscript preparation, review, and submission processes.

Author Contributions

Jiawen Deng conceptualised and supervised the study, performed data retrieval and management, performed statistical coding, created all visualisations, and drafted the manuscript. Heather Jianbo Zhao was involved in data retrieval and management, drafted the manuscript, and made intellectually important manuscript edits. Jaehyun Hwang was involved in data retrieval and management and made intellectually important manuscript edits. Aya Alsefaou was involved in data retrieval and management, drafted the manuscript, and made intellectually important manuscript edits. Eddie Guo performed code validation and review and made intellectually important manuscript edits. Kiyan Heybati was involved in data retrieval and management and made intellectually important manuscript edits. Myron Moskalyk performed code validation and review, drafted the manuscript, and made intellectually important manuscript edits. All authors agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved and give final approval for the manuscript to be submitted and published in its current state.

Funding Statement

Data Availability Statement

Prompts, image attachments, responses, and extracted binary flags of all GPT-4o requests are available on Mendeley Data: https://doi.org/10.17632/3kdcc8cf92.1. The full codebase and datasets required to reproduce the specialised neural networks, including the trained TensorFlow model files, are also available on Mendeley Data: https://doi.org/10.17632/2yv6rv3pzs.1.

Supplementary Material.

Supplementary_material-Suppl.1-s1.pdf^{(232.4KB, pdf)}

Supplementary Material.

Supplementary_material-Suppl.2-s2.docx^{(17.4KB, docx)}

Supplementary Material.

Supplementary_material-Suppl.3-s3.docx^{(170.1KB, docx)}

Supplementary Material.

Supplementary_material-Suppl.4-s4.docx^{(18.9KB, docx)}

Supplementary Material.

Supplementary_material-Suppl.5-s5.docx^{(18.3KB, docx)}

Supplementary Material.

Supplementary_material-Suppl.6-s6.docx^{(18.5KB, docx)}

Supplementary Material.

Supplementary_material-Suppl.7-s7.docx^{(18.3KB, docx)}

Supplementary Material.

Supplementary_material-Suppl.8-s8.docx^{(18.4KB, docx)}

Supplementary Material.

Supplementary_material-Suppl.9-s9.docx^{(20.1KB, docx)}

References

1. PDQ Adult Treatment Editorial Board . Melanoma treatment (PDQ^®): health professional version. PDQ cancer information summaries. Bethesda, MD: National Cancer Institute (US); 2024. [Google Scholar]
2. Drucker AM, Bai L, Eder L, Chan A-W, Pope E, Tu K, et al. Sociodemographic characteristics and emergency department visits and inpatient hospitalizations for atopic dermatitis in Ontario: a cross-sectional study. CMAJ Open. 2022;10(2):E491–9. [Google Scholar]
3. Rinderknecht F-AB, Naik HB. Access to dermatologic care and provider impact on hidradenitis suppurativa care: global survey insights. Int J Womens Dermatol. 2024;10(1):e130. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Jeong HK, Park C, Henao R, Kheterpal M. Deep learning in dermatology: a systematic review of current approaches, outcomes, and limitations. JID Innov. 2023;3(1):100150. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Jogin M, Mohana MMS, Divya GD, Meghana RK, Apoorva S. Feature extraction using Convolution Neural Networks (CNN) and Deep Learning. In: 2018 3rd IEEE international conference on recent trends in electronics, Information & Communication Technology (RTEICT). 2018. p. 2319–23. [Google Scholar]
6. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021;8(1):53. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Bozzo A, Hollingsworth A, Chatterjee S, Apte A, Deng J, Sun S, et al. A multimodal neural network with gradient blending improves predictions of survival and metastasis in Sarcoma. NPJ Precis Oncol. 2024;8(1):188. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Deng J, Heybati K, Park Y-J, Zhou F, Bozzo A. Artificial intelligence in clinical practice: a look at ChatGPT. Cleve Clin J Med. 2024;91(3):173–80. [DOI] [PubMed] [Google Scholar]
9. Pacheco AGC, Lima GR, Salomão AS, Krohling B, Biral IP, de Angelo GG, et al. PAD-UFES-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief. 2020;32:106221. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Stoneham S, Livesey A, Cooper H, Mitchell C. ChatGPT versus clinician: challenging the diagnostic capabilities of artificial intelligence in dermatology. Clin Exp Dermatol. 2024;49(7):707–10. [DOI] [PubMed] [Google Scholar]
11. Cirone K, Akrout M, Abid L, Oakley A. Assessing the utility of multimodal large Language models (GPT-4 vision and large Language and vision assistant) in identifying melanoma across different skin tones. JMIR Dermatol. 2024;7:e55508. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025;31(1):60–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Buuren S, Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations inR. J Stat Softw. 2011;45(3). [Google Scholar]
14. Apicella A, Isgrò F, Prevete R. Don’t push the button. In: Exploring data leakage risks in machine Learning and Transfer Learning; 2024. [Google Scholar]
15. Deng J, Heybati K, Shammas-Toma M. When vision meets reality: exploring the clinical applicability of GPT-4 with vision. Clin Imaging. 2024;108:110101. [DOI] [PubMed] [Google Scholar]
16. Deng J, Guo E, Zhao HJ, Venugopal K, Moskalyk M. Development of a transfer learning-based, multimodal neural network for identifying malignant dermatological lesions from smartphone images. Cancer Inform. 2025;24:11769351251349891. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. arXiv [csCV]. 2016. [Google Scholar]
18. Berwick M, Armstrong BK, Ben-Porat L, Fine J, Kricker A, Eberle C, et al. Sun exposure and mortality from melanoma. J Natl Cancer Inst. 2005;97(3):195–9. [DOI] [PubMed] [Google Scholar]
19. Schwartz MR, Luo L, Berwick M. Sex differences in Melanoma. Curr Epidemiol Rep. 2019;6(2):112–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Olsen CM, Thompson JF, Pandeya N, Whiteman DC. Evaluation of sex-specific incidence of Melanoma. JAMA Dermatol. 2020;156(5):553–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Memon A, Bannister P, Rogers I, Sundin J, Al-Ayadhy B, James PW, et al. Changing epidemiology and age-specific incidence of cutaneous malignant melanoma in England: an analysis of the national cancer registration data by age, gender and anatomical site, 1981-2018. Lancet Reg Health Eur. 2021;2:100024. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Aggarwal P. Performance of artificial intelligence imaging models in detecting dermatological manifestations in higher fitzpatrick skin color classifications. JMIR Dermatol. 2021;4(2):e31697. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378–82. [Google Scholar]
24. Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions. Hoboken, NJ: John Wiley & Sons; 2013. [Google Scholar]
25. Vlassov VV. Precautionary bias. Eur J Public Health. 2017;27(3):389. [DOI] [PubMed] [Google Scholar]
26. Choudhury A, Chaudhry Z. Large language models and user trust: consequence of self-referential learning loop and the deskilling of healthcare professionals. arXiv [csCY]. 2024;26:e56764. [Google Scholar]
27. He J, Rungta M, Koleczek D, Sekhon A, Wang FX, Hasan S. Does prompt formatting have any impact on LLM performance? arXiv [csCL]. 2024. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_material-Suppl.1-s1.pdf^{(232.4KB, pdf)}

Supplementary_material-Suppl.2-s2.docx^{(17.4KB, docx)}

Supplementary_material-Suppl.3-s3.docx^{(170.1KB, docx)}

Supplementary_material-Suppl.4-s4.docx^{(18.9KB, docx)}

Supplementary_material-Suppl.5-s5.docx^{(18.3KB, docx)}

Supplementary_material-Suppl.6-s6.docx^{(18.5KB, docx)}

Supplementary_material-Suppl.7-s7.docx^{(18.3KB, docx)}

Supplementary_material-Suppl.8-s8.docx^{(18.4KB, docx)}

Supplementary_material-Suppl.9-s9.docx^{(20.1KB, docx)}

Data Availability Statement

[B1] 1. PDQ Adult Treatment Editorial Board . Melanoma treatment (PDQ^®): health professional version. PDQ cancer information summaries. Bethesda, MD: National Cancer Institute (US); 2024. [Google Scholar]

[B2] 2. Drucker AM, Bai L, Eder L, Chan A-W, Pope E, Tu K, et al. Sociodemographic characteristics and emergency department visits and inpatient hospitalizations for atopic dermatitis in Ontario: a cross-sectional study. CMAJ Open. 2022;10(2):E491–9. [Google Scholar]

[B3] 3. Rinderknecht F-AB, Naik HB. Access to dermatologic care and provider impact on hidradenitis suppurativa care: global survey insights. Int J Womens Dermatol. 2024;10(1):e130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Jeong HK, Park C, Henao R, Kheterpal M. Deep learning in dermatology: a systematic review of current approaches, outcomes, and limitations. JID Innov. 2023;3(1):100150. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5. Jogin M, Mohana MMS, Divya GD, Meghana RK, Apoorva S. Feature extraction using Convolution Neural Networks (CNN) and Deep Learning. In: 2018 3rd IEEE international conference on recent trends in electronics, Information & Communication Technology (RTEICT). 2018. p. 2319–23. [Google Scholar]

[B6] 6. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021;8(1):53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. Bozzo A, Hollingsworth A, Chatterjee S, Apte A, Deng J, Sun S, et al. A multimodal neural network with gradient blending improves predictions of survival and metastasis in Sarcoma. NPJ Precis Oncol. 2024;8(1):188. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Deng J, Heybati K, Park Y-J, Zhou F, Bozzo A. Artificial intelligence in clinical practice: a look at ChatGPT. Cleve Clin J Med. 2024;91(3):173–80. [DOI] [PubMed] [Google Scholar]

[B9] 9. Pacheco AGC, Lima GR, Salomão AS, Krohling B, Biral IP, de Angelo GG, et al. PAD-UFES-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief. 2020;32:106221. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10. Stoneham S, Livesey A, Cooper H, Mitchell C. ChatGPT versus clinician: challenging the diagnostic capabilities of artificial intelligence in dermatology. Clin Exp Dermatol. 2024;49(7):707–10. [DOI] [PubMed] [Google Scholar]

[B11] 11. Cirone K, Akrout M, Abid L, Oakley A. Assessing the utility of multimodal large Language models (GPT-4 vision and large Language and vision assistant) in identifying melanoma across different skin tones. JMIR Dermatol. 2024;7:e55508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12. Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025;31(1):60–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Buuren S, Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations inR. J Stat Softw. 2011;45(3). [Google Scholar]

[B14] 14. Apicella A, Isgrò F, Prevete R. Don’t push the button. In: Exploring data leakage risks in machine Learning and Transfer Learning; 2024. [Google Scholar]

[B15] 15. Deng J, Heybati K, Shammas-Toma M. When vision meets reality: exploring the clinical applicability of GPT-4 with vision. Clin Imaging. 2024;108:110101. [DOI] [PubMed] [Google Scholar]

[B16] 16. Deng J, Guo E, Zhao HJ, Venugopal K, Moskalyk M. Development of a transfer learning-based, multimodal neural network for identifying malignant dermatological lesions from smartphone images. Cancer Inform. 2025;24:11769351251349891. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17. Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. arXiv [csCV]. 2016. [Google Scholar]

[B18] 18. Berwick M, Armstrong BK, Ben-Porat L, Fine J, Kricker A, Eberle C, et al. Sun exposure and mortality from melanoma. J Natl Cancer Inst. 2005;97(3):195–9. [DOI] [PubMed] [Google Scholar]

[B19] 19. Schwartz MR, Luo L, Berwick M. Sex differences in Melanoma. Curr Epidemiol Rep. 2019;6(2):112–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20. Olsen CM, Thompson JF, Pandeya N, Whiteman DC. Evaluation of sex-specific incidence of Melanoma. JAMA Dermatol. 2020;156(5):553–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21. Memon A, Bannister P, Rogers I, Sundin J, Al-Ayadhy B, James PW, et al. Changing epidemiology and age-specific incidence of cutaneous malignant melanoma in England: an analysis of the national cancer registration data by age, gender and anatomical site, 1981-2018. Lancet Reg Health Eur. 2021;2:100024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22. Aggarwal P. Performance of artificial intelligence imaging models in detecting dermatological manifestations in higher fitzpatrick skin color classifications. JMIR Dermatol. 2021;4(2):e31697. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378–82. [Google Scholar]

[B24] 24. Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions. Hoboken, NJ: John Wiley & Sons; 2013. [Google Scholar]

[B25] 25. Vlassov VV. Precautionary bias. Eur J Public Health. 2017;27(3):389. [DOI] [PubMed] [Google Scholar]

[B26] 26. Choudhury A, Chaudhry Z. Large language models and user trust: consequence of self-referential learning loop and the deskilling of healthcare professionals. arXiv [csCY]. 2024;26:e56764. [Google Scholar]

[B27] 27. He J, Rungta M, Koleczek D, Sekhon A, Wang FX, Hasan S. Does prompt formatting have any impact on LLM performance? arXiv [csCL]. 2024. [Google Scholar]

PERMALINK

When Jack of All Trades Is a Master of None: Comparing the Performance of GPT-4 Omni against Specialised Neural Networks in Identifying Malignant Dermatological Lesions from Smartphone Images and Structured Clinical Data

Jiawen Deng

Heather Jianbo Zhao

Jaehyun Hwang

Aya Alsefaou

Eddie Guo

Kiyan Heybati

Myron Moskalyk

Abstract

Introduction

Methods

Results

Conclusion

Plain Language Summary

Introduction

Fig. 1.

Methods

Data Sources

GPT-4o Configurations

Specialised Neural Network Configurations

Performance Evaluation

Performance Comparison

Fairness Evaluation

Consistency Evaluation

Results

Table 1.

Table 2.

Fig. 2.

Fig. 3.

GPT-4o Performance

GPT-4o Consistency

GPT-4o Fairness

Comparing GPT-4o against Specialised Neural Networks

Comparing GPT-4o Performance between Different Modalities

Discussion

Limitations

Conclusion

Key Message

Acknowledgements

Statement of Ethics

Conflict of Interest Statement

Funding Sources

Author Contributions

Funding Statement

Data Availability Statement

Supplementary Material.

Supplementary Material.

Supplementary Material.

Supplementary Material.

Supplementary Material.

Supplementary Material.

Supplementary Material.

Supplementary Material.

Supplementary Material.

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases