Skip to main content
Karger Author's Choice logoLink to Karger Author's Choice
. 2023 Mar 21;239(4):499–513. doi: 10.1159/000530225

Artificial Intelligence for the Classification of Pigmented Skin Lesions in Populations with Skin of Color: A Systematic Review

Yuyang Liu a, Clare A Primiero b, Vishnutheertha Kulkarni a, H Peter Soyer b,c, Brigid Betz-Stablein b,
PMCID: PMC10407827  PMID: 36944317

Abstract

Background

While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at later stages and have a poorer prognosis. The use of artificial intelligence (AI) models can potentially improve early detection of skin cancers; however, the lack of skin color diversity in training datasets may only widen the pre-existing racial discrepancies in dermatology.

Objective

The aim of this study was to systematically review the technique, quality, accuracy, and implications of studies using AI models trained or tested in populations with skin of color for classification of pigmented skin lesions.

Methods

PubMed was used to identify any studies describing AI models for classification of pigmented skin lesions. Only studies that used training datasets with at least 10% of images from people with skin of color were eligible. Outcomes on study population, design of AI model, accuracy, and quality of the studies were reviewed.

Results

Twenty-two eligible articles were identified. The majority of studies were trained on datasets obtained from Chinese (7/22), Korean (5/22), and Japanese populations (3/22). Seven studies used diverse datasets containing Fitzpatrick skin type I–III in combination with at least 10% from black Americans, Native Americans, Pacific Islanders, or Fitzpatrick IV–VI. AI models producing binary outcomes (e.g., benign vs. malignant) reported an accuracy ranging from 70% to 99.7%. Accuracy of AI models reporting multiclass outcomes (e.g., specific lesion diagnosis) was lower, ranging from 43% to 93%. Reader studies, where dermatologists’ classification is compared with AI model outcomes, reported similar accuracy in one study, higher AI accuracy in three studies, and higher clinician accuracy in two studies. A quality review revealed that dataset description and variety, benchmarking, public evaluation, and healthcare application were frequently not addressed.

Conclusions

While this review provides promising evidence of accurate AI models in populations with skin of color, the majority of the studies reviewed were obtained from East Asian populations and therefore provide insufficient evidence to comment on the overall accuracy of AI models for darker skin types. Large discrepancies remain in the number of AI models developed in populations with skin of color (particularly Fitzpatrick type IV–VI) compared with those of largely European ancestry. A lack of publicly available datasets from diverse populations is likely a contributing factor, as is the inadequate reporting of patient-level metadata relating to skin color in training datasets.

Keywords: Artificial intelligence, Skin of color, Race, Skin cancers

Introduction

Skin cancer is the most common malignancy worldwide, with melanoma representing the deadliest form. While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at a later stage and have a poorer prognosis when compared to Caucasian populations [1–3]. Even when diagnosed at the same stage, Hispanic, Native, Asian, and African Americans have significantly shorter survival time than Caucasian Americans (p < 0.05) [4]. Skin cancers in people with skin of color often present differently from those with Caucasian skin and are often underrepresented in dermatology training [5, 6].

The use of artificial intelligence (AI) algorithms for image analysis and detection of skin cancer has the potential to decrease healthcare disparities by removing unintended clinician bias and improving accessibility and affordability [7]. Skin lesion classification by AI algorithms to date has performed equivalently to [8] and, in some cases, better than dermatologists [9]. Human-computer collaboration can increase diagnostic accuracy further [10]. However, most AI advances have used homogenous datasets [11–15] collected from countries with predominantly European ancestry [16]. Exclusion of skin of color in training datasets poses the risk of incorrect diagnosis or missing skin cancers entirely [8] and risks widening racial disparities that already exist in dermatology [8, 17].

While multiple reviews have compared AI-based model performances for skin cancer detection [18–20], the use of AI in populations with skin of color has not been evaluated. The objective of this study was to systematically review the current literature for AI models for classification of pigmented skin lesion images in populations with skin of color.

Methods

Literature Search

The systematic review follows the PRISMA guidelines [21]. A protocol was registered with PROSPERO (International Prospective Register of Systematic Reviews) and can be accessed at https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42021281347.

A PubMed search in March 2021 used search terms relating to artificial intelligence, skin cancer, and skin lesions (search strings in online suppl. eTable; for all online suppl. material, see www.karger.com/doi/10.1159/000530225). No date range was applied, language was restricted to English, and only original research was included. Covidence software was used for screening administration. Search results were screened by reviewing titles/abstracts by two independent reviewers (Y.L. 100% and B.B.S. 20%) using eligibility criteria described in Table 1. Remaining articles were assessed for eligibility by reviewing methods or full text. Disagreements were resolved following discussions with a third independent reviewer (C.P.).

Table 1.

Inclusion and exclusion criteria used for screening and assessing eligibility of articles

Inclusion criteria Exclusion criteria
1. Any computer modeling or use of AI on diagnosis of skin conditions
2. Datasets provide information on the population (racial or Fitzpatrick skin type breakdown) or datasets obtained from countries with predominantly skin of color population
3. Uses dermoscopic, clinical, 3D, or other photographic images of the skin surface
4. Includes the assessment of malignant and/or non-malignant pigmented skin lesions
1. No population description of the training datasets (demographic, racial, or ethnicity breakdown) or from a country with predominantly Caucasian population of European ancestry
2. Dataset description with >90% Caucasian population, or fair skin type, or Fitzpatrick skin type I–III
3. Solely used images from ISIC [56], PH2 [13], IAD [57], ISBI [58], HAM10000 [12], MED-NODE [14], ILSVRC [59], DermaQuest [60], DERMIS [61], DERM IQA[62], DermNet NZ [63], and datasets known to be of predominantly European ancestry

DERM IQA, Dermoscopic Image Quality Assessment [62]; DERMIS, dermatology information system; ISIC, International Skin Imaging Collaboration [56]; DermNet NZ, DermNet New Zealand [63]; IAD, Interactive Atlas of Dermoscopy [57]; ISBI, International Symposium on Biomedical Imaging [58]; ILSVRC, ImageNet Large Scale Visual Recognition Challenge [59]; HAM10000, Human Against Machine with 10,000 training images [12]; MED-NODE, computer-assisted MElanoma Diagnosis from NOn-DErmoscopic images [14].

Data Extraction and Synthesis

Data extraction was performed using a standardized form by author Y.L. and confirmation by V.K. The following parameters were recorded: reference, ethnicity/ancestry/race, lesion number, sex, age, location, skin condition, public availability of dataset, number of images, type of images, methods of confirmation, deep learning system, model output, comparison with human input, and any missing data reported. Algorithm performance measures are recorded by either accuracy, sensitivity, specificity, and/or area under the receiver operating characteristic curve. A narrative synthesis of extracted data was used to present findings as a meta-analysis was not feasible due to heterogeneity of the study design, AI systems, skin lesions, and outcomes.

Quality Assessment

Quality was assessed using the Checklist for Evaluation of Image-Based Artificial Intelligence Reports in Dermatology (CLEAR Derm) Consensus Guidelines [22]. This 25-point checklist offers comprehensive recommendations on factors critical to the development, performance, and application of image-based AI algorithms in dermatology [22].

Author Y.L. performed the quality assessment of all included studies, and author B.B.S. assessed 20%. Inter-rater agreement rate was 87%, with disagreements resolved via a third independent reviewer (V.K.). Each criterion was evaluated to be either fully, partially, or not addressed and scored either 1, 0.5, or 0, respectively, using a scoring rubric in online supplementary eTable2.

Results

The database search identified 993 articles, including 13 duplicates. After screening titles/abstracts, 535 records were excluded, and the remaining 445 records were screened by methods, with 63 articles reviewed by full text. Forward and backward citations search revealed no additional articles. A total of 22 studies were included in the final review (PRISMA flow diagram in online supplementary eFig. 1).

Study Design

All 22 studies were performed between 2002 and 2021 [23–32], with 11 (50%) studies published between 2020 and 2021 [33–44]. An overview of study characteristics is displayed in Table 2. The median number of total images used in each study for all datasets combined was 5,846 (ranging from 212 to 185,192). The median dataset size for training, testing, and validation was 4,732 images (range: 247–22,608), 362 (range: 100–40,331), and 1,258 (range: 109–14,883), respectively.

Table 2.

Overview of study characteristics

First author, year Patient population Public availability of dataset Image type/Image No. Validation (H = histology, C = clinical diagnosis) Deep learning system Model output
ethnicity/ancestry/race/location dataset information
Piccolo et al. (2002) [23] Fitzpatrick I–V Lesion n = 341
Patient n = 289
F 65% (n = 188)
Average age 33.6
No Dermoscopy
Total 341
Training 0
Testing 341
Validation 0
All – H Network architecture DEM-MIPS (artificial neural network designed to evaluate different colorimetric and geometric parameters) Binary (melanoma, non-melanoma)
Iyatomi et al. (2008) [24] Italian
Austrian
Japanese
n/a No Dermoscopy
Total 1,258
Training 247
Testing NA
Validation 1,258
Dataset A and B– H
Dataset C – H + C
Network architecture
ANN (back-propagation artificial neural networks)
Binary (malignant or benign)
Malignancy (risk) score (0–100)
Chang et al. (2013) [25] Taiwanese Lesion n = 769
Patient n = 676
F 56% (n = 380)
Average age 47.6
No Clinical
Total 1,899
Training NA
Testing NA
Validation NA
Benign – C
Malignant – H
Network architecture
Computer-aided diagnosis (CADx) system built on 91 conventional features of shape, texture, and colors (developing software – MATLAB)
Binary (benign or malignant)
Chen et al. (2016) [26] American Indian, Alaska Native, Asian, Pacific Islander, black or African, American, Caucasian Community dataset
Patient n = 1,900
F 52.3% (n = 993)
Age >50% under 35
Partially
DermNet NZ – Yes
Community – No
Clinical
Total 12,000
Training 11,780
Testing 337
Validation NA
Community dataset (benign and malignant) – C
DermNet – H
Network architecture
Patented image-search algorithm that builds on proven computer vision methods from the field of CBIR
Binary (melanoma and non-melanoma)
Yang et al. (2017) [27] Korean Patient n = 110
F 50% (n = 55)
No Dermoscopy
Total 297
Training 0
Testing 297
Validation 0
All – H Network architecture
3 stage algorithm, pre-processing, stripe pattern detection and automatic discrimination (MATLAB)
Binary (LM, nevus)
Han et al. (2018) [28] Korean
Caucasian
n/a Partially
MED-NODE, Atlas, Edinburgh, Dermofit – yes
Others – no
Clinical
Total 182,044
Training 178,875
Testing 1,276
Validation 22,728
ASAN – C+ H
Multiple other dataset (5) used with unclear validation
Neural network – CNN
Network Architecture
Microsoft ResNet – 152
Google Inception
12-class skin tumor types
Yu et al. (2018) [29] Korean Lesion n = 275 No Dermoscopy
Total 724
Training 372
Testing 362
Validation 109
All – H Neural network – CNN
Network architecture modified VGG – 16
Binary (melanoma/non-melanoma)
Zhang et al. (2018) [30] Chinese Lesion n = 1,067 No Dermoscopy
Total 1,067
Training 4,867
Testing 1,142
Validation NA
Benign and malignant – C
Three dermatologists disagree – H
Neural network – CNN
Network architecture
GoogLeNet Inception v3
Four class classifier (BCC, SK, melanocytic nevus, and psoriasis)
Fujisawa et al. (2019) [31] Japanese Patients n = 2,296 No Clinical
Total 6,009
Training 4,867
Testing 1,142
Validation NA
Melanocytic nevus, split nevus, lentigo simplex – C
Others – PE
Neural network – DCNN
Network architecture
GoogLeNet DCNN model
1. Two class classifier (benign vs. malignant)
2. Four class classifier (malignant epithelial lesion, malignant melanocytic lesion, benign epithelial lesion, benign melanocytic lesion)
3. 14 class classification
4. 21 class classification
Jinnai et al. (2019) [38] Japanese Patient n = 3,551 No Dermoscopy
Total 5,846
Training 4,732
Testing 666
Validation NA
Malignant – H
Benign tumor – C
Neural network – FRCNN
Network architecture – VGG-16
Binary (benign/malignant)
Six-class classifications (6 skin conditions)
Zhao et al. (2019) [32] Chinese n/a No Clinical
Total 4,500
Training 3,375
Testing 1,125
Validation NA
Benign – C
Malignant – PE
Neural network – CNN
Network architecture – Xception
Risk (low/high/dangerous)
Cho et al. (2020) [33] Korean Patient n = 404 No Clinical
Total 2,254
Training 1,629
Testing 625
Validation NA
Benign – C
Malignant – H
Neural network – DCNN
Network architecture – Inception-ResNet-V2
Binary classification (benign or malignant)
Han et al. (2020) [36] Korean
Caucasian
ASNA, Normal, SNU
Patient n = 28,222
F 55% (n = 15,522)
Average age 40
MED-NODE, Web, Edinburgh
NA
Partially
Edinburgh – Yes
SNU – upon request
Clinical
Total 224,181
Training 220,680
Testing 2,441
Validation 3,501
ASAN – C+ H
Edinburgh – H
Med-Node – H
SNU – C + H
Web – image finding
Neural network – CNN
Network architecture
SENet
Se-ResNet-50
Visual geometry group (VGG-19)
Binary (malignant, non-malignant)
Binary (steroids, antibiotics, antivirals, antifungals)
Multiple class classification (134 skin disorders)
Han et al. (2020) [35] Korean Patients n = 673
Lesions n = 673
F 54% (n = 363)
Average age 58
No Clinical
Total 185,192
Training 182,348
Testing NA
Validation 2,844
All – H Neural network – CNN
Network architecture
SENet
SE-ResNeXt-50
SE-ResNet-50)
Risk output (risk of malignancy)
Han et al. (2020) [34] Korean Patient n = 9,556
Lesion n = 10,426
F 55% (5,255)
Average age 52
No Clinical
Total 40,331
Training 1106,886a
Testing NA
Validation 40,331
All – H Neural network – RCNN
Network architecture
SENet
Se-ResNeXt-50
Binary (malignant, non-malignant)
32 class classification
Huang et al. (2020) [37] Chinese Lesion n = 1,225 No Clinical
Total 3,299
Training 2,474
Testing 825
Validation NA
All – PE Neural network - CNN
Network architecture
Inception V3
Inception-ResNet V2
DenseNet 121
ResNet 50
Binary (SK/BCC)
Li (2020) [44] Chinese Patient n = 106 No Dermoscopy and clinical
Total 212
Training 200,000a
Testing 212
Validation NA
All – H Network architecture
Youzhi AI software (Shanghai Maise Information Technology Co., Ltd., Shanghai, China)
Binary (benign or malignant)
14 class classification
Liu et al. (2020) [39] Fitzpatrick type I–VI Patient n = 15,640
Lesion n = 20,676
No Clinical
Total 79,720
Training 64,837
Testing NA
Validation 14,483
Benign – C
Malignant – H
Neural network – DLS
Network architecture
Inception – v4
26 class classification (primary output)
419 class classification (secondary output)
Wang et al. (2020) [40] Chinese, with Fitzpatrick type IV n/a No Dermoscopy
Total 10,307
Training 8,246
Testing 1,031
Validation 1,031
BCC – C+H
Others – C
Neural network – CNN
Network architecture
GoogLeNet Inception v3
Binary classification (psoriasis and others)
Multi-class classification
Huang et al. (2021) [37] Taiwanese
Caucasian
KCGMH
Patient no. 1,222
F 52.4% (n = 640)
Average age 62
HAM10000 n/a
Partially
KCGMH – no
HAM10000 – yes
Clinical
Total 1,287
Training 1,031
Testing 128
Validation 128
All – H Neural network – CNN
Network architecture
DenseNet 121 – binary classification
EfficientNet E4 – five class classification
Binary (benign/malignant)
5 class classification (BCC, BK, MM, NV, SCC)
7 class (AK, BCC, BKL, SK, DF, MM, NV)
Minagawa et al. (2021) [42] Caucasian
Japanese
Patient n = 50 Partially
ISIC–yes
Shinshu – no
Dermoscopy
Total 12,948
Training 12,848
Testing 100
Validation NA
Benign – C
Malignant – H
Neural network - DNN
Neural architecture – Inception-ResNet-V2
4 class classification (MM/BCC/MN/BK)
Yang et al. (2021) [43] Chinese n/a No Clinical
Total 12,816
Training 10,414
Testing 300
Validation 2,102
All – C Neural network – DCNN
Neural architecture
DenseNet-96
ResNet-152
ResNet-99
Converged network (DenseNet – ResNet fusion)
6 class classification (Nevi, Melasma, cafe-au-lait, SK, and acquired nevi)

BCC, basal cell carcinoma; BK, benign keratosis; CNN OR DCNN, convolutional neural network; DF, dermatofibroma; SCC, squamous cell carcinoma; SK, seborrheic keratosis; MM, melanoma; MN, melanocytic; PE, pathological examination (insufficient information provided whether histopathology and/or clinical evaluation was used); n/a, not available; n, number; CBIR, content-based image retrieval.

aAlgorithm previously trained on an different dataset; therefore, dataset numbers are not included.

The majority of studies (15/22, 68%) analyzed clinical images (i.e., wide-field or regional images), while seven studies analyzed dermoscopy images [23, 24, 27, 29, 30, 40, 42], and one study included both [44]. All but one study included both malignant and benign pigmented skin lesions, with one investigating only benign pigmented facial lesions [43].

Histopathology was used as the ground truth in 15 studies for all malignant lesions and partially in two studies [24, 26], while one study only used histopathology to resolve clinician disagreements [23]. Seven studies used histopathology as ground truth for benign lesions [23, 27, 29, 34, 35, 41, 44]. In nine studies, ground truth was established by consensus of experienced dermatologists [25, 30–32, 38–40, 42, 43]. Other studies used a mix of both [24, 26, 33, 36] or were not clearly defined [28, 37].

The number of pigmented skin lesion classifications used for AI model evaluation ranged from binary outcomes (e.g., benign vs. malignant) to classification of up to 419 skin conditions [39]. While most studies (19/22, 86%) evaluated lesions across all body sites, one study exclusively analyzed the lips/mouth [33], another assessed only facial skin lesions [43], and one study specifically addressed acral melanoma [29].

Population

Homogenous datasets were collected from the Chinese/Taiwanese (n = 8, 36%) [25, 30, 32, 37, 40, 41, 43, 44], Korean (n = 5, 23%) [27–29, 33–35], and Japanese populations (n = 3, 14%) [31, 38, 42]. Seven studies (32%) included populations from Caucasians/Fitzpatrick skin type I–III [23, 24, 26, 28, 36, 39, 42], with at least 10% American Indian [26], Alaska Native [26], black or African American [26], Pacific Islander [26], Native American [26], or Fitzpatrick IV–VI [23, 39] in the training and/or test set (Table 2).

The majority of studies did not specify the sex distribution (n = 13, 59%) or participant age (n = 15, 68%). Seven studies included age specification, ranging from 18 to older than 85 years [23, 25, 26, 34–36, 41].

Outcome and Performance

The outcome of the classification algorithms used either a diagnostic model, risk categorical model (e.g., low, medium, or high), or a combination of both. An overview of AI model performance is described in Table 3. Majority of studies (20/22, 91%) used a diagnostic model, either with binary classification of benign or malignant [23–27, 29, 33, 35, 37], multiclass classification of specific lesion diagnosis [28, 30, 32, 39, 42, 43], or both [31, 34, 36, 38, 40, 41, 44]. One study used categorical risk as the outcome [32]. Another study reported both diagnostic model and risk categorical model [24].

Table 3.

Measures of output and performance for AI models included in the review

Reference Accuracy (%) Sensitivity (%) Specificity (%) AUC
Binary classification models
Piccolo et al. (2002) [23] n/a 92 74 n/a
Iyatomi et al. (2008) [24] n/a 86 86 0.93
Chang et al. (2013) [25] 91 86 88 0.95
Chen et al. (2016) [26] 91 90 92 n/a
Yang et al. (2017) [27] 99.7 100 99 n/a
Yu et al. (2018) [29] 82 93 72 0.80
Cho et al. (2020) [33] n/a Dataset 1: 76
Dataset 2: 70
Dataset 1: 80
Dataset 2: 76
Dataset 1: 0.83
Dataset 2: 0.77
Huang et al. (2020) [37] 86 n/a n/a 0.92
Han et al. (2020) [35] n/a 77 91 0.91
Fujisawa et al. (2019) [31] 93 96 90 n/a
Jinnai et al. (2019) [38] 92 83 95 n/a
Han et al. (2020) [36] n/a n/a n/a Edinburgh dataset: 0.93
SNU dataset: 0.94
Han et al. (2020) [34] n/a Top 1: 63 Top 1: 90 0.86
Li et al. (2020) [44] 86 75 93 n/a
Wang et al. (2020) [40] 77 n/a n/a n/a
Multiclass classification models
Han et al. (2018) [28] n/a ASAN dataset: 86
Edinburg dataset: 85
ASAN dataset: 86
Edinburg dataset: 81
Zhang et al. (2018) [30] Dataset A: 87
Dataset B: 87
n/a n/a
Fujisawa et al. (2019) [31] 77 n/a n/a
Jinnai et al. (2019) [38] 87 86 87
Liu et al. (2020) (26-classification model) [39] Top 1: 71
Top 3: 93
Top 1: 58
Top 3: 88
n/a
Han et al. (2020) [36] Top 1
Edinburgh dataset: 57
SNU dataset: 45
Top 3
Edinburgh dataset: 84
SNU dataset: 69
Top 5
Edinburgh dataset: 92
SNU dataset: 78
n/a n/a
Han et al. (2020) [34] Top 1: 43
Top 3: 62
n/a n/a
Li et al. (2020) [44] 73 n/a n/a
Wang et al. (2020) [40] 82 n/a n/a
Minagawa et al. (2021) [42] 90 n/a n/a
Yang et al. (2021) [43] Algorithm A: 88
Algorithm B: 77
Algorithm C: 90
Algorithm D: 87
Algorithm A: 83
Algorithm B: 63
Algorithm C: 81
Algorithm D: 80
Algorithm A: 98
Algorithm B: 90
Algorithm C: 99
Algorithm D: 98
Huang et al. (2021) [37] 5 class (KCGMH dataset): 72
7 class (HAM10000 dataset): 86
n/a n/a
Risk categorical classification
Zhao et al. (2019) [32] 83 Benign: 93
Low risk: 85
High risk: 86
Benign: 88
Low risk: 85
High risk: 91
Benign: 0.96
Low risk: 0.92
High risk: 0.95

Top: top-(n) accuracy represents the fact that the correct diagnosis is among the top n predictions output by the model.

For example, top-3 accuracy means that any of the top 3 highest probability predictions made by the model match the expected answer.

AUC, area under the curve.

The AI models using binary classification (16/22) reported an accuracy ranging from 70% to 99.7%. Of these studies, 6/16 reported ≥90% accuracy [25–27, 31, 38, 41], three studies reported between 80 and 90% accuracy [29, 37, 44], and one study reported <80% accuracy [40]. Twelve AI models reported sensitivity and specificity as a measure of performance, which ranged from 58 to 100% and 72 to 99%, respectively. Eight studies provided an area under the curve (AUC) with 5/8 reporting values >0.9 [24, 25, 35–37], with the remaining three models scoring between 0.77 and 0.86 [29, 33, 34].

For the 13 studies using multiclass output (i.e., >2 diagnoses), accuracy of models ranged from 43% to 93%. Six of these studies (6/13) scored <80% accuracy [31, 34, 36, 39, 41, 44], six others scored between 80 and 90% accuracy [30, 32, 38, 40, 42, 43], and one provided sensitivity and specificity of 86% and 86%, respectively, as a measure of performance [28].

Reader Studies

Reader studies, where the performance of AI models and clinician classification is compared, were performed in 14/22 studies, with results provided in Table 4[23, 25, 29, 31–39, 42, 44]. Six studies compared AI outcomes to classification by experts, e.g., dermatologists [25, 32, 34, 36, 42, 44]. Eight studies compared outcomes for both experts and non-experts, e.g., dermatology residents and general practitioners [23, 29, 31, 33, 35, 37–39].

Table 4.

Reader studies between AI models and human experts (e.g., dermatologists), and non-experts (e.g., dermatology residents, GPs)

Reference AI performance Expert performance Non-expert performance
Piccolo et al. (2002) [23] Sensitivity: 92%
Specificity: 74%
Sensitivity: 92%
Specificity: 99%
Sensitivity: 69%
Specificity: 94%
Chang et al. (2013) [25] Accuracy Melanoma: 91%
Non-melanoma: 83%
Sensitivity: 86%
Specificity: 88%
Accuracy: 81%
Sensitivity: 83%
Specificity: 86%
Yu et al. (2018) [29] Accuracy: 82%
Sensitivity: 93%
Specificity: 72%
AUC: 0.80
Accuracy: 81%
Sensitivity: 97%
Specificity: 67%
AUC: 0.80
Accuracy: 65%
Sensitivity: 45%
Specificity: 84%
AUC: 0.65
Huang et al. (2020) [37] Sensitivity: 90%
AUC 0.94
Sensitivity: 85%
Specificity: 90%
Sensitivity: 66%
Specificity: 72%
Han et al. (2020) [35] Sensitivity: 89%
Specificity: 78%
AUC: 0.92
Sensitivity: 95%
Specificity: 72%
ROC: 0.91
Accuracy
Dermatology resident: 94%
Non-dermatology clinician: 77%
Sensitivity
Dermatology resident: 69%
Non-dermatology clinician: 65%
AUC
Dermatology resident: 0.88
Non-dermatology clinician: 0.73
Fujisawa et al. (2019) [31] Accuracy
Binary: 92%
Multiclass: 75%
Accuracy
Binary: 85%
Multiclass: 60%
Accuracy
Binary: 74%
Multiclass: 42%
Jinnai et al., (2019) [38] Accuracy: 92%
Sensitivity: 83%
Specificity: 95%
Accuracy: 87%
Sensitivity: 86%
Specificity: 87%
Accuracy: 85%
Sensitivity: 84%
Specificity: 86%
Zhao et al. (2019) [32] Sensitivity
Benign: 90%
Low risk: 90%
High risk: 75%
Sensitivity
Benign: 61%
Low risk: 50%
High risk: 64%
Cho et al. (2020) [33] Sensitivity
Dataset 1: 76%
Dataset 2: 70%
Specificity
Dataset 1: 80%
Dataset 2: 76%
AUC
Dataset 1: 0.83
Dataset 2: 0.77
Sensitivity
-Without algorithm: 90%
-With algorithm: 90%
Specificity
-Without algorithm: 58%
-With algorithm: 61%
Sensitivity
Dermatology resident
-Without algorithm: 80%
-With algorithm: 85%
Non-dermatology clinician
-Without algorithm: 65%
-With algorithm: 74%
Specificity
Dermatology resident
-Without algorithm: 53%
-With algorithm: 71%
Non-dermatology clinician
-Without algorithm: 46%
-With algorithm: 49%
AUC
Dermatology resident
-Without algorithm: 0.33
-With algorithm: 0.42
Non-dermatology clinician
-Without algorithm: 0.11
-With algorithm: 0.23
Han et al. (2020) [36] Multiclass model
Accuracy
Top 1: 45%
Top 3: 69%
Top 5: 78%
Multiclass model
Accuracy (without algorithm)
Top 1: 50%
Top 3: 67% (with algorithm)
Top 1: 53%
Top 3: 74%
Binary model
Accuracy
-Without algorithm: 77%
-With algorithm: 85%
Han et al. (2020) [34] Binary model
Sensitivity: 67%
Specificity: 87%
Multiclass accuracy
Top 1: 50%
Top 3: 70%
Binary model
Sensitivity: 66%
Specificity: 67%
Multiclass accuracy
Top 1: 38%
Top 3: 53%
Li et al. (2020) [44] Accuracy
Binary: 73%
Multiclass: 86%
Accuracy
Binary: 83%
Multiclass: 74%
Liu et al. (2020) [39] Accuracy
Top 1: 66%
Top 3: 90%
Accuracy
Top 1: 63%
Top 3: 75%
Accuracy
Primary care physician
Top 1: 44%
Top 3: 60%
Nurse practitioner
Top 1: 40%
Top 3: 55%
Minagawa et al. (2021) [42] Accuracy: 71% Accuracy: 90%

In reader studies comparing binary classification between AI and experts (n = 11), one study reported similar diagnostic accuracy/specificity [29], three showed higher accuracy for AI models [25, 31, 38], and two reported higher accuracy in experts [42, 44]. Five studies reported specificity, sensitivity, and AUC instead of accuracy with varying outcomes [23, 32, 33, 35, 37]. For reader studies between AI and non-experts (n = 7), AI showed higher accuracy, specificity, sensitivity, and AUC in most studies [23, 29, 31, 33, 35, 37, 38]. In reader studies that compared multiclass classification between AI and expert readers (n = 5), three studies reported higher top-1 diagnosis (i.e., diagnosis with highest statistical probability) accuracy for AI [31, 34, 44], in one study readers performed better [36], and the last study had similar results between AI and readers [39]. In multiclass reader studies with non-experts, AI reported higher accuracy in both studies (n = 2) [31, 39].

One study compared categorical risk classification with experts and showed AI to have higher sensitivity across all categories [32]. Two studies additionally showed significant increase in accuracy when human experts were aided by AI outcomes [33, 36].

Quality Assessments

Studies included in the review were evaluated against the 25-point CLEAR Derm Consensus Guidelines [22], covering four domains (data, technique, technical assessment, and application). For each checklist item, each study was assessed to determine whether it fully, partially, or did not address the criteria. A summary of results is provided in Table 5, with the individual study scores available in online supplementary eTable 3. Overall, an average score of 17.4 (±2.2) out of a possible 25 points was obtained.

Table 5.

Summary of quality assessment on AI models reviewed

Domain Checklist item Quality assessment, n (%)
fully addressed partially addressed not addressed
Data 1 Image types 21 (95) 1 (5) 0 (0)
2 Image artifacts 12 (55) 5 (23) 5 (23)
3 Technical acquisition details 22 (100) 0 (0) 0 (0)
4 Pre-processing procedures 20 (91) 0 (0) 2 (9)
5 Synthetic images made public if used 22 (100)a 0 (0) 0 (0)
6 Public images adequately referenced 22 (100) 0 (0) 0 (0)
7 Patient-level metadata 5 (23) 17 (77) 0 (23)
8 Skin tone information and procedure by which skin tone was assessed 3 (14) 16 (73) 3 (14)
9 Potential biases that may arise from use of patient information and metadata 9 (41) 7 (32) 6 (27)
10 Dataset partitions 12 (55) 9 (41) 1 (5)
11 Sample sizes of training, validation, and test sets 7 (32) 14 (64) 1 (5)
12 External test set 3 (14) 2 (9) 17 (77)
13 Multivendor images 20 (91) 2 (9) 0 (0)
14 Class distribution and balance 5 (23) 15 (68) 2 (9)
15 OOD images 2 (9) 7 (32) 13 (59)
Technique 16 Labeling method (ground truth, who did it) 15 (68) 7 (32) 0 (0)
17 References to common/accepted diagnostic labels 22 (100) 0 (0) 0 (0)
18 Histopathologic review for malignancies 16 (73) 2 (9) 4 (18)
19 Detailed description of algorithm development 14 (64) 6 (27) 2 (10)
Technical Assessment 20 How to publicly evaluate algorithm 5 (23) 0 (0) 17 (77)
21 Performance measures 9 (41) 13 (59) 0 (0)
22 Benchmarking, technical comparison, and novelty 15 (68) 0 (0) 7 (32)
23 Bias assessment 10 (45) 6 (27) 6 (27)
Application 24 Use cases and target conditions (inside distribution) 16 (73) 6 (27) 0 (0)
25 Potential impacts on the healthcare team and patients 3 (14) 13 (59) 6 (27)

aNo studies included synthetic images (checklist item 5), therefore marked as “fully addressed” to not negatively impact quality score.

OOD, out of distribution.

Data

The data domain consists of 15 checklist items; of these, six items were fully addressed by >90% of publications, including checklist items (1) image types, (3) technical acquisition details, (4) pre-processing procedures, and (6) public images adequately referenced. No studies included synthetic images (item 5). Checklist items (2) image artifacts, (9) potential biases, and (10) dataset partitions (e.g., lesion distribution) were only fully addressed by approximately half of the studies. About a third of studies fully addressed the sample size of training, validation, and test sets (item 11) [28, 29, 34, 35, 39, 40, 43]. The most poorly addressed criteria were patient-level metadata (e.g., sex, gender, ethnicity; item 7), skin color information (item 8), using an external test dataset (item 12), class distribution and balance of the images (item 14), and out of distribution images (item 15).

Patient-level metadata (item 7) was partially reported in most studies. While geographical location, hospital location, and race were adequately reported, sex and age specifications were limited, as were anatomical location, relevant past medical history, and history of presenting illness. Skin color information was reported as Fitzpatrick’s skin type in only three studies [23, 39, 40].

Technique

All four checklist items in the technique domain were fully addressed by most publications. Image labeling method (item 16), e.g., information on ground truth, was fully addressed in 68% (n = 15) of publications. All studies used commonly accepted diagnostic labels (item 17). Histopathologic review (item 18) was fully addressed in 73% (n = 16) of studies. A detailed description of algorithm development (item 19) was provided in 64% (n = 14) of studies.

Technical Assessment

Of the four checkpoints, the most poorly addressed item was the public evaluation of the model (item 20), with only five papers fully addressing this item [29, 34–36, 39] and the majority (n = 17, 77%) not addressing it at all. Four studies reported public availability of their AI algorithm or offered a public-facing test interface for external testing [28, 34–36]. Performance measures model (item 21) was fully addressed by nine studies. Benchmarking, technical comparison, and novelty (item 22) compared to previously developed algorithm and human specialists were fully addressed by 15 studies, with technical bias assessment (item 23) discussed by 10 studies.

Application

The application domain consisted of two checklist items. Use of cases and target conditions (item 24) was fully addressed by 16 (73%) studies, while potential impacts on the healthcare team and patients (item 25) were fully addressed by only three studies [38, 39, 41].

Discussion

We present, to our knowledge, the first systematic review summarizing existing AI image-based algorithms for the classification of skin lesions in people with skin of color. Our review identified 22 articles where the algorithms were primarily developed on or tested on images from people with skin of color.

Diversity and Transparency

Within the identified AI studies involving skin of color populations, there is further under-representation of darker skin types. We found that Asian populations (Chinese, Korean, and Japanese) were the major ethnic groups included, with limited datasets involving populations from the Middle East, India, Africa, Pacific Islands, and from Hispanic and Native Americans. Only three studies specifically reported skin color categories of type IV–VI and/or black or African American. Two North American studies reported only 4.3% [26] and 6.8% [39] black African American images, and one Italian-led study [23] reported 26.5% of images with Fitzpatrick skin type IV/V.

Adding to the issue was a lack of transparency in reporting skin color in image-based AI studies. A recently published scoping review [17] of 70 studies involving image-based AI algorithms for classification of skin conditions found only 14 (20%) provided details about race and seven (10%) provided detail about skin color. Furthermore, only three studies partially included Fitzpatrick skin type IV–VI populations. The lack of diversity in studies is likely fueled by the shortage of diverse publicly available datasets. A recent systematic review of 39 open access datasets and skin atlases reported that 12 of the 14 datasets that reported country of origin were from countries of predominantly European ancestry [16]. Only one dataset from Brazil reported that 5% of their population contained Fitzpatrick skin type IV–VI images [45], and another dataset comprised images from a South Korean population [36].

Model Performance

AI models in this review showed reasonably high-performance classification using both dermoscopic and clinical images in populations with skin of color. Accuracy was greater than 70% for all binary models reviewed, which is similar to models developed using Caucasian datasets [46]. It has previously been suggested that instead of training algorithms on more diverse datasets, it may be beneficial to develop separate algorithms for different populations [7]. Han et al. [28] demonstrated a significant difference in performance of an AI model trained using skin images from exclusive Asian populations, which had greater accuracy for classifying basal cell carcinomas and melanomas in Asian populations (96% and 96%, respectively) than in Caucasian populations (90% and 88%, respectively).

There is some evidence that adding out-of-distribution images (i.e., those not originally represented in the training dataset) to a pre-existing dataset and re-training can improve classification accuracy among the newly added group while not affecting accuracy among the original group [22]. Another previously suggested method is to use artificially darkened lesion images in the training dataset [47].

Quality Assessments

Our review identified several critical items from the CLEAR Checklist that were insufficiently addressed by the majority of AI imaging studies in populations with skin of color. Overall, discrepancies in image artifact description, lack of skin color information and other patient-level metadata, missing rationale for dataset sizes, and inclusion of external test sets were found. Image artifacts have been previously shown to affect algorithm performance in both dermatological images [48, 49] and clinical photographs [50]. Unclear descriptions of pre-processing of images can cause incorrect diagnosis, as can subtle alteration in color balance or rotation that can result in incorrect classifying melanoma as benign nevus [49]. Furthermore, less than half of the studies reviewed reported anatomical locations of imaged lesions. The body site of a lesion can be informative; for example, skin of color populations have a high prevalence of acral melanomas which are commonly found on the palms, soles, subungual region, and mucous membranes [51]. Future studies should consider adopting established consensus on image acquisition metrics [52] or dermatology-specific guidelines [22] to standardize images and develop robust archives of skin images used for research.

A major concern is that most AI algorithms have only been tested on internal datasets, e.g., those sampled from the same/similar populations. This issue was similarly highlighted in a recent review [18]. AI algorithms are prone to overfitting, which can result in inflated accuracy measures when only tested on internal datasets and significant performance drop when tested with images from a different source [9]. The lack of standardized performance measure model, publicly available diverse datasets, test interfaces, and reader studies presents a barrier to comparability and reproducibility of AI algorithms. An alternative solution is for algorithms to be evaluated on standardized public test datasets, with transparency on algorithm performance [22]. Testing models on external datasets and/or out-of-distribution images is a better indicator of AI model performance in a real-world setting. Given public dermoscopic datasets with diverse skin colors are now available, future algorithms should include external testing as gold standard evaluation [53].

Lastly, only limited studies included in the review addressed the potential impacts on the healthcare team and patients. Future studies would benefit by considering the clinical translation of AI models during the early stages of development and evaluating clinical utility of AI models along with their performance in conjunction with their intended users [36].

Limitations and Outlook

Our systematic review is not without limitations. First, our search was limited to articles published in English. This is particularly of note for populations with skin of color, where English may often not be the primary language. Additionally, as many studies did not report on skin of color status, inclusion was based on assumptions using ethnicity and geographical location. There can be significant variability in skin color even for a population of the same race or geographical location [54]. Reporting bias may also influence the review, as higher performing algorithms are more likely to be reported and accepted for publication.

Conclusion

AI algorithms have great potential to complement skin cancer screening and diagnosis of pigmented lesions, leading to improved dermatology workflows and patient outcomes. The performance and reliability of image-based AI algorithm significantly depend on the quality of data on which it is trained and tested. While this review provides promising development of AI algorithms in skin of color for East Asian populations, there are still significant discrepancies in the number of algorithms developed in populations with skin of color, particularly in darker skin types. Without inclusion of images from populations with skin of color, the incorporation of AI models into clinical practice could lead to missed and delayed diagnoses of neoplasms in people with skin of color, further widening existing health outcome disparities [55].

Statement of Ethics

An ethics statement is not applicable because this study is based exclusively on published literature.

Conflict of Interest Statement

H. Peter Soyer is a shareholder of MoleMap NZ Ltd. and e-derm-consult GmbH and undertakes regular teledermatological reporting for both companies. H. Peter Soyer is a medical consultant for Canfield Scientific Inc., MoleMap Australia Pty Ltd., and Blaze Bioscience Inc. and a medical advisor for First Derm. The remaining authors have no conflicts of interests to declare.

Funding Sources

H. Peter Soyer holds an NHMRC MRFF Next Generation Clinical Researchers Program Practitioner Fellowship (APP1137127). Clare Primiero is supported by an Australian Government Research Training Program Scholarship.

Author Contributions

Yuyang Liu: conceptualization (equal), data curation (lead), formal analysis (equal), methodology (lead), validation (equal), and writing – original draft (lead). Clare Primiero: conceptualization (equal), funding acquisition (equal), methodology (equal), supervision (equal), and writing – review and editing (equal). Vishnutheertha Kulkarni: data curation (supporting), formal analysis (supporting), validation (equal), and writing – review and editing (supporting). H. Peter Soyer: conceptualization (supporting), funding acquisition (equal), supervision (supporting), and writing – review and editing (supporting). Brigid Betz-Stablein: conceptualization (equal), data curation (supporting), formal analysis (supporting), methodology (equal), supervision (equal), and writing – review and editing (equal).

Funding Statement

H. Peter Soyer holds an NHMRC MRFF Next Generation Clinical Researchers Program Practitioner Fellowship (APP1137127). Clare Primiero is supported by an Australian Government Research Training Program Scholarship.

Data Availability Statement

This review is based exclusively on published literature. No publicly available datasets were used and no data were generated. Further inquiries can be directed to the corresponding author.

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

References

  • 1. Cormier JN, Xing Y, Ding M, Lee JE, Mansfield PF, Gershenwald JE, et al. Ethnic differences among patients with cutaneous melanoma. Arch Intern Med. 2006;166(17):1907–14. 10.1001/archinte.166.17.1907. [DOI] [PubMed] [Google Scholar]
  • 2. Jackson C, Maibach H. Ethnic and socioeconomic disparities in dermatology. J Dermatolog Treat. 2016;27(3):290–1. 10.3109/09546634.2015.1101409. [DOI] [PubMed] [Google Scholar]
  • 3. Howlader NNA, Krapcho M, Miller D, Brest A, Yu M, Ruhl J, et al. edition, SEER cancer statistics review, 1975-2018. In. Bethesda, MD: National Cancer Institute; 2021. [Google Scholar]
  • 4. Dawes SM, Tsai S, Gittleman H, Barnholtz-Sloan JS, Bordeaux JS. Racial disparities in melanoma survival. J Am Acad Dermatol. 2016;75(5):983–91. 10.1016/j.jaad.2016.06.006. [DOI] [PubMed] [Google Scholar]
  • 5. Adamson AS. Should we refer to skin as “ethnic? J Am Acad Dermatol. 2017;76(6):1224–5. 10.1016/j.jaad.2017.01.028. [DOI] [PubMed] [Google Scholar]
  • 6. Diao JA, Adamson AS. Representation and misdiagnosis of dark skin in a large-scale visual diagnostic challenge. J Am Acad Dermatol. 2022;86(4):950–1. 10.1016/j.jaad.2021.03.088. [DOI] [PubMed] [Google Scholar]
  • 7. Adamson AS, Smith A. Machine learning and health care disparities in dermatology. JAMA Dermatol. 2018;154(11):1247–8. 10.1001/jamadermatol.2018.2348. [DOI] [PubMed] [Google Scholar]
  • 8. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8. 10.1038/nature21056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Tschandl P, Codella N, Akay BN, Argenziano G, Braun RP, Cabo H, et al. Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. Lancet Oncol. 2019;20(7):938–47. 10.1016/S1470-2045(19)30333-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Tschandl P, Rinner C, Apalla Z, Argenziano G, Codella N, Halpern A, et al. Human-computer collaboration for skin cancer recognition. Nat Med. 2020;26(8):1229–34. 10.1038/s41591-020-0942-0. [DOI] [PubMed] [Google Scholar]
  • 11. Rotemberg V, Kurtansky N, Betz-Stablein B, Caffery L, Chousakos E, Codella N, et al. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Sci Data. 2021;8(1):34. 10.1038/s41597-021-00815-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018;5:180161. 10.1038/sdata.2018.161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Mendonca T, Ferreira PM, Marques JS, Marcal ARS, Rozeira J. PH2: a dermoscopic image database for research and benchmarking. Annu Int Conf IEEE Eng Med Biol Soc. 2013;2013:5437–40. 10.1109/EMBC.2013.6610779. [DOI] [PubMed] [Google Scholar]
  • 14. Giotis I, Molders N, Land S, Biehl M, Jonkman MF, Petkov N. MED-NODE: a computer-assisted melanoma diagnosis system using non-dermoscopic images. Expert Syst Appl. 2015;42(19):6578–85. 10.1016/j.eswa.2015.04.034. [DOI] [Google Scholar]
  • 15. de Faria SMM, Henrique M, Filipe JN, Pereira PMM, Tavora LMN, Assuncao PAA, et al. Light field image dataset of skin lesions. Annu Int Conf IEEE Eng Med Biol Soc. 2019;2019:3905–8. 10.1109/EMBC.2019.8856578. [DOI] [PubMed] [Google Scholar]
  • 16. Wen D, Khan SM, Ji Xu A, Ibrahim H, Smith L, Caballero J, et al. Characteristics of publicly available skin cancer image datasets: a systematic review. Lancet Digit Health. 2022;4(1):e64–74. 10.1016/S2589-7500(21)00252-1. [DOI] [PubMed] [Google Scholar]
  • 17. Daneshjou R, Smith MP, Sun MD, Rotemberg V, Zou J. Lack of transparency and potential bias in artificial intelligence data sets and algorithms: a scoping review. JAMA Dermatol. 2021;157(11):1362–9. 10.1001/jamadermatol.2021.3129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Haggenmuller S, Maron RC, Hekler A, Utikal JS, Barata C, Barnhill RL, et al. Skin cancer classification via convolutional neural networks: systematic review of studies involving human experts. Eur J Cancer. 2021;156:202–16. 10.1016/j.ejca.2021.06.049. [DOI] [PubMed] [Google Scholar]
  • 19. Thomsen K, Iversen L, Titlestad TL, Winther O. Systematic review of machine learning for diagnosis and prognosis in dermatology. J Dermatolog Treat. 2020;31(5):496–510. 10.1080/09546634.2019.1682500. [DOI] [PubMed] [Google Scholar]
  • 20. Freeman K, Dinnes J, Chuchu N, Takwoingi Y, Bayliss SE, Matin RN, et al. Algorithm based smartphone apps to assess risk of skin cancer in adults: systematic review of diagnostic accuracy studies. BMJ. 2020;368:m127. 10.1136/bmj.m127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration. Bmj. 2009;339:b2700. 10.1136/bmj.b2700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Daneshjou R, Barata C, Betz-Stablein B, Celebi ME, Codella N, Combalia M, et al. Checklist for evaluation of image-based artificial intelligence reports in dermatology: CLEAR derm consensus guidelines from the international skin imaging collaboration artificial intelligence working group. JAMA Dermatol. 2022;158(1):90–6. 10.1001/jamadermatol.2021.4915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Piccolo D, Ferrari A, Peris K, Diadone R, Ruggeri B, Chimenti S. Dermoscopic diagnosis by a trained clinician vs. a clinician with minimal dermoscopy training vs. computer-aided diagnosis of 341 pigmented skin lesions: a comparative study. Br J Dermatol. 2002;147(3):481–6. 10.1046/j.1365-2133.2002.04978.x. [DOI] [PubMed] [Google Scholar]
  • 24. Iyatomi H, Oka H, Celebi ME, Hashimoto M, Hagiwara M, Tanaka M, et al. An improved Internet-based melanoma screening system with dermatologist-like tumor area extraction algorithm. Comput Med Imaging Graph. 2008;32(7):566–79. 10.1016/j.compmedimag.2008.06.005. [DOI] [PubMed] [Google Scholar]
  • 25. Chang WY, Huang A, Yang CY, Lee CH, Chen YC, Wu TY, et al. Computer-aided diagnosis of skin lesions using conventional digital photography: a reliability and feasibility study. PLoS One. 2013;8(11):e76212. 10.1371/journal.pone.0076212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Chen RH, Snorrason M, Enger SM, Mostafa E, Ko JM, Aoki V, et al. Validation of a skin-lesion image-matching algorithm based on computer vision technology. Telemed J E Health. 2016;22(1):45–50. 10.1089/tmj.2014.0249. [DOI] [PubMed] [Google Scholar]
  • 27. Yang S, Oh B, Hahm S, Chung KY, Lee BU. Ridge and furrow pattern classification for acral lentiginous melanoma using dermoscopic images. Biomed Signal Process Control. 2017;32:90–6. 10.1016/j.bspc.2016.09.019. [DOI] [Google Scholar]
  • 28. Han SS, Kim MS, Lim W, Park GH, Park I, Chang SE. Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. J Invest Dermatol. 2018;138(7):1529–38. 10.1016/j.jid.2018.01.028. [DOI] [PubMed] [Google Scholar]
  • 29. Yu C, Yang S, Kim W, Jung J, Chung K-Y, Lee SW, et al. Acral melanoma detection using a convolutional neural network for dermoscopy images. PLoS One. 2018;13(3):e0193321. 10.1371/journal.pone.0193321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Zhang X, Wang S, Liu J, Tao C. Towards improving diagnosis of skin diseases by combining deep neural network and human knowledge. BMC Med Inform Decis Mak. 2018;18(Suppl 2):59–. 10.1186/s12911-018-0631-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Fujisawa Y, Otomo Y, Ogata Y, Nakamura Y, Fujita R, Ishitsuka Y, et al. Deep-learning-based, computer-aided classifier developed with a small dataset of clinical images surpasses board-certified dermatologists in skin tumour diagnosis. Br J Dermatol. 2019;180(2):373–81. 10.1111/bjd.16924. [DOI] [PubMed] [Google Scholar]
  • 32. Zhao XY, Wu X, Li FF, Li Y, Huang WH, Huang K, et al. The application of deep learning in the risk grading of skin tumors for patients using clinical images. J Med Syst. 2019;43(8):283. 10.1007/s10916-019-1414-2. [DOI] [PubMed] [Google Scholar]
  • 33. Cho SI, Sun S, Mun JH, Kim C, Kim SY, Cho S, et al. Dermatologist-level classification of malignant lip diseases using a deep convolutional neural network. Br J Dermatol. 2020;182(6):1388–94. 10.1111/bjd.18459. [DOI] [PubMed] [Google Scholar]
  • 34. Han SS, Moon IJ, Kim SH, Na J-I, Kim MS, Park GH, et al. Assessment of deep neural networks for the diagnosis of benign and malignant skin neoplasms in comparison with dermatologists: a retrospective validation study. PLoS Med. 2020;17(11):e1003381. 10.1371/journal.pmed.1003381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Han SS, Moon IJ, Lim W, Suh IS, Lee SY, Na JI, et al. Keratinocytic skin cancer detection on the face using region-based convolutional neural network. JAMA Dermatol. 2020;156(1):29–37. 10.1001/jamadermatol.2019.3807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Han SS, Park I, Eun Chang S, Lim W, Kim MS, Park GH, et al. Augmented intelligence dermatology: deep neural networks empower medical professionals in diagnosing skin cancer and predicting treatment options for 134 skin disorders. J Invest Dermatol. 2020;140(9):1753–61. 10.1016/j.jid.2020.01.019. [DOI] [PubMed] [Google Scholar]
  • 37. Huang K, He X, Jin Z, Wu L, Zhao X, Wu Z, et al. Assistant diagnosis of basal cell carcinoma and seborrheic keratosis in Chinese population using convolutional neural network. J Healthc Eng. 2020;2020:1713904. 10.1155/2020/1713904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Jinnai S, Yamazaki N, Hirano Y, Sugawara Y, Ohe Y, Hamamoto R. The development of a skin cancer classification system for pigmented skin lesions using deep learning. Biomolecules. 2020;10(8):1123. 10.3390/biom10081123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Liu Y, Jain A, Eng C, Way DH, Lee K, Bui P, et al. A deep learning system for differential diagnosis of skin diseases. Nat Med. 2020;26(6):900–8. 10.1038/s41591-020-0842-3. [DOI] [PubMed] [Google Scholar]
  • 40. Wang S-Q, Zhang X-Y, Liu J, Tao C, Zhu C-Y, Shu C, et al. Deep learning-based, computer-aided classifier developed with dermoscopic images shows comparable performance to 164 dermatologists in cutaneous disease diagnosis in the Chinese population. Chin Med J. 2020;133(17):2027–36. 10.1097/CM9.0000000000001023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Huang HW, Hsu BWY, Lee CH, Tseng VS. Development of a light-weight deep learning model for cloud applications and remote diagnosis of skin cancers. J Dermatol. 2021;48(3):310–6. 10.1111/1346-8138.15683. [DOI] [PubMed] [Google Scholar]
  • 42. Minagawa A, Koga H, Sano T, Matsunaga K, Teshima Y, Hamada A, et al. Dermoscopic diagnostic performance of Japanese dermatologists for skin tumors differs by patient origin: a deep learning convolutional neural network closes the gap. J Dermatol. 2021;48(2):232–6. 10.1111/1346-8138.15640. [DOI] [PubMed] [Google Scholar]
  • 43. Yang Y, Ge Y, Guo L, Wu Q, Peng L, Zhang E, et al. Development and validation of two artificial intelligence models for diagnosing benign, pigmented facial skin lesions. Skin Res Technol. 2021;27(1):74–9. 10.1111/srt.12911. [DOI] [PubMed] [Google Scholar]
  • 44. Li CX, Fei WM, Shen CB, Wang ZY, Jing Y, Meng RS, et al. Diagnostic capacity of skin tumor artificial intelligence-assisted decision-making software in real-world clinical settings. Chin Med J. 2020;133(17):2020–6. 10.1097/CM9.0000000000001002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Pacheco AGC, Lima GR, Salomão AS, Krohling B, Biral IP, de Angelo GG, et al. PAD-UFES-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief. 2020;32:106221. 10.1016/j.dib.2020.106221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Takiddin A, Schneider J, Yang Y, Abd-Alrazaq A, Househ M. Artificial intelligence for skin cancer detection: scoping review. J Med Internet Res. 2021;23(11):e22934. 10.2196/22934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Aggarwal P, Papay FA. Artificial intelligence image recognition of melanoma and basal cell carcinoma in racially diverse populations. J Dermatolog Treat. 2021:1–6. [DOI] [PubMed] [Google Scholar]
  • 48. Winkler JK, Fink C, Toberer F, Enk A, Deinlein T, Hofmann-Wellenhof R, et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol. 2019;155(10):1135–41. 10.1001/jamadermatol.2019.1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Du-Harpur X, Arthurs C, Ganier C, Woolf R, Laftah Z, Lakhan M, et al. Clinically relevant vulnerabilities of deep machine learning systems for skin cancer diagnosis. J Invest Dermatol. 2021;141(4):916–20. 10.1016/j.jid.2020.07.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Phillips M, Marsden H, Jaffe W, Matin RN, Wali GN, Greenhalgh J, et al. Assessment of accuracy of an artificial intelligence algorithm to detect melanoma in images of skin lesions. JAMA Netw Open. 2019;2(10):e1913436. 10.1001/jamanetworkopen.2019.13436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Bellew S, Del Rosso JQ, Kim GK. Skin cancer in asians: part 2: melanoma. J Clin Aesthet Dermatol. 2009;2(10):34–6. [PMC free article] [PubMed] [Google Scholar]
  • 52. Katragadda C, Finnane A, Soyer HP, Marghoob AA, Halpern A, Malvehy J, et al. Technique standards for skin lesion imaging: a delphi consensus statement. JAMA Dermatol. 2017;153(2):207–13. 10.1001/jamadermatol.2016.3949. [DOI] [PubMed] [Google Scholar]
  • 53. Daneshjou R, Vodrahalli K, Liang W, Novoa R, Jenkins M, Rotemberg V, et al. Disparities in dermatology AI: assessments using diverse clinical images. 2021. [Google Scholar]
  • 54. Wei L, Xuemin W, Wei L, Li L, Ping Z, Yanyu W, et al. Skin color measurement in Chinese female population: analysis of 407 cases from 4 major cities of China. Int J Dermatol. 2007;46(8):835–9. 10.1111/j.1365-4632.2007.03192.x. [DOI] [PubMed] [Google Scholar]
  • 55. Adamson AS, Essien U, Ewing A, Daneshjou R, Hughes-Halbert C, Ojikutu B, et al. Diversity, race, and health. Med. 2021;2(1):6–10. 10.1016/j.medj.2020.12.020. [DOI] [PubMed] [Google Scholar]
  • 56.International Skin Imaging Collaboration [Available from: https://www.isic- archive.com/#!/topWithHeader/onlyHeaderTop/gallery].
  • 57. Argenziano G, Soyer HP, De Giorgio V, et al. Interactive atlas of dermoscopy. 2000. [Google Scholar]
  • 58. Codella NCF, Gutman D, Celebi ME, Helba B, Marchetti MA, Dusza SW, et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 International symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). IEEE 15th Int Symposium Biomed Imaging (ISBI 2018). 2018:168–72. 10.1109/ISBI.2018.8363547. [DOI] [Google Scholar]
  • 59. Olga Russakovsky* JD, Su H, Krause J, Satheesh S, Ma S, Huang Z, et al. (* = equal contribution) ImageNet large scale visual recognition challe. ImageNet large scale visual recognition challenge. IJCV. 2015. [Google Scholar]
  • 60.DermQuest. [Online].(2012). [Available from: http://www.dermquest.com].
  • 61. Dermatology information system. [Online].(2012). [Available from: http://www.dermis.net]. [Google Scholar]
  • 62. Alves J, Moreira D, Alves P, Rosado L, Vasconcelos MJM. Automatic focus assessment on dermoscopic images acquired with smartphones. Sensors. 2019;19(22):4957. 10.3390/s19224957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. DermNet NZ. [Online]. [Available from: https://dermnetnz.org/.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

This review is based exclusively on published literature. No publicly available datasets were used and no data were generated. Further inquiries can be directed to the corresponding author.


Articles from Dermatology (Basel, Switzerland) are provided here courtesy of Karger Publishers

RESOURCES