Abstract
Background
While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at later stages and have a poorer prognosis. The use of artificial intelligence (AI) models can potentially improve early detection of skin cancers; however, the lack of skin color diversity in training datasets may only widen the pre-existing racial discrepancies in dermatology.
Objective
The aim of this study was to systematically review the technique, quality, accuracy, and implications of studies using AI models trained or tested in populations with skin of color for classification of pigmented skin lesions.
Methods
PubMed was used to identify any studies describing AI models for classification of pigmented skin lesions. Only studies that used training datasets with at least 10% of images from people with skin of color were eligible. Outcomes on study population, design of AI model, accuracy, and quality of the studies were reviewed.
Results
Twenty-two eligible articles were identified. The majority of studies were trained on datasets obtained from Chinese (7/22), Korean (5/22), and Japanese populations (3/22). Seven studies used diverse datasets containing Fitzpatrick skin type I–III in combination with at least 10% from black Americans, Native Americans, Pacific Islanders, or Fitzpatrick IV–VI. AI models producing binary outcomes (e.g., benign vs. malignant) reported an accuracy ranging from 70% to 99.7%. Accuracy of AI models reporting multiclass outcomes (e.g., specific lesion diagnosis) was lower, ranging from 43% to 93%. Reader studies, where dermatologists’ classification is compared with AI model outcomes, reported similar accuracy in one study, higher AI accuracy in three studies, and higher clinician accuracy in two studies. A quality review revealed that dataset description and variety, benchmarking, public evaluation, and healthcare application were frequently not addressed.
Conclusions
While this review provides promising evidence of accurate AI models in populations with skin of color, the majority of the studies reviewed were obtained from East Asian populations and therefore provide insufficient evidence to comment on the overall accuracy of AI models for darker skin types. Large discrepancies remain in the number of AI models developed in populations with skin of color (particularly Fitzpatrick type IV–VI) compared with those of largely European ancestry. A lack of publicly available datasets from diverse populations is likely a contributing factor, as is the inadequate reporting of patient-level metadata relating to skin color in training datasets.
Keywords: Artificial intelligence, Skin of color, Race, Skin cancers
Introduction
Skin cancer is the most common malignancy worldwide, with melanoma representing the deadliest form. While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at a later stage and have a poorer prognosis when compared to Caucasian populations [1–3]. Even when diagnosed at the same stage, Hispanic, Native, Asian, and African Americans have significantly shorter survival time than Caucasian Americans (p < 0.05) [4]. Skin cancers in people with skin of color often present differently from those with Caucasian skin and are often underrepresented in dermatology training [5, 6].
The use of artificial intelligence (AI) algorithms for image analysis and detection of skin cancer has the potential to decrease healthcare disparities by removing unintended clinician bias and improving accessibility and affordability [7]. Skin lesion classification by AI algorithms to date has performed equivalently to [8] and, in some cases, better than dermatologists [9]. Human-computer collaboration can increase diagnostic accuracy further [10]. However, most AI advances have used homogenous datasets [11–15] collected from countries with predominantly European ancestry [16]. Exclusion of skin of color in training datasets poses the risk of incorrect diagnosis or missing skin cancers entirely [8] and risks widening racial disparities that already exist in dermatology [8, 17].
While multiple reviews have compared AI-based model performances for skin cancer detection [18–20], the use of AI in populations with skin of color has not been evaluated. The objective of this study was to systematically review the current literature for AI models for classification of pigmented skin lesion images in populations with skin of color.
Methods
Literature Search
The systematic review follows the PRISMA guidelines [21]. A protocol was registered with PROSPERO (International Prospective Register of Systematic Reviews) and can be accessed at https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42021281347.
A PubMed search in March 2021 used search terms relating to artificial intelligence, skin cancer, and skin lesions (search strings in online suppl. eTable; for all online suppl. material, see www.karger.com/doi/10.1159/000530225). No date range was applied, language was restricted to English, and only original research was included. Covidence software was used for screening administration. Search results were screened by reviewing titles/abstracts by two independent reviewers (Y.L. 100% and B.B.S. 20%) using eligibility criteria described in Table 1. Remaining articles were assessed for eligibility by reviewing methods or full text. Disagreements were resolved following discussions with a third independent reviewer (C.P.).
Table 1.
Inclusion criteria | Exclusion criteria |
---|---|
1. Any computer modeling or use of AI on diagnosis of skin conditions 2. Datasets provide information on the population (racial or Fitzpatrick skin type breakdown) or datasets obtained from countries with predominantly skin of color population 3. Uses dermoscopic, clinical, 3D, or other photographic images of the skin surface 4. Includes the assessment of malignant and/or non-malignant pigmented skin lesions |
1. No population description of the training datasets (demographic, racial, or ethnicity breakdown) or from a country with predominantly Caucasian population of European ancestry 2. Dataset description with >90% Caucasian population, or fair skin type, or Fitzpatrick skin type I–III 3. Solely used images from ISIC [56], PH2 [13], IAD [57], ISBI [58], HAM10000 [12], MED-NODE [14], ILSVRC [59], DermaQuest [60], DERMIS [61], DERM IQA[62], DermNet NZ [63], and datasets known to be of predominantly European ancestry |
DERM IQA, Dermoscopic Image Quality Assessment [62]; DERMIS, dermatology information system; ISIC, International Skin Imaging Collaboration [56]; DermNet NZ, DermNet New Zealand [63]; IAD, Interactive Atlas of Dermoscopy [57]; ISBI, International Symposium on Biomedical Imaging [58]; ILSVRC, ImageNet Large Scale Visual Recognition Challenge [59]; HAM10000, Human Against Machine with 10,000 training images [12]; MED-NODE, computer-assisted MElanoma Diagnosis from NOn-DErmoscopic images [14].
Data Extraction and Synthesis
Data extraction was performed using a standardized form by author Y.L. and confirmation by V.K. The following parameters were recorded: reference, ethnicity/ancestry/race, lesion number, sex, age, location, skin condition, public availability of dataset, number of images, type of images, methods of confirmation, deep learning system, model output, comparison with human input, and any missing data reported. Algorithm performance measures are recorded by either accuracy, sensitivity, specificity, and/or area under the receiver operating characteristic curve. A narrative synthesis of extracted data was used to present findings as a meta-analysis was not feasible due to heterogeneity of the study design, AI systems, skin lesions, and outcomes.
Quality Assessment
Quality was assessed using the Checklist for Evaluation of Image-Based Artificial Intelligence Reports in Dermatology (CLEAR Derm) Consensus Guidelines [22]. This 25-point checklist offers comprehensive recommendations on factors critical to the development, performance, and application of image-based AI algorithms in dermatology [22].
Author Y.L. performed the quality assessment of all included studies, and author B.B.S. assessed 20%. Inter-rater agreement rate was 87%, with disagreements resolved via a third independent reviewer (V.K.). Each criterion was evaluated to be either fully, partially, or not addressed and scored either 1, 0.5, or 0, respectively, using a scoring rubric in online supplementary eTable2.
Results
The database search identified 993 articles, including 13 duplicates. After screening titles/abstracts, 535 records were excluded, and the remaining 445 records were screened by methods, with 63 articles reviewed by full text. Forward and backward citations search revealed no additional articles. A total of 22 studies were included in the final review (PRISMA flow diagram in online supplementary eFig. 1).
Study Design
All 22 studies were performed between 2002 and 2021 [23–32], with 11 (50%) studies published between 2020 and 2021 [33–44]. An overview of study characteristics is displayed in Table 2. The median number of total images used in each study for all datasets combined was 5,846 (ranging from 212 to 185,192). The median dataset size for training, testing, and validation was 4,732 images (range: 247–22,608), 362 (range: 100–40,331), and 1,258 (range: 109–14,883), respectively.
Table 2.
First author, year | Patient population | Public availability of dataset | Image type/Image No. | Validation (H = histology, C = clinical diagnosis) | Deep learning system | Model output | |
---|---|---|---|---|---|---|---|
ethnicity/ancestry/race/location | dataset information | ||||||
Piccolo et al. (2002) [23] | Fitzpatrick I–V | Lesion n = 341 Patient n = 289 F 65% (n = 188) Average age 33.6 |
No | Dermoscopy Total 341 Training 0 Testing 341 Validation 0 |
All – H | Network architecture DEM-MIPS (artificial neural network designed to evaluate different colorimetric and geometric parameters) | Binary (melanoma, non-melanoma) |
Iyatomi et al. (2008) [24] | Italian Austrian Japanese |
n/a | No | Dermoscopy Total 1,258 Training 247 Testing NA Validation 1,258 |
Dataset A and B– H Dataset C – H + C |
Network architecture ANN (back-propagation artificial neural networks) |
Binary (malignant or benign) Malignancy (risk) score (0–100) |
Chang et al. (2013) [25] | Taiwanese | Lesion n = 769 Patient n = 676 F 56% (n = 380) Average age 47.6 |
No | Clinical Total 1,899 Training NA Testing NA Validation NA |
Benign – C Malignant – H |
Network architecture Computer-aided diagnosis (CADx) system built on 91 conventional features of shape, texture, and colors (developing software – MATLAB) |
Binary (benign or malignant) |
Chen et al. (2016) [26] | American Indian, Alaska Native, Asian, Pacific Islander, black or African, American, Caucasian | Community dataset Patient n = 1,900 F 52.3% (n = 993) Age >50% under 35 |
Partially DermNet NZ – Yes Community – No |
Clinical Total 12,000 Training 11,780 Testing 337 Validation NA |
Community dataset (benign and malignant) – C DermNet – H |
Network architecture Patented image-search algorithm that builds on proven computer vision methods from the field of CBIR |
Binary (melanoma and non-melanoma) |
Yang et al. (2017) [27] | Korean | Patient n = 110 F 50% (n = 55) |
No | Dermoscopy Total 297 Training 0 Testing 297 Validation 0 |
All – H | Network architecture 3 stage algorithm, pre-processing, stripe pattern detection and automatic discrimination (MATLAB) |
Binary (LM, nevus) |
Han et al. (2018) [28] | Korean Caucasian |
n/a | Partially MED-NODE, Atlas, Edinburgh, Dermofit – yes Others – no |
Clinical Total 182,044 Training 178,875 Testing 1,276 Validation 22,728 |
ASAN – C+ H Multiple other dataset (5) used with unclear validation |
Neural network – CNN Network Architecture Microsoft ResNet – 152 Google Inception |
12-class skin tumor types |
Yu et al. (2018) [29] | Korean | Lesion n = 275 | No | Dermoscopy Total 724 Training 372 Testing 362 Validation 109 |
All – H | Neural network – CNN Network architecture modified VGG – 16 |
Binary (melanoma/non-melanoma) |
Zhang et al. (2018) [30] | Chinese | Lesion n = 1,067 | No | Dermoscopy Total 1,067 Training 4,867 Testing 1,142 Validation NA |
Benign and malignant – C Three dermatologists disagree – H |
Neural network – CNN Network architecture GoogLeNet Inception v3 |
Four class classifier (BCC, SK, melanocytic nevus, and psoriasis) |
Fujisawa et al. (2019) [31] | Japanese | Patients n = 2,296 | No | Clinical Total 6,009 Training 4,867 Testing 1,142 Validation NA |
Melanocytic nevus, split nevus, lentigo simplex – C Others – PE |
Neural network – DCNN Network architecture GoogLeNet DCNN model |
1. Two class classifier (benign vs. malignant) 2. Four class classifier (malignant epithelial lesion, malignant melanocytic lesion, benign epithelial lesion, benign melanocytic lesion) 3. 14 class classification 4. 21 class classification |
Jinnai et al. (2019) [38] | Japanese | Patient n = 3,551 | No | Dermoscopy Total 5,846 Training 4,732 Testing 666 Validation NA |
Malignant – H Benign tumor – C |
Neural network – FRCNN Network architecture – VGG-16 |
Binary (benign/malignant) Six-class classifications (6 skin conditions) |
Zhao et al. (2019) [32] | Chinese | n/a | No | Clinical Total 4,500 Training 3,375 Testing 1,125 Validation NA |
Benign – C Malignant – PE |
Neural network – CNN Network architecture – Xception |
Risk (low/high/dangerous) |
Cho et al. (2020) [33] | Korean | Patient n = 404 | No | Clinical Total 2,254 Training 1,629 Testing 625 Validation NA |
Benign – C Malignant – H |
Neural network – DCNN Network architecture – Inception-ResNet-V2 |
Binary classification (benign or malignant) |
Han et al. (2020) [36] | Korean Caucasian |
ASNA, Normal, SNU Patient n = 28,222 F 55% (n = 15,522) Average age 40 MED-NODE, Web, Edinburgh NA |
Partially Edinburgh – Yes SNU – upon request |
Clinical Total 224,181 Training 220,680 Testing 2,441 Validation 3,501 |
ASAN – C+ H Edinburgh – H Med-Node – H SNU – C + H Web – image finding |
Neural network – CNN Network architecture SENet Se-ResNet-50 Visual geometry group (VGG-19) |
Binary (malignant, non-malignant) Binary (steroids, antibiotics, antivirals, antifungals) Multiple class classification (134 skin disorders) |
Han et al. (2020) [35] | Korean | Patients n = 673 Lesions n = 673 F 54% (n = 363) Average age 58 |
No | Clinical Total 185,192 Training 182,348 Testing NA Validation 2,844 |
All – H | Neural network – CNN Network architecture SENet SE-ResNeXt-50 SE-ResNet-50) |
Risk output (risk of malignancy) |
Han et al. (2020) [34] | Korean | Patient n = 9,556 Lesion n = 10,426 F 55% (5,255) Average age 52 |
No | Clinical Total 40,331 Training 1106,886a Testing NA Validation 40,331 |
All – H | Neural network – RCNN Network architecture SENet Se-ResNeXt-50 |
Binary (malignant, non-malignant) 32 class classification |
Huang et al. (2020) [37] | Chinese | Lesion n = 1,225 | No | Clinical Total 3,299 Training 2,474 Testing 825 Validation NA |
All – PE | Neural network - CNN Network architecture Inception V3 Inception-ResNet V2 DenseNet 121 ResNet 50 |
Binary (SK/BCC) |
Li (2020) [44] | Chinese | Patient n = 106 | No | Dermoscopy and clinical Total 212 Training 200,000a Testing 212 Validation NA |
All – H | Network architecture Youzhi AI software (Shanghai Maise Information Technology Co., Ltd., Shanghai, China) |
Binary (benign or malignant) 14 class classification |
Liu et al. (2020) [39] | Fitzpatrick type I–VI | Patient n = 15,640 Lesion n = 20,676 |
No | Clinical Total 79,720 Training 64,837 Testing NA Validation 14,483 |
Benign – C Malignant – H |
Neural network – DLS Network architecture Inception – v4 |
26 class classification (primary output) 419 class classification (secondary output) |
Wang et al. (2020) [40] | Chinese, with Fitzpatrick type IV | n/a | No | Dermoscopy Total 10,307 Training 8,246 Testing 1,031 Validation 1,031 |
BCC – C+H Others – C |
Neural network – CNN Network architecture GoogLeNet Inception v3 |
Binary classification (psoriasis and others) Multi-class classification |
Huang et al. (2021) [37] | Taiwanese Caucasian |
KCGMH Patient no. 1,222 F 52.4% (n = 640) Average age 62 HAM10000 n/a |
Partially KCGMH – no HAM10000 – yes |
Clinical Total 1,287 Training 1,031 Testing 128 Validation 128 |
All – H | Neural network – CNN Network architecture DenseNet 121 – binary classification EfficientNet E4 – five class classification |
Binary (benign/malignant) 5 class classification (BCC, BK, MM, NV, SCC) 7 class (AK, BCC, BKL, SK, DF, MM, NV) |
Minagawa et al. (2021) [42] | Caucasian Japanese |
Patient n = 50 | Partially ISIC–yes Shinshu – no |
Dermoscopy Total 12,948 Training 12,848 Testing 100 Validation NA |
Benign – C Malignant – H |
Neural network - DNN Neural architecture – Inception-ResNet-V2 |
4 class classification (MM/BCC/MN/BK) |
Yang et al. (2021) [43] | Chinese | n/a | No | Clinical Total 12,816 Training 10,414 Testing 300 Validation 2,102 |
All – C | Neural network – DCNN Neural architecture DenseNet-96 ResNet-152 ResNet-99 Converged network (DenseNet – ResNet fusion) |
6 class classification (Nevi, Melasma, cafe-au-lait, SK, and acquired nevi) |
BCC, basal cell carcinoma; BK, benign keratosis; CNN OR DCNN, convolutional neural network; DF, dermatofibroma; SCC, squamous cell carcinoma; SK, seborrheic keratosis; MM, melanoma; MN, melanocytic; PE, pathological examination (insufficient information provided whether histopathology and/or clinical evaluation was used); n/a, not available; n, number; CBIR, content-based image retrieval.
aAlgorithm previously trained on an different dataset; therefore, dataset numbers are not included.
The majority of studies (15/22, 68%) analyzed clinical images (i.e., wide-field or regional images), while seven studies analyzed dermoscopy images [23, 24, 27, 29, 30, 40, 42], and one study included both [44]. All but one study included both malignant and benign pigmented skin lesions, with one investigating only benign pigmented facial lesions [43].
Histopathology was used as the ground truth in 15 studies for all malignant lesions and partially in two studies [24, 26], while one study only used histopathology to resolve clinician disagreements [23]. Seven studies used histopathology as ground truth for benign lesions [23, 27, 29, 34, 35, 41, 44]. In nine studies, ground truth was established by consensus of experienced dermatologists [25, 30–32, 38–40, 42, 43]. Other studies used a mix of both [24, 26, 33, 36] or were not clearly defined [28, 37].
The number of pigmented skin lesion classifications used for AI model evaluation ranged from binary outcomes (e.g., benign vs. malignant) to classification of up to 419 skin conditions [39]. While most studies (19/22, 86%) evaluated lesions across all body sites, one study exclusively analyzed the lips/mouth [33], another assessed only facial skin lesions [43], and one study specifically addressed acral melanoma [29].
Population
Homogenous datasets were collected from the Chinese/Taiwanese (n = 8, 36%) [25, 30, 32, 37, 40, 41, 43, 44], Korean (n = 5, 23%) [27–29, 33–35], and Japanese populations (n = 3, 14%) [31, 38, 42]. Seven studies (32%) included populations from Caucasians/Fitzpatrick skin type I–III [23, 24, 26, 28, 36, 39, 42], with at least 10% American Indian [26], Alaska Native [26], black or African American [26], Pacific Islander [26], Native American [26], or Fitzpatrick IV–VI [23, 39] in the training and/or test set (Table 2).
The majority of studies did not specify the sex distribution (n = 13, 59%) or participant age (n = 15, 68%). Seven studies included age specification, ranging from 18 to older than 85 years [23, 25, 26, 34–36, 41].
Outcome and Performance
The outcome of the classification algorithms used either a diagnostic model, risk categorical model (e.g., low, medium, or high), or a combination of both. An overview of AI model performance is described in Table 3. Majority of studies (20/22, 91%) used a diagnostic model, either with binary classification of benign or malignant [23–27, 29, 33, 35, 37], multiclass classification of specific lesion diagnosis [28, 30, 32, 39, 42, 43], or both [31, 34, 36, 38, 40, 41, 44]. One study used categorical risk as the outcome [32]. Another study reported both diagnostic model and risk categorical model [24].
Table 3.
Reference | Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC |
---|---|---|---|---|
Binary classification models | ||||
Piccolo et al. (2002) [23] | n/a | 92 | 74 | n/a |
Iyatomi et al. (2008) [24] | n/a | 86 | 86 | 0.93 |
Chang et al. (2013) [25] | 91 | 86 | 88 | 0.95 |
Chen et al. (2016) [26] | 91 | 90 | 92 | n/a |
Yang et al. (2017) [27] | 99.7 | 100 | 99 | n/a |
Yu et al. (2018) [29] | 82 | 93 | 72 | 0.80 |
Cho et al. (2020) [33] | n/a | Dataset 1: 76 Dataset 2: 70 |
Dataset 1: 80 Dataset 2: 76 |
Dataset 1: 0.83 Dataset 2: 0.77 |
Huang et al. (2020) [37] | 86 | n/a | n/a | 0.92 |
Han et al. (2020) [35] | n/a | 77 | 91 | 0.91 |
Fujisawa et al. (2019) [31] | 93 | 96 | 90 | n/a |
Jinnai et al. (2019) [38] | 92 | 83 | 95 | n/a |
Han et al. (2020) [36] | n/a | n/a | n/a | Edinburgh dataset: 0.93 SNU dataset: 0.94 |
Han et al. (2020) [34] | n/a | Top 1: 63 | Top 1: 90 | 0.86 |
Li et al. (2020) [44] | 86 | 75 | 93 | n/a |
Wang et al. (2020) [40] | 77 | n/a | n/a | n/a |
Multiclass classification models | ||||
Han et al. (2018) [28] | n/a | ASAN dataset: 86 Edinburg dataset: 85 |
ASAN dataset: 86 Edinburg dataset: 81 |
|
Zhang et al. (2018) [30] | Dataset A: 87 Dataset B: 87 |
n/a | n/a | |
Fujisawa et al. (2019) [31] | 77 | n/a | n/a | |
Jinnai et al. (2019) [38] | 87 | 86 | 87 | |
Liu et al. (2020) (26-classification model) [39] | Top 1: 71 Top 3: 93 |
Top 1: 58 Top 3: 88 |
n/a | |
Han et al. (2020) [36] | Top 1 Edinburgh dataset: 57 SNU dataset: 45 Top 3 Edinburgh dataset: 84 SNU dataset: 69 Top 5 Edinburgh dataset: 92 SNU dataset: 78 |
n/a | n/a | |
Han et al. (2020) [34] | Top 1: 43 Top 3: 62 |
n/a | n/a | |
Li et al. (2020) [44] | 73 | n/a | n/a | |
Wang et al. (2020) [40] | 82 | n/a | n/a | |
Minagawa et al. (2021) [42] | 90 | n/a | n/a | |
Yang et al. (2021) [43] | Algorithm A: 88 Algorithm B: 77 Algorithm C: 90 Algorithm D: 87 |
Algorithm A: 83 Algorithm B: 63 Algorithm C: 81 Algorithm D: 80 |
Algorithm A: 98 Algorithm B: 90 Algorithm C: 99 Algorithm D: 98 |
|
Huang et al. (2021) [37] | 5 class (KCGMH dataset): 72 7 class (HAM10000 dataset): 86 |
n/a | n/a | |
Risk categorical classification | ||||
Zhao et al. (2019) [32] | 83 | Benign: 93 Low risk: 85 High risk: 86 |
Benign: 88 Low risk: 85 High risk: 91 |
Benign: 0.96 Low risk: 0.92 High risk: 0.95 |
Top: top-(n) accuracy represents the fact that the correct diagnosis is among the top n predictions output by the model.
For example, top-3 accuracy means that any of the top 3 highest probability predictions made by the model match the expected answer.
AUC, area under the curve.
The AI models using binary classification (16/22) reported an accuracy ranging from 70% to 99.7%. Of these studies, 6/16 reported ≥90% accuracy [25–27, 31, 38, 41], three studies reported between 80 and 90% accuracy [29, 37, 44], and one study reported <80% accuracy [40]. Twelve AI models reported sensitivity and specificity as a measure of performance, which ranged from 58 to 100% and 72 to 99%, respectively. Eight studies provided an area under the curve (AUC) with 5/8 reporting values >0.9 [24, 25, 35–37], with the remaining three models scoring between 0.77 and 0.86 [29, 33, 34].
For the 13 studies using multiclass output (i.e., >2 diagnoses), accuracy of models ranged from 43% to 93%. Six of these studies (6/13) scored <80% accuracy [31, 34, 36, 39, 41, 44], six others scored between 80 and 90% accuracy [30, 32, 38, 40, 42, 43], and one provided sensitivity and specificity of 86% and 86%, respectively, as a measure of performance [28].
Reader Studies
Reader studies, where the performance of AI models and clinician classification is compared, were performed in 14/22 studies, with results provided in Table 4[23, 25, 29, 31–39, 42, 44]. Six studies compared AI outcomes to classification by experts, e.g., dermatologists [25, 32, 34, 36, 42, 44]. Eight studies compared outcomes for both experts and non-experts, e.g., dermatology residents and general practitioners [23, 29, 31, 33, 35, 37–39].
Table 4.
Reference | AI performance | Expert performance | Non-expert performance |
---|---|---|---|
Piccolo et al. (2002) [23] | Sensitivity: 92% Specificity: 74% |
Sensitivity: 92% Specificity: 99% |
Sensitivity: 69% Specificity: 94% |
Chang et al. (2013) [25] | Accuracy Melanoma: 91% Non-melanoma: 83% Sensitivity: 86% Specificity: 88% |
Accuracy: 81% Sensitivity: 83% Specificity: 86% |
|
Yu et al. (2018) [29] | Accuracy: 82% Sensitivity: 93% Specificity: 72% AUC: 0.80 |
Accuracy: 81% Sensitivity: 97% Specificity: 67% AUC: 0.80 |
Accuracy: 65% Sensitivity: 45% Specificity: 84% AUC: 0.65 |
Huang et al. (2020) [37] | Sensitivity: 90% AUC 0.94 |
Sensitivity: 85% Specificity: 90% |
Sensitivity: 66% Specificity: 72% |
Han et al. (2020) [35] | Sensitivity: 89% Specificity: 78% AUC: 0.92 |
Sensitivity: 95% Specificity: 72% ROC: 0.91 |
Accuracy Dermatology resident: 94% Non-dermatology clinician: 77% Sensitivity Dermatology resident: 69% Non-dermatology clinician: 65% AUC Dermatology resident: 0.88 Non-dermatology clinician: 0.73 |
Fujisawa et al. (2019) [31] | Accuracy Binary: 92% Multiclass: 75% |
Accuracy Binary: 85% Multiclass: 60% |
Accuracy Binary: 74% Multiclass: 42% |
Jinnai et al., (2019) [38] | Accuracy: 92% Sensitivity: 83% Specificity: 95% |
Accuracy: 87% Sensitivity: 86% Specificity: 87% |
Accuracy: 85% Sensitivity: 84% Specificity: 86% |
Zhao et al. (2019) [32] | Sensitivity Benign: 90% Low risk: 90% High risk: 75% |
Sensitivity Benign: 61% Low risk: 50% High risk: 64% |
|
Cho et al. (2020) [33] | Sensitivity Dataset 1: 76% Dataset 2: 70% Specificity Dataset 1: 80% Dataset 2: 76% AUC Dataset 1: 0.83 Dataset 2: 0.77 |
Sensitivity -Without algorithm: 90% -With algorithm: 90% Specificity -Without algorithm: 58% -With algorithm: 61% |
Sensitivity Dermatology resident -Without algorithm: 80% -With algorithm: 85% Non-dermatology clinician -Without algorithm: 65% -With algorithm: 74% Specificity Dermatology resident -Without algorithm: 53% -With algorithm: 71% Non-dermatology clinician -Without algorithm: 46% -With algorithm: 49% AUC Dermatology resident -Without algorithm: 0.33 -With algorithm: 0.42 Non-dermatology clinician -Without algorithm: 0.11 -With algorithm: 0.23 |
Han et al. (2020) [36] | Multiclass model Accuracy Top 1: 45% Top 3: 69% Top 5: 78% |
Multiclass model Accuracy (without algorithm) Top 1: 50% Top 3: 67% (with algorithm) Top 1: 53% Top 3: 74% Binary model Accuracy -Without algorithm: 77% -With algorithm: 85% |
|
Han et al. (2020) [34] | Binary model Sensitivity: 67% Specificity: 87% Multiclass accuracy Top 1: 50% Top 3: 70% |
Binary model Sensitivity: 66% Specificity: 67% Multiclass accuracy Top 1: 38% Top 3: 53% |
|
Li et al. (2020) [44] | Accuracy Binary: 73% Multiclass: 86% |
Accuracy Binary: 83% Multiclass: 74% |
|
Liu et al. (2020) [39] | Accuracy Top 1: 66% Top 3: 90% |
Accuracy Top 1: 63% Top 3: 75% |
Accuracy Primary care physician Top 1: 44% Top 3: 60% Nurse practitioner Top 1: 40% Top 3: 55% |
Minagawa et al. (2021) [42] | Accuracy: 71% | Accuracy: 90% |
In reader studies comparing binary classification between AI and experts (n = 11), one study reported similar diagnostic accuracy/specificity [29], three showed higher accuracy for AI models [25, 31, 38], and two reported higher accuracy in experts [42, 44]. Five studies reported specificity, sensitivity, and AUC instead of accuracy with varying outcomes [23, 32, 33, 35, 37]. For reader studies between AI and non-experts (n = 7), AI showed higher accuracy, specificity, sensitivity, and AUC in most studies [23, 29, 31, 33, 35, 37, 38]. In reader studies that compared multiclass classification between AI and expert readers (n = 5), three studies reported higher top-1 diagnosis (i.e., diagnosis with highest statistical probability) accuracy for AI [31, 34, 44], in one study readers performed better [36], and the last study had similar results between AI and readers [39]. In multiclass reader studies with non-experts, AI reported higher accuracy in both studies (n = 2) [31, 39].
One study compared categorical risk classification with experts and showed AI to have higher sensitivity across all categories [32]. Two studies additionally showed significant increase in accuracy when human experts were aided by AI outcomes [33, 36].
Quality Assessments
Studies included in the review were evaluated against the 25-point CLEAR Derm Consensus Guidelines [22], covering four domains (data, technique, technical assessment, and application). For each checklist item, each study was assessed to determine whether it fully, partially, or did not address the criteria. A summary of results is provided in Table 5, with the individual study scores available in online supplementary eTable 3. Overall, an average score of 17.4 (±2.2) out of a possible 25 points was obtained.
Table 5.
Domain | Checklist item | Quality assessment, n (%) | |||
---|---|---|---|---|---|
fully addressed | partially addressed | not addressed | |||
Data | 1 | Image types | 21 (95) | 1 (5) | 0 (0) |
2 | Image artifacts | 12 (55) | 5 (23) | 5 (23) | |
3 | Technical acquisition details | 22 (100) | 0 (0) | 0 (0) | |
4 | Pre-processing procedures | 20 (91) | 0 (0) | 2 (9) | |
5 | Synthetic images made public if used | 22 (100)a | 0 (0) | 0 (0) | |
6 | Public images adequately referenced | 22 (100) | 0 (0) | 0 (0) | |
7 | Patient-level metadata | 5 (23) | 17 (77) | 0 (23) | |
8 | Skin tone information and procedure by which skin tone was assessed | 3 (14) | 16 (73) | 3 (14) | |
9 | Potential biases that may arise from use of patient information and metadata | 9 (41) | 7 (32) | 6 (27) | |
10 | Dataset partitions | 12 (55) | 9 (41) | 1 (5) | |
11 | Sample sizes of training, validation, and test sets | 7 (32) | 14 (64) | 1 (5) | |
12 | External test set | 3 (14) | 2 (9) | 17 (77) | |
13 | Multivendor images | 20 (91) | 2 (9) | 0 (0) | |
14 | Class distribution and balance | 5 (23) | 15 (68) | 2 (9) | |
15 | OOD images | 2 (9) | 7 (32) | 13 (59) | |
Technique | 16 | Labeling method (ground truth, who did it) | 15 (68) | 7 (32) | 0 (0) |
17 | References to common/accepted diagnostic labels | 22 (100) | 0 (0) | 0 (0) | |
18 | Histopathologic review for malignancies | 16 (73) | 2 (9) | 4 (18) | |
19 | Detailed description of algorithm development | 14 (64) | 6 (27) | 2 (10) | |
Technical Assessment | 20 | How to publicly evaluate algorithm | 5 (23) | 0 (0) | 17 (77) |
21 | Performance measures | 9 (41) | 13 (59) | 0 (0) | |
22 | Benchmarking, technical comparison, and novelty | 15 (68) | 0 (0) | 7 (32) | |
23 | Bias assessment | 10 (45) | 6 (27) | 6 (27) | |
Application | 24 | Use cases and target conditions (inside distribution) | 16 (73) | 6 (27) | 0 (0) |
25 | Potential impacts on the healthcare team and patients | 3 (14) | 13 (59) | 6 (27) |
aNo studies included synthetic images (checklist item 5), therefore marked as “fully addressed” to not negatively impact quality score.
OOD, out of distribution.
Data
The data domain consists of 15 checklist items; of these, six items were fully addressed by >90% of publications, including checklist items (1) image types, (3) technical acquisition details, (4) pre-processing procedures, and (6) public images adequately referenced. No studies included synthetic images (item 5). Checklist items (2) image artifacts, (9) potential biases, and (10) dataset partitions (e.g., lesion distribution) were only fully addressed by approximately half of the studies. About a third of studies fully addressed the sample size of training, validation, and test sets (item 11) [28, 29, 34, 35, 39, 40, 43]. The most poorly addressed criteria were patient-level metadata (e.g., sex, gender, ethnicity; item 7), skin color information (item 8), using an external test dataset (item 12), class distribution and balance of the images (item 14), and out of distribution images (item 15).
Patient-level metadata (item 7) was partially reported in most studies. While geographical location, hospital location, and race were adequately reported, sex and age specifications were limited, as were anatomical location, relevant past medical history, and history of presenting illness. Skin color information was reported as Fitzpatrick’s skin type in only three studies [23, 39, 40].
Technique
All four checklist items in the technique domain were fully addressed by most publications. Image labeling method (item 16), e.g., information on ground truth, was fully addressed in 68% (n = 15) of publications. All studies used commonly accepted diagnostic labels (item 17). Histopathologic review (item 18) was fully addressed in 73% (n = 16) of studies. A detailed description of algorithm development (item 19) was provided in 64% (n = 14) of studies.
Technical Assessment
Of the four checkpoints, the most poorly addressed item was the public evaluation of the model (item 20), with only five papers fully addressing this item [29, 34–36, 39] and the majority (n = 17, 77%) not addressing it at all. Four studies reported public availability of their AI algorithm or offered a public-facing test interface for external testing [28, 34–36]. Performance measures model (item 21) was fully addressed by nine studies. Benchmarking, technical comparison, and novelty (item 22) compared to previously developed algorithm and human specialists were fully addressed by 15 studies, with technical bias assessment (item 23) discussed by 10 studies.
Application
The application domain consisted of two checklist items. Use of cases and target conditions (item 24) was fully addressed by 16 (73%) studies, while potential impacts on the healthcare team and patients (item 25) were fully addressed by only three studies [38, 39, 41].
Discussion
We present, to our knowledge, the first systematic review summarizing existing AI image-based algorithms for the classification of skin lesions in people with skin of color. Our review identified 22 articles where the algorithms were primarily developed on or tested on images from people with skin of color.
Diversity and Transparency
Within the identified AI studies involving skin of color populations, there is further under-representation of darker skin types. We found that Asian populations (Chinese, Korean, and Japanese) were the major ethnic groups included, with limited datasets involving populations from the Middle East, India, Africa, Pacific Islands, and from Hispanic and Native Americans. Only three studies specifically reported skin color categories of type IV–VI and/or black or African American. Two North American studies reported only 4.3% [26] and 6.8% [39] black African American images, and one Italian-led study [23] reported 26.5% of images with Fitzpatrick skin type IV/V.
Adding to the issue was a lack of transparency in reporting skin color in image-based AI studies. A recently published scoping review [17] of 70 studies involving image-based AI algorithms for classification of skin conditions found only 14 (20%) provided details about race and seven (10%) provided detail about skin color. Furthermore, only three studies partially included Fitzpatrick skin type IV–VI populations. The lack of diversity in studies is likely fueled by the shortage of diverse publicly available datasets. A recent systematic review of 39 open access datasets and skin atlases reported that 12 of the 14 datasets that reported country of origin were from countries of predominantly European ancestry [16]. Only one dataset from Brazil reported that 5% of their population contained Fitzpatrick skin type IV–VI images [45], and another dataset comprised images from a South Korean population [36].
Model Performance
AI models in this review showed reasonably high-performance classification using both dermoscopic and clinical images in populations with skin of color. Accuracy was greater than 70% for all binary models reviewed, which is similar to models developed using Caucasian datasets [46]. It has previously been suggested that instead of training algorithms on more diverse datasets, it may be beneficial to develop separate algorithms for different populations [7]. Han et al. [28] demonstrated a significant difference in performance of an AI model trained using skin images from exclusive Asian populations, which had greater accuracy for classifying basal cell carcinomas and melanomas in Asian populations (96% and 96%, respectively) than in Caucasian populations (90% and 88%, respectively).
There is some evidence that adding out-of-distribution images (i.e., those not originally represented in the training dataset) to a pre-existing dataset and re-training can improve classification accuracy among the newly added group while not affecting accuracy among the original group [22]. Another previously suggested method is to use artificially darkened lesion images in the training dataset [47].
Quality Assessments
Our review identified several critical items from the CLEAR Checklist that were insufficiently addressed by the majority of AI imaging studies in populations with skin of color. Overall, discrepancies in image artifact description, lack of skin color information and other patient-level metadata, missing rationale for dataset sizes, and inclusion of external test sets were found. Image artifacts have been previously shown to affect algorithm performance in both dermatological images [48, 49] and clinical photographs [50]. Unclear descriptions of pre-processing of images can cause incorrect diagnosis, as can subtle alteration in color balance or rotation that can result in incorrect classifying melanoma as benign nevus [49]. Furthermore, less than half of the studies reviewed reported anatomical locations of imaged lesions. The body site of a lesion can be informative; for example, skin of color populations have a high prevalence of acral melanomas which are commonly found on the palms, soles, subungual region, and mucous membranes [51]. Future studies should consider adopting established consensus on image acquisition metrics [52] or dermatology-specific guidelines [22] to standardize images and develop robust archives of skin images used for research.
A major concern is that most AI algorithms have only been tested on internal datasets, e.g., those sampled from the same/similar populations. This issue was similarly highlighted in a recent review [18]. AI algorithms are prone to overfitting, which can result in inflated accuracy measures when only tested on internal datasets and significant performance drop when tested with images from a different source [9]. The lack of standardized performance measure model, publicly available diverse datasets, test interfaces, and reader studies presents a barrier to comparability and reproducibility of AI algorithms. An alternative solution is for algorithms to be evaluated on standardized public test datasets, with transparency on algorithm performance [22]. Testing models on external datasets and/or out-of-distribution images is a better indicator of AI model performance in a real-world setting. Given public dermoscopic datasets with diverse skin colors are now available, future algorithms should include external testing as gold standard evaluation [53].
Lastly, only limited studies included in the review addressed the potential impacts on the healthcare team and patients. Future studies would benefit by considering the clinical translation of AI models during the early stages of development and evaluating clinical utility of AI models along with their performance in conjunction with their intended users [36].
Limitations and Outlook
Our systematic review is not without limitations. First, our search was limited to articles published in English. This is particularly of note for populations with skin of color, where English may often not be the primary language. Additionally, as many studies did not report on skin of color status, inclusion was based on assumptions using ethnicity and geographical location. There can be significant variability in skin color even for a population of the same race or geographical location [54]. Reporting bias may also influence the review, as higher performing algorithms are more likely to be reported and accepted for publication.
Conclusion
AI algorithms have great potential to complement skin cancer screening and diagnosis of pigmented lesions, leading to improved dermatology workflows and patient outcomes. The performance and reliability of image-based AI algorithm significantly depend on the quality of data on which it is trained and tested. While this review provides promising development of AI algorithms in skin of color for East Asian populations, there are still significant discrepancies in the number of algorithms developed in populations with skin of color, particularly in darker skin types. Without inclusion of images from populations with skin of color, the incorporation of AI models into clinical practice could lead to missed and delayed diagnoses of neoplasms in people with skin of color, further widening existing health outcome disparities [55].
Statement of Ethics
An ethics statement is not applicable because this study is based exclusively on published literature.
Conflict of Interest Statement
H. Peter Soyer is a shareholder of MoleMap NZ Ltd. and e-derm-consult GmbH and undertakes regular teledermatological reporting for both companies. H. Peter Soyer is a medical consultant for Canfield Scientific Inc., MoleMap Australia Pty Ltd., and Blaze Bioscience Inc. and a medical advisor for First Derm. The remaining authors have no conflicts of interests to declare.
Funding Sources
H. Peter Soyer holds an NHMRC MRFF Next Generation Clinical Researchers Program Practitioner Fellowship (APP1137127). Clare Primiero is supported by an Australian Government Research Training Program Scholarship.
Author Contributions
Yuyang Liu: conceptualization (equal), data curation (lead), formal analysis (equal), methodology (lead), validation (equal), and writing – original draft (lead). Clare Primiero: conceptualization (equal), funding acquisition (equal), methodology (equal), supervision (equal), and writing – review and editing (equal). Vishnutheertha Kulkarni: data curation (supporting), formal analysis (supporting), validation (equal), and writing – review and editing (supporting). H. Peter Soyer: conceptualization (supporting), funding acquisition (equal), supervision (supporting), and writing – review and editing (supporting). Brigid Betz-Stablein: conceptualization (equal), data curation (supporting), formal analysis (supporting), methodology (equal), supervision (equal), and writing – review and editing (equal).
Funding Statement
H. Peter Soyer holds an NHMRC MRFF Next Generation Clinical Researchers Program Practitioner Fellowship (APP1137127). Clare Primiero is supported by an Australian Government Research Training Program Scholarship.
Data Availability Statement
This review is based exclusively on published literature. No publicly available datasets were used and no data were generated. Further inquiries can be directed to the corresponding author.
Supplementary Material
Supplementary Material
Supplementary Material
Supplementary Material
References
- 1. Cormier JN, Xing Y, Ding M, Lee JE, Mansfield PF, Gershenwald JE, et al. Ethnic differences among patients with cutaneous melanoma. Arch Intern Med. 2006;166(17):1907–14. 10.1001/archinte.166.17.1907. [DOI] [PubMed] [Google Scholar]
- 2. Jackson C, Maibach H. Ethnic and socioeconomic disparities in dermatology. J Dermatolog Treat. 2016;27(3):290–1. 10.3109/09546634.2015.1101409. [DOI] [PubMed] [Google Scholar]
- 3. Howlader NNA, Krapcho M, Miller D, Brest A, Yu M, Ruhl J, et al. edition, SEER cancer statistics review, 1975-2018. In. Bethesda, MD: National Cancer Institute; 2021. [Google Scholar]
- 4. Dawes SM, Tsai S, Gittleman H, Barnholtz-Sloan JS, Bordeaux JS. Racial disparities in melanoma survival. J Am Acad Dermatol. 2016;75(5):983–91. 10.1016/j.jaad.2016.06.006. [DOI] [PubMed] [Google Scholar]
- 5. Adamson AS. Should we refer to skin as “ethnic? J Am Acad Dermatol. 2017;76(6):1224–5. 10.1016/j.jaad.2017.01.028. [DOI] [PubMed] [Google Scholar]
- 6. Diao JA, Adamson AS. Representation and misdiagnosis of dark skin in a large-scale visual diagnostic challenge. J Am Acad Dermatol. 2022;86(4):950–1. 10.1016/j.jaad.2021.03.088. [DOI] [PubMed] [Google Scholar]
- 7. Adamson AS, Smith A. Machine learning and health care disparities in dermatology. JAMA Dermatol. 2018;154(11):1247–8. 10.1001/jamadermatol.2018.2348. [DOI] [PubMed] [Google Scholar]
- 8. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8. 10.1038/nature21056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Tschandl P, Codella N, Akay BN, Argenziano G, Braun RP, Cabo H, et al. Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. Lancet Oncol. 2019;20(7):938–47. 10.1016/S1470-2045(19)30333-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Tschandl P, Rinner C, Apalla Z, Argenziano G, Codella N, Halpern A, et al. Human-computer collaboration for skin cancer recognition. Nat Med. 2020;26(8):1229–34. 10.1038/s41591-020-0942-0. [DOI] [PubMed] [Google Scholar]
- 11. Rotemberg V, Kurtansky N, Betz-Stablein B, Caffery L, Chousakos E, Codella N, et al. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Sci Data. 2021;8(1):34. 10.1038/s41597-021-00815-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018;5:180161. 10.1038/sdata.2018.161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Mendonca T, Ferreira PM, Marques JS, Marcal ARS, Rozeira J. PH2: a dermoscopic image database for research and benchmarking. Annu Int Conf IEEE Eng Med Biol Soc. 2013;2013:5437–40. 10.1109/EMBC.2013.6610779. [DOI] [PubMed] [Google Scholar]
- 14. Giotis I, Molders N, Land S, Biehl M, Jonkman MF, Petkov N. MED-NODE: a computer-assisted melanoma diagnosis system using non-dermoscopic images. Expert Syst Appl. 2015;42(19):6578–85. 10.1016/j.eswa.2015.04.034. [DOI] [Google Scholar]
- 15. de Faria SMM, Henrique M, Filipe JN, Pereira PMM, Tavora LMN, Assuncao PAA, et al. Light field image dataset of skin lesions. Annu Int Conf IEEE Eng Med Biol Soc. 2019;2019:3905–8. 10.1109/EMBC.2019.8856578. [DOI] [PubMed] [Google Scholar]
- 16. Wen D, Khan SM, Ji Xu A, Ibrahim H, Smith L, Caballero J, et al. Characteristics of publicly available skin cancer image datasets: a systematic review. Lancet Digit Health. 2022;4(1):e64–74. 10.1016/S2589-7500(21)00252-1. [DOI] [PubMed] [Google Scholar]
- 17. Daneshjou R, Smith MP, Sun MD, Rotemberg V, Zou J. Lack of transparency and potential bias in artificial intelligence data sets and algorithms: a scoping review. JAMA Dermatol. 2021;157(11):1362–9. 10.1001/jamadermatol.2021.3129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Haggenmuller S, Maron RC, Hekler A, Utikal JS, Barata C, Barnhill RL, et al. Skin cancer classification via convolutional neural networks: systematic review of studies involving human experts. Eur J Cancer. 2021;156:202–16. 10.1016/j.ejca.2021.06.049. [DOI] [PubMed] [Google Scholar]
- 19. Thomsen K, Iversen L, Titlestad TL, Winther O. Systematic review of machine learning for diagnosis and prognosis in dermatology. J Dermatolog Treat. 2020;31(5):496–510. 10.1080/09546634.2019.1682500. [DOI] [PubMed] [Google Scholar]
- 20. Freeman K, Dinnes J, Chuchu N, Takwoingi Y, Bayliss SE, Matin RN, et al. Algorithm based smartphone apps to assess risk of skin cancer in adults: systematic review of diagnostic accuracy studies. BMJ. 2020;368:m127. 10.1136/bmj.m127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration. Bmj. 2009;339:b2700. 10.1136/bmj.b2700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Daneshjou R, Barata C, Betz-Stablein B, Celebi ME, Codella N, Combalia M, et al. Checklist for evaluation of image-based artificial intelligence reports in dermatology: CLEAR derm consensus guidelines from the international skin imaging collaboration artificial intelligence working group. JAMA Dermatol. 2022;158(1):90–6. 10.1001/jamadermatol.2021.4915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Piccolo D, Ferrari A, Peris K, Diadone R, Ruggeri B, Chimenti S. Dermoscopic diagnosis by a trained clinician vs. a clinician with minimal dermoscopy training vs. computer-aided diagnosis of 341 pigmented skin lesions: a comparative study. Br J Dermatol. 2002;147(3):481–6. 10.1046/j.1365-2133.2002.04978.x. [DOI] [PubMed] [Google Scholar]
- 24. Iyatomi H, Oka H, Celebi ME, Hashimoto M, Hagiwara M, Tanaka M, et al. An improved Internet-based melanoma screening system with dermatologist-like tumor area extraction algorithm. Comput Med Imaging Graph. 2008;32(7):566–79. 10.1016/j.compmedimag.2008.06.005. [DOI] [PubMed] [Google Scholar]
- 25. Chang WY, Huang A, Yang CY, Lee CH, Chen YC, Wu TY, et al. Computer-aided diagnosis of skin lesions using conventional digital photography: a reliability and feasibility study. PLoS One. 2013;8(11):e76212. 10.1371/journal.pone.0076212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Chen RH, Snorrason M, Enger SM, Mostafa E, Ko JM, Aoki V, et al. Validation of a skin-lesion image-matching algorithm based on computer vision technology. Telemed J E Health. 2016;22(1):45–50. 10.1089/tmj.2014.0249. [DOI] [PubMed] [Google Scholar]
- 27. Yang S, Oh B, Hahm S, Chung KY, Lee BU. Ridge and furrow pattern classification for acral lentiginous melanoma using dermoscopic images. Biomed Signal Process Control. 2017;32:90–6. 10.1016/j.bspc.2016.09.019. [DOI] [Google Scholar]
- 28. Han SS, Kim MS, Lim W, Park GH, Park I, Chang SE. Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. J Invest Dermatol. 2018;138(7):1529–38. 10.1016/j.jid.2018.01.028. [DOI] [PubMed] [Google Scholar]
- 29. Yu C, Yang S, Kim W, Jung J, Chung K-Y, Lee SW, et al. Acral melanoma detection using a convolutional neural network for dermoscopy images. PLoS One. 2018;13(3):e0193321. 10.1371/journal.pone.0193321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Zhang X, Wang S, Liu J, Tao C. Towards improving diagnosis of skin diseases by combining deep neural network and human knowledge. BMC Med Inform Decis Mak. 2018;18(Suppl 2):59–. 10.1186/s12911-018-0631-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Fujisawa Y, Otomo Y, Ogata Y, Nakamura Y, Fujita R, Ishitsuka Y, et al. Deep-learning-based, computer-aided classifier developed with a small dataset of clinical images surpasses board-certified dermatologists in skin tumour diagnosis. Br J Dermatol. 2019;180(2):373–81. 10.1111/bjd.16924. [DOI] [PubMed] [Google Scholar]
- 32. Zhao XY, Wu X, Li FF, Li Y, Huang WH, Huang K, et al. The application of deep learning in the risk grading of skin tumors for patients using clinical images. J Med Syst. 2019;43(8):283. 10.1007/s10916-019-1414-2. [DOI] [PubMed] [Google Scholar]
- 33. Cho SI, Sun S, Mun JH, Kim C, Kim SY, Cho S, et al. Dermatologist-level classification of malignant lip diseases using a deep convolutional neural network. Br J Dermatol. 2020;182(6):1388–94. 10.1111/bjd.18459. [DOI] [PubMed] [Google Scholar]
- 34. Han SS, Moon IJ, Kim SH, Na J-I, Kim MS, Park GH, et al. Assessment of deep neural networks for the diagnosis of benign and malignant skin neoplasms in comparison with dermatologists: a retrospective validation study. PLoS Med. 2020;17(11):e1003381. 10.1371/journal.pmed.1003381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Han SS, Moon IJ, Lim W, Suh IS, Lee SY, Na JI, et al. Keratinocytic skin cancer detection on the face using region-based convolutional neural network. JAMA Dermatol. 2020;156(1):29–37. 10.1001/jamadermatol.2019.3807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Han SS, Park I, Eun Chang S, Lim W, Kim MS, Park GH, et al. Augmented intelligence dermatology: deep neural networks empower medical professionals in diagnosing skin cancer and predicting treatment options for 134 skin disorders. J Invest Dermatol. 2020;140(9):1753–61. 10.1016/j.jid.2020.01.019. [DOI] [PubMed] [Google Scholar]
- 37. Huang K, He X, Jin Z, Wu L, Zhao X, Wu Z, et al. Assistant diagnosis of basal cell carcinoma and seborrheic keratosis in Chinese population using convolutional neural network. J Healthc Eng. 2020;2020:1713904. 10.1155/2020/1713904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Jinnai S, Yamazaki N, Hirano Y, Sugawara Y, Ohe Y, Hamamoto R. The development of a skin cancer classification system for pigmented skin lesions using deep learning. Biomolecules. 2020;10(8):1123. 10.3390/biom10081123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Liu Y, Jain A, Eng C, Way DH, Lee K, Bui P, et al. A deep learning system for differential diagnosis of skin diseases. Nat Med. 2020;26(6):900–8. 10.1038/s41591-020-0842-3. [DOI] [PubMed] [Google Scholar]
- 40. Wang S-Q, Zhang X-Y, Liu J, Tao C, Zhu C-Y, Shu C, et al. Deep learning-based, computer-aided classifier developed with dermoscopic images shows comparable performance to 164 dermatologists in cutaneous disease diagnosis in the Chinese population. Chin Med J. 2020;133(17):2027–36. 10.1097/CM9.0000000000001023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Huang HW, Hsu BWY, Lee CH, Tseng VS. Development of a light-weight deep learning model for cloud applications and remote diagnosis of skin cancers. J Dermatol. 2021;48(3):310–6. 10.1111/1346-8138.15683. [DOI] [PubMed] [Google Scholar]
- 42. Minagawa A, Koga H, Sano T, Matsunaga K, Teshima Y, Hamada A, et al. Dermoscopic diagnostic performance of Japanese dermatologists for skin tumors differs by patient origin: a deep learning convolutional neural network closes the gap. J Dermatol. 2021;48(2):232–6. 10.1111/1346-8138.15640. [DOI] [PubMed] [Google Scholar]
- 43. Yang Y, Ge Y, Guo L, Wu Q, Peng L, Zhang E, et al. Development and validation of two artificial intelligence models for diagnosing benign, pigmented facial skin lesions. Skin Res Technol. 2021;27(1):74–9. 10.1111/srt.12911. [DOI] [PubMed] [Google Scholar]
- 44. Li CX, Fei WM, Shen CB, Wang ZY, Jing Y, Meng RS, et al. Diagnostic capacity of skin tumor artificial intelligence-assisted decision-making software in real-world clinical settings. Chin Med J. 2020;133(17):2020–6. 10.1097/CM9.0000000000001002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Pacheco AGC, Lima GR, Salomão AS, Krohling B, Biral IP, de Angelo GG, et al. PAD-UFES-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief. 2020;32:106221. 10.1016/j.dib.2020.106221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Takiddin A, Schneider J, Yang Y, Abd-Alrazaq A, Househ M. Artificial intelligence for skin cancer detection: scoping review. J Med Internet Res. 2021;23(11):e22934. 10.2196/22934. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Aggarwal P, Papay FA. Artificial intelligence image recognition of melanoma and basal cell carcinoma in racially diverse populations. J Dermatolog Treat. 2021:1–6. [DOI] [PubMed] [Google Scholar]
- 48. Winkler JK, Fink C, Toberer F, Enk A, Deinlein T, Hofmann-Wellenhof R, et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol. 2019;155(10):1135–41. 10.1001/jamadermatol.2019.1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Du-Harpur X, Arthurs C, Ganier C, Woolf R, Laftah Z, Lakhan M, et al. Clinically relevant vulnerabilities of deep machine learning systems for skin cancer diagnosis. J Invest Dermatol. 2021;141(4):916–20. 10.1016/j.jid.2020.07.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Phillips M, Marsden H, Jaffe W, Matin RN, Wali GN, Greenhalgh J, et al. Assessment of accuracy of an artificial intelligence algorithm to detect melanoma in images of skin lesions. JAMA Netw Open. 2019;2(10):e1913436. 10.1001/jamanetworkopen.2019.13436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Bellew S, Del Rosso JQ, Kim GK. Skin cancer in asians: part 2: melanoma. J Clin Aesthet Dermatol. 2009;2(10):34–6. [PMC free article] [PubMed] [Google Scholar]
- 52. Katragadda C, Finnane A, Soyer HP, Marghoob AA, Halpern A, Malvehy J, et al. Technique standards for skin lesion imaging: a delphi consensus statement. JAMA Dermatol. 2017;153(2):207–13. 10.1001/jamadermatol.2016.3949. [DOI] [PubMed] [Google Scholar]
- 53. Daneshjou R, Vodrahalli K, Liang W, Novoa R, Jenkins M, Rotemberg V, et al. Disparities in dermatology AI: assessments using diverse clinical images. 2021. [Google Scholar]
- 54. Wei L, Xuemin W, Wei L, Li L, Ping Z, Yanyu W, et al. Skin color measurement in Chinese female population: analysis of 407 cases from 4 major cities of China. Int J Dermatol. 2007;46(8):835–9. 10.1111/j.1365-4632.2007.03192.x. [DOI] [PubMed] [Google Scholar]
- 55. Adamson AS, Essien U, Ewing A, Daneshjou R, Hughes-Halbert C, Ojikutu B, et al. Diversity, race, and health. Med. 2021;2(1):6–10. 10.1016/j.medj.2020.12.020. [DOI] [PubMed] [Google Scholar]
- 56.International Skin Imaging Collaboration [Available from: https://www.isic- archive.com/#!/topWithHeader/onlyHeaderTop/gallery].
- 57. Argenziano G, Soyer HP, De Giorgio V, et al. Interactive atlas of dermoscopy. 2000. [Google Scholar]
- 58. Codella NCF, Gutman D, Celebi ME, Helba B, Marchetti MA, Dusza SW, et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 International symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). IEEE 15th Int Symposium Biomed Imaging (ISBI 2018). 2018:168–72. 10.1109/ISBI.2018.8363547. [DOI] [Google Scholar]
- 59. Olga Russakovsky* JD, Su H, Krause J, Satheesh S, Ma S, Huang Z, et al. (* = equal contribution) ImageNet large scale visual recognition challe. ImageNet large scale visual recognition challenge. IJCV. 2015. [Google Scholar]
- 60.DermQuest. [Online].(2012). [Available from: http://www.dermquest.com].
- 61. Dermatology information system. [Online].(2012). [Available from: http://www.dermis.net]. [Google Scholar]
- 62. Alves J, Moreira D, Alves P, Rosado L, Vasconcelos MJM. Automatic focus assessment on dermoscopic images acquired with smartphones. Sensors. 2019;19(22):4957. 10.3390/s19224957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. DermNet NZ. [Online]. [Available from: https://dermnetnz.org/.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
This review is based exclusively on published literature. No publicly available datasets were used and no data were generated. Further inquiries can be directed to the corresponding author.