Skip to main content
PLOS One logoLink to PLOS One
. 2024 Jan 19;19(1):e0297146. doi: 10.1371/journal.pone.0297146

Evaluating deep learning-based melanoma classification using immunohistochemistry and routine histology: A three center study

Christoph Wies 1,2,#, Lucas Schneider 1,#, Sarah Haggenmüller 1, Tabea-Clara Bucher 1, Sarah Hobelsberger 3, Markus V Heppt 4, Gerardo Ferrara 5, Eva I Krieghoff-Henning 1,, Titus J Brinker 1,‡,*
Editor: Vincenzo L’Imperio6
PMCID: PMC10798511  PMID: 38241314

Abstract

Pathologists routinely use immunohistochemical (IHC)-stained tissue slides against MelanA in addition to hematoxylin and eosin (H&E)-stained slides to improve their accuracy in diagnosing melanomas. The use of diagnostic Deep Learning (DL)-based support systems for automated examination of tissue morphology and cellular composition has been well studied in standard H&E-stained tissue slides. In contrast, there are few studies that analyze IHC slides using DL. Therefore, we investigated the separate and joint performance of ResNets trained on MelanA and corresponding H&E-stained slides. The MelanA classifier achieved an area under receiver operating characteristics curve (AUROC) of 0.82 and 0.74 on out of distribution (OOD)-datasets, similar to the H&E-based benchmark classification of 0.81 and 0.75, respectively. A combined classifier using MelanA and H&E achieved AUROCs of 0.85 and 0.81 on the OOD datasets. DL MelanA-based assistance systems show the same performance as the benchmark H&E classification and may be improved by multi stain classification to assist pathologists in their clinical routine.

Introduction

Melanoma diagnoses have increased in recent decades [1] and melanoma is the fifth most common cancer in the United States [2]. In spite of its relatively high frequency, melanoma is often difficult to be histopathologically differentiated from nevi, a high diagnostic discordance rate having been reported even among experienced histopathologists [3]. If a melanoma is initially misclassified as nevus and therefore diagnosed at a later stage, the patient’s chances of survival could be significantly reduced and therapy will probably have to be more intense. On the other hand, if harmless benign lesions are diagnosed as melanoma, the patient will suffer an unnecessary psychological and physical burden. In individual cases, overdiagnosis can even lead to unnecessary, expensive and stressful therapies, which can also be associated with high costs in the healthcare system and unnecessary toxicity for affected patients [4]. More precise diagnostic options could contribute to overcoming these problems.

Due to rapid technological advances of the last few years, AI-based assistance systems may become powerful tools for pathological cancer diagnostics. Deep Learning (DL) with Convolutional Neural Networks (CNN) has shown promise in studies aimed at distinguishing melanomas and nevi on digitized hematoxylin and eosin (H&E)-stained whole slide images (WSI) [5, 6]. In some cases, the DL approach could even outperform humans [7]. However, accuracy of these classifiers especially on external data still shows room for improvement.

In addition to standard H&E-stained slides, immunohistochemical (IHC)-stained tissue sections are often available for many cancer entities and represent a source of complementary prognostic and/or predictive information in addition to H&E-stained tissue. However, the analysis of IHC-stained slides by DL models is a relatively new area of research. Recent studies, however, have employed DL for successful classification of non-skin cancer entities, i.e., to determine HER2 status in breast cancer [8] and immune cell multistains as prognostic and predictive biomarkers in colorectal cancer [9] on IHC-stained slides. Moreover, as shown in previous work, the fusion of different data modalities often improves generalizability and performance of DL models [911].

IHC-stains routinely used by pathologists to better differentiate other, usually benign lesions from melanomas are MelanA (MART-1) [12], HMB-45 [13], Ki-67 [14], tyrosinase [15], S100 [16] and PRAME [17]. The expression of melanocyte antigen MelanA (also called melanoma antigen recognized by T-cells 1 (MART1)) is a lineage-specific melanocytic marker which is commonly used by histopathologists for routine diagnosis of melanocytic neoplasms, since it highlights the cytomorphology and the distribution of melanocytes.

IHC expression of MelanA can be automatically analyzed using state-of-the-art artificial intelligence (AI) methods. In this study, we investigate the use of DL-based image analysis models on MelanA-stained tissue for melanoma classification in comparison and in addition to the standard H&E-based diagnosis.

Materials and methods

The presented study investigates melanoma suspicious lesions based on dermatoscopic investigation, which were verified histopathologically as melanoma or nevus. We use DL models to classify whether a lesion is a melanoma or a nevus based on MelanA or H&E stained tumor tissue or a combination of both stains.

Ethics approval was obtained from the ethics committees of the technical university in Dresden. Patients provided informed written consent. This work was performed in accordance with the Declaration of Helsinki [18].

Datasets

The inclusion criteria to participate in our study was to be 18 years old with melanoma-suspicious skin lesions that were biopsied after dermoscopic examination. Suspicious lesions that were pre-biopsied or located near the eye, under the fingernails or toenails were excluded. The ground truth labels were histopathological confirmed by at least one reference dermatopathologist investigating at least the H&E-stained reference slide. MelanA (MART-1) [19, 20] immunohistochemical (IHC) and Hematoxylin and Eosin (H&E) stained tissue slides from the university hospital in Dresden were used for training, validation and hold-out testing. Slides from the university hospital in Erlangen and from the National Cancer Institute of Naples were used for out of distribution (OOD) testing. Table 1 describes the population of all three cohorts. The Dresden, Erlangen and Naples cohorts were collected prospectively, with data collection starting in April 2021 and ending in April 2023, participants provided informed written consent. Data was physically transferred in batches. Data received before 2023 from the university hospital in Dresden was used as a training set, data received later was used as a holdout test dataset. The labels of the datasets were pathologically verified. All 3 cohorts differ in the stains of MelanA slides. Antibodies from different manufacturers with different dilutions were used at each site (Table 2). Representative WSI thumbnails from all 3 cohorts and for both stains are shown in the supplements in S1 Fig.

Table 1. Description of the population in our datasets.

For continuous features we report median, range, and number of NAs, for categorical features we report the total number of observations per group. Here the training population as well as all three test populations are described. Melanoma in situ describes the early stage of a malignant melanoma that has not yet broken through the basement membrane. However, features at the cellular level do not differ between melanoma in situ and malignant melanoma.

Melanoma Melanoma in situ Nevi All
Dresden (train)
Samples 82 13 112 207
Age 70[29;95] 74[43;87] 44[18;94] 61 [18;95]
Breslow 0.7[0.0;20.0] 8 NA 0.4[0.0;0.6] 6 NA 112 NA 0.6[0.0;20.0] 126 NA
Gender male 52 10 49 111
female 30 3 63 96
AJCC stage 0 0 13 0 13
I 60 0 0 60
II 14 0 0 14
III 8 0 0 8
NA 0 0 112 112
Localisation Extremities 27 1 34 62
Head 7 6 12 25
Trunk 48 6 66 120
Dresden (test)
Samples 45 15 66 126
Age 73[33;92] 69[43;92] 59[20;88] 67 [20;92]
Breslow 0.9[0.3;6.5] 5 NA 0.0[0.0;0.3] 12 NA 66 NA 0.7[0.0;6.5] 83 NA
Gender male 24 6 31 61
female 21 9 35 65
AJCC stage 0 0 15 0 15
I 29 0 0 29
II 10 0 0 10
III 3 0 0 3
NA 3 0 66 69
Localisation Extremities 15 7 21 43
Head 13 4 7 24
Trunk 17 4 38 59
Erlangen
Samples 41 5 35 81
Age 62[34;93] 64[48;86] 51[23;83] 57[23;93]
Breslow 0.5[0.0;10.0] 3 NA 0.5[0.5;0.5] 4 NA 35 NA 0.5[0.0;10.0] 39 NA
Gender male 22 3 23 48
female 19 2 12 33
AJCC stage 0 0 5 0 5
I 21 0 0 21
II 7 0 0 7
III 4 0 0 4
IV 1 0 0 1
NA 8 0 35 35
Localisation Extremities 14 1 17 32
Head 11 1 1 13
Trunk 16 3 17 36
Naples
Samples 15 10 25 50
Age 51[28;71] 66[51;84] 35[21;80] 49[21;84]
Breslow 2.0[0.4;5.8] 10 NA 25 NA 2.0[0.4;5.8] 35 NA
Gender male 5 6 10 21
female 10 4 15 29
AJCC stadium 0 0 10 0 10
I 7 0 0 7
II 0 0 0 0
III 7 0 0 7
IV 1 0 0 1
NA 0 0 25 25
Localisation Extremities 6 2 9 17
Head 1 2 4 7
Trunk 7 6 12 25
NA 1 0 0 1

Table 2. Antibodies and parameters of staining methods used by the different clinics.

Hospital Clone Company Stain machine Kit Dilution
Dresden A103 Agilent Ventana Roche Benchmark Ultra ultraView Red 1:25
Erlangen A103 Millipore Sigma Roche BenchMark XT Fast Red 1:200
Naples A103 Ventana Ventana Roche Benchmark Ultra ultraView Red 1:1

Pre-processing

IHC and adjacent H&E slides from the Dresden, Erlangen and Naples cohorts were digitized with an Aperio® AT2 Slide Scanner with a 40× magnification resulting in WSIs with a resolution of 0.25 μm/px. Tumor boundaries were manually annotated under expert supervision with the QuPath digital pathology software version 0.3 [21]. WSIs were tessellated into patches of 237 px x 237 px by an in-house developed QuPath script for each slide in different (40x, 20x, 10x, 5x) magnifications for IHC WSIs and in 40x magnification for H&E WSIs. Tiles with 40x magnification were created with a size of 60 x 60 μm, which corresponds to 237px x 237px. Tile sizes at 20x, 10x and 5x magnification are 120x120 μm, 240x240 μm and 480x480 μm, respectively. All tiles used for training, validation and hold-out testing were extracted without stride/overlap.

Models

To classify pigmented lesions between melanomas and nevi, the ResNet architecture introduced by He et al. [22] was selected as a model for all data modalities. The hyperparameters of the different models were tuned individually using the Bayesian optimization framework Optuna [23] and five-fold cross-validation, all the models where load from the timm library [24] To avoid overfitting with respect to slides containing a huge amount of tiles, we used weighted sampling to train with a predefined amount of tiles per slide in all epochs. The hyperparameters we tuned were the size of the ResNet, the learning rate, the number of training epochs, the type of pooling, the number of tiles used per training epoch and whether or not the initialized ResNet was pretrained on ImageNet. The parameterization of all models is shown in the supplements in S1 Table.

The slide prediction procedure for the different image modalities is as follows: The models were trained at the tile level (with resolutions of 0.25, 0.5, 1.0 and 2.0μm/px), using the slide label for each tile of the slide. All tiles of the slide were predicted and the slide score was calculated by averaging all tile scores (Fig 1). To train models capable of handling domain shifts, the color jitter augmentation package of PyTorch [25] was used as part of the training process, as mentioned by Tellez et al. [26]. In contrast to H&E stained slides, features of protein expression can be distributed over a larger area in the cytoplasm. For this purpose, different magnifications were used to analyze these larger features.

Fig 1. Schematic diagram of the different models.

Fig 1

The red box shows the pipeline for MelanA-stained WSIs and the purple box the pipeline for H&E-stained WSIs. We tessellated MelanA-stained WSIs corresponding to different magnifications and trained individual models on each tile size. The class probabilities for each tile were predicted and aggregated into a slide score by averaging all tile scores. For the H&E-based model we proceeded in the same way.

Combined models

Unimodal classifiers were combined to build models based on multiple data modalities. A classifier based on all four MelanA magnifications was built, where predictions with higher certainty give a higher contribution to the combined prediction. Scores of the different magnification models were averaged and weighted based on their distance to the optimal decision threshold. Other fusion approaches like averaging the scores unweighted or weighted based on the model’s validation performances were investigated, all of which yielded comparable results (shown in the supplements, S2 Table). The H&E classifier was combined with the MelanA multiscale classifier using the same fusion method.

Motivated through the clinical practice we investigated another setup, called the hierarchical setup, where we first predict the label based on the H&E-classifier but add the MelanA based classifier for those lesions where the H&E WSI leads to an uncertain prediction only.

To calculate whether or not a H&E-based prediction was uncertain, we calculated confidence intervals (CIs) of the slide-level score via bootstrap and checked afterwards whether the optimal decision threshold is contained in the 95% CI. For cases where the threshold was contained in the CI of the slide-level score we added the MelanA based classifier.

Reporting

For all results, 95% CIs are given next to the corresponding Areas under the receiver operating characteristic curve (AUROCs) of the model. CIs were calculated using the bootstrap-method [27]. The method was applied to the predicted values of a cohort. AUROCs were then calculated for this bootstrap cohort. After 10,000 repetitions the 2.5% as well as the 97.5% quantiles and thus, the 95% CI were calculated.

Results

W The unimodal H&E classifier is based on previous works [5, 10, 2830] and was adapted to the corresponding MelanA resolutions. The MelanA-models were combined and fused with the H&E model into one multi-modal classifier. All described models were tested within internal distribution (InD) on the Dresden holdout set, and OOD on the Erlangen and Naples cohorts (see Table 1). AUROCs and bootstrapped CIs for all models are shown in Table 3.

Table 3. AUROC values as well as 95% bootstrapped CIs for the three test cohorts and all evolved models.

Stains and resolutions used AUROC (Dresden) AUROC (Erlangen) AUROC (Naples)
H&E (0.25 μm/px) 0.96 [0.94;0.99] 0.75 [0.64;0.86] 0.81 [0.67;0.92]
MelanA (0.25 μm/px) 0.90 [0.84;0.97] 0.75 [0.64;0.85] 0.79 [0.66;0.91]
MelanA (0.50 μm/px) 0.92 [0.87;0.97] 0.78 [0.67;0.87] 0.77 [0.63;0.89]
MelanA (1.00 μm/px) 0.88 [0.82;0.95] 0.73 [0.62;0.84] 0.8 [0.67;0.92]
MelanA (2.00 μm/px) 0.86 [0.78;0.95] 0.67 [0.52;0.77] 0.75 [0.60;0.87]
MelanA (all 4 combined) 0.88 [0.80;0.96] 0.74 [0.62;0.84] 0.82 [0.68;0.92]
MelanA + H&E 0.94 [0.89;0.98] 0.81 [0.71;0.90] 0.85 [0.73;0.94]
MelanA + H&E (hierarchical) 0.96 [0.95;1.00] 0.75 [0.64;0.86] 0.83 [0.71;0.94]

All results differ significantly from random guessing since no CI contains 0.5, the critical value. Thus, we are able to classify melanoma on all evolved MelanA-based models as well as with the benchmark H&E-based model as well. Beside this, it should be highlighted that almost all models on all cohorts perform with a AUROC significantly better than 0.7 which makes findings probably relevant for clinical practice. However, note that CIs overlap in several cases, indicating that different models perform similarly and thus, probably contain a high amount of shared Information.

In addition, we investigated another hierarchical approach motivated by clinical practice, using only MelanA-stained slides for cases where the H&E-based model is uncertain.

The ROC diagrams of the MelanA-based, the H&E-based, and the combined models for all three cohorts are shown in Fig 2. Another representation of this plot, to better compare models within one cohort is shown in the supplements in S4 Fig. Additional ROC plots of the individual MelanA models, which consider only one magnification, and results of the hierarchical approach are shown in the supplementary material in S2 and S3 Figs.

Fig 2. ROC plots by data modality with corresponding AUROC values.

Fig 2

The different subplots show results for the individual evolved models: A: MelanA-based performance B: H&E-based performance taking all magnifications into account C: combined model using H&E as well as MelanA by aggregating the individual scores. The different colors of the ROC curves show from which data source site the results come: Red: internal results (Dresden), Blue: external results (Erlangen), Purple: external results (Naples).

MelanA-based classifiers

The curves of the different magnifications are shown in the supplementary material in S1 Fig. They overlap at several points in all cohorts, which means that for different sensitivity/specificity trade-offs, different magnifications lead to the best results. In the internal cohort, the classifiers reached AUROCs between 0.85 and 0.92, in the Erlangen cohort AUROCs between 0.67 and 0.78, and in the Naples cohort AUROCs between 0.75 and 0.80. The CIs of the different magnifications overlap in all cohorts, so there is no magnification that leads to a significantly best performance overall.

The combination of all 4 magnifications, shown in Fig 2 A), was not significantly different from the models that use only one magnification.

In the Dresden (0.88) and Erlangen (0.74) cohorts, the AUROC of the combined MelanA model without considering CIs is worse than that of the (0.50 μm/px) model as a stand-alone classifier. For the Naples cohort, the AUROC of the combined MelanA classifier (0.82) is slightly, but not significantly, better than all individual models.

H&E-based classifier

The classifier using only H&E-stained tissue, as our benchmark, achieved an AUROC of 0.96 on the internal test set and AUROCs of 0.75 and 0.81 on the external cohorts, respectively. The ROC plot in Fig 2 B) and the results in Table 3 show that the internal performance is significantly better than the external performance. Performance on both external data sets is not significantly different.

Combined classifiers using H&E and MelanA

The model based on both data modalities, the H&E-stained tissue as well as the MelanA- stained tissue of all investigated resolutions, shown in Fig 2 C), performs numerically slightly worse compared to the H&E model on the Dresden cohort, reaching an AUROC of 0.94.

However, in the external cohorts the combined model performs best in absolute numbers, reaching AUROCs of 0.81 and 0.85 on the Erlangen and Naples cohorts, respectively. Nevertheless, the performance of the combined model is not significantly different from the MelanA-based model or from the H&E-based model for any of the investigated cohorts.

The hierarchical approach, where MelanA predictions are only taken into account when H&E-based prediction is uncertain, which reflects the diagnostic path better, leads to ROC-plots shown in S3 Fig. This approach resulted in the numerically best, albeit still not significantly different, performance on the internal cohort. It did not change results on Naples, the smaller external cohorts, since the H&E-based model was only uncertain for one sample within the Naples cohort and was certain for all samples in the Erlangen cohort.

Discussion

In this work, we were able to predict melanoma/nevi classification across multiple datasets on MelanA slides with a similar accuracy as on benchmark H&E slides using DL-based image analysis. Furthermore, the results may suggest that the multistain approach has the potential to improve prediction accuracy and robustness, since at least on both external cohorts the combined model reached the highest AUROCs.

To integrate the presented work into clinical practice, a method for AI-pathologist interaction needs to be developed. For this purpose, we are developing an Explainable artificial intelligence system in collaboration with dermatologists [31], which produces easily interpretable explanations based on dermatoscopic images and aims to be integrated as an AI tool into digital pathology and clinical practice. Such a system can be expanded to include other data modalities such as immunohistochemistry or routine histology.

In clinical practice, pathologists often use H&E-stained tissue sections for melanoma diagnosis and resort to IHC-stained tissue in uncertain cases [32]. While DL-assisted detection of melanoma on H&E sections has been well studied [7], few studies have been performed using additional routine IHC-stained slides. Digital image analysis by automated quantification of the proliferation marker Ki-67 was used to distinguish melanoma from nevi as a diagnostic and prognostic aid [33]. Recently, an improved DL annotation method for H&E/SOX10 dual stains was developed to better identify tumor cells in cutaneous melanoma [34]. In the study presented here, MelanA-stained tissue was selected as an additional diagnostic tool since it highlights the cytomorphology and the distribution of melanocytes, thereby allowing a more accurate evaluation of the architecture of any melanocytic tumor, along with the size and the shape of single cells. Other IHC stains such as HMB45, p16, and PRAME were excluded because they are useful only in selected cases. However, SOX10 was not chosen because it is a nuclear marker and gives no idea about the actual size of melanocytes and about the morphologic features of their dendritic processes. Finally, Ki67, although largely used in routine, is of little help in the recognition of in situ and early invasive melanoma [1316].

In the current pathological routine, IHC markers including MelanA are used heterogeneously in different hospitals and laboratories. At the university hospital in Dresden, generally all dermatologically melanoma-suspicious skin lesions are stained with MelanA, providing an unbiased training dataset for our study. In contrast, the OOD datasets likely contain more challenging lesions since MelanA-stained tissue was only prepared at the university hospital in Erlangen in case the H&E-stained slides provided uncertain pathological results. The Naples dataset contained 40% in situ melanomas, all of which are small in size and generate few tiles, making classification in general potentially difficult.

The Dresden test set apparently does not benefit from the inclusion of additional data (S4 Fig), since the H&E-stained tissue slides are already sufficient to yield maximum accuracy. This may be due to the rather unambiguous dataset and thus, the very high performance and a broad data set with many subclasses. In contrast, the OOD datasets benefit from incorporating the additional MelanA-stained slides, making the classifier externally more robust. A combined classifier thus provides an advantage here, a finding we have already made in predicting BRAF status using H&E, clinical and methylation data in melanoma [10] suggesting that a multi stain based classifier can lead to better generalizability. Although the information contained in the H&E- and the MelanA-stained slides is probably partially redundant, one can still see a benefit of combining both stains on OOD data.

A detailed analysis of the various misclassifications revealed that it was not individual histopathological features that caused the model to underperform, but mainly technical artifacts such as overlapping section fragments, low staining intensity combined with strong pigmentation and very small lesions that resulted in only a small number of tiles. This also explains why the MelanA+H&E classifier did not perform better in the Dresden cohort. The simple H&E classifier is already sufficiently good; a combination in the Dresden hold out dataset does not lead to any improvement in individual cases, as there are individual unseen technical coloring artifacts here. This could be avoided by increasing the amount of data in the training set or by excluding the staining artifacts. In clinical pathology, MelanA staining is used in parallel as well as in addition to H&E staining to perform melanoma and nevi classifications. The combined model design was conceived as a parallel evaluation, whereby the model has the greatest possible information at its disposal.

Due to the cytoplasmic distribution of the MelanA protein, tiles from a higher magnification can potentially be too small to extract all relevant features. Pathologists frequently investigate the MelanA stains at lower magnifications to evaluate the silhouette and overall architecture of the lesion, which also contains valuable information. Our data could not show that there is an identifiable best magnification. However, each magnification seems to contain partly different information as the combination of all 4 magnifications brings a slight overall improvement, which can be attributed to ensembling.

Contrary to clinical practice, the hierarchical approach did not lead to any improvement on external datasets. This shows that an unbiased dataset is preferable for training a DL model, since the network can make better decisions with larger datasets. Interestingly, the uncertain Dresden specimens are lesions with large diameters of 8 mm to 17 mm, where a melanoma has developed in the center of a nevus, with melanoma features smoothly merge into nevus features which probably confuses the model, as all tiles are weighted equally in our model. In contrast, the uncertain lesion in Naples is very small with a diameter of <1.0 mm.

Limitations

Overall, the major limitation of this study is the relatively small sample size of the external test sets. In addition, the above-mentioned variability in the pathological routine as well as the different staining protocols of the respective clinics complicate the comparison of the results and findings. In addition, a not inconsiderable label noise must be taken into account, since the labels were histopathologically verified according to the gold standard of care, but a high interrater variability must be assumed, as shown in previous studies [3, 35].

Conclusions

With DL analysis of MelanA-stained tissue, we were able to classify melanomas and nevi in two distinct OOD cohorts with similar accuracy as with H&E-stained tissue. The numerically, but not statistically significantly, better classification results achieved by combining H&E and MelanA classifiers suggests that the combination of these image modalities may lead to improved generalizability and performance. However, these results need to be confirmed in larger studies containing more lesions.

Supporting information

S1 Fig. Representative thumbnails for melanoma, melanoma in-situ, and nevus for all three cohorts.

(TIF)

S2 Fig. ROC plots by data source site with corresponding AUROC values.

A: Results from Dresden B: Results from Erlangen C: Results from Naples. Red: 40x magnification Blue: 20x magnification Purple: 10x magnification Gray: 5x magnification.

(TIF)

S3 Fig. ROC plot of the hierarchical compared to the combined approach with corresponding AUROC values by data source site.

A: Results from Dresden B: Results from Erlangen C: Results from Naples. Black: Results of the combined approach using H&E and MElanA for all lesions Red: Hierarchical approach using MelanA-stained tissue only for H&E-based uncertain lesions.

(TIF)

S4 Fig. ROC plots by data modality with corresponding AUROC values.

A: Results from Dresden B: Results from Erlangen C: Results from Naples. Red: MelanA-based performance taking all magnifications into account Purple: H&E-based performance Black: combined model using H&E as well as MelanA by aggregating the individual scores.

(TIF)

S1 Table. Hyperparameters of all evolved models.

(XLSX)

S2 Table. Additional results derived by using different fusion approaches: Dist-opt means weighted by the distance to the individual models optimal thresholds; dist-05 means weighted by the distance to the default threshold of 0.5; avg denotes the fusion by conducting a simple average of all scores; perf means weighted based on the individual models validation performance in a way that better performing models contribute more to the fused result.

(XLSX)

S1 Dataset

(ZIP)

S2 Dataset

(ZIP)

S3 Dataset

(ZIP)

S4 Dataset

(ZIP)

S5 Dataset

(ZIP)

Abbreviations

AI

Artificial intelligence

AUROC

Area under receiver operating characteristics curve

CI

Confidence Interval

CNN

Convolutional neural network

DL

Deep Learning

H&E

Hematoxylin and Eosin

IHC

Immunohistochemical

InD

Internal distribution

MART1

Melanoma antigen recognized by T-cells 1

MelanA

Melanocytic antigen

OOD

Out of distribution

WSI

Whole slide image

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The presented work was funded by the federal Ministry of Health, Berlin, Germany (grants: Tumor Behavior Prediction Initiative (TPI) and Skin Classification Project 2 (SCP2)); Ministry of Social Affairs, Health and Integration of the Federal State Baden-Württemberg, Germany (grant: KTI); grant holder in all cases: Titus J. Brinker, German Cancer Research Center, Heidelberg, Germany). The sponsors had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

References

  • 1.Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021. May;71(3):209–49. doi: 10.3322/caac.21660 [DOI] [PubMed] [Google Scholar]
  • 2.Saginala K, Barsouk A, Aluru JS, Rawla P, Barsouk A. Epidemiology of Melanoma. Med Sci. 2021. Oct 20;9(4):63. doi: 10.3390/medsci9040063 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Elmore JG, Barnhill RL, Elder DE, Longton GM, Pepe MS, Reisch LM, et al. Pathologists’ diagnosis of invasive melanoma and melanocytic proliferations: observer accuracy and reproducibility study. BMJ. 2017. Jun 28;357:j2813. doi: 10.1136/bmj.j2813 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Niebling MG, Haydu LE, Karim RZ, Thompson JF, Scolyer RA. Pathology review significantly affects diagnosis and treatment of melanoma patients: an analysis of 5011 patients treated at a melanoma treatment center. Ann Surg Oncol. 2014. Jul;21(7):2245–51. doi: 10.1245/s10434-014-3682-x [DOI] [PubMed] [Google Scholar]
  • 5.Höhn J, Krieghoff-Henning E, Jutzi TB, von Kalle C, Utikal JS, Meier F, et al. Combining CNN-based histologic whole slide image analysis and patient data to improve skin cancer classification. Eur J Cancer Oxf Engl 1990. 2021. May;149:94–101. doi: 10.1016/j.ejca.2021.02.032 [DOI] [PubMed] [Google Scholar]
  • 6.Li M, Abe M, Nakano S, Tsuneki M. Deep Learning Approach to Classify Cutaneous Melanoma in a Whole Slide Image. Cancers. 2023. Mar 22;15(6):1907. doi: 10.3390/cancers15061907 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Brinker TJ, Schmitt M, Krieghoff-Henning EI, Barnhill R, Beltraminelli H, Braun SA, et al. Diagnostic performance of artificial intelligence for histologic melanoma recognition compared to 18 international expert pathologists. J Am Acad Dermatol. 2022. Mar;86(3):640–2. doi: 10.1016/j.jaad.2021.02.009 [DOI] [PubMed] [Google Scholar]
  • 8.Tewary S, Mukhopadhyay S. AutoIHCNet: CNN architecture and decision fusion for automated HER2 scoring. Appl Soft Comput. 2022. Apr;119:108572. [Google Scholar]
  • 9.Foersch S, Glasner C, Woerl AC, Eckstein M, Wagner DC, Schulz S, et al. Multistain deep learning for prediction of prognosis and therapy response in colorectal cancer. Nat Med [Internet]. 2023. Jan 9 [cited 2023 Jan 23]; Available from: https://www.nature.com/articles/s41591-022-02134-1 doi: 10.1038/s41591-022-02134-1 [DOI] [PubMed] [Google Scholar]
  • 10.Schneider L, Wies C, Krieghoff-Henning EI, Bucher TC, Utikal JS, Schadendorf D, et al. Multimodal integration of image, epigenetic and clinical data to predict BRAF mutation status in melanoma. Eur J Cancer. 2023. Apr;183:131–8. doi: 10.1016/j.ejca.2023.01.021 [DOI] [PubMed] [Google Scholar]
  • 11.Schneider L, Laiouar-Pedari S, Kuntz S, Krieghoff-Henning E, Hekler A, Kather JN, et al. Integration of deep learning-based image analysis and genomic data in cancer pathology: A systematic review. Eur J Cancer. 2022. Jan;160:80–91. doi: 10.1016/j.ejca.2021.10.007 [DOI] [PubMed] [Google Scholar]
  • 12.Kawakami Y, Eliyahu S, Delgado CH, Robbins PF, Rivoltini L, Topalian SL, et al. Cloning of the gene coding for a shared human melanoma antigen recognized by autologous T cells infiltrating into tumor. Proc Natl Acad Sci. 1994. Apr 26;91(9):3515–9. doi: 10.1073/pnas.91.9.3515 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Gown AM, Vogel AM, Hoak D, Gough F. Monoclonal Antibodies Specific for Melanocytic Tumors Distinguish Subpopulations of Melanocytes. 1986;9. [PMC free article] [PubMed] [Google Scholar]
  • 14.Soyer HP. Kl 67 immunostaining in melanocytic skin tumors. Correlation with histologic parameters. J Cutan Pathol. 1991. Aug;18(4):264–72. [DOI] [PubMed] [Google Scholar]
  • 15.Chen YT, Stockert E, Tsang S, Coplan KA, Old LJ. Immunophenotyping of melanomas for tyrosinase: implications for vaccine development. Proc Natl Acad Sci. 1995. Aug 29;92(18):8125–9. doi: 10.1073/pnas.92.18.8125 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Cho KH, Hashimoto K, Taniguchi Y, Pietruk T, Zarbo RJ, An T. Immunohistochemical study of melanocytic nevus and malignant melanoma with monoclonal antibodies against s-100 subunits. Cancer. 1990. Aug 15;66(4):765–71. doi: [DOI] [PubMed] [Google Scholar]
  • 17.Watari K, Tojo A, Nagamura-Inoue T, Nagamura F, Takeshita A, Fukushima T, et al. Identification of a melanoma antigen, PRAME, as a BCR/ABL-inducible gene. FEBS Lett. 2000. Jan 28;466(2–3):367–71. doi: 10.1016/s0014-5793(00)01112-1 [DOI] [PubMed] [Google Scholar]
  • 18.Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015. Oct 28;351:h5527. doi: 10.1136/bmj.h5527 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Coulie PG, Brichard V, Van Pel A, Wölfel T, Schneider J, Traversari C, et al. A new gene coding for a differentiation antigen recognized by autologous cytolytic T lymphocytes on HLA-A2 melanomas. J Exp Med. 1994. Jul 1;180(1):35–42. doi: 10.1084/jem.180.1.35 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kawakami Y, Eliyahu S, Sakaguchi K, Robbins PF, Rivoltini L, Yannelli JR, et al. Identification of the immunodominant peptides of the MART-1 human melanoma antigen recognized by the majority of HLA-A2-restricted tumor infiltrating lymphocytes. J Exp Med. 1994. Jul 1;180(1):347–52. doi: 10.1084/jem.180.1.347 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bankhead P, Loughrey MB, Fernández JA, Dombrowski Y, McArt DG, Dunne PD, et al. QuPath: Open source software for digital pathology image analysis. Sci Rep. 2017. Dec 4;7(1):16878. doi: 10.1038/s41598-017-17204-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Internet]. Las Vegas, NV, USA: IEEE; 2016 [cited 2023 Jun 6]. p. 770–8. Available from: http://ieeexplore.ieee.org/document/7780459/
  • 23.Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-generation Hyperparameter Optimization Framework [Internet]. arXiv; 2019. [cited 2022 Dec 13]. Available from: http://arxiv.org/abs/1907.10902 [Google Scholar]
  • 24.Wightman R, Raw N, Soare A, Arora A, Ha C, Reich C, et al. rwightman/pytorch-image-models: v0.8.10dev0 Release [Internet]. Zenodo; 2023. [cited 2023 Nov 20]. Available from: https://zenodo.org/record/4414861 [Google Scholar]
  • 25.Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library [Internet]. arXiv; 2019. [cited 2023 Jun 5]. Available from: http://arxiv.org/abs/1912.01703 [Google Scholar]
  • 26.Tellez D, Litjens G, Bándi P, Bulten W, Bokhorst JM, Ciompi F, et al. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Med Image Anal. 2019. Dec;58:101544. doi: 10.1016/j.media.2019.101544 [DOI] [PubMed] [Google Scholar]
  • 27.Efron B. Bootstrap Methods: Another Look at the Jackknife. Ann Stat. 1979;7(1):1–26. [Google Scholar]
  • 28.Zhou Z, Ren Y, Zhang Z, Guan T, Wang Z, Chen W, et al. Digital histopathological images of biopsy predict response to neoadjuvant chemotherapy for locally advanced gastric cancer. Gastric Cancer. 2023. Sep;26(5):734–42. doi: 10.1007/s10120-023-01407-z [DOI] [PubMed] [Google Scholar]
  • 29.Kulkarni PM, Robinson EJ, Sarin Pradhan J, Gartrell-Corrado RD, Rohr BR, Trager MH, et al. Deep Learning Based on Standard H&E Images of Primary Melanoma Tumors Identifies Patients at Risk for Visceral Recurrence and Death. Clin Cancer Res. 2020. Mar 1;26(5):1126–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wessels F, Schmitt M, Krieghoff-Henning E, Kather JN, Nientiedt M, Kriegmair MC, et al. Deep learning can predict survival directly from histology in clear cell renal cell carcinoma. Huk M, editor. PLOS ONE. 2022. Aug 17;17(8):e0272656. doi: 10.1371/journal.pone.0272656 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Chanda T, Hauser K, Hobelsberger S, Bucher TC, Garcia CN, Wies C, et al. Dermatologist-like explainable AI enhances trust and confidence in diagnosing melanoma. arxiv [Internet]. Available from: https://arxiv.org/abs/2303.12806 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kim SW, Roh J, Park CS. Immunohistochemistry for Pathologists: Protocols, Pitfalls, and Tips. J Pathol Transl Med. 2016. Nov;50(6):411–8. doi: 10.4132/jptm.2016.08.08 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Nielsen PS, Riber-Hansen R, Raundahl J, Steiniche T. Automated quantification of MART1-verified Ki67 indices by digital image analysis in melanocytic lesions. Arch Pathol Lab Med. 2012. Jun;136(6):627–34. doi: 10.5858/arpa.2011-0360-OA [DOI] [PubMed] [Google Scholar]
  • 34.Nielsen PS, Georgsen JB, Vinding MS, Østergaard LR, Steiniche T. Computer-Assisted Annotation of Digital H&E/SOX10 Dual Stains Generates High-Performing Convolutional Neural Network for Calculating Tumor Burden in H&E-Stained Cutaneous Melanoma. Int J Environ Res Public Health. 2022. Nov 2;19(21):14327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Lodha S, Saggar S, Celebi JT, Silvers DN. Discordance in the histopathologic diagnosis of difficult melanocytic neoplasms in the clinical setting. J Cutan Pathol. 2008. Apr;35(4):349–52. doi: 10.1111/j.1600-0560.2007.00970.x [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Vincenzo L'Imperio

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

16 Oct 2023

PONE-D-23-29059Evaluating Deep Learning-based Melanoma Classification using Immunohistochemistry and Routine Histology: A Three Center StudyPLOS ONE

Dear Dr. Brinker,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 26 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Vincenzo L'Imperio, MD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for including your ethics statement:  "Ethics approval was obtained from the ethics committees of the

respective universities before the study was initiated. Patients provided informed written consent. This work was

performed in accordance with the Declaration of Helsinki". 

Please amend your current ethics statement to include the full name of the ethics committee/institutional review board(s) that approved your specific study. 

Once you have amended this/these statement(s) in the Methods section of the manuscript, please add the same text to the “Ethics Statement” field of the submission form (via “Edit Submission”).

For additional information about PLOS ONE ethical requirements for human subjects research, please refer to http://journals.plos.org/plosone/s/submission-guidelines#loc-human-subjects-research.

3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

4. Thank you for stating the following in the Competing Interests section: 

[TJB would like to disclose that he is the owner of Smart Health Heidelberg GmbH (Handschuhsheimer Landstr. 9/1, 69120 Heidelberg, Germany) which develops mobile apps, outside of the submitted work. SHo reports clinical trial support from Almirall and speaker’s honoraria from Almirall, UCB and AbbVie and has received travel support from the following companies: UCB, Janssen Cilag, Almirall, Novartis, Lilly, LEO Pharma and AbbVie outside the submitted work.]. 

Please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials, by including the following statement: ""This does not alter our adherence to  PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests).  If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. 

Please include your updated Competing Interests statement in your cover letter; we will change the online submission form on your behalf. 

5. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

""Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

6. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

7. We note that you have included the phrase “data not shown” in your manuscript. Unfortunately, this does not meet our data sharing requirements. PLOS does not permit references to inaccessible data. We require that authors provide all relevant data within the paper, Supporting Information files, or in an acceptable, public repository. Please add a citation to support this phrase or upload the data that corresponds with these findings to a stable repository (such as Figshare or Dryad) and provide and URLs, DOIs, or accession numbers that may be used to access these data. Or, if the data are not a core part of the research being presented in your study, we ask that you remove the phrase that refers to these data.

8. Your ethics statement should only appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please ensure that your ethics statement is included in your manuscript, as the ethics statement entered into the online submission form will not be published alongside your manuscript. 

9. We notice that your supplementary figures and tables are included in the manuscript file. Please remove them and upload them with the file type 'Supporting Information'. Please ensure that each Supporting Information file has a legend listed in the manuscript after the references list.

10. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 

11. We note that Figure 1 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

A. You may seek permission from the original copyright holder of Figure 1 to publish the content specifically under the CC BY 4.0 license. 

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an ""Other"" file with your submission. 

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

B. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

12. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Titus J. Brinker et al.’s work presents an original computational approach to melanocytic lesions, utilizing clear and elegant methods, yielding promising yet not optimal results for clinical translation. Specifically, the study employs Deep-learning models to analyze histological information extracted from both H&E and Melan-A immunostain, aiming to classify lesions into benign nevus or melanoma. The study is well-executed and well-designed, however, as mentioned in the paper, this approach has certain limitations. One of its inherent shortcomings is the inability to provide a comprehensive evaluation of the entire lesion at low magnification, which is usually crucial in the evaluation of melanocytic lesions, overlooking important features such as the symmetry. Another limitation is the lack of a direct link to histological characteristics and the absence of accompanying histological images, which could significantly improve the accessibility and clarity of the study. On the other hand, the study's main strengths include the evaluation of different immunohistochemical clones, a multicentric approach with a well-balanced dataset and training conducted at various magnifications.

Overall, the study could be improved with a few adjustments:

-In the introduction, it is mentioned that pathologists have up to a 25% interobserver discordance rate, but the references cited are relatively dated (from 15 to 27 years ago), a period during which the evaluation of melanocytic lesions has gained new markers and insights. More recent references would be advisable.

-In the same section, from the sentence beginning with 'If melanoma is initially misclassified...' to the end of the paragraph, references are lacking.

-In the Methods section, Table 1 categorizes cases into three classes, including melanoma in situ. While it is implied that melanoma in situ belongs to the melanoma class, the table's presentation may still cause confusion. To enhance clarity, it is advisable to explicitly specify the relationship between melanoma in situ and the melanoma class.

-In the same section it is stated that a color augmentation approach was used, but it is not established whether this led to improvements in metrics and domain shift. At least some commentary on this approach, which sparks debate on model cutoff changes, would be recommended.

-In the results section, the initial part appears somewhat repetitive, and it would be advisable to condense this portion within the methods to make the entire methodological process more straightforward.

-Still in the results section, the legend in Figure 2 appears to reverse Figures 2A and 2B.

-In the discussion, no explanation or hypothesis is provided for these misclassifications, and it is not described from a histological perspective in which cases the model underperformed. It remains unclear whether this underperformance was solely due to variations in technical slide characteristics or if there are specific subsets of lesions where the model struggled more.

Reviewer #2: Since the objective of the study is not simply the development of a DL-based model to diagnose melanoma on WSI, but a comparison of the performance of DL systems using H&E versus MelanA, without lengthening the introduction, I would integrate it by briefly mentioning a few numbers, namely: (1a) the current state-of-the-art performance of published DL models in melanoma diagnosis, (1b) whether there are public datasets to gauge performance, and (2) the reported accuracy boosts by combining H&E with IHC (in other use-cases if there are none in melanoma).

It would be useful for the reader if the authors provided a few example WSI thumbnails and fields from the three cohorts, for each staining modality, to gauge the entity of staining/scanning variation between cohorts.

Am I correct in understanding that the patches were extracted without stride/overlap? Was the same done for inference? I would state this clearly in the manuscript.

Some details regarding model architecture are lacking. In addition to the mentioned pooling strategy, where were models cut, and what custom heads were used?

What hardware was used? What's the inference time per tile and per WSI?

L127. Training the model at tile level using the slide label has some risks. Were there WSIs in which multiple categories coexisted (i.e. melanoma ± in-situ melanoma ± benign nevus)? How were these handled in training? And in inference?

The model input is (a batch of) 237×237px patches, but what about its output? Figure 1 seems to show a binary output (melanoma vs benign?) but you also had in-situ melanoma in the slide categories. What exactly was the output of the model?

L127 "All tiles of the slide were predicted and the slide score was calculated by averaging all tile scores (see Figure 1)". Can you provide more detail? The answer to this question depends on the previous one. Was the model a binary melanoma/benign model, and was your final label "benign" if benign tiles outnumbered malignant tiles? Was your model a multicategory one (melanoma - melanoma in situ - benign nevus) and did you predict the most common category as the slide label?

How did you handle tiles which were not melanoma, melanoma in-situ or nevus? In the training set you annotated and extracted only tiles from the lesion, but what about the test set? Did you run inference on the whole tissue, including epidermis, dermis and subcutis uninvolved by melanocytic tumor? Did you use a tissue detector to filter out empty tiles?

The reported performances of the H&E-only and MelanA-only models make me think. The pathologists among the authors of this work are invited to discuss their interpretation of the fact that the MelanA+H&E classifier does not seem to work better than the simple H&E classifier, at least on the Dresden cohort. What does the pathologist's thought process look like when evaluating MelanA in addition to H&E, and was the combined model designed coherently?

L288 while I agree that each magnification contains partly different information, I don't think you can infer that from the observation that the combination of all 4 magnifications brings a slight overall improvement. Unless you prove that it is not due to ensembling.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Giorgio Cazzaniga

Reviewer #2: Yes: Alessandro Caputo

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jan 19;19(1):e0297146. doi: 10.1371/journal.pone.0297146.r002

Author response to Decision Letter 0


22 Nov 2023

Dear Dr. L'Imperio, dear Dr. Cazzaniga, dear Dr. Caputo,

we thank the reviewers for their constructive suggestions and questions, which contributed to improving the quality of this work.

We attached an additional document "Review-Letter_with_comments" into the submission, where we respond point by point to all questions, the journal requirements as well as the specific questions from the two reviewers.

We hope, that we could address all open points.

Sincerly,

Dr. Titus J. Brinker

Attachment

Submitted filename: Review-Letter_with_comments.pdf

Decision Letter 1

Vincenzo L'Imperio

29 Dec 2023

Evaluating Deep Learning-based Melanoma Classification using Immunohistochemistry and Routine Histology: A Three Center Study

PONE-D-23-29059R1

Dear Dr. Brinker,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Vincenzo L'Imperio, MD

Academic Editor

PLOS ONE

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: From my side, the authors have aptly addressed the comments. The refined manuscript is now well-prepared for publication.

Reviewer #2: All comments have been addressed adequately. The limitations that remain are explained in the manuscript text.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Giorgio Cazzaniga

Reviewer #2: Yes: Alessandro Caputo

**********

Acceptance letter

Vincenzo L'Imperio

9 Jan 2024

PONE-D-23-29059R1

PLOS ONE

Dear Dr. Brinker,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Vincenzo L'Imperio

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Representative thumbnails for melanoma, melanoma in-situ, and nevus for all three cohorts.

    (TIF)

    S2 Fig. ROC plots by data source site with corresponding AUROC values.

    A: Results from Dresden B: Results from Erlangen C: Results from Naples. Red: 40x magnification Blue: 20x magnification Purple: 10x magnification Gray: 5x magnification.

    (TIF)

    S3 Fig. ROC plot of the hierarchical compared to the combined approach with corresponding AUROC values by data source site.

    A: Results from Dresden B: Results from Erlangen C: Results from Naples. Black: Results of the combined approach using H&E and MElanA for all lesions Red: Hierarchical approach using MelanA-stained tissue only for H&E-based uncertain lesions.

    (TIF)

    S4 Fig. ROC plots by data modality with corresponding AUROC values.

    A: Results from Dresden B: Results from Erlangen C: Results from Naples. Red: MelanA-based performance taking all magnifications into account Purple: H&E-based performance Black: combined model using H&E as well as MelanA by aggregating the individual scores.

    (TIF)

    S1 Table. Hyperparameters of all evolved models.

    (XLSX)

    S2 Table. Additional results derived by using different fusion approaches: Dist-opt means weighted by the distance to the individual models optimal thresholds; dist-05 means weighted by the distance to the default threshold of 0.5; avg denotes the fusion by conducting a simple average of all scores; perf means weighted based on the individual models validation performance in a way that better performing models contribute more to the fused result.

    (XLSX)

    S1 Dataset

    (ZIP)

    S2 Dataset

    (ZIP)

    S3 Dataset

    (ZIP)

    S4 Dataset

    (ZIP)

    S5 Dataset

    (ZIP)

    Attachment

    Submitted filename: Review-Letter_with_comments.pdf

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES