Skip to main content
. 2017 Dec 12;318(22):2199–2210. doi: 10.1001/jama.2017.14585

Table 2. Test Data Set Results of the 32 Submitted Algorithms vs Pathologists for Tasks 1 and 2 in the CAMELYON16 Challengea.

Codenameb Task 1: Metastasis Identification Task 2: Metastases Classification P Value for Comparison of the Algorithm vs Pathologists WTCd Algorithm Model Comments
FROC Score
(95% CI)c
AUC
(95% CI)c
Deep Learning Architecture
HMS and MIT II 0.807
(0.732-0.889)
0.994
(0.983-0.999)
<.001 GoogLeNet Ensemble of 2 networks; stain standardization; extensive data augmentation; hard negative mining
HMS and MGH III 0.760
(0.692-0.857)
0.976
(0.941-0.999
<.001 ResNet Fine-tuned pretrained network; fully convolutional network
HMS and MGH I 0.596
(0.578-0.734)
0.964
(0.928-0.989)
<.001 GoogLeNet Fine-tuned pretrained network
CULab III 0.703
(0.605-0.799)
0.940
(0.888-0.980)
<.001 VGG-16 Fine-tuned pretrained network; fully convolutional network
HMS and MIT I 0.693
(0.600-0.819)
0.923
(0.855-0.977)
.11 GoogLeNet Ensemble of 2 networks; hard negative mining
ExB I 0.511
(0.363-0.620)
0.916
(0.858-0.962)
.02 ResNet Varied class balance during training
CULab I 0.544
(0.467-0.629)
0.909
(0.851-0.954)
.04 VGG-Net Fine-tuned pretrained network
HMS and MGH II 0.729
(0.596-0.788)
0.908
(0.846-0.961)
.04 ResNet Fine-tuned pretrained network
CULab II 0.527
(0.335-0.627)
0.906
(0.841-0.957)
.16 VGG-Net & ResNet Fine-tuned pretrained network; cascaded a VGG-Net that operated on low magnification images and a ResNet model that refined the results
DeepCare I 0.243
(0.197-0.356)
0.883
(0.806-0.943)
>.99 GoogLeNet Fine-tuned pretrained network
Quincy Wong I 0.367
(0.250-0.521)
0.865
(0.789-0.924)
>.99 SegNet Fine-tuned pretrained network
Middle East Technical University I 0.389
(0.272-0.512)
0.864
(0.786-0.927)
>.99 4-layer CNN Custom confidence filtering for postprocessing
NLP LOGIX I 0.386
(0.255-0.511)
0.830
(0.742-0.899)
>.99 AlexNet Used a second-stage random forest classifier to generate slide scores
Smart Imaging II 0.339
(0.239-0.420)
0.821
(0.753-0.894)
>.99 GoogLeNet Used an ensemble of the output from the team’s first entry and the GoogLeNet model
University of Toronto I 0.382
(0.286-0.515)
0.815
(0.722-0.886)
>.99 VGG-Net Combined the output of multiple CNNs trained on different magnifications by computing their mean
Warwick-Qatar University I 0.305
(0.219-0.397)
0.796
(0.711-0.871)
>.99 U-Net Used stain normalization
Radboudumc I 0.575
(0.446-0.659)
0.779
(0.694-0.860)
>.99 VGG-Net Extensive data augmentation; second-stage CNN to generate slide-level scores
Hochschule für Technik und Wirtschaft-Berlin I 0.187
(0.112-0.250)
0.768
(0.665-0.853)
>.99 CRFasRNN Fine-tuned pretrained network
University of Toronto II 0.352
(0.292-0.511)
0.762
(0.659-0.846)
>.99 VGG-Net Combined the output of multiple CNNs trained on different magnifications by using an additional CNN
Tampere I 0.257
(0.171-0.376)
0.761
(0.662-0.837)
>.99 Random Forests Used a large set of intensity and texture features
Smart Imaging I 0.208
(0.119-0.306)
0.757
(0.663-0.839)
>.99 SVM & Adaboost Cascaded SVM and Adaboost classifiers using texture features
Osaka University I 0.347
(0.234-0.463)
0.732
(0.629-0.824)
>.99 GoogLeNet
CAMP-TUM II 0.273
(0.194-0.379)
0.735
(0.633-0.819)
>.99 GoogLeNet Hard negative mining
University of South Florida I 0.179
(0.116-0.242)
0.727
(0.611-0.823)
>.99 Random Forests Used various intensity and texture features
NSS I 0.165
(0.116-0.195)
0.727
(0.635-0.81)
>.99 Rule-based Multiple thresholds on several nucleus-based features
Tampere II 0.252
(0.149-0.350)
0.713
(0.612-0.801)
>.99 7-layer CNN Self-designed network architecture
CAMP-TUM I 0.184
(0.127-0.243)
0.691
(0.580-0.787)
>.99 Agg-Net Multiscale approach for analyzing the images
Minsk Team I 0.227
(0.181-0.264)
0.689
(0.568-0.804)
>.99 GoogLeNet Separate models for different data sets; hard negative mining
VISILAB I 0.142
(0.080-0.203)
0.653
(0.551-0.748)
>.99 Random Forests Used Haralick texture features
VISILAB II 0.116
(0.063-0.177)
0.651
(0.549-0.742)
>.99 3-layer CNN Self-designed network architecture
Anonymous I 0.097
(0.049-0.158)
0.628
(0.530-0.717)
>.99 Random Forests
Laboratoire d'Imagerie Biomédicale I 0.120
(0.079-0.182)
0.556
(0.434-0.654)
>.99 SVM Used various color and texture features
Pathologist WOTC 0.724
(0.643-0.804)
0.966
(0.927-0.998)
Expert pathologist who assessed without a time constraint
Mean pathologists WTC 0.810
(0.750-0.869)
The mean performance of 11 pathologists in a simulation exercise designed to mimic the routine workflow of diagnostic pathology with a flexible 2-h time limit

Abbreviations: AUC, area under the receiver operating characteristic curve; CAMELYON16, Cancer Metastases in Lymph Nodes Challenge 2016; CAMP-TUM, Computer Aided Medical Procedures and Augmented Reality-Technical University of Munich; CNN, convolutional neural network; CULab, Chinese University Lab; FROC, free-response receiver operator characteristic; HMS, Harvard Medical School; MGH, Massachusetts General Hospital; MIT, Massachusetts Institute of Technology; WOTC, without time constraint; WTC, with time constraint.

a

For algorithms, contact information, and detailed a description of each algorithm, see eTable 3 and eMethods in the Supplement. For a glossary of deep learning terminology, see eAppendix in the Supplement.

b

Algorithms are shown ranked highest (top of Table) to lowest (bottom of Table) according to their performance on task 2. The submission number was indicated in each team’s algorithm name by Roman numeral. Teams were allowed a maximum of 3 submissions.

c

The percentile bootstrap method was used to construct 95% CIs for FROC true-positive fraction scores (FROC scores) and AUCs.

d

The results of the significant test with MRMC ROC analysis for the comparison of each individual algorithm with the pathologists WTC. The P values were adjusted for multiple comparisons using the Bonferroni correction, in which the P values are multiplied by the number of comparisons (32; comparison of the 32 submitted algorithms with the panel of pathologists).