. 2017 Dec 12;318(22):2199–2210. doi: 10.1001/jama.2017.14585

Table 2. Test Data Set Results of the 32 Submitted Algorithms vs Pathologists for Tasks 1 and 2 in the CAMELYON16 Challenge^a.

Codename^b	Task 1: Metastasis Identification	Task 2: Metastases Classification	P Value for Comparison of the Algorithm vs Pathologists WTC^d	Algorithm Model		Comments
Codename^b	FROC Score (95% CI)^c	AUC (95% CI)^c		Deep Learning	Architecture	Comments
HMS and MIT II	0.807 (0.732-0.889)	0.994 (0.983-0.999)	<.001	✓	GoogLeNet	Ensemble of 2 networks; stain standardization; extensive data augmentation; hard negative mining
HMS and MGH III	0.760 (0.692-0.857)	0.976 (0.941-0.999	<.001	✓	ResNet	Fine-tuned pretrained network; fully convolutional network
HMS and MGH I	0.596 (0.578-0.734)	0.964 (0.928-0.989)	<.001	✓	GoogLeNet	Fine-tuned pretrained network
CULab III	0.703 (0.605-0.799)	0.940 (0.888-0.980)	<.001	✓	VGG-16	Fine-tuned pretrained network; fully convolutional network
HMS and MIT I	0.693 (0.600-0.819)	0.923 (0.855-0.977)	.11	✓	GoogLeNet	Ensemble of 2 networks; hard negative mining
ExB I	0.511 (0.363-0.620)	0.916 (0.858-0.962)	.02	✓	ResNet	Varied class balance during training
CULab I	0.544 (0.467-0.629)	0.909 (0.851-0.954)	.04	✓	VGG-Net	Fine-tuned pretrained network
HMS and MGH II	0.729 (0.596-0.788)	0.908 (0.846-0.961)	.04	✓	ResNet	Fine-tuned pretrained network
CULab II	0.527 (0.335-0.627)	0.906 (0.841-0.957)	.16	✓	VGG-Net & ResNet	Fine-tuned pretrained network; cascaded a VGG-Net that operated on low magnification images and a ResNet model that refined the results
DeepCare I	0.243 (0.197-0.356)	0.883 (0.806-0.943)	>.99	✓	GoogLeNet	Fine-tuned pretrained network
Quincy Wong I	0.367 (0.250-0.521)	0.865 (0.789-0.924)	>.99	✓	SegNet	Fine-tuned pretrained network
Middle East Technical University I	0.389 (0.272-0.512)	0.864 (0.786-0.927)	>.99	✓	4-layer CNN	Custom confidence filtering for postprocessing
NLP LOGIX I	0.386 (0.255-0.511)	0.830 (0.742-0.899)	>.99	✓	AlexNet	Used a second-stage random forest classifier to generate slide scores
Smart Imaging II	0.339 (0.239-0.420)	0.821 (0.753-0.894)	>.99	✓	GoogLeNet	Used an ensemble of the output from the team’s first entry and the GoogLeNet model
University of Toronto I	0.382 (0.286-0.515)	0.815 (0.722-0.886)	>.99	✓	VGG-Net	Combined the output of multiple CNNs trained on different magnifications by computing their mean
Warwick-Qatar University I	0.305 (0.219-0.397)	0.796 (0.711-0.871)	>.99	✓	U-Net	Used stain normalization
Radboudumc I	0.575 (0.446-0.659)	0.779 (0.694-0.860)	>.99	✓	VGG-Net	Extensive data augmentation; second-stage CNN to generate slide-level scores
Hochschule für Technik und Wirtschaft-Berlin I	0.187 (0.112-0.250)	0.768 (0.665-0.853)	>.99	✓	CRFasRNN	Fine-tuned pretrained network
University of Toronto II	0.352 (0.292-0.511)	0.762 (0.659-0.846)	>.99	✓	VGG-Net	Combined the output of multiple CNNs trained on different magnifications by using an additional CNN
Tampere I	0.257 (0.171-0.376)	0.761 (0.662-0.837)	>.99		Random Forests	Used a large set of intensity and texture features
Smart Imaging I	0.208 (0.119-0.306)	0.757 (0.663-0.839)	>.99		SVM & Adaboost	Cascaded SVM and Adaboost classifiers using texture features
Osaka University I	0.347 (0.234-0.463)	0.732 (0.629-0.824)	>.99	✓	GoogLeNet
CAMP-TUM II	0.273 (0.194-0.379)	0.735 (0.633-0.819)	>.99	✓	GoogLeNet	Hard negative mining
University of South Florida I	0.179 (0.116-0.242)	0.727 (0.611-0.823)	>.99		Random Forests	Used various intensity and texture features
NSS I	0.165 (0.116-0.195)	0.727 (0.635-0.81)	>.99		Rule-based	Multiple thresholds on several nucleus-based features
Tampere II	0.252 (0.149-0.350)	0.713 (0.612-0.801)	>.99	✓	7-layer CNN	Self-designed network architecture
CAMP-TUM I	0.184 (0.127-0.243)	0.691 (0.580-0.787)	>.99	✓	Agg-Net	Multiscale approach for analyzing the images
Minsk Team I	0.227 (0.181-0.264)	0.689 (0.568-0.804)	>.99	✓	GoogLeNet	Separate models for different data sets; hard negative mining
VISILAB I	0.142 (0.080-0.203)	0.653 (0.551-0.748)	>.99		Random Forests	Used Haralick texture features
VISILAB II	0.116 (0.063-0.177)	0.651 (0.549-0.742)	>.99	✓	3-layer CNN	Self-designed network architecture
Anonymous I	0.097 (0.049-0.158)	0.628 (0.530-0.717)	>.99		Random Forests
Laboratoire d'Imagerie Biomédicale I	0.120 (0.079-0.182)	0.556 (0.434-0.654)	>.99		SVM	Used various color and texture features
Pathologist WOTC	0.724 (0.643-0.804)	0.966 (0.927-0.998)				Expert pathologist who assessed without a time constraint
Mean pathologists WTC		0.810 (0.750-0.869)				The mean performance of 11 pathologists in a simulation exercise designed to mimic the routine workflow of diagnostic pathology with a flexible 2-h time limit

Abbreviations: AUC, area under the receiver operating characteristic curve; CAMELYON16, Cancer Metastases in Lymph Nodes Challenge 2016; CAMP-TUM, Computer Aided Medical Procedures and Augmented Reality-Technical University of Munich; CNN, convolutional neural network; CULab, Chinese University Lab; FROC, free-response receiver operator characteristic; HMS, Harvard Medical School; MGH, Massachusetts General Hospital; MIT, Massachusetts Institute of Technology; WOTC, without time constraint; WTC, with time constraint.

^{^a}

For algorithms, contact information, and detailed a description of each algorithm, see eTable 3 and eMethods in the Supplement. For a glossary of deep learning terminology, see eAppendix in the Supplement.

^{^b}

Algorithms are shown ranked highest (top of Table) to lowest (bottom of Table) according to their performance on task 2. The submission number was indicated in each team’s algorithm name by Roman numeral. Teams were allowed a maximum of 3 submissions.

^{^c}

The percentile bootstrap method was used to construct 95% CIs for FROC true-positive fraction scores (FROC scores) and AUCs.

^{^d}

The results of the significant test with MRMC ROC analysis for the comparison of each individual algorithm with the pathologists WTC. The P values were adjusted for multiple comparisons using the Bonferroni correction, in which the P values are multiplied by the number of comparisons (32; comparison of the 32 submitted algorithms with the panel of pathologists).

Table 2. Test Data Set Results of the 32 Submitted Algorithms vs Pathologists for Tasks 1 and 2 in the CAMELYON16 Challengea.

Table 2. Test Data Set Results of the 32 Submitted Algorithms vs Pathologists for Tasks 1 and 2 in the CAMELYON16 Challenge^a.