Table 2. Test Data Set Results of the 32 Submitted Algorithms vs Pathologists for Tasks 1 and 2 in the CAMELYON16 Challengea.
Codenameb | Task 1: Metastasis Identification | Task 2: Metastases Classification | P Value for Comparison of the Algorithm vs Pathologists WTCd | Algorithm Model | Comments | |
---|---|---|---|---|---|---|
FROC Score (95% CI)c |
AUC (95% CI)c |
Deep Learning | Architecture | |||
HMS and MIT II | 0.807 (0.732-0.889) |
0.994 (0.983-0.999) |
<.001 | ✓ | GoogLeNet | Ensemble of 2 networks; stain standardization; extensive data augmentation; hard negative mining |
HMS and MGH III | 0.760 (0.692-0.857) |
0.976 (0.941-0.999 |
<.001 | ✓ | ResNet | Fine-tuned pretrained network; fully convolutional network |
HMS and MGH I | 0.596 (0.578-0.734) |
0.964 (0.928-0.989) |
<.001 | ✓ | GoogLeNet | Fine-tuned pretrained network |
CULab III | 0.703 (0.605-0.799) |
0.940 (0.888-0.980) |
<.001 | ✓ | VGG-16 | Fine-tuned pretrained network; fully convolutional network |
HMS and MIT I | 0.693 (0.600-0.819) |
0.923 (0.855-0.977) |
.11 | ✓ | GoogLeNet | Ensemble of 2 networks; hard negative mining |
ExB I | 0.511 (0.363-0.620) |
0.916 (0.858-0.962) |
.02 | ✓ | ResNet | Varied class balance during training |
CULab I | 0.544 (0.467-0.629) |
0.909 (0.851-0.954) |
.04 | ✓ | VGG-Net | Fine-tuned pretrained network |
HMS and MGH II | 0.729 (0.596-0.788) |
0.908 (0.846-0.961) |
.04 | ✓ | ResNet | Fine-tuned pretrained network |
CULab II | 0.527 (0.335-0.627) |
0.906 (0.841-0.957) |
.16 | ✓ | VGG-Net & ResNet | Fine-tuned pretrained network; cascaded a VGG-Net that operated on low magnification images and a ResNet model that refined the results |
DeepCare I | 0.243 (0.197-0.356) |
0.883 (0.806-0.943) |
>.99 | ✓ | GoogLeNet | Fine-tuned pretrained network |
Quincy Wong I | 0.367 (0.250-0.521) |
0.865 (0.789-0.924) |
>.99 | ✓ | SegNet | Fine-tuned pretrained network |
Middle East Technical University I | 0.389 (0.272-0.512) |
0.864 (0.786-0.927) |
>.99 | ✓ | 4-layer CNN | Custom confidence filtering for postprocessing |
NLP LOGIX I | 0.386 (0.255-0.511) |
0.830 (0.742-0.899) |
>.99 | ✓ | AlexNet | Used a second-stage random forest classifier to generate slide scores |
Smart Imaging II | 0.339 (0.239-0.420) |
0.821 (0.753-0.894) |
>.99 | ✓ | GoogLeNet | Used an ensemble of the output from the team’s first entry and the GoogLeNet model |
University of Toronto I | 0.382 (0.286-0.515) |
0.815 (0.722-0.886) |
>.99 | ✓ | VGG-Net | Combined the output of multiple CNNs trained on different magnifications by computing their mean |
Warwick-Qatar University I | 0.305 (0.219-0.397) |
0.796 (0.711-0.871) |
>.99 | ✓ | U-Net | Used stain normalization |
Radboudumc I | 0.575 (0.446-0.659) |
0.779 (0.694-0.860) |
>.99 | ✓ | VGG-Net | Extensive data augmentation; second-stage CNN to generate slide-level scores |
Hochschule für Technik und Wirtschaft-Berlin I | 0.187 (0.112-0.250) |
0.768 (0.665-0.853) |
>.99 | ✓ | CRFasRNN | Fine-tuned pretrained network |
University of Toronto II | 0.352 (0.292-0.511) |
0.762 (0.659-0.846) |
>.99 | ✓ | VGG-Net | Combined the output of multiple CNNs trained on different magnifications by using an additional CNN |
Tampere I | 0.257 (0.171-0.376) |
0.761 (0.662-0.837) |
>.99 | Random Forests | Used a large set of intensity and texture features | |
Smart Imaging I | 0.208 (0.119-0.306) |
0.757 (0.663-0.839) |
>.99 | SVM & Adaboost | Cascaded SVM and Adaboost classifiers using texture features | |
Osaka University I | 0.347 (0.234-0.463) |
0.732 (0.629-0.824) |
>.99 | ✓ | GoogLeNet | |
CAMP-TUM II | 0.273 (0.194-0.379) |
0.735 (0.633-0.819) |
>.99 | ✓ | GoogLeNet | Hard negative mining |
University of South Florida I | 0.179 (0.116-0.242) |
0.727 (0.611-0.823) |
>.99 | Random Forests | Used various intensity and texture features | |
NSS I | 0.165 (0.116-0.195) |
0.727 (0.635-0.81) |
>.99 | Rule-based | Multiple thresholds on several nucleus-based features | |
Tampere II | 0.252 (0.149-0.350) |
0.713 (0.612-0.801) |
>.99 | ✓ | 7-layer CNN | Self-designed network architecture |
CAMP-TUM I | 0.184 (0.127-0.243) |
0.691 (0.580-0.787) |
>.99 | ✓ | Agg-Net | Multiscale approach for analyzing the images |
Minsk Team I | 0.227 (0.181-0.264) |
0.689 (0.568-0.804) |
>.99 | ✓ | GoogLeNet | Separate models for different data sets; hard negative mining |
VISILAB I | 0.142 (0.080-0.203) |
0.653 (0.551-0.748) |
>.99 | Random Forests | Used Haralick texture features | |
VISILAB II | 0.116 (0.063-0.177) |
0.651 (0.549-0.742) |
>.99 | ✓ | 3-layer CNN | Self-designed network architecture |
Anonymous I | 0.097 (0.049-0.158) |
0.628 (0.530-0.717) |
>.99 | Random Forests | ||
Laboratoire d'Imagerie Biomédicale I | 0.120 (0.079-0.182) |
0.556 (0.434-0.654) |
>.99 | SVM | Used various color and texture features | |
Pathologist WOTC | 0.724 (0.643-0.804) |
0.966 (0.927-0.998) |
Expert pathologist who assessed without a time constraint | |||
Mean pathologists WTC | 0.810 (0.750-0.869) |
The mean performance of 11 pathologists in a simulation exercise designed to mimic the routine workflow of diagnostic pathology with a flexible 2-h time limit |
Abbreviations: AUC, area under the receiver operating characteristic curve; CAMELYON16, Cancer Metastases in Lymph Nodes Challenge 2016; CAMP-TUM, Computer Aided Medical Procedures and Augmented Reality-Technical University of Munich; CNN, convolutional neural network; CULab, Chinese University Lab; FROC, free-response receiver operator characteristic; HMS, Harvard Medical School; MGH, Massachusetts General Hospital; MIT, Massachusetts Institute of Technology; WOTC, without time constraint; WTC, with time constraint.
For algorithms, contact information, and detailed a description of each algorithm, see eTable 3 and eMethods in the Supplement. For a glossary of deep learning terminology, see eAppendix in the Supplement.
Algorithms are shown ranked highest (top of Table) to lowest (bottom of Table) according to their performance on task 2. The submission number was indicated in each team’s algorithm name by Roman numeral. Teams were allowed a maximum of 3 submissions.
The percentile bootstrap method was used to construct 95% CIs for FROC true-positive fraction scores (FROC scores) and AUCs.
The results of the significant test with MRMC ROC analysis for the comparison of each individual algorithm with the pathologists WTC. The P values were adjusted for multiple comparisons using the Bonferroni correction, in which the P values are multiplied by the number of comparisons (32; comparison of the 32 submitted algorithms with the panel of pathologists).