. Author manuscript; available in PMC: 2022 Aug 30.

Published in final edited form as: Nat Med. 2021 Jan 11;27(2):244–249. doi: 10.1038/s41591-020-01174-9

Table 1:

Summary of additional model evaluation.

Additional DM & DBT Evaluation
Dataset	Location	Manufacturer	Model	Input Type	AUC
OMI-DB	UK	Hologic	2D	DM	0.963 ± 0.003
Site A – DM	Oregon	GE	2D	DM	0.927 ± 0.008
Site E	China	Hologic	2D	DM	0.971 ± 0.005
Site E (resampled)	China	Hologic	2D	DM	0.956 ± 0.020
Site A – DBT	Oregon	Hologic	2D^*	DBT manufacturer synthetics	0.922 ± 0.016
Site A – DBT	Oregon	Hologic	3D	DBT slices	0.947 ± 0.012
Site A – DBT	Oregon	Hologic	2D + 3D	DBT manufacturer synthetics + slices	0.957 ± 0.010

All results correspond to using the “index” exam for cancer cases and “confirmed” negatives for the non-cancer cases, except for Site E where the negatives are unconfirmed. “Pre-index” results where possible and additional analysis are included in Extended Data Figure 8. Rows 1–2: Performance of the 2D deep learning model on held-out test sets of the OMI-DB (1,205 cancers, 1,538 negatives) and Site A (254 cancers, 7,697 negatives) datasets. Rows 3–4: Performance on a dataset collected at a Chinese hospital (Site E; 533 cancers, 1,000 negatives). The dataset consists entirely of diagnostic exams given the low prevalence of screening mammography in China. Nevertheless, even when adjusting for tumor size using bootstrap resampling to approximate the distribution of tumor sizes expected in an American screening population (see Methods), the model still achieves high performance (Row 4). Rows 5–7: Performance on DBT data (Site A - DBT; 78 cancers, 518 negatives). Row 5 contains results of the 2D model fine-tuned on the manufacturer-generated synthetic 2D images, which are created to augment/substitute DM images in a DBT study (*indicates this fine-tuned model). Row 6 contains the results of the weakly-supervised 3D model, illustrating strong performance when evaluated on the MSP images computed from the DBT slices. We note that when scoring the DBT volume as the maximum bounding box score over all of the slices, the strongly-supervised 2D model used to create the MSP images exhibits an AUC of 0.865±0.020. Thus, fine-tuning this model on the MSP images significantly improves its performance. Row 7: Results when combining predictions across the final 3D model and 2D models. The standard deviation for each AUC value was calculated via bootstrapping.