Take-Away Points
■ Major Focus: To evaluate the performance of three commercially available artificial intelligence algorithms in screening mammography assessment in an external population both independently and in combination with radiologists.
■ Key Result: One of the three artificial intelligence algorithms had significantly higher accuracy for breast cancer detection (area under receiver operating characteristic curve of 0.96 vs 0.92 vs 0.92; P < .001). Combining this best algorithm with first-reader radiologists achieved the highest accuracy at 88.6% sensitivity and 93.0% specificity.
■ Impact: A commercially available artificial intelligence algorithm can assess screening mammograms with comparable diagnostic accuracy as radiologists. Whether such an algorithm can act as independent readers prospectively in a screening population remains unanswered.
Screening mammography is currently the clinical standard for early detection of breast cancer. However, performance of individual radiologists in screening mammography programs varies widely. Efforts to develop artificial intelligence (AI) algorithms for screening mammography have yielded mixed outcomes owing to variability in quality of training data and study methodology. External validation studies comparing the performance of different AI algorithms in a screening population are lacking.
This external validation study compared the diagnostic performance of three commercially available AI algorithms both independently and in combination with radiologists in a retrospective population-based screening cohort. Women with implants or prior breast cancer were excluded. The study population included 739 women with pathology-verified breast cancer diagnosed at screening or within 12 months and 8066 randomly sampled healthy controls with negative findings by 2-year cancer-free follow-up. The AI algorithm determined suspicion of cancer on a continuous scale from 0 to 1.0 with a decision threshold for normal and abnormal set at the mean specificity of the first-reader radiologist (96.6%).
Algorithm 1 performed significantly better than the other two algorithms with area under receiver operating characteristic curve of 0.96 versus 0.92 versus 0.92 (P < .001). No significant difference in sensitivity existed compared with radiologist consensus at the same specificity (81.9% vs 85.0%, P = .11), despite the algorithm having no inputs from prior imaging or clinical history. When choosing an operating point corresponding to the U.S. Breast Cancer Surveillance Consortium benchmark of 88.9% specificity, algorithm 1’s sensitivity of 88.6% was comparable to the benchmark sensitivity of 86.9%.
In combination with a radiologist where abnormal was defined by either the algorithm or radiologist, the cancer detection rate increased 8% from 5.32 per 1000 for algorithm 1 alone to 5.76 per 1000 in combination with the first reader with a corresponding 77% increase in recall rate from 39.1 to 69.1 per 1000.
The most accurate algorithm was trained on the largest training set and utilized pixel-level annotations and higher capacity network architecture, despite training on a patient population and vendor system different from the external validation data set. In conclusion, only one of the three algorithms performed at an independent level comparable to radiologists, setting the groundwork for prospective trial evaluation of AI systems in screening mammography.
Highlighted Article
Salim M, Wåhlin E, Dembrower K, et al. External evaluation of 3 commercial artificial intelligence algorithms for independent assessment of screening mammograms. JAMA Oncol 2020;6(10):1581–1588. doi: 10.1001/jamaoncol.2020.3321
Highlighted Article
- Salim M, Wåhlin E, Dembrower K, et al. External evaluation of 3 commercial artificial intelligence algorithms for independent assessment of screening mammograms. JAMA Oncol 2020;6(10):1581–1588. doi: 10.1001/jamaoncol.2020.3321 [DOI] [PMC free article] [PubMed] [Google Scholar]
