Skip to main content
Radiology: Artificial Intelligence logoLink to Radiology: Artificial Intelligence
. 2023 Mar 22;5(3):e220146. doi: 10.1148/ryai.220146

Impact of Different Mammography Systems on Artificial Intelligence Performance in Breast Cancer Screening

Clarisse F de Vries 1,, Samantha J Colosimo 1, Roger T Staff 1, Jaroslaw A Dymiter 1, Joseph Yearsley 1, Deirdre Dinneen 1, Moragh Boyle 1, David J Harrison 1, Lesley A Anderson 1,#, Gerald Lip 1,#; on behalf of the iCAIRD Radiology Collaboration1
PMCID: PMC10245180  PMID: 37293340

Abstract

Artificial intelligence (AI) tools may assist breast screening mammography programs, but limited evidence supports their generalizability to new settings. This retrospective study used a 3-year dataset (April 1, 2016–March 31, 2019) from a U.K. regional screening program. The performance of a commercially available breast screening AI algorithm was assessed with a prespecified and site-specific decision threshold to evaluate whether its performance was transferable to a new clinical site. The dataset consisted of women (aged approximately 50–70 years) who attended routine screening, excluding self-referrals, those with complex physical requirements, those who had undergone a previous mastectomy, and those who underwent screening that had technical recalls or did not have the four standard image views. In total, 55 916 screening attendees (mean age, 60 years ± 6 [SD]) met the inclusion criteria. The prespecified threshold resulted in high recall rates (48.3%, 21 929 of 45 444), which reduced to 13.0% (5896 of 45 444) following threshold calibration, closer to the observed service level (5.0%, 2774 of 55 916). Recall rates also increased approximately threefold following a software upgrade on the mammography equipment, requiring per–software version thresholds. Using software-specific thresholds, the AI algorithm would have recalled 277 of 303 (91.4%) screen-detected cancers and 47 of 138 (34.1%) interval cancers. AI performance and thresholds should be validated for new clinical settings before deployment, while quality assurance systems should monitor AI performance for consistency.

Keywords: Breast, Screening, Mammography, Computer Applications–Detection/Diagnosis, Neoplasms-Primary, Technology Assessment

Supplemental material is available for this article.

© RSNA, 2023

Keywords: Breast, Screening, Mammography, Computer Applications–Detection/Diagnosis, Neoplasms-Primary, Technology Assessment


Summary

Artificial intelligence (AI) performance in breast cancer screening was affected by mammography equipment and software used, highlighting the importance of local clinical settings and technology for effective AI implementation.

Key Points

  • ■ A mammography equipment software upgrade resulted in a threefold increase in the recall rate of a commercially available breast cancer screening artificial intelligence (AI) algorithm.

  • ■ Calibration of the AI decision threshold reduced recall rates from 47.7% to 13.0%.

  • ■ Implementation of AI into clinical practice requires local retrospective evaluation and ongoing quality assurance.

Introduction

A recent U.K. National Screening Committee review (1,2) concluded that evidence was insufficient to support the implementation of artificial intelligence (AI) in routine breast cancer screening. The review identified limited evidence on sources of variability, impact on interval cancers (ICs) detected between screening cycles, and performance of a preset threshold to classify recall or no recall. In addition, evidence for the transferability of AI models is inconsistent (35).

We evaluated commercial AI software (6) by using data from a U.K. screening program to determine whether its performance transferred to an external dataset generated with different mammography equipment. The AI software is Conformité Européenne marked, indicating compliance with applicable European Union regulations. This study evaluates the generalizability of the AI tool by using consecutively acquired clinical data, comparing stand-alone performance to the dual reporting system in the U.K. screening service.

Materials and Methods

Sample

The Proportionate Review Subcommittee of the London-Bloomsbury Research Ethics Committee approved this retrospective study (reference no. 20/LO/0563). Secondary use of de-identified data negated the requirement for individual consent. Public Benefit and Privacy Panel approval was obtained (reference no. 1920–0258).

National Health Service (NHS) Grampian clinical data and mammograms were collected from the Scottish Breast Screening Service (SBSS) (February 12, 2016–March 31, 2020). Full-field digital mammograms were acquired with five mammography units of the same make and model (Selenia Dimensions; Hologic) with no known differences at study commencement. All units conform to NHS breast cancer screening quality standards (7). The standard imaging protocol consisted of two views per breast (craniocaudal and mediolateral oblique). As part of routine screening, two readers interpreted each set of images, with a third reader arbitrating in cases of disagreement. During the study period, mammograms in the screening center were routinely read by a pool of 11 readers with 1 to 20 years of experience each, led by one reader (G.L.).

The evaluation dataset was limited to a 3-year U.K. screening cycle (April 1, 2016–March 31, 2019) of women (aged approximately 50–70 years) attending routine screening. Figure 1 shows exclusions.

Figure 1:

Flow diagram shows the generation and composition of the original, test, and validation datasets. Exclusions are indicated in the white boxes. The vendor-recommended exclusions are indicated in the shaded outer box. Confirmed positive cases are women with histologically confirmed cancer. Confirmed negative cases are women with negative findings for cancer with a negative 3-year follow-up screening and no interval cancer. DICOM = Digital Imaging and Communications in Medicine.

Flow diagram shows the generation and composition of the original, test, and validation datasets. Exclusions are indicated in the white boxes. The vendor-recommended exclusions are indicated in the shaded outer box. Confirmed positive cases are women with histologically confirmed cancer. Confirmed negative cases are women with negative findings for cancer with a negative 3-year follow-up screening and no interval cancer. DICOM = Digital Imaging and Communications in Medicine.

Data Processing

SBSS clinical data were transferred to the Grampian Data Safe Haven (DaSH). Mammograms from the breast screening picture archiving and communication system were transferred to the Safe Haven Artificial Intelligence Platform (SHAIP) developed by Canon Medical Research Europe (8). “Hiding in Plain Sight” (9) de-identification was performed.

Mia (version 2.0.1), developed by Kheiron Medical Technologies, the vendor in this study, assessed mammograms in SHAIP for potential malignancies. Mia was previously trained and tested on images acquired with Hologic, GE Healthcare, Siemens, and IMS Giotto mammography equipment. Mia, an ensemble of deep learning algorithms, employs the four standard image views (full-field digital mammography craniocaudal and mediolateral oblique views for each breast) to generate a continuous output ranging from 0 to 1 (malignancy prediction value). The malignancy prediction values were linked to the clinical data in DaSH. Mia's performance was evaluated by using a predefined threshold (≥0.1117 indicates recall) (6) and site-specific threshold.

Mia's performance was evaluated by academic health data scientists (C.F.d.V., J.A.D.) in DaSH (10), which the vendor could not access (meaning authors affiliated with the vendor had no control of the data). The vendor ran Mia within SHAIP with no access to the clinical outcomes to provide the Mia malignancy prediction values. The vendor also provided the Mia decision thresholds.

Threshold Calibration

Mia was not previously evaluated on images from Hologic Selenia Dimensions mammography equipment. The initial evaluation identified variability in algorithm performance. The vendor was provided with a validation dataset (16 204 screens) to generate a site-specific decision threshold. This subset included all screening data from 200 confirmed positive cases (women with histologically confirmed cancer), 4000 confirmed negative cases (women with negative findings for cancer with a negative 3-year follow-up screening and no IC), and 8000 unconfirmed negative cases (Appendix S1).

Statistical Analysis

A receiver operating characteristic (ROC) curve was plotted, and the area under the ROC curve (AUC) and CI (DeLong method) (11) were calculated. Positive screens were defined as histologically confirmed cancers detected through standard screening.

Sensitivity, specificity, and positive and negative predictive values, as well as cancer detection and recall rates of Mia, with CIs (Clopper-Pearson method) (12), were calculated for the prespecified and site-specific thresholds. Cancer detection rate was quantified as the number of screen-detected cancers with a (Mia) recall opinion divided by the total number of screens. The prespecified threshold was evaluated on the entire dataset after exclusions (original dataset) and on the subset not used to calibrate the threshold (test dataset). The site-specific threshold was evaluated using the test dataset. Furthermore, Mia's performance was compared with the performance of the first reader (reader 1). Mia was not compared with the second reader, as in the United Kingdom, the second reader can access the first reader's opinion and therefore does not read independently.

As an exploratory subanalysis, the site-specific threshold performance on the test dataset was stratified by mammography unit. Differences across units were assessed using Pearson χ2 (specificity, recall, and cancer detection rate) and Fisher exact (sensitivity) tests. Additionally, sensitivity was compared between small (<15 mm) and large (≥15 mm) tumors using a χ2 test.

ICs (cancers not detected during routine screening but identified between screening rounds) were analyzed separately. Following individual review, all readers in the clinical team regularly met to form a consensus on cancer visibility on prior screening mammograms, using the following categories (13): 1 = no visible lesion, 2 = lesion visible on review in hindsight, 3 = lesion clearly visible, and occult = lesion not visible at screening or subsequent symptomatic imaging. The proportion of IC patients Mia indicated to recall (with the updated threshold) was determined and stratified by consensus opinion.

Statistical analyses were performed in R (version 4.0.3) (Appendix S2). ROC curves, AUCs, and CIs were generated using the pROC package (14). Sample size information is available in Appendix S3. P value less than .05 was considered to indicate a statistically significant difference.

Data Availability

The statistical output alongside the relevant R code is available in Appendix S2. Access to the raw SBSS data and mammograms (with de-identified participant data) is subject to the required approvals (eg, Public Benefit and Privacy Panel, NHS Research & Development, Research Ethics Committee approval) and data agreements being in place. More information can be found on the DaSH website: https://www.abdn.ac.uk/iahs/facilities/grampian-data-safe-haven.php.

Results

Cohort Characteristics

After the application of vendor-recommended exclusions (3.9% [2293 of 58 209]) (15), an evaluation dataset of 55 916 screens was used (Fig 1). Of these, 2774 (5.0%) were recalled.

The mean age was 60 years (SD, 6.0 years); 450 patients had histologically confirmed screen-detected breast cancer, and 156 ICs were detected at follow-up (Table 1).

Table 1:

U.K. Breast Screening Program Cohort Characteristics

graphic file with name ryai.220146.tbl1.jpg

AI Performance Prethreshold Calibration

Figure 2A shows the Mia ROC curve. The AUC is 0.95 (95% CI: 0.94, 0.96). The Mia precision-recall curve can be found in Appendix S4.

Figure 2:

The artificial intelligence required threshold calibration, with software-specific thresholds, for optimal performance. (A) Mia receiver operating characteristic curve on the original dataset with prespecified threshold. The original dataset was not used to establish the prespecified threshold. (B) Rise in recall rate after an event for the four mammography units. The vertical dashed line indicates the date of a software upgrade. A fifth unit, a mobile unit, was not upgraded during the study timeline and is not included in this figure. AUC = area under the receiver operating characteristic curve.

The artificial intelligence required threshold calibration, with software-specific thresholds, for optimal performance. (A) Mia receiver operating characteristic curve on the original dataset with prespecified threshold. The original dataset was not used to establish the prespecified threshold. (B) Rise in recall rate after an event for the four mammography units. The vertical dashed line indicates the date of a software upgrade. A fifth unit, a mobile unit, was not upgraded during the study timeline and is not included in this figure. AUC = area under the receiver operating characteristic curve.

For the prespecified threshold (original dataset: 55 916 screens and 450 cancers), sensitivity and specificity were 97.3% and 52.7%, respectively (Table 2). The recall rate was 47.7% and the cancer detection rate was 7.8 per 1000. For the test dataset (45 444 screens and 303 cancers, excluding screens used for threshold calibration), sensitivity and specificity were 98.3% and 52.1%, respectively; recall rate was 48.3%, and cancer detection rate was 6.6 per 1000.

Table 2:

Mia Performance on Screen-detected Cancers

graphic file with name ryai.220146.tbl2.jpg

Threshold Calibration

An initial site-specific threshold of 0.2938 was generated. This threshold revealed a step change in recall rate at set points for each mammography unit (Fig 2B). Review of image headers revealed that the increase in recalls correlated with a mammography unit software update. The AI algorithm was not updated during the study. All units had the same software before the update (version 1.7). The software running on units 1 to 4 was upgraded to version 1.8 at different time points. The monthly recall rate for software version 1.7 ranged from 8.3% (63 of 760) to 13.2% (183 of 1382); for version 1.8, it ranged from 23.8% (79 of 332) to 38.6% (86 of 223). In comparison, the reader 1 monthly recall rate ranged from 3.8% (37 of 966) to 6.9% (84 of 1218) before the software update and from 2.5% (seven of 282) to 7.9% (13 of 164) after the software update. Reader 1 sensitivity and specificity changed from 85.4% (328 of 384) to 87.9% (58 of 66) and from 95.1% (43 075 of 45 276) to 95.6% (9746 of 10 190), respectively.

Per–software version thresholds were generated to ensure stability of recall rates (Appendix S1). Due to a small number of positive studies in the post–software update subset, the vendor was provided with 35 additional positive studies (from mammography unit 4, after software upgrade) to reduce the threshold's susceptibility to noise.

Two site-specific thresholds were generated across all mammography units: 0.2712 before upgrade and 0.4319 after upgrade.

Applying the new thresholds to the test dataset resulted in a sensitivity of 91.4%, specificity of 87.6%, recall rate of 13.0%, and cancer detection rate of 6.1 per 1000 (Table 2). By comparison, reader 1 sensitivity, specificity, recall rate, and cancer detection rate were 86.1%, 95.2%, 5.4%, and 5.7 per 1000, respectively. Reader 1 detected 261 of 303 (86.1%) screening-diagnosed cancers, while Mia would have detected 277 of 303 (91.4%) cancers.

AI Performance Split by Mammography Unit and Lesion Size

Mia performance with the site-specific thresholds was significantly different across mammography units for specificity (P < .001) and recall rate (P < .001), but not for sensitivity (P = .51) or cancer detection rate (P = .93) (Table 2). We found no evidence of a difference in the sensitivity of Mia between small and large tumors (91.0% [162 of 178] and 93.7% [104 of 111], respectively; P = .55).

IC Recall

The test dataset contained 138 ICs. Using the site-specific thresholds, Mia would have recalled 47 (34.1%) ICs. Mia indicated to recall 15 of 56 category 1 ICs (no visible lesion); four of 14 category 2 ICs (lesion visible on review in hindsight); three of three category 3 ICs (lesion clearly visible on previous screening mammograms); and two of nine occult ICs. Mia would have recalled a further 24 of 57 ICs not yet categorized by consensus opinion (due to COVID-19–related delays in IC review).

Discussion

AI performance could be affected by different mammography systems, impacting deployment in new settings. In this study, local calibration and per–software version thresholds were required to reduce recall rates from 47.7% to 13.0%. After threshold optimization, Mia had a higher recall rate than reader 1 (13.0% vs 5.4%) but would have detected more cancers (277 vs 261), including those missed by routine dual reporting (47 of 138). The U.K. acceptable recall rate is less than 9% in a double reading setting with arbitration (16). The Mia false-positive rate was higher than that in routine clinical practice, suggesting that Mia would be best used combined with human reader input, as recommended by the vendor. Economic and operational evaluations are required across possible implementation scenarios.

Our results are supported by previous research observing issues relating to the generalizability of radiology AI models (3,5,17). Furthermore, we have established that AI performance can be influenced by different mammography systems. The AI had previously been calibrated on a range of mammography units, including the Hologic Lorad Selenia, an older model of the unit employed in this study (Hologic Selenia Dimensions). The software update applied to the mammography units included several enhancements that may affect image characteristics. Human reader performance was not adversely affected following the update. Independent verification of vendor-reported transferability of thresholds using the same mammography unit and software version elsewhere is needed.

A user-definable threshold could allow centers to perform threshold recalibration themselves. However, many centers would struggle to gather enough data and/or will lack the technological expertise to adjust the thresholds successfully. A national implementation and validation framework for AI in breast cancer screening, alongside representative national datasets, could help set AI decision thresholds and quality assurance standards.

Study strengths included using a retrospective unenriched dataset consecutively acquired in a dual reporting screening setting, with sufficient follow-up to capture screen-detected cancers and ICs. The AI was not trained on the dataset. Exclusions were minimal (3.9%).

Study limitations included the following: the evaluation of one AI product, a single-center setting, a predominantly White patient sample group, and the unavailability of IC information because of COVID-19–related delays. Also, post hoc analyses of performance stratified by mammography unit and lesion size were not adequately powered and require further evaluation in larger studies.

As different mammography systems can substantially affect AI performance, AI performance and decision thresholds should be validated when applied in new clinical settings. Quality assurance systems, including change management, should monitor AI algorithms for consistent performance.

Acknowledgments

Acknowledgment

We would like to thank the DaSH team, including Joanne Lumsden, PhD, for their technical support.

1

Members of the iCAIRD Radiology Collaboration team are listed at the end of this article.

*

L.A.A. and G.L. are co–senior authors.

Supported by the Industrial Centre for Artificial Intelligence Research in Digital Diagnostics (iCAIRD), which is funded by Innovate UK on behalf of UK Research and Innovation (UKRI) (project no. 104690).

iCAIRD Radiology Collaborators: Corri Black, Alison D. Murray, and Katie Wilde, University of Aberdeen; James D. Blackwood, NHS Greater Glasgow and Clyde; Claire Butterly and John Zurowski, University of Glasgow; Jon Eilbeck and Colin McSkimming, NHS Grampian; Canon Medical Research Europe–SHAIP platform.

Disclosures of conflicts of interest: C.F.d.V. No relevant relationships. S.J.C. No relevant relationships. R.T.S. No relevant relationships. J.A.D. No relevant relationships. J.Y. Employed by Kheiron Medical Technologies; support for attending meetings/travel from Kheiron Medical Technologies; patents planned, issued, or pending with Kheiron Medical Technologies; stock or stock options in Kheiron Medical Technologies. D.D. Full-time employee of Kheiron Medical Technologies, supplier of the medical device evaluated in this project; grant from Innovate UK via iCAIRD, the industrial center for AI research in digital diagnostics, all parties received grant monies for the work done; associate member of the Faculty for Clinical Informatics and a health executive in residence for the UCL Global Business School for Health (unpaid, volunteer role); stock or stock options in Kheiron Medical Technologies (employee share options benefit scheme). M.B. iCAIRD funded by Innovate UK, under the UK Research and Innovation (UKRI) Industrial Strategy Challenge Fund “From Data to Early Diagnosis in Precision Medicine” challenge. D.J.H. Receipt of research award (chief investigator) from Innovate UK/UKRI, this funding underpinned the research infrastructure and some staff time. L.A.A. Funding from Innovate UK. G.L. No relevant relationships.

Abbreviations:

AI
artificial intelligence
AUC
area under the ROC curve
DaSH
Grampian Data Safe Haven
IC
interval cancer
NHS
National Health Service
ROC
receiver operating characteristic
SBSS
Scottish Breast Screening Service
SHAIP
Safe Haven Artificial Intelligence Platform

Contributor Information

Clarisse F. de Vries, Email: clarisse.devries@abdn.ac.uk.

Corri Black, University of Aberdeen.

Alison D. Murray, University of Aberdeen

Katie Wilde, University of Aberdeen.

James D. Blackwood, NHS Greater Glasgow and Clyde

Claire Butterly, University of Glasgow.

John Zurowski, University of Glasgow.

Jon Eilbeck, NHS Grampian; Canon Medical Research Europe–SHAIP platform.

Colin McSkimming, NHS Grampian; Canon Medical Research Europe–SHAIP platform.

Collaborators: Corri Black, Alison D. Murray, Katie Wilde, James D. Blackwood, Claire Butterly, John Zurowski, Jon Eilbeck, and Colin McSkimming

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The statistical output alongside the relevant R code is available in Appendix S2. Access to the raw SBSS data and mammograms (with de-identified participant data) is subject to the required approvals (eg, Public Benefit and Privacy Panel, NHS Research & Development, Research Ethics Committee approval) and data agreements being in place. More information can be found on the DaSH website: https://www.abdn.ac.uk/iahs/facilities/grampian-data-safe-haven.php.


Articles from Radiology: Artificial Intelligence are provided here courtesy of Radiological Society of North America

RESOURCES