Skip to main content
. 2021 Sep 2;374:n1872. doi: 10.1136/bmj.n1872

Table 4.

Summary of test accuracy outcomes

Study Index test (manufacturer)/comparator TP FP FN TN % Sensitivity
(95% CI)
Δ % Sensitivity,
P value or (95% CI)
% Specificity
(95% CI)
Δ % Specificity,
value or (95% CI)
Standalone AI (5 studies):
 Lotter 2021,28
Index cancer
AI (in-house) at reader’s specificity 126 51 5 103 96.2 (91.7 to 99.2) +14.2, P<0.001 66.9 Set to be equal
AI (in-house) at reader’s sensitivity 107 14 24 140 82.0 Set to be equal 90.9 (84.9 to 96.1) +24.0, P<0.001
Comparator: average single reader† NA NA NA NA 82.0 66.9
 McKinney 202029* AI (in-house) NR NR NR NR 56.24 +8.1, P<0.001 84.29 +3.46, P=0.02
Comparator: original single reader NR NR NR NR 48.1 80.83
 Rodriguez-Ruiz 201933 AI (Transpara version 1.4.0) 63 25 16 95 80 (70 to 90) +3 (-6.2 to 12.6) 79 (73 to 86) Set to be equal
Comparator: average single reader§ NA NA NA NA 77 (70 to 83) 79 (73 to 86)
 Salim 202035 AI-1 (anonymised) 605 NR NR NR 81.9 (78.9 to 84.6) See below 96.6 (96.5 to 96.7) Set to be equal
AI-2 (anonymised) 495 NR NR NR 67.0 (63.5 to 70.4) −14.9 v AI-1 (P<0.001) 96.6 (96.5 to 96.7) Set to be equal
AI-3 (anonymised) 498 NR NR NR 67.4 (63.9 to 70.8) −14.5 v AI-1 (P<0.001) 96.7 (96.6 to 96.8) Set to be equal
Comparator: original reader 1 572 NR NR NR 77.4 (74.2 to 80.4) −4.5 v AI-1 (P=0.03) 96.6 (96.5 to 96.7)
Comparator: original reader 2 592 NR NR NR 80.1 (77.0 to 82.9) −1.8 v AI-1 (P=0.40) 97.2 (97.1 to 97.3) +0.6 v AI-1 (NR)
Comparator: original consensus reading 628 NR NR NR 85.0 (82.2 to 87.5) +3.1 v AI-1 (P=0.11) 98.5 (98.4 to 98.6) +1.9 v AI-1 (NR)
 Schaffter 202036 Top-performing AI (in-house) NR NR NR NR 77.1 Set to be equal 88 −8.7 v reader 1 (NR)
Ensemble method (CEM; in-house) NR NR NR NR 77.1 Set to be equal 92.5 −4.2 v reader 1 (NR)
Comparator: original reader 1 NR NR NR NR 77.1 96.7 (96.6 to 96.8)
 Schaffter 202036 Top-performing AI (in-house) NR NR NR NR 83.9 Set to be equal 81.2 −17.3 v consensus (NR)
Comparator: original consensus reading NR NR NR NR 83.9 98.5
AI for triage pre-screen (4 studies):
 Balta 202025 AI as pre-screen (Transpara version 1.6.0):
 AI score ≤2: ~15% low risk 114 15 028 0 2754 100.0 NA 15.49 NA
 AI score ≤5: ~45% low risk 109 9791 5 7991 95.61 NA 44.94 NA
 AI score ≤7: ~65% low risk 105 6135 9 11 647 92.11 NA 65.50 NA
 Lång 202027 AI as pre-screen (Transpara version 1.4.0):
 AI score ≤2: ~19% low risk 68 7684 0 1829 100.0 NA 19.23 NA
 AI score ≤5: ~53% low risk 61 4438 7 5075 89.71 NA 53.35 NA
 AI score ≤7: ~73% low risk 57 2541 11 6972 83.82 NA 73.29 NA
 Raya-Povedano 202131 AI as pre-screen (Transpara version 1.6.0); AI score ≤7: ~72% low risk 100 4450 13 11 424 88.5 (81.1 to 93.7) NA 72.0 (71.3 to 72.7) NA
 Dembrower 202026§ AI as pre-screen (Lunit version 5.5.0.16):
 AI score ≤0.0293: 60% low risk¶ 347 29 787 0 45 200 100.0 NA 60.28 NA
 AI score ≤0.0870: 80% low risk¶ 338 14 729 9 60 258 97.41 NA 80.36 NA
AI for triage post-screen (1 study):
 Dembrower 202026§ AI as post-screen (Lunit v5.5.0.16);
prediction of interval cancers:
AI score ≥0.5337: ~2% high risk
32 1413 168 73 921 16 NA 98.12 NA
 Dembrower 202026§ AI as post-screen (Lunit version 5.5.0.16); prediction of interval and next round screen detected cancers:
AI score ≥0.5337: ~2% high risk
103 1342 444 73 645 19 NA 98.21 NA
AI as reader aid (3 studies):
 Pacilè 202030 AI support§ (MammoScreen version 1) NA NA NA NA 69.1 (60.0 to 78.2) +3.3, P=0.02 73.5 (65.6 to 81.5) +1.0, P=0.63
Comparator: average single reader** NA NA NA NA 65.8 (57.4 to 74.3) 72.5 (65.6 to 79.4)
 Rodriguez-Ruiz 201932 AI support (Transpara version 1.3.0) 86 29 14 111 86 (84 to 88) +3, P=0.05 79 (77 to 81) +2, P=0.06
Comparator: average single reader 83 32 17 108 83 (81 to 85) 77 (75 to 79)
 Watanabe 201937 AI support** (cmAssist) NA NA NA NA 62 (range 41 to 75) +11, P=0.03 77.2 −0.9 (NR)
Comparator: average single reader** NA NA NA NA 51 (range 25 to 71) 78.1

AI=artificial intelligence; CEM=challenge ensemble method of eight top performing AIs from DREAM challenge; CI=confidence interval; DREAM=Dialogue on Reverse Engineering Assessment and Methods; FN=false negatives; F=false positives; NA=not applicable; NR=not reported; TN=true negatives; TP=true positives.

*

Inverse probability weighting: negative cases were upweighted to account for the spectrum enrichment of the study population. Patients associated with negative biopsies were downweighted by 0.64. Patients who were not biopsied were upweighted by 23.61.

Applied an inverse probability weighted bootstrapping (1000 samples) with a 14:1 ratio of healthy women to women receiving a diagnosis of cancer to simulate a study population with a cancer prevalence matching a screening cohort.

In addition, the challenge ensemble method prediction was combined with the original radiologist assessment. At the first reader’s sensitivity of 77.1%, CEM+reader 1 resulted in a specificity of 98.5% (95% confidence interval 98.4% to 98.6%), higher than the specificity of the first reader alone of 96.7% (95% confidence interval, 96.6% to 96.8%; P<0.001). At the consensus readers’ sensitivity of 83.9%, CEM+consensus did not significantly improve the consensus interpretations alone (98.1% v 98.5% specificity, respectively). These simulated results of the hypothetical integration of AI with radiologists’ decisions were excluded as they did not incorporate radiologist behaviour when AI is applied.

§

Applied 11 times upsampling of the 6817 healthy women, resulting in 74 987 healthy women and a total simulated screening population of 75 534.

Specificity estimates not based on exact numbers; the numbers were calculated by reviewers from reported proportions applied to 75 334 women (347 screen detected cancers and 74 987 healthy women).

**

In enriched test set multiple reader multiple case laboratory studies where multiple readers asses the same images, there are considerable problems in summing 2x2 test data across readers.