Table 4.
Summary of test accuracy outcomes
| Study | Index test (manufacturer)/comparator | TP | FP | FN | TN | % Sensitivity (95% CI) |
Δ % Sensitivity, P value or (95% CI) |
% Specificity (95% CI) |
Δ % Specificity, value or (95% CI) |
|---|---|---|---|---|---|---|---|---|---|
| Standalone AI (5 studies): | |||||||||
| Lotter 2021,28
Index cancer |
AI (in-house) at reader’s specificity | 126 | 51 | 5 | 103 | 96.2 (91.7 to 99.2) | +14.2, P<0.001 | 66.9 | Set to be equal |
| AI (in-house) at reader’s sensitivity | 107 | 14 | 24 | 140 | 82.0 | Set to be equal | 90.9 (84.9 to 96.1) | +24.0, P<0.001 | |
| Comparator: average single reader† | NA | NA | NA | NA | 82.0 | — | 66.9 | ||
| McKinney 202029* | AI (in-house) | NR | NR | NR | NR | 56.24 | +8.1, P<0.001 | 84.29 | +3.46, P=0.02 |
| Comparator: original single reader | NR | NR | NR | NR | 48.1 | — | 80.83 | — | |
| Rodriguez-Ruiz 201933 | AI (Transpara version 1.4.0) | 63 | 25 | 16 | 95 | 80 (70 to 90) | +3 (-6.2 to 12.6) | 79 (73 to 86) | Set to be equal |
| Comparator: average single reader§ | NA | NA | NA | NA | 77 (70 to 83) | — | 79 (73 to 86) | — | |
| Salim 202035† | AI-1 (anonymised) | 605 | NR | NR | NR | 81.9 (78.9 to 84.6) | See below | 96.6 (96.5 to 96.7) | Set to be equal |
| AI-2 (anonymised) | 495 | NR | NR | NR | 67.0 (63.5 to 70.4) | −14.9 v AI-1 (P<0.001) | 96.6 (96.5 to 96.7) | Set to be equal | |
| AI-3 (anonymised) | 498 | NR | NR | NR | 67.4 (63.9 to 70.8) | −14.5 v AI-1 (P<0.001) | 96.7 (96.6 to 96.8) | Set to be equal | |
| Comparator: original reader 1 | 572 | NR | NR | NR | 77.4 (74.2 to 80.4) | −4.5 v AI-1 (P=0.03) | 96.6 (96.5 to 96.7) | — | |
| Comparator: original reader 2 | 592 | NR | NR | NR | 80.1 (77.0 to 82.9) | −1.8 v AI-1 (P=0.40) | 97.2 (97.1 to 97.3) | +0.6 v AI-1 (NR) | |
| Comparator: original consensus reading | 628 | NR | NR | NR | 85.0 (82.2 to 87.5) | +3.1 v AI-1 (P=0.11) | 98.5 (98.4 to 98.6) | +1.9 v AI-1 (NR) | |
| Schaffter 202036‡ | Top-performing AI (in-house) | NR | NR | NR | NR | 77.1 | Set to be equal | 88 | −8.7 v reader 1 (NR) |
| Ensemble method (CEM; in-house) | NR | NR | NR | NR | 77.1 | Set to be equal | 92.5 | −4.2 v reader 1 (NR) | |
| Comparator: original reader 1 | NR | NR | NR | NR | 77.1 | — | 96.7 (96.6 to 96.8) | ||
| Schaffter 202036 | Top-performing AI (in-house) | NR | NR | NR | NR | 83.9 | Set to be equal | 81.2 | −17.3 v consensus (NR) |
| Comparator: original consensus reading | NR | NR | NR | NR | 83.9 | — | 98.5 | — | |
| AI for triage pre-screen (4 studies): | |||||||||
| Balta 202025 | AI as pre-screen (Transpara version 1.6.0): | ||||||||
| AI score ≤2: ~15% low risk | 114 | 15 028 | 0 | 2754 | 100.0 | NA | 15.49 | NA | |
| AI score ≤5: ~45% low risk | 109 | 9791 | 5 | 7991 | 95.61 | NA | 44.94 | NA | |
| AI score ≤7: ~65% low risk | 105 | 6135 | 9 | 11 647 | 92.11 | NA | 65.50 | NA | |
| Lång 202027 | AI as pre-screen (Transpara version 1.4.0): | ||||||||
| AI score ≤2: ~19% low risk | 68 | 7684 | 0 | 1829 | 100.0 | NA | 19.23 | NA | |
| AI score ≤5: ~53% low risk | 61 | 4438 | 7 | 5075 | 89.71 | NA | 53.35 | NA | |
| AI score ≤7: ~73% low risk | 57 | 2541 | 11 | 6972 | 83.82 | NA | 73.29 | NA | |
| Raya-Povedano 202131 | AI as pre-screen (Transpara version 1.6.0); AI score ≤7: ~72% low risk | 100 | 4450 | 13 | 11 424 | 88.5 (81.1 to 93.7) | NA | 72.0 (71.3 to 72.7) | NA |
| Dembrower 202026§ | AI as pre-screen (Lunit version 5.5.0.16): | ||||||||
| AI score ≤0.0293: 60% low risk¶ | 347 | 29 787 | 0 | 45 200 | 100.0 | NA | 60.28 | NA | |
| AI score ≤0.0870: 80% low risk¶ | 338 | 14 729 | 9 | 60 258 | 97.41 | NA | 80.36 | NA | |
| AI for triage post-screen (1 study): | |||||||||
| Dembrower 202026§ | AI as post-screen (Lunit v5.5.0.16); prediction of interval cancers: AI score ≥0.5337: ~2% high risk |
32 | 1413 | 168 | 73 921 | 16 | NA | 98.12 | NA |
| Dembrower 202026§ | AI as post-screen (Lunit version 5.5.0.16); prediction of interval and next round screen detected cancers: AI score ≥0.5337: ~2% high risk |
103 | 1342 | 444 | 73 645 | 19 | NA | 98.21 | NA |
| AI as reader aid (3 studies): | |||||||||
| Pacilè 202030 | AI support§ (MammoScreen version 1) | NA | NA | NA | NA | 69.1 (60.0 to 78.2) | +3.3, P=0.02 | 73.5 (65.6 to 81.5) | +1.0, P=0.63 |
| Comparator: average single reader** | NA | NA | NA | NA | 65.8 (57.4 to 74.3) | — | 72.5 (65.6 to 79.4) | — | |
| Rodriguez-Ruiz 201932 | AI support (Transpara version 1.3.0) | 86 | 29 | 14 | 111 | 86 (84 to 88) | +3, P=0.05 | 79 (77 to 81) | +2, P=0.06 |
| Comparator: average single reader | 83 | 32 | 17 | 108 | 83 (81 to 85) | — | 77 (75 to 79) | — | |
| Watanabe 201937 | AI support** (cmAssist) | NA | NA | NA | NA | 62 (range 41 to 75) | +11, P=0.03 | 77.2 | −0.9 (NR) |
| Comparator: average single reader** | NA | NA | NA | NA | 51 (range 25 to 71) | — | 78.1 | — | |
AI=artificial intelligence; CEM=challenge ensemble method of eight top performing AIs from DREAM challenge; CI=confidence interval; DREAM=Dialogue on Reverse Engineering Assessment and Methods; FN=false negatives; F=false positives; NA=not applicable; NR=not reported; TN=true negatives; TP=true positives.
Inverse probability weighting: negative cases were upweighted to account for the spectrum enrichment of the study population. Patients associated with negative biopsies were downweighted by 0.64. Patients who were not biopsied were upweighted by 23.61.
Applied an inverse probability weighted bootstrapping (1000 samples) with a 14:1 ratio of healthy women to women receiving a diagnosis of cancer to simulate a study population with a cancer prevalence matching a screening cohort.
In addition, the challenge ensemble method prediction was combined with the original radiologist assessment. At the first reader’s sensitivity of 77.1%, CEM+reader 1 resulted in a specificity of 98.5% (95% confidence interval 98.4% to 98.6%), higher than the specificity of the first reader alone of 96.7% (95% confidence interval, 96.6% to 96.8%; P<0.001). At the consensus readers’ sensitivity of 83.9%, CEM+consensus did not significantly improve the consensus interpretations alone (98.1% v 98.5% specificity, respectively). These simulated results of the hypothetical integration of AI with radiologists’ decisions were excluded as they did not incorporate radiologist behaviour when AI is applied.
Applied 11 times upsampling of the 6817 healthy women, resulting in 74 987 healthy women and a total simulated screening population of 75 534.
Specificity estimates not based on exact numbers; the numbers were calculated by reviewers from reported proportions applied to 75 334 women (347 screen detected cancers and 74 987 healthy women).
In enriched test set multiple reader multiple case laboratory studies where multiple readers asses the same images, there are considerable problems in summing 2x2 test data across readers.