. 2021 Sep 2;374:n1872. doi: 10.1136/bmj.n1872

Table 4.

Summary of test accuracy outcomes

Study	Index test (manufacturer)/comparator	TP	FP	FN	TN	% Sensitivity (95% CI)	Δ % Sensitivity, P value or (95% CI)	% Specificity (95% CI)	Δ % Specificity, value or (95% CI)
Standalone AI (5 studies):
Lotter 2021,²⁸ Index cancer	AI (in-house) at reader’s specificity	126	51	5	103	96.2 (91.7 to 99.2)	+14.2, P<0.001	66.9	Set to be equal
	AI (in-house) at reader’s sensitivity	107	14	24	140	82.0	Set to be equal	90.9 (84.9 to 96.1)	+24.0, P<0.001
	Comparator: average single reader†	NA	NA	NA	NA	82.0	—	66.9
McKinney 2020²⁹*	AI (in-house)	NR	NR	NR	NR	56.24	+8.1, P<0.001	84.29	+3.46, P=0.02
McKinney 2020²⁹*	Comparator: original single reader	NR	NR	NR	NR	48.1	—	80.83	—
Rodriguez-Ruiz 2019³³	AI (Transpara version 1.4.0)	63	25	16	95	80 (70 to 90)	+3 (-6.2 to 12.6)	79 (73 to 86)	Set to be equal
Rodriguez-Ruiz 2019³³	Comparator: average single reader§	NA	NA	NA	NA	77 (70 to 83)	—	79 (73 to 86)	—
Salim 2020³⁵†	AI-1 (anonymised)	605	NR	NR	NR	81.9 (78.9 to 84.6)	See below	96.6 (96.5 to 96.7)	Set to be equal
	AI-2 (anonymised)	495	NR	NR	NR	67.0 (63.5 to 70.4)	−14.9 v AI-1 (P<0.001)	96.6 (96.5 to 96.7)	Set to be equal
	AI-3 (anonymised)	498	NR	NR	NR	67.4 (63.9 to 70.8)	−14.5 v AI-1 (P<0.001)	96.7 (96.6 to 96.8)	Set to be equal
	Comparator: original reader 1	572	NR	NR	NR	77.4 (74.2 to 80.4)	−4.5 v AI-1 (P=0.03)	96.6 (96.5 to 96.7)	—
	Comparator: original reader 2	592	NR	NR	NR	80.1 (77.0 to 82.9)	−1.8 v AI-1 (P=0.40)	97.2 (97.1 to 97.3)	+0.6 v AI-1 (NR)
	Comparator: original consensus reading	628	NR	NR	NR	85.0 (82.2 to 87.5)	+3.1 v AI-1 (P=0.11)	98.5 (98.4 to 98.6)	+1.9 v AI-1 (NR)
Schaffter 2020³⁶‡	Top-performing AI (in-house)	NR	NR	NR	NR	77.1	Set to be equal	88	−8.7 v reader 1 (NR)
Schaffter 2020³⁶‡	Ensemble method (CEM; in-house)	NR	NR	NR	NR	77.1	Set to be equal	92.5	−4.2 v reader 1 (NR)
	Comparator: original reader 1	NR	NR	NR	NR	77.1	—	96.7 (96.6 to 96.8)
Schaffter 2020³⁶	Top-performing AI (in-house)	NR	NR	NR	NR	83.9	Set to be equal	81.2	−17.3 v consensus (NR)
	Comparator: original consensus reading	NR	NR	NR	NR	83.9	—	98.5	—
AI for triage pre-screen (4 studies):
Balta 2020²⁵	AI as pre-screen (Transpara version 1.6.0):
	AI score ≤2: ~15% low risk	114	15 028	0	2754	100.0	NA	15.49	NA
	AI score ≤5: ~45% low risk	109	9791	5	7991	95.61	NA	44.94	NA
	AI score ≤7: ~65% low risk	105	6135	9	11 647	92.11	NA	65.50	NA
Lång 2020²⁷	AI as pre-screen (Transpara version 1.4.0):
	AI score ≤2: ~19% low risk	68	7684	0	1829	100.0	NA	19.23	NA
	AI score ≤5: ~53% low risk	61	4438	7	5075	89.71	NA	53.35	NA
	AI score ≤7: ~73% low risk	57	2541	11	6972	83.82	NA	73.29	NA
Raya-Povedano 2021³¹	AI as pre-screen (Transpara version 1.6.0); AI score ≤7: ~72% low risk	100	4450	13	11 424	88.5 (81.1 to 93.7)	NA	72.0 (71.3 to 72.7)	NA
Dembrower 2020²⁶§	AI as pre-screen (Lunit version 5.5.0.16):
	AI score ≤0.0293: 60% low risk¶	347	29 787	0	45 200	100.0	NA	60.28	NA
	AI score ≤0.0870: 80% low risk¶	338	14 729	9	60 258	97.41	NA	80.36	NA
AI for triage post-screen (1 study):
Dembrower 2020²⁶§	AI as post-screen (Lunit v5.5.0.16); prediction of interval cancers: AI score ≥0.5337: ~2% high risk	32	1413	168	73 921	16	NA	98.12	NA
Dembrower 2020²⁶§	AI as post-screen (Lunit version 5.5.0.16); prediction of interval and next round screen detected cancers: AI score ≥0.5337: ~2% high risk	103	1342	444	73 645	19	NA	98.21	NA
AI as reader aid (3 studies):
Pacilè 2020³⁰	AI support§ (MammoScreen version 1)	NA	NA	NA	NA	69.1 (60.0 to 78.2)	+3.3, P=0.02	73.5 (65.6 to 81.5)	+1.0, P=0.63
Pacilè 2020³⁰	Comparator: average single reader**	NA	NA	NA	NA	65.8 (57.4 to 74.3)	—	72.5 (65.6 to 79.4)	—
Rodriguez-Ruiz 2019³²	AI support (Transpara version 1.3.0)	86	29	14	111	86 (84 to 88)	+3, P=0.05	79 (77 to 81)	+2, P=0.06
Rodriguez-Ruiz 2019³²	Comparator: average single reader	83	32	17	108	83 (81 to 85)	—	77 (75 to 79)	—
Watanabe 2019³⁷	AI support** (cmAssist)	NA	NA	NA	NA	62 (range 41 to 75)	+11, P=0.03	77.2	−0.9 (NR)
Watanabe 2019³⁷	Comparator: average single reader**	NA	NA	NA	NA	51 (range 25 to 71)	—	78.1	—

AI=artificial intelligence; CEM=challenge ensemble method of eight top performing AIs from DREAM challenge; CI=confidence interval; DREAM=Dialogue on Reverse Engineering Assessment and Methods; FN=false negatives; F=false positives; NA=not applicable; NR=not reported; TN=true negatives; TP=true positives.

Inverse probability weighting: negative cases were upweighted to account for the spectrum enrichment of the study population. Patients associated with negative biopsies were downweighted by 0.64. Patients who were not biopsied were upweighted by 23.61.

^†

Applied an inverse probability weighted bootstrapping (1000 samples) with a 14:1 ratio of healthy women to women receiving a diagnosis of cancer to simulate a study population with a cancer prevalence matching a screening cohort.

^‡

In addition, the challenge ensemble method prediction was combined with the original radiologist assessment. At the first reader’s sensitivity of 77.1%, CEM+reader 1 resulted in a specificity of 98.5% (95% confidence interval 98.4% to 98.6%), higher than the specificity of the first reader alone of 96.7% (95% confidence interval, 96.6% to 96.8%; P<0.001). At the consensus readers’ sensitivity of 83.9%, CEM+consensus did not significantly improve the consensus interpretations alone (98.1% v 98.5% specificity, respectively). These simulated results of the hypothetical integration of AI with radiologists’ decisions were excluded as they did not incorporate radiologist behaviour when AI is applied.

^§

Applied 11 times upsampling of the 6817 healthy women, resulting in 74 987 healthy women and a total simulated screening population of 75 534.

^¶

Specificity estimates not based on exact numbers; the numbers were calculated by reviewers from reported proportions applied to 75 334 women (347 screen detected cancers and 74 987 healthy women).

^**

In enriched test set multiple reader multiple case laboratory studies where multiple readers asses the same images, there are considerable problems in summing 2x2 test data across readers.