Artificial intelligence–assisted breast ultrasound: modest AUROC improvement and shorter interpretation time without significant change in diagnostic accuracy

Hyuksool Kwon; Sun Mi Kim; Seok Hwan Oh; Mijung Jang; Myeong-Gee Kim; Hyeon-Min Bae; Sang Il Choi; Su Min Cho; Youngmin Kim; Guil Jung; Hyeon-Jik Lee; Sang-Yun Kim

doi:10.3389/fradi.2026.1747783

. 2026 Mar 18;6:1747783. doi: 10.3389/fradi.2026.1747783

Artificial intelligence–assisted breast ultrasound: modest AUROC improvement and shorter interpretation time without significant change in diagnostic accuracy

Hyuksool Kwon ^1,^2,³, Sun Mi Kim ^4,^*, Seok Hwan Oh ^1,^3,⁵, Mijung Jang ⁴, Myeong-Gee Kim ^1,^3,⁵, Hyeon-Min Bae ^3,⁵, Sang Il Choi ⁴, Su Min Cho ⁴, Youngmin Kim ⁵, Guil Jung ⁵, Hyeon-Jik Lee ⁵, Sang-Yun Kim ⁵

PMCID: PMC13039021 PMID: 41929405

Abstract

Objectives

To evaluate whether Vis-BUS, a commercial artificial intelligence (AI) breast ultrasound detection and analysis software, improves diagnostic discrimination and interpretation efficiency in breast ultrasound examinations.

Materials and methods

This retrospective multi-reader study included 258 breast ultrasound examinations (129 malignant and 129 benign lesions). Six radiologists independently interpreted all cases without AI and, after a two-week washout, with AI assistance. Diagnostic performance metrics, including the area under the receiver operating characteristic curve (AUROC), area under the precision–recall curve (AUPRC), accuracy, sensitivity, and specificity, were compared using multi-reader analysis. Median interpretation time per case was recorded and compared using paired statistical tests.

Results

Vis-BUS assistance modestly increased the pooled AUROC (0.921 vs. 0.953, p = 0.002) and reduced median reading time (6.0 vs. 3.0 s, p < 0.001), whereas AUPRC, accuracy, sensitivity, and specificity did not differ significantly (all p > 0.06). Accuracy (79.1% vs. 83.9%, p = 0.061), sensitivity (94.2% vs. 96.3%, p = 0.243), and specificity (64.0% vs. 71.6%, p = 0.069) showed no significant differences. Median interpretation time decreased from 6.0 to 3.0 s (p < 0.001). Subgroup analyses demonstrated significant AUROC improvements for dense breasts and tumors ≤ 2 cm (p < 0.001 for both).

Conclusion

Vis-BUS AI assistance was associated with improved diagnostic discrimination and shorter interpretation time. However, accuracy, sensitivity, and specificity did not differ significantly. These findings suggest potential efficiency benefits, while the clinical impact remains to be confirmed in prospective multi-center studies.

Keywords: artificial intelligence, breast cancer, computer-aided diagnosis, diagnostic accuracy, ultrasound imaging

Introduction

Breast cancer remains the most common malignancy among women worldwide and represents a leading cause of cancer-related mortality (1). Mammography is the standard screening modality; however, its diagnostic performance is significantly reduced in women with dense breast tissue, in whom cancers may be masked by fibroglandular structures (2, 3). Supplemental breast ultrasound (US) provides real-time, radiation-free imaging and demonstrates higher sensitivity for small and mammographically occult cancers (4). This clinical relevance is particularly pronounced in East Asia. In South Korea, where over 50% of screening-age women have heterogeneously or extremely dense breasts, the Korean National Health Insurance Service began reimbursing supplemental breast ultrasound in 2021 (5). The ongoing MUST-BE trial is evaluating combined mammography-plus-ultrasound screening in women aged 40–59 (6). In this context, optimizing ultrasound interpretation addresses a clinical need of national importance. Nonetheless, US interpretation suffers from operator dependence, variable specificity, and longer interpretation times, leading to higher recall rates and unnecessary biopsies compared with mammography (7).

To address these challenges, artificial intelligence (AI) solutions for breast US have been developed. AI-based computer-aided detection (CADe) and diagnosis (CADx) systems can automatically identify suspicious regions (8, 9). Prior reader studies have suggested that such AI systems may improve diagnostic accuracy, reduce inter-reader variability, and enhance efficiency (10, 11). More recently, commercial AI breast US tools have demonstrated potential to support radiologists in both diagnostic and screening settings (12). Although much of the literature comprises retrospective, laboratory-style reader studies, to our knowledge no prior study has simultaneously quantified both diagnostic discrimination and interpretation efficiency in a fully crossed multi-reader multi-case design using a commercial AI system integrated into the reading workflow with automated time logging and calibration analysis (13).

We had developed an AI breast ultrasound detection and analysis software (Vis-BUS) and this is designed to assist radiologists by providing lesion localization and analysis. Given the limitations of traditional US interpretation and the emerging role of AI, we conducted a multi-reader study to evaluate whether Vis-BUS assistance could improve diagnostic discrimination and reduce interpretation time in breast US. We hypothesized that AI support would enhance lesion-level diagnostic performance and efficiency, particularly in challenging subgroups such as dense breasts and small tumors.

Materials and methods

Study design and ethical approval

This was a single-center, retrospective, multi-reader, multi-case study approved by the Institutional Review Board of Seoul National University Bundang Hospital (protocol no. E-2310-858-301). The requirement for informed consent was waived because all data were de-identified and the study posed minimal risk.

Study population

From the institutional PACS archives, 258 breast ultrasound examinations were retrospectively collected between 2015 and 2020. All examinations were diagnostic breast ultrasound studies performed for evaluation of known or suspected breast lesions; screening ultrasound examinations were not included in this cohort. Therefore, recall from screening does not apply to this study population. Eligible patients were women aged 19–79 years who underwent diagnostic breast US for solid breast lesions ≥ 5 mm with a definitive reference diagnosis. A total of 129 malignant lesions (112 invasive carcinomas and 17 ductal carcinoma in situ confirmed by pathology) and 129 benign lesions (55 biopsy-proven and 74 confirmed stable after ≥ 12 months of imaging follow-up) were included. The proportion of DCIS among malignant cases (17/129, 13.2%) is consistent with reported institutional and registry data, in which DCIS typically represents 10%–20% of pathologically confirmed breast cancers.

In cases with multiple lesions, only the index lesion was selected for analysis, defined as the lesion with pathologic confirmation or, if unavailable, the most clinically relevant lesion with long-term follow-up. This approach minimized redundancy and ensured independence of cases. Breast density was abstracted from the closest mammogram within ±12 months when available; density was unknown in 35/258 (13.6%) because some patients underwent diagnostic US without mammography in routine practice. We retained such cases to reflect real-world workflows but explicitly acknowledge this as a limitation and provide subgroup analyses by density where available. A prior history of breast cancer or lesion was permitted if the patient was not undergoing active cancer treatment at the time of imaging and the ultrasound was not obtained within one month after biopsy or surgery, in line with our exclusion criteria. The detailed inclusion and exclusion criteria are provided in Supplementary Material S1 and Supplementary Table S1.

Case selection and sampling

We used an enrichment strategy to stabilize multi-reader variance: all eligible malignant lesions during the accrual window were enumerated, and benign cases were randomly sampled from the pool of eligible benign examinations to achieve a 1:1 ratio (129:129). This design increases the precision of reader-study estimates but limits the generalizability of prevalence-dependent metrics (e.g., PPV/NPV and AUPRC), which are therefore interpreted cautiously and contextualized in the Discussion/Limitations.

The accrual window (2015–2020) was chosen to ensure temporal separation from the Vis-BUS training data (acquired through 2019). All eligible malignant diagnostic breast US cases meeting the prespecified inclusion and exclusion criteria were enumerated; the resulting count of 129 reflected the application of strict eligibility requirements (solid lesions ≥5 mm, definitive reference standard, no post-procedural imaging within one month). Readers were informed that the study dataset was enriched with approximately equal proportions of malignant and benign cases, consistent with published recommendations for MRMC reader studies (14). This transparency minimizes the risk of threshold distortion associated with undisclosed enrichment.

Image acquisition

All examinations were performed using high-resolution ultrasound systems from multiple vendors (Samsung Medison, GE Healthcare, Philips Healthcare), equipped with high-frequency linear transducers. At least two orthogonal B-mode images of each lesion were acquired and stored.

Reader panel and reading sessions

Six board-certified radiologists independently interpreted all cases. The panel included two breast imaging specialists (14 and 17 years of experience) and four general radiologists (1–38 years of experience). Reader labeling was as follows: R1 (17 y) and R3 (14 y) were breast imaging specialists; R2 (1 y), R4 (38 y), R5 (1 y), and R6 (1 y) were general radiologists.

In session 1 (unaided), radiologists reviewed images without AI assistance. After a two-week washout, the same set of cases was re-randomized and interpreted again with AI assistance (session 2). Although readers may have recognized previously seen lesions, the washout interval and randomization of case order were intended to minimize recall bias. For each case, radiologists determined recall vs. no recall, drew a region of interest (ROI), and assigned a malignancy score.

AI tool

Vis-BUS (Barreleye Inc., Seoul, South Korea) is a commercially available AI breast ultrasound detection and analysis software that has received regulatory clearance from the Korean Ministry of Food and Drug Safety (MFDS). The system was developed by Barreleye Inc., whose relationship to study authors is disclosed in the Conflict of Interest statement. Vis-BUS was trained on 190,000 ultrasound images obtained from five tertiary hospitals between 2010 and 2019. Expert radiologists annotated lesion boundaries and malignancy, with each annotation independently verified by a second reader. The dataset was split into 70% training, 15% validation and 15% internal test sets at the patient level to prevent data leakage. Key hyperparameters were tuned on the validation set (learning rate=1 × 10⁻⁴, weight decay=1 × 10⁻⁵, batch size=16). Additional details are provided in Supplementary Material S2.

This output is reported as the Cancer Probability Score (CPS; −100 = definitely benign to +100 = definitely malignant), a continuous malignancy confidence score that conveys the algorithm's graded assessment. When applied clinically, a positive CPS (>0) is intended to prompt consideration of biopsy, while the full −100 to +100 range supports ROC-based discrimination analysis. A probabilistic score allows clinicians to adjust decision thresholds based on patient context and risk tolerance, facilitates nuanced discussions with patients, and supports downstream triage or follow-up strategies. The CPS is defined on a symmetric −100 to +100 scale to center the decision boundary at 0, aligning with a “no-biopsy vs. biopsy” dichotomy. This symmetric scaling improves interpretability for up- or down-titrating thresholds (e.g., screening vs. diagnostic contexts) and mirrors the sign of the model's log-odds while preserving a familiar 200-point visual range. Models were optimized using AdamW with typical learning rates on the order of 10⁻⁴, and standard data-augmentation and early-stopping strategies were applied to improve generalization (Figure 1). Additional details on dataset composition, exact train–validation–test splits, specific hyperparameters and training protocols are provided in the Supplementary Material.

Diagram illustrating an AI system for ultrasound lesion analysis, showing the workflow from B-mode input image cropping and preprocessing, through lesion detection and segmentation networks, to feature extraction with convolutional layers, and fusion encoding for final lesion characterization. — Overall configuration of the Vis-BUS neural networks. Vis-BUS integrates two primary components: Lesion Detection AI (LD-AI) and Lesion Analysis AI (LA-AI). LD-AI uses b-mode ultrasound images, $B_{img} \sim R^{256 X 256}$ , to identify the location of the lesion, while LA-AI analyzes the breast b-mode image characteristics.

Assistance protocol

Each radiologist conducted two reading sessions separated by a two-week washout. In Session 1 (unaided) the reader, blinded to clinical and pathologic data, reviewed every case, decided recall vs. no-recall, drew an ROI box around the lesion, assigned a cancer-probability score (CPS; −100 = definitely benign to +100 = definitely malignant), and the workstation automatically logged interpretation time. In Session 2 (AI-assisted) the same 258 cases were re-randomized; Vis-BUS displayed its bounding box and malignancy likelihood, after which the reader could adjust the ROI and CPS following the identical workflow while interpretation time was again recorded.

Two complementary endpoints were collected per case: (1) a binary action decision (recall vs. no-recall), representing the reader's clinical recommendation for biopsy or further workup; and (2) a continuous Cancer Probability Score (CPS; −100 = definitely benign to +100 = definitely malignant), used for ROC-based discrimination analysis. Sensitivity and specificity were derived from the binary recall decision, while AUROC was derived from the continuous CPS. The ground truth was the definitive diagnosis: pathology-confirmed malignancy, biopsy-proven benign, or benign by stable imaging follow-up (≥12 months).

Outcome measures

The primary outcomes were diagnostic discrimination metrics, including the area under the receiver operating characteristic curve (AUROC) and the area under the precision–recall curve (AUPRC), comparing unaided and AI-assisted readings.

The secondary outcomes included overall diagnostic accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), interpretation time per case, and inter-reader agreement. Subgroup analyses were conducted according to patient age (< 50 years vs. ≥ 50 years), breast density (dense vs. non-dense), tumor size (≤ 2 cm vs. > 2 cm), and lesion morphology. Lesion size was measured as the maximal diameter on B-mode ultrasound for all cases. Pathologic tumor size was available for surgically resected malignant cases and was used for T-stage classification.

Microcalcification-dominant lesions were defined as US-visible lesions in which echogenic foci consistent with mammographic microcalcifications constituted the predominant imaging feature, with or without an accompanying mass; classification was determined by the consensus of two radiologists based on paired mammogram-US review.

Statistical analysis

Statistical analyses were performed using R (version 4.5). Multi-reader multi-case analysis of variance was applied using the Obuchowski–Rockette framework, treating readers and cases as random effects. Differences in AUROC and AUPRC were assessed with paired comparisons, while accuracy, sensitivity, specificity, PPV, and NPV were analyzed with generalized linear mixed models. Reading times were compared using the Wilcoxon signed-rank test.

In this reader study, sensitivity was defined as the proportion of pathology-confirmed malignancies for which the reader recommended recall, and specificity as the proportion of confirmed benign lesions for which the reader recommended no recall. These definitions reflect each reader's independent, blinded assessment of pre-selected cases and are standard in MRMC reader study methodology (16).

Both continuous CPS data (for AUROC) and binary recall decisions (for sensitivity/specificity) were collected from each reader for every case. This dual-endpoint approach is standard in MRMC reader studies and is explicitly endorsed by the FDA guidance for CAD clinical performance assessment (2022). The continuous CPS enables full ROC curve estimation across all possible thresholds, while the binary recall decision provides a clinically interpretable operating point. These complementary endpoints are analyzed with separate statistical methods and answer distinct questions about discriminative ability vs. point-estimate diagnostic accuracy (17).

Our sample size determination followed the recommended hybrid approach: all eligible consecutive malignant cases from the accrual window (2015–2020) were enumerated (n = 129), and benign cases were randomly sampled to achieve a 1:1 ratio (n = 129), yielding 258 total cases. An a priori power calculation using conjectured variance parameters from published MRMC studies confirmed that a minimum of 239 cases was required to detect an absolute AUROC increase of 0.03 with 85% power at a two-sided α of 0.05 (15). The final cohort of 258 cases exceeded this requirement, providing a buffer for potential case exclusion. Key statistical definitions and formulas are summarized in Supplementary Material S3 (Supplementary Table S2).

Results

Patient population and case characteristics

A total of 258 breast ultrasound cases (129 malignant and 129 benign) from 258 patients were included (Figure 2) and each radiologist finished two reading sessions separated by a two-week washout (Figure 3). Patient age ranged from 20 to 78 years (mean 48.2 ± 11.0; median 47 years). Malignant cases were on average older than benign cases (median age 50 vs. 46 years, p < .001). Most patients (196, 76.0%) had dense breasts on mammography. Prior breast lesion or cancer history was present in 34 cases (13.2%). Patients with a remote history of breast cancer were eligible if they were not undergoing active cancer treatment at the time of imaging and if the index lesion in this study had an independent reference standard. Cases with post-procedural imaging (after vacuum-assisted biopsy or surgical excision) were excluded per protocol. This clarifies why 34 cases had a prior history while still meeting the prespecified exclusion criteria. All 129 malignancies were confirmed by pathology (112 invasive cancers and 17 ductal carcinoma in situ), while of the 129 benign lesions, 55 (42.6%) were biopsy-proven benign and 74 (57.4%) were confirmed stable on ≥1-year imaging follow-up (BI-RADS 2 or 3 outcome). Lesion size ranged from 0.5 to 8.5 cm (median, 1.3 cm); malignant lesions were larger on average (median 2.0 cm vs. 0.9 cm for benign). By T stage, 70/129 (54.3%) cancers were ≤2 cm, 48 (37.2%) were 2–5 cm, and 11 (8.5%) exceeded 5 cm. Sonographic morphology differed markedly between benign and malignant lesions: 96.1% of cancers appeared irregular in shape (vs. 51.2% of benign lesions), whereas most benign lesions were oval (45.0% vs. 3.1% of cancers). Margins were predominantly indistinct (68.2% overall), with spiculated margins in 25/129 cancers (19.4%) and 1/129 benign lesions (0.8%) Sonographic morphology descriptors (shape, margin, echogenicity) were abstracted from the original clinical ultrasound reports and are presented in Table 1.

Flowchart illustrating case selection for a breast lesion study, showing inclusion and exclusion criteria, resulting in 258 cases divided into 129 malignant biopsy-proven lesions and 129 benign lesions confirmed by pathology or follow-up. — Case enrollment. Flowchart illustrating the retrospective case selection process and lesion classification pipeline. After applying the inclusion criteria and then excluding cases that met any exclusion criteria, a final cohort of 258 eligible ultrasound cases was obtained. The final dataset comprised 129 malignant cases and 129 benign cases. Malignant lesions were all confirmed as cancers by biopsy, whereas benign lesions were confirmed either by benign pathology or by stable imaging findings over ≥1 year. An enrichment sampling strategy was used to ensure equal numbers of malignant and benign cases in the study cohort.

Workflow diagram comparing unaided and AI-assisted reader assessments for ultrasound images; includes steps for recall decision, region of interest placement, and cancer probability scoring, along with outcomes for AI analysis and reader actions. — Vis-BUS–assisted versus unaided reader assessments in breast ultrasound **(A)** session 1 (unaided. 1-3): Radiologists, blinded to clinical and pathologic data, reviewed each of the 258 cases, decided “recall” versus “no recall,” drew a rectangular region of interest (ROI) around the lesion, and assigned a cancer probability score (CPS; –100 = definitely benign to +100 = definitely malignant). Interpretation time for each case was automatically recorded by the workstation. **(B)** Session 2 (AI-Assisted, 4): After a two-week washout, the same cases were re-randomized. Vis-BUS automatically overlaid its bounding box indicating lesion location and displayed a malignancy likelihood score. Readers then adjusted the ROI and CPS following the identical workflow, with interpretation time again logged.

Table 1.

Patient and case characteristics.

Characteristic	All patients (N = 258)	Benign (n = 129)	Malignant (n = 129)	P-value
Age (years)	48.2 ± 11.0 (20–78)	44.9 ± 10.0 (20–78)	51.6 ± 10.9 (22–78)	<.001
Lesion size (cm)	1.7 ± 1.4, median 1.3 [0.8–2.1]	1.0 ± 0.6, median 0.9 [0.6–1.2]	2.4 ± 1.6, median 2.0 [1.4–3.0]	<.001
Breast density, n (%)				<.001
ACR A	2 (0.8%)	0 (0.0%)	2 (1.6%)
ACR B	25 (9.7%)	3 (2.3%)	22 (17.1%)
ACR C	148 (57.4%)	78 (60.5%)	70 (54.3%)
ACR D	48 (18.6%)	26 (20.2%)	22 (17.1%)
Unknown	35 (13.6%)	22 (17.1%)	13 (10.1%)
Lesion shape, n (%)				<.001
Oval	62 (24.0%)	58 (45.0%)	4 (3.1%)
Round	6 (2.3%)	5 (3.9%)	1 (0.8%)
Irregular	190 (73.6%)	66 (51.2%)	124 (96.1%)
Margin, n (%)				<.001
Circumscribed	43 (16.7%)	41 (31.8%)	2 (1.6%)
Indistinct	176 (68.2%)	85 (65.9%)	91 (70.5%)
Angular	13 (5.0%)	2 (1.6%)	11 (8.5%)
Spiculated	26 (10.1%)	1 (0.8%)	25 (19.4%)
Echogenicity, n (%)				<.001
Hypoechoic	210 (81.4%)	96 (74.4%)	114 (88.4%)
Isoechoic	28 (10.9%)	25 (19.4%)	3 (2.3%)
Heterogeneous	15 (5.8%)	6 (4.7%)	9 (7.0%)
Complex/Hyperechoic	5 (1.9%)	2 (1.6%)	3 (2.3%)
BI-RADS category, n (%)				<.001
C2	35 (13.6%)	35 (27.1%)	0 (0.0%)
C3	48 (18.6%)	48 (37.2%)	0 (0.0%)
C4	69 (26.7%)	46 (35.7%)	23 (17.8%)
C5	106 (41.1%)	0 (0.0%)	106 (82.2%

Open in a new tab

Age is given as mean ± SD with range in parentheses; lesion size as mean ± SD with median [IQR]; categorical variables are number of patients, with percentages in parentheses.

ACR, American College of Radiology; BI-RADS, Breast Imaging Reporting and Data System; IQR, interquartile range; SD, standard deviation; cm, centimeter.

Reader accuracy and diagnostic performance

Interpretation time was defined as the elapsed time from case display to submission of the recall decision and CPS score, automatically logged by the workstation. This interval encompasses the complete per-case reading task (image review, recall decision, ROI placement, and CPS assignment) but does not include clinical scanning, mammographic correlation, or report generation.

All six radiologists demonstrated improved diagnostic discrimination with Vis-BUS AI assistance (Table 2). At the pooled level, the area under the ROC curve (AUROC) increased from 0.921 (95% CI 0.889–0.954) without AI to 0.953 (95% CI 0.930–0.975)—a gain of 0.032 (95% CI 0.013–0.050; p = 0.002). In the pooled analysis, AUPRC was 0.932 without AI vs. 0.933 with AI; this difference was not statistically significant (Δ = 0.001; 95% CI −0.013 to 0.015; p = 0.944) (Figure 4).

Table 2.

Per-Reader diagnostic performance without vs With AI assistance.

Reader (year of experience)	Metric	Without AI (95% CI)	With AI (95% CI)	Δ (95% CI)	p-value
Pooled	AUROC	0.921 (0.889–0.954)	0.953 (0.930–0.975)	+0.032 (0.013–0.050)	0.002
	AUPRC	0.932 (0.921- 0.943)	0.933 (0.923- 0.942)	+0.001 (−0.013- 0.015)	0.944
	Sensitivity	94.2% (90.6–96.5)	96.3% (93.5–97.9)	+2.1% (–1.6–5.6)	0.243
	Specificity	64.0% (57.9–69.6)	71.6% (65.5–76.9)	+7.6% (–0.5–15.8)	0.069
	PPV	58.4% (51.8–64.7)	63.9% (57.1–70.2)	+5.5% (–3.7–14.8)	0.243
	NPV	86.3% (80.1–90.9)	90.8% (85.1–94.4)	+4.5% (–2.6–11.5)	0.219
	Accuracy	79.1% (75.1–82.6)	83.9% (80.2–87.0)	+4.8% (–2.4–11.9)	0.061
Reader 1 (17)	AUROC	0.944 (0.918–0.970)	0.952 (0.926–0.978)	+0.008 (–0.044–0.060)	0.339
	AUPRC	0.922 (0.895-0.945)	0.947 (0.927-0.965)	+0.025 (−0.003-0.057)	0.495
	Sensitivity	90.7% (84.3–95.1)	96.9% (92.3–99.1)	+6.2% (–2.8–14.8)	0.013
	Specificity	75.2% (66.8–82.4)	63.6% (54.6–71.9)	–11.6% (–27.8–5.1)	0.005
	PPV	78.5% (71.1–84.8)	72.7% (65.4–79.2)	–5.8% (–19.4–8.1)	0.015
	NPV	89.0% (81.6–94.2)	95.3% (88.5–98.7)	+6.3% (–5.7–17.1)	0.019
	Accuracy	82.9% (77.8–87.3)	80.2% (74.8–84.9)	–2.7% (–12.5–7.1)	0.296
Reader 2 (1)	AUROC	0.913 (0.878–0.948)	0.937 (0.908–0.966)	+0.024 (–0.040–0.088)	0.078
	AUPRC	0.917 (0.890-0.942)	0.917 (0.888-0.942)	+0.001 (−0.039-0.038)	0.965
	Sensitivity	96.9% (92.3–99.1)	94.6% (89.1–97.8)	–2.3% (–10.0–5.5)	0.45
	Specificity	51.9% (43.0–60.8)	72.9% (64.3–80.3)	+21.0% (3.5–37.3)	<.001
	PPV	66.8% (59.6–73.5)	77.7% (70.4–84.0)	+10.9% (–3.1–24.4)	<.001
	NPV	94.4% (86.2–98.4)	93.1% (86.2–97.2)	–1.3% (–12.2–11.0)	0.66
	Accuracy	74.4% (68.6–79.6)	83.7% (78.6–88.0)	+9.3% (–1.0–19.4)	<.001
Reader 3 (14)	AUC	0.931 (0.902–0.960)	0.974 (0.960–0.989)	+0.043 (0.000–0.087)	<.001
	AUPRC	0.931 (0.903-0.953)	0.942 (0.920-0.960)	+0.011 (−0.018-0.043)	0.546
	Sensitivity	91.5% (85.3–95.7)	96.1% (91.2–98.7)	+4.6% (–4.5–13.4)	0.114
	Specificity	73.6% (65.2–81.0)	81.4% (73.6–87.7)	+7.8% (–7.4–22.5)	0.034
	PPV	77.6% (70.2–84.0)	83.8% (76.8–89.3)	+6.2% (–7.2–19.1)	0.008
	NPV	85.4% (80.3–89.5)	86.6% (81.8–90.5)	+1.2% (–7.7–10.2)	0.604
	Accuracy	82.6% (77.4–87.0)	88.8% (84.3–92.3)	+6.2% (–2.7–14.9)	0.005
Reader 4 (38)	AUC	0.942 (0.916–0.969)	0.970 (0.953–0.987)	+0.028 (–0.016–0.071)	0.008
	AUPRC	0.965 (0.949-0.977)	0.938 (0.915-0.959)	-0.027 (−0.052-0.002)	0.478
	Sensitivity	95.3% (90.2–98.3)	97.7% (93.4–99.5)	+2.4% (–4.9–9.3)	0.450
	Specificity	76.0% (67.7–83.1)	72.9% (64.3–80.3)	–3.1% (–18.8–12.6)	0.556
	PPV	79.9% (72.7–85.9)	78.3% (71.1–84.4)	–1.6% (–14.8–11.7)	0.533
	NPV	88.2% (83.5–92.0)	86.6% (81.8–90.5)	–1.6% (–10.2–7.0)	0.462
	Accuracy	85.7% (80.8–89.7)	85.3% (80.3–89.4)	–0.4% (–9.4–8.6)	1
Reader 5 (1)	AUROC	0.890 (0.851–0.930)	0.941 (0.912–0.970)	+0.051 (–0.018–0.119)	0.003
	AUPRC	0.948 (0.913-0.974)	0.908 (0.871-0.9380	-0.04 (−0.086-0.004)	0.531
	Sensitivity	96.9% (92.3–99.1)	97.7% (93.4–99.5)	+0.8% (–5.7–7.2)	1.000
	Specificity	41.1% (32.5–50.1)	61.2% (52.3–69.7)	+20.1% (2.2–37.2)	<.001
	PPV	62.2% (55.1–68.9)	71.6% (64.3–78.1)	+9.4% (–4.6–23.0)	<.001
	NPV	88.6% (82.7–93.0)	86.6% (81.8–90.5)	–2.0% (–11.2–7.8)	0.436
	Accuracy	69.0% (63.0–74.6)	79.5% (74.0–84.2)	+10.5% (–0.6–21.2)	<.001
Reader 6 (1)	AUROC	0.903 (0.865–0.941)	0.937 (0.904–0.970)	+0.034 (–0.037–0.105)	0.081
	AUPRC	0.911 (0.880-0.937)	0.945 (0.922-0.965)	+0.035 (−0.003-0.072)	0.493
	Sensitivity	93.8% (88.1–97.3)	94.6% (89.1–97.8)	+0.8% (–8.2–9.7)	1.000
	Specificity	65.9% (57.0–74.0)	77.5% (69.3–84.4)	+11.6% (–4.7–27.4)	0.012
	PPV	73.3% (65.9–79.9)	80.8% (73.6–86.7)	+7.5% (–6.3–20.8)	0.006
	NPV	86.5% (81.5–90.6)	86.6% (81.8–90.5)	+0.1% (–8.8–9.0)	0.967
	Accuracy	79.8% (74.4–84.6)	86.0% (81.2–90.0)	+6.2% (–3.4–15.6)	0.015

Open in a new tab

AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision–recall curve; CI, confidence interval; PPV, positive predictive value; NPV, negative predictive value; Δ = absolute change (With AI minus Without AI).

Bar chart illustration shows sensitivity, specificity, and accuracy percentages for readers R1 to R6, comparing results with and without AI. Each graph uses orange for without AI and blue for with AI. — Receiver operating characteristic (ROC) and precision–recall (PR) curves for each reader and the pooled analysis, comparing unaided readings (solid lines) with Vis-BUS AI–assisted readings (dashed lines). In the pooled analysis, the AUROC rose from 0.921 (95% CI, 0.889–0.954) without AI to 0.953 (95% CI, 0.930–0.975; Δ = 0.031; p = 0.002), while the AUPRC increased from 0.932 (95% CI, 0.921–0.943) to 0.933 (95% CI, 0.923–0.942; Δ = 0.001; p = 0.944). Reader 1's AUROC/AUPRC changed from 0.944/0.922 to 0.952/0.947 (p = 0.339/0.495); Reader 2's from 0.913/0.917 to 0.937/0.917 (p = 0.078/0.965); Reader 3's from 0.931/0.931 to 0.974/0.942 (p < .001/0.546); Reader 4's from 0.942/0.965 to 0.970/0.938 (p = 0.008/0.478); Reader 5's from 0.890/0.948 to 0.941/0.908 (p = 0.003/0.531); and Reader 6's from 0.903/0.911 to 0.937/0.945 (p = 0.081/0.493).

Pooled sensitivity increased from 94.2% (90.6–96.5) to 96.3% (93.5–97.9), an improvement of 2.1 points (95% CI –1.6–5.6; p = 0.243), while pooled specificity improved from 64.0% (57.9–69.6) to 71.6% (65.5–76.9), a gain of 7.6 points (95% CI –0.5–15.8; p = 0.069). Positive predictive value (PPV) rose from 58.4% to 63.9% (+5.5 points; 95% CI –3.7–14.8; p = 0.243), negative predictive value (NPV) from 86.3% to 90.8% (+4.5 points; 95% CI –2.6–11.5; p = 0.219), and overall accuracy from 79.1% to 83.9% (+4.8 points; 95% CI –2.4–11.9; p = 0.061).

Individually, all six readers experienced AUROC gains with AI, with significant improvements in three readers (R3: 0.931→0.974, p < .001; R4: 0.942→0.970, p = 0.008; R5: 0.890→0.941, p = 0.003). Changes in AUPRC varied by reader (Δ –0.040 to +0.035) and did not reach statistical significance for any individual. Sensitivity increased in five of six readers, achieving significance only for Reader 1 (90.7%→96.9%, p = 0.013), while specificity improvements were significant for Readers 2 (+21.0 points, p < .001), 3 (+7.8 points, p = 0.034), 5 (+20.1 points, p < .001), and 6 (+11.6 points, p = 0.012). PPV and NPV trends mirrored these patterns, and accuracy improved significantly in four readers.

Reader efficiency and interpretation time

AI assistance significantly improved reading efficiency for most readers. With Vis-BUS, the median interpretation time per case decreased by 1–2 s for most readers (Table 3). For example, Reader 1's median read time dropped from 9 s (IQR 4–40) without AI to 3 s (IQR 2–4) with AI (p < .001). Reader 4's median time went from 3.6 (2.7–5.3) to 2.7 (2.0–4.9) seconds (p < .001), and Reader 5 from 2.6 to 2.0 s (p < .001). In contrast, two readers showed no meaningful change in speed: Reader 3's median was 5.7 vs. 5.2 s with AI (p = 0.3), and Reader 6's times were 2.1 vs. 2.2 s (p = 0.7). Pooled across all readings, the median interpretation time fell from 6.0 s without AI to 3.0 s with AI (Wilcoxon p < .001). Mean reading time per case also declined from 28.0 ± 5.6 s to 9.5 ± 1.9 s with AI. Figure 5 illustrates the distribution of case reading times by condition, showing a general left-shift (faster reads) with AI assistance.

Table 3.

Per-reader interpretation time With and without AI assistance.

Reader (year of experience)	Without AI, sec (IQR)	With AI (IQR)	Δ (95% CI)	P-value^†
Reader 1 (17)	9 (4, 40)	3 (2, 4)	5.7 (3.6-8.6)	<.001
Reader 2 (1)	3.0 (1.9, 6.2)	2.4 (1.9, 4.3)	0.2 (0-0.4)	<.001
Reader 3 (14)	5.7 (4.0, 8.5)	5.2 (3.5, 8.6)	0.4 (−0.3-0.8)	0.3
Reader 4 (36)	3.6 (2.7, 5.3)	2.7 (2.0, 4.9)	0.8 (0.5–1.1)	<.001
Reader 5 (1)	2.60 (2.00, 4.00)	2.00 (1.60, 2.80)	0.5 (0.4–0.7)	<.001
Reader 6 (1)	2.10 (1.70, 3.20)	2.20 (1.50, 3.40)	0.2 (0–0.4)	0.7

Open in a new tab

CI = confidence interval, IQR=interquartile range

^†

P-values from paired two-sided Wilcoxon signed-rank tests (comparing unaided vs AI-assisted conditions).

Box plot chart comparing interpretation time in seconds for readers two to six, showing left bars for without artificial intelligence and right bars for with artificial intelligence. Statistically significant reductions in interpretation time are observed for readers two, four, and five based on p-values less than zero point zero zero one, while non-significant differences are shown for readers three and six. — Per-Reader interpretation times without vs With AI assistance (readers 2–6). Box-and-whisker plots showing the distribution of per-case ultrasound interpretation times for five radiologists (Readers 2–6), comparing unaided readings (“Without AI,” white boxes) to AI-assisted readings (“With AI,” orange boxes). Boxes span the interquartile range (25th–75th percentiles), whiskers extend to the most extreme values within 1.5×IQR, and horizontal lines denote medians. Above each reader's pair, the p-value from a two-sided paired Wilcoxon signed-rank test is annotated, indicating the statistical significance of time reductions with Vis-BUS assistance. Reader 1 is omitted from this panel because its extreme range (median 9 s, IQR 4–40 s) would compress the y-axis and obscure detail.

Subgroup performance and error patterns

Radiologist performance improved markedly with AI support in specific challenging subgroups (Table 4). For small tumors (T1 ≤ 2 cm), AI assistance increased the average reader AUROC from 0.889 without AI to 0.933 with AI (Δ = 0.044; p < 0.001).

Table 4.

Diagnostic performance across subgroups With and without AI.

Sub-group	Without AI	With AI	Δ95 % CI	P-value
AUROC
Age < 50 y	0.906	0.946	+0.041 (0.017–0.065)	0.002
Age ≥ 50 y	0.911	0.953	+0.043 (0.014–0.073)	0.004
Dense breasts	0.902	0.944	+0.042 (0.021–0.064)	<.001
Calcifications present	0.911	0.956	+0.045 (0.008–0.085)	0.012
Calcifications absent	0.912	0.947	+0.035 (0.012–0.059)	0.002
Tumor size ≤ 2 cm	0.889	0.933	+0.044 (0.019–0.070)	<.001
Tumor size 2–5 cm	0.891	0.952	+0.060 (–0.002–0.129)	0.062
Irregular shape	0.891	0.942	+0.050 (0.028–0.073)	<.001
General radiologists¹	0.922	0.955	+0.034 (0.010–0.057)	0.003
Breast-imaging specialists¹	0.904	0.946	+0.042 (0.015–0.070)	0.002
Sensitivity
Age < 50 y	93.1	96	+2.9 (–2.3 to 8.1)	0.263
Age ≥ 50 y	95.2	96.5	+1.3 (–3.5 to 6.0)	0.599
Dense breasts	93.5	95.5	+2.0 (–2.2 to 6.2)	0.346
Calcifications present	94.4	96.7	+2.3 (–2.3 to 6.8)	0.311
Calcifications absent	93.9	95.8	+1.9 (–3.6 to 7.3)	0.500
Tumor size ≤ 2 cm	92.1	94	+1.9 (–3.7 to 7.5)	0.502
Tumor size 2–5 cm	96.2	99	+2.8 (–1.7 to 7.3)	0.159
Irregular shape	94.3	96.1	+1.9 (–1.8 to 5.5)	0.304
General radiologists¹	94.8	95.7	+1.0 (–2.8 to 4.7)	0.611
Breast-imaging specialists¹	93	97.3	+4.3 (–0.2 to 8.7)	0.060
Specificity
Age < 50 y	64.8	72	+7.3 (–2.3 to 16.9)	0.140
Age ≥ 50 y	61.6	70.2	+8.6 (–6.8 to 24.0)	0.277
Dense breasts	65.7	73.1	+7.4 (–1.8 to 16.5)	0.117
Calcifications present	66.7	70.5	+3.8 (–23.7 to 31.4)	0.784
Calcifications absent	63.6	71.7	+8.0 (–0.5 to 16.6)	0.067
Tumor size ≤ 2 cm	64.9	72.5	+7.6 (–0.7 to 15.9)	0.074
Tumor size 2–5 cm	44.4	52.8	+8.3 (–31.0 to 47.7)	0.679
Irregular shape	38.9	52	+13.1 (–3.9 to 30.1)	0.134
General radiologists¹	58.1	73.3	+15.1 (6.6 to 23.6)	0.001
Breast-imaging specialists¹	75.6	68.2	–7.4 (–16.5 to 1.8)	0.118

Open in a new tab

AUROC, area under the receiver operating characteristic curve, CI = confidence interval.

AI assistance significantly enhanced diagnostic performance across challenging subgroups and improved both sensitivity and specificity. In dense breasts (BI-RADS C/D), the AUROC increased from 0.902 to 0.944 (Δ = 0.042; p < .001), sensitivity rose from 93.5% to 95.5% (Δ = 2.0%; p = 0.346), and specificity improved from 65.7% to 73.1% (Δ = 7.4%; p = 0.117). For microcalcification-dominant lesions, the AUROC increased from 0.911 to 0.956 (Δ = 0.045; p = 0.012), sensitivity from 94.4% to 96.7% (Δ = 2.3%; p = 0.311), and specificity from 66.7% to 70.5% (Δ = 3.8%; p = 0.784). Gains in easier subgroups (e.g., larger tumors, non-dense tissue) were minimal and not statistically significant (p > 0.05). AI also narrowed the gap between reader types: general radiologists' AUROC rose from 0.922 to 0.955 (Δ = 0.034; p = 0.003) and breast imaging specialists from 0.904 to 0.946 (Δ = 0.042; p = 0.002), reducing the inter-group difference to just 0.009 (p = NS). Sensitivity for general radiologists improved from 94.8% to 95.7% (Δ = 1.0%; p = 0.611), and specificity increased markedly from 58.1% to 73.3% (Δ = 15.1%; p = 0.001).

Error pattern analysis compared the standalone AI binary classification (CPS > 0 = positive) against unaided reader performance. Of the 33 cases misclassified by the standalone AI, 12 were cancers that the AI failed to detect (AI false-negatives) but that were correctly recalled by at least one of the six unaided readers. Conversely, 5 cancers were correctly flagged by the standalone AI (CPS > 0) but missed by all six unaided readers (i.e., all six decided “no recall”). This complementary error pattern—in which AI and human readers tended to miss different subsets of cancers—underscores the synergistic benefit of combining AI assistance with radiologist interpretation to maximize overall detection (Table 5).

Table 5.

Radiologist behavior in cases misclassified by AI alone (matching result=N): reader accuracy distribution and Wilcoxon signed-rank analysis.

No. of Readers Correct per Case	Unaided Reading (no AI) No. of cases (%)	AI-assisted Reading No. of cases (%)
0	7 (21%)	8 (24%)
1	5 (15%)	7 (21%)
2	4 (12%)	5 (15%)
3	4 (12%)	4 (12%)
4	4 (12%)	5 (15%)
5	5 (15%)	2 (6%)
6	4 (12%)	2 (6%)
Median [IQR] correct readers per case	3 [1–5]	2 [1–4]
Wilcoxon signed-rank test†	z = –2.99	p = 0.003

Open in a new tab

IQR, interquartile range.

Data are based on 33 cases in which the AI system (Vis-BUS) misclassified the lesion (matching_result=N). Values in parentheses are percentages of these 33 cases. This is a per-case paired comparison of reader performance without vs with AI assistance (Vis-BUS). The Wilcoxon signed-rank test compares the number of correct readers (out of six) per case between conditions.

The AI assistance reduced variability among radiologists – the intraclass correlation coefficient for inter-reader agreement increased from moderate reliability (ICC = 0.70; 95% CI 0.65–0.75) to good reliability (ICC = 0.78; 95% CI 0.74–0.82), reflecting more consistent interpretations.

Figure 6 illustrates how Vis-BUS AI affects diagnostic performance and reading efficiency for each reader. For both pooled and individual data, bar charts display sensitivity, specificity and accuracy with and without AI. Calibration analysis showed that the AI tended to overestimate malignancy probabilities at higher CPS values (Supplementary Figure S2). Detailed calibration analysis is provided in Supplementary Material S4. When using a CPS threshold of +20 to recommend biopsy, sensitivity and specificity were 91.5% and 68.2%, respectively, which approximates BI-RADS 4A recommendations.

Six pairs of line graphs compare receiver operating characteristic (ROC) and precision-recall (PRC) curves for six readers evaluating model performance with and without AI assistance. Each ROC shows true positive rate vs. false positive rate, and each PRC presents precision vs. recall. Legends in each chart report area under curve metrics for each reader, illustrating performance differences with and without AI. — Effect of AI on diagnostic performance and efficiency. Bar charts summarize sensitivity, specificity and overall accuracy for each reader in unaided (yellow) and AI-assisted (orange) conditions. Across all metrics, Vis-BUS either improves or maintains performance relative to unaided readings.

By expertise, AUROC increased from 0.904 to 0.946 in specialists (Δ 0.042; p = 0.002) and from 0.922 to 0.955 in general radiologists (Δ 0.034; p = 0.003), narrowing the gap between groups. R3 (specialist, 14 y), R4 (general, 38 y), and R5 (general, 1 y) showed significant individual AUROC gains.

Discussion

In this retrospective multi-reader study, use of Vis-BUS, a commercial AI breast ultrasound detection and analysis software, was associated with a statistically significant improvement in AUROC but no significant differences in AUPRC, accuracy, sensitivity, or specificity between unaided and AI-assisted interpretations. These findings suggest that while AI may aid in lesion-level discrimination, its overall diagnostic impact in terms of clinical accuracy remains limited. This interpretation is consistent with previous work showing that supplemental US enhances cancer detection in dense breasts but with increased false positives and reduced specificity (2, 3).

The reduction in median interpretation time from 6.0 to 3.0 s was statistically significant. However, the clinical impact of this finding requires caution. Time savings in image reading alone may not substantially influence the overall workflow, as the major time components in breast ultrasound include scanning the patient and drafting the report. Still, in high-volume practice settings, cumulative reductions across thousands of cases could produce meaningful efficiency gains, particularly in screening or outpatient environments (7, 18).

Subgroup analysis highlighted improved diagnostic discrimination for dense breasts and small tumors (≤2 cm). These are clinically important contexts, since dense tissue can obscure lesions on mammography and small cancers are often challenging to detect. Similar benefits of AI support in difficult diagnostic scenarios have been reported, showing increased radiologist performance when interpreting subtle lesions (12, 19).

Another notable finding was improved inter-reader consistency with AI assistance. This aligns with prior reports that AI support can reduce variability in BI-RADS classification and improve agreement among radiologists with varying levels of experience (9, 20). Such effects may be especially beneficial for less-experienced readers, narrowing the performance gap with experts and supporting training and education (9). Nevertheless, given that accuracy, sensitivity, and specificity did not differ significantly, the equalizing role of AI should be considered preliminary.

From a broader perspective, AI tools for breast imaging have demonstrated variable impact across modalities. In mammography, large-scale studies such as MASAI have confirmed that AI-supported reading can achieve non-inferior performance compared to double reading (10). Real-world nationwide implementations also support feasibility and safety of AI deployment (9). For ultrasound, feasibility studies have shown AI can function in real time to support detection and diagnosis (10), yet consistent evidence of substantial diagnostic benefit remains limited. Furthermore, questions regarding interpretability, integration with PACS, and clinician trust remain critical to adoption (8, 21). External validations and prospective trials are required to demonstrate robustness across institutions and populations (22).

The observed AUROC improvement of 0.031 was statistically significant, consistent with our a priori power target of detecting an absolute increase ≥0.03. However, the clinical significance of such a small absolute improvement warrants careful consideration. Prior methodological work suggests that AUROC improvements of this magnitude, while statistically detectable with adequate sample sizes, may not reliably translate to meaningful changes in patient outcomes such as recall rates, biopsy yields, or cancer detection rates (15, 23). Prospective studies with patient-level outcome endpoints are needed to determine whether the observed discrimination improvement translates to clinical benefit.

This study has several limitations. First, it was retrospective and conducted at a single institution, which limits generalizability. Second, the dataset was enriched with equal numbers of malignant and benign lesions, differing from real-world prevalence and potentially inflating diagnostic metrics. Third, only static B-mode images were analyzed; as breast ultrasound is dynamic, some clinically relevant findings may not have been captured. Fourth, in cases with multiple lesions, only an index lesion was selected, which does not reflect the complexity of multifocal disease. Fifth, the fixed reading order (unaided first, then AI-assisted) may introduce order effects, although this sequential design is supported by MRMC methodology literature (24, 25) and FDA guidance (2022) as providing greater statistical power while minimizing intra-reader variability. The unaided-first order prevents contamination of baseline performance by prior AI exposure. Although the two-week washout period and case re-randomization were intended to minimize recall bias, we cannot exclude the possibility that radiologists subconsciously remembered certain lesions. The counterbalanced crossover design suggested by the Reviewer would provide additional protection against memory effects and should be considered in future studies. Sixth, our sample size was calculated to detect an AUROC increase of 0.03, which, while statistically valid, may have limited clinical meaning. Seventh, this study was performed in a controlled laboratory setting using static B-mode images archived in PACS, rather than during real-time clinical scanning. This is a fundamental limitation inherent to all retrospective breast ultrasound reader studies (18). In clinical practice, radiologists perform dynamic scanning with real-time transducer manipulation, assess lesion compressibility and mobility, and integrate palpation findings and mammographic correlation—none of which are captured in static images. The measured interpretation times (median 3–6 s per image) reflect the image review component only and should not be extrapolated to total clinical examination time, which typically requires 10–15 min including scanning, correlation, and reporting. The clinical significance of per-image time savings can only be assessed in prospective real-time studies, where AI may provide its greatest value through concurrent decision support during scanning. Prospective, multi-center studies with real-time AI integration are needed to validate the clinical applicability of these findings. Finally, an alternative study design in which readers are randomly assigned to two independent groups—one interpreting all cases without AI and the other interpreting all cases with AI—would eliminate memory bias entirely by removing repeated exposure to the same cases. Such a parallel-group (between-reader) design avoids the carryover and recall confounders inherent in within-reader crossover studies, although it requires a larger reader panel to achieve equivalent statistical power and introduces between-group variability as a potential confounder (26, 27). Future investigations should consider this approach to provide more robust evidence of AI-assisted diagnostic benefit.

In conclusion, Vis-BUS AI assistance improved diagnostic discrimination and reduced interpretation time, while accuracy, sensitivity, and specificity did not differ significantly. These findings support AI as a potential adjunct to radiologist interpretation in breast ultrasound, but further validation in prospective, multi-center trials is required before widespread adoption.

Acknowledgments

We extend our deepest gratitude to the dedicated breast radiologists who served as expert readers throughout the duration of this study: Mijung Jang, Yoon-ah Do, Bo-ra Yun, Hyoung-sim Suh, Hun-jong Lim, and Min-su Kim. From the initial lesion localization and ROI delineation to the final consensus review, their meticulous attention to detail ensured that each case was assessed with the highest standards of accuracy. They willingly committed to multiple reading sessions, including both unaided and AI-assisted phases, providing invaluable insights into Cancer Probability Score assessments and workflow timing metrics. Their constructive feedback during post-session debriefs directly informed refinements to our annotation protocols and contributed to enhanced inter-rater consistency. Beyond their technical contributions, their collaborative spirit—coordinating schedules, sharing best practices, and engaging in critical discussions—played a pivotal role in the seamless execution of this multi-reader study. We are profoundly appreciative of their sustained effort and expertise, which have been instrumental in demonstrating how Vis-BUS AI assistance can both elevate diagnostic performance and streamline breast ultrasound interpretation.

Funding Statement

The author(s) declared that financial support was received for this work and/or its publication. Work supported by Barreleye, Inc. (Seoul, South Korea).

Footnotes

Edited by: Amine Amyar, Harvard Medical School, United States

Reviewed by: Magnus Dustler, Lund University, Sweden

Olga Nanova, State Budget-Funded Health Care Institution of the City of Moscow "Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department", Russia

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

The studies involving humans were approved by Institutional Review Board of Seoul National University Bundang Hospital (protocol no. E-2310-858-301). The studies were conducted in accordance with the local legislation and institutional requirements. The ethics committee/institutional review board waived the requirement of written informed consent for participation from the participants or the participants’ legal guardians/next of kin because We studied and analyzed retrospevtively collected data.

Author contributions

HK: Validation, Resources, Visualization, Data curation, Project administration, Formal analysis, Methodology, Conceptualization, Writing – review & editing, Investigation, Supervision, Software, Writing – original draft, Funding acquisition. SK: Supervision, Methodology, Writing – review & editing, Validation, Investigation, Conceptualization, Data curation, Resources, Formal analysis, Project administration, Funding acquisition. SO: Conceptualization, Writing – review & editing, Data curation, Software, Methodology. MJ: Writing – review & editing, Data curation, Validation, Resources, Methodology. M-GK: Data curation, Project administration, Formal analysis, Writing – review & editing. H-MB: Funding acquisition, Investigation, Writing – review & editing, Project administration, Conceptualization. SC: Resources, Supervision, Validation, Writing – review & editing. SC: Validation, Writing – review & editing, Supervision, Resources. YK: Data curation, Formal analysis, Writing – review & editing. GJ: Supervision, Writing – review & editing, Validation, Software. H-JL: Software, Writing – review & editing, Validation. S-YK: Validation, Software, Writing – review & editing.

Conflict of interest

HK, SO, M-GK, and H-MB report equity ownership in and employment by Barreleye, Inc.

The remaining author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fradi.2026.1747783/full#supplementary-material

Table1.docx^{(126.5KB, docx)}

References

1.Arnold M, Morgan E, Rumgay H, Mafra A, Singh D, Laversanne M, et al. Current and future burden of breast cancer: global statistics for 2020 and 2040. Breast. (2022) 66:15–23. 10.1016/j.breast.2022.08.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Ohuchi N, Suzuki A, Sobue T, Kawai M, Yamamoto S, Zheng Y-F, et al. Sensitivity and specificity of mammography and adjunctive ultrasonography to screen for breast cancer in the Japan strategic anti-cancer randomized trial (J-START). Lancet. (2016) 387:341–8. 10.1016/S0140-6736(15)00774-6 [DOI] [PubMed] [Google Scholar]
3.Brown AL, Vijapura C, Patel M, De La Cruz A, Wahab R. Breast cancer in dense breasts: detection challenges and supplemental screening opportunities. Radiographics. (2023) 43:e230024. 10.1148/rg.230024 [DOI] [PubMed] [Google Scholar]
4.Berg WA, Blume JD, Cormack JB, Mendelson EB, Lehrer D, Böhm-Vélez M, et al. Combined screening with ultrasound and mammography vs mammography alone in women at elevated risk of breast cancer. JAMA. (2008) 299:2151–63. 10.1001/jama.299.18.2151 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Won SY, Lee GY, Lim KJ, et al. Supplemental breast ultrasound screening in women with dense breasts: a systematic review and meta-analysis. Korean J Radiol. (2021) 22:159–67. 10.3348/kjr.2019.0588 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Park SH, Kim SY, Seo M. Mammography and ultrasonography study for breast cancer screening effectiveness (MUST-BE): a study protocol. Ultrasonography. (2019) 38:190–8. 10.14366/usg.18012 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Bahl M, Chang JM, Mullen LA, Berg WA. Artificial intelligence for breast ultrasound: aJR expert panel narrative review. AJR Am J Roentgenol. (2024) 223:e2330645. 10.2214/AJR.23.30645 [DOI] [PubMed] [Google Scholar]
8.Díaz O, Rodríguez-Ruíz A, Sechopoulos I. Artificial intelligence for breast cancer detection: technology, challenges, and prospects. Eur J Radiol. (2024) 175:111457. 10.1016/j.ejrad.2024.111457 [DOI] [PubMed] [Google Scholar]
9.Baek J, Kim J, Kim HJ, Yoon JH, Park HY, Lee J, et al. Clinical application of artificial intelligence in breast ultrasound. J Korean Soc Radiol. (2025) 86:216–26. 10.3348/jksr.2025.0019 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Lång K, Josefsson V, Larsson AM, Larsson S, Högberg C, Sartor H, et al. Artificial intelligence–supported screen Reading versus standard double Reading in the mammography screening with artificial intelligence (MASAI) randomised, controlled, non-inferiority, single-blinded, screening accuracy study. Lancet Oncol. (2023) 24:936–44. 10.1016/S1470-2045(23)00298-X [DOI] [PubMed] [Google Scholar]
11.Eisemann N, Bunk S, Mukama T, Baltus H, Elsner SA, Gomille T, et al. Nationwide real-world implementation of AI for cancer detection in population-based mammography screening. Nat Med. (2025) 31:917–24. 10.1038/s41591-024-03408-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Lee J, Kim WH, Kim J, Kim HJ, Moon JS, Kang B, et al. Efficacy of a real-time artificial intelligence ultrasound system with CAD for breast cancer: a feasibility study. J Breast Cancer. (2025) 28:206–14. 10.4048/jbc.2024.0303 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Brot HF, Mango VL. Artificial intelligence in breast ultrasound: application in clinical practice. Ultrasonography. (2024) 43:3–14. 10.14366/usg.23116 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Gallas BD, Chan HP, D’Orsi CJ, Dodd LE, Gigger ML, Gur D, et al. Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. Acad Radiol. (2012) 19:463–77. 10.1016/j.acra.2011.12.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Martens FK, Tonk ECM, Kers JG, Janssens ACJW. Small improvement in the area under the receiver operating characteristic curve indicated small changes in predicted risks. J Clin Epidemiol. (2026) 79:159-64. 10.1016/j.jclinepi.2016.07.002 [DOI] [PubMed] [Google Scholar]
16.Obuchowski NA, Bullen J. Multireader diagnostic accuracy imaging studies: fundamentals of design and analysis. Radiology. (2022) 303:26–34. 10.1148/radiol.211593 [DOI] [PubMed] [Google Scholar]
17.Metz CE. ROC Methodology in radiologic imaging. Invest Radiol. (1986) 21:720–33. 10.1097/00004424-198609000-00009 [DOI] [PubMed] [Google Scholar]
18.Mango VL, Wynn RT, Ha R, et al. Quantitative assessment of an artificial intelligence–powered decision support on diagnostic performance in breast ultrasound. Breast J. (2020) 26:1540–8. [Google Scholar]
19.Lai YC, Chen HH, Hsu JF, Hong Y-J, Chiu T-T, Chhiou H-J. Evaluation of physician performance using a concurrent-read artificial intelligence system to support breast ultrasound interpretation. Breast. (2022) 65:124–35. 10.1016/j.breast.2022.07.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Cho E, Kim EK, Song MK, Yoon JH. Application of computer-aided diagnosis on breast ultrasonography: evaluation of diagnostic performances and agreement of radiologists according to different levels of experience. J Ultrasound Med. (2017) 37(1):209–16. 10.1002/jum.14332 [DOI] [PubMed] [Google Scholar]
21.Choi JS, Han BK, Ko ES, Bae JM, Ko EY, Song SH, et al. Effect of a deep learning framework-based computer-aided diagnosis system on the diagnostic performance of radiologists in differentiating between malignant and benign masses on breast ultrasonography. Korean J Radiol. (2019) 20(1):749–58. 10.3348/kjr.2018.0530 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Salim M, Wåhlin E, Dembrower K, Azavedo E, Foukakis T, Liu Y, et al. External evaluation of three commercially available artificial intelligence algorithms for independent assessment of screening mammograms. JAMA Oncol. (2020) 1:1581–8. 10.1001/jamaoncol.2020.3321 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Anderson AW, Marinovich ML, Houssami N, Lee CI, Lowry KP, Elmore JG, et al. Independent external validation of artificial intelligence algorithms for automated interpretation of screening mammography: a systematic review. J Am Coll Radiol. (2022) 19(2 Pt A):259–73. doi: 10.1016/j.jacr.2021.11.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Beiden SV, Wagner RF, Doi K, Nishikawa RM, Freedman M, Ben Lo S, et al. Independent versus sequential Reading in ROC studies of computer-assist modalities: analysis of components of variance. Acad Radiol. (2002) 9:1036–43. 10.1016/S1076-6332(03)80479-8 [DOI] [PubMed] [Google Scholar]
25.Schalekamp S, van Ginneken B, Koedam E, Snoeren PR, Tiehuis AM, Wittenberg R, et al. Computer-aided detection improves detection of pulmonary nodules in chest radiographs beyond the support by bone-suppressed images. Radiology. (2014) 272:252–61. 10.1148/radiol.14131315 [DOI] [PubMed] [Google Scholar]
26.Gallas BD, Pennello GA, Myers KJ. Multireader multicase variance analysis for binary data. J Opt Soc Am A. (2007) 24:B70–80. 10.1364/JOSAA.24.000B70 [DOI] [PubMed] [Google Scholar]
27.Schalekamp S, et al. Independent versus sequential Reading using computer-aided detection. J Med Imaging. (2014) 1:015501. 10.1117/1.JMI.1.1.015501 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table1.docx^{(126.5KB, docx)}

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

[B1] 1.Arnold M, Morgan E, Rumgay H, Mafra A, Singh D, Laversanne M, et al. Current and future burden of breast cancer: global statistics for 2020 and 2040. Breast. (2022) 66:15–23. 10.1016/j.breast.2022.08.010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Ohuchi N, Suzuki A, Sobue T, Kawai M, Yamamoto S, Zheng Y-F, et al. Sensitivity and specificity of mammography and adjunctive ultrasonography to screen for breast cancer in the Japan strategic anti-cancer randomized trial (J-START). Lancet. (2016) 387:341–8. 10.1016/S0140-6736(15)00774-6 [DOI] [PubMed] [Google Scholar]

[B3] 3.Brown AL, Vijapura C, Patel M, De La Cruz A, Wahab R. Breast cancer in dense breasts: detection challenges and supplemental screening opportunities. Radiographics. (2023) 43:e230024. 10.1148/rg.230024 [DOI] [PubMed] [Google Scholar]

[B4] 4.Berg WA, Blume JD, Cormack JB, Mendelson EB, Lehrer D, Böhm-Vélez M, et al. Combined screening with ultrasound and mammography vs mammography alone in women at elevated risk of breast cancer. JAMA. (2008) 299:2151–63. 10.1001/jama.299.18.2151 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5.Won SY, Lee GY, Lim KJ, et al. Supplemental breast ultrasound screening in women with dense breasts: a systematic review and meta-analysis. Korean J Radiol. (2021) 22:159–67. 10.3348/kjr.2019.0588 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.Park SH, Kim SY, Seo M. Mammography and ultrasonography study for breast cancer screening effectiveness (MUST-BE): a study protocol. Ultrasonography. (2019) 38:190–8. 10.14366/usg.18012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7.Bahl M, Chang JM, Mullen LA, Berg WA. Artificial intelligence for breast ultrasound: aJR expert panel narrative review. AJR Am J Roentgenol. (2024) 223:e2330645. 10.2214/AJR.23.30645 [DOI] [PubMed] [Google Scholar]

[B8] 8.Díaz O, Rodríguez-Ruíz A, Sechopoulos I. Artificial intelligence for breast cancer detection: technology, challenges, and prospects. Eur J Radiol. (2024) 175:111457. 10.1016/j.ejrad.2024.111457 [DOI] [PubMed] [Google Scholar]

[B9] 9.Baek J, Kim J, Kim HJ, Yoon JH, Park HY, Lee J, et al. Clinical application of artificial intelligence in breast ultrasound. J Korean Soc Radiol. (2025) 86:216–26. 10.3348/jksr.2025.0019 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10.Lång K, Josefsson V, Larsson AM, Larsson S, Högberg C, Sartor H, et al. Artificial intelligence–supported screen Reading versus standard double Reading in the mammography screening with artificial intelligence (MASAI) randomised, controlled, non-inferiority, single-blinded, screening accuracy study. Lancet Oncol. (2023) 24:936–44. 10.1016/S1470-2045(23)00298-X [DOI] [PubMed] [Google Scholar]

[B11] 11.Eisemann N, Bunk S, Mukama T, Baltus H, Elsner SA, Gomille T, et al. Nationwide real-world implementation of AI for cancer detection in population-based mammography screening. Nat Med. (2025) 31:917–24. 10.1038/s41591-024-03408-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12.Lee J, Kim WH, Kim J, Kim HJ, Moon JS, Kang B, et al. Efficacy of a real-time artificial intelligence ultrasound system with CAD for breast cancer: a feasibility study. J Breast Cancer. (2025) 28:206–14. 10.4048/jbc.2024.0303 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Brot HF, Mango VL. Artificial intelligence in breast ultrasound: application in clinical practice. Ultrasonography. (2024) 43:3–14. 10.14366/usg.23116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.Gallas BD, Chan HP, D’Orsi CJ, Dodd LE, Gigger ML, Gur D, et al. Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. Acad Radiol. (2012) 19:463–77. 10.1016/j.acra.2011.12.016 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Martens FK, Tonk ECM, Kers JG, Janssens ACJW. Small improvement in the area under the receiver operating characteristic curve indicated small changes in predicted risks. J Clin Epidemiol. (2026) 79:159-64. 10.1016/j.jclinepi.2016.07.002 [DOI] [PubMed] [Google Scholar]

[B16] 16.Obuchowski NA, Bullen J. Multireader diagnostic accuracy imaging studies: fundamentals of design and analysis. Radiology. (2022) 303:26–34. 10.1148/radiol.211593 [DOI] [PubMed] [Google Scholar]

[B17] 17.Metz CE. ROC Methodology in radiologic imaging. Invest Radiol. (1986) 21:720–33. 10.1097/00004424-198609000-00009 [DOI] [PubMed] [Google Scholar]

[B18] 18.Mango VL, Wynn RT, Ha R, et al. Quantitative assessment of an artificial intelligence–powered decision support on diagnostic performance in breast ultrasound. Breast J. (2020) 26:1540–8. [Google Scholar]

[B19] 19.Lai YC, Chen HH, Hsu JF, Hong Y-J, Chiu T-T, Chhiou H-J. Evaluation of physician performance using a concurrent-read artificial intelligence system to support breast ultrasound interpretation. Breast. (2022) 65:124–35. 10.1016/j.breast.2022.07.009 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Cho E, Kim EK, Song MK, Yoon JH. Application of computer-aided diagnosis on breast ultrasonography: evaluation of diagnostic performances and agreement of radiologists according to different levels of experience. J Ultrasound Med. (2017) 37(1):209–16. 10.1002/jum.14332 [DOI] [PubMed] [Google Scholar]

[B21] 21.Choi JS, Han BK, Ko ES, Bae JM, Ko EY, Song SH, et al. Effect of a deep learning framework-based computer-aided diagnosis system on the diagnostic performance of radiologists in differentiating between malignant and benign masses on breast ultrasonography. Korean J Radiol. (2019) 20(1):749–58. 10.3348/kjr.2018.0530 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Salim M, Wåhlin E, Dembrower K, Azavedo E, Foukakis T, Liu Y, et al. External evaluation of three commercially available artificial intelligence algorithms for independent assessment of screening mammograms. JAMA Oncol. (2020) 1:1581–8. 10.1001/jamaoncol.2020.3321 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Anderson AW, Marinovich ML, Houssami N, Lee CI, Lowry KP, Elmore JG, et al. Independent external validation of artificial intelligence algorithms for automated interpretation of screening mammography: a systematic review. J Am Coll Radiol. (2022) 19(2 Pt A):259–73. doi: 10.1016/j.jacr.2021.11.008 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24.Beiden SV, Wagner RF, Doi K, Nishikawa RM, Freedman M, Ben Lo S, et al. Independent versus sequential Reading in ROC studies of computer-assist modalities: analysis of components of variance. Acad Radiol. (2002) 9:1036–43. 10.1016/S1076-6332(03)80479-8 [DOI] [PubMed] [Google Scholar]

[B25] 25.Schalekamp S, van Ginneken B, Koedam E, Snoeren PR, Tiehuis AM, Wittenberg R, et al. Computer-aided detection improves detection of pulmonary nodules in chest radiographs beyond the support by bone-suppressed images. Radiology. (2014) 272:252–61. 10.1148/radiol.14131315 [DOI] [PubMed] [Google Scholar]

[B26] 26.Gallas BD, Pennello GA, Myers KJ. Multireader multicase variance analysis for binary data. J Opt Soc Am A. (2007) 24:B70–80. 10.1364/JOSAA.24.000B70 [DOI] [PubMed] [Google Scholar]

[B27] 27.Schalekamp S, et al. Independent versus sequential Reading using computer-aided detection. J Med Imaging. (2014) 1:015501. 10.1117/1.JMI.1.1.015501 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Artificial intelligence–assisted breast ultrasound: modest AUROC improvement and shorter interpretation time without significant change in diagnostic accuracy

Hyuksool Kwon

Sun Mi Kim

Seok Hwan Oh

Mijung Jang

Myeong-Gee Kim

Hyeon-Min Bae

Sang Il Choi

Su Min Cho

Youngmin Kim

Guil Jung

Hyeon-Jik Lee

Sang-Yun Kim

Roles

Abstract

Objectives

Materials and methods

Results

Conclusion

Introduction

Materials and methods

Study design and ethical approval

Study population

Case selection and sampling

Image acquisition

Reader panel and reading sessions

AI tool

Figure 1.

Assistance protocol

Outcome measures

Statistical analysis

Results

Patient population and case characteristics

Figure 2.

Figure 3.

Table 1.

Reader accuracy and diagnostic performance

Table 2.

Figure 4.

Reader efficiency and interpretation time

Table 3.

Figure 5.

Subgroup performance and error patterns

Table 4.

Table 5.

Figure 6.

Discussion

Acknowledgments

Funding Statement

Footnotes

Data availability statement

Ethics statement

Author contributions

Conflict of interest

Generative AI statement

Publisher's note

Supplementary material

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases