Skip to main content
NPJ Digital Medicine logoLink to NPJ Digital Medicine
. 2024 Nov 30;7:349. doi: 10.1038/s41746-024-01328-w

Impact of human and artificial intelligence collaboration on workload reduction in medical image interpretation

Mingyang Chen 1, Yuting Wang 1, Qiankun Wang 1, Jingyi Shi 1, Huike Wang 1, Zichen Ye 1, Peng Xue 1,, Youlin Qiao 1,
PMCID: PMC11608314  PMID: 39616244

Abstract

Clinicians face increasing workloads in medical imaging interpretation, and artificial intelligence (AI) offers potential relief. This meta-analysis evaluates the impact of human-AI collaboration on image interpretation workload. Four databases were searched for studies comparing reading time or quantity for image-based disease detection before and after AI integration. The Quality Assessment of Studies of Diagnostic Accuracy was modified to assess risk of bias. Workload reduction and relative diagnostic performance were pooled using random-effects model. Thirty-six studies were included. AI concurrent assistance reduced reading time by 27.20% (95% confidence interval, 18.22%–36.18%). The reading quantity decreased by 44.47% (40.68%–48.26%) and 61.72% (47.92%–75.52%) when AI served as the second reader and pre-screening, respectively. Overall relative sensitivity and specificity are 1.12 (1.09, 1.14) and 1.00 (1.00, 1.01), respectively. Despite these promising results, caution is warranted due to significant heterogeneity and uneven study quality.

Subject terms: Diagnosis, Translational research

Introduction

Imaging examinations are increasingly used for disease detection owing to their noninvasiveness and convenience, resulting in an escalating workload for clinicians. In Europe, double-reading of mammograms is recommended for breast cancer screening, with 100,000 women needing to be screened annually. However, the cancer detection rate is only ~6 per 1000 screened participants1,2. The limited availability of clinicians, coupled with stringent time constraints, exacerbates this strain, potentially compromising the quality of patient care. Medical imaging, as a specialty, facilitates the integration of artificial intelligence (AI) owing to its established digital workflow and universal standards for image storage. Over 200 commercial AI products based on image detection have been approved by the Food and Drug Administration3. Integrating AI into clinical practice holds the promise of reducing the workload by reshaping the role of clinicians and streamlining clinical workflows, thereby freeing scarce experts to engage in more critical tasks.

Previous studies have predominantly focused on the comparative evaluation of standalone AI and human clinicians, with findings indicating equivalent diagnostic performance4,5. However, the exclusive deployment of AI in isolation from human oversight raises significant ethical and legal concerns6. Human-AI collaboration, harnessing the strengths of both AI and clinicians, represents the most pertinent solution for current clinical scenarios. Numerous AI integration strategies lie on the spectrum of fully manual and automatic reviews. Examples include AI assisting clinicians in real-time image review7, serving as a second reader to validate findings alongside humans8, and acting as a pre-screening tool to identify low-risk cases that do not require human review9. These strategies have the potential to reduce clinicians’ workload, both in terms of time spent on image interpretation and the quantity of images reviewed.

In medical literature, optimistic findings regarding the accuracy of medical AI are often presented as assertions of its capability to enhance efficiency. This blending of concepts is inaccurate and misleading10. Although the workload lacks a precise definition, it encompasses at least two dimensions: image review time and quantity. To date, apart from one protocol11, limited systematic reviews or meta-analyses have investigated the potential reduction in workload associated with the integration of AI into clinical workflows for image-based disease detection. Therefore, we aim to systematically reviewed the research status in this field and quantitatively analyzed the impact of different human-AI collaboration strategies on workload and diagnostic performance.

Results

Study selection

Figure 1 shows a flowchart of study selection. In total, 4302 studies were identified using electronic databases and manual searches. After removing duplicates, 3197 studies underwent title and abstract screening. A total of 134 full-text articles were assessed for eligibility, and 85 were included in the systematic review. Forty-nine studies were excluded because of insufficient information, leaving 36 studies for meta-analysis.

Fig. 1. Flowchart of study selection.

Fig. 1

In total, 4302 studies were identified using electronic databases and manual searches. After removing duplicates, 3197 studies underwent title and abstract screening. A total of 134 full-text articles were assessed for eligibility, and 85 were included in the systematic review. Forty-nine studies were excluded because of insufficient information, leaving 36 studies for meta-analysis.

Study characteristics

Of the 85 studies included in the systematic review, 60 (70.6%) focused on reading time7,1270, while 25 (29.4%) focused on reading quantity2,8,9,7192. The number of studies showed an increasing trend over time, rising from 5 (5.9%) in 2019 to 30 (35.3%) in 2023. Most of the research data were sourced from Asia (n = 45), followed by Europe (n = 26), and the United States (n = 14). The studies covered various imaging modalities, with X-ray accounting for over 44.7%, followed by computed tomography (CT), magnetic resonance imaging, histopathology, endoscopy, digital breast tomosynthesis, ultrasound, and cytology. The most studied health conditions were breast cancer (n = 29), chest abnormalities (n = 16), and fractures (n = 11), followed by prostate cancer, cardiovascular and cerebrovascular diseases, small bowel disease, esophageal cancer, and gastric cancer. The study characteristics are summarized in Table 1 and Supplementary Table 1.

Table 1.

Characteristics of included studies

Overall (n = 85) Reading time (n = 60) Reading quantity (n = 25)
Publication year
 2019 5 (5.9) 2 (3.3) 3 (12.0)
 2020 8 (9.4) 6 (10.0) 2 (8.0)
 2021 16 (18.8) 13 (21.7) 3 (12.0)
 2022 26 (30.6) 18 (30.0) 8 (32.0)
 2023 30 (35.3) 21 (35.0) 9 (36.0)
Data source regiona
 Asia 45 (52.9) 39 (65.0) 6 (24.0)
 Europe 26 (30.6) 12 (20.0) 14 (56.0)
 USA 14 (16.5) 9 (15.0) 4 (16.0)
 Other 5 (5.9) 3 (5.0) 2 (8.0)
Imaging modalityb
 X-rayc 38 (44.7) 23 (38.4) 15 (60.0)
 CT 18 (21.2) 18 (30.0) 0 (0)
 MRI 9 (10.6) 5 (8.3) 4 (16.0)
 Histopathology 5 (5.9) 4 (6.7) 1 (4.0)
 Endoscope 5 (5.9) 2 (3.3) 3 (12.0)
 DBT 5 (5.9) 3 (5.0) 2 (8.0)
 Ultrasound 4 (4.7) 4 (6.7) 0 (0.0)
 Cytology 2 (2.4) 1 (1.7) 1 (4.0)
Health condition
 Breast cancer 29 (34.1) 11 (18.4) 18 (72.0)
 Chest abnormalities 16 (18.8) 14 (23.4) 2 (8.0)
 Fracture 11 (12.9) 11 (18.3) 0 (0.0)
 Prostate cancer 5 (5.9) 5 (8.3) 0 (0.0)
 CVD 5 (5.9) 5 (8.3) 0 (0.0)
 Small bowel disease 4 (4.7) 1 (1.7) 3 (12.0)
 Esophageal cancer 3 (3.5) 2 (3.3) 1 (4.0)
 Gastric cancer 2 (2.4) 2 (3.3) 0 (0.0)
 Othersd 10 (11.8) 9 (15.0) 1 (4.0)
Role of AIe
 As concurrent aid 60 (70.6) 60 (100.0) 0 (0.0)
 As second reader 3 (3.5) 0 (0.0) 3 (12.0)
 As pre-screening 23 (27.1) 0 (0.0) 23 (92.0)

Cells are frequencies (percentages).

CT computed tomography, MRI magnetic resonance imaging, DBT digital breast tomosynthesis, CVD cardiovascular and cerebrovascular disease.

aThe total does not equal 85 because some studies are cross-regional.

bThe total does not equal 85 because some studies include more than one image modality.

cIncluding mammogram.

d“Others” includes one each of bone metastases, brain metastases, cervical cytology abnormalities, fetal intracranial malformations, brain abnormalities, intussusception, lumbar spine stenosis, oral squamous cell carcinoma, soft tissue diseases, and thyroid nodules.

eThe total does not equal 85 because one study includes AI both as second reader and as pre-screening.

For the 60 studies comparing reading times, X-ray (n = 23) and CT (n = 18) were the most commonly studied modalities, with chest abnormalities (n = 14) being the most frequently studied health condition. AI was exclusively utilized as a concurrent aid for image interpretation. For the 25 studies comparing reading quantity, X-ray (n = 15) took precedence as the most commonly examined modality, predominantly concerning breast cancer diagnosis (n = 18). AI served as a second reader in three studies and as a pre-screening tool in 23 studies.

Among the 36 studies included in the meta-analysis, 21 (58.3%) focused on reading time12,13,18,23,24,27,31,32,41,43,44,49,50,52,55,57,61,6567,69, including six on chest abnormalities, six on fractures, five on cancer, two on cardiovascular and cerebrovascular disease, one on fetal intracranial malformations, and one on intussusception. The other 15 studies (41.7%) focused on reading quantity8,9,7481,83,8688,92, including nine on breast cancer, two on chest abnormalities, two on small bowel disease, one on cervical abnormalities, and one on esophageal cancer.

Workload reduction

As shown in Fig. 2 and Table 2, with the integration of AI, there was a notable reduction in the human workload for image interpretation, resulting in a time-saving of 27.20% (95% CI, 18.22%–36.18%) and a quantity-saving of 58.48% (95% CI, 46.83%−70.14%). Substantial heterogeneity was observed, with I2 exceeding 99% (p < 0.001).

Fig. 2. Pooled human workload reduction after AI integration.

Fig. 2

With the integration of AI, there was a notable reduction in reading time and reading quantity. a Pooled reading time reduction; b Pooled reading quantity reduction.

Table 2.

Human workload reduction and relative diagnostic performance after AI integration

No. of study Percentage workload reduction Test for subgroup difference Relative sensitivity Test for subgroup difference Relative specificity Test for subgroup difference
Reading time reduction
 Overall 21 27.20 (18.22, 36.18) 1.12 (1.08, 1.16) 1.00 (0.98, 1.01)
 Image modalitya 0.750 0.215 0.068
  X-ray 10 21.08 (11.11, 31.04) 1.13 (1.08, 1.19) 0.99 (0.97, 1.00)
  CT 6 25.61 (17.79, 33.43) 1.12 (1.06, 1.19) 1.05 (0.99, 1.11)
  Ultrasound 3 30.34 (-13.04, 73.71) 1.06 (1.00, 1.13) 1.07 (0.96, 1.18)
 Health conditionb 0.932 0.562 0.005
  Chest abnormalities 6 27.24 (15.53, 38.95) 1.12 (1.06, 1.20) 0.98 (0.96, 1.01)
  Fracture 6 25.53 (16.44, 34.63) 1.18 (1.09, 1.28) 1.03 (1.00, 1.06)
  Cancer 5 30.72 (1.77, 59.66) 1.12 (1.04, 1.21) 1.04 (0.97, 1.13)
 Reader experiencec 0.425 <0.001 0.878
  Junior 8 22.16 (8.74, 35.58) 1.24 (1.14, 1.35) 1.03 (0.99, 1.06)
  Senior 8 16.67 (14.91, 18.41) 1.09 (1.05, 1.14) 1.03 (1.00, 1.06)
Reading quantity reduction
 Overall 15 58.48 (46.83, 70.14) 1.11 (1.08, 1.15) 1.01 (1.00, 1.01)
 Role of AId 0.018 <0.001 0.056
  As second reader 3 44.47 (40.68, 48.26) 0.99 (0.96, 1.01) 1.00 (1.00, 1.00)
  As pre-screening 13 61.72 (47.92, 75.52) 1.15 (1.10, 1.19) 1.01 (1.00, 1.01)

95% confidence interval in parentheses.

CT computed tomography.

a2 studies, including 1 cytology and 1 histopathology, are not displayed in imaging modality subgroup analysis for reading time reduction due to limited number of studies for pooling.

b4 studies, including 2 cardiovascular and cerebrovascular disease, 1 fetal intracranial malformation and 1 intussusception, are not displayed in health condition subgroup analysis for reading time reduction due to limited number of studies for pooling.

cOnly 8 studies categorized readers as junior and senior.

dThe total does not equal 15 because one study included AI both as second reader and as pre-screening.

The meta regression did not reveal any interaction between reading time reduction and the factors of imaging modality, health condition, or reader experience. Time reductions of 21.08% (95% CI, 11.11%–31.04%) and 25.61% (95% CI, 17.79%–33.43%) were observed in X-ray and CT scans, respectively. However, no significant time-saving was found in ultrasound (30.34% (95% CI, –13.04%–73.71%)). The time required for detecting chest abnormalities, fractures, and cancers decreased by 27.24% (95% CI, 15.53%–38.95%), 25.53% (95% CI, 16.44%–34.63%), and 30.72% (95% CI, 1.77%–59.66%), respectively. Only eight studies distinguished between the expertise levels of the readers, showing similar time decreases for juniors (22.16% (95% CI, 8.74%–35.58%)) and seniors (16.67% (95% CI, 14.91%–18.41%)).

After stratifying AI roles, AI as a pre-screening tool can save more reading quantity compared to being a second reader (61.72% (95% CI, 47.92%–75.52%) versus 44.47% (95% CI, 40.68%–48.26%), p = 0.018). Figure 3 shows a schematic diagram of the workload reduction for different AI roles.

Fig. 3. Schematic diagram of workload reduction for different AI integration strategies.

Fig. 3

AI serving as a concurrent aid to humans reduced the reading time by 27.20%; when humans acted as the first reader and AI as the second reader with inconsistent arbitration, it reduced the reading quantity by 44.47%; when AI was used for pre-screening to identify relatively high-risk cases for human review, it reduced the reading quantity by 61.72%. Red boxes indicate instances where AI replaces readers or could have a direct influence on reader’s behavior. Yellow boxes denotes where AI could have an indirect effect on reader’s behavior. Green boxes indicate corresponding workload reduction after AI integration.

Relative diagnostic performance

As shown in Fig. 4 and presented in Table 2, relative sensitivity and specificity were 1.12 (95% CI, 1.09–1.14) and 1.00 (95% CI, 1.00–1.01), respectively. For studies evaluating time reduction, AI integration yielded an 12% increase (1.12; 95% CI, 1.09–1.16) in sensitivity, while specificity remained unchanged (1.00; 95% CI, 0.98–1.01). In studies assessing quantity reduction, AI integration resulted in a 11% increase (1.11; 95% CI, 1.08–1.15) in sensitivity and 1% increase (1.01; 95% CI, 1.00–1.01) in specificity.

Fig. 4. Pooled human relative diagnostic performance after AI integration.

Fig. 4

With AI integration, sensitivity significantly increased (left), while specificity remained unchanged (right).

After stratifying the reader experience, juniors showed a greater improvement in sensitivity with AI integration compared to seniors (24% versus 9%, p < 0.001). The diagnostic performance after AI integration is non-inferior to that before AI integration across different modalities and conditions.

Before AI integration, the overall absolute sensitivity and specificity were 0.79 (95% CI, 0.76, 0.82) and 0.93 (95% CI, 0.89, 0.96), respectively. After AI integration, the overall absolute sensitivity and specificity were 0.88 (95% CI, 0.83, 0.92) and 0.95 (95% CI, 0.92, 0.97), respectively. Corresponding summary receiver operator characteristics (SROC) curves were provided in supplementary Fig. 1.

Quality assessment

A summary plot of the study quality, risk of bias, and concerns regarding applicability for each study are outlined in Supplementary Figs. 2 and 3. Twenty-nine studies were considered to have a high or unclear risk of bias for the participant selection domain, and eight studies did not clarify the qualifications of the participants. For the index test domain, 17 studies were at high risk of bias since they did not explain the functions of AI systems and user interfaces. In the reference standard domain, 22 studies faced a high or unclear risk of bias, mainly due to the absence of clear definitions and measures for reading time and quantity. Regarding flow and timing, 12 studies were deemed to have a high or unclear risk of bias, as the authors did not mention whether there was an appropriate interval between the index test and reference standard. For the applicability concern domain, 22 studies raised high or unclear concerns regarding participant selection. Twenty-four studies exhibited high concerns in the index test due to deviations from clinical practice. Additionally, five studies had high concerns in the reference standard domain. Supplementary Fig. 4 indicates potential publication bias for the outcomes of reading time reduction, as evidenced by asymmetry in the funnel plots and statistically significant p-values from Egger’s test (p < 0.001).

Discussion

Our results demonstrated that human-AI collaboration significantly reduced the workload in image-based disease detection without compromising diagnostic performance. Nearly one-fourth of the reading time was saved when aided by concurrent AI. The reading quantity decreased by 44.47% and 61.72% when AI served as the second reader and pre-screening, respectively. Additionally, the sensitivity increased by 12% and specificity remained unchanged, respectively, after the integration of AI. Despite this study presenting an optimistic view of AI, caution is warranted, as there is still a distance to clinical practice.

Our findings suggest that the concurrent use of AI with image interpretation reduces time across various imaging modalities and health conditions. However, caution is needed when extrapolating this result owing to significant heterogeneity and the fair quality of the included studies. Disease prevalence, AI algorithms, and reader characteristics are all potential contributors to this heterogeneity. Several studies have noted decreased reading time with lower AI scores but unchanged or increased time with higher AI scores, representing disease risk7,41,47,55. This could be attributed to clinicians reporting more confidence in normal cases after referring to the AI results, thus enabling quicker decision-making. Conversely, when abnormalities are flagged by AI, clinicians may spend more time scrutinizing the validity of the AI assessment, irrespective of the displayed AI accuracy. Therefore, the proportion of abnormal cases in different studies is one source of heterogeneity. Moreover, differences in AI functionalities also impact interpretation time32. In cytological slide interpretation, AI can focus on and outline abnormal cells from thousands of cells, reducing the interpretation time from 4 min to less than 1 min32. However, in some AI implementations, such as displaying a heat map on an entire chest X-ray, the time-saving effect was not as apparent12. Half of studies were at high risk of bias in the index test domain due to unclear explanations of AI system functions and user interfaces, potentially affecting AI performance evaluation. Similarly, unclear bias in the reference standard might lead to inconsistent measurements of reading time and quantity, affecting result reliability. Reader attitudes and acceptance toward AI also contribute to heterogeneity, as those embracing AI may experience smoother processes and save more time, while resistance to AI may disrupt the conventional interpretation workflow. Additionally, in most studies, readers examined a test set outside clinical practice, which is likely to result in the laboratory effect.

While AI concurrently aids in reducing reading time, clinicians are still required to review all images. However, when AI replaces the second reader for arbitration in cases of discrepancy with the first reader, it can reduce the image volume by 44.47%. Furthermore, utilizing AI as a pre-screening method to eliminate cases that do not require human review can decrease the interpretation quantity by 61.72%. As AI transitions from serving as a concurrent aid to a second reader and then to a pre-screening tool, it progressively assumes a higher degree of substitution for human involvement, leading to a more substantial reduction in workload. Our study revealed a broader range of conditions, including chest abnormalities, fractures, cancer, and others, when evaluating reading time. In contrast, research focusing on the decrease in reading quantity appears to be more concentrated on specific conditions, with 60% of studies focusing on breast cancer. In scenarios with high demand for screening but low disease prevalence, such as breast cancer screening, AI-driven solutions with a high degree of substitution are particularly well-suited. The world’s first randomized controlled trial on AI-supported breast cancer screening demonstrated that using AI as a second reader reduced image volume by 44.3% while maintaining a similar cancer detection rate2. It would have taken one radiologist 4.6 months less to read 46,345 screening examinations in the AI integration group compared with 83,231 in the control group.

The stable diagnostic accuracy of AI systems is a prerequisite and foundation for further discussion on whether they can reduce the workload. However, despite showcasing immense promise in research, the seamless transition of AI into real clinical settings encounters significant challenges. Most studies have a high or unclear risk of bias for the participant selection, which could lead to insufficient representativeness of the sample and limit the generalizability of the results. Specifically, models underlying AI systems are often evaluated only within the context of their training settings. Even with external validation, it largely consists of retrospective studies, such as multi-reader and multi-case studies, and lacks a prospective design or validation across diverse clinical contexts31,43,61. The performance of AI may degrade in the real world due to the diversity in healthcare systems, patient demographics, and clinical practice, raising concerns about its generalization and limiting its clinical usefulness93,94. Enhancing transparency and rigor with the use of checklists represents a pivotal approach to verifying the proper implementation of AI models and ensuring adequate reproducibility. These checklists serve to standardize the reporting of crucial components pertaining to the construction of AI algorithms, including data sources, eligibility criteria, demographic characteristics, and imaging equipment utilized during data acquisition. Furthermore, post-deployment monitoring is crucial for refining AI models by regularly updating the training data and incorporating feedback from clinicians. This iterative process improves model performance and ensures that they meet the evolving needs of clinicians.

The successful use of AI depends on effective human-AI collaboration. Studies indicate that the benefits of AI assistance vary among different clinicians, with the least experienced clinicians gaining the most from AI-based support95, which is consistent with our findings. However, many of the included studies failed to specify how readers were chosen and their level of experience. There is a possibility that less-experienced readers are intentionally selected to favor AI results, underrepresenting their real expertise in clinical settings. Another potential risk in human-AI collaboration is the human tendency to overly rely on AI systems, leading to automation bias. AI systems will inevitably make mistakes, including biased outcomes, “hallucinations,” and AI drift. Blinding following AI results can cause serious harm to patients. Readers, regardless of expertise, can be subject to automation bias, with inexperienced readers being more vulnerable to following incorrect suggestions from AI96. Novices, overconfident in the capabilities of AI systems and lacking confidence in their own abilities, may experience short-term skill improvements. However, in the long run, this may hinder their development of independent diagnostic skills and critical thinking abilities, rendering them less effective without the help of AI.

Specific strategies to mitigate overreliance and automation bias should be considered to maximize the benefits of human-AI collaboration. An increasing number of studies have embraced the decision-referral approach to refine threshold selections9,75,92. This approach typically involves both predictive and referral AI components. The referral AI utilizes the confidence score generated by the predictive AI to determine whether to apply the predictive AI model or to refer the given medical image to the standard clinical workflow for evaluation. It selectively makes decisions on a subset of cases with a high degree of accuracy while still ensuring the central role of humans in the decision-making process. Another strategy involves exploring the explainable AI (XAI) to dissect its “black box” nature and allow inferences about how a certain judgment by the system is generated. Various types of XAI methods have been reported to enhance transparency and provide insight into different medical specialities97,98. Educating clinicians on the reasoning processes of AI systems will facilitate easier comprehension and enable them to make reasonable decisions. Additionally, instilling a sense of accountability in AI users for their own decisions can also aid in reducing automation bias.

The introduction of new technologies often entails systematic changes, and caution should be exercised regarding short-term efficiency gains that may result in long-term workload increases. Although the average reading time decreased from 180 s to 22 s per case with AI assistance when interpreting cervical cytology slides99, only 400 slides can be scanned in 20 h because of the high requirements for image scanning to avoid image blur and information loss caused by overlapping cells. This additional time consumption also includes training clinicians on AI usage, adapting AI integration into clinical workflows, and regulating AI to foster a healthy ecosystem. Some optimists believe that once clinicians become familiar with an AI system, they can anticipate significant increases in efficiency99. However, their willingness to acquire new skills and undertake new responsibilities influences whether the possible benefits of AI integration will be realized. Omitting the considerable human labor and time dedicated to the development and validation of AI systems also paints an overly optimistic picture regarding the total human effort required. Moreover, continuous input such as expert-annotated images will be indispensable even after AI systems are integrated into daily workflows to maintain system accuracy. The systematic effect can often only be assessed years after the introduction of AI with the help of sociologists, historians, and empirical observations of users’ daily experiences.

This study represents the first quantitative assessment of the impact of different AI integration strategies on workload reduction. However, several limitations of this study should be considered before extrapolating the conclusions. First, five researchers were involved in literature screening and data extraction, making it inevitable to introduce some subjectivity. However, multiple rigorous sessions were undertaken to ensure consistency among participants. Furthermore, some subgroups had a limited number of included studies, potentially undermining the stability of the pooled results, such as those involving ultrasound and AI as second readers. Many studies have treated reading time as a secondary outcome, reporting only point estimates without standard deviations or confidence intervals. Additionally, numerous studies have reported sensitivity without specificity, precluding their inclusion in the meta-analysis. Furthermore, the significant publication bias suggests that studies with non-significant or negative results may have been underreported, potentially leading to an overestimation of the actual effects of AI in this study. Advocating for the pre-registration of future studies can help reduce the selective reporting of significant results, enhancing research transparency. Additionally, future research should prioritize standardized reporting practices to foster comparability and synthesis across studies, thereby obtaining higher levels of evidence.

Our study examined only the reduction of workload during image-based disease detection. However, AI can also enhance clinical efficiency by performing tasks such as image acquisition and reconstruction, image segmentation, and the measurement of anatomical structures. Moreover, the rapid advancements in large language models (LLMs) hold the potential to revolutionize medical imaging100. For instance, in the pre-imaging stages, LLMs can provide decision support, integrate clinical history, and assist in protocoling. LLMs can also improve reporting workflows in various ways, such as automatically generating report impressions by summarizing key findings, converting free text into structured reports, and proposing relevant differential diagnoses. These additional tasks underscore the broad spectrum of AI’s potential to enhance clinical efficiency beyond disease detection. Future research should evaluate the impact of AI on efficiency in these areas as well.

In conclusion, our systematic review and meta-analysis quantitatively elucidate the reduction in clinician workload and improvement in diagnostic performance through human-AI collaboration, providing a positive perspective on the integration of AI into clinical practice. However, prospective studies aligned with real clinical settings and further research into AI’s broader impact beyond disease detection tasks are essential for bridging the gap between research and practice and maximizing its potential to revolutionize medical imaging practices.

Methods

The study protocol was registered with PROSPERO (CRD42023443557) and was conducted and reported following the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) 2020 guidelines101. Please see Supplementary Note 1 for details.

Search strategy and eligibility criteria

We searched PubMed, Embase, Cochrane, and IEEE Xplore databases for studies published between January 2018 and December 2023 with no language restrictions. Medical Subject Heading (MeSH) terms and keywords related to AI, workload, and medical imaging were used to identify comparative studies assessing human workload before and after the introduction of AI. The full search strategies are listed in Supplementary Note 2. We also conducted manual searches of bibliographies, citations, and related articles from the included studies to identify additional relevant articles not captured in the initial searches.

The eligibility assessment involved two reviewers who independently screened the titles and abstracts of the search results, resolving discrepancies with a third reviewer. Studies were included if they compared the workload for detecting diseases from medical imaging with and without AI integration. The workload refers to either the time spent on image interpretation or the quantity of images reviewed. We excluded studies that: (1) used medical waveform data graphics; (2) did not focus on classification tasks; and (3) were based on animal or non-human samples. Additionally, case reports, review articles, editorials, letters, comments, conference abstracts, proceedings, and duplicates were excluded.

Data extraction

The primary outcome was the percentage of workload reduction, including relative reductions in reading time and quantity. It is defined as the difference between the pre-AI and post-AI workloads (time or quantity of image reviews), divided by the pre-AI workload. The secondary outcome evaluated was the relative diagnostic performance, defined as the post-AI diagnostic performance divided by the pre-AI diagnostic performance, consisting of relative sensitivity and specificity. Given the paired study design of original studies, relative sensitivity and specificity may be more valuable and informative than absolute sensitivity and specificity. For studies with multiple clinicians, we used the overall result if it was reported. If the overall result is not provided and only individual or group-specific results are available, we pooled them into a single effect, preventing overlapping information and inflated confidence from separate calculations102,103. For diagnostic performance, we used the arithmetic mean of diagnostic contingency tables (true positive, true negative, false positive, false negative) from each individual reader as the pooled effects. For studies reporting multiple reading quantities at different AI thresholds or in different scenarios within the same patient group, we selected either the primary outcome reported by the original author or the more conservative result based on sensitivity analysis.

Two reviewers independently extracted data using standardized forms, encompassing study characteristics, patient characteristics, reader characteristics, AI characteristics, and outcomes of interest. Any disagreements were resolved by a third reviewer. The data were obtained through direct extraction or indirect calculation.

Quality assessment

We modified a risk assessment tool, the Quality Assessment of Studies of Diagnostic Accuracy-Revised (QUADAS-2)104,105, by streamlining the criteria and incorporating elements relevant to our study aim. Our modified version assessed bias across four domains (participant selection, index test, reference standard, and flow and timing), similar to QUADAS-2, and evaluated applicability to the review question across the first three domains. Each domain included a tailored signaling question. The risk of bias and applicability in each domain were evaluated based on the responses to these questions. Supplementary Note 3 illustrates the adapted QUADAS-2. Two independent reviewers applied the modified tool to assess bias and applicability, resolving disagreements through discussion.

Data analysis

Studies reporting mean reading time and standard deviation (SD) or reading quantity and diagnostic contingency tables both before and after AI introduction were eligible for inclusion in the meta-analysis. We pooled the reading time reduction rate and reading quantity reduction rate along with 95% confidence intervals using random-effects model given the considerable heterogeneity between studies. Bivariate random-effects model analysis was used to pooled the relative and absolute sensitivity and specificity since these two indicators are often correlated. Summary receiver operating characteristics (SROC) curves were also plotted. I2 statistics and Cochran’s Q test were used to assess the degree and statistical significance of study heterogeneity. Meta-regression was conducted based on imaging modality, health condition, reader experience, and the role of AI, provided that there were at least three studies in each subgroup. Subgroup interactions were tested using the chi-squared test. Egger test was used to assess publication bias in reading time and quantity reduction, and Deeks’ funnel plot asymmetry test was used to evaluate publication bias in diagnostic performance before and after AI integration. We used the meta package in R (version 4.1.0) for the analysis of workload reduction and the metadta command in STATA 18 for the analysis of relative diagnostic performance. Statistical significance was set at p < 0.05.

Supplementary information

Supplementary Information (907.9KB, pdf)

Acknowledgements

This study was supported by CAMS Innovation Fund for Medical Sciences (CIFMS 2021-I2M-1-004), China Postdoctoral Science Foundation (2023M740323, 2024T170072), and Postdoctoral Fellowship Program of CPSF (GZB20230076).

Author contributions

M.C. and P.X. contributed to the study design and conceptualization. M.C. and Z.Y. contributed to the literature search. M.C., Y.W., Q.W., J.S., and H.W. extracted data, conducted the analysis, and evaluated the quality of included studies. M.C. wrote the initial manuscript. P.X. and Y.Q. revised the manuscript. All authors approved the final version of the manuscript and take accountability for all aspects of the work.

Data availability

The datasets used during the current study are available from the corresponding author on reasonable request.

Code availability

The codes used in the analysis of this study will be made available from the corresponding author upon reasonable request.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Peng Xue, Email: xuepeng_pumc@foxmail.com.

Youlin Qiao, Email: qiaoy@cicams.ac.cn.

Supplementary information

The online version contains supplementary material available at 10.1038/s41746-024-01328-w.

References

  • 1.Schünemann, H. J. et al. Breast cancer screening and diagnosis: a synopsis of the European Breast Guidelines. Ann. Intern. Med.172, 46–56 (2020). [DOI] [PubMed] [Google Scholar]
  • 2.Lång, K. et al. Artificial intelligence-supported screen reading versus standard double reading in the Mammography Screening with Artificial Intelligence trial (MASAI): a clinical safety analysis of a randomised, controlled, non-inferiority, single-blinded, screening accuracy study. Lancet Oncol.24, 936–944 (2023). [DOI] [PubMed] [Google Scholar]
  • 3.Rajpurkar, P. & Lungren, M. P. The current and future state of AI interpretation of medical images. N. Engl. J. Med.388, 1981–1990 (2023). [DOI] [PubMed] [Google Scholar]
  • 4.Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health1, e271–e297 (2019). [DOI] [PubMed] [Google Scholar]
  • 5.Xue, P. et al. Deep learning in image-based breast and cervical cancer detection: a systematic review and meta-analysis. NPJ Digit Med.5, 19 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.WHO. Ethics and governance of artificial intelligence for health: WHO guidance executive summary. https://www.who.int/publications/i/item/9789240037403 (2021).
  • 7.Shin, H. J., Han, K., Ryu, L. & Kim, E. K. The impact of artificial intelligence on the reading times of radiologists for chest radiographs. NPJ Digit Med.6, 82 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Marinovich, M. L. et al. Artificial intelligence (AI) for breast cancer screening: BreastScreen population-based cohort study of cancer detection. eBioMedicine.90, 104498 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Leibig, C. et al. Combining the strengths of radiologists and AI for breast cancer screening: a retrospective analysis. Lancet Digit Health4, e507–e519 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Jongsma, K. R., Sand, M. & Milota, M. Why we should not mistake accuracy of medical AI for efficiency. NPJ Digit Med.7, 57 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wenderott, K., Gambashidze, N. & Weigl, M. Integration of artificial intelligence into sociotechnical work systems-effects of artificial intelligence solutions in medical imaging on clinical efficiency: protocol for a systematic literature review. JMIR Res. Protoc.11, e40485 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ahn, J. S. et al. Association of artificial intelligence-aided chest radiograph interpretation with reader performance and efficiency. JAMA Netw. Open5, e2229289 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ajmera, P. et al. Validation of a deep learning model for detecting chest pathologies from digital chest radiographs. Diagnostics13, 557 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Andre, F. et al. Human AI teaming for coronary CT angiography assessment: impact on imaging workflow and diagnostic accuracy. Diagnostics13, 3574 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bao, C. et al. Evaluation of an artificial intelligence support system for breast cancer screening in Chinese people based on mammogram. Cancer Med.12, 3718–3726 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Barinov, L. et al. Improving the efficacy of ACR TI-RADS through deep learning-based descriptor augmentation. J. Digit Imaging36, 2392–2401 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Buchlak, Q. D. et al. Effects of a comprehensive brain computed tomography deep learning model on radiologist detection accuracy. Eur. Radiol.34, 810–822 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Canoni-Meynet, L. et al. Added value of an artificial intelligence solution for fracture detection in the radiologist’s daily trauma emergencies workflow. Diagn. Interv. Imaging103, 594–600 (2022). [DOI] [PubMed] [Google Scholar]
  • 19.Chen, J. et al. Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography. Sci. Rep.10, 19196 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chen, W. et al. Improving the diagnosis of acute ischemic stroke on non-contrast CT using deep learning: a multicenter study. Insights Imaging13, 184 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Conant, E. F. et al. Improving accuracy and efficiency with concurrent use of artificial intelligence for digital breast tomosynthesis. Radiol. Artif. Intell.1, e180096 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Dang, L. A. et al. Impact of artificial intelligence in breast cancer screening with mammography. Breast Cancer29, 967–977 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Duron, L. et al. Assessment of an AI aid in detection of adult appendicular skeletal fractures by emergency physicians and radiologists: a multicenter cross-sectional diagnostic study. Radiology.300, 120–129 (2021). [DOI] [PubMed] [Google Scholar]
  • 24.Eloy, C. et al. Artificial intelligence–assisted cancer diagnosis improves the efficiency of pathologists in prostatic biopsies. Virchows Archiv.482, 595–604 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Fu, T. et al. Assessing the potential of a deep learning tool to improve fracture detection by radiologists and emergency physicians on extremity radiographs. Acad. Radiol.31, 1989–1999 (2023). [DOI] [PubMed] [Google Scholar]
  • 26.Watanabe, Y. et al. Improvement of the diagnostic accuracy for intracranial haemorrhage using deep learning-based computer-assisted detection. Neuroradiology.63, 713–720 (2021). [DOI] [PubMed] [Google Scholar]
  • 27.Wei, X. et al. Artificial intelligence assistance improves the accuracy and efficiency of intracranial aneurysm detection with CT angiography. Eur. J. Radiol.149, 110169 (2022). [DOI] [PubMed] [Google Scholar]
  • 28.Winkel, D. J. et al. A novel deep learning based computer-aided diagnosis system improves the accuracy and efficiency of radiologists in reading biparametric magnetic resonance images of the prostate: results of a multireader, multicase study. Invest. Radiol.56, 605–613 (2021). [DOI] [PubMed] [Google Scholar]
  • 29.Yacoub, B. et al. Impact of artificial intelligence assistance on chest CT interpretation times: a prospective randomized study. Am. J. Roentgenol.219, 743–751 (2022). [DOI] [PubMed] [Google Scholar]
  • 30.Yang, S. Y. et al. Histopathology-based diagnosis of oral squamous cell carcinoma using deep learning. J. Dent. Res.101, 1321–1327 (2022). [DOI] [PubMed] [Google Scholar]
  • 31.Yang, W. et al. Diagnostic performance of deep learning-based vessel extraction and stenosis detection on coronary computed tomography angiography for coronary artery disease: a multi-reader multi-case study. Radiol. Med.128, 307–315 (2023). [DOI] [PubMed] [Google Scholar]
  • 32.Yao, B. et al. Artificial intelligence assisted cytological detection for early esophageal squamous epithelial lesions by using low-grade squamous intraepithelial lesion as diagnostic threshold. Cancer Med.12, 1228–1236 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Yao, L. et al. Rib fracture detection system based on deep learning. Sci. Rep.11, 23513 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Yin, S. et al. Development and validation of a deep-learning model for detecting brain metastases on 3D post-contrast MRI: a multi-center multi-reader evaluation study. Neuro Oncol.24, 1559–1570 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Yuan, X. L. et al. Artificial intelligence for diagnosing gastric lesions under white-light endoscopy. Surg. Endosc.36, 9444–9453 (2022). [DOI] [PubMed] [Google Scholar]
  • 36.Zhang, B. et al. Improving rib fracture detection accuracy and reading efficiency with deep learning-based detection software: a clinical evaluation. Br. J. Radiol.94, 20200870 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Zhang, P. et al. Development of a deep learning system to detect esophageal cancer by barium esophagram. Front. Oncol.12, 766243 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Zhou, Q. Q. et al. Automatic detection and classification of rib fractures on thoracic CT using convolutional neural network: accuracy and feasibility. Korean J. Radiol.21, 869–879 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Rodríguez-Ruiz, A. et al. Detection of breast cancer with mammography: effect of an artificial intelligence support system. Radiology.290, 305–314 (2019). [DOI] [PubMed] [Google Scholar]
  • 40.Song, Y. B. et al. Comparison of detection performance of soft tissue calcifications using artificial intelligence in panoramic radiography. Sci. Rep.12, 19115 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Sun, Y. et al. Deep learning model improves radiologists’ performance in detection and classification of breast lesions. Chin. J. Cancer Res.33, 682–693 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Sun, Z. et al. A multicenter study of artificial intelligence-aided software for detecting visible clinically significant prostate cancer on mpMRI. Insights Imaging14, 72 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Sung, J. et al. Added value of deep learning-based detection system for multiple major findings on chest radiographs: a randomized crossover study. Radiology.299, 450–459 (2021). [DOI] [PubMed] [Google Scholar]
  • 44.Tan, H. et al. The value of deep learning-based computer aided diagnostic system in improving diagnostic performance of rib fractures in acute blunt trauma. BMC Med. Imaging23, 55 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Tejani, A. et al. Deep learning for detection of pneumothorax and pleural effusion on chest radiographs: validation against computed tomography, impact on resident reading time, and interreader concordance. J. Thorac. Imaging39, 185–193 (2023). [DOI] [PubMed] [Google Scholar]
  • 46.Uematsu, T. et al. Comparisons between artificial intelligence computer-aided detection synthesized mammograms and digital mammograms when used alone and in combination with tomosynthesis images in a virtual screening setting. Jpn J. Radiol.41, 63–70 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.van Winkel, S. L. et al. Impact of artificial intelligence support on accuracy and reading time in breast tomosynthesis image interpretation: a multi-reader multi-case study. Eur. Radiol.31, 8682–8691 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Wang, K. et al. Artificial intelligence as diagnostic aiding tool in cases of Prostate Imaging Reporting and Data System category 3: the results of retrospective multi-center cohort study. Abdom. Radiol.48, 3757–3765 (2023). [DOI] [PubMed] [Google Scholar]
  • 49.Lin, M. et al. Deep learning system improved detection efficacy of fetal intracranial malformations in a randomized controlled trial. NPJ Digit Med.6, 191 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Meng, F. et al. AI support for accurate and fast radiological diagnosis of COVID-19: an international multicenter, multivendor CT study. Eur. Radiol.33, 4280–4291 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Meng, X. H. et al. A fully automated rib fracture detection system on chest CT images and its impact on radiologist performance. Skeletal Radiol.50, 1821–1828 (2021). [DOI] [PubMed] [Google Scholar]
  • 52.Mu, L. et al. Fine-tuned deep convolutional networks for the detection of femoral neck fractures on pelvic radiographs: a multicenter dataset validation. IEEE Access9, 78495–78503 (2021). [Google Scholar]
  • 53.Nam, J. G. et al. Development and validation of a deep learning algorithm detecting 10 common abnormalities on chest radiographs. Eur. Respir. J.57, 2003061 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Noguchi, S. et al. Deep learning-based algorithm improved radiologists’ performance in bone metastases detection on CT. Eur. Radiol.32, 7976–7987 (2022). [DOI] [PubMed] [Google Scholar]
  • 55.Pacilè, S. et al. Improving breast cancer detection accuracy of mammography with the concurrent use of an artificial intelligence tool. Radiol. Artif. Intell.2, e190208 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Park, J. et al. Artificial intelligence that determines the clinical significance of capsule endoscopy images can increase the efficiency of reading. PLoS ONE15, e0241474 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Pei, Y. et al. A deep-learning pipeline to diagnose pediatric intussusception and assess severity during ultrasound scanning: a multicenter retrospective-prospective study. NPJ Digit Med.6, 182 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Pinto, M. C. et al. Impact of artificial intelligence decision support using deep learning on breast cancer screening interpretation with single-view wide-angle digital breast tomosynthesis. Radiology.300, 529–536 (2021). [DOI] [PubMed] [Google Scholar]
  • 59.Lee, J. H. et al. Improving the performance of radiologists using artificial intelligence-based detection support software for mammography: a multi-reader study. Korean J. Radiol.23, 505–516 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Lan, J. et al. Using less annotation workload to establish a pathological auxiliary diagnosis system for gastric cancer. Cell Rep. Med.4, 101004 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Lai, Y. C. et al. Evaluation of physician performance using a concurrent-read artificial intelligence system to support breast ultrasound interpretation. Breast.65, 124–135 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Kozuka, T. et al. Efficiency of a computer-aided diagnosis (CAD) system with deep learning in detection of pulmonary nodules on 1-mm-thick images of computed tomography. Jpn J. Radiol.38, 1052–1061 (2020). [DOI] [PubMed] [Google Scholar]
  • 63.Kim, J. H. et al. Clinical validation of a deep learning algorithm for detection of pneumonia on chest radiographs in emergency department patients with acute febrile respiratory illness. J. Clin. Med.9, 1981 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Jung, M. et al. Artificial intelligence system shows performance at the level of uropathologists for the detection and grading of prostate cancer in core needle biopsy: an independent external validation study. Mod. Pathol.35, 1449–1457 (2022). [DOI] [PubMed] [Google Scholar]
  • 65.Hsu, H. H. et al. Performance and reading time of lung nodule identification on multidetector CT with or without an artificial intelligence-powered computer-aided detection system. Clin. Radiol.76, 626.e623–626.e632 (2021). [DOI] [PubMed] [Google Scholar]
  • 66.Hendrix, N. et al. Musculoskeletal radiologist-level performance by using deep learning for detection of scaphoid fractures on conventional multi-view radiographs of hand and wrist. Eur. Radiol.33, 1575–1588 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Hempel, H. L. et al. Higher agreement between readers with deep learning CAD software for reporting pulmonary nodules on CT. Eur. J. Radiol. Open9, 100435 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Hallinan, J. et al. A226: improved productivity using deep learning assisted reporting for MRI lumbar spine. Global Spine J.13, 135S (2023). [Google Scholar]
  • 69.Guermazi, A. et al. Improving radiographic fracture recognition performance and efficiency using artificial intelligence. Radiology.302, 627–636 (2022). [DOI] [PubMed] [Google Scholar]
  • 70.Kim, E. Y. et al. Concordance rate of radiologists and a commercialized deep-learning solution for chest X-ray: Real-world experience with a multicenter health screening cohort. PLoS ONE17, e0264383 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Aoki, T. et al. Comparison of clinical utility of deep learning-based systems for small-bowel capsule endoscopy reading. J. Gastroenterol. Hepatol.39, 157–164 (2023). [DOI] [PubMed] [Google Scholar]
  • 72.Bhowmik, A. et al. Automated triage of screening breast MRI examinations in high-risk women using an ensemble deep learning model. Invest. Radiol.58, 710–719 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Dembrower, K. et al. Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload: a retrospective simulation study. Lancet Digit Health2, e468–e474 (2020). [DOI] [PubMed] [Google Scholar]
  • 74.Ding, Z. et al. Gastroenterologist-level identification of small-bowel diseases and normal variants by capsule endoscopy using a deep-learning model. Gastroenterology.157, 1044–1054.e1045 (2019). [DOI] [PubMed] [Google Scholar]
  • 75.Dvijotham, K. D. et al. Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians. Nat. Med.29, 1814–1820 (2023). [DOI] [PubMed] [Google Scholar]
  • 76.Frazer, H. M. L., et al. Integrated AI reader development and evaluation provides clinically-relevant guidance for human-AI collaboration in population mammographic screening. Preprint at https://www.medrxiv.org/content/10.1101/2022.11.23.22282646v2 (2022).
  • 77.Xie, X. et al. Development and validation of an artificial intelligence model for small bowel capsule endoscopy video review. JAMA Netw. Open5, e2221992 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Xue, P. et al. Improving the accuracy and efficiency of abnormal cervical squamous cell detection with cytologist-in-the-loop artificial intelligence. Mod. Pathol.36, 100186 (2023). [DOI] [PubMed] [Google Scholar]
  • 79.Yala, A. et al. A deep learning model to triage screening mammograms: a simulation study. Radiology.293, 38–46 (2019). [DOI] [PubMed] [Google Scholar]
  • 80.Yoon, S. H. et al. Use of artificial intelligence in triaging of chest radiographs to reduce radiologists’ workload. Eur. Radiol.34, 1094–1103 (2023). [DOI] [PubMed] [Google Scholar]
  • 81.Raya-Povedano, J. L. et al. AI-based strategies to reduce workload in breast cancer screening with mammography and tomosynthesis: a retrospective evaluation. Radiology.300, 57–65 (2021). [DOI] [PubMed] [Google Scholar]
  • 82.Rodriguez-Ruiz, A. et al. Can we reduce the workload of mammographic screening by automatic identification of normal exams with artificial intelligence? A feasibility study. Eur. Radiol.29, 4825–4832 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Shoshan, Y. et al. Artificial intelligence for reducing workload in breast cancer screening with digital breast tomosynthesis. Radiology.303, 69–77 (2022). [DOI] [PubMed] [Google Scholar]
  • 84.Verburg, E. et al. Deep learning for automated triaging of 4581 breast MRI examinations from the DENSE trial. Radiology.302, 29–36 (2022). [DOI] [PubMed] [Google Scholar]
  • 85.Verburg, E. et al. Validation of combined deep learning triaging and computer-aided diagnosis in 2901 breast MRI examinations from the second screening round of the dense tissue and early breast neoplasm screening trial. Invest. Radiol.58, 293–298 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature.577, 89–94 (2020). [DOI] [PubMed] [Google Scholar]
  • 87.Park, J. et al. Identification of active pulmonary tuberculosis among patients with positive interferon-gamma release assay results: value of a deep learning-based computer-aided detection system in different scenarios of implementation. J. Thorac. Imaging38, 145–153 (2023). [DOI] [PubMed] [Google Scholar]
  • 88.Lauritzen, A. D. et al. An artificial intelligence–based mammography screening protocol for breast cancer: outcome and radiologist workload. Radiology.304, 41–49 (2022). [DOI] [PubMed] [Google Scholar]
  • 89.Larsen, M. et al. Possible strategies for use of artificial intelligence in screen-reading of mammograms, based on retrospective data from 122,969 screening examinations. Eur. Radiol.32, 8238–8246 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Lång, K. et al. Identifying normal mammograms in a large screening population using artificial intelligence. Eur. Radiol.31, 1687–1692 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Jing, X. et al. Using deep learning to safely exclude lesions with only ultrafast breast MRI to shorten acquisition and reading time. Eur. Radiol.32, 8706–8715 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Gehrung, M. et al. Triage-driven diagnosis of Barrett’s esophagus for early detection of esophageal adenocarcinoma using deep learning. Nat. Med.27, 833–841 (2021). [DOI] [PubMed] [Google Scholar]
  • 93.Voter, A. F., Larson, M. E., Garrett, J. W. & Yu, J. J. Diagnostic accuracy and failure mode analysis of a deep learning algorithm for the detection of cervical spine fractures. Am. J. Neuroradiol.42, 1550–1556 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Hsu, W. et al. External validation of an ensemble model for automated mammography interpretation by artificial intelligence. JAMA Netw. Open5, e2242343 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Tschandl, P. et al. Human-computer collaboration for skin cancer recognition. Nat. Med.26, 1229–1234 (2020). [DOI] [PubMed] [Google Scholar]
  • 96.Dratsch, T. et al. Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance. Radiology.307, e222176 (2023). [DOI] [PubMed] [Google Scholar]
  • 97.Band, S. et al. Application of explainable artificial intelligence in medical health: a systematic review of interpretability methods. Inform. Med. Unlocked40, 101286 (2023). [Google Scholar]
  • 98.Bienefeld, N. et al. Solving the explainable AI conundrum by bridging clinicians’ needs and developers’ goals. NPJ Digit Med.6, 94 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Bai, X. et al. Assessment of efficacy and accuracy of cervical cytology screening with artificial intelligence assistive system. Mod. Pathol.37, 100486 (2024). [DOI] [PubMed] [Google Scholar]
  • 100.Bhayana, R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology.310, e232756 (2024). [DOI] [PubMed] [Google Scholar]
  • 101.Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ.372, n71 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Julian P. T. Higgins, Tianjing Li & Deeks, J.J. Cochrane Handbook for Systematic Reviews of Interventions. https://training.cochrane.org/handbook/current/chapter-06#section-6-5-2 (2024).
  • 103.L¢pez-L¢pez, J. A., Page, M. J., Lipsey, M. W. & Higgins, J. P. T. Dealing with effect size multiplicity in systematicreviews and meta-analyses. Res. Synth. Methods.9, 336–351 (2018). [DOI] [PubMed] [Google Scholar]
  • 104.Sounderajah, V. et al. A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat. Med.27, 1663–1665 (2021). [DOI] [PubMed] [Google Scholar]
  • 105.Whiting, P. F. et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann. Intern. Med.155, 529–536 (2011). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information (907.9KB, pdf)

Data Availability Statement

The datasets used during the current study are available from the corresponding author on reasonable request.

The codes used in the analysis of this study will be made available from the corresponding author upon reasonable request.


Articles from NPJ Digital Medicine are provided here courtesy of Nature Publishing Group

RESOURCES