Abstract
Background
Systematic reviews are fundamental to evidence-based medicine, but the process of screening studies is time-consuming and prone to errors, especially when conducted by a single reviewer. False exclusions of relevant studies can significantly impact the quality and reliability of reviews. Artificial intelligence (AI) tools have emerged as secondary reviewers in detecting these false exclusions, yet empirical evidence comparing their performance is limited.
Methods
This study protocol outlines a comprehensive evaluation of four AI tools (ASReview, DistillerSR Artificial Intelligence System [DAISY], Evidence for Policy and Practice Information [EPPI]-Reviewer, and Rayyan) in their capacity to act as secondary reviewers during single-reviewer title and abstract screening for systematic reviews. Utilizing a database of single-reviewer screening decisions from two published systematic reviews, we will assess how effective AI tools are at detecting false exclusions while assisting single-reviewer screening compared to the dual-reviewer reference standard. Additionally, we aim to determine the overall screening performance of AI tools in assisting single-reviewer screening.
Discussion
This research seeks to provide valuable insights into the potential of AI-assisted screening for detecting falsely excluded studies during single screening. By comparing the performance of multiple AI tools, we aim to guide researchers in selecting the most effective assistive technologies for their review processes.
Systematic review registration
(Open Science Framework): https://osf.io/dky26
Keywords: Rapid reviews, AI tools, Falsely excluded studies
Background
Systematic reviews are fundamental to evidence-based medicine, providing a comprehensive synthesis of the available evidence on specific clinical questions. A critical phase in systematic reviews is the title and abstract screening, where reviewers determine which studies to include based on the information contained in the studies’ titles and abstracts. The accepted international standard for systematic reviews is dual-reviewer screening, which aims to mitigate the risk of false exclusions—the erroneous exclusion of relevant studies [1–3].
However, due to several factors, such as decision-making urgency and resource availability, single-reviewer screening is often used, especially in rapid reviews. While faster, single-reviewer screening is prone to higher error rates, leading to the potential exclusion of relevant studies that can impact the validity and comprehensiveness of a review [4, 5].
Gartlehner et al. (2019) demonstrated that single-reviewer abstract screening missed, on average, 13% of the relevant studies, highlighting a significant limitation in this approach [4]. Similarly, a systematic review by Waffenschmidt et al. (2019) found that the median proportion of missed studies was 5% (range 0 to 58%) in single-reviewer screenings [6]. Comparatively, dual screening missed only 3% of the relevant studies on average, underscoring its robustness [4]. Despite these findings, the Cochrane Handbook for Systematic Reviews of Interventions acknowledges that initial screening by a single reviewer is acceptable, but, ideally, screening should be conducted in duplicate [2].
In light of these challenges, advancements in artificial intelligence (AI) offer potential solutions to enhance the efficiency and accuracy of title and abstract screening in systematic reviews, particularly when only one reviewer is involved. Recent studies evaluating the performance of machine-assisted abstract screening report that AI tools are not ready to fully replace human screeners, and that head-to-head comparisons of AI tools are lacking [7]. However, supporting software that uses AI for screening tasks, such as ASReview, DistillerSR Artificial Intelligence System (DAISY), Evidence for Policy and Practice Information (EPPI)-Reviewer, and Rayyan, can potentially reduce the rate of false exclusions and improve the reliability of single-reviewer screenings by acting as a secondary reviewer for quality control rather than as an independent reviewer. These AI tools are designed to assist in the abstract screening process by identifying abstracts that may have been falsely excluded by the human screener [7]. Terminology related to this approach often refers to it as “human in the loop,” a form of workflow in which humans remain involved (rather than being entirely replaced) through semi-automation (i.e., using machine learning to indicate to a human reviewer whether a record is likely suitable for inclusion with the human making the final decision, rather than allowing machine learning to make decisions alone) [8].
By leveraging machine learning algorithms, AI tools can analyze and identify relevant studies that might be overlooked by human reviewers, thereby enhancing the overall quality of systematic reviews. For this study, we chose ASReview, DAISY, EPPI-Reviewer, and Rayyan due to their widespread use in the research community and our review team’s high experience with these tools. Additionally, these tools have demonstrated robust performance in assisting with title and abstract screening, identifying all or 95% of the relevant studies while saving 37% to 92% of the workload [7].
Empirical evidence comparing AI tools head to head as secondary reviewers in detecting false exclusions is limited. This study addresses this gap by evaluating the performance of four AI tools in detecting false exclusions while assisting in the screening phase of systematic reviews. Additionally, it determines the overall performance of single-reviewer abstract screening with AI assistance. By assessing AI tools’ performance as a secondary reviewer, this study seeks to provide insights into the potential of AI tools to transform current practices in evidence synthesis, ultimately contributing to more efficient and accurate systematic reviews.
Methods
We will conduct an evaluation study with a prospective experimental design. Our study protocol was registered on the Open Science Framework (https://osf.io/dky26).
Aim and objectives
The primary objective of this study is to evaluate the performance of four AI tools embedded in systematic review software (ASReview [9], DAISY [10], EPPI-Reviewer [11], and Rayyan [12]) in detecting falsely excluded references as a secondary reviewer when assisting a single reviewer during the title and abstract screening phase. The secondary objective is to evaluate the overall performance when using AI-assisted single-reviewer screening with different AI tools across all screening decisions. Table 1 provides all terms and definitions used in this study protocol.
Table 1.
Terms and definitions
| Terms | Definitions |
|---|---|
| General terms | |
| Performance | Includes all the AI tools’ performance metrics and measures of the AI’s ability to identify abstracts correctly as includes or excludes compared to the reference standard |
| False exclusion | A relevant abstract that was incorrectly excluded during the screening process according to the reference standard |
| Reference standard | The final number of included and excluded abstracts made by the original review authors through dual-reviewer screening |
| Single-reviewer screening | The process where only one reviewer screens a title and abstract for inclusion/exclusion |
| Performance metrics | |
| Rescued false exclusions | Determines the number of relevant abstracts incorrectly excluded by the single reviewer but identified by the AI tools |
| Rescue precisions | Determines the proportion of correctly identified false exclusions among all the abstracts flagged for rescue by the AI tools |
| Missed false exclusions | Determines the proportion of all falsely excluded abstracts from the single reviewer that the AI tool failed to identify (missed) |
| Recall (sensitivity) | Measures the percentage of truly relevant abstracts that the AI correctly identified |
| Specificity | Measures the percentage of truly irrelevant abstracts that the AI tool correctly excluded |
| Accuracy | Evaluates the AI tools’ ability to correctly identify relevant and irrelevant abstracts out of all the abstracts examined |
| Precision | Evaluates the AI tools’ ability to correctly identify relevant abstracts among all the abstracts that were predicted as relevant by the AI tool. It evaluates how often the AI tools’ “relevant” predictions are actually correct |
With our study, we aim to answer the following research questions:
What is the performance of AI tools (ASReview, DAISY, EPPI-Reviewer, and Rayyan) in detecting falsely excluded abstracts when used for AI-assisted quality control of single-reviewer screening decisions in systematic reviews?
What is the overall performance (recall, specificity, accuracy, etc.) of AI-assisted quality control of single-reviewer screening decisions in systematic reviews?
Comparisons of interest are as follows:
Single-reviewer abstract screening vs. single-reviewer abstract screening + AI assistance
Single-reviewer abstract screening + AI assistance vs. dual-reviewer abstract screening (reference standard)
Comparison of four AI assistance tools
Data sources
Single-screening decisions
We will use a database provided by Gartlehner et al. (2020) of a crowd-based, parallel-group randomized controlled trial that assessed the accuracy of single-reviewer abstract screening [13]. Two published systematic reviews, one on a pharmacological topic (pharmacological versus nonpharmacological interventions for depression) [14] and the other on a public health topic (environmental interventions to reduce the consumption of sugar-sweetened beverages) [15], including interrupted time series, nonrandomized controlled trials, nonrandomized studies of interventions, and randomized controlled trials, served as the sources for abstract screening. The database consists of 24,942 inclusion and exclusion decisions for 2000 abstracts (including 1000 pharma abstracts and 1000 public health abstracts) made by 280 reviewers. Each abstract was reviewed multiple times (10–15 decisions per abstract).
Reference standard
The final included (n = 80) and excluded (n = 1920) studies determined by the original review authors of the two systematic reviews will serve as the reference standard [14, 15]. These study inclusion decisions were made through dual-reviewer screening, where the abstracts were screened independently by two human reviewers, with discrepancies resolved through consensus or by involving a third reviewer if necessary.
Study methods
To simulate a real-world workflow of AI-assisted quality control in single-reviewer screening, we first create a training set that includes independent decisions made by two reviewers, mimicking a dual-reviewer screening process. Then, the AI tool is applied to the test set of single-reviewer decisions for quality control. The AI-assisted single-reviewer screening is then evaluated against the reference standard, reflecting a practical application of AI support in single-reviewer workflows (see Figure 1).
Fig. 1.
Presenting an overview of the study workflow
Training set
The training set is created to mimic a dual-reviewer screening process. We will implement a stratified random sampling approach to create a balanced dataset. Using R version 4.1.0 (08-02−2021), we will write a script to randomly extract 300 unique reference IDs. Each reference ID represents a pair of screening decisions from two different reviewers, either from the public health or the pharma topic. Each dataset of 600 screening decisions from 300 unique reference IDs contains at least 15% of abstracts coded as “Includes” and at least 5% of the abstracts of studies that were finally included in the reference standard. We will sample with replacement, so that a single abstract could be part of different training sets. For transparency and reproducibility, we will document the frequency of resampled abstracts and set specific random seeds for each iteration. To achieve independent learning curves in the AI tools, we will set up independent projects for each dataset in each AI tool’s user interface. Using the standard settings for each AI tool, we will upload the training dataset. Then, the AI tool will have two decisions for 300 unique EndNote IDs to learn from; some might be conflicting, and some might be aligning. Conflicting decisions will not be resolved. For each training set, we will calculate and report the inter-rater reliability metrics (Cohen’s kappa) between the reviewers and document the distribution of agreement/disagreement patterns.
Test set
The test set will represent a single-reviewer screening scenario with AI assistance. We will implement a code to randomly extract single-screening decisions from the remaining pool of 1000 abstracts for each topic (1000 abstracts for the pharmaceutical topic and 1000 abstracts for the public health topic), excluding the 300 decisions already allocated to the training set. This will result in a test set of 700 single-screening decisions for each topic. The test set will then be uploaded to the AI tools for evaluation.
The AI tools will use the previously provided responses to assign a likelihood score for inclusion to each reference. A predefined probability threshold of 80% will be applied: references receiving a score of 80% or higher will be classified as inclusions, while those below this threshold will be classified as exclusions. These AI-assisted inclusion and exclusion decisions will be used to evaluate the performance of AI tools in detecting falsely excluded abstracts and overall performance compared to the reference standard. To ensure robustness, this process will be repeated 100 times for each AI tool.
To assess the impact of different inclusion probability thresholds on model performance, we will conduct sensitivity analyses using alternative cutoffs (70%, 75%, 85%, and 90%). For each threshold, we will calculate the number of inclusions and exclusions, false exclusion and inclusion rates, and overall performance metrics to evaluate how the chosen threshold influences the balance between sensitivity and specificity.
Data collection
We will collect the following data:
False exclusion metrics (see Table 2)
Table 2.
False exclusion metrics
| Prediction | Single-reviewer false exclusion | Single-reviewer correct exclusion |
|---|---|---|
| Total AI rescue flags as false exclusion |
True positive (TP): Correctly identified false exclusions |
False positive (FP): Incorrectly identified false exclusions |
| AI does not flag |
False negative (FN): Unidentified false exclusions |
True negative (TN): Correctly identified exclusions |
| Total false exclusions by single reviewers | Total correct exclusions by single reviewers |
The number of falsely excluded abstracts (not) identified by the AI tools* is as follows:
True positives (TPfalse exclusion): Correctly identified false exclusions (AI correctly identifies abstracts that the reviewer falsely excluded).
False negatives (FNfalse exclusion): Unidentified false exclusions (AI misses abstracts that the reviewer falsely excluded).
The number of correctly excluded abstracts (not) identified by the AI tools* is as follows:
True negatives (TNfalse exclusion): Correctly identified exclusions (AI confirms the reviewer’s correct exclusions).
False positives (FPfalse exclusion): Incorrectly identified false exclusions (AI incorrectly identifies the reviewers’ correctly excluded abstracts as false exclusions).
Overall screening performance metrics for all screening decisions (see Table 3)
Table 3.
Overall screening performance metrics
| Prediction | Reference standard inclusions | Reference standard exclusions |
|---|---|---|
| Total AI inclusions |
True positive (TP): Correctly identified relevant abstracts |
False positive (FP): Incorrectly identified as relevant abstracts |
| Total AI exclusions |
False negative (FN): Unidentified relevant abstracts |
True negative (TN): Correctly identified irrelevant abstracts |
| Total inclusions | Total exclusions |
The number of correctly included abstracts (not) identified by the AI tools* is as follows: True Positives (TP): Correctly Identified Relevant Abstracts (AI correctly identifies relevant abstracts).
False Negatives (FN): Unidentified Relevant Abstracts (AI misses relevant abstracts).
The number of correctly excluded abstracts (not) identified by the AI tools* is as follows:
True Negatives (TN): Correctly Identified Irrelevant Abstracts (AI correctly identifies abstracts that should be excluded).
False Positives (FP): Incorrectly Identified as Relevant Abstracts (AI falsely includes abstracts that should be excluded).
*Threshold for AI probability score for inclusion: 70%, 75%, 80%, 85%, and 90%.
Data analysis
We will use R version 4.1.0 (08-02−2021) for all the statistical analyses. The data analysis for this study will comprehensively evaluate the performance of AI tools (ASReview, DAISY, EPPI-Reviewer, and Rayyan) in detecting false exclusions during single-reviewer abstract screening. For our analysis, we will discard the training set batch. We will assess the following primary and secondary outcomes. For the metrics, we will calculate the mean across all datasets plus confidence intervals.
Primary outcomes: How well the AI tools detect reviewers’ false exclusions (single-reviewer screening vs. AI tool predictions)
-
Rescued false exclusions: Quantifies the number of relevant abstracts incorrectly excluded by the single reviewer but identified by the AI tools. This metric highlights the effectiveness of the AI tools in detecting relevant abstracts that were falsely excluded by a single reviewer:
Correctly Identified False Exclusions (TPfalse exclusion)/Total False Exclusions by Single Reviewers
-
Rescue precision: Determines the proportion of correctly identified false exclusions among all the studies flagged for rescue by the AI tools. Precision measures the proportion of true rescues (actual false exclusions correctly identified by the AI tools) out of all the abstracts flagged for rescue review by the AI tools (including those unnecessarily flagged):
Correctly Identified False Exclusions (TPfalse exclusion)/(Total AI Rescue Flags as False Exclusions [Correctly Identified False Exclusions (TPfalse exclusion) + Incorrectly Identified False Exclusions (FPfalse exclusion)])
-
Missed false exclusions: Determines the proportion of all false exclusions from the single reviewers that the AI tool failed to identify (missed):
Unidentified False Exclusions (FNfalse exclusion)/Total False Exclusions by Single Reviewers
Secondary outcomes: How well the AI tools perform in screening decisions (reference standard vs. AI tool predictions)
-
Recall evaluates the ability of the AI tools to correctly identify relevant abstracts that were included in the reference standard. Recall specifically measures the proportion of true positives (relevant abstracts correctly identified by the AI tools) out of all the actual positives (relevant abstracts that should have been identified, including those missed):
Correctly Identified Relevant Abstracts (TP)/(Correctly Identified Relevant Abstracts [TP] + Unidentified Relevant Abstracts [FN])
-
Specificity: Evaluates the ability of the AI tools to correctly identify irrelevant abstracts. Specificity measures the proportion of true negatives (irrelevant abstracts correctly identified as such by the AI tools) out of all the actual negatives (irrelevant abstracts that should have been identified, including those incorrectly identified):
Correctly Identified Irrelevant Abstracts (TN)/(Correctly Identified Irrelevant Abstracts [TN] + Incorrectly Identified as Relevant Abstracts [FP])
-
Accuracy: Evaluates the ability of the AI tools to correctly identify relevant (true positives) and irrelevant abstracts (true negatives) out of all the abstracts examined:
(Correctly Identified Relevant Abstracts [TP] + Correctly Identified Irrelevant Abstracts [TN])/Total Number of Abstracts
-
Precisions: Evaluates the ability of the AI tools to correctly identify relevant abstracts among all the abstracts that were predicted as relevant by the AI tool:
Correctly Identified Relevant Abstracts (TP)/(Correctly Identified Relevant Abstracts [TP] + Incorrectly Identified as Relevant Abstracts [FP])
Performance metrics measuring the AI tools’ ability to detect single reviewers’ false exclusions, such as rescued false exclusions, rescue precision, and missed false exclusions, will be calculated. Additionally, overall screening performance metrics including all screening decisions and comparing the reference standard to the AI tools’ predictions, such as recall, specificity, accuracy, and precision, will be calculated. Data visualization techniques, including heatmaps, matrices, or plots to compare the AI tools’ performances, will be employed to visually represent the performance metrics. Additionally, we will conduct sensitivity analyses to assess the robustness of our findings across different conditions (e.g., dataset sizes, topics, analyzing cases with reviewer disagreement) and to understand the impact of these key parameters on the AI tools’ performance.
Discussion
This evaluation study aims to evaluate the effectiveness of AI tools (ASReview, DAISY, EPPI-Reviewer, and Rayyan) in detecting false exclusions during single-reviewer abstract screening for systematic reviews. By comparing the performance of these tools against single-reviewer screening and the dual-reviewer reference standard, we expect to gain valuable insights into the potential of AI-assisted screening in improving the efficiency and accuracy of the systematic review process.
This study could demonstrate that AI-assisted screening significantly reduces false exclusions compared to single-reviewer screening alone. This finding would have important implications for the conduct of systematic reviews, potentially allowing for a more efficient use of resources without compromising the quality of the review. Additionally, the comparison between the AI tools may reveal differences in their performance, providing guidance for researchers in selecting the most appropriate tool for their needs.
The study uses data from only two published systematic reviews. The performance of AI tools may vary depending on the subject area, complexity, and quality of the abstracts. Therefore, the results will not be fully generalizable to all types of systematic reviews. The success of machine learning depends on the quality of the training set created by human reviewers, requiring a high level of precision (i.e., correctly labeling included and excluded records based on the title/abstract information) to train AI tools effectively [16]. Therefore, the AI tools’ performance is heavily dependent on the quality and representativeness of the training data. If the single-reviewer decisions used for training contain biases or errors, these may be propagated in the AI tools’ predictions. While dual-reviewer screening is considered the gold standard, it is not infallible. There may be cases where both reviewers incorrectly classify an abstract, which could affect the evaluation of the AI tools’ performance. Each AI tool may have its own limitations or biases that are not fully captured in this study design. For example, differences in the underlying algorithms or pretraining data could affect performance in ways that are not directly comparable. The choice of decision thresholds for the inclusion and exclusion of references in AI tools can significantly impact their performance. While sensitivity analyses are planned, finding the optimal threshold that balances sensitivity and specificity may be challenging and context dependent. Additionally, the results of this study may become outdated as new versions of these tools or entirely new tools become available. This study does not fully capture the potential synergies or conflicts that might arise from human reviewers working alongside AI tools in real-time. The dynamics of this interaction could influence the overall effectiveness of AI-assisted screening.
Despite these limitations, this study will represent an important step in understanding the potential of AI-assisted abstract screening in systematic reviews. By providing empirical evidence on the head-to-head performance of different AI tools in detecting false exclusions, this research will contribute to the ongoing development of more efficient and accurate systematic review methodologies. As AI technologies continue to evolve, it is crucial to rigorously evaluate their performance and understand their limitations to ensure their responsible and effective integration into the systematic review process.
Acknowledgements
ChatGPT (https://chatgpt.com) was used for the initial grammar proofreading in English. The manuscript was then further reviewed by a human proofreader.
Authors’ contributions
LA and GG developed the study concept. LA wrote the protocol, which GG and JK critically revised. The authors read and approved the final version of the submitted manuscript.
Funding
No funding.
Data availability
Not applicable.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Institute for Quality and Efficiency in Health. General Methods Version 6.0 2020. 2020.
- 2.Higgins JPT TJ, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.4 (updated August 2023). Cochrane, editor2023.
- 3.Centre for Reviews Dissemination. CRD’s guidance for undertaking reviews in healthcare: York Publ. Services; 2009.
- 4.Gartlehner G, Affengruber L, Titscher V, Noel-Storr A, Dooley G, Ballarini N, et al. Single-reviewer abstract screening missed 13 percent of relevant studies: a crowd-based, randomized controlled trial. J Clin Epidemiol. 2020. 10.1016/j.jclinepi.2020.01.005. [DOI] [PubMed] [Google Scholar]
- 5.Nama N, Hennawy M, Barrowman N, O’Hearn K, Sampson M, McNally JD. Successful incorporation of single reviewer assessments during systematic review screening: development and validation of sensitivity and work-saved of an algorithm that considers exclusion criteria and count. Syst Rev. 2021;10(1):98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Waffenschmidt S, Knelangen M, Sieben W, Bühn S, Pieper D. Single screening versus conventional double screening for study selection in systematic reviews: a methodological systematic review. BMC Med Res Methodol. 2019;19(1):132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Affengruber L, van der Maten MM, Spiero I, Nussbaumer-Streit B, Mahmić-Kaknjo M, Ellen ME, et al. An exploration of available methods and tools to improve the efficiency of systematic review production-a scoping review. BMC Med Res Methodol. 2024. 10.1186/s12874-024-02320-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Marshall IJ, Wallace BC. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst Rev. 2019;8(1):163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Utrecht University. ASReview LAB v1.5. 2023.
- 10.DistillerSR Inc. DistillerSR. Version 2.35. . 2022.
- 11.EPPI Centre, UCL Social Research Institute, University College London. EPPI-Reviewer: advanced software for systematic reviews, maps and evidence synthesis. 2023.
- 12.Rayyan Systems Inc. Rayyan. 2023.
- 13.Gartlehner G, Affengruber L, Titscher V, Noel-Storr A, Dooley G, Ballarini N, et al. Single-reviewer abstract screening missed 13 percent of relevant studies: a crowd-based, randomized controlled trial. J Clin Epidemiol. 2020;121:20–8. [DOI] [PubMed] [Google Scholar]
- 14.Gartlehner G, Gaynes BN, Amick HR, Asher GN, Morgan LC, Coker-Schwimmer E, et al. Comparative benefits and harms of antidepressant, psychological, complementary, and exercise treatments for major depression: an evidence report for a clinical practice guideline from the American College of Physicians. Ann Intern Med. 2016;164(5):331–41. [DOI] [PubMed] [Google Scholar]
- 15.von Philipsborn P, Stratil JM, Burns J, Busert LK, Pfadenhauer LM, Polus S, et al. Environmental interventions to reduce the consumption of sugar-sweetened beverages and their effects on health. Cochrane Database Syst Rev. 2019;6(6):Cd012292. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hamel C, Kelly SE, Thavorn K, Rice DB, Wells GA, Hutton B. An evaluation of DistillerSR’s machine learning-based prioritization tool for title/abstract screening - impact on reviewer-relevant outcomes. BMC Med Res Methodol. 2020;20(1):256. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Not applicable.

