Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2020 Jan 14;15(1):e0227742. doi: 10.1371/journal.pone.0227742

Error rates of human reviewers during abstract screening in systematic reviews

Zhen Wang 1,2,*, Tarek Nayfeh 2, Jennifer Tetzlaff 3, Peter O’Blenis 3, Mohammad Hassan Murad 1,2
Editor: Sompop Bencharit4
PMCID: PMC6959565  PMID: 31935267

Abstract

Background

Automated approaches to improve the efficiency of systematic reviews are greatly needed. When testing any of these approaches, the criterion standard of comparison (gold standard) is usually human reviewers. Yet, human reviewers make errors in inclusion and exclusion of references.

Objectives

To determine citation false inclusion and false exclusion rates during abstract screening by pairs of independent reviewers. These rates can help in designing, testing and implementing automated approaches.

Methods

We identified all systematic reviews conducted between 2010 and 2017 by an evidence-based practice center in the United States. Eligible reviews had to follow standard systematic review procedures with dual independent screening of abstracts and full texts, in which citation inclusion by one reviewer prompted automatic inclusion through the next level of screening. Disagreements between reviewers during full text screening were reconciled via consensus or arbitration by a third reviewer. A false inclusion or exclusion was defined as a decision made by a single reviewer that was inconsistent with the final included list of studies.

Results

We analyzed a total of 139,467 citations that underwent 329,332 inclusion and exclusion decisions from 86 unique reviewers. The final systematic reviews included 5.48% of the potential references identified through bibliographic database search (95% confidence interval (CI): 2.38% to 8.58%). After abstract screening, the total error rate (false inclusion and false exclusion) was 10.76% (95% CI: 7.43% to 14.09%).

Conclusions

This study suggests important false inclusion and exclusion rates by human reviewers. When deciding the validity of a future automated study selection algorithm, it is important to keep in mind that the gold standard is not perfect and that achieving error rates similar to humans may be adequate and can save resources and time.

Introduction

Systematic review is a process to identify, select, synthesize and appraise all empirical evidence that fits pre-specified criteria to answer a specific research question. Since Archie Cochrane criticized lack of reliable evidence in medical care and called for “critical summary, by specialty or subspecialty, adapted periodically, of all relevant randomized control trials” in 1970s [1], systematic review has become the foundation of modern evidence based medicine. It is estimated that the annual publications of systematic reviews increased 2,728% from 1,024 in 1991 to 28,959 in 2014.[2]

Despite of the surging number of published systematic reviews in recent years, many systematic reviews employ suboptimal methodological approaches.[24] Rigorous systematic reviews require strict procedures with at least eight time-consuming steps.[5, 6] Significant time and resources are needed, with estimated 0.9 minutes, 7 minutes and 53 minutes spent per reference per reviewer on abstract screening, full text screening, and data extraction; respectively.[7, 8] One thousand potential studies retrieved from literature search required 952 hours to complete.[9]Therefore, methods to improve efficiency of systematic reviews without jeopardizing the validity are greatly needed.

In recent years, innovations have been proposed to accelerate the process of systematic reviews, including methods to simplify steps of systematic reviews (e.g., rapid systematic reviews)[1013], and technology to facilitate literature retrieval, screening, and extraction.[7, 8, 1421] Automation tools for systematic reviews, based on machine learning, text mining, and natural language processing, have particularly been popular with an estimated workload reduction from 30% to 70%.[14] Till July 2019, 39 tools have been completed and are available for “real-world” use.[22] However, innovations are not always perfect and may introduce additional "unintended" errors. A recent study found an automation tool used by health systems to identify patients with complex health needs led to significant racial bias.[23] Assessment of these automation tools for systematic reviews, thus, is critical for wide adoption in practice.[19, 24] No large scale test has been conducted. No conclusions have been made on whether and how to implement these automation tools. Theoretically, assessment of the automation tools can be treated as a classification problem: to determine whether a citation should be included or excluded. The standard outcome metrics are used, such as sensitivity, specificity, area under curve, positive predictive value. The standard of comparison (a.k.a., gold standard) is usually human reviewers. Yet, human reviewers make errors. There is lack of evidence of human errors in the process of systematic reviews.

Thus, we conducted this study to determine citation selection error rate (false inclusion and false exclusion rates) in systematic reviews conducted by pairs of independent human reviewers during abstract screening. These rates are currently unknown and can help in designing, testing and implementing automated approaches.

Materials and methods

Study design and data source

We searched all systematic reviews conducted by an evidence-based practice center in the United States. The evidence-based practice center is one of the 12 evidence-based practice centers designated and funded by U.S. Agency for Healthcare Research and Quality (AHRQ). It specializes in conducting systematic reviews and meta-analysis, and developing clinical practice guidelines, evidence dissemination and implementation tools, and related methodological research. Eligible systematic reviews had to 1) be started and finished between June 1, 2010 and Dec 31, 2017; 2) follow standard systematic review procedures [5]: 1) dual independent screening of abstracts and titles, abstract inclusion by one reviewer prompted automatic inclusion for full text screening; 2) dual independent screening of full text, disagreements between reviewers reconciled via consensus or arbitration by a third reviewer. The final included list of studies consisted of the studies after abstract screening, and full texting screening.; 3) use a web-based commercial systematic review software (DistillerSR, Evidence Partners Incorporated, Ottawa, Canada); and 4) be led by at least one of the core investigators of the evidence-based practice center. The investigation team consisted of a core group (10–15 investigators at any time period) and external collaborators with either methodological or content expertise. DistillerSR was used to facilitate abstracts and full texts screening and track all inclusion and exclusion decisions made by human reviewers. We did not use any automation algorithm in the included systematic reviews.

Outcomes

The main outcome of interest was error rate of human reviewers during abstract screening. An error was defined as a decision made by a single reviewer in abstract screening that was inconsistent (i.e., false inclusion or false exclusion) with the final included list of studies that that underwent abstract screening, and full texting screening and were eligible for data extraction and analysis (see Fig 1). We calculated error rate as the number of errors divided by the total number of screened abstracts (the total number of citations*2). We also estimated the overall abstract inclusion rate (defined as the number of eligible studies after abstract screening divided by the total number of citations), and the final inclusion rate (defined as the number of the final included studies divided by the total number of citations). In this study, we did not compare the performance between human reviewers and the automation algorithms integrated in DistillerSR.

Fig 1. Errors occurred during systematic review abstract screening.

Fig 1

Statistical analysis

We calculated the outcomes of interest for each eligible systematic review. The mean of the outcomes across systematic reviews were the average outcomes of each study weighted by the inverse proportion to the variance of the denominator (total number of screened abstracts or total number of citations). The variance was estimated using the following formula:

V={n[sumwi(n-1)]}sumwi(xi-xbar)2

xbar = weight mean

All analyses were implemented with Stata version 15.1 (StataCorp LP, College Station, TX, USA).

Results

A total of 25 systematic reviews were included in the analyses. These systematic reviews included 139,467 citations, representing 329,332 inclusion and exclusion decisions from 85 unique reviewers. Twenty-eight reviewers were core investigators from the evidence-based practice center; 57 were external collaborators with content or methodological expertise. Table 1 listed the characteristics of the included systematic reviews.

Table 1. Characteristics of the included systematic reviews.

Characteristics Results
Systematic reviews 25
Time period June 2010 to December 2017
Citations from literature search 139,467
Inclusion and exclusion decisions 329,332
Decisions after abstract screening 278,934
Decisions after full text screening 50,398
Systematic reviewers 85
From the core team 28
External content or methodological experts 57
Clinical area
Cardiovascular medicine 1
Mental health 2
Primary care 3
Pulmonology and critical care 2
Cardiovascular medicine 1
Endocrinology 7
Hematology 2
Health care delivery research 4
Urology 3
Review question type
Methodology 4
Diagnostic/Screening/Prognostic 4
Treatment 17

Abstract screening inclusion rate was 18.07% (95% CI: 12.65% to 23.48%) of the citations identified through literature search. Final inclusion rate after full text screening was 5.48% of the citations identified through literature search included in the systematic review (95% confidence interval (CI): 2.38% to 8.58%). The total error rate was 10.76% (95% CI: 7.43% to 14.09%). The error rates and inclusion rates varied by clinical area and type of review questions (Table 2).

Table 2. Error and inclusion rates by topic area and type of review questions.

Final inclusion rate (95% CI) dual process Abstract inclusion rate (95% CI) dual process Error rate (95% CI)
Overall (n = 25) 5.48% (2.38% to 8.58%) 18.07% (12.65% to 23.48%) 10.76% (7.43% to 14.09%)
Clinical Area
Cardiovascular medicine (n = 1) 1.59% 23.92% 17.73%
Mental health (n = 2) 2.10% (0% to 20.15%) 9.92% (0.77% to19.06%) 6.43% (0% to 18.19%)
Primary care (n = 3) 5.39% (0% to 13.19%) 28.00% (20.85%, 35.15%) 21.11% (5.15% to 37.08%)
Pulmonology and critical care (n = 2) 1.12% (0% to 2.43%) 9.13% (05 to 38.18%) 6.68% (0% to 42.04%)
Cardiovascular medicine (n = 1) 1.93% 18.56% 19.16%
Endocrinology (n = 7) 5.91% (2.89% to 8.94%) 20.40% (9.70% to 31.09%) 12.23% (4.76% to 19.70%)
Hematology (n = 2) 6.69% (0% to 37.32%) 11.00% (0% to 40.13%) 5.76% (0% to 24.27%)
Health care delivery research (n = 4) 2.18% (0% to 6.86%) 14.55% (0% to 43.35%) 8.73% (0% to 29.03%)
Urology (n = 3) 23.77% (0% to 57.85%) 42.85% (24.92% to 60.79%) 17.17% (0% to 36.07%)
Review question type
Methodology 2.18% (0.00% to 6.86%) 14.55% (0% to 43.35%) 8.73% (0.00% to 29.03%
Diagnostic/Screening/Prognostic 7.83% (3.56% to 12.11%) 25.99% (0% to 57.82%) 14.97% (0% to 34.38%)
Treatment 5.86% (1.47% to 10.26%) 17.64% (12.00% to 23.29%) 10.57% (7.32% to 13.83%)

Discussion

In this cohort of 25 systematic reviews, covering 9 clinical areas and 3 types of clinical questions, a total of 329,332 screening decisions (inclusion vs. exclusion) were made by 85 human reviewers. The error rate (false inclusion and false exclusion) during abstract screening was 10.76%, which varied from 5.76% to 21.11%, depending on clinical areas and question types.

Implications

A rigorous systematic review follows strict approaches and requires significant resource and time to complete, which typically lasts 6–18 months by a team of human reviewers.[25] Automation tools have the potential to mimic human activities in systematic review tasks and gained popularity in academia and industry. However, validity of the automation tools has yet to be established. [19, 24] It is intuitive to assume that these tools should achieve a zero error rate in order to be implemented to generate evidence used for decision-making (i.e., 100% sensitivity and 100% specificity).

Human reviewers have been used as the “gold standard” in evaluating the automation tools. However, similar to those “gold standards” used in clinical medicine, 100% accuracy is unlikely in reality. We found 10.76% error rate made by human reviewers in abstract screening (an error about 1 in 9 abstracts). This error rate also varied from topics and types of questions. Thus, when developing and refining an automation tool, achieving error rates similar to humans may be adequate. If this is the case, then these tools can serve as a single reviewer that gets paired with a second human reviewer.

Limitations

The sample size is relatively small, especially as we further stratify by clinical areas. The findings may not be generalizable to other systematic review questions or topics. The human reviewers who conducted these systematic reviews had a wide range of content knowledge and methodological experience (from minimum 1 year to over 10 years), which can be quite different from other review teams. In our practice, citations from abstract screening were automatically included when conflicts between two independent reviewers emerged. The abstract inclusion rate and final inclusion rate resulting from this approach can be higher than those of the teams who resolve conflicts in abstract screening. When both reviewers agree on excluding an abstract, this abstract disappears from the process; thus, a dual erroneous exclusion cannot be assessed. We were not able to evaluate error rate during full text screening as we did not track the conflicts between reviewers. Lastly, while we call judgments in study selection that are inconsistent with the final inclusion as errors, we acknowledge that these errors could be due to poor reporting and insufficient data provided in the published abstract. Thus, they may not be avoidable and they are not the fault of human reviewers. In summary, this study is an initial step to evaluate human errors in systematic reviews. Future studies need to evaluate different systematic review approaches (e.g., rapid systematic review, scoping review), clinical areas, and review questions. It is also important to increase the number of systematic reviews involved in the evaluation and include other EPC or non-EPC institutions.

Conclusions

This study of 329,332 abstract screening decisions made by a large, diverse group of systematic reviewers suggests important false inclusion and exclusion rates by human reviewers. When deciding the validity of a future automated study selection algorithm, it is important to keep in mind that the gold standard is not perfect and that achieving error rates similar to humans is likely adequate and can save resources and time.

Supporting information

S1 Appendix

(DOCX)

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Cochrane AL. 1931–1971: a critical review, with particular reference to the medical profession. Medicines for the year. 2000;1979:1. [Google Scholar]
  • 2.Ioannidis JP. The Mass Production of Redundant, Misleading, and Conflicted Systematic Reviews and Meta-analyses. Milbank Q. 2016;94(3):485–514. Epub 2016/09/14. 10.1111/1468-0009.12210 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Page MJ, Altman DG, McKenzie JE, Shamseer L, Ahmadzai N, Wolfe D, et al. Flaws in the application and interpretation of statistical analyses in systematic reviews of therapeutic interventions were common: a cross-sectional analysis. J Clin Epidemiol. 2018;95:7–18. Epub 2017/12/06. 10.1016/j.jclinepi.2017.11.022 . [DOI] [PubMed] [Google Scholar]
  • 4.Baudard M, Yavchitz A, Ravaud P, Perrodeau E, Boutron I. Impact of searching clinical trial registries in systematic reviews of pharmaceutical treatments: methodological systematic review and reanalysis of meta-analyses. BMJ. 2017;356:j448 Epub 2017/02/19. 10.1136/bmj.j448 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Higgins JP, Green S. Cochrane handbook for systematic reviews of interventions: John Wiley & Sons; 2011. [Google Scholar]
  • 6.Murad MH, Montori VM, Ioannidis JP, Jaeschke R, Devereaux P, Prasad K, et al. How to read a systematic review and meta-analysis and apply the results to patient care: users’ guides to the medical literature. Jama. 2014;312(2):171–9. 10.1001/jama.2014.5559 [DOI] [PubMed] [Google Scholar]
  • 7.Wang Z, Asi N, Elraiyah TA, Abu Dabrh AM, Undavalli C, Glasziou P, et al. Dual computer monitors to increase efficiency of conducting systematic reviews. J Clin Epidemiol. 2014;67(12):1353–7. Epub 2014/08/03. 10.1016/j.jclinepi.2014.06.011 . [DOI] [PubMed] [Google Scholar]
  • 8.Wang Z, Noor A, Elraiyah T, Murad M, editors. Dual monitors to increase efficiency of conducting systematic reviews. 21st Cochrane Colloquium; 2013. [DOI] [PubMed]
  • 9.Allen IE, Olkin I. Estimating time to conduct a meta-analysis from number of citations retrieved. JAMA. 1999;282(7):634–5. Epub 1999/10/12. 10.1001/jama.282.7.634 . [DOI] [PubMed] [Google Scholar]
  • 10.Khangura S, Konnyu K, Cushman R, Grimshaw J, Moher D. Evidence summaries: the evolution of a rapid review approach. Syst Rev. 2012;1:10 Epub 2012/05/17. 10.1186/2046-4053-1-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hailey D, Corabian P, Harstall C, Schneider W. The use and impact of rapid health technology assessments. International journal of technology assessment in health care. 2000;16(2):651–6. 10.1017/s0266462300101205 [DOI] [PubMed] [Google Scholar]
  • 12.Patnode CD, Eder ML, Walsh ES, Viswanathan M, Lin JS. The use of rapid review methods for the US Preventive Services Task Force. American journal of preventive medicine. 2018;54(1):S19–S25. [DOI] [PubMed] [Google Scholar]
  • 13.Ganann R, Ciliska D, Thomas H. Expediting systematic reviews: methods and implications of rapid reviews. Implement Sci. 2010;5:56 Epub 2010/07/21. 10.1186/1748-5908-5-56 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Systematic reviews. 2015;4(1):5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Li D, Wang Z, Wang L, Sohn S, Shen F, Murad MH, et al. A Text-Mining Framework for Supporting Systematic Reviews. Am J Inf Manag. 2016;1(1):1–9. Epub 2017/10/27. [PMC free article] [PubMed] [Google Scholar]
  • 16.Li D, Wang Z, Shen F, Murad MH, Liu H, editors. Towards a multi-level framework for supporting systematic review—A pilot study. 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2014: IEEE.
  • 17.Alsawas M, Alahdab F, Asi N, Li DC, Wang Z, Murad MH. Natural language processing: use in EBM and a guide for appraisal. BMJ Evidence-Based Medicine. 2016;21(4):136–8. [DOI] [PubMed] [Google Scholar]
  • 18.Cohen AM, Hersh WR, Peterson K, Yen P-Y. Reducing workload in systematic review preparation using automated citation classification. Journal of the American Medical Informatics Association. 2006;13(2):206–19. 10.1197/jamia.M1929 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Beller E, Clark J, Tsafnat G, Adams C, Diehl H, Lund H, et al. Making progress with the automation of systematic reviews: principles of the International Collaboration for the Automation of Systematic Reviews (ICASR). Syst Rev. 2018;7(1):77 Epub 2018/05/21. 10.1186/s13643-018-0740-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.O’Connor AM, Tsafnat G, Gilbert SB, Thayer KA, Shemilt I, Thomas J, et al. Still moving toward automation of the systematic review process: a summary of discussions at the third meeting of the International Collaboration for Automation of Systematic Reviews (ICASR). Syst Rev. 2019;8(1):57 Epub 2019/02/23. 10.1186/s13643-019-0975-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bannach-Brown A, Przybyla P, Thomas J, Rice ASC, Ananiadou S, Liao J, et al. Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error. Syst Rev. 2019;8(1):23 Epub 2019/01/17. 10.1186/s13643-019-0942-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.SR Toolbox 2019 [cited 2019 August 6]. http://systematicreviewtools.com/index.php.
  • 23.Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–53. 10.1126/science.aax2342 [DOI] [PubMed] [Google Scholar]
  • 24.O’Connor AM, Tsafnat G, Thomas J, Glasziou P, Gilbert SB, Hutton B. A question of trust: can we build an evidence base to gain trust in systematic review automation technologies? Systematic Reviews. 2019;8(1):143 10.1186/s13643-019-1062-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Smith V, Devane D, Begley CM, Clarke M. Methodology in conducting a systematic review of systematic reviews of healthcare interventions. BMC Med Res Methodol. 2011;11(1):15 Epub 2011/02/05. 10.1186/1471-2288-11-15 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Sompop Bencharit

12 Dec 2019

PONE-D-19-26633

Error rates of human reviewers during abstract screening in systematic reviews

PLOS ONE

Dear Dr. Wang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

This is an interesting study. However, as pointed out by two reviewers, there are some issues needed to be addressed in particular the methodology. Since the work is quite unique, more explanation in the background (see Reviewer #1) as well as the Methods (see Reviewers #1 and 3), are needed. 

==============================

We would appreciate receiving your revised manuscript by Jan 26 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Sompop Bencharit, DDS, MS, PhD, FACP

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

1. Please ensure that you include a title page within your main document. You should list all authors and all affiliations as per our author instructions and clearly indicate the corresponding author.

2. Please upload a copy of Figure, to which you refer in your text on page 3. If the figure is no longer to be included as part of the submission please remove all reference to it within the text.

3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

Reviewer #3: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thank you for this study which contributes to our knowledge of the process of conducting systematic reviews.

Data Availability: The list of systematic reviews examined is provided; however, there are no data regarding the full sets of references obtained from the original searches of the examined systematic reviews, nor the inclusion/exclusion decisions of each reviewer for each systematic review. Therefore, readers would not be able to replicate this analysis with the data currently provided.

Statistical Analysis: the descriptive statistics were calculated correctly.

Background: You mention that systematic reviews employ suboptimal methodological approaches and the potential for human errors; however, you do not acknowledge that automated systems and algorithms could introduce errors, potentially systematic errors which could introduce bias into a systematic review, such as that found in this study https://science.sciencemag.org/content/366/6464/447

Methods: Consider explaining/justifying the eligibility criteria #3 use a single software program, Distiller SR - were reviews excluded because they used a different software program? For eligibility criteria #2 - was the citation inclusion by one reviewer prompting automatic inclusion happen at only the abstract screening level (vs. full text screening)? under what circumstances were disagreements between reviewers reconciled via consensus or arbitration.

Results: Consider breaking out the error rate to errors of exclusion and errors of inclusion, as these may be different and of interest to readers and would allow you to provide rates of specificity and sensitivity, typical metrics for classification problems as you mention in the Background section and as was done in the study by Bannach-Brown referenced below.

Further, I would like to see a breakdown of errors from the abstract screening vs. the full text screening (and within this stratification, report exclusion vs. inclusion errors). This is important and interesting, because as you note in the limitations, the errors of inclusion at the abstract screening level reflect the fact that more information is needed to make a decision, rather than commission of an error by the reviewer.

Limitations: You note that "In our practice, citations from abstract screening were automatically included when conflicts between two independent reviewers emerged" - perhaps note that this may have resulted in a spuriously lower error rate than "truth."

Consider referencing the following article in the background or implications as this study calculated error rates from a Machine Learning screening algorithm: Bannach-Brown, A., Przybyła, P., Thomas, J., Rice, A. S., Ananiadou, S., Liao, J., & Macleod, M. R. (2019). Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error. Systematic reviews, 8(1), 23.

Limitations or perhaps conclusions: As this is a novel study, consider contextualizing the findings as an initial calculation of human errors observed in a small number of systematic reviews and discussing the need for replication of this study using reviews conducted by other EPCs or other non-EPC institutions, using different software, and most important to increase the sample size of studies used to inform our knowledge of valid human error rates.

Typo: 3rd paragraph of background section, In recent year should be In recent years

Reviewer #2: Dear authors - Thank you for investigating this question. It's of interest to me.

I agree that the sample is small and I wondered how meaningful the 10% error rate is. It's expected that human reviewers will make mistakes and if each reviewer errors on 10% of decisions, then... what? That's one reason why there are two reviewers. However, the use of that number as a benchmark to test automated systems puts it in an interesting context. If the automated process is as effective as a human reviewer, then we can save significant human time by eliminating one of the reviewers. But, again, the sample is small, and I imagine the type of question can influence the error rate, and some screening questions might be more conducive to human review / automation. In other words, a 10% error rate for one question may vary drastically for another. Some of this is touched on in the 'implications' section, which I'd like to see expanded, but I understand that such discussion can easily go beyond the study.

Overall, the small sample and the topic variability in the sample and the variability of the screening questions in each of the reviews in the sample lead me to question the significance of the 10%. And that makes me question its utility as a benchmark. But it's a thought-provoking topic and the authors use an interesting method to address it (Distiller data).

Thank you, by the way, for including the complete list of systematic reviews included in the analysis. This helps with reproducibility.

Reviewer #3: Dear Authors,

I have attached a revised version of your manuscript and a PDF file including my specific comments. My major concerns are related to the methdology you followed in your study.

Please, check both.

Best regards.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Mark MacEachern

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: PONE-D-19-26633_reviewer.pdf

Attachment

Submitted filename: PONE-D-19-26633_comments.pdf

Decision Letter 1

Sompop Bencharit

30 Dec 2019

Error rates of human reviewers during abstract screening in systematic reviews

PONE-D-19-26633R1

Dear Dr. Wang,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Sompop Bencharit, DDS, MS, PhD, FACP

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

The authors have sufficiently addressed all comments from the reviewers.

Reviewers' comments:

Acceptance letter

Sompop Bencharit

2 Jan 2020

PONE-D-19-26633R1

Error rates of human reviewers during abstract screening in systematic reviews

Dear Dr. Wang:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sompop Bencharit

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix

    (DOCX)

    Attachment

    Submitted filename: PONE-D-19-26633_reviewer.pdf

    Attachment

    Submitted filename: PONE-D-19-26633_comments.pdf

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES