Abstract
Objectives
The surge in publications increases screening time required to maintain high-quality literature reviews. One of the most time-consuming phases is title and abstract screening. Machine learning tools have semi-automated this process for systematic reviews, with limited success for scoping reviews. ChatGPT, a chatbot based on a large language model, might support scoping review screening by identifying key concepts and themes. We hypothesize that ChatGPT outperforms the semi-automated tool Rayyan, increasing efficiency at acceptable costs while maintaining a low type II error.
Materials and Methods
We conducted a retrospective study using human screening decisions on a scoping review of 15 307 abstracts as a benchmark. A training set of 100 abstracts was used for prompt engineering for ChatGPT and training Rayyan. Screening decisions for all abstracts were obtained via an application programming interface for ChatGPT and manually for Rayyan. We calculated performance metrics, including accuracy, sensitivity, and specificity with Stata.
Results
ChatGPT 4.0 decided upon 15 306 abstracts, vastly outperforming Rayyan. ChatGPT 4.0 demonstrated an accuracy of 68%, specificity of 67%, sensitivity of 88%-89%, a negative predictive value of 99%, and an 11% false negative rate when compared to human researchers’ decisions. The workload savings were at 64% reasonable costs.
Discussion and Conclusion
This study demonstrated ChatGPT’s potential to be applied in the first phase of the literature appraisal process for scoping reviews. However, human oversight remains paramount. Additional research on ChatGPT’s parameters, the prompts and screening scenarios is necessary in order to validate these results and to develop a standardized approach.
Keywords: scoping review, ChatGPT, artificial intelligence, automation, large language model, screening
Introduction
Given the exponential increase in scientific publications in recent years, there is a need for timely literature reviews to summarize the available evidence and inform healthcare guidelines.1 However, these reviews are labor-intensive and time-consuming, with systematic reviews often taking over a year to publish.1–8 This has led to a growing interest in leveraging artificial intelligence tools to support the review process.4,5,8 A particularly time-consuming effort lies in the screening of titles and abstracts, typically involving thousands of abstracts reviewed by two independent researchers.7,9 This phase is crucial for the quality and validity of the review, as results are dependent on a comprehensive database search and a thorough, unbiased screening process.10,11 The abstract screening phase averages 33 days, with the yield rate of relevant articles ranging on average from 1% to 2.9%.2,3,8,12 This highlights the potential benefits of artificial intelligence support in this area.2,4
Text mining, an artificial intelligence technique, has supported automation in systematic reviews from 2006 on.4 Since then, the underlying artificial intelligence techniques have been refined with review tools such as Rayyan, Covidence, and EPPI-Reviewer employing semi-automated machine learning algorithms to rank references by relevance.7,13,14 These tools learn from human decisions to develop an initial ranking and improve it with subsequent human input.5 Natural language processing and text mining techniques—including word tokenization, stop word removal, stemming, and data-mining—extract features to structure the textual data and infer implicit knowledge.2,4,12 These features are then used to train classifiers to predict the inclusion likelihood of an unclassified abstract and to rank accordingly.2,4,7,13,14 Learning algorithms vary, with most tools using support vector machine and some others relying on Naive Bayes and k-nearest neighbours.2,4,15 Several studies estimate that machine learning tools can halve the screening workload, while detecting 95% of relevant articles.5,12,14
While automation tools have proven effective for systematic reviews, they have not proven equally beneficial for scoping reviews.12 Scoping reviews are increasingly utilized to synthesize evidence and identify evidence gaps.16 Unlike systematic reviews, which address narrow, well-defined research questions, scoping reviews tackle broader questions with wider search and screening criteria.17,18 Their inclusion criteria typically focus on broad concepts and contexts rather than specific interventions and comparators.16 For instance, a scoping review on digital tools for interprofessional communication in healthcare encompasses a variety of digital solutions across diverse professional contexts including physicians and nurses.19 Scoping reviews further incorporate evidence from various sources, including primary research and non-empirical evidence.20
Given that chatbots employing large language models have demonstrated potential to identify key concepts and themes within texts and have shown promising results when screening abstracts for systematic reviews, they might also enhance screening processes for scoping reviews.6,10,11,21 Large language models are trained on extensive datasets to learn patterns and understand context, followed by fine-tuning for expected outcomes.22 They use deep neural networks to process information through multiple interconnected layers.23 In the case of Open AI’s ChatGPT, the input layer tokenizes text. The tokens are then converted into high-dimensional vectors in the embedding layer to capture semantic meaning. This information is then passed to several transformer blocks, with the output being processed through several layers to generate the final probability distribution over the vocabulary.24 This structure enables ChatGPT to respond to queries without prior training from end-users.11,25
Currently, semi-automated tools such as Rayyan, a review tool designed for systematic reviews, have also been used to support scoping reviews. We hypothesize, however, that using ChatGPT in the first screening phase of scoping reviews is superior to using Rayyan, and increases efficiency at acceptable costs while maintaining a low type II error (ie, missing relevant articles). To test this hypothesis, we compared the decisions made by ChatGPT and Rayyan to those of human researchers during the first screening phase of a sample scoping review and assessed the performance.
Methodology
This retrospective study used human researchers’ final decisions at the first screening phase as a benchmark.
Scoping review description
In December 2022, we conducted a comprehensive search across five scientific databases for publications related to digitally supported interprofessional communication and collaboration in healthcare. We aimed to identify studies that involved a digital tool facilitating interaction between (1) healthcare workers from different professions or (2) healthcare workers with the same background employed in different organizations. We included all types of primary research from any healthcare setting and location, written in English, French, German, Portuguese, or Spanish. We excluded abstracts, book chapters, and literature focused on interactions with students or patients. Details on search strategy, criteria and research process are published elsewhere.19 After removing 11 767 duplicates in Covidence (Veritas Health Innovation), 15 307 unique records were exported to Microsoft Excel (Microsoft Corporation). Nearly all abstracts were written in English, with seven being uniquely in German, four each in French and Spanish and one in Portuguese. Overall, 3852 abstract were initially missing in the dataset, of which 3612 were retrieved and added manually. This refined dataset formed the basis for analysis.
Selection of chatbot and comparator
We selected ChatGPT 4.0, as a representative of large language models due to its widespread use and strong performance across various tasks.23,25 Its previous version, ChatGPT 3.5, was also included due to its lower cost. As the comparator we chose Rayyan (Qatar Computing Research Institute), a free, well-established web-based screening tool using artificial intelligence that provided the best user experience in a comparative study of various screening tools.26 Unlike other literature review tools, Rayyan provides quantifiable rating of the inclusion likelihood of abstracts.13 Rayyan uses text mining methods, extracting words from titles and abstracts and converting them into numerical features (n-grams) to structure the data. A support vector machine classifier analyses these features and creates a predictive model based on the patterns identified in reviewers’ decisions, updating its predictions as reviewers proceed with screening. An abstract’s inclusion likelihood is displayed in the form of a 5-star rating.12,13 In our study, a rating of < 2.5 out of 5 stars was deemed to be an exclusion; a rating of >2.5 stars an inclusion and a rating of 2.5 stars was “not rated.”
Training the tools
To create a training set of 100 abstracts, we randomly selected 50 excluded and 50 included abstracts according to the screening decisions from two independent researchers, with a third researcher resolving discrepancies. Of the 50 included abstracts, 30 were excluded after full text screening. This training set served (1) to train Rayyan, which requires at least 50 decisions including five inclusions and five exclusions and (2) to engineer the prompt for ChatGPT.27 All abstracts were uploaded to Rayyan and training abstracts labelled according to the human screening results. Rayyan’s ratings were computed and recorded in a Microsoft Excel sheet (Microsoft Corporation).
The first author (K.N.) conducted iterative prompt engineering, refining the prompt based on the decisions and explanations that ChatGPT provided for inclusion or exclusion rather than relying on formalized prompt engineering techniques. The final prompt, including both the instructions and the abstract (see Figure 1), resulted in a sensitivity of 69% and a specificity of 82%. Five days later, the training set was re-evaluated with the same prompt to test ChatGPT’s decision consistency, yielding 68% sensitivity and 79% specificity, displaying a moderate interrater reliability of 0.75.28,29 The results of both trials were deemed acceptable and we proceeded with full dataset analysis on February 21, 2024, with a second analysis on February 27, 2024. We obtained ChatGPT’s decisions, including an explanation for the decision, through an application programming interface (API) from Excel, using version 6.9 of an Excel macro (ListenData, 2023).30 The “AIAssistant” function was used without word count restriction, prompting ChatGPT to generate the decision and explanation without recalling the prior interactions. Temperature was set to 0.7 throughout all ratings to balance repeatability and creativity. The temperature, ranging from 0 to 2, controls the randomness of the output: a value of 0 selects the most probable words, leading to less creative but repeatable outputs, while a temperature of 1 and beyond allows for more creative but less repeatable outputs.31
Figure 1.
ChatGPT prompt with example responses for inclusion and exclusion. (A) ChatGPT prompt featuring a brief explanation of the overall task, followed by the inclusion criteria and a clarification on the abstract’s focus, as well as the exclusion criteria. The prompt concludes with the abstract, illustrated here with a placeholder text. (B) An example ChatGPT’s decision to include an abstract. (C) An example ChatGPT’s decision to exclude an abstract.
Data analysis
We conducted descriptive analysis with Stata version 16 (StataCorp LLC). Using the human decisions as a benchmark, we calculated for each tool and model the number of falsely included and excluded abstracts, as well as those correctly included and excluded. From these, we computed the accuracy (proportion of abstracts correctly judged by AI among all abstracts), sensitivity (proportion of abstracts included by AI among all included as determined by reference standard—also referred to as recall), specificity (proportion of irrelevant abstracts that were excluded by AI), precision (proportion of relevant abstracts determined by reference standard that were included by AI) and negative predictive value (proportion of irrelevant abstracts determined by reference standard that were excluded by AI) (see Figure 2).10,11,32 Additionally, we derived the false negative rate (proportion of falsely excluded abstracts by AI among all included as determined by reference standard), the proportion missed (proportion of falsely excluded abstracts by AI among irrelevant articles as determined by reference standard) and the workload saving (proportion of abstracts correctly excluded by AI among all abstracts).11,32 We then calculated the interrater reliability after excluding the not-rated abstracts for each interrater reliability calculation and the costs involved per screening method. As we conducted two sets of ratings for ChatGPT 4.0, we used the average cost. For ChatGPT 4.0 and 3.5, we also qualitatively analysed the output structure and performed a cost evaluation across all tools.
Figure 2.
Metrics of ChatGPT’s and Rayyan’s decisions: ChatGPT 4.0 (
: rating 1, n = 15 306 and
: rating 2, n = 15 306), ChatGPT 3.5 (
, n = 15 213), and Rayyan (
, n = 4). (A) Confusion matrix illustrating the AI tool’s decisions vs human decisions as true negatives and positives, as well as false negatives and positives, alongside other relevant metrics. (B) Additional metrics to complement the confusion matrix.
Study dissemination
Study results were presented at the European Conference of Public Health.33
Results
Manual screening results
Manual title and abstract screening excluded 14 633 manuscripts (95.6%). The mean interrater reliability across reviewer pairs—calculated among the five researchers involved in the abstract screening process—was 0.29 with a standard error (SE) of 0.07, ranging from 0.62 (SE: 0.08) to 0.08 (SE: 0.14). Screening occurred between December 2022 and February 2023.
Description of the (semi-)automated screening results
Reasons for not-rated
After training on 100 abstracts, Rayyan rated only four abstracts differently than 2.5 stars: two with 1.5 stars (rated as “exclude”) and one each with 3.5 and 4.5 stars (rated as “include”). ChatGPT 4.0 did not rate one abstract due to exceeding the 8192 token limit. Conversely, ChatGPT 3.5 rated that abstract but did not rate 94 other abstracts; 93 of which had no abstract (39% of all missing abstracts). In all other no-abstract cases, ChatGPT 3.5 labelled the entry as “exclude.” ChatGPT 4.0 labelled all entries without abstract as “exclude.”
Errors and other indicators
After excluding not-rated abstracts, all ChatGPT models showed similar levels of accuracy (68%), precision (11%), negative predictive value (99%), and specificity (67%). The sensitivity for ChatGPT 3.5 was with 84% lower than for 4.0 (rating 1: 88% and rating 2: 89%). Rayyan’s 100% achievement across these indicators relied on 4 classified abstracts. ChatGPT 4.0 and 3.5 displayed false negative rates of 11% and 16%, respectively. Compared to final full-text decisions, 3% (ChatGPT 4.0) and 4% (ChatGPT 3.5) of abstracts were falsely excluded. Using ChatGPT for screening achieved a workload saving of 64% across all ChatGPT models.
Interrater reliability
Rayyan’s interrater reliability with human researchers was 1.00 (SE: 0.50) but based on four abstracts. The interrater reliability between humans and all ChatGPT ratings ranged from 0.12 to 0.13 (SE: 0.00) (see Figure 3). For 1648 abstracts, the evaluations of the two ChatGPT 4.0 ratings differed, resulting in an interrater reliability of 0.76 (SE: 0.01). The interrater reliability between ChatGPT 3.5 and ChatGPT 4.0 ratings was lower (0.41, SE: 0.01).
Figure 3.
Interrater reliability with SE. A score below 0 indicates a poor, 0.00-0.20 a slight, 0.21-0.40 a fair, 0.41-0.60 a moderate, 0.61-0.80 a substantial, and 0.81-1.00 an almost perfect interrater reliability.34 N refers to the total number of abstracts evaluated by human raters and/or tools. (A) Average interrater reliability of human raters. (B) Interrater reliability between the human benchmark and the decisions of the different ChatGPT models and ratings. “4.0 (1.)” refers to the first rating of ChatGPT 4.0, “4.0 (2.)” to the second rating and “3.5” to the decisions of ChatGPT 3.5. (C) Interrater reliability between the different models and ratings of ChatGPT.
Output format
ChatGPT 4.0 consistently explained its ratings leading with the decision (Include or Exclude), followed by an explanation. In contrast, ChatGPT 3.5 used 3 different patterns:
Decision (eg, “Include”) followed by the explanation (n: 11 711),
Textual decision (eg, “This abstract should be excluded based on the inclusion and exclusion criteria provided.”) followed by the explanation (n: 1374) and
Explanation followed by the decision (n: 2129).
Cost evaluation
The costs incurred for screening differed according to the tool and model used:
Rayyan: free of charge.
ChatGPT 3.5: $9.06 (API usage fee only).
ChatGPT 4.0: $505.72 (base subscription $20 + average API usage fee $485.72).
Note: *The API usage fee represents the average of both ratings (rating 1: $491.91 and rating 2: $479.53), as in some cases the Excel file crashed, requiring reanalysis.
The API usage fee depends on the number of tokens per prompt and output. In English, a token is approximately equivalent to 4 characters.35
Discussion
Summary of results
ChatGPT 4.0 screened and decided upon 15 306 abstracts as compared to four abstracts being decided upon by the semi-automated screening tool Rayyan. All ChatGPT models showed consistent levels of accuracy (68%), precision (11%), negative predictive value (99%), specificity (67%) and workload savings (64%). The sensitivity for ChatGPT 4.0 was at 88% and 89%, and at 84% for ChatGPT 3.5 and the false negative rate at 11% and 16%, respectively. ChatGPT’s high sensitivity demonstrated its strong ability to correctly identify relevant studies, while its lower specificity and accuracy reflected a tendency of over-inclusion, reducing precision. With a negative predictive value of 99%, ChatGPT correctly identified irrelevant studies. The false negative rate of 11% (ChatGPT 4.0) and 16% (ChatGPT 3.5) highlighted the necessity for human oversight to ensure inclusion of all relevant studies and to maintain the review’s quality. At a model temperature of 0.7, the interrater reliability for the ChatGPT 4.0 ratings was substantial, with a moderate reliability when compared with ChatGPT 3.5. Both models demonstrated only slight reliability when compared with human researchers’ decisions. While ChatGPT 4.0 consistently used the same output format, ChatGPT 3.5 produced different output patterns. The cost of deployment varied: Rayyan was free of charge, whereas using ChatGPT 3.5 cost $9.06 and ChatGPT 4.0 amounted to $505.72. The higher ability of ChatGPT 4.0 to correctly identify potentially relevant articles justified the higher cost.
Comparison with semi-automated tools
Our study demonstrates that chatbots are a more feasible alternative in scoping review screening compared to current semi-automated tools, which require substantial human interaction to train and refine their models.11 Despite training with double the recommended sample, Rayyan only decided on four articles.27 However, larger training samples and iterative retraining may improve Rayyan’s performance.32
In systematic review screening, Rayyan achieved a sensitivity of 78% and a proportion of missed references of 0.5%.36 Semi-automated tools such as Abstrackr and Distiller AI have shown similar sensitivities, with specificities ranging from 72% to 95% and for Distiller AI a precision of 16%.32,37 Rayyan’s low performance in this study highlights the need for alternative tools in scoping reviews. Semi-automated tools also perform worse for systematic reviews with complex inclusion criteria or multiple research questions, with some studies suggesting limiting their use to reviews only including randomized controlled trials.32,38,39
Efforts to reduce workload by using semi-automated tools often compromise recall (ie, missing relevant articles) due to three main reasons: (1) unclear stopping point, (2) imbalanced dataset, and (3) biased researchers.1,12,40 Researchers use various metrics to determine when to discontinue manual screening, such as a certain number of consecutive articles excluded, a minimum prediction score, a pre-determined number of articles screened, or the time spent on screening.1,12,32,41 The optimal stopping point likely varies by review and is only know in retrospect.12,14,41 A heuristic estimate of 50% is widely accepted across tools, with studies reporting 95% of all abstracts to be included after screening 29.5%-47.1% of abstracts.7,12 However, even low-ranked articles may be relevant.12,14 Imbalanced datasets can bias algorithms towards exclusion.13,39 Lastly, ranking the articles according to relevance might influence researchers’ decisions, potentially causing complacency and a tendency to underestimate the importance of articles presented at a later stage.12
Comparison with large language models
Chatbots like ChatGPT demonstrate advantages for screening processes compared to semi-automated tools.12,42 First, as a zero-shot model, ChatGPT requires no prior training by the end-user, nor are seed articles needed.10,43 Second, ChatGPT’s ability to analyse fuzzy or unstructured data enables understanding whether “review” refers to a literature or customer review.42 Similarly to human researchers, ChatGPT can provide explanations for its decisions, potentially highlighting errors in reasoning.10 Lastly, ChatGPT’s multilingual understanding enables researchers to extend their search to non-English research articles, often underrepresented in reviews.44
Albeit the performance metrics being slightly lower than for semi-automated tools frequently used for systematic reviews, ChatGPT showed promising results: Several studies demonstrated high efficacy of ChatGPT in systematic review screening, which addresses narrower research questions compared to scoping reviews.10,11,21,32,37 One study using ChatGPT 4.0 for a scoping review screening reported an accuracy of 94%, a specificity of 94%, and a sensitivity of 100%.10 The sensitivity of our study was higher than the specificity, suggesting that ChatGPT was effective at including relevant abstracts but less efficient in excluding irrelevant ones.
Repeating the same prompt on different days resulted in a substantial ChatGPT 4.0 interrater reliability of 0.76, aligning with prior findings.29 The slight interrater reliability with human researchers in our study was lower than in a study with systematic reviews but consistent with other scoping reviews.10,11 This difference is likely due to scoping reviews demonstrating less clear inclusion and exclusion criteria than systematic reviews. Human researchers also demonstrated a lower interrater reliability in scoping reviews as compared to systematic reviews.10,11 To foster trust, good alignment between the decisions of human researchers and ChatGPT is imperative.11
Next steps
Different strategies for reducing screening workload have been proposed in the literature, such as using semi-automated tools or relying on a single reviewer. However, these approaches often affect recall: For instance, a single reviewer might miss around 13% of relevant studies.7 ChatGPT-supported screening, with its speedy analysis of large datasets, opens up new possibilities for supporting human researchers while ensuring accurate screening and a high recall.10,45,46 Potential strategies include:
Sequential approach: This approach includes an initial screening using ChatGPT to exploit its speed and scalability, with human researchers reviewing only the abstracts included by ChatGPT.11,32,42 This strategy, based on ChatGPT’s high negative predictive value, could effectively limit the number of abstracts to be screened by researchers, possibly improving researchers’ concentration and motivation.41
Hybrid approach: A hybrid approach combines decisions by one human researcher and the Chatbot, with conflicts being resolved by an additional researcher.2,29 This might balance the high sensitivity of chatbots with the high specificity of human raters.2,11
Multiple chatbot voting rounds: In this approach, ChatGPT conducts multiple screening rounds. Then, abstracts voted for inclusion at least once are included, or abstracts are included when they achieved a minimum number of votes for inclusion.46 This approach can be combined with either the sequential approach or the hybrid approach.
Regardless of the chosen approach, the workload and cost savings possibly limit screening fatigue and enable researchers to use a broader search string, maximizing sensitivity and improving recall.13,15,41,42
Several considerations are necessary when using ChatGPT: First, accurate outputs rely on complete, correct and non-biased inputs.21 Therefore, input data needs to be meticulously prepared, including the manual addition of abstracts. Second, clear and specific instructions (prompts) to the model improve the accuracy of the answers.47,48 Using the PCC scheme (Population-Concept-Context), frequently applied in scoping reviews, for providing inclusion criteria may be beneficial. Initial trials with prompts using the PICOS scheme (Population-Intervention-Comparison-Outcome-Study design) for systematic review screening with ChatGPT demonstrated good results.11 Providing more information, such as study type and year, might further enhance the chatbot’s performance.10 Iterative engineering of the prompts is necessary for a good human-chatbot interrater reliability.29 Trials with different chatbot parameters, such as temperature, affecting the randomness of the output generated might further be beneficial.31 Lastly, human oversight and trust are crucial. The uptake of ChatGPT might be slow, as researchers tend to be hesitant to use tools not yet widely accepted in the scientific community.38 While automatic deduplication of references is an accepted standard procedure, especially proprietary large language models are perceived as a black box due to the complexity and lack of transparency in their output generation process and because system parameters in terms of both technology and input by researchers are not commonly shared.42 Continuous updates of the tool might yield different results, complicating replication efforts.11 To gain researchers’ trust and acceptance in the scientific community, human oversight and a standardized and transparent approach is needed, alongside studies demonstrating the AI tool’s non-inferiority to human researchers in specific phases of a literature review.7,10,37,38 Furthermore, allowing researchers to set the AI tools’ decision threshold might reduce risk and increase trust in the tool.37
Future research
To deepen our understanding of ChatGPT’s decisions and improve performance, future research is needed to qualitatively analyze ChatGPT’s explanations and compare its reasoning to that of human researchers. Special attention should be given to decisions where ChatGPT diverged from human judgment or from its own judgment in another rating.11 Additionally, we recommend prospectively evaluating the performance and workload savings of ChatGPT when used alongside a researcher, compared against a researcher pair and investigating the different screening approaches discussed above.13
Beyond abstract screening, ChatGPT offers potential for implementation in various stages of the review process, including search strategy derivation, full text screening and data extraction.5,11,21,42 ChatGPT could generate search terms and adapt them to different databases.21 As ChatGPT’s token restriction has been expanded, it could become a viable tool for the full text screening phase, as well.10 Additionally, ChatGPT’s ability to understand context suggests its usefulness in data extraction.14
Limitations
Key strengths of our study are to elicit reasons for ChatGPT’s decisions, to repeat the rating of ChatGPT 4.0 and to use the current gold standard (final decisions of 2 independent researchers with another researcher settling differences) as reference.32 However, this study also has some limitations. First, despite being the gold standard, research has shown that human decisions are not flawless, being dependent upon their expertise, experience and language proficiency.5,11,36 Reviewers are trained to be over-inclusive, as evidenced by a 70% exclusion rate during full text screening, yet missing 3% of relevant studies.37,46,49 Second, our results are based on a single scoping review with a well-defined scope. Further research is needed to investigate the generalizability to other disciplines and broader topics.10–12 Third, we did not apply a systematic approach to prompt engineering during the refinement process. Utilizing a more structured method for prompt engineering could have potentially improved ChatGPT’s performance. Lastly, due to practical constraints, we compared only one chatbot (two models) and one semi-automated tool whose core technology (supported vector machine) may not reflect the most advanced technology.13
Conclusions
With an exponentially growing body of research, maintaining quality in reviews will likely require increased screening time. Our study demonstrates ChatGPT’s potential to be applied in the first screening phase of scoping reviews, demonstrating good levels of accuracy, specificity, and sensitivity and vastly outperforming the semi-automated machine learning tool Rayyan. ChatGPT also demonstrated a negative predictive value of 99% and workload savings of 64%.
Despite the promising results, caution is warranted in solely relying on ChatGPT, as its decisions resulted in a false negative rate of 11% for ChatGPT 4.0 and 16% for ChatGPT 3.5. Human oversight remains paramount. Further research on ChatGPT’s parameters, the prompt, screening scenarios and fields of research is necessary in order to validate these results and develop a standardized approach.
Acknowledgments
The authors want to extend their gratitude to Marie-Christin Redlich and Patricia Möbius-Lerch for supporting the screening process of the scoping review, without which the comparison in this present feasibility study would not have been possible.
Contributor Information
Kim Nordmann, Bavarian Research Center for Digital Health and Social Care, Kempten University of Applied Sciences, Kempten 87437, Germany.
Michael Schaller, Bavarian Research Center for Digital Health and Social Care, Kempten University of Applied Sciences, Kempten 87437, Germany.
Stefanie Sauter, Bavarian Research Center for Digital Health and Social Care, Kempten University of Applied Sciences, Kempten 87437, Germany.
Florian Fischer, Bavarian Research Center for Digital Health and Social Care, Kempten University of Applied Sciences, Kempten 87437, Germany; Institute of Public Health, Charité—Universitätsmedizin Berlin, Berlin 10117, Germany.
Author contributions
Kim Nordmann: Conceptualization, Methodology, Formal analysis, Investigation, Resources, Data curation, Visualization, Project administration, Writing—original draft. Stefanie Sauter: Writing—review & editing. Michael Schaller: Methodology, Writing—review & editing. Florian Fischer: Supervision, Methodology, Writing—review & editing
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Conflicts of interest
The author(s) declare no competing interests.
Data availability
The data that support the findings of this study are available from the corresponding author, K.N., upon reasonable request.
Ethics approval
As the study did not involve sensitive data, no ethical clearance was necessary.
Declaration of generative AI in scientific writing
During the preparation of this work, the authors used GTP 4.0 in order to improve readability and language. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
References
- 1. van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, et al. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ Open. 2023;13:e072254. 10.1136/bmjopen-2023-072254 [published Online First: 7 July 2023]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Blaizot A, Veettil SK, Saidoung P, et al. Using artificial intelligence methods for systematic review in health sciences: a systematic review. Res Synth Methods. 2022;13:353-362. 10.1002/jrsm.1553 [published Online First: 28 February 2022]. [DOI] [PubMed] [Google Scholar]
- 3. Borah R, Brown AW, Capers PL, et al. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7:e012545. 10.1136/bmjopen-2016-012545 [published Online First: 27 February 2017]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. de La Torre-López J, Ramírez A, Romero JR. Artificial intelligence to automate the systematic review of scientific literature. Computing. 2023;105:2171-2194. [Google Scholar]
- 5. Tsafnat G, Glasziou P, Keen Choong M, et al. Systematic review automation technologies. Syst Rev. 2014;3:74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Christou P. Ηow to use artificial intelligence (AI) as a resource, methodological and analysis tool in qualitative research? TQR. 2023;28:1968-1980. [Google Scholar]
- 7. Hamel C, Hersi M, Kelly SE, et al. Guidance for using artificial intelligence for title and abstract screening while conducting knowledge syntheses. BMC Med Res Methodol. 2021;21:285. 10.1186/s12874-021-01451-2 [published Online First: 20 December 2021]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Zhang Y, Liang S, Feng Y, et al. Automation of literature screening using machine learning in medical evidence synthesis: a diagnostic test accuracy systematic review protocol. Syst Rev. 2022;11:11. 10.1186/s13643-021-01881-5 [published Online First: 15 January 2022]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Muthu S. The efficiency of machine learning-assisted platform for article screening in systematic reviews in orthopaedics. Int Orthop. 2023;47:551-556. 10.1007/s00264-022-05672-y [published Online First: 23 December 2022]. [DOI] [PubMed] [Google Scholar]
- 10. Guo E, Gupta M, Deng J, et al. Automated paper screening for clinical reviews using large language models: data analysis study. J Med Internet Res. 2024;26:e48996. 10.2196/48996 [published Online First: 12 January 2024]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Issaiy M, Ghanaati H, Kolahi S, et al. Methodological insights into ChatGPT's screening performance in systematic reviews. BMC Med Res Methodol. 2024;24:78. 10.1186/s12874-024-02203-8 [published Online First: 27 March 2024]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Chai KEK, Lines RLJ, Gucciardi DF, et al. Research screener: a machine learning tool to semi-automate abstract screening for systematic reviews. Syst Rev. 2021;10:93. 10.1186/s13643-021-01635-3 [published Online First: 1 April 2021]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Valizadeh A, Moassefi M, Nakhostin-Ansari A, et al. Abstract screening using the automated tool Rayyan: results of effectiveness in three diagnostic test accuracy systematic reviews. BMC Med Res Methodol. 2022;22:160. 10.1186/s12874-022-01631-8 [published Online First: 2 June 2022]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Marshall IJ, Wallace BC. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst Rev. 2019;8:163. 10.1186/s13643-019-1074-9 [published Online First: 11 July 2019]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. van de Schoot R, Bruin J D, Schram R, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3:125-133. [Google Scholar]
- 16. Peters MDJ, Godfrey C, McInerney P, et al. Best practice guidance and reporting items for the development of scoping review protocols. JBI Evid Synth. 2022;20:953-968. 10.11124/JBIES-21-00242 [published Online First: 1 April 2022]. [DOI] [PubMed] [Google Scholar]
- 17. Pollock D, Tricco AC, Peters MDJ, et al. Methodological quality, guidance, and tools in scoping reviews: a scoping review protocol. JBI Evid Synth. 2022;20:1098-1105. [DOI] [PubMed] [Google Scholar]
- 18. Huang Y, Procházková M, Lu J, et al. Family related variables’ influences on adolescents’ health based on health behaviour in School-Aged children database, an AI-assisted scoping review, and narrative synthesis. Front Psychol. 2022;13:871795. 10.3389/fpsyg.2022.871795 [published Online First: 10 August 2022]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Nordmann K, Sauter S, Möbius-Lerch P, et al. Conceptualizing interprofessional digital communication and collaboration in health care: protocol for a scoping review. JMIR Res Protoc 2023;12:e45179. 10.2196/45179 [published Online First: 26 June 2023]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Campbell F, Tricco AC, Munn Z, et al. Mapping reviews, scoping reviews, and evidence and gap maps (EGMs): the same but different- the "big picture" review family. Syst Rev. 2023;12:45. 10.1186/s13643-023-02178-5 [published Online First: 15 March 2023]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Alshami A, Elsayed M, Ali E, et al. Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Systems. 2023;11:351. [Google Scholar]
- 22. Alberts IL, Mercolli L, Pyka T, et al. Large language models (LLM) and ChatGPT: what will the impact on nuclear medicine be? Eur J Nucl Med Mol Imaging. 2023;50:1549-1552. 10.1007/s00259-023-06172-w [published Online First: 9 March 2023]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Buhr CR, Smith H, Huppertz T, et al. ChatGPT versus consultants: blinded evaluation on answering otorhinolaryngology Case-Based questions. JMIR Med Educ. 2023;9:e49183. 10.2196/49183 [published Online First: 5 December 2023]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Belatrix. ChatGPT System Architecture: exploring the basics of AI, ML, and NLP 2024. Accessed February 21, 2025. https://www.pentalog.com/blog/tech-trends/chatgpt-fundamentals/
- 25. Naveed H, Khan AU, Qiu S, et al. A comprehensive overview of large language models. ACM Trans Intell Syst Technol. 2025;16:5;106. 10.1145/3744746 [DOI]
- 26. Harrison H, Griffin SJ, Kuhn I, et al. Software tools to support title and abstract screening for systematic reviews in healthcare: an evaluation. BMC Med Res Methodol. 2020;20:7. 10.1186/s12874-020-0897-3 [published Online First: 13 January 2020]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Rayyan. Using Rayyan’s predictions classifier for relevance ranking? 2024. Accessed February 21, 2025. https://help.rayyan.ai/hc/en-us/articles/17461088734353-Using-Rayyan-s-Predictions-Classifier-for-Relevance-Ranking
- 28. Stephens LD, Jacobs JW, Adkins BD, et al. Battle of the (chat)bots: comparing large language models to practice guidelines for Transfusion-Associated Graft-Versus-Host disease prevention. Transfus Med Rev. 2023;37:150753. 10.1016/j.tmrv.2023.150753 [published Online First: 19 August 2023]. [DOI] [PubMed] [Google Scholar]
- 29. Huang Y-M, Rocha T, eds. Innovative Technologies and Learning: 6th International Conference, ICITL 2023 Porto, Portugal, August 28–30, 2023 Proceedings. Springer Nature Switzerland; 2023. [Google Scholar]
- 30. Bhalla D. 3 Ways to Integrate ChatGPT into Excel 2023. Accessed February 21, 2025. https://www.listendata.com/2023/03/how-to-run-chatgpt-inside-excel.html
- 31. Davis J, van Bulck L, Durieux BN, et al. The temperature feature of ChatGPT: modifying creativity for clinical research. JMIR Hum Factors 2024;11:e53559. 10.2196/53559 [published Online First: 8 March 2024]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Carey N, Harte M, Mc Cullagh L. A text-mining tool generated title-abstract screening workload savings: performance evaluation versus single-human screening. J Clin Epidemiol. 2022;149:53-59. 10.1016/j.jclinepi.2022.05.017 [published Online First: 30 May 2022]. [DOI] [PubMed] [Google Scholar]
- 33. Nordmann K, Fischer F. Capability of ChatGPT to support the screening process of scoping reviews: a feasibility study. Eur J Public Health,. 2024;34:ckae144.686. 10.1093/eurpub/ckae144.686 [DOI] [Google Scholar]
- 34. OpenAI. Pricing: simple and flexible. Only pay for what you use. 2024. Accessed February 21, 2025. https://openai.com/api/pricing/
- 35. Dos Reis AHS, de Oliveira ALM, Fritsch C, et al. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev. 2023;12:68. 10.1186/s13643-023-02231-3 [published Online First: 15 April 2023]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Gartlehner G, Affengruber L, Titscher V, et al. Single-reviewer abstract screening missed 13 percent of relevant studies: a crowd-based, randomized controlled trial. J Clin Epidemiol. 2020;121:20-28. 10.1016/j.jclinepi.2020.01.005 [published Online First: 21 January 2020]. [DOI] [PubMed] [Google Scholar]
- 37. Gates A, Johnson C, Hartling L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the abstrackr machine learning tool. Syst Rev. 2018;7:45. 10.1186/s13643-018-0707-8 [published Online First: 12 March 2018]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Rathbone J, Hoffmann T, Glasziou P. Faster title and abstract screening? Evaluating abstrackr, a semi-automated online screening program for systematic reviewers. Syst Rev. 2015;4:80. 10.1186/s13643-015-0067-6 [published Online First: 15 June 2015]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Shemilt I, Khan N, Park S, et al. Use of cost-effectiveness analysis to compare the efficiency of study identification methods in systematic reviews. Syst Rev. 2016;5:140. 10.1186/s13643-016-0315-4 [published Online First: 17 August 2016]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Oude Wolcherink MJ, Pouwels XGLV, van Dijk SHB, et al. Can artificial intelligence separate the wheat from the chaff in systematic reviews of health economic articles? Expert Rev Pharmacoecon Outcomes Res. 2023;23:1049-1056. 10.1080/14737167.2023.2234639 [published Online First: 13 August 2023]. [DOI] [PubMed] [Google Scholar]
- 41. Wagner G, Lukyanenko R, Paré G. Artificial intelligence and the conduct of literature reviews. J Inf Technol. 2022;37:209-226. [Google Scholar]
- 42. Wiggers K. The emerging types of language models and why they matter. 2022. Accessed February 21, 2025. https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/? guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAMrf8uxMxhzZhGieB8Fifg_auk00DivUWtQOTVPCBJFGfxZ3Nn8D8h_R15WJW1eJlAk_5WMgwF8Bj-f-IHv_YOu9QrloVk6FJM09opGM7qj9GrzYW_KI5LPZgVpipW0g9RWqUkQzv3UK265FGCJNmPuV45g8QyAkZG9Adn347KHm&guccounter=2
- 43. Neimann Rasmussen L, Montgomery P. The prevalence of and factors associated with inclusion of non-English language studies in campbell systematic reviews: a survey and meta-epidemiological study. Syst Rev. 2018;7:129. 10.1186/s13643-018-0786-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Ruksakulpiwat S, Phianhasin L, Benjasirisan C, et al. Assessing the efficacy of ChatGPT versus human researchers in identifying relevant studies on mHealth interventions for improving medication adherence in patients with ischemic stroke when conducting systematic reviews: comparative analysis. JMIR Mhealth Uhealth 2024;12:e51526. 10.2196/51526 [published Online First: 6 May 2024]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. O'Mara-Eves A, Thomas J, McNaught J, et al. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4:5. 10.1186/2046-4053-4-5 [published Online First: 14 January 2015]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Mahuli SA, Rai A, Mahuli AV, et al. Application ChatGPT in conducting systematic reviews and meta-analyses. Br Dent J. 2023;235:90-92. [DOI] [PubMed] [Google Scholar]
- 47. Qureshi R, Shaughnessy D, Gill KAR, et al. Are ChatGPT and large language models "the answer" to bringing us closer to systematic review automation? Syst Rev. 2023;12:72. 10.1186/s13643-023-02243-z [published Online First: 29 April 2023]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Wang Z, Nayfeh T, Tetzlaff J, et al. Error rates of human reviewers during abstract screening in systematic reviews. PLoS One. 2020;15:e0227742. 10.1371/journal.pone.0227742 [published Online First: 14 January 2020]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159-174. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author, K.N., upon reasonable request.



