ChatGPT-4o can serve as the second rater for data extraction in systematic reviews

Mette Motzfeldt Jensen; Mathias Brix Danielsen; Johannes Riis; Karoline Assifuah Kristjansen; Stig Andersen; Yoshiro Okubo; Martin Grønbech Jørgensen

doi:10.1371/journal.pone.0313401

. 2025 Jan 7;20(1):e0313401. doi: 10.1371/journal.pone.0313401

ChatGPT-4o can serve as the second rater for data extraction in systematic reviews

Mette Motzfeldt Jensen ^1,^2,^*, Mathias Brix Danielsen ^1,², Johannes Riis ^1,², Karoline Assifuah Kristjansen ^3,⁴, Stig Andersen ^1,², Yoshiro Okubo ⁵, Martin Grønbech Jørgensen ^1,²

Editor: Weiqiang (Albert) Jin⁶

PMCID: PMC11706374 PMID: 39774443

Abstract

Background

Systematic reviews provide clarity of a bulk of evidence and support the transfer of knowledge from clinical trials to guidelines. Yet, they are time-consuming. Artificial intelligence (AI), like ChatGPT-4o, may streamline processes of data extraction, but its efficacy requires validation.

Objective

This study aims to (1) evaluate the validity of ChatGPT-4o for data extraction compared to human reviewers, and (2) test the reproducibility of ChatGPT-4o’s data extraction.

Methods

We conducted a comparative study using papers from an ongoing systematic review on exercise to reduce fall risk. Data extracted by ChatGPT-4o were compared to a reference standard: data extracted by two independent human reviewers. The validity was assessed by categorizing the extracted data into five categories ranging from completely correct to false data. Reproducibility was evaluated by comparing data extracted in two separate sessions using different ChatGPT-4o accounts.

Results

ChatGPT-4o extracted a total of 484 data points across 11 papers. The AI’s data extraction was 92.4% accurate (95% CI: 89.5% to 94.5%) and produced false data in 5.2% of cases (95% CI: 3.4% to 7.4%). The reproducibility between the two sessions was high, with an overall agreement of 94.1%. Reproducibility decreased when information was not reported in the papers, with an agreement of 77.2%.

Conclusion

Validity and reproducibility of ChatGPT-4o was high for data extraction for systematic reviews. ChatGPT-4o was qualified as a second reviewer for systematic reviews and showed potential for future advancements when summarizing data.

Introduction

Systematic reviews synthesize all available research and represent the highest level of evidence, which is crucial for knowledge transfer and informing best practices and clinical guidelines. Conducting systematic reviews is thus important but time-consuming, typically taking between 6 to 12 months, with some reviews extending longer for complex or voluminous topics [1, 2].

Artificial intelligence (AI), such as large language models (LLMs) (e.g., ChatGPT-4o by OpenAI), shows promise for certain procedures in a systematic review like literature search, screening, data extraction, analysis, and quality assessment [1]. Recent literature has explored the potential of AI in research methodologies, particularly in article screening. Van Dijk et al. investigated AI for title and abstract screening and found it time-saving, efficient, and useful when applied correctly [3]. Feng et al. evaluated AI for automated literature searches, using human investigators as a reference standard, and found a high recall rate but concluded that human involvement is indispensable for selecting relevant studies [4]. Ghosh et al. (2024) employed an In-Context Learning framework to extract PICO elements from clinical trials, achieving state-of-the-art results [5].

LLMs have shown promise in automating complex tasks, such as extracting data from clinical trial documents. Data extraction traditionally require significant manual effort and expertise as it directly affects the accuracy of conclusions. Errors or incomplete data can lead to misinterpretations and skew clinical recommendations. Manual extraction is labor-intensive, time-consuming, and prone to error, which impacts reproducibility and complicates handling large datasets. As the volume of published research grows, AI-based models are increasingly relevant to automate extraction, reduce human errors, and improve consistency.

There is consensus that AI is a promising tool for conducting systematic reviews, but it faces challenges and limitations that necessitate human assessment and emphasize the need for validation [6–8].

A 2021 systematic review concluded that the benefits of AI for data extraction were unclear, with AI tools by then deemed insufficient and unreliable [6]. However, recently, Santos et al. (2023) concluded that AI as Machine Learning and Natural Language Processing models was efficient for systematic reviews and clinical guidelines but noted limitations that need addressing before full automation can ensure accuracy and efficiency [7]. In a recent study, Alyasiri et al. (2024) demonstrated that ChatGPT-4 shows promise in reference retrieval. However, the accuracy and reliability of AI-generated references remain a concern, as the model occasionally generates incorrect or fabricated citations. These findings underscore the importance of human oversight in AI-assisted research and provide context for evaluating ChatGPT-4’s capabilities in data extraction for systematic reviews [9].

On May 13, 2024, OpenAI launched the latest and improved LLM ChatGPT-4o, capable of understanding figures and tables. Thus, our present study explores whether this enhanced version can replace a human for data extraction during a systematic review.

Aim

This study aims to (1) evaluate the validity of ChatGPT-4o for data extraction in systematic reviews for randomized controlled trials (RCTs) compared to two independent human reviewers as the reference standard, and (2) test the reproducibility of ChatGPT-4o’s data extraction.

Methods

Protocol

The protocol was registered on OSF in November 2023 and updated in May 2024 to specify the aim and validation methods and report the use of the latest version of ChatGPT, ChatGPT-4o (https://osf.io/8gn4p/).

Design

To test the validity and reproducibility of ChatGPT-4o data extraction compared to traditional manual data extraction, we used papers from our ongoing systematic review on exercise to reduce fall risk currently being performed by the author group [10]. The manual data extraction were performed by two independent reviewers, and a third reviewer was involved to settle any disagreement. Consensus results of the manual data extraction from the ongoing source review were used as the reference standard in this study.

The validity of ChatGPT-4o’s data extraction was tested by comparing the reference standard to two sessions of data extraction using Chat-GPT-4o. Each session of ChatGPT-4o’s data extraction was performed independently by one of two authors (MMJ and MD) with separate ChatGPT-4o accounts. Data extraction in ChatGPT-4o was performed on 14^th to 17^th of May 2024 following the publication of the updated ChatGPT-4o version on the 13^th of May 2024.

To test the reproducibility of ChatGPT-4o’s data extraction, we compared the results of the two sessions.

An overview of the study design is shown in Fig 1.

ChatGPT-4o data extraction

Questions (prompts) for each data point were written without special knowledge of LLMs, phrased as they would be to another researcher. These questions covered baseline information, intervention description, participant baseline information, drop-out rates, and evaluation results for two outcomes (daily-life falls and laboratory falls). The questions were directly inserted as prompts in ChatGPT-4o. The list of prompts can be found in supplementary (S1 Table).

The relevant article PDFs included in the source review were uploaded to ChatGPT-4o. Two independent authors (MD and MMJ) entered all prompts into ChatGPT-4o and copied the responses into an Excel sheet. Only the first answer was used, with no regeneration or modification of answers. The authors used independent ChatGPT-4o accounts from different computers with the setting “Improve the model for everyone” turned off, and a new chat was used for each study. Turning the setting “Improve the model for everyone” off means that the input given to ChatGPT-4o is not used for model training and, therefore, prevents interference of the other data extraction session from the same paper.

Reference standard

Data extraction in the source review was conducted according to usual quality standards. Two authors extracted data from the included studies independently and afterwards reached consensus on the extracted data. The consensus data was used as a reference standard for the current study. Half of the included studies (n = 11) in the ongoing review were randomly selected to test ChatGPT-4o’s ability compared to humans [10].

Comparison of data extracted by ChatGPT 4o with the reference standard

The data extracted by ChatGPT-4o was compared to the reference standard by the two independent authors who also entered the prompts into ChatGPT. The responses were assessed and categorized according to pre-specified categories. The categories and goals for validity were published in the OSF protocol (https://osf.io/8gn4p/). The categories were described as following:

Category 1. Completely correct (only relevant information): to achieve this category, the results must fulfill the following criteria: a) correctly answer the question, b) no false information, c) no missing information, and d) no unnecessary information.

Category 2. Satisfactory (all the relevant information, but also non-relevant information): to achieve this category, the results must fulfill the following criteria: a) correctly answer the question, b) no missing information, c) no false information on the question asked, and d) additional or unnecessary information is acceptable, even if false, as long as the answer to the question is correct.

Category 3. Lacking information (some of the relevant information but not all): to achieve this category, the results must fulfill the following criteria: a) answers part of the question with correct information, b) missing some information, c) no false information on the question asked, and d) additional or unnecessary information is acceptable, even if false, as long as the answer to the question is correct.

Category 4. Missing all information (none of the relevant information): to achieve this category, the results must fulfill the following criteria: a) do not answer the question, and 2) do not provide false information on the question.

Category 5. False data (any amount of false information i.e. hallucinations): To achieve this category, the results must answer the question with false and misleading information.

Validity goals and expected utility of ChatGPT as a data extraction tool

Goals for acceptable validity and expected utility of ChatGPT as a reviewer were pre-defined, with categories ranging from “Single Reviewer” to “Useless” based on accuracy and proportion of false data, as stated in our protocol https://osf.io/8gn4p/. This assessment method was developed to ensure that the quality of the data extraction is as good or better than that of a single, second, or third reviewer. See Table 1.

Table 1. Response assessment categories.

100% correct Answers Category 1 on all questions	Single reviewer
80–99% correct < 80% of correct answers are in category 2 < 10% false	Second Reviewer
50–79% correct and < 20% false.	Third Reviewer
Less than 50% correct and/or > 20% false.	Useless

Open in a new tab

Our pre-specified assessment of the validity of ChatGPT-4o depending on category and degree of correct and false answers (left), and consequent implementation of ChatGPT-4o (right).

Statistical methods

Frequencies and percentages were used for categorical data descriptions. Validity was analyzed by collecting data extracted by both raters/authors for each question, allowing for non-deterministic algorithm performance. Proportions of each response category were visualized using histograms with 95% confidence intervals. Post hoc subgroup analyses were performed based on the presence or absence of information in the uploaded PDFs and the type of information.

Reproducibility was tested by comparing the stability of the response categorization for a data point between two data extractions. The agreement between the two data extractions was generally high, so the Kappa values were deemed unreliable due to the "Kappa paradox” [11]. Instead, we reported the percentage agreement as the primary measure together with Gwet’s AC2, which is more reliable when the agreement is high [12]. This value functions like a Kappa ranging between 1 and 0, and values towards 1 are better. Quadratic weighting was used to penalize greater discrepancy between ratings.

All analyses were performed using R version 4.1.2.

Results

We included 22 prompts (S1 Table) of which the response from ChatGPT-4o served as data points, to be extracted from each of the 11 papers in the source systematic review. This resulted in 242 data points extracted by each of the two authors comparing data from ChatGPT-4o directly with standardized forms of the reference standard. The data was carefully evaluated within one of the five response categories. Both datasets with evaluations are available as supplementary (S1, S2 Datasets). In the test of validity, these 242 data points extracted by each author were stacked for a total of 484 data points. Of the extracted data points, 48 (15.5%) were not reported in the studies, requiring ChatGPT to recognize this information as missing. The time to complete data extraction by ChatGPT was 3.5 hours within a week, while human raters took approximately 25–30 hours over 6–7 weeks (part-time).

Agreement between AI and human data extraction

The overall number of data points categorized as “completely correct” or “satisfactory” was 447 (92.4%, 95% CI: 89.5% to 94.5%). The amount of extracted data that was completely false was 25 (5.2%, 95% CI: 3.4% to 7.4%). The results for each category are shown in Fig 2. Based on these results, ChatGPT meets the prespecified goal for validity required to function as a second reviewer.

Fig 2 — Evaluation of ChatGPT-4o responses compared to the reference standard. Response categories included ’Completely correct,’ ’Satisfactory,’ ’Lacking information,’ ’Missing all information,’ and ’False information’.

Results stratified by the presence or absence of information in the uploaded PDFs revealed a lower number of "completely correct" and "satisfactory" responses but higher frequencies of "false information" when data points were not reported in the study (Fig 3). Similar tendencies were found when dividing results based on the category of extracted data, with outcome data showing lower frequencies of completely correct or satisfactory information (Fig 4).

Fig 3 — The proportion of ChatGPT-4o responses in each response category stratified based on whether the articles reported the outcome or not.

Fig 4 — The proportion of ’Completely correct’ and ’Satisfactory’ responses from ChatGPT-4o compared to the reference standard across four data domains: General information, Population, Intervention and comparator, and Outcome.

Reproducibility of ChatGPT data extraction

The overall reproducibility between the two data extractions using ChatGPT was high, with an overall agreement of 94.1% and Gwet’s AC2 of 0.93 (0.89 to 0.96) (Table 2). However, reproducibility was lower when the information was not reported in the paper, with agreements of 77.2% and Gwet’s AC2 of 0.43 (0.10 to 0.75). This finding was further supported by the analysis of information domains, where agreement was lower for outcome data, and a larger proportion of information was not reported.

Table 2. Overall and domain-specific results of reproducibility of data extraction by ChatGPT-4o.

Domain	Percentage agreement	Gwet’s AC2 (95% CI)
Overall results
All datapoints	94.1%	0.93 (0.89 to 0.96)
Data to extract reported vs. not reported in paper
Reported in paper	98.2%	0.98 (0.96 to 0.99)
Not reported in the paper	77.2%	0.43 (0.10 to 0.75)
Based on domain
General information	98.3%	0.98 (0.96 to 1.00)
Population	95.4%	0.94 (0.89 to 1.00)
Intervention and comparator	99.3%	0.99 (0.98 to 1.00)
Outcome	78.7%	0.60 (0.35 to 0.85)

Open in a new tab

Results of reproducibility across all data points (top), including analyses based on whether data was reported or not in the paper, and by individual domains of extracted data (bottom).

Discussion

This study aimed to evaluate the validity of ChatGPT-4o for data extraction in systematic reviews for RCTs compared to a final consensus dataset established by two independent human reviewers and secondly to test its reproducibility. We found a 92.4% agreement between ChatGPT-4o and the human datasets, with only 5.2% completely false data points. Overall, reproducibility was high; however, when requested information was absent in the paper, the likelihood of false data was higher.

We are the first to explore the validity and reproducibility of ChatGPT-4o in data extraction for systematic reviews. Previous studies did not extensively examine data extraction using prior versions of ChatGPT. In a brief feature article, the use of ChatGPT for data extraction and Risk of Bias assessment was explored using the first public version of ChatGPT, ChatGPT 3.5, which was released by OpenAI in November 2022 [8]. The study found that while ChatGPT can aid in the Risk of Bias and data extraction process, it could not fully replace human expertise. The report also emphasized the importance of clear instructions for effective AI assistance and noted the limitations of ChatGPT-3.5. The updated ChatGPT4o has enhanced the ability to extract relevant information from large datasets, directly analyze uploaded PDFs, and interpret data from tables and figures within documents, making it a more robust tool for systematic reviews. In the present study, we show that while ChatGPT-4o cannot fully replace human expertise, it can serve as a second reviewer, reducing the workload and time required for a full researcher. Exploring the role of AI for data extraction is particularly relevant, as it also may address the existing inefficiencies and limitations of manual approaches. Compared to errors in human reviewers’ data extraction (17.7% for a single rater and 14.5% for double raters), a mere 5.2% error was seen by the ChatGPT-4o, which qualifies it as an efficient second reviewer [13]. A recent study on the use of ChatGPT-3.5 Turbo in systematic reviews concluded the model may be used as a second reviewer for title and abstract screening [14].

While we demonstrated that ChatGPT-4o can serve as a second reviewer for data extraction, there is still room for improvement. A significant portion of the errors found in this study can be classified as hallucinations, where the model generated plausible but factually incorrect information, often occurring when specific data points were not explicitly reported in the articles. In such cases, instead of indicating missing information, ChatGPT-4o attempted to infer or generate data based on patterns from its training, leading to inaccuracies.

To overcome the limitation of incorrect data reporting, we prompted the AI to flag missing data rather than infer it. This was achieved by first asking whether the study collected the relevant data, followed by a request for specific details, e.g., ’Does the study collect data on laboratory falls? If yes, how is a laboratory fall defined, and how is information on laboratory falls collected?’ Despite this approach, the model would still provide incorrect data or hallucinate. Further optimization or prompt engineering designed to ask more specific targeted questions or guide the model to recognize missing data, could be used to improve ChatGPT-4o’s data extraction accuracy and reliability.

While we manually inserted the question in the prompt in the website version of ChatGPT-4o, taking 3.5 hours, this time can be drastically reduced by using a purpose-built plugin for ChatGPT-4o.

We have only explored the use of ChatGPT for data extraction, which serves as a significant advantage, as the data extraction process is among the most time-consuming steps in systematic reviews [13]. However, there is still a need to investigate the use of the latest version of ChatGPT or other similar AIs for other time-consuming aspects of systematic reviews, such as risk of bias assessment and abstract screening.

Our study is based on a source systematic review on RCT exploring training interventions for fall prevention in older adults. We assume ChatGPT-4o to produce similar results when used for data extraction in other fields. While this study demonstrates the potential of ChatGPT-4o as an auxiliary tool for data extraction in systematic reviews, it is important to acknowledge its limitations. The validity and reproducibility are markedly lower when the information is not reported in the primary article. Newer RCTs often follow a standardized reporting guideline to minimize missing information [15]. When ChatGPT-4o is applied for data extraction in other fields or other types of study, the risk of missing information must be considered, along with its potential to overlook complex contextual relationships. Although the model can handle large volumes of information, it may miss subtle interactions between study variables. Another concern is the presence of inherent biases raised from the data the model was trained on, resulting in overrepresentation of certain types of perspectives while underrepresenting others, and potentially skewing or misinterpreting the results of a systematic review. Future research should consider AI-based data extraction to be used alongside human reviewer. AI tools can assist in automating routine tasks and handling large datasets, while human expertise can focus on interpreting complex relationships and resolving ambiguous cases.

With an agreement between the two sessions of ChatGPT data extraction of 94.1%, the reproducibility was good; however, it was lower when the information was not reported in the paper, with an agreement of 77.2%. Furthermore, ChatGPT-4o performed best at answering general information about articles but faced challenges in the results section. Still, the tool is rapidly evolving, and the time required for data extraction is significantly shorter compared to human reviewers. Difficulties primarily arose when data was not reported, which could impact validity depending on the research area and quality of reporting.

Ethical considerations in AI-driven systematic reviews

As AI models like ChatGPT-4o are increasingly integrated into research, ethical issues arise, particularly concerning accountability and transparency. While AI tools offer promising advancements in automating data extraction and improving efficiency, researchers must regain control to ensure these technologies are used responsibly [16]. Hence, it is important to follow recent AI reporting guidelines [17].

It is essential that the role of AI is clearly disclosed in the research process. Accountability remains with the human researchers, who must oversee the AI’s outputs and verify the accuracy of the extracted data. AI tools should be treated as complementary aids, with human reviewers ultimately responsible for validating the results and ensuring the overall quality of the review.

Addressing ethical issues will be critical as AI tools become prevalent in research, ensuring their use enhances, rather than undermines, the integrity of systematic reviews.

Strengths and limitations

The study is based on only 11 RCTs on falls research, potentially limiting generalizability. While the sample size is relatively small and lacks stratification by study type, we ensured a robust methodology by including a large number of data points (total of 484 data points) which allows for a meaningful comparison with ChatGPT-4o. Expanding the study to include a broader range of topics or study types would however improve the external validity of our findings. Future research should explore ChatGPT-4o’s performance across a wider variety of study designs. Incorporating more diverse data sources in future studies would help further assess the generalizability of ChatGPT-4o’s performance and broaden its application across a wider range of systematic reviews.

ChatGPT-4o’s responses were compared to a manual consensus reference standard, which is naturally subject to some degree of interpretation by the reviewers. Consequently, the evaluation of ChatGPT-4o’s accuracy depends on how closely its responses align with the human reviewers’ consensus, rather than an entirely objective measure. However, it is important to recognize that subjective variability is a common factor in data extraction for systematic reviews, even when performed solely by human reviewers.

The majority of the data extraction items in our study—such as mean age, gender distribution, and the number of dropouts—are based on objective information with little room for subjective influence, as there is typically only one correct answer for these data points. While we acknowledge that some level of subjectivity may persist, this approach is consistent with standard procedures in systematic reviews, where consensus-based human extraction is commonly used as the reference standard.

Finally, the authors do not have specific expertise in LLMs, and maybe better-written prompts could improve results. However, our approach exemplifies the average researcher’s knowledge rather than the inner workings of LLMs, providing real-world results on the usability for the general researcher.

Conclusion

We found the agreement between ChatGPT-40 and human reviewers to be greater than 80% for data extraction in systematic reviews, with a 92.4% agreement between ChatGPT-4o and human reviewers. We found a 5.2% error of data extracted by ChatGPT-4o. Hence, ChatGPT-4o can be used as a second reviewer in the data extraction process for systematic reviews. Large language models have developed rapidly in recent years and hold great promise for summarizing data to support clinical decision making as illustrated by the present evaluation of the advancement and validity of ChatGPT-4o. The performance may improve but it remains important to retain control of the process and improvements by further studies monitoring accuracy and reliability of LLMs.

Supporting information

S1 Table. List of prompts and domains.

(DOCX)

pone.0313401.s001.docx^{(19KB, docx)}

S1 Dataset. Data extraction from reviewer 1.

Full data extraction in ChatGPT-4o and evaluation of responses compared to the reference standard.

(XLSX)

pone.0313401.s002.xlsx^{(131.7KB, xlsx)}

S2 Dataset. Data extraction from reviewer 2.

Full data extraction in ChatGPT-4o and evaluation of responses compared to the reference standard.

(XLSX)

pone.0313401.s003.xlsx^{(126.8KB, xlsx)}

Data Availability

All relevant data are within the paper and its Supporting Information files.

Funding Statement

The author(s) received no specific funding for this work.

References

1.Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Syst Rev. 2023. Apr 29;12(1):72. doi: 10.1186/s13643-023-02243-z [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017. Feb 27;7(2):e012545. doi: 10.1136/bmjopen-2016-012545 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, van der Palen J, Doggen CJM, Lenferink A. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ Open. 2023. Jul 7;13(7):e072254. doi: 10.1136/bmjopen-2023-072254 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Feng Y, Liang S, Zhang Y, Chen S, Wang Q, Huang T, et al. Automated medical literature screening using artificial intelligence: a systematic review and meta-analysis. J Am Med Inform Assoc. 2022. Jul 12;29(8):1425–32. doi: 10.1093/jamia/ocac066 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ghosh M, Mukherjee S, Ganguly A, Basuchowdhuri P, Naskar SK, Ganguly D. AlpaPICO: Extraction of PICO frames from clinical trial documents using LLMs. Methods [Internet]. 2024. Jun;226:78–88. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1046202324000896 doi: 10.1016/j.ymeth.2024.04.005 [DOI] [PubMed] [Google Scholar]
6.Blaizot A, Veettil SK, Saidoung P, Moreno-Garcia CF, Wiratunga N, Aceves-Martins M, et al. Using artificial intelligence methods for systematic review in health sciences: A systematic review. Res Synth Methods. 2022. May;13(3):353–62. doi: 10.1002/jrsm.1553 [DOI] [PubMed] [Google Scholar]
7.Santos ÁO Dos, da Silva ES, Couto LM, Reis GVL, Belo VS. The use of artificial intelligence for automating or semi-automating biomedical literature analyses: A scoping review. J Biomed Inform. 2023. Jun;142:104389. doi: 10.1016/j.jbi.2023.104389 [DOI] [PubMed] [Google Scholar]
8.Mahuli SA, Rai A, Mahuli AV, Kumar A. Application ChatGPT in conducting systematic reviews and meta-analyses. Br Dent J. 2023. Jul;235(2):90–2. doi: 10.1038/s41415-023-6132-y [DOI] [PubMed] [Google Scholar]
9.Alyasiri OM, Salman AM, Akhtom D, Salisu S. ChatGPT revisited: Using ChatGPT-4 for finding references and editing language in medical scientific articles. J Stomatol Oral Maxillofac Surg. 2024. Mar 21;101842. doi: 10.1016/j.jormas.2024.101842 [DOI] [PubMed] [Google Scholar]
10.Szabo Ildiko-Zsuzsa, Mathias Brix Danielsen Jens Eg Norgaard, Lord Stephen, Okubo Yoshiro, Andersen Stig, et al. The Effects of Perturbation-based Balance Training on Daily-life and Laboratory Falls in Community-dwelling: A Systematic Review and Meta-Analysis. PROSPERO 2022. CRD42022343368 Available from: https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42022343368. [Google Scholar]
11.Feinstein AR, Cicchetti D V. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990;43(6):543–9. doi: 10.1016/0895-4356(90)90158-l [DOI] [PubMed] [Google Scholar]
12.Wongpakaran N, Wongpakaran T, Wedding D, Gwet KL. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med Res Methodol. 2013. Apr 29;13:61. doi: 10.1186/1471-2288-13-61 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Buscemi N, Hartling L, Vandermeer B, Tjosvold L, Klassen TP. Single data extraction generated more errors than double data extraction in systematic reviews. J Clin Epidemiol. 2006. Jul;59(7):697–703. doi: 10.1016/j.jclinepi.2005.11.010 [DOI] [PubMed] [Google Scholar]
14.Tran VT, Gartlehner G, Yaacoub S, Boutron I, Schwingshackl L, Stadelmaier J, et al. Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses. Ann Intern Med. 2024. Jun;177(6):791–9. doi: 10.7326/M23-3389 [DOI] [PubMed] [Google Scholar]
15.Lapping K, Marsh DR, Rosenbaum J, Swedberg E, Sternin J, Sternin M, et al. The positive deviance approach: challenges and opportunities for the future. Food Nutr Bull. 2002. Dec;23(4 Suppl):130–7. [PubMed] [Google Scholar]
16.Astorp MS, Emmersen J, Andersen S. ChatGPT in medicine: A novel case of Dr Jekyll and Mr Hyde. Ethics Med Public Health. 2023. Aug;29:100923. [Google Scholar]
17.Flanagin A, Pirracchio R, Khera R, Berkwits M, Hswen Y, Bibbins-Domingo K. Reporting Use of AI in Research and Scholarly Publication-JAMA Network Guidance. JAMA. 2024. Apr 2;331(13):1096–8. doi: 10.1001/jama.2024.3471 [DOI] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0313401.r001

Decision Letter 0

Weiqiang (Albert) Jin

29 Aug 2024

PONE-D-24-28092ChatGPT-4o Can Serve as the Second Rater for Data Extraction in Systematic ReviewsPLOS ONE

Dear Dr. Motzfeldt Jensen,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR: After going through the manuscript and the reviewers' comments, the most reviewers and I suggest a Minor Revision, Please review the following details.

==============================

Please submit your revised manuscript by Oct 13 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Weiqiang (Albert) Jin, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire data will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption. Please be assured that, once you have provided your new statement, the assessment of your exemption will not hold up the peer review process.

3. Please include a separate caption for each figure in your manuscript.

4. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Dear authors:

After going through the manuscript and the reviewers' comments, I suggest a Minor Revision (as a associate editor).

The study is promising and offers some great insights into using AI / gpt-4o for data extraction in systematic reviews, but there are a few areas that need a bit more work. The key points to address include adding more details about the methods—like how the sample size was chosen, the specific prompts used, and how the human reviewers did their work. Also, expanding the results and discussion sections to clarify the types of errors found, ways to reduce them, and any potential limitations would help. It's also important to make sure the manuscript follows the proper guidelines for reporting AI research and to polish the language for better readability.

Please consider all the suggestions given by the four reviewers and revise it, and provide us with a one by one detailed response letter.

regards,

Weiqiang Jin.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

Reviewer #3: Yes

Reviewer #4: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dear Editor and Authors

Thank you for submitting your manuscript. I have carefully read through the entire paper and believe this study presents an innovative approach with notable potential for enhancing data extraction processes in systematic reviews. The experimental design is robust, and the results are promising.However, there are a few areas where the manuscript could benefit from minor revisions to further improve clarity and impact.In summary, I recommend minor revision.Below are my specific comments, which I hope will assist you in revising and improving the paper.

1.Insufficient Detail in Background Description: The background section mentions that systematic reviews are a key tool for translating clinical trial evidence into guidelines but does not sufficiently explain why data extraction is a crucial step in systematic reviews and how it impacts the quality of the review. It is recommended to add a discussion on the limitations of current manual data extraction methods and the potential benefits of AI intervention in this section.

2.Results Section Needs More Comparative Detail: The results section compares data extraction by ChatGPT-4o with that of human reviewers but does not provide detailed information about the data extraction process used by human reviewers. To enhance the persuasiveness of the comparison, it is suggested to include detailed information about the background and data extraction criteria of human reviewers to ensure fairness and accuracy. Additionally, a description of the human reviewers' workflow could help better understand the differences and advantages between AI and human review processes.

3.Emphasis on Study Limitations in Discussion: While the conclusion affirms the potential of ChatGPT-4o as an auxiliary tool for systematic reviews, the study does not address its limitations, such as the potential for AI to overlook complex contextual relationships or inherent biases. It is recommended to discuss these potential limitations in the discussion section and explore ways to address these issues in future research.

4.Accuracy of Keywords: The keywords cover the core concepts of the study, but it may be beneficial to include more targeted terms such as “automated data extraction” and “machine learning” to enhance search relevance.

5.Data Sources and Representativeness: The study mentions randomly selecting 11 articles to test the capabilities of ChatGPT-4o but does not elaborate on the specifics of the random selection method or the representativeness of the sample. It is recommended to provide more information on how the sample was ensured to be representative, such as whether different types of studies or data reporting quality were considered, to ensure the generalizability of the test results.

Incorporating these adjustments will enhance the manuscript's overall quality and provide readers with a clearer understanding of the study's implications and applications.

Thank you for considering these suggestions.

Best regards,

Reviewer #2: This manuscript contributes valuable insights into the use of AI, specifically ChatGPT-4o, for data extraction in systematic reviews. However, there are some points that need further clarification and refinement.

1. External Validity:

1.1 The study's focus on a single ongoing systematic review raises concerns about the generalizability of the findings. Systematic reviews can encompass a wide range of research questions, including interventional, diagnostic, and prognostic studies, etc. It would be beneficial to consider whether the inclusion of different types of studies or systematic reviews might improve the validity and reproducibility of ChatGPT-4o's data extraction capabilities.

1.2 The sample size of 11 studies in one systematic review used to evaluate validity is relatively small. This limited sample size may not be sufficient to draw robust conclusions about the general applicability of ChatGPT-4o. Further justification for the adequacy of this sample size would strengthen the manuscript.

2. Introduction:

2.1 The Introduction would benefit from a more comprehensive review of existing literature on the use of large language models (LLMs) for data extraction. For example, including references such as DOI:10.1016/j.ymeth.2024.04.005, among others, would help contextualize the novelty and contributions of the current study.

3. Methods:

3.1 ChatGPT-4o data extraction：

The manuscript lacks details on the specific prompts used for data extraction with ChatGPT-4o. Providing this information would enhance the transparency and reproducibility of the study.

3.2 Comparison of data extracted by ChatGPT 4o with the reference standard:

It is unclear how the comparison between ChatGPT-4o and human data extraction was conducted. Specifically, were the evaluations conducted by one or two researchers? Clarifying this aspect of the methodology is important for assessing the rigor of the study.

3.3 Validity goals and expected utility of ChatGPT as a data extraction tool:

Providing a clearer reference to the categories for validity assessment would improve the clarity of the methods.

3.4 The manuscript mentions 22 data points extracted from each study but does not specify what these data points are. Detailing the specific data points would aid readers in understanding the scope of the data extraction process.

4. Results:

4.1 Consider providing examples of the typical data extraction results in the appendix. This would allow readers to better assess the performance of ChatGPT-4o.

4.2 The manuscript mentions that 5.2% of the data extracted by ChatGPT-4o was incorrect. A brief discussion of the types of hallucinations observed and potential strategies for identifying or mitigating these errors would be useful.

Other Comments:

5. Adherence to Reporting Guidelines:

The manuscript should align with reporting guidelines for AI-related clinical research, such as those mentioned in Flanagin et al. (2024), "Reporting Use of AI in Research and Scholarly Publication-JAMA Network Guidance." Providing details on the prompts used, the time frame of ChatGPT-4o usage, and other relevant methodological aspects would be beneficial.

6. Language and Clarity:

The manuscript contains some language that could be further refined for clarity. For example, the sentence "this finding was supported when looking across information domains where agreement was lower for outcome data and a larger proportion of information was not reported" could be simplified for better readability.

Reviewer #3: This is an interesting and novel study exploring the use of AI, LLM specifically, in collecting data for systematic reviews. However, I have a few comments that I would like the authors of this study to address.

1. Please provide how you arrived that a sample size of 11 would be adequate to test the validity and reproducibility of ChatGPT 4o.

2. I would be interested to know the agreement rate between the two authors? Suppose there was frequent disagreement between the two assessors, requiring a third author to intervene. Does this mean the data extracted by human standards was subjective and hence assessing ChatGPT against this subjective measure would have not shown its true capacity for extracting data from full-text papers?

3. How did you arrive at the prespecified assessment of the validity? Were there any previous similar studies using this model?

4. What was the prespecified goal of determining acceptable reproducibility?

5. All figures and tables should have a legend summarising and explaining their results. All figures and tables from this study are missing legends. Graphs are also missing titles.

Reviewer #4: 1. The study notes that the questions were written without special knowledge of LLMs, which reflects an average user’s experience. However, exploring how optimized prompts could enhance AI performance might provide valuable insights. This could lead to recommendations on best practices for researchers using ChatGPT-4o.

2. The study mentions that 5.2% of the data extracted by ChatGPT-4o was false, particularly when information was not reported in the papers. A deeper analysis of these errors—beyond reporting their frequency—could provide useful guidance on how to mitigate such risks.

3. While the manuscript discusses the validity and reproducibility of ChatGPT-4o, it might benefit from a brief discussion on the ethical implications of using AI in systematic reviews, particularly regarding accountability and transparency when AI-generated data is integrated into research outputs.

4. The study is based on a specific set of 11 RCTs focusing on fall prevention in older adults. While the methodology is robust, the generalizability to other fields or types of studies remains uncertain. What are the author's suggestions to expand the study to include a broader range of topics or study types that would strengthen the conclusions?

5. To add relevant information that supports the manuscript and benefits the readers, a suggested paper for investigating how ChatGPT can find and return real references. You may cite it as follows: https://doi.org/10.1016/j.jormas.2024.101842.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Suodi zhai

Reviewer #3: No

Reviewer #4: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2025 Jan 7;20(1):e0313401. doi: 10.1371/journal.pone.0313401.r002

Author response to Decision Letter 0

3 Oct 2024

Dear editor and reviewers,

I would like to sincerely thank you for your time and effort in reviewing our manuscript. We appreciate your valuable feedback and thoughtful suggestions, which has helped to provide more clarity and refining of our manuscript. We hope that the revised version meets your expectations, and we look forward to your further assessment.

Sincerely,

Mette Motzfeldt Jensen

On behalf of the co-authors.

Attachment

Submitted filename: Response to reviewers.docx

pone.0313401.s004.docx^{(52.7KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0313401.r003

Decision Letter 1

Weiqiang (Albert) Jin

24 Oct 2024

ChatGPT-4o Can Serve as the Second Rater for Data Extraction in Systematic Reviews

PONE-D-24-28092R1

Dear Dr. Authors of Paper PONE-D-24-28092R1,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Weiqiang (Albert) Jin, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Congratulations! The reviewers have expressed their appreciation for your work and have acknowledged its quality by recommending acceptance of your article. Well done.

Before the final proofreading, please ensure that all citations in the manuscript adhere to the publication's formatting guidelines. Additionally, verify the accuracy of information for each referenced article, prioritizing published dois over preprints like arXiv.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author:

I recommend citing the following two references that utilize GPT for information processing:

ChatAgri: Exploring potentials of ChatGPT on cross-linguistic agricultural text classification [DOI: 10.1016/j.neucom.2023.126708]

Prompt learning for metonymy resolution: Enhancing performance with internal prior knowledge of pre-trained language models [DOI: 10.1016/j.knosys.2023.110928]

Reviewer #1: All comments have been addressed

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #4: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

Reviewer #4: Thanks for addressed the comments raised in a previous round of review. So I feel that this manuscript is now acceptable for publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #4: No

**********

PLoS One. doi: 10.1371/journal.pone.0313401.r004

Acceptance letter

Weiqiang (Albert) Jin

5 Nov 2024

PONE-D-24-28092R1

PLOS ONE

Dear Dr. Motzfeldt Jensen,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Weiqiang (Albert) Jin

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Table. List of prompts and domains.

(DOCX)

pone.0313401.s001.docx^{(19KB, docx)}

S1 Dataset. Data extraction from reviewer 1.

Full data extraction in ChatGPT-4o and evaluation of responses compared to the reference standard.

(XLSX)

pone.0313401.s002.xlsx^{(131.7KB, xlsx)}

S2 Dataset. Data extraction from reviewer 2.

Full data extraction in ChatGPT-4o and evaluation of responses compared to the reference standard.

(XLSX)

pone.0313401.s003.xlsx^{(126.8KB, xlsx)}

Attachment

Submitted filename: Response to reviewers.docx

pone.0313401.s004.docx^{(52.7KB, docx)}

Data Availability Statement

All relevant data are within the paper and its Supporting Information files.

[pone.0313401.ref001] 1.Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Syst Rev. 2023. Apr 29;12(1):72. doi: 10.1186/s13643-023-02243-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0313401.ref002] 2.Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017. Feb 27;7(2):e012545. doi: 10.1136/bmjopen-2016-012545 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0313401.ref003] 3.van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, van der Palen J, Doggen CJM, Lenferink A. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ Open. 2023. Jul 7;13(7):e072254. doi: 10.1136/bmjopen-2023-072254 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0313401.ref004] 4.Feng Y, Liang S, Zhang Y, Chen S, Wang Q, Huang T, et al. Automated medical literature screening using artificial intelligence: a systematic review and meta-analysis. J Am Med Inform Assoc. 2022. Jul 12;29(8):1425–32. doi: 10.1093/jamia/ocac066 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0313401.ref005] 5.Ghosh M, Mukherjee S, Ganguly A, Basuchowdhuri P, Naskar SK, Ganguly D. AlpaPICO: Extraction of PICO frames from clinical trial documents using LLMs. Methods [Internet]. 2024. Jun;226:78–88. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1046202324000896 doi: 10.1016/j.ymeth.2024.04.005 [DOI] [PubMed] [Google Scholar]

[pone.0313401.ref006] 6.Blaizot A, Veettil SK, Saidoung P, Moreno-Garcia CF, Wiratunga N, Aceves-Martins M, et al. Using artificial intelligence methods for systematic review in health sciences: A systematic review. Res Synth Methods. 2022. May;13(3):353–62. doi: 10.1002/jrsm.1553 [DOI] [PubMed] [Google Scholar]

[pone.0313401.ref007] 7.Santos ÁO Dos, da Silva ES, Couto LM, Reis GVL, Belo VS. The use of artificial intelligence for automating or semi-automating biomedical literature analyses: A scoping review. J Biomed Inform. 2023. Jun;142:104389. doi: 10.1016/j.jbi.2023.104389 [DOI] [PubMed] [Google Scholar]

[pone.0313401.ref008] 8.Mahuli SA, Rai A, Mahuli AV, Kumar A. Application ChatGPT in conducting systematic reviews and meta-analyses. Br Dent J. 2023. Jul;235(2):90–2. doi: 10.1038/s41415-023-6132-y [DOI] [PubMed] [Google Scholar]

[pone.0313401.ref009] 9.Alyasiri OM, Salman AM, Akhtom D, Salisu S. ChatGPT revisited: Using ChatGPT-4 for finding references and editing language in medical scientific articles. J Stomatol Oral Maxillofac Surg. 2024. Mar 21;101842. doi: 10.1016/j.jormas.2024.101842 [DOI] [PubMed] [Google Scholar]

[pone.0313401.ref010] 10.Szabo Ildiko-Zsuzsa, Mathias Brix Danielsen Jens Eg Norgaard, Lord Stephen, Okubo Yoshiro, Andersen Stig, et al. The Effects of Perturbation-based Balance Training on Daily-life and Laboratory Falls in Community-dwelling: A Systematic Review and Meta-Analysis. PROSPERO 2022. CRD42022343368 Available from: https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42022343368. [Google Scholar]

[pone.0313401.ref011] 11.Feinstein AR, Cicchetti D V. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990;43(6):543–9. doi: 10.1016/0895-4356(90)90158-l [DOI] [PubMed] [Google Scholar]

[pone.0313401.ref012] 12.Wongpakaran N, Wongpakaran T, Wedding D, Gwet KL. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med Res Methodol. 2013. Apr 29;13:61. doi: 10.1186/1471-2288-13-61 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0313401.ref013] 13.Buscemi N, Hartling L, Vandermeer B, Tjosvold L, Klassen TP. Single data extraction generated more errors than double data extraction in systematic reviews. J Clin Epidemiol. 2006. Jul;59(7):697–703. doi: 10.1016/j.jclinepi.2005.11.010 [DOI] [PubMed] [Google Scholar]

[pone.0313401.ref014] 14.Tran VT, Gartlehner G, Yaacoub S, Boutron I, Schwingshackl L, Stadelmaier J, et al. Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses. Ann Intern Med. 2024. Jun;177(6):791–9. doi: 10.7326/M23-3389 [DOI] [PubMed] [Google Scholar]

[pone.0313401.ref015] 15.Lapping K, Marsh DR, Rosenbaum J, Swedberg E, Sternin J, Sternin M, et al. The positive deviance approach: challenges and opportunities for the future. Food Nutr Bull. 2002. Dec;23(4 Suppl):130–7. [PubMed] [Google Scholar]

[pone.0313401.ref016] 16.Astorp MS, Emmersen J, Andersen S. ChatGPT in medicine: A novel case of Dr Jekyll and Mr Hyde. Ethics Med Public Health. 2023. Aug;29:100923. [Google Scholar]

[pone.0313401.ref017] 17.Flanagin A, Pirracchio R, Khera R, Berkwits M, Hswen Y, Bibbins-Domingo K. Reporting Use of AI in Research and Scholarly Publication-JAMA Network Guidance. JAMA. 2024. Apr 2;331(13):1096–8. doi: 10.1001/jama.2024.3471 [DOI] [PubMed] [Google Scholar]

PERMALINK

ChatGPT-4o can serve as the second rater for data extraction in systematic reviews

Mette Motzfeldt Jensen

Mathias Brix Danielsen

Johannes Riis

Karoline Assifuah Kristjansen

Stig Andersen

Yoshiro Okubo

Martin Grønbech Jørgensen

Roles

Abstract

Background

Objective

Methods

Results

Conclusion

Introduction

Aim

Methods

Protocol

Design

Fig 1. Overview of the study design.

ChatGPT-4o data extraction

Reference standard

Comparison of data extracted by ChatGPT 4o with the reference standard

Validity goals and expected utility of ChatGPT as a data extraction tool

Table 1. Response assessment categories.

Statistical methods

Results

Agreement between AI and human data extraction

Fig 2. ChatGPT-4o response assessment.

Fig 3. Stratification of ChatGPT-4o responses by outcome reporting status.

Fig 4. Proportion of correct and satisfactory ChatGPT-4o responses across four data domains.

Reproducibility of ChatGPT data extraction

Table 2. Overall and domain-specific results of reproducibility of data extraction by ChatGPT-4o.

Discussion

Ethical considerations in AI-driven systematic reviews

Strengths and limitations

Conclusion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Weiqiang (Albert) Jin

Roles

Author response to Decision Letter 0

Decision Letter 1

Weiqiang (Albert) Jin

Roles

Acceptance letter

Weiqiang (Albert) Jin

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases