Abstract
Objectives
We developed and validated a large language model (LLM)-assisted system for conducting systematic literature reviews in health technology assessment (HTA) submissions.
Materials and Methods
We developed a five-module system using abstracts acquired from PubMed: (1) literature search query setup; (2) study protocol setup using population, intervention/comparison, outcome, and study type (PICOs) criteria; (3) LLM-assisted abstract screening; (4) LLM-assisted data extraction; and (5) data summarization. The system incorporates a human-in-the-loop design, allowing real-time PICOs criteria adjustment. This is achieved by collecting information on disagreements between the LLM and human reviewers regarding inclusion/exclusion decisions and their rationales, enabling informed PICOs refinement. We generated four evaluation sets including relapsed and refractory multiple myeloma (RRMM) and advanced melanoma to evaluate the LLM's performance in three key areas: (1) recommending inclusion/exclusion decisions during abstract screening, (2) providing valid rationales for abstract exclusion, and (3) extracting relevant information from included abstracts.
Results
The system demonstrated relatively high performance across all evaluation sets. For abstract screening, it achieved an average sensitivity of 90%, F1 score of 82, accuracy of 89%, and Cohen's κ of 0.71, indicating substantial agreement between human reviewers and LLM-based results. In identifying specific exclusion rationales, the system attained accuracies of 97% and 84%, and F1 scores of 98 and 89 for RRMM and advanced melanoma, respectively. For data extraction, the system achieved an F1 score of 93.
Discussion
Results showed high sensitivity, Cohen's κ, and PABAK for abstract screening, and high F1 scores for data extraction. This human-in-the-loop AI-assisted SLR system demonstrates the potential of GPT-4's in context learning capabilities by eliminating the need for manually annotated training data. In addition, this LLM-based system offers subject matter experts greater control through prompt adjustment and real-time feedback, enabling iterative refinement of PICOs criteria based on performance metrics.
Conclusion
The system demonstrates potential to streamline systematic literature reviews, potentially reducing time, cost, and human errors while enhancing evidence generation for HTA submissions.
Keywords: GPT-4, large language model, human-in-the loop AI, systematic literature review, information extraction
Introduction
Background and significance
Systematic literature reviews (SLRs) are comprehensive analyses of existing literature that address specific research questions, serving as the highest form of evidence in evidence-based medicine.1 These reviews are crucial for informing clinical practice, guiding research, and shaping healthcare policy. However, the traditional approach to SLRs is time-consuming, labor-intensive, and costly, often taking 6-16 months or even up to two years for the most comprehensive reviews.2–4
Health technology assessment (HTA) is a comprehensive evaluation of health technologies, bridging research and policy making.5 SLR is an integral part of HTA that synthesizes evidence on clinical efficacy, safety, economic, and humanistic aspects of treatments.6 HTA requirements for SLR vary by country, and the latest clinical efficacy and safety reviews (clinical SLR) are required by almost all HTA agencies.6 The concept of living SLRs, which was introduced nearly a decade ago and rapidly synthesize evolving evidence, has gained popularity due to their timely impact on healthcare and policy.7 However, continuously updating literature data is resource-intensive using traditional SLR methods.7 Recent evaluations have identified potential resource savings of 61-79% for title and abstract screening phases using machine learning methods,8 indicating an opportunity to employ Large Language Models (LLMs) in living SLRs or HTAs.
The SLR process involves literature search, screening, retrieval, data extraction, appraisal, and reporting.9 The quality of SLRs heavily depends on the comprehensive database searches, thorough abstract screening, rigorous study selection and the expertise of the project team. Literature screening is often considered the most time-consuming part of the SLR,2 and prone to reviewer disputes.10 Systematic literature reviews often retrieve thousands of citations through initial keyword dataset searches. However, most citations are excluded during screening since keyword searching alone does not consider contextual relevance to specific research questions, with approximately 85% of citations typically excluded based solely on titles and abstracts.11,12 This process is challenging to replicate, adjust, or tailor due to the substantial human effort required to review such a large volume of citations. Overlooking relevant articles may lead to bias,13 whereas including irrelevant studies can be corrected during full article screening. The manual nature of abstract screening introduces potential threats to internal validity, such as fatigue, researcher bias, and inconsistent selection criteria application.14 HTA agencies recommend that clinical SLRs primarily include clinical trial results and sometimes real-world evidence studies, using the population, intervention/comparison, outcome and study type (PICOs) framework for selection and documenting reasons for inclusion or exclusion.6
Previous attempts to automate title and abstract screening, and data extraction often relied on labor-intensive labeling and computationally intensive machine/deep learning models.15–21 These models were typically limited by several factors: sensitivity to potentially inaccurate training labels generated by human experts, imbalanced data distributions, SLR-specific contraints, and lack of generalizability. Recent advancements in LLMs have made it possible to automate literature screening and data extraction in a scalable, generalizable way without requiring annotation sets or model training, with studies showing varying results. Guo et al22 demonstrated GPT-4's high accuracy in screening clinical research abstracts compared to human reviewers. Alshami et al23 used GPT-3.5 to automate SLRs, achieving high accuracy in article classification but low accuracy in information extraction for Internet of Things (IoT) applications in water management. Syriani et al24 employed GPT-3.5 Turbo for software engineering SLRs, outperforming traditional machine learning classifiers despite using simpler screening criteria. Kharaisha et al25 utilized GPT-4 for both abstract and full article screening, as well as data extraction, reporting poor to moderate performance for abstract screening, human-like levels for full article screening, and moderate performance for data extraction. Notably, none of these studies leveraged the PICOs framework, a key methodology in developing well-formed clinical questions for evidence-based medicine. This gap highlights an opportunity for further research in applying LLMs to SLRs while incorporating established clinical research methodologies.
We propose an AI-assisted SLR (AI-SLR) system using GPT-4 in a zero-shot setting to facilitate clinical SLRs. The system covers literature search, abstract screening, data extraction, and reporting. It incorporates the PICOs framework for abstract screening and allows user-specified data field descriptions for extraction. A performance dashboard helps users revise and optimize PICOs criteria. We evaluated the system's performance using studies of relapsed and refractory multiple myeloma (RRMM) and advanced melanoma to examine its generalizability across multiple indications.
Key contributions:
Introduction of the PICOs framework for structured abstract screening prompts, with comprehensive efficiency evaluation and pilot data extraction from abstracts.
Development of a human-in-the-loop system for real-time PICOs criteria refinement, collecting AI-human disagreement information for informed modifications.
Materials and methods
The proposed AI-SLR system consists of five modules shown in Figure 1: (1) Searching the PubMed and/or Embase databases using medical terms with Boolean strategies, (2) Setting up the study protocol using the PICOs framework, (3) Performing abstract screening by PICOs criteria using GPT-4, (4) Implementing targeted data extraction using GPT-4 on accepted abstracts, and (5) Generating structured summary reports to assist researchers in detailed data synthesis strategy. Specifically, the PICOs criteria (module 2) are an input for the LLM prompt in module 3, and the data field descriptions (module 4) become part of the prompt to extract data. Users can provide background knowledge related to disease areas to guide the LLM in screening and extraction. Users can iterate between modules 2 and 3 until screening performance meets their expectations such as retrieving at least 90% of relevant citations.
Figure 1.
The overview of AI-assisted systematic literature review system.
Module 1: searching literature database for abstract retrieval
We used the following queries to retrieve relevant abstracts from the PubMed database resulting in 3713 abstracts for RRMM and 1720 for advanced melanoma.
RRMM: ((“clinical trial” OR “real world” OR “RWE” OR “review”) AND (“myeloma” OR “kahler”)) AND (“relapsed” OR “recurrent” OR “refractory” OR “R/R” OR “RRMM”)
Melanoma: ((melanoma) AND (“clinical trial” or “RCT” or “review”) AND “systemic”) AND ((“2010”[Date—Create]: “3000”[Date—Create]))
Module 2: setting up PICOs inclusion and exclusion criteria
The PICOs criteria used in this study were generated by reviewers with both systematic literature review (SLR) expertise and clinical background in the relevant therapeutic areas. Figure 2 describes the disease-specific PICOs criteria for RRMM and advanced melanoma.
Figure 2.
Descriptions of PICOs criteria used for refractory and relapsed multiple myeloma and advanced melanoma. (A) Refractory and relapsed multiple myeloma PICOs. (B) Advanced melanoma PICOs.
Module 3: iterative abstract screening and PICOs adjustment
This component utilizes PICOs criteria generated in Module 2 and combines them with specific instructions shown in Table S1 as a prompt for GPT-4 to perform abstract screening. In addition, the module incorporates a process that requires users to review at least 10% of randomly selected retrieved abstracts, or 30 abstracts for searches yielding fewer than 300 results. When users agree with GPT-4's screening decision and related reasons for excluding an abstract, the module records these results. When users disagree with GPT-4's screening decision and/or reasons for exclusion, the system prompts them to indicate which specific PICOs criteria they disagree with, or to input free text describing any reasons not captured by the current criteria. The module reports performance metrics including precision, recall, and accuracy based on this review process. These mechanisms help users refine the PICOs criteria until they are satisfied with the system outputs. Details of the iterative process between Modules 2 and 3 are provided in Methods S1.
Module 4: extracting information for data fields of interest
In this work, we identify three kinds of information are of interest including (1) study details such as study cohort, interventions, trial phase, and registry number; (2) patient characteristics such as participant’s age and gender; and (3) study outcomes such as tumor response and adverse events of interest, and their related information. Among these, the study outcomes of interest are predefined, and their related information is extracted, including specific study group, number of patients, percentage of patients, median, hazard ratio, and other important data. The detailed descriptions are shown in Table 1, the study outcomes of interest are in Table S2, and the prompt used for data extraction is provided in Table S3.
Table 1.
Information types and descriptions of data fields for instructing GPT-4.
| Information type | Data field | Description used to instruct GPT-4 |
|---|---|---|
| Study details | Study cohort | The main population/cohort involved in the study along with their sizes |
| Interventions | Interventions mentioned in the abstract | |
| Publication type | Choose from these four options—Original Research Article, Meta-analysis, Review, Other | |
| Study design | Choose the study design from these five options—Clinical trial study, Real word evidence study, Systematic review, Other/unspecified review, Other/unspecified study | |
| Trial phase | The phase of the clinical trial mentioned in the abstract | |
| Supplementary information | All supplementary appendix-related information mentioned in the abstract | |
| Registry number | Registry number mentioned in the abstract | |
| Patient characteristics | Age | Patient age |
| Gender | Patient gender | |
| Study outcomes | Study Outcome | Common study outcomes and adverse event categories (Table S2) |
| Group description | Show “whole population” if the outcome is related to the whole study population, or the respective study group name if the outcome is related to any specific group | |
| Number of patients |
|
|
| Percentage of patients | The percentage of patients associated with this study outcome or adverse event. If multiple percentages are mentioned for specific sub-cohorts, please show in the format “sub-cohort: percentage” for each sub-cohort separated by commas. “NA” if no information is available in the abstract | |
| Median | The median value associated with this study outcome. If multiple median values are mentioned for specific sub-cohorts, please show in the format “sub-cohort: median value” for each sub-cohort separated by commas. “NA” if no information is available in the abstract | |
| Hazard ratio | The hazard ratio associated with this study outcome. If multiple hazard ratios are mentioned for specific sub-cohorts, please show in the format “sub-cohort: hazard ratio” for each sub-cohort separated by commas. “NA” if no information is available in the abstract | |
| Other information | Any additional information associated with this study outcome or adverse event including negated values such as “no” or “not observed” or “no significant differences.” Please show in the format “description: value” such as “risk reduction: value” and “partial response: no” |
Module 5: data summary
This component consolidates the results about inclusion/exclusion decisions and excluded reasons from abstract screening and extracted information into a thorough overview. Users can examine, sort, and download the synthesized data. It offers a sophisticated information handling framework, producing concise reports that include essential details such as trial ID, PMID, clinical outcomes, article title, authors, publication year, and other extracted data points.
Evaluation
We evaluated the AI-SLR system's abstract screening and data extraction modules using human-generated reference standards. Performance was assessed using precision, recall (sensitivity), accuracy, specificity, negative predictive value (NPV) and F-1 scores. Sensitivity is preferred for abstract screening in clinical SLRs. NPV measures the proportion of correctly excluded abstracts, assessing the system's exclusion accuracy. We evaluated the system's potential as an independent reviewer by comparing its performance to human reviewers using the human-generated reference set.26 Cohen's κ was used as the primary inter-rater reliability metric, measuring agreement adjusted for chance.27,28 To address Cohen's κ's limitations with imbalanced datasets, we also employed prevalence-adjusted bias-adjusted kappa (PABAK), which corrects for prevalence and bias effects.29
Four evaluation datasets were generated by researchers with advanced medical degrees, possessing between 5- and 15-years extensive knowledge and practical experience in conducting SLRs for HTA submission. The first dataset consisted of randomly selected abstracts (49 for RRMM and 50 for advanced melanoma) reviewed by authors J.G. and C.L. for PICOs relevance, with 18 RRMM and 21 advanced melanoma abstracts deemed relevant. The second dataset comprised PICOs reasons for excluded abstracts. The third dataset, based on relevant abstracts from the first set, included data information on study details, patient characteristics, and outcomes extracted by human reviewers (as shown in Table 1). The fourth dataset was created from two vendor-conducted SLRs typically used for HTA submissions: an RRMM study that screened 3,665 abstracts with 2,071 (56.5%) meeting inclusion criteria, and an advanced melanoma study that screened 2,753 abstracts with 145 (5.3%) meeting inclusion criteria. We compared the system's inclusion/exclusion decisions with evaluation sets 1 and 4, assessed PICOs exclusion reasons using set 2, and evaluated data extraction using set 3. To maintain high sensitivity, we instructed the model to include abstracts in cases where the model is not certain or confident, either because of insufficient information in the abstract or for other reasons. Study outcomes were considered correct only when all seven data fields matched the evaluation set.
The PICOs criteria were used for RRMM evaluation sets 1 and 4 are shown in Figure 2A. The PICOs criteria applied to advanced melanoma evaluation sets 1 and 4 are shown in Figure 2B and Table S4, respectively. We compared the inclusion and exclusion decisions generated by the system with those in the evaluation sets 1 and 4. The GPT-4 generated PICOs reasons for excluded abstracts were evaluated using evaluation set 2. The data extraction module was evaluated using evaluation set 3. For this set, study details and patient characteristics were assessed field by field. Study outcomes were considered correct only when all seven data fields matched between the system's output and the evaluation set.
Results
The overall performance measures for the inclusion/exclusion decision-making during abstract screening are shown in Table 2. The detailed confusion matrices are shown in Table S5. Table 3 presents the performance measures for PICOs reasons for excluded abstracts. The frequency of exclusion reasons identified by GPT-4 for RRMM and advanced melanoma abstracts in evaluation set 4, which it predicted to exclude, are shown in Tables S6 and S7. The performance measures for the data extraction are shown in Table 4. Sample outputs of the system for abstract screening and data extraction are provided in Tables S8-S10.
Table 2.
Performance of GPT-4 in screening titles and abstracts against human reviewers' output.
| Evaluation set | Total no. of abstracts | No. of included by human experts | Recall (%) | Precision (%) | F-1 Score | Specificity (%) | Accuracy (%) | Cohen’s Kappa | PABAKb |
|---|---|---|---|---|---|---|---|---|---|
| RRMMa—Set 1 | 49 | 18 | 89 | 80 | 84 | 87 | 88 | 0.74 | 0.76 |
| advanced melanoma—Set 1 | 50 | 21 | 90 | 90 | 90 | 93 | 92 | 0.84 | 0.84 |
| RRMMa—Set 4 | 3665 | 2071 | 97 | 75 | 85 | 59 | 80 | 0.57 | 0.61 |
| advanced melanoma—Set 4 | 2753 | 145 | 82 | 60 | 69 | 97 | 96 | 0.67 | 0.92 |
| Average Performance | NA | NA | 90 | 76 | 82 | 84 | 89 | 0.71 | 0.78 |
RRMM: relapsed and refractory multiple myeloma.
PABAK: prevalence-adjusted and bias-adjusted Kappa Score.
Table 3.
Performance of GPT-4 in identifying exclusion criteria.a
| Exclusion Category | Criterion | TP | FP | TN | FN | Spec | NPV | Acc | P | R | F-1 Score |
|---|---|---|---|---|---|---|---|---|---|---|---|
| a. Performance of GPT-4 in identifying exclusion criteria for RRMM abstracts compared to human reviewers in evaluation set 2 | |||||||||||
| Population | Studies exclusively involving patients under the age of 18. | 47 | 0 | 1 | 1 | 100 | 50 | 97 | 100 | 97 | 98 |
| Population | Studies exclusively centered on newly diagnosed or treatment-naive multiple myeloma patients and did not include recurrent/refractory multiple myeloma (RRMM) patients. | 46 | 0 | 1 | 2 | 100 | 33 | 95 | 100 | 95 | 97 |
| Population | Studies not targeting multiple myeloma (MM) patients. | 45 | 0 | 3 | 1 | 100 | 75 | 97 | 100 | 97 | 98 |
| Intervention/Comparators | Studies that do not mention treatment for multiple myeloma. | 44 | 1 | 4 | 0 | 80 | 100 | 97 | 97 | 100 | 98 |
| Intervention/Comparators | Studies primarily involving stem cell transplantation (SCT) and total body irradiation before SCT as interventions when not for 2nd line of therapy. | 45 | 0 | 3 | 1 | 100 | 75 | 97 | 100 | 97 | 98 |
| Outcomes | Studies that lack reporting on any outcomes mentioned in the inclusion criteria. | 45 | 0 | 4 | 0 | 100 | 100 | 100 | 100 | 100 | 100 |
| Study type | Studies not either clinical trials or real world evidence study | 22 | 3 | 24 | 0 | 88 | 100 | 93 | 88 | 100 | 93 |
| Other | Studies not in English. | 49 | 0 | 0 | 0 | NA | NA | 100 | 100 | 100 | 100 |
| Average Performance | 95 | 76 | 97 | 98 | 98 | 98 | |||||
| b. Performance of GPT-4 in identifying exclusion criteria for advanced melanoma abstracts compared to human reviewers in evaluation set 2 | |||||||||||
| Population | Studies exclusively involving patients under the age of 12. | 49 | 0 | 1 | 0 | 100 | 100 | 100 | 100 | 100 | 100 |
| Population | Studies exclusively centered on in situ (stage 0), early stage (stage I-II), or resectable stage III melanoma and do not include the 1st line of systemic therapy patients. | 36 | 0 | 0 | 14 | NA | 0 | 72 | 100 | 72 | 83 |
| Population | Studies not targeting melanoma patients. | 43 | 2 | 5 | 0 | 71 | 100 | 96 | 95 | 100 | 97 |
| Intervention/Comparators | Studies primarily involve local, regional, or intraregional therapy options only without systemic therapies. | 34 | 2 | 3 | 11 | 60 | 21 | 74 | 94 | 75 | 83 |
| Intervention/Comparators | Neoadjuvant, or Adjuvant therapy studies. | 44 | 0 | 5 | 1 | 100 | 83 | 98 | 100 | 97 | 98 |
| Intervention/Comparators | Studies related to the second or later lines of therapies after progression on or after the first line of systemic therapy. | 46 | 3 | 1 | 0 | 25 | 100 | 94 | 93 | 100 | 96 |
| Outcomes | Studies that lack reporting on any outcomes. | 37 | 8 | 2 | 3 | 20 | 40 | 78 | 82 | 92 | 86 |
| Study type | Studies not either clinical trials or systematic review or Review article—Other/unspecified review or Meta-analysis | 22 | 7 | 12 | 9 | 63 | 57 | 68 | 75 | 70 | 72 |
| Other | Studies not in English. | 50 | 0 | 0 | 0 | NA | NA | 100 | 100 | 100 | 100 |
| Other | Studies that do not include “randomized clinical trial study”. | 24 | 9 | 8 | 9 | 47 | 47 | 64 | 72 | 72 | 72 |
| Average Performance | 61 | 61 | 84 | 91 | 88 | 89 | |||||
TP (True Positive): The abstract should be included, and GPT-4 includes it. TN (True Negative): The abstract should be excluded due to a specific criterion, and GPT-4 excludes it for the same reason. FP (False Positive): The abstract should be excluded due to a specific criterion, but GPT-4 includes it. FN (False Negative): The abstract should be included, but GPT-4 excludes it due to a specific criterion. NA (Not Applicable): None of the abstracts were considered as either true negative or false negative for this criterion. Spec: Specificity; NPV: Negative Predictive Value; Acc: Accuracy; P: Precision; R: Recall The unit for Spec, NPV, Acc, P, and R is percentages (%).
Table 4.
Performance of GPT-4 in data extraction.
| a. Performance of GPT-4 in extracting study details, patient characteristics, and study outcomes | ||||||
|---|---|---|---|---|---|---|
| Evaluation Case | Number of abstracts | Number of data fields | Precision (%) | Recall (%) | F-1 Score | |
| RRMM | Study details | 18 | 144 | 100 | 100 | 100 |
| Patient characteristics | 18 | 36 | 100 | 100 | 100 | |
| Study outcomesa | 18 | 95 | 88 | 83 | 86 | |
| Advanced Melanoma | Study details | 21 | 168 | 94 | 99 | 97 |
| Patient characteristics | 21 | 42 | 80 | 100 | 89 | |
| Study outcomesa | 21 | 98 | 86 | 83 | 84 | |
| Average Performance | NA | NA | 91 | 94 | 93 | |
| b. Performance of GPT-4 in extracting details for study outcomes | ||||||||
|---|---|---|---|---|---|---|---|---|
| Outcome value category | RRMM |
Advanced Melanoma |
||||||
| Number of Evaluated Data Fields | Precision (%) | Recall (%) | F-1 Score | Number of Evaluated Data Fields | Precision (%) | Recall (%) | F-1 Score | |
| Number of patients | 24 | 81 | 88 | 85 | 34 | 75 | 77 | 76 |
| Percentage of patients | 40 | 84 | 82 | 83 | 56 | 80 | 82 | 81 |
| Hazard ratio | 6 | 71 | 63 | 67 | 11 | 86 | 75 | 80 |
| Median | 24 | 89 | 89 | 89 | 20 | 96 | 96 | 96 |
| Other information | 35 | 88 | 74 | 81 | 14 | 79 | 69 | 73 |
Each study outcome consists of seven elements: outcome, group description, number of patients, percentage of patients, hazard ratio, median, and other relevant information. For the system's extraction to be considered correct, it must accurately extract all seven of these elements. The precision, recall and F-1 score are based on the number of data fields.
We performed an error analysis for both abstract screening and data extraction. False negatives occur when the system predicts abstracts as “Excluded” while human experts deem them as “Included.” Conversely, false positives occur when the system predicts abstracts as “Included” while human experts deem them as “Excluded.”
Based on the RRMM evaluation set 1, we identified 2 false negatives and 4 false positives. One false negative occurred because the system excluded a study that involved lymphoma patients. The other false negative resulted from the system concluding that the study primarily focused on stem cell transplantation. Among the four false positives, three were excluded by human reviewers because the studies were neither clinical trials nor real-world evidence studies. The fourth false positive was excluded by human reviewers because the abstract did not mention any multiple myeloma treatment.
Based on the advanced melanoma evaluation set 1, we identified 2 false negatives and 2 false positives. The two false negatives occurred because the system incorrectly predicted that: (1) systemic therapies were not involved, and (2) the study did not focus on first-line systemic therapy. The two false positives occurred because the system failed to recognize that the studies did not meet the PICOs criteria for intervention/comparators and study type.
Based on the RRMM evaluation set 4, we identified 60 false negatives and 661 false positives. 30 false negatives and 30 false positives were randomly selected for further error analysis. Upon review, the human reviewers (JG and CL) agreed with the system's predictions for 23 of the 30 false negative cases, concluding that these 23 cases were correctly excluded by the system. The human reviewers further confirmed that the exclusion reasons given by the system were correct, indicating that the results generated by the clinical SLR vendor was incorrect for these cases. Our analysis of 30 false positives revealed two major reasons for misclassification: challenges in distinguishing original research studies from review studies (ie, neither clinical trials nor real-world evidence studies), and incorrectly including studies that lack outcomes.
Based on the advanced melanoma evaluation set 4, we identified 26 false negatives and 81 false positives. 26 false negatives and 30 randomly selected false positives were further reviewed. The human reviewers agreed with the system's decision to exclude abstracts for 15 out of 26 false negatives. For an additional 5 of the 26, the reviewers considered them ambiguous and likely to cause disagreement even among human reviewers. Regarding the 30 false positives, the experts thought the system was correct in including abstracts for 8 out of 30 cases. The remaining 22 were misclassified, as the system faced difficulties in identifying exclusion reason related to non-randomized clinical trial studies (ie, retrospective analyses, observational studies, or phase I trials).
For data extraction, the system performed well in extracting study details and patient characteristics. However, for study outcomes extraction that involves capturing more details, we observed that the system faced difficulties in identifying the associated study groups for the outcomes. Specifically, for RRMM, 2 of the 18 abstracts contained incorrect study group predictions, whereas for melanoma, 4 out of the 21 abstracts contained incorrect groups. For instance, for an RRMM abstract, the system missed 2 out of 4 study groups associated with the “Tumor response—Objective response rate” outcome and falsely identified an additional study group. The category-level outcome values indicate that F-1 scores are low for Hazard ratio and Other information. Probable reasons could be less frequent instances of these two categories in the data (8 instances of hazard ratio for RRMM and 12 for melanoma, and 11 instances of Other information for melanoma). Moreover, Other information captures a wider variety of outcome value information that are not represented by the other four categories (eg, “no significant differences”, “recurrent disease”, etc).
Discussion
We developed a human-in-the-loop AI-assisted SLR system using GPT-4's zero-shot capabilities for abstract screening with PICOs criteria and data extraction in clinical SLRs. This approach allows iterative refinement of PICOs criteria based on performance metrics and creation of data element descriptions for extraction. By eliminating the need for manually annotated training data, the system offers a time-efficient and highly generalizable solution for various research topics.
GPT-4's abstract screening performance showed high sensitivity (82-97%), F1-scores (69-90), and accuracy (80-96%). Inter-rater reliability measures (Cohen's κ: 0.57-0.84; PABAK: 0.61-0.92) varied between evaluation sets. Sensitivity remained consistently high (82-97%), aligning with clinical SLR practices of inclusive abstract screening, as irrelevant articles can be excluded during full-text review. Cohen's κ scores showed stronger agreement between the system and evaluation set 1 (RRMM: 0.74, moderate; advanced melanoma: 0.84, strong) than with evaluation set 4 (0.58, weak; 0.67, moderate).26,27 This discrepancy may be due to the different methods used to generate evaluation sets 1 and 4. Evaluation set 1 was created by two of the authors (JG and CL) using only title and abstract information, which was the same information presented to GPT-4. In contrast, evaluation set 4 was generated by clinical SLR vendors, who may have considered additional information such as citation metadata and their prior knowledge on the related therapeutic area. Furthermore, the clinical SLR vendor that generated the advanced melanoma set claimed to have used an AI-assisted tool with an abstract screening process involving an active learning model,30 which may have had significant implications for the final inclusion/exclusion decisions. Upon reviewing a random subset of 56 disagreed cases between the system and advanced melanoma evaluation set 4, the authors (JG and CL) disagreed with 23 decisions made by the clinical SLR vendor but agreed with decisions made by the system. Clinical SLR for RRMM is more challenging than for advanced melanoma since the relevant literature for the RRMM is more diverse and complex due to the lack of standard care and large number of interventions (40 plus regimens) evaluated in various study setting (ranging from RCTs, single arm trial to observational studies based on claim databases). Hanegraaf et al26 reported an average κ score of 0.82 for abstract screening in human literature reviews, based on 45 clinical SLR articles. They also found that 37 surveyed researchers expected κ scores between 0.6 and 0.9 for publishable reviews. Our κ scores from evaluation set 1 fell within this expected range, while those from evaluation set 4 (0.58 and 0.67) were at or below the lower end.
GPT-4's data extraction performance, compared to evaluation set 3, showed precision, recall, and F1 scores ranging from 80-100%, 83-100%, and 84-100, respectively (Table 4a). Study outcomes, comprising seven data fields (Table 1), were deemed correct only when all fields were accurately extracted. Itemized performance metrics are in Table 4b. RRMM's hazard ratio extraction showed lower performance (precision: 71%, recall: 63%, F1: 67), possibly due to limited samples (n = 6).
Our study achieved higher sensitivity (average 89.64%) compared to previous GPT-based abstract screening studies by Guo et al,22 Alshami et al,23 and Khraisha et al25 (81%, 80%, and 42% respectively). This improvement may be attributed to our prompt strategy instructing GPT-4 to include citations even with low confidence. While previous studies reported consistently high specificity (90-93%), our study showed high specificity in 3 of 4 cases (87-97%), except for RRMM in evaluation set 4 (59%). Notably, this set had an atypically high relevance rate (56.5%) compared to usual SLRs (3-20%).
Our study's Cohen's κ scores for abstract screening were consistently higher than those reported by Guo et al22 (0.71 vs 0.23) and Khraisha et al25 (0.71 vs 0.34). The PABAK increased for three evaluation cases, ranging from 0.02 to 0.04, and reached 0.25 for advanced melanoma in evaluation set 4 (Table 2). This increase occurs because PABAK accounts for category distribution (included/excluded). Evaluation set 1 and RRMM in evaluation set 4 had balanced categories, resulting in only slight PABAK increases. However, for advanced melanoma in evaluation set 4, which was highly imbalanced (5.3% relevance rate), PABAK increased to 0.92. This observation aligns with findings from Guo et al22 and Khraisha et al.25 In Guo et al's study, highly imbalanced evaluation sets led to a significant increase in PABAK (0.93) compared to Cohen's κ (0.23). Conversely, Khraisha et al's balanced evaluation set yielded identical PABAK and Cohen's κ values (0.34).
Our data extraction sensitivity reached 94% (Table 4) for included abstracts, surpassing Khraisha et al's 75% sensitivity using full texts. Our evaluation was more stringent, requiring all 7 outcome fields to be correct. We omitted accuracy, Cohen's κ, and PABAK due to lack of true negative data.
Seven software platforms (NestedKnowledge,31 SWIFT ActiveScreener,32 DistillerSR,33 EPPI-Reviewer,34 LaserAI,35 Made.AI,36 EasySLR37) support multiple SLR stages, some of them using active learning to prioritize potentially relevant references without fully automating screening.30 SWIFT-ActiveScreener estimates screening completeness, potentially allowing early termination of manual screening. Typically, 95% of relevant references are found after screening 40% of the total.32 While these platforms mainly support manual data extraction, RobotReviewer38 focuses on automated extraction using machine learning and neural networks, though interpretability remains challenging. EPPI-Reviewer recently integrated GPT-4 for abstract information extraction, but evaluation results are pending.39 LaserAI developed an AI-assisted literature review system that spans from literature search to full-text data extraction, highlighting its value for conducting living SLRs.40 While some researchers have published review articles using LaserAI,41 publication of the current system and its benchmarking data is limited. No peer-reviewed publications are available for Made.AI and EasySLR.
Most HTA agencies require clinical SLRs, conducted by at least two reviewers, to evaluate drug effectiveness against standard care. PICOs requirements vary by country, resulting in different outcomes for the same SLRs. Searches typically are expected to occur 3 months to 1 year prior to the submission.42 EU HTA legislation, which requires responding to PICOs requested by EU Joint Clinical Assessment (JCA) assessors and co-assessors within tight timelines: 100 days for new molecular entities (NMEs) and just 60 days for line extensions.43 These varying requirements and tight timelines burden life science researchers from industry, academia, health technology assessment (HTA) agencies, and regulatory bodies.
Unlike traditional ML models requiring extensive training data and tuning, LLMs offer subject matter experts more control through prompt adjustment and real-time feedback, streamlining the process. LLMs could enable living SLRs, enhancing efficiency in screening and data extraction, potentially reducing human reviewers needed. They may improve consistency, reduce errors and fatigue which are key threats to the internal validity of a manually conducted SLR, and comprehensively capture exclusion reasons that reviewers will typically not capture due to time pressure. Our study demonstrates GPT-4's reasonable performance in abstract screening, PICOs specific exclusion reasons classification, and data extraction. Performance may be influenced by prompt quality, particularly the clarity of inclusion/exclusion criteria. As LLMs advance, we anticipate further performance improvements. Future work will focus on several key areas. First, we plan to assess the system's performance in medical fields beyond oncology to examine its broader generalizability. Second, we will evaluate the effectiveness of the current human-in-the-loop mechanism. Third, we aim to extend the system to handle screening and data extraction using the full text of articles. Forth, feedback collected from human experts could enable future versions of the system to incorporate a reinforcement learning process for automated refinement of PICOs criteria. Finally, we plan to benchmark the system in real-world settings, measuring time savings for reviewers with both systematic literature review (SLR) expertise and clinical background in the relevant therapeutic areas.
Conclusion
We developed a generalizable, end-to-end LLM-based AI-SLR system. To our knowledge, this is the first time that PICOs criteria is used to instruct an LLM. The system includes a human-in-the-loop module that displays real-time LLM performance, allowing end users to adjust their prompts accordingly. Results showed high sensitivity, Cohen's κ, and PABAK for abstract screening, and high F1 scores for data extraction. Our system can potentially reduce the time, cost, and human errors associated with traditional SLRs, ultimately contributing to more timely and comprehensive evidence generation.
Supplementary Material
Acknowledgment
We thank Dr Jingcheng Du for insightful discussions.
Contributor Information
Ying Li, Regeneron Pharmaceuticals, Inc., Tarrytown, NY 10591, United States.
Surabhi Datta, IMO Health, Inc., Rosemont, IL 60018, United States.
Majid Rastegar-Mojarad, IMO Health, Inc., Rosemont, IL 60018, United States.
Kyeryoung Lee, IMO Health, Inc., Rosemont, IL 60018, United States.
Hunki Paek, IMO Health, Inc., Rosemont, IL 60018, United States.
Julie Glasgow, IMO Health, Inc., Rosemont, IL 60018, United States.
Chris Liston, IMO Health, Inc., Rosemont, IL 60018, United States.
Long He, IMO Health, Inc., Rosemont, IL 60018, United States.
Xiaoyan Wang, IMO Health, Inc., Rosemont, IL 60018, United States.
Yingxin Xu, Regeneron Pharmaceuticals, Inc., Tarrytown, NY 10591, United States.
Author contributions
Ying Li (Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing), Surabhi Datta (Data curation, Formal analysis, Methodology, Writing – original draft, Writing – review & editing), Majid Rastegar-Mojarad (Data curation, Formal analysis, Methodology, Project administration, Writing – original draft, Writing – review & editing), Kyeryoung Lee (Data curation, Formal analysis, Investigation, Resources, Writing – original draft), Hunki Paek (Data curation, Formal analysis, Resources), Julie Glasgow (Data curation, Formal analysis, Resources), Chris Liston (Data curation, Formal analysis, Investigation), Long He (Resources, Software), Xiaoyan Wang (Conceptualization), and Yingxin Xu (Conceptualization, Funding acquisition, Investigation, Resources, Supervision, Writing – original draft, Writing – review & editing)
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Conflicts of interest
Y.L. and Y.X. are currently employees of Regeneron Pharmaceuticals, Inc. S.D, M.R.-M, K.L., H.P., J.G., C.L., L.H., and X.W. are currently employees of Intelligence Medical Objective, Inc. The affiliations played no role in the design, execution, interpretation, or reporting of this research. The authors affirm that the manuscript is an honest, accurate, and transparent account of the study being reported. All views and opinions expressed in this manuscript are solely those of the authors and do not necessarily represent the views of the companies.
Data availability
The data underlying this article will be shared on reasonable request to the corresponding author.
References
- 1. Chandler J, Cumpston M, Li T, et al. Cochrane Handbook for Systematic Reviews of Interventions. Wiley; 2019. [Google Scholar]
- 2. Borah R, Brown AW, Capers PL, et al. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7:e012545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Michelson M, Reuter K. The significant cost of systematic reviews and meta-analyses: a call for greater involvement of machine learning to assess the promise of clinical trials. Contemp Clin Trials Commun. 2019;16:100443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Fiorini N, Canese K, Starchenko G, et al. Best match: new relevance search for PubMed. PLoS Biol. 2018;16:e2005343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. https://www.who.int/health-topics/health-technology-assessmen. Secondary.
- 6. Wright C, Swanston A, Nicholson L, et al. HTA44 systematic literature review requirements for health technology assessment in European markets. Value in Health. 2024;27:S253. [Google Scholar]
- 7. Thokala P, Srivastava T, Smith R, et al. Living health technology assessment: issues, challenges and opportunities. Pharmacoeconomics. 2023;41:227-237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Abogunrin S, Queiros L, Witzmann A, et al. ML1 do machines perform better than humans at systematic review of published literature? a case study of prostate cancer clinical evidence. Value Health. 2020;23:S404. [Google Scholar]
- 9. Tawfik GM, Dila KAS, Mohamed MYF, et al. A step by step guide for conducting a systematic review and meta-analysis with simulation data. Trop Med Health. 2019;47:46-49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Wang Z, Nayfeh T, Tetzlaff J, et al. Error rates of human reviewers during abstract screening in systematic reviews. PLoS One. 2020;15:e0227742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Rathbone J, Carter M, Hoffmann T, et al. Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module. Syst Rev. 2015;4:6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Polanin JR, Pigott TD, Espelage DL, et al. Best practice guidelines for abstract screening large‐evidence systematic reviews and meta‐analyses. Res Synthesis Methods. 2019;10:330-342. [Google Scholar]
- 13. Gartlehner G, Affengruber L, Titscher V, et al. Single-reviewer abstract screening missed 13 percent of relevant studies: a crowd-based, randomized controlled trial. J Clin Epidemiol. 2020;121:20-28. [DOI] [PubMed] [Google Scholar]
- 14. Mallett R, Hagen-Zanker J, Slater R, et al. The benefits and challenges of using systematic reviews in international development research. J Develop Effect. 2012;4:445-455. [Google Scholar]
- 15. Du J, Soysal E, Wang D, et al. Machine learning models for abstract screening task—a systematic literature review application for health economics and outcome research. BMC Med Res Methodol. 2024;24:108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. O'Mara-Eves A, Thomas J, McNaught J, et al. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4:5-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Blaizot A, Veettil SK, Saidoung P, et al. Using artificial intelligence methods for systematic review in health sciences: a systematic review. Res Synth Methods. 2022;13:353-362. [DOI] [PubMed] [Google Scholar]
- 18. Kebede MM, Le Cornet C, Fortner RT. In‐depth evaluation of machine learning methods for semi‐automating article screening in a systematic review of mechanistic literature. Res Synth Methods. 2023;14:156-172. [DOI] [PubMed] [Google Scholar]
- 19. Khalil H, Ameen D, Zarnegar A. Tools to support the automation of systematic reviews: a scoping review. J Clin Epidemiol. 2022;144:22-42. [DOI] [PubMed] [Google Scholar]
- 20. Moreno-Garcia CF, Jayne C, Elyan E, et al. A novel application of machine learning and zero-shot classification methods for automated abstract screening in systematic reviews. Decision Anal J. 2023;6:100162. [Google Scholar]
- 21. Du J, Wang D, Lin B, et al. Use of Deep Learning for Full-text Data Elements Extraction for Systematic Literature Review Tasks. 2024. 10.21203/rs.3.rs-4426541/v1 [DOI]
- 22. Guo E, Gupta M, Deng J, et al. Automated paper screening for clinical reviews using large language models: data analysis study. J Med Internet Res. 2024;26:e48996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Alshami A, Elsayed M, Ali E, et al. Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Systems. 2023;11:351. [Google Scholar]
- 24. Syriani E, David I, Kumar G. Screening articles for systematic reviews with ChatGPT. J Comput Languages. 2024;80:101287. [Google Scholar]
- 25. Khraisha Q, Put S, Kappenberg J, et al. Can large language models replace humans in systematic reviews? Evaluating GPT‐4's efficacy in screening and extracting data from peer‐reviewed and grey literature in multiple languages. Res Synth Methods. 2024;15:616-626. [DOI] [PubMed] [Google Scholar]
- 26. Hanegraaf P, Wondimu A, Mosselman JJ, et al. Inter-reviewer reliability of human literature reviewing and implications for the introduction of machine-assisted systematic reviews: a mixed-methods review. BMJ Open. 2024;14:e076912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb). 2012;22:276-282. [PMC free article] [PubMed] [Google Scholar]
- 28. Park CU, Kim HJ. Measurement of inter-rater reliability in systematic review. Hanyang Med Rev. 2015;35:44-49. [Google Scholar]
- 29. Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46:423-429. [DOI] [PubMed] [Google Scholar]
- 30. Schmidt L, Sinyor M, Webb RT, et al. A narrative review of recent tools and innovations toward automating living systematic reviews and evidence syntheses. Z Evid Fortbild Qual Gesundhwes. 2023;181:65-75. [DOI] [PubMed] [Google Scholar]
- 31. Sauca M, Tarchand R, Kallmes K. HTA361 living systematic review (LSR) in health technology assessment (HTA): current guidance, methods, and challenges. Value Health. 2023;26:S390. [Google Scholar]
- 32. Howard BE, Phillips J, Tandon A, et al. SWIFT-active screener: accelerated document screening through active learning and integrated recall estimation. Environ Int. 2020;138:105623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Kamra S, Hyderboini R, Sirumalla Y, et al. MSR70 pilot study to evaluate efficiency of DISTILLERSR®'S artificial intelligence (AI) tool over manual screening process in literature review. Value Health. 2022;25:S532. [Google Scholar]
- 34. Thomas JGS, Brunton J, Ghouze Z, O'Driscoll P, Bond M, Koryakina A. EPPI-Reviewer: advanced software for systematic reviews, maps and evidence synthesis. Secondary EPPI-Reviewer: advanced software for systematic reviews, maps and evidence synthesis. 2022. https://eppi.ioe.ac.uk/cms/Default.aspx?tabid=3384
- 35. LaserAI. Secondary LaserAI. https://laser.ai/
- 36. Made.AI. Secondary Made.AI. https://pharmanlp.com/
- 37. EasySLR. Secondary EasySLR. https://www.easyslr.com/
- 38. RobertReviewer. Secondary RobertReviewer. https://www.robotreviewer.net/
- 39.EPPI Reviewer. Secondary EPPI Reviewer. https://eppi.ioe.ac.uk/cms/Default.aspx?tabid=3921
- 40. Borowiack E, Sadowska E, Nowak A, et al. MSR61 AI support reduced screening burden in a systematic review with costs and cost-effectiveness outcomes (SR-CCEO) for cost-effectiveness modeling. Value Health. 2023;26:S288. [Google Scholar]
- 41. Brozek J, Borowiack E, Sadowska E, et al. Patients' values and preferences for health states in allergic rhinitis—an artificial intelligence supported systematic review. Allergy. 2024;79:1812-1830. [DOI] [PubMed] [Google Scholar]
- 42. Ostawal A, Arca E, Braun N, et al. PNS242 balancing global HTA requirements for literature reviews across Europe, North America, and Asia. Value Health. 2019;22:S802. [Google Scholar]
- 43.European Commission: Health Technology Assessment—Joint Clinical Assessments of Medicinal Products.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article will be shared on reasonable request to the corresponding author.


