Abstract
Abstract
Objectives
Systematic literature reviews (SLRs) are essential for synthesising research evidence and guiding informed decision-making. However, SLRs require significant resources and substantial efforts in terms of workload. The introduction of artificial intelligence (AI) tools can reduce this workload. This study aims to investigate the preferences in SLR screening, focusing on trade-offs related to tool attributes.
Design
A discrete choice experiment (DCE) was performed in which participants completed 13 or 14 choice tasks featuring AI tools with varying attributes.
Setting
Data were collected via an online survey, where participants provided background on their education and experience.
Participants
Professionals who have published SLRs registered on Pubmed, or who were affiliated with a recent Health Economics and Outcomes Research conference were included as participants.
Interventions
The use of a hypothetical AI tool in SLRs with different attributes was considered by the participants. Key attributes for AI tools were identified through a literature review and expert consultations. These attributes included the AI tool’s role in screening, required user proficiency, sensitivity, workload reduction and the investment needed for training. Primary outcome measures: The participants’ adoption of the AI tool, that is, the likelihood of preferring the AI tool in the choice experiment, considering different configurations of attribute levels, as captured through the DCE choice tasks. Statistical analysis was performed using conditional multinomial logit. An additional analysis was performed by including the demographic characteristics (such as education, experience with SLR publication and familiarity with AI) as interaction variables.
Results
The study received responses from 187 participants with diverse experience in performing SLRs and AI use. The familiarity with AI was generally low, with 55.6% of participants being (very) unfamiliar with AI. In contrast, intermediate proficiency in AI tools is positively associated with adoption (p=0.030). Similarly, workload reduction is also strongly linked to adoption (p<0.001). Interestingly, if expert proficiency is needed for the AI, authors with more scientific experience in their profession are less likely to adopt AI (p=0.009). However, more experience specifically with SLR publications increases AI adoption likelihood (p=0.001).
Conclusions
The findings suggest that workload reduction is not the only consideration for SLR reviewers when using AI tools. The key to AI adoption in SLRs is creating reliable, workload-reducing tools that assist rather than replace human reviewers, with moderate proficiency requirements and high sensitivity.
Keywords: Artificial Intelligence, Systematic Review, Decision Making
STRENGTHS AND LIMITATIONS OF THIS STUDY.
- The first assessment of the preferences regarding artificial intelligence (AI) tools in systematic literature reviews (SLRs) considering various criteria. 
- A mix of literature and expert input was used to construct the level of attributes and their alternatives. 
- The choices constructed were hypothetical, considering the fast development of AI tools in SLRs. 
- Potential selection bias may have influenced results due to recruitment shortfall and possible under-representation of AI-sceptical participants. 
Introduction
Systematic literature reviews (SLRs) are essential for organising and synthesising scientific knowledge.1 However, the growing volume of publications on scientific databases makes it increasingly difficult to conduct timely reviews and provide comprehensive up-to-date overviews of relevant studies. While SLRs are vital for evidence-based research, they also involve highly manual, error-prone and labour-intensive tasks.2 Additionally, 15% of all SLR studies become outdated within a year, and up to 23% within 2 years.3 The title and abstract screening phase, in particular, is often tedious and thankless, requiring multiple experts to manually evaluate numerous titles and abstracts that may not meet the inclusion criteria.4 Exploring strategies to streamline this stage could be highly beneficial.5
Previous research has highlighted workload-related barriers to conducting SLRs, such as a lack of human resources and resistance to using multiple reviewers.5 The resource demand of the process can be alleviated by developing an appropriately sensitive and specific search strategy for each SLR research question and using software for de-duplication5 6 and consequently, averting the need to manually process duplicate records. However, even with these best practices, the screening phase still demands significant time, effort and money.7 Alternatively, various artificial intelligence (AI) tools—computer systems designed to perform tasks associated with human intelligence—have been developed or are in development to assist or even replace one or two human reviewers, leading to automation of the SLR process.8 9 This automation would free up the highly trained human capital from repetitive and algorithmic tasks, allowing them to focus on other aspects of research, potentially improving the overall quality of research conducted.
The use of AI in SLRs has shown promise in reducing human labour. However, there is a significant doubt within the community of reviewers about the actual utility and reliability of these newly emerging AI platforms.10 A recent study investigated the expected level of inter-reviewer reliability (IRR) by authors of SLRs for both human and machine-assisted reviews.11 While human-performed SLRs are likely to show a moderate agreement between reviewers, authors expect machine learning-assisted SLRs to perform better, indicating a reluctance to use AI in the SLR process. Besides IRR, incorporating AI into the SLR process involves trade-offs between various attributes and benefits. Although AI tools offer exciting possibilities for streamlining SLR title and abstract screening, it is important to understand the trade-offs involved before considering using AI. Several studies have evaluated the performance of AI tools like ASReview, DistillerSR, SWIFT, EPPI-reviewer and Abstrackr.12,19 However, most of the studies focus on just one or two metrics of the tool (eg, workload reduction) and generalisability of the results is difficult. Simply choosing the tool that offers the most dramatic workload reduction might not be the best strategy10 and there may be a need to consider other factors, for example, sensitivity of the tool, or the required user proficiency, while making such decisions. Since the use of AI in performing SLRs is a novel phenomenon, there is a research gap regarding the preferences of authors of SLRs towards these tools. Therefore, this study aims to understand SLR authors’ preferences regarding the purpose and use of AI tools in conducting SLRs. It focuses on the trade-offs between AI tool reliability, the (time) investment and related benefits these tools provide.
Methods
Rationale for discrete choice experiment
A discrete choice experiment (DCE) was conducted to investigate professionals’ preferences towards using AI tools in SLRs. A DCE is a quantitative method that captures individuals’ preferences regarding new technologies and their roles in processes, providing robust data for future use. Originating from economic and marketing research, DCEs are now used in various scientific fields. Given that AI-supported SLRs are relatively new, revealed preference data are unavailable, making DCEs ideal for exploring attitudes towards this novel solution. Previously, DCEs have been used in decision-making for AI tools in healthcare, such as studying physicians’ preferences for AI-based assistance tools and people’s preferences for AI clinicians before and during the COVID-19 pandemic.20 21 DCEs create realistic scenarios to capture preferences and understand acceptable trade-offs between characteristics, estimating the perceived utility of the tool including subject-related interactions. The objective of DCEs is to ensure practical data collection on the preference of AI tool characteristics while modelling the complex factors influencing decision-making, such as personal preferences and situational context as reflected in participants’ background and experience. For our DCE of AI-supported SLRs, the study design follows The Professional Society for Health Economics and Outcomes Research (ISPOR) Good Research Practices for Conjoint Analysis Task Force checklist (online supplemental appendix A).22 In the following section, we have outlined the identification and selection of the AI tool characteristics, that is, attributes, and their corresponding levels (eg, high or low sensitivity of the AI tool). This framework enables the construction of realistic choice tasks, allowing to explore how participants prefer different AI tool attributes. In addition, the choice task construction and instrument design are described in detail.
Data collection instrument
The participant’s view of the instrument included a general introduction to explain the context of the decision-making process, including the background, purpose and aim of the study, the voluntary nature of their involvement, and the potential value of the research performed. This was followed by a short survey questionnaire (online supplemental appendix B) included inquiring on the participants’ academic qualifications, scientific experience, number of SLR publications, primary purpose for conducting SLRs, experience with AI in SLRs, the minimum citation set size for seeking AI assistance, and their primary occupation sector, followed by the DCE itself (online supplemental appendix C). The DCE choice task presented to participants consisted of two alternative AI tool configurations, each differing in attribute levels. Participants were asked to select the option they preferred in each scenario. No opt-out option per question (ie, choice of ‘no AI’ as option) was used in order to increase data clarity.
Microsoft Forms software was used to administer the survey and DCE. Participant anonymity was guaranteed by the use of a unique anonymous identifier; hence, the ethical approval for observational studies with minor burden for the respondent is waived under Dutch law. The involvement and investment of the participants were justified based on the potential to inform integration of AI tools in systematic literature reviews, in light of evidence-based medicine. Informed consent was given by all authors who participated in the survey.
Attribute and level identification and selection
The attribute identification was based on a literature scan of both scientific and grey literature creating an overview of AI(-assisted) use in SLRs and related metrics and trade-offs. This scan was supplemented with a targeted review on large language models (LLMs) use in SLRs, and a targeted search on the performance reports of existing AI tools in use. Based on this identification, an initial selection of attributes was made based on the attributes’ importance as assessed by the review team and the potential correlation between the attributes. Final selection of attributes was decided in a workshop style discussion within the project team.
Due to the diversity, use and interpretation of AI tools, three categories of AI tool types currently employed in SLRs were used for this DCE attribute selection, focusing on general classifiers, prioritisation algorithms and large language models.
The general classifiers (GC) are a task-based AI tool designed to execute predefined actions or workflows based on the provided input. This tool can classify both individual elements (eg, Population, Intervention, Comparison, Outcome, Study design (PICOS)) of studies, as well as multiple elements simultaneously, to function as an inclusion/exclusion classifier.23 Manual definition of classifier elements such as inclusion/exclusion criteria is required in the review process. GCs need to be developed and validated for specific search filters. For instance, a machine learning algorithm was developed to identify randomised controlled trials (RCTs), with the primary objective of filtering out non-RCT articles.24 Another example is the development of specific search filters in MEDLINE for identifying articles related to degenerative cervical myelopathy.25
The prioritisation algorithm (PA) incorporates an active learning component where the algorithm learns from the inclusion/exclusion of an article by a reviewer. It employs machine learning algorithms to prioritise articles deemed most relevant for inclusion. Examples of PAs are tools such as DistillerSR,26 ASreview27 and Abstrackr.28 Although it does not replace human decision-making, it assists in prioritising articles for review, potentially saving time by focusing attention on the most significant literature first. Authors still need to decide when to stop reviewing, that is, the stopping criteria (for example, after reviewing 20–30 consecutive irrelevant articles) and the rationale for including or excluding articles.
LLMs are AI tools designed to operate on textual and other input. They analyse, process or generate text-based data using natural language processing techniques, making their application more flexible and widespread. Text-based LLMs assist in classification tasks by suggesting labels or categories for articles based on their content.29 Additionally, they aid in prioritising articles by analysing their relevance to the review topic and recommending which articles should be given higher priority for review. Examples of SLR tools that use LLMs are Pitts30 and EasySLR.31 LLMs are more flexible in its application than the other two categories of AI tools described above, but their effectiveness in SLRs depends on various factors, including the design of the prompts used to interact with the model and the expertise of the individuals using the tool.
The following logical and comparable attributes were identified for the three AI tools described above and used to construct the choices provided to respondents: (1) investment needed to train or prompt engineer the AI, (2) validation of classifications by the trained AI, (3) necessary proficiency (from the author) in the use of the tool, (4) estimated sensitivity, (5) workload reduction and (6) role of AI tool in decision making (see online supplemental appendix D for detailed description of attributes).
Choice task construction
The initial level selection was based on the internal project team discussion and the value ranges identified from the literature overview produced during the attribute and level identification and selection stage as described above, with the individual levels being mutually exclusive. Level selection was based as much as possible on the sourced literature, or when data were lacking, logical assumptions were made. Together with the research team, we refined the choice tasks through multiple discussion rounds to identify those attributes with the highest sensitivity and specificity for capturing variations in participant preferences (table 1). For a detailed description of each choice set, see online supplemental appendix E.
Table 1. All possible combinations of levels within each tool.
| AI tool | Set | Necessary proficiency (from the author) in the tool use | Investment needed to train or prompt engineer the AI | Validation | Sensitivity | Workload reduction in the process* (% of the citations that does not have to be screened) | 
|---|---|---|---|---|---|---|
| GC | Set 1.1 | No proficiency needed | No investment needed | No validation needed | 98% | 20% | 
| Set 1.2 | Novice | 5% of citation set | 2.5% of citation set | 95% | 30% | |
| Set 1.3 | Novice | 10% of citation set | 5% of citation set | 93% | 40% | |
| PA | Set 2.1 | No proficiency needed | No investment needed | No validation needed | 87% | 80% | 
| Set 2.2 | No proficiency needed | No investment needed | No validation needed | 91% | 70% | |
| Set 2.3 | No proficiency needed | No investment needed | No validation needed | 95% | 60% | |
| LLM | Set 3.1 | Proficient/intermediate | 5% of citation set | 2.5% of citation set | 95% | 70% | 
| Set 3.2 | Expert in AI | 10% of citation set | 5% of citation set | 98% | 80% | 
These values relate to the full replacement of the human role with an AI tool. The workload reduction for partial replacement would be half of full replacement and the workload reduction in assistance would be half of partial replacement.
AI, artificial intelligence; GC, general classifier; LLM, large language model; PA, prioritisation algorithm.
The complete set of choice tasks, as presented in table 1, resulted in 276 distinct alternatives for comparing the different AI tools across various attributes. To avoid overwhelming the participants with choice sets we deemed as unrealistic and unfeasible, we applied the following restrictions on the choices:
- Sensitivity lower than 95% would not be acceptable to fully replace humans and thus will only be used in the role of partial replacement or assistance. 
- Minimum workload reduction would be 15%. 
By applying the aforementioned restrictions, 136 different alternatives remained. To make sure that all AI tools would be compared with each other, the following steps were taken to select the final alternatives:
- Step 1: Comparing all the options with the different attributes and levels, leaving the role of AI out. Which resulted in 28 alternatives. 
- Step 2: Assigning the role of AI tool to each of the 28 alternatives in which the levels of each attribute would be the closest to each other, making the options as hard as possible. 
- Step 3: Internal discussion of the role of AI tool for each of the alternatives. 
- Step 4: A pilot test within the project team as well as four external reviewers was used to validate the survey, and the selection of attributes, metric levels and scenario choices. If the pilot uncovered any mismatched combinations, adjustments were made to ensure the relevance and accuracy of the approach. Final selection consisted of 27 alternatives, where one alternative was dropped out as one of the options was always chosen as the dominant option. 
Online supplemental appendix E shows the 27 alternative choice sets that were finally selected. To ensure a high participation rate and avoid overwhelming respondents, we used a partial design for the DCE, where only a random subset of attribute combinations was presented to participants, in line with the ISPOR Good Research Practices for Conjoint Analysis Task Force recommendations of a maximum of 8–16 tasks.22 The choices were divided into two shorter DCE sets of 13 and 14 questions each, distributed equally among the potential participants.
Within the context of the different use of the AI applications (as a full replacement, partial replacement or as an assistance role), there were some disparities. Across all three roles, it was assumed that estimated sensitivity, proficiency, validation and the investment in AI-tool training remain constant. The key attribute that varies among the three roles was workload reduction. In addition, we presumed that in partial replacement, the workload reduction would be half of full replacement and the workload reduction in assistance would be half of partial replacement.
Sample approach
Participants were identified through the PubMed database search, targeting authors who published SLRs since 2006. Additionally, the study was actively face-to-face promoted at the 2023 ISPOR Europe conference as well as using a snow-balling approach to identify authors of SLRs from our own network. Eligible individuals were contacted via email following European General Data Protection Regulation (GDPR) rules.32 A two-wave email approach was employed: first, an introductory email with general study information and a link to the online survey and DCE was sent. A follow-up reminder email was sent 2 weeks later.
Although there is no standard method to determine the minimum respondents in DCE,33 based on Johnson and Orme34,34 recommendations, we estimated that around 222 would be ideal. By assuming a 5% response rate, we approached around 4000 participants. Given that our analysis prioritised the general relevance of the attributes over precise estimates, a smaller sample size was deemed sufficient. The survey and DCE were advertised and remained open until August 2024.
Statistical analysis
Following data collection, respondent answers were extracted and analysed using R software.35 The quality of DCE responses was assessed, and any responses where the same option of a choice set was chosen for all alternatives were excluded from the analysis. Descriptive statistics were used to summarise participant characteristics. The primary outcome (dependent variable) of our analysis was the participants’ likelihood of selecting the alternative AI tool configuration in the DCE, based on varying combinations of attribute levels. A single logistic regression analysis with the support of the ‘Support.CEs’ package for R was used to analyse the DCE responses, incorporating all attribute levels simultaneously as predictors without any adjustment of any covariates. This approach allowed for the estimation of the probability of choosing an alternative while accounting for the influence of all attributes in the choice set. Each categorical attribute level was dummy-coded, with the lowest level of the attribute as reference. In line with standard practice in discrete choice experiments, we report the beta coefficients from the conditional logit model rather than ORs. These coefficients represent the relative preference weight of each attribute level and are used to compare preferences across levels and attributes, as well as to calculate measures such as relative attribute importance.20 22
The probability of choosing an alternative (the dependent variable) was calculated as the means of choosing that alternative across all respondents. The relative importance of each attribute level was calculated compared with the reference level within each attribute, as represented by the different b coefficients. Additionally, the relative importance of each attribute itself was estimated by calculating the range of its estimated values and expressing it as a proportion of the total range summed across all attributes.20
An additional model analysis was performed based on sociodemographic and expertise characteristics, such as education and previous experience with AI tools in literature reviews. Respondent characteristics were not included as covariates, as this would imply that choice alternatives are inherently preferred based on these characteristics rather than the attributes themselves. Instead, interaction terms were used to explore how preferences for specific attribute levels varied across subgroups. Interaction terms were added with the support of the ‘MASS’ package in R to explore potential moderating effects. A stepwise model based on the lowest Akaike information criterion (AIC) was applied based on the initial model with all possible interactions, balancing model fit and complexity.
In an additional subgroup analysis, the studied sample was divided into two groups: with lower and higher experience in terms of publications and years in research and with low and high familiarity with the AI tools to assess how important attributes are for new reviewers of the technology and those more experienced.
Data Sharing Statement
No additional data available.
Patient and public involvement
A pilot test was performed with four external reviewers.
Results
Sample and characteristics
A total of 187 participants completed the survey between June and August 2024. Among them, the majority (63.6%) held a Doctor of Philosophy (PhD) degree as highest education level. The distribution of experience since first publication was even, with a slight majority (56.1%) having over 10 years of experience. Additionally, 23.0% of participants reported having conducted more than 10 SLRs. Familiarity with AI was generally low, with 55.6% of participants being (very) unfamiliar with AI, and 15.5% indicated they would never seek AI assistance. Most participants were primarily employed in academia (62.0%) (table 2).
Table 2. Background characteristics of respondents.
| Variable | N (%) | 
|---|---|
| Highest education level | |
| Bachelor’s | 8 (4.3) | 
| Master’s | 38 (20.3) | 
| PhD | 119 (63.6) | 
| Professor | 2 (1.1) | 
| Other* | 20 (10.7) | 
| Experience since first publication | |
| 0–5 years | 33 (17.6) | 
| 6–10 years | 49 (26.2) | 
| 11–15 years | 35 (18.7) | 
| 16–20 years | 27 (14.4) | 
| More than 20 years | 43 (23.0) | 
| Literature review experience | |
| 0–1 SLRs | 38 (20.3) | 
| 2–4 SLRs | 59 (31.6) | 
| 5–7 SLRs | 35 (18.7) | 
| 8–9 SLRs | 12 (6.4) | 
| >10 SLRs | 43 (23.0) | 
| Familiarity with AI | |
| Very unfamiliar with AI | 66 (35.3) | 
| Unfamiliar with AI | 38 (20.3) | 
| Somewhat familiar with AI | 52 (27.8) | 
| Familiar with AI | 20 (10.7) | 
| Very familiar with AI | 11 (5.9) | 
| Minimal size of citation set to seek AI assistance | |
| Below 5k citations | 21 (11.2) | 
| 5k citations | 22 (11.8) | 
| 10k citations | 21 (11.2) | 
| 15k citations | 2 (1.1) | 
| Above 15k citations | 6 (3.2) | 
| Seek AI assistance regardless of the size of citation set | 86 (46.0) | 
| Not use AI at all | 29 (15.5) | 
| Sector of primary occupation | |
| Academia | 116 (62.0) | 
| Healthcare | 39 (20.9) | 
| Government | 5 (2.7) | 
| Industry | 22 (11.8) | 
| Other† | 5 (2.7) | 
Includes: 7× MD, MD/PhD, MD/DSc, MD/MS/MPH, 3× habilitation, PharmD/MSc, ScD, 2× fellowship, resident in Ophthalmology, Associate’s degree, JD.
Includes: 2× academia/industry, academia/healthcare/industry, IT, public health.
AI, artificial intelligence; SLRs, systematic literature reviews.
Discrete choice experiment results
The analysis identified key factors influencing the adoption of AI in SLRs (table 3). The most notable finding is a strong resistance to fully replacing human reviewers with AI as compared with using AI as assistance only, as shown by a negative coefficient (b (SE)=−0.493 (0.157)) and a statistically significant p value (0.002). In contrast, none of the proficiency level in AI tools is significantly associated with adoption, indicating that overall reviewers do not favour tools requiring different levels of expertise. A critical factor is increased workload reduction, which is strongly linked to adoption, as evidenced by a positive coefficient (b (SE)=1.001 (0.453)) and a significant p value (p=0.0271), showing that tools perceived to ease workloads are more readily embraced. Although the estimated higher sensitivity of AI tools shows a strong positive coefficient (b (SE)=4.423 (2.296)), implying accuracy is highly valued, this result is not statistically significant in this sample. Similarly, the negative association (b (SE)=−2.910 (2.412)) between the higher investment required for training or prompt engineering and adoption is not statistically significant, suggesting that while it may be a concern, it is not a decisive factor in the context and estimated range of levels of the attributes. This is also corresponding with the relative importance of both the estimated sensitivity (39.2%) and the workload reduction (48.4%), compared with the lower levels of importance of the other attributes. Considering these results, overall, participants value the workload reduction as the most important attributes compared with the other attributes. However, beyond the ranges of the estimated sensitivity (<87%) and the investment needed to train or prompt engineer the AI tool (>10% of the citation set) in this experiment, it could significantly impact the choice of adoption.
Table 3. Conditional logit estimations for the combined attribute model.
| β coefficient (SE) | P value | Relative importance | |
|---|---|---|---|
| Alternative specific coefficient | 0.174 (0.046) | <0.001* | |
| Role of AI tool | 0.6% | ||
| Assistance (reference) | |||
| Partial replacement | 0.186 (0.112) | 0.098 | |
| Fully replacement | −0.493 (0.157) | 0.002* | |
| Necessary proficiency in the use of the tool | 0.2% | ||
| No proficiency needed (reference) | |||
| Novice | 0.117 (0.200) | 0.557 | |
| Intermediate | 0.233 (0.172) | 0.174 | |
| Expert | −0.053 (0.329) | 0.872 | |
| Investment needed to train or prompt engineer the AI tool | −2.910 (2.412) | 0.228 | 11.7% | 
| Estimated sensitivity | 4.423 (2.296) | 0.054 | 39.2% | 
| Workload reduction | 1.001 (0.453) | 0.0271* | 48.4% | 
Statistically significant.
AI, artificial intelligence.
Analysis including interaction terms
The results of the interaction analysis are shown in table 4. Researchers with more scientific experience are slightly more inclined to accept AI fully replacing human reviewers, as well as when there is a high familiarity with AI, as indicated by significant positive coefficients (b=0.145, p=0.044 and b=0.215, p<0.001, respectively). Additionally, larger citation sets also increase the likelihood of accepting full AI replacement (b=0.119, p=0.004). Interestingly, if expert proficiency is needed for the AI, authors with more scientific experience in their profession are less likely to adopt AI (b=−0.202, p=0.008). However, those with more experience specifically with SLRs increase AI adoption likelihood when expert proficiency is needed (b=0.239, p=0.001). Regarding estimated sensitivity of the AI tool, participants with more scientific experience in their profession value high sensitivity (b=2.745, p=0.008), though those with more experience specifically with SLRs are less interested in a high estimated sensitivity (b=−3.409, p<0.001). Finally, larger citation sets decrease the likelihood of adopting AI for workload reduction (b=−0.289, p=0.005), suggesting a preference for more human oversight in extensive tasks.
Table 4. Conditional logit estimations for the combined attribute model including the final selection of interaction terms.
| β coefficient (SE) | P value | |
|---|---|---|
| Alternative specific coefficient | 0.173 (0.046) | <0.001* | 
| Role of AI tool | ||
| Assistance (reference) | ||
| Partial replacement | 0.096 (0.261) | 0.712 | 
| Fully replacement | −1.929 (0.361) | <0.001* | 
| Necessary proficiency in the use of the tool | ||
| No proficiency needed (reference) | ||
| Novice | 0.102 (0.201) | 0.610 | 
| Intermediate | 0.575 (0.266) | 0.030* | 
| Expert | −0.136 (0.399) | 0.733 | 
| Investment needed to train or prompt engineer the AI | −2.778 (2.430) | 0.258 | 
| Estimated sensitivity | 5.851 (3.745) | 0.118 | 
| Workload reduction | 2.219 (0.623) | <0.001* | 
| Role of AI (partial replacement)* experience years | 0.120 (0.080) | 0.135 | 
| Role of AI (partial replacement)* experience with SLR publications | −0.098 (0.062) | 0.114 | 
| Role of AI (fully replacement)* experience years | 0.145 (0.072) | 0.044* | 
| Role of AI (fully replacement)* familiarity with use of AI | 0.215 (0.055) | <0.001* | 
| Role of AI (fully replacement)* size of citation set | 0.119 (0.042) | 0.004* | 
| Proficiency (intermediate)* experience years | −0.116 (0.066) | 0.080 | 
| Proficiency (expert) * experience years | −0.202 (0.077) | 0.008* | 
| Proficiency (expert) * experience with SLR publications | 0.239 (0.075) | 0.001* | 
| Estimated sensitivity* experience years | 2.745 (1.037) | 0.008* | 
| Estimated sensitivity* experience with SLR publications | −3.409 (0.991) | <0.001* | 
| Workload reduction* size of citation set | −0.289 (0.103) | 0.005* | 
Statistically significant.
AI, artificial intelligence; SLR, systematic literature review.
Subgroup analysis
The subgroup analysis reveals several patterns (see online supplemental appendix F for details). Authors with less than 10 years of scientific experience show strong resistance to AI as a full replacement (b=−0.790, p=0.001), while this resistance is weaker and non-significant for those with more than 10 years (b=−0.286, p=0.173). Intermediate proficiency needed is significant for authors with less than 10 years of scientific experience (b=0.565, p=0.038), suggesting they are open to gain some proficiency in order to use AI. Among participants with fewer than five SLR publications, there is strong reluctance toward AI as a full replacement (b=−0.799, p<0.001), along with heightened interest in estimated sensitivity of the AI tool (b=6.512, p=0.044). This effect diminishes for those with more than five SLR publications. Finally, AI familiarity reduces opposition to AI replacing human roles. Participants unfamiliar with AI tools strongly resist full AI replacement (b=−0.777, p<0.001), while participants who are familiar with AI tools show less resistance (b=−0.169, p=0.461). Participants who are unfamiliar with AI tools also value workload reduction (b=1.450, p=0.023).
Discussion
The findings of this study provide valuable insights into the factors influencing SLR authors’ preferences for AI tools, addressing the aim of assessing the trade-offs between AI tool reliability, the time investment required, and the related benefits in conducting SLRs. While over half of the participants were (very) unfamiliar with such AI tools, only a small proportion indicated they would never use AI in the SLR process. However, a critical observation from the study is the need for less experienced reviewers to approach the use of AI in literature synthesis with caution. Particularly, they should focus on academic integrity, as AI may present risks regarding the depth and accuracy of the synthesised content, potentially leading to the production of fraudulent material.36
One of the most significant takeaways is the strong aversion to fully replacing human reviewers with AI, reflected by the consistently negative and significant coefficients across various subgroups. The unpredictability, fear of unknown errors and the ‘black-box’ nature of AI algorithms likely contribute to this mistrust, making researchers hesitant to fully rely on AI.37 This suggests that while AI is recognised as a tool with potential benefits, there remains a deep-rooted concern about the reliability of AI in decision-making processes. Interestingly, previous research has shown that acceptable error rates for AI are significantly lower than for humans, likely due to a general mistrust of AI. This mistrust may stem from AI’s limited social capacity and lack of likeability compared with human colleagues, making it less likely to be forgiven for errors.37 The resistance is particularly pronounced among researchers with less experience and fewer SLR publications, indicating that familiarity with the field and AI tools may play a role in shaping perceptions. Researchers are likely cautious about AI’s ability to handle the nuances of literature synthesis, where errors in screening and selecting studies can have major implications. Notably, partial automation scored more positive coefficients than using AI purely as an assistant, indicating an openness to some level of automation within the SLR process, consistent with finding and opinions in the field of AI use in medical applications indicated.38 39 Consequently, it is important for future developments in AI tools for SLRs to focus on hybrid models, where AI supports but does not replace human expertise. This approach may bridge the gap between the need for efficiency and the desire for reliability.
Workload reduction emerged as another critical factor influencing AI adoption. Participants who were unfamiliar with AI were significantly more likely to value workload reduction, indicating that researchers new to AI view it as a means to streamline processes.40 This suggests that AI tools designed to reduce administrative burden without compromising on accuracy would likely see higher adoption rates, especially among less experienced researchers. Interestingly, the importance of workload reduction diminished in reviewers using the hypothetical tool for larger citation sets. This could imply that researchers working with more extensive datasets may prioritise human oversight over automation, potentially due to concerns about accuracy in complex tasks as most of the AI tools currently still depend on possibly outdated methodologies.41 This highlights the importance of designing AI tools that are scalable and capable of maintaining accuracy across diverse datasets. The ability to adjust AI assistance based on the complexity of the task could enhance adoption among researchers working on large-scale projects.
Proficiency with AI tools also plays a key role in adoption preferences. Intermediate proficiency, rather than novice or expert levels, was significantly associated with higher AI adoption. This indicates that SLR authors may prefer AI tools that strike a balance between ease of use and advanced functionality, allowing for effective utilisation without the need for extensive training or technical expertise. Researchers with less than 10 years of experience are particularly open to gaining some proficiency in AI, which suggests that AI tools with moderate learning curves could encourage wider adoption among early-career researchers. In contrast, more experienced researchers are less likely to adopt AI if expert-level proficiency is needed, likely because they already have established methods and may not see the need to invest additional time in mastering new tools. This suggests that SLR AI tool developers could focus on creating customisable AI solutions that offer both basic and advanced settings, enabling reviewers of different experience levels to tailor the tools to their needs.
AI tool reliability, measured through estimated sensitivity, also plays an important role, although the findings were mixed. Sensitivity is highly valued by participants with fewer than five SLR publications, indicating that these researchers are concerned about the risk of missing relevant studies. However, this concern diminishes as researchers gain more experience, suggesting that seasoned SLR authors may rely more on their expertise to mitigate potential AI errors. This points to a potential trade-off in AI adoption, where less experienced reviewers might be more drawn to tools that guarantee high sensitivity, while more experienced reviewers might be comfortable with AI tools that offer efficiency gains but require human intervention to ensure accuracy.
Interestingly, the investment required to train or prompt-engineer AI did not significantly impact adoption, even though it was negatively associated. The negative impact of the investment required to train the AI tool could be attributed to the fact that investing time in training AI may seem counterintuitive when the primary goal is to enhance efficiency.36 This finding suggests that while the time and effort required to set up AI tools are a consideration, they are not a decisive barrier to adoption. Researchers might be willing to invest some time in AI tools if they are convinced of the potential long-term benefits, such as workload reduction and improved efficiency. Ultimately, the next phase of AI development in SLRs may involve improving user interfaces and reducing the learning curve for researchers, further eliminating barriers to entry.
This study has several strengths that contribute to research on the integration of AI in SLRs. First, to our best knowledge, this is the first study to systematically assess the preferences of SLR authors regarding AI tools using a DCE, a robust method for eliciting preferences in adoption. The attributes and levels used in the DCE were based on both literature and expert consultation, enhancing the real-world evidence of the scenarios presented to participants. The study also included a diverse sample of professionals with varying levels of experience in SLRs and AI, allowing for subgroup analyses that provide nuanced insights into adoption drivers. However, the interpretation of our results should be considered in light of the limitations acknowledged. The hypothetical nature of the DCE, and the fast development of AI tools in SLRs may not have fully captured the real-world decision-making, and participant preferences may differ from actual behaviour in practice. In addition, the sample, while diverse in, for example, experience and familiarity of using AI, may still be subject to selection bias, considering failure to meet our 5% target recruitment (187 recruited of 222 target). Participants with a positive attitude towards AI use (in SLRs) might have been more prone to partake in our study. This might limit the generalisability of our results to those more sceptical with AI applications in SLRs. Given that our DCE focused on preferences for different attributes of AI tools rather than on the decision to adopt AI tools, the impact of this selection bias on our results is likely limited. In addition, although we accounted for respondents’ background characteristics through interaction term to assess moderation effects, unmeasured confounders, such as gender, may have biased the outcomes. AIC is one of the most common measures to determine the model fit.42 Furthermore, while AIC can increase the risk of overfitting when applied to models with a large number of choice sets, this concern is minimal in our case, as the model includes only two choice sets.42 Moreover, our discrete choice experiment did not explicitly incorporate ethical considerations related to the use of AI in SLRs, such as algorithmic bias or transparency in decision-making. While these considerations could be critical in the choice for adopting an AI tool, they are complex to translate into discrete, measurable attributes suitable for inclusion in a choice experiment. The current attributes used, such as role of the AI tool in SLR as assistance, partially replacement of a human, or full replacement of a human, are implicitly taken into account in these considerations. Future research could detail the ethical component in the choice of adoption of AI tools in SLRs. Finally, our DCE did not account for environmental impact of AI tools, such as the carbon footprint associated with server infrastructure and computational demands. As awareness of the potential contribution of AI to climate change grows, the influence of carbon emission levels of different AI tools could shift choice preference towards selective use of AI in SLRs, or towards adoption of sustainable AI alternatives.
Conclusion
In conclusion, the study highlights several key factors influencing the adoption of AI tools in SLRs. These include the perceived reliability of the tool, its ability to significantly reduce workload, and the level of proficiency required to use it effectively. SLR authors generally express reluctance to fully replace human reviewers with AI, probably due to concerns about reliability. However, there is greater openness towards AI tools that assist the human reviewers. Overall, AI tools that require moderate proficiency, offer workload reduction and maintain high sensitivity are likely to see greater acceptance in the SLR community. These findings provide actionable insights for developers of SLR tools that have AI incorporated into them and healthcare decision-makers that use such tools, suggesting that hybrid approaches—where AI supports but does not entirely replace human expertise—could foster broader adoption and trust in AI-assisted SLR processes.
Supplementary material
Footnotes
Funding: This work was funded by F. Hoffmann-La Roche, Basel, Switzerland (grant number: N/A).
Prepublication history and additional supplemental material for this paper are available online. To view these files, please visit the journal online (https://doi.org/10.1136/bmjopen-2025-099921).
Provenance and peer review: Not commissioned; externally peer reviewed.
Patient consent for publication: Not applicable.
Ethics approval: All participants were asked to fill in an informed consent form before participation in the study. Patient anonymity is guaranteed by the use of a unique anonymous identifier; hence, the ethical approval for observational studies has been waived under Dutch law. Participants gave informed consent to participate in the study before taking part.
Patient and public involvement: Patients and/or the public were involved in the design, or conduct, or reporting, or dissemination plans of this research. Refer to the Methods section for further details.
Data availability statement
All data relevant to the study are included in the article or uploaded as supplementary information.
References
- 1.Boren SA, Moxley D. Systematically reviewing the literature: building the evidence for health care quality. Mo Med. 2015;112:58–62. [PMC free article] [PubMed] [Google Scholar]
- 2.Marshall C. Tool support for systematic reviews in software engineering .https://keele-repository.worktribe.com/output/407199 Available. [Google Scholar]
- 3.Shojania KG, Sampson M, Ansari MT, et al. How quickly do systematic reviews go out of date? A survival analysis. Ann Intern Med. 2007;147:224–33. doi: 10.7326/0003-4819-147-4-200708210-00179. [DOI] [PubMed] [Google Scholar]
- 4.Polanin JR, Pigott TD, Espelage DL, et al. Best practice guidelines for abstract screening large‐evidence systematic reviews and meta‐analyses. Res Synth Methods. 2019;10:330–42. doi: 10.1002/jrsm.1354. [DOI] [Google Scholar]
- 5.Booth A, Sutton A, Papaioannou D. Systematic approaches to a successful literature review. SAGE Publ. Ltd; 2016. https://uk.sagepub.com/en-gb/eur/systematic-approaches-to-a-successful-literature-review/book270933 Available. [Google Scholar]
- 6.Bennett NR, Cumberbatch C, Francis DK. There are challenges in conducting systematic reviews in developing countries: the Jamaican experience. J Clin Epidemiol. 2015;68:1095–8. doi: 10.1016/j.jclinepi.2014.09.026. [DOI] [PubMed] [Google Scholar]
- 7.O’Dwyer LC, Wafford QE. Addressing challenges with systematic review teams through effective communication: a case report. Jmla. 2021;109:643–7. doi: 10.5195/jmla.2021.1222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, et al. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ Open. 2023;13:e072254. doi: 10.1136/bmjopen-2023-072254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.van Dinter R, Tekinerdogan B, Catal C. Automation of systematic literature reviews: A systematic literature review. Inf Softw Technol. 2021;136:106589. doi: 10.1016/j.infsof.2021.106589. [DOI] [Google Scholar]
- 10.Blaizot A, Veettil SK, Saidoung P, et al. Using artificial intelligence methods for systematic review in health sciences: A systematic review. Res Synth Methods. 2022;13:353–62. doi: 10.1002/jrsm.1553. [DOI] [PubMed] [Google Scholar]
- 11.Hanegraaf P, Wondimu A, Mosselman JJ, et al. Inter-reviewer reliability of human literature reviewing and implications for the introduction of machine-assisted systematic reviews: a mixed-methods review. BMJ Open. 2024;14:e076912. doi: 10.1136/bmjopen-2023-076912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.van de Schoot R, de Bruin J, Schram R, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3:125–33. doi: 10.1038/s42256-020-00287-7. [DOI] [Google Scholar]
- 13.Gartlehner G, Wagner G, Lux L, et al. Assessing the accuracy of machine-assisted abstract screening with DistillerAI: a user study. Syst Rev. 2019;8:277. doi: 10.1186/s13643-019-1221-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hamel C, Kelly SE, Thavorn K, et al. An evaluation of DistillerSR’s machine learning-based prioritization tool for title/abstract screening - impact on reviewer-relevant outcomes. BMC Med Res Methodol. 2020;20 doi: 10.1186/s12874-020-01129-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Howard BE, Phillips J, Tandon A, et al. SWIFT-Active Screener: Accelerated document screening through active learning and integrated recall estimation. Environ Int. 2020;138:105623. doi: 10.1016/j.envint.2020.105623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Howard BE, Phillips J, Miller K, et al. SWIFT-Review: a text-mining workbench for systematic review. Syst Rev. 2016;5:87. doi: 10.1186/s13643-016-0263-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Gates A, Johnson C, Hartling L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool. Syst Rev. 2018;7 doi: 10.1186/s13643-018-0707-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Rathbone J, Hoffmann T, Glasziou P. Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Syst Rev. 2015;4 doi: 10.1186/s13643-015-0067-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Tsou AY, Treadwell JR, Erinoff E, et al. Machine learning for screening prioritization in systematic reviews: comparative performance of Abstrackr and EPPI-Reviewer. Syst Rev. 2020;9:73. doi: 10.1186/s13643-020-01324-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.von Wedel P, Hagist C. Physicians’ preferences and willingness to pay for artificial intelligence-based assistance tools: a discrete choice experiment among german radiologists. BMC Health Serv Res. 2022;22 doi: 10.1186/s12913-022-07769-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Liu T, Tsang W, Xie Y, et al. Preferences for Artificial Intelligence Clinicians Before and During the COVID-19 Pandemic: Discrete Choice Experiment and Propensity Score Matching Study. J Med Internet Res. 2021;23:e26997. doi: 10.2196/26997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bridges JFP, Hauber AB, Marshall D, et al. Conjoint analysis applications in health--a checklist: a report of the ISPOR Good Research Practices for Conjoint Analysis Task Force. Value Health. 2011;14:403–13. doi: 10.1016/j.jval.2010.11.013. [DOI] [PubMed] [Google Scholar]
- 23.Boudin F, Nie J-Y, Bartlett JC, et al. Combining classifiers for robust PICO element detection. BMC Med Inform Decis Mak. 2010;10 doi: 10.1186/1472-6947-10-29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Marshall IJ, Noel-Storr A, Kuiper J, et al. Machine learning for identifying Randomized Controlled Trials: An evaluation and practitioner’s guide. Res Synth Methods. 2018;9:602–14. doi: 10.1002/jrsm.1287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Davies BM, Goh S, Yi K, et al. Development and validation of a MEDLINE search filter/hedge for degenerative cervical myelopathy. BMC Med Res Methodol. 2018;18 doi: 10.1186/s12874-018-0529-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Systematic review and literature review software by distillersr. 2024. https://www.distillersr.com/ Available.
- 27.ASReview – active learning for systematic reviews. 2024. https://asreview.nl/ Available.
- 28.Abstrackr: home. 2024. http://abstrackr.cebm.brown.edu/account/login Available.
- 29.Dennstädt F, Zink J, Putora PM, et al. Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain. Syst Rev. 2024;13:158. doi: 10.1186/s13643-024-02575-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Living systematic review software | pitts. 2024. https://pitts.ai/ Available.
- 31.EasySLR: fly through reviews - easySLR. 2024. https://www.easyslr.com/ Available.
- 32.Radley-Gardner O, Beale HG, Zimmermann R, editors. Fundamental Texts on European Private Law. 2nd. Oxford, UK: Hart Publishing; 2016. edn. [Google Scholar]
- 33.Speckemeier C, Krabbe L, Schwenke S, et al. Discrete choice experiment to determine preferences of decision-makers in healthcare for different formats of rapid reviews. Syst Rev. 2021;10 doi: 10.1186/s13643-021-01647-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Johnson R, Orme B. Sample Size Issues for Conjoint Analysis Studies. 2010. https://sawtoothsoftware.com/resources/technical-papers/sample-size-issues-for-conjoint-analysis-studies Available.
- 35.Posit (Rstudio) Posit. 2024. https://www.posit.co/ Available.
- 36.Khalifa M, Albadawy M. Using artificial intelligence in academic writing and research: An essential productivity tool. Computer Methods and Programs in Biomedicine Update. 2024;5:100145. doi: 10.1016/j.cmpbup.2024.100145. [DOI] [Google Scholar]
- 37.Lenskjold A, Nybing JU, Trampedach C, et al. Should artificial intelligence have lower acceptable error rates than humans? BJR Open. 2023;5:20220053. doi: 10.1259/bjro.20220053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Sezgin E. Artificial intelligence in healthcare: Complementing, not replacing, doctors and healthcare providers. Digit Health. 2023;9:20552076231186520. doi: 10.1177/20552076231186520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Alami H, Lehoux P, Papoutsi C, et al. Understanding the integration of artificial intelligence in healthcare organisations and systems through the NASSS framework: a qualitative study in a leading Canadian academic centre. BMC Health Serv Res. 2024;24 doi: 10.1186/s12913-024-11112-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Fabiano N, Gupta A, Bhambra N, et al. How to optimize the systematic review process using AI tools. JCPP Adv. 2024;4:e12234. doi: 10.1002/jcv2.12234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Bolaños F, Salatino A, Osborne F, et al. Artificial intelligence for literature reviews: opportunities and challenges. Artif Intell Rev. 2024;57:259. doi: 10.1007/s10462-024-10902-3. [DOI] [Google Scholar]
- 42.Hauber AB, González JM, Groothuis-Oudshoorn CGM, et al. Statistical Methods for the Analysis of Discrete Choice Experiments: A Report of the ISPOR Conjoint Analysis Good Research Practices Task Force. Value Health. 2016;19:300–15. doi: 10.1016/j.jval.2016.04.004. [DOI] [PubMed] [Google Scholar]
