Skip to main content
BMC Medical Research Methodology logoLink to BMC Medical Research Methodology
. 2025 Mar 18;25:75. doi: 10.1186/s12874-025-02528-y

Using artificial intelligence for systematic review: the example of elicit

Nathan Bernard 1,2,, Yoshimasa Sagawa Jr 1,2, Nathalie Bier 3,4, Thomas Lihoreau 1,5,6, Lionel Pazart 1,2,6, Thomas Tannou 1,2,3,7
PMCID: PMC11921719  PMID: 40102714

Abstract

Background

Artificial intelligence (AI) tools are increasingly being used to assist researchers with various research tasks, particularly in the systematic review process. Elicit is one such tool that can generate a summary of the question asked, setting it apart from other AI tools. The aim of this study is to determine whether AI-assisted research using Elicit adds value to the systematic review process compared to traditional screening methods.

Methods

We compare the results from an umbrella review conducted independently of AI with the results of the AI-based searching using the same criteria. Elicit contribution was assessed based on three criteria: repeatability, reliability and accuracy. For repeatability the search process was repeated three times on Elicit (trial 1, trial 2, trial 3). For accuracy, articles obtained with Elicit were reviewed using the same inclusion criteria as the umbrella review. Reliability was assessed by comparing the number of publications with those without AI-based searches.

Results

The repeatability test found 246,169 results and 172 results for the trials 1, 2, and 3 respectively. Concerning accuracy, 6 articles were included at the conclusion of the selection process. Regarding, revealed 3 common articles, 3 exclusively identified by Elicit and 17 exclusively identified by the AI-independent umbrella review search.

Conclusion

Our findings suggest that AI research assistants, like Elicit, can serve as valuable complementary tools for researchers when designing or writing systematic reviews. However, AI tools have several limitations and should be used with caution. When using AI tools, certain principles must be followed to maintain methodological rigour and integrity. Improving the performance of AI tools such as Elicit and contributing to the development of guidelines for their use during the systematic review process will enhance their effectiveness.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12874-025-02528-y.

Keywords: Artificial intelligence tools, Systematic review writing, Reliability, Accuracy

Background

Systematic reviews are a rigorous and time-consuming process that require a high degree of completeness. Several artificial intelligence (AI) tools have been developed to automatically or semi-automatically handle key milestones in this process [1, 2]. Recent publications demonstrated the predominant use of these tools by researchers for screening (73%), risk of bias assessment (13%) and data extraction (13%) [2]. However, ongoing research is examining the extent to which AI can be relied upon to assist researchers in the systematic review process [3]. This raises the question of whether these tools are as effective as traditional human research.

Among AI tools, Elicit, based on language models like GPT-3, is positioned as a powerful research assistant tool [4]. Elicit uses semantic similarity to identify papers relevant to a research question across multiple databases, even if those papers do not employ the exact keywords. Elicit claims to identify the most relevant papers, then generates a summary of the question through analysis of each abstract [4]. The distinctive ‘custom report’ AI feature of Elicit sets it apart from other AI tools, automatically providing a comprehensive understanding of specific topics. To be more accurate, Elicit offers the possibility to apply filters that allow for the determination of specific keywords, article type and publication years. Furthermore, Elicit facilitates information organization by displaying results based on population, intervention, or study methodology. Elicit is powerful because it works as a process-based system. Indeed, elicit will retrieve, analyze and summarize eight articles that are most likely to answer the question. This is a new product that makes it easier to find the answer to a question. These features highlight why there is a growing interest in exploring the use of Elicit to assist researchers in producing systematic literature reviews.

Our research team work focuses on developing smart living environments to support older people to age in place. We recently published an umbrella review about this topic [5] without employing AI tools. This raised the question: Does systematic review screening with Elicit add value to the process compared to classical screening methods? Firstly, we hypothesized that the screening method with Elicit would include the same studies as the classical screening method, and secondly, we hypothesized that the screening method with Elicit would include new studies. Finally, we hypothesized that the AI-assisted research would add detail to the conclusions of our original umbrella review.

Methods

To answer this question, the screening method was carried out in several stages. The first step involved using Elicit as a database to identify relevant studies. Unlike traditional databases, Elicit does not allow keyword-based searches in the conventional manner. Instead, the same research question as in Tannou et al. [5] was entered into Elicit: “What is the effectiveness of smart living environments in supporting ageing in place?”. Filters were applied to refine the search based on article type (systematic review) and publication years (from 2005 to 2021) according to the inclusion criteria of the original umbrella review. The results were then organized chronologically, in ascending order by year. To enhance coverage, the search was repeated for nine different year-based searches (2005, 2010, 2015, 2016, 2017, 2018, 2019, 2020 and 2021). As part of its algorithm, Elicit focuses on the eight most relevant studies. In our research, we used the “show more” feature until Elicit could no longer retrieve additional papers and displayed the message “no more papers”. For earlier search years (e.g. 2005), results included studies published between 2005 and 2021. To minimize duplicates, only studies corresponding to the specific search year, were retained (e.g. a searchin 2016 yields 8 studies published in 2016 only).

The remaining steps of the systematic review process study selection, data extraction and data comparison were conducted solely by the author of the original umbrella review [5].

Elicit contribution was assessed based on three criteria: repeatability, reliability and accuracy.

Repeatability

In our study, repeatability was defined as Elicit ability to provide consistent results under indentical conditions but at different times. To assess this, the search method described above was replicated three times at different moments (carrying by hour and day), using the same research query. The results were compared across these three trial, following the approach described by Kitchenham et al. [6]. The searches were conducted between April 19 and April 20, 2023.

Accuracy

In our study, accuracy was defined Elicit ability to retrieve relevant articles. A mixed method was used to assess accuracy. Studies were retrieved using the methodology described above. Then the first author of the umbrella review manually assessed the relevance of the identified articles using the same inclusion/exclusion criteria as the original umbrella review [5].

Reliability

In our study, reliability was defined as the agreement between Elicit assisted screening and the classical screening method from study by Tannou et al. [5]. This was assessed by comparing the number of publications by both methods. Descriptive statistics (percentages) were used evaluate the overlap between the two approaches.

Results

For the repeatability test, the results varied across trials (see Table 1).: 246 results were obtained in trial 1, 169 results in trial 2 and 172 results in trial 3 (See Additional files 1, 2 and 3 respectively). After pooling, the article from all three trials and removing, a total of 241 articles were identified (see Fig. 1).

Table 1.

Number of articles as a function of years and trial with searching details

Trial
Years
Trial 1 Trial 2 Trial 3
Searching date / Hour Number of articles Searching date / Hour Number of articles Searching date / Hour Number of articles
2005 18/04/2023 07 :43 1 18/04/2023 15 :55 1 18/04/2023 20 :10 3
2010 18/04/2023 07 :54 11 18/04/2023 16 :27 6 18/04/2023 20 :18 10
2015 18/04/2023 08 :10 11 18/04/2023 19 :06 7 19/04/2023 08 :57 8
2016 18/04/2023 08 :20 16 18/04/2023 19 :16 8 19/04/2023 09 :12 9
2017 18/04/2023 08 :28 14 18/04/2023 19 :19 10 19/04/2023 09 :31 12
2018 18/04/2023 08 :42 22 18/04/2023 19 :34 22 19/04/2023 11 :39 19
2019 18/04/2023 09 :03 47 18/04/2023 19 :46 45 19/04/2023 12 :21 27
2020 18/04/2023 13 :54 64 18/04/2023 19 :50 32 19/04/2023 13 :19 39
2021 18/04/2023 14:58 60 18/04/2023 19 :56 38 19/04/2023 16 :03 45
Total 246 169 172

Fig. 1.

Fig. 1

The diagram from Tannou et al. study (left), Elicit (right), and the comparisons at different steps of the systematic review (a, b, c)

Concerning accuracy, following title and abstract screening using the same inclusion criteria as the umbrella review, 241 articles were selected (See Additional file 4). A full text review was conducted on 29 articles (see Additional file 5). As illustrated in Fig. 1, at the conclusion of the selection process, 6 articles were included (See Additional file 6) [712].

Concerning reliability, upon comparing the articles “excluded in title and abstract” from Tannou et al. and Elicit, 32 common articles were identified (Fig. 1a) (See additional file 7). After comparing the articles “excluded in integral text” from Tannou et al. and Elicit, 17 common articles were identified (Fig. 1b) (See Additional file 8). Finally, upon comparing articles of “new included studies” from Tannou et al. and Elicit, 3 common articles were identified (Fig. 1c) (See Additional file 9). This result represents 17.6% of the studies finally included using the classic screening method. Additionally, 3 articles [1012] were exclusively identified by Elicit and 14 articles were exclusively identified by Tannou et al.

Discussion

The aim of the present study was to assess whether the use of Elicit in the screening process of systematic reviews adds value compared to the traditional screening process. Our findings suggest that AI research assistants, such as Elicit, are relevant complementary tools for researchers during systematic literature review process. However, they have not yet reached a level of development where they can fully replace traditional approaches. Contrary to our first hypothesis Elicit showed a lack of repeatability and reliability when compared to a classical screening method. Only 17,6% of studies finally included by Tannou et al. were identified by Elicit. However, in line with our second hypothesis, Elicit identified 3 articles that had not been included by Tannou et al., showing its potential to enhance the comprehensiveness of the classical screening method. Particularly, Elicit found 3 articles confirmed as “articles to include” by the first author of the Tannou et al. study. However, contrary to our last hypothesis, upon reviewing the full text of these 3 articles found by Elicit, we found that their inclusion would not have altered the conclusion of the umbrella review. These articles lacked specificity and did not provide substantial additional information. Viewed from a different perspective, considering only the 6 articles found with Elicit, compared to the 17 from the published umbrella review [5], the research question’s conclusions would have remained the same. This shows that all the articles found using Elicit were really relevant to the research question. However, the small number of articles reduced the level of precision and nuances, which are key objectives of the systematic review process [13]. Despite these limitations, Elicit would allow for relevant assistance in producing a systematic review. However, it cannot be used on its own to produce a relevant systematic review.

Nevertheless, Elicit accuracy is also influenced by the formulation of the research question. For instance, posing our question to Elicit using three slightly different wordings yielded (see Additional files 10) a similar conclusion but the details provided and the cited articles differed. Indeed, the few articles used by Elicit to produce its custom report referred to the umbrella review protocol. This raises the questions about the validity of Elicit conclusions, given that the recommendations were made using a protocol.

The enhancement of the systematic literature reviews process through the use of AI tools should address several issues, as evidenced by our experience. The results showed issues with repeatability and reliability, compromising both methodological rigour and relevance. A factors relating to how Elicit operates is, that it presently does not employ a keyword-based search [4]. This requires sorting results by year to ensure comprehensiveness. The second factor is related to Elicit’s capabilities, as it is based on articles referenced in a single database, Semantic Scholar [14]. This limits the number of possible results if studies are not referenced in this database and does not provide the opportunity to be comprehensive. As a result, in our searches, Trial 1 consistently produces the highest number of results per year; however, all included studies were found by Trial 1 and 3; but Trial 2 found only 4 of the 6 included studies. This first affects the accuracy because of 3 new included studies only 2 were retrieved by trial 2. This also affects reliability, as only 2 of the 3 articles by Tannou et al. [5] only 2 were identified, reducing the percentage of finally included studies identified to 11,7%.When analyzing the results, 137 common articles were found by the three trials, 24 articles were found by trial 1 and 3 only, 25 articles were found by trial 1 and 2 only, and 2 articles were found by trial 2 and 3 only (see Additional files 11, 12, 13, 14). Research noise (non-relevant results) may be an explanation for the variation in the number of articles per study, but is not sufficient. This can also impact accuracy and reliability due to different number finally included studyThis raises the questions of whether Elicit needs to be used more than once and how many trials are required to achieve stability of results. A keyword-based search and improvement of Elicit capabilities will enhance methodological relevance that’s an essential improvement of AI tools for systematic reviews. Elicit accuracy also needs improvement, as the two most common reasons for excluding articles screened by Elicit (206 articles) were inappropriate outcomes or interventions. Another important limitation is the problematic use of references by Elicit to produce the summary. Elicit incorrectly cited the umbrella review protocol, which is a fairly serious error, and raise the obligation of human control for referencing.

Implications

AI tools have a place in supporting researchers in carrying out systematic literature reviews. Indeed, these works commonly encounter challenges first, relating to the comprehensiveness of the article search and second to the risk of cognitive bias on the part of the researcher, which can be mitigated. This is demonstrated by several similar studies [6, 1517], which generally conclude that two systematic literature reviews that are similar but are conducted independently on the same research question can yield different results both in terms of the studies included and the conclusion drawn. An extended explanation could be that several biases may influence the results at different levels of the systematic review. The level of expertise, the decision and judgement of the author, the subject area may first influence the included studies and then the conclusions of the studies. These issues define the role of AI during the systematic review process. AI can be used during data screening, extraction, or risk of bias assessment to improve comprehensiveness and accuracy, as well as to reduce human biases related to expertise level, decision-making processes, or cognitive biases [1, 2, 18]. The limitations of AI tools (lack of comprehensiveness, accuracy, and screening errors) [1, 2, 18, 19], coupled with the results of this study, highlight the need to maintain human oversight when using these tools.

Conclusion

The systematic review process, which is very time-consuming [20], can be improved in terms of time efficiency, optimization and rigor by combining AI tools with human analysis. Thisallowi for a better understanding of the complexity of the research [19]. Currently, Elicit can be used throughout the entire systematic review process to assist researchers with specific tasks. Firstly, for the formulation of research questions, based on its functionality and secondly, as demonstrated in this study for screening and data extraction if the tool performance are increased.

Given the multiplicity of available AI tools [18, 21], it is important to respect certain principles to maintain methodological rigor and integrity. Firstly, it is essential to remember that these tools lack evidence on their validity, reliability, and accuracy. Secondly, as mentioned earlier, the use of an AI tool should only be at certain stages of the systematic review process and not for automating the entire process. Thirdly, for transparency and reproducibility, it is crucial to mention the use of AI tools in the methodology section. Finally, the implementation of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) AI reporting guidelines which are currently underway, will be very useful for providing a framework for the use of AI [22].

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1 (33.5KB, xlsx)
Supplementary Material 2 (26.2KB, xlsx)
Supplementary Material 3 (26.7KB, xlsx)
Supplementary Material 4 (33.4KB, xlsx)
Supplementary Material 5 (12.9KB, xlsx)
Supplementary Material 6 (9.8KB, xlsx)
Supplementary Material 7 (13.6KB, xlsx)
Supplementary Material 8 (11.9KB, xlsx)
Supplementary Material 9 (10.4KB, xlsx)
Supplementary Material 11 (33.4KB, xlsx)
Supplementary Material 13 (14.5KB, xlsx)
Supplementary Material 14 (11.4KB, xlsx)

Abbreviations

AI

Artificial intelligence

Author contributions

N.B, Y.S, L.P and T.T have concepted and designed of the work, N.B acquired, analysed and interpretated data. N.B and T.T have drafted the work. Y.S, N.Bi, T.L, L.P and T.T have substantially revised the work. All authors reviewed the manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data availability

The datasets supporting the conclusions of this article are included within the article (and its Additional files).

Declarations

Ethical approval and consent to participate

Not Applicable.

Consent to publish on part of authors

All authors read and approved the final manuscript.

Consent to publish on part of participants

Not Applicable

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Kebede MM, Le Cornet C, Fortner RT. In-depth evaluation of machine learning methods for semi-automating Article screening in a systematic review of mechanistic literature. Res Synth Methods Mars. 2023;14(2):156–72. [DOI] [PubMed] [Google Scholar]
  • 2.Blaizot A, Veettil SK, Saidoung P, Moreno-Garcia CF, Wiratunga N, Aceves-Martins M, et al. Using artificial intelligence methods for systematic review in health sciences: A systematic review. Res Synth Methods Mai. 2022;13(3):353–62. [DOI] [PubMed] [Google Scholar]
  • 3.Zhang Y, Liang S, Feng Y, Wang Q, Sun F, Chen S, et al. Automation of literature screening using machine learning in medical evidence synthesis: a diagnostic test accuracy systematic review protocol. Syst Rev 15 Janv. 2022;11(1):11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.FAQ| Elicit. [cité 23 mai 2023]. Disponible sur: https://elicit.org/faq#what-is-elicit
  • 5.Tannou T, Lihoreau T, Couture M, Giroux S, Wang RH, Spalla G, et al. Is research on « smart living environments » based on unobtrusive technologies for older adults going in circles? Evidence from an umbrella review. Ageing Res Rev Févr. 2023;84:101830. [DOI] [PubMed] [Google Scholar]
  • 6.Kitchenham B, Brereton P, Li Z, Budgen D, Burn A. Repeatability of systematic literature reviews. In: 15th Annual Conference on Evaluation & Assessment in Software Engineering (EASE 2011). 2011 [cité 21 déc 2023]. pp. 46–55. Disponible sur: https://ieeexplore.ieee.org/abstract/document/6083161
  • 7.Khosravi P, Ghapanchi AH. Investigating the effectiveness of technologies applied to assist seniors: A systematic literature review. Int J Med Inf Janv. 2016;85(1):17–26. [DOI] [PubMed] [Google Scholar]
  • 8.Lussier M, Lavoie M, Giroux S, Consel C, Guay M, Macoir J, et al. Early detection of mild cognitive impairment with In-Home monitoring sensor technologies using functional measures: A systematic review. IEEE J Biomed Health Inf Mars. 2019;23(2):838–47. [DOI] [PubMed] [Google Scholar]
  • 9.Reeder B, Meyer E, Lazar A, Chaudhuri S, Thompson HJ, Demiris G. Framing the evidence for health smart homes and home-based consumer health technologies as a public health intervention for independent aging: a systematic review. Int J Med Inf Juill. 2013;82(7):565–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lenouvel E, Novak L, Nef T, Klöppel S. Advances in sensor monitoring effectiveness and applicability: A systematic review and update. Gerontologist 15 Mai. 2020;60(4):e299–308. [DOI] [PubMed] [Google Scholar]
  • 11.Richardson MX, Ehn M, Stridsberg SL, Redekop K, Wamala-Andersson S. Nocturnal digital surveillance in aged populations and its effects on health, welfare and social care provision: a systematic review. BMC Health Serv Res 30 Juin. 2021;21(1):622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Morris M, Adair B, Miller K, Ozanne E, Hampson R, Pearce A, et al. Smart-Home technologies to assist older people to live well at home. J Aging Sci 1 Janv. 2013;1:101. [Google Scholar]
  • 13.Aromataris E, Pearson A. The systematic review: an overview. AJN Am J Nurs Mars. 2014;114(3):53. [DOI] [PubMed] [Google Scholar]
  • 14.Gusenbauer M. Google scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases. Scientometrics 1 Janv. 2019;118(1):177–214. [Google Scholar]
  • 15.Kitchenham B, Brereton P, Budgen D. In. Mapping study completeness and reliability - A case study. 2012;126–35.
  • 16.Wohlin C, Runeson P, da Mota Silveira Neto PA, Engström E, do, Carmo Machado I, de Almeida ES. oct. On the reliability of mapping studies in software engineering. J Syst Softw. 1 2013;86(10):2594–610.
  • 17.MacDonell S, Shepperd M, Kitchenham B, Mendes E. How reliable are systematic reviews in empirical software engineering?? IEEE Trans Softw Eng Sept. 2010;36(5):676–87. [Google Scholar]
  • 18.Fabiano N, Gupta A, Bhambra N, Luu B, Wong S, Maaz M, et al. How to optimize the systematic review process using AI tools. JCPP Adv 23 Avr. 2024;4(2):e12234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, van der Palen J, Doggen CJM, Lenferink A. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ Open 7 Juill. 2023;13(7):e072254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Beller EM, Chen JKH, Wang ULH, Glasziou PP. Are systematic reviews up-to-date at the time of publication? Syst Rev 28 Mai. 2013;2:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Beller E, Clark J, Tsafnat G, Adams C, Diehl H, Lund H et al. Making progress with the automation of systematic reviews: principles of the International Collaboration for the Automation of Systematic Reviews (ICASR). Syst Rev. 19 mai. 2018;7(1):77. [DOI] [PMC free article] [PubMed]
  • 22.Cacciamani GE, Chu TN, Sanford DI, Abreu A, Duddalwar V, Oberai A, et al. PRISMA AI reporting guidelines for systematic reviews and meta-analyses on AI in healthcare. Nat Med Janv. 2023;29(1):14–5. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (33.5KB, xlsx)
Supplementary Material 2 (26.2KB, xlsx)
Supplementary Material 3 (26.7KB, xlsx)
Supplementary Material 4 (33.4KB, xlsx)
Supplementary Material 5 (12.9KB, xlsx)
Supplementary Material 6 (9.8KB, xlsx)
Supplementary Material 7 (13.6KB, xlsx)
Supplementary Material 8 (11.9KB, xlsx)
Supplementary Material 9 (10.4KB, xlsx)
Supplementary Material 11 (33.4KB, xlsx)
Supplementary Material 13 (14.5KB, xlsx)
Supplementary Material 14 (11.4KB, xlsx)

Data Availability Statement

The datasets supporting the conclusions of this article are included within the article (and its Additional files).


Articles from BMC Medical Research Methodology are provided here courtesy of BMC

RESOURCES