Artificial intelligence’s contribution to biomedical literature search: revolutionizing or complicating?

Rui Yip; Young Joo Sun; Alexander G Bassuk; Vinit B Mahajan

doi:10.1371/journal.pdig.0000849

. 2025 May 12;4(5):e0000849. doi: 10.1371/journal.pdig.0000849

Artificial intelligence’s contribution to biomedical literature search: revolutionizing or complicating?

Rui Yip ^1,², Young Joo Sun ^1,², Alexander G Bassuk ³, Vinit B Mahajan ^1,^2,^4,^*

Editor: Luis Filipe Nakayama⁵

PMCID: PMC12068611 PMID: 40354425

Abstract

There is a growing number of articles about conversational AI (i.e., ChatGPT) for generating scientific literature reviews and summaries. Yet, comparative evidence lags its wide adoption by many clinicians and researchers. We explored ChatGPT’s utility for literature search from an end-user perspective through the lens of clinicians and biomedical researchers. We quantitatively compared basic versions of ChatGPT’s utility against conventional search methods such as Google and PubMed. We further tested whether ChatGPT user-support tools (i.e., plugins, web-browsing function, prompt-engineering, and custom-GPTs) could improve its response across four common and practical literature search scenarios: (1) high-interest topics with an abundance of information, (2) niche topics with limited information, (3) scientific hypothesis generation, and (4) for newly emerging clinical practices questions. Our results demonstrated that basic ChatGPT functions had limitations in consistency, accuracy, and relevancy. User-support tools showed improvements, but the limitations persisted. Interestingly, each literature search scenario posed different challenges: an abundance of secondary information sources in high interest topics, and uncompelling literatures for new/niche topics. This study tested practical examples highlighting both the potential and the pitfalls of integrating conversational AI into literature search processes, and underscores the necessity for rigorous comparative assessments of AI tools in scientific research.

Author summary

As generative Artificial Intelligence (AI) tools become increasingly functional, the promise of this technology is creating a wave of excitement and anticipation around the globe including the wider scientific and biomedical community. Despite this growing excitement, researchers seeking robust, reliable, reproducible, and peer-reviewed findings have raised concerns about AI’s current limitations, particularly in spreading and promoting misinformation. This emphasizes the need for continued discussions on how to appropriately employ AI to streamline the current research practices. We, as members of the scientific community and also end-users of conversational AI tools, seek to explore practical incorporations of AI for streamlining research practices. Here, we probed text-based research tasks—scientific literature mining—can be outsourced to ChatGPT and to what extent human adjudication might be necessary. We tested different models of ChatGPT as well as augmentations such as plugins and custom GPT under different contexts of biomedical literature searching. Our results show that though at present, ChatGPT does not meet the level of reliability needed for it to be widely adopted for scientific literature searching. However, as conversational AI tools rapidly advance (a trend highlighted by the development of augmentations in this article), we envision a time when ChatGPT can become a great time saver for literature searches and make scientific information easily accessible.

Introduction

Artificial intelligence (AI), in its many forms, has been heavily incorporated into scientific research, bringing significant benefits in big data analysis and automation of routine tasks [1–6]. Among these advancements, large language models (LLMs) and its application as a conversational AI (processing and generating human-like, conversational text; e.g., ChatGPT) enable those without AI expertise to easily leverage its capabilities [7]. The rapid developments of LLMs have driven members of the scientific society to widely investigate its application in the biomedical field, which have been shown to be effective in diagnostics, medical education, and even in gene-editing protein designs [8–12].

Scientific information mining is one of the fundamental steps of the scientific discovery process. It involves a combined process of literature search followed by a subjective evaluation to identify good quality references on a specific topic [13]. However, this is a laborious and time-consuming task. Surveys indicate that scientists spend about 7 hours weekly on literature searching, while a literature review on a particular topic takes an average of 41 weeks for a five-person research team [14]. Due to its capability of generating an immediate response, ChatGPT has been investigated if it can be effectively utilized to streamline this process, by asking it to write and synthesize literature reviews and summaries. Yet, the findings showed limitations, with persistent issues in—accuracy (commonly known as hallucination, when ChatGPT fabricated information), consistency (when ChatGPT showed different responses each time to the same query), and relevance (when ChatGPT returned irrelevant information). These limitations can be critical especially in a literature review context, as biomedical researchers strictly require robust, reproducible, and accurate information/findings. Many have, therefore, concluded that ChatGPT currently, may not be a robust tool for generating research reviews [15–22].

We evaluated ChatGPT’s performance in a more straightforward task - its ability to identify high quality references under various timely biomedical literature search scenarios. In doing, so we identified ChatGPT’s potential and its current limitations when used for biomedical literature searching.

Results and discussions

Strategies of employing ChatGPT for scientific literature search

Conventional approaches in scientific literature searching.

Literature searches conventionally rely on web-based search engines and scientific literature archives/databases (i.e., Google or PubMed) and generally follow the steps below (Fig 1A):

Keyword Formulation: List essential keywords representing your research focus.
Search Execution: Input keywords into search engines or databases.
Initial Screening: Skim through the returned results, assessing titles and abstracts for relevance.
Deep Dive: Read relevant papers in detail to understand the author’s findings, extract information, evaluate and accommodate author’s findings/arguments/discussions/opinions.
Reference Mining: Examine cited works in selected papers to uncover further relevant reads that offer additional/contrasting insights or information that is not present in the initial studies.
Iterative Review: Continue the cycle of reading and reference mining until either no new relevant information is identified, or enough information is gathered/acquired.

Using conventional web-based search engines involves time-consuming steps (3–6), as users must sift through excessive information and discard what is irrelevant. This challenge intensifies for scientists lacking prior knowledge in the field, topic, or technologies being searched. Conversational AI, like ChatGPT potentially offers a more time-efficient way by automating the search, extraction and analysis of information (steps 2–4), providing immediate response to specific inquiries like “Give me 6 vitreous proteomics papers in age-related macular degeneration (AMD)”. To investigate this, we tested ChatGPT for its efficacy in identifying vitreous proteomics studies in AMD. Throughout this process, we continuously proofread the results and provided a quantitative analysis to assess its performance.

Before AI tools, literature searches began by asking web-based search engines (i.e., Google, PubMed, and Google Scholar) to look for “Vitreous proteomics studies in AMD”. After a series of exhaustive literature searches using traditional platforms, we identified 10 papers that were relevant. Google returned 116,000 search results, with the top 10 results, identifying six relevant papers (Fig 1B). Querying the same prompt with PubMed, six results were returned, and only one was relevant. It is notable that PubMed offers functions for optimizing search results, such as the use of Boolean operators or filter results by journal categories and study types (Fig 1C). For this study, we did not employ optimization strategies, as our goal was to compare search outcomes from the perspective of an average end-user in typical research scenarios. Google Scholar, another popular academic search platform, produced 4,460 results for the same query. Upon proofreading the top ten results, four suggested links correctly identified relevant papers (Fig 1D).

Scientific searches with basic functions of ChatGPT.

Employing the same inquiry, “Give me 6 vitreous proteomics studies in AMD,” we tested both ChatGPT’s GPT-3.5 and ChatGPT Classic (GPT-4 with no additional capabilities) models (Fig 1E). Concerns from the scientific community regarding ChatGPT include inconsistent outputs, inaccurate or fabricated references, and inclusion of irrelevant articles [16,23,24]. To address this, we prompted the same question 10 times and analyzed the concerns for basic versions of ChatGPT (i.e., GPT-3.5 and ChatGPT Classic).

GPT-3.5 generated inconsistent results, failing to suggest relevant publications in six instances and providing inaccurate references in others, often advising users to consult PubMed and Google Scholar. In contrast, while ChatGPT Classic generated lists of publications in every iteration, it also failed to provide accurate references, often featuring rephrased words in titles, fabricated authorships, or incorrect publication dates or journals (Fig 1F). As such, based on their basic functions alone, neither GPT-3.5 nor ChatGPT Classic demonstrated a clear advantage over conventional search methods with clear limitations in consistency, accuracy, and relevancy that could lead to scientific misinformation.

Scientific literature searching with augmented ChatGPT.

The use of conversational AI involves 3 elements: the AI model (Large language model; LLM), the data source, and the user input. Augmentations to each can enhance ChatGPT’s output quality—through user-support tools/functions and prompt engineering (https://platform.openai.com/docs/guides/prompt-engineering) [25,26]. To assess if these augmentations could improve ChatGPT’s consistency, accuracy, and relevancy, we tested them in an analogic framework—the use of conversational AI seen as “a student going into a library to ask a specific question.” Here, the student is the user, and the librarian is the LLM providing answers based on the information available in the library (the data source). Below shows our augmentation approach for each element (Fig 2):

Fig 2 — Relevant peer-reviewed publications retrieved: P1 [34], P2 [35], P3 [36], P4 [37], P5 [38], P6 [39], P7 [40], P8 [41], P9 [42], P10 [43], P11 [44].

Expansion of library: use of online and/or plugin data source.
- Access to real-time web through ChatGPT’s web-browsing function.
The change to specialized librarian: use of plugins specialized in scientific literature search.
- Tailoring how the large language model interprets and conveys information with the “Scholarly” plugin.
Asking better questions: prompt engineering.
- Refining the input to enhance the quality and relevance of responses from LLM.

Using the same prompt, “Give me 6 vitreous proteomics studies in AMD,” we tested it in ten separate iterations for each augmentation strategy. Post the November 6, 2023 update, ChatGPT4’s default model included a built-in web-browsing function. In five of ten iterations, at least one relevant study was identified. Surprisingly, there was an iteration with all six answers having clear and accurate references and five being relevant, including a study that we previously did not identify. The incorporation of web-browsing functions enabled the LLM to access more current information available on the internet and improved search responses. However, issues with inconsistency and accuracy persisted. For instance, one hyperlink erroneously led to a paper on sugar tolerance instead of vitreous proteomics (Fig 2A).

We employed prompt engineering in the multimodal ChatGPT-4 model that included a more detailed prompt with clearer instructions that encompasses the purpose of the inquiry (“conducting a literature search”), specified the information source (“peer-reviewed articles from academic and scientific journals”), and the detailed study designs sought (“…patients with AMD” and “included mass spectrometry and/or multiplex ELISA data”) (Fig 2B). With these refinements, ChatGPT demonstrated an enhanced ability to distinguish relevant from irrelevant studies. These prompt refinements align with the prompt engineering strategy of providing clear instructions, as recommended by OpenAI. A clearer and more detailed prompt improves the LLM’s response by narrowing the scope and reducing errors caused by ambiguity. Additionally, the extra information acts as a filter, guiding the LLM to exclude studies that do not meet the additional criteria (https://platform.openai.com/docs/guides/prompt-engineering). However, it is important to note that challenges with inconsistency and accuracy continued.

Next, we integrated a plugin for searching academic and journal-published articles (i.e., Scholarly) (Fig 2C). In ten iterations, it consistently identified six papers with accurate references, repeatedly returning the same four relevant papers. This improvement stems from Scholarly’s ability to access specialised academic repositories and its built-in constraints, ensuring the LLM to only retrieve data exclusively from these sources [27].

On Jan 10^th 2024, OpenAI introduced the Custom GPT store, allowing users to create and use customized ChatGPT versions tailored for specific purposes. These customized models can have a specified knowledge base, pre-configured prompts, and additional functionalities, essentially combining the 3 potential ways of augmentations mentioned earlier into one framework [28]. We tested Consensus, one of the popular GPTs used for academic searching [29], and noted that in ten iterations, it identified 3 additional papers that were previously not discovered using other search approaches. This improvement may be due to Consensus’s functionality of accessing information specifically from the Semantic Scholar database, using a fine-tuned LLM specifically trained to understand and summarize research articles, and an advanced in-built search algorithm that improves literature search responses [30].

Overall, augmentations significantly improved ChatGPT’s ability to return relevant research manuscripts with accurate sources. Yet, it couldn’t fully resolve issues with consistency, accuracy, and relevancy. It’s important to highlight that ChatGPT’s web-browsing function is restricted to open-source information, excluding subscription-based journals or publications. While there are multiple academic search plugins, including “Scholarly”, that can be paired with ChatGPT to access a broader range of literature databases, users should remain cognizant of the limitations regarding the coverage and timeliness these sources provide.

Practical evaluation of ChatGPT-based scientific literature search in various scenarios

Using ChatGPT for literature searches in information-rich topics.

Augmented ChatGPT proved useful, unlike its basic counterpart, when we focused on the specific scientific findings of the most prevalent patient disease in ophthalmology. We extended our tests to COVID-19, a topic that is of high interest to the public and embodies a high volume of publications, to observe if the abundance of information would affect ChatGPT’s response.

We asked ChatGPT to “show all studies that discuss genetic risk factors for Long COVID-19” (Fig 3A), testing this five times in both the basic and augmented ChatGPT models. Basic ChatGPT models failed to return any answers and suggested conventional search methods for this task. Interestingly, in all five iterations, web-based ChatGPT-4 identified six articles with clear sources, but these were predominantly news articles rather than peer-reviewed studies (Fig 3B). In contrast, the scholarly-augmented version consistently returned the same three or four relevant research articles (Fig 3D).

To enhance answer quality, we refined our prompt, specifying the need for peer-reviewed article sources, and retested all four approaches (Fig 3A). The web-browsing ChatGPT-4 model showed more research and fewer news articles per iteration (Fig 3C), however the refinement had little effect on the responses from both basic ChatGPT and ChatGPT augmented with Scholarly (Fig 3E). Our findings indicate that basic ChatGPT models were not particularly effective for literature searches in this context. Although the prompt-engineered, augmented model could return relevant literature, the vast amount of available information appeared to hinder its capability to consistently identify peer-reviewed studies, which are often the primary interest of scientists.

Using ChatGPT for hypothesis generation.

Literature searches not only assess current literature landscape but also aid in hypothesis generation focusing on novelty, reasonability, and testability. We evaluated if web-based ChatGPT-4 could return relevant literature and contextualize it in the logical frame of our hypothesis. We hypothesized that “creatine intake is associated with hair loss (AC)” based on our rationale that “creatine intake increases DHT (AB)” and “an increase in DHT is associated with hair loss (BC) (Fig 4A).” Initiating with the query (AC): “Provide relevant peer-reviewed studies that are related to our hypothesis: creatine intake can lead to hair loss.” The responses were inconsistent and returned only one relevant research article either showed only AB correlation, or a review paper discussing potential side effects of creatine intake in five iterations (Fig 4B). Recently, it was suggested that ChatGPT may generate better-quality responses when it is prompted to reason through steps by splitting complex tasks into simpler, intermediate tasks. This prompting method, termed “Chain-of-Thought Prompting” can enable the LLM to focus on fewer variables at a time with more computation, resulting in lower error rates and improving the LLM’s reasoning ability (https://platform.openai.com/docs/guides/prompt-engineering) [31].

Therefore, we prompt engineered to have ChatGPT follow the human logical flow of generating a hypothesis, by querying separately in an AB, BC manner (Fig 4C). This strategy generally improved the number of relevant peer-reviewed articles retrieved. There was better contextualization of literature in our logical framework of the hypothesis, also including future directions proposals. Yet, it’s important to note that a prompt engineered approach still showed inconsistency and variability in the quality of responses. As such, employing a step-by-step inquiry approach, mirroring human logical reasoning, might optimize the utility of ChatGPT in hypothesis generation.

Using ChatGPT for searching clinical practice guidelines.

Decision-making in clinical practice is typically guided by established standards of care. However, clinicians often face cases or scenarios where these standards are not well-established due to contradictory or inconclusive evidence. For instance, semaglutide (commonly known as Ozempic), a drug commonly used for the treatment of type 2 diabetes, has been shown to impact gastric emptying which may be a risk for operations that require general anesthesia [32]. Although the American Society of Anaesthesiologists has published a consensus-based guideline, it recognized the “limited” evidence for establishing a preoperative fasting standards for patients taking Ozempic [33]. In clinical practice, where guidelines require accuracy and conclusiveness for patient care, the precision of information is crucial. To assess ChatGPT’s utility in clinical setting, we asked it to “show me all studies that discuss preoperative fasting guidelines for Ozempic”. We hoped that it not only can identify previous relevant studies and the most up-to-date guidelines but also acknowledge the lack of evidence. The basic ChatGPT models failed to identify relevant guidelines, and even the augmented ChatGPT with Scholarly, despite its prevoiusly promising performance, failed to find any pertinent studies related to our query in all five iterations. In contrast, ChatGPT-4 with web-browsing functionality identified the guidelines and but only noted the lack of evidence in two of the five iterations. These results indicate that, as of now, all versions of ChatGPT may not be sufficiently reliable for literature searches specifically aimed at finding clinical guidelines.

Conclusion

This manuscript provides a snapshot view of LLM’s utility in literature searching at the time of testing from November 2023 to early 2024. As of March 2025, LLM and generative AI’s utility in literature search is no longer solely available in the form of conversational AI such as ChatGPT. Many of the search engines referred to “conventional” in this manuscript such as Google have already incorporated AI features to enhance literature searches available to users such as AI-generated summaries in Google responses. This manuscript evaluated ChatGPT’s utility by conducting search scenarios in five to ten independent iterations. As LLMs quickly advance, we suggest future studies to increase the number of iterations for a more holistic understanding of the limitations of ChatGPT across different search scenarios.

At first glance, AI holds great possibilities with assisting scientific researchers on time-consuming and mundane tasks such as literature searches. However, the inconsistent accuracy underscores the need for careful human oversight. Despite this, the conversational AI tools are advancing rapidly with LLMs continuing to be optimized with plugins adding additional functions. We envision a time when AI becomes a strong and reliable ally in streamlining and reshaping scientific research practices. Yet the question persists, is now truly the right time for the scientific community to fully embrace the utilization of such tools? As we navigate the balance between AI’s potential benefits and the imperative for rigorous scientific integrity, this question remains central to the ongoing discourse on the role of AI in research (this concluding paragraph was suggested by ChatGPT based on this article with several iterations of human refinement).

Methods

Artificial Intelligence and web-based search engines

Queries with web-based search engines (Google, PubMed, and Google Scholar) were conducted on November 28, 2023. For evaluating basic functions of ChatGPT, ChatGPT 3.5 and ChatGPT Classic were employed on November 28, 2023. Access to these models were facilitated through the drop-down menu in the ChatGPT user interface. To verify the activation of the web-browsing feature of ChatGPT 4, we looked for responses that explicitly referenced web sources. The Scholarly plugin was accessed via the “ChatGPT Plugin” option from the dropdown menu in the ChatGPT interface. Additionally, Consensus GPT was accessed through the ChatGPT GPT store.

Evaluation on consistency, relevancy and accuracy

Each query was tested in 5–10 independent iterations, where each iteration was conducted as a new ChatGPT session to prevent context retention bias. To assess for accuracy, we manually verified whether the manuscript titles in ChatGPT responses match actual publications that can be identified in Google/Google Scholar/PubMed repository. For cases where a hyperlink is provided in the response (e.g., from web-browsing GPT, plugins or custom GPT), we added an additional criterion to check if the provided link correctly directed to the article cited in its response. Note that an “accurate reference” in the scientific context also includes additional criteria such as correct authorship, publication date, and journal name. We did not include these in our criteria as we focused on assessing ChatGPT’s utility in real-world scenarios for identifying literature.

A response can only be used to assess for relevancy if it provided accurate reference as defined above. We proofread each individual article that was returned and checked if it contained relevant information pertaining to the query in testing. Responses from all the iterations for a specific query were compared to evaluate consistencies across independent sessions. For queries conducted on search engines (i.e., Google, PubMed, and Google Scholar), we proofread and evaluated the top ten results.

Supporting information

S1 File. ChatGPT conversation transcripts.

(PDF)

pdig.0000849.s001.pdf^{(821.3KB, pdf)}

Acknowledgments

We express our appreciation to Julian Wolf and Charles Meno Theodore Deboer at Stanford Ophthalmology for their insights regarding data presentation and evaluations of queries investigated in this manuscript. We thank MaryAnn Mahajan and Joel Andrew Franco at Stanford Ophthalmology for their proofreading of the manuscript.

Data Availability

Data can be accessed through the Supplementary Information file.

Funding Statement

VBM is supported by NIH grants (R01EY031952, R01EY031360, R01EY030151, and P30EY026877), the Stanford Center for Optic Disc Drusen, and Research to Prevent Blindness, New York, New York. AGB is supported by NIH grants R01EY030151 and R01EY031952. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Quazi S. Artificial intelligence and machine learning in precision and genomic medicine. Med Oncol. 2022;39(8):120. doi: 10.1007/s12032-022-01711-1 [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
2.Malhotra A, Molloy EJ, Bearer CF, Mulkey SB. Emerging role of artificial intelligence, big data analysis and precision medicine in pediatrics. Pediatr Res. 2023;93(2):281–3. doi: 10.1038/s41390-022-02422-z [DOI] [PubMed] [Google Scholar]
3.Leung E, Lee A, Tsang H, Wong MCS. Data-driven service model to profile healthcare needs and optimise the operation of community-based care: A multi-source data analysis using predictive artificial intelligence. Hong Kong Med J. 2023;29(6):484–6. doi: 10.12809/hkmj235154 [DOI] [PubMed] [Google Scholar]
4.Abbasimehr H, Paki R. Prediction of COVID-19 confirmed cases combining deep learning methods and Bayesian optimization. Chaos Solitons Fractals. 2021;142:110511. doi: 10.1016/j.chaos.2020.110511 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Magrabi F, Lyell D, Coiera E. Automation in Contemporary Clinical Information Systems: a Survey of AI in Healthcare Settings. Yearb Med Inform. 2023;32(1):115–26. doi: 10.1055/s-0043-1768733 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Hinson JS, Klein E, Smith A, Toerper M, Dungarani T, Hager D, et al. Multisite implementation of a workflow-integrated machine learning system to optimize COVID-19 hospital admission decisions. NPJ Digit Med. 2022;5(1):94. doi: 10.1038/s41746-022-00646-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614(7947):224–6. doi: 10.1038/d41586-023-00288-7 [DOI] [PubMed] [Google Scholar]
8.Hou W, Ji Z. Assessing gpt-4 for cell type annotation in single-cell rna-seq analysis. Nat Methods. 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Tenner ZM, Cottone MC, Chavez MR. Harnessing the open access version of ChatGPT for enhanced clinical opinions. PLOS Digit Health. 2024;3(2):e0000355. doi: 10.1371/journal.pdig.0000355 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ruffolo JA, Nayfach S, Gallagher J, Bhatnagar A, Beazer J, Hussain R. Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences. bioRxiv. 2024. doi: 2024.04.22.590591 [Google Scholar]
12.Qu Y, Huang K, Cousins H, Johnson WA, Yin D, Shah MM. Crispr-gpt: An llm agent for automated design of gene-editing experiments. bioRxiv. 2024. doi: 10.1101/2024.04.25.591003 [DOI] [Google Scholar]
13.Grewal A, Kataria H, Dhawan I. Literature search for research planning and identification of research problem. Indian J Anaesth. 2016;60(9):635–9. doi: 10.4103/0019-5049.190618 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Wiggers K. Elicit is building a tool to automate scientific literature review. 2023. https://techcrunch.com/ [Google Scholar]
15.Bockting CL, van Dis EAM, van Rooij R, Zuidema W, Bollen J. Living guidelines for generative AI - why scientists must oversee its use. Nature. 2023;622(7984):693–6. doi: 10.1038/d41586-023-03266-1 [DOI] [PubMed] [Google Scholar]
16.AlZaabi A, ALAmri A, Albalushi H, Aljabri R, AalAbdulsalam A. ChatGPT applications in academic research: a review of benefits, concerns, and recommendations. bioRxiv. 2023. doi: 10.1101/2023.08.17.553688 [DOI] [Google Scholar]
17.Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus. 2023;15(2):e35179. doi: 10.7759/cureus.35179 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Zhu J, Jiang J, Yang M, Ren Z. Chatgpt and environmental research. Environ Sci Technol. 2023;57(46):17667–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Zhong Q, Tan X, Du R, Liu J, Liao L, Wang C. Is ChatGPT a reliable source for writing review articles in catalysis research? A case study on CO2 hydrogenation to higher alcohols. Preprints. 2023. [Google Scholar]
20.Ruppar T. Artificial intelligence in research dissemination. West J Nurs Res. 2023;45(4):291–2. [DOI] [PubMed] [Google Scholar]
21.D’Amico RS, White TG, Shah HA, Langer DJ. I Asked a ChatGPT to Write an Editorial About How We Can Incorporate Chatbots Into Neurosurgical Research and Patient Care…. Neurosurgery. 2023;92(4):663–4. doi: 10.1227/neu.0000000000002414 [DOI] [PubMed] [Google Scholar]
22.Hutson M. Could AI help you to write your next paper?. Nature. 2022;611(7934):192–3. doi: 10.1038/d41586-022-03479-w [DOI] [PubMed] [Google Scholar]
23.Walters WH, Wilder EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep. 2023;13(1):14045. doi: 10.1038/s41598-023-41032-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Blanchard F, Assefi M, Gatulle N, Constantin J-M. ChatGPT in the world of medical research: From how it works to how to use it. Anaesth Crit Care Pain Med. 2023;42(3):101231. doi: 10.1016/j.accpm.2023.101231 [DOI] [PubMed] [Google Scholar]
25.White J, Fu Q, Hays S, Sandborn M, Olea C, Gilbert H. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv. 2023. [Google Scholar]
26.Chen B, Zhang Z, Langrené N, Zhu S. Unleashing the potential of prompt engineering in large language models: A comprehensive review. arXiv. 2023. [Google Scholar]
27.Lendahire. AI Education Technology, ChatGPT 4 Plugin: Exploring the Scholarly Plugin for Enhanced ChatGPT Searches. [Blog post discussing the use of the Scholarly plugin in ChatGPT for improving academic and educational search capabilities]. Lendahire; 2024. https://lendahire.com/exploring-the-scholarly-plugin-for-enhanced-chatgpt-searches/ [Google Scholar]
28.Introducing GPTs. 2023. [Google Scholar]
29.Consensus. Introducing: Consensus GPT, your AI research assistant. 2024. [Google Scholar]
30.AI C. How it Works & Consensus FAQ’s. [Overview of how the Consensus academic search engine works, including methodology and features]. Consensus NLP; 2025. https://consensus.app/home/blog/welcome-to-consensus/ [Google Scholar]
31.Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst. 2022;35:24824–37. [Google Scholar]
32.Wilding JPH, Batterham RL, Calanna S, Davies M, Van Gaal LF, Lingvay I, et al. Once-Weekly Semaglutide in Adults with Overweight or Obesity. N Engl J Med. 2021;384(11):989–1002. doi: 10.1056/NEJMoa2032183 [DOI] [PubMed] [Google Scholar]
33.consensus-based guideline for ozempic - Google Search. n.d.
34.Schori C, Trachsel C, Grossmann J, Zygoula I, Barthelmes D, Grimm C. The proteomic landscape in the vitreous of patients with age-related and diabetic retinal disease. Invest Ophthalmol Vis Sci. 2018;59(4):AMD31–40. [DOI] [PubMed] [Google Scholar]
35.Koss MJ, Hoffmann J, Nguyen N, Pfister M, Mischak H, Mullen W, et al. Proteomics of vitreous humor of patients with exudative age-related macular degeneration. PLoS One. 2014;9(5):e96895. doi: 10.1371/journal.pone.0096895 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Nobl M, Reich M, Dacheva I, Siwy J, Mullen W, Schanstra JP, et al. Proteomics of vitreous in neovascular age-related macular degeneration. Exp Eye Res. 2016;146:107–17. doi: 10.1016/j.exer.2016.01.001 [DOI] [PubMed] [Google Scholar]
37.Santos FM, Ciordia S, Mesquita J, Cruz C, Sousa JPCE, Passarinha LA, et al. Proteomics profiling of vitreous humor reveals complement and coagulation components, adhesion factors, and neurodegeneration markers as discriminatory biomarkers of vitreoretinal eye diseases. Front Immunol. 2023;14:1107295. doi: 10.3389/fimmu.2023.1107295 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Dos Santos FM, Ciordia S, Mesquita J, de Sousa JPC, Paradela A, Tomaz CT, et al. Vitreous humor proteome: unraveling the molecular mechanisms underlying proliferative and neovascular vitreoretinal diseases. Cell Mol Life Sci. 2022;80(1):22. doi: 10.1007/s00018-022-04670-y [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Santos FM, Mesquita J, Castro-de-Sousa JP, Ciordia S, Paradela A, Tomaz CT. Vitreous Humor Proteome: Targeting Oxidative Stress, Inflammation, and Neurodegeneration in Vitreoretinal Diseases. Antioxidants (Basel). 2022;11(3):505. doi: 10.3390/antiox11030505 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Guo H, Li J, Lu P. Systematic review and meta-analysis of mass spectrometry proteomics applied to ocular fluids to assess potential biomarkers of age-related macular degeneration. BMC Ophthalmol. 2023;23(1):507. doi: 10.1186/s12886-023-03237-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.García-Quintanilla L, Rodríguez-Martínez L, Bandín-Vilar E, Gil-Martínez M, González-Barcia M, Mondelo-García C. Recent advances in proteomics-based approaches to studying age-related macular degeneration: a systematic review. Int J Mol Sci. 2022;23(23). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Ecker SM, Pfahler SM, Hines JC, Lovelace AS, Glaser BM. Sequential in-office vitreous aspirates demonstrate vitreous matrix metalloproteinase 9 levels correlate with the amount of subretinal fluid in eyes with wet age-related macular degeneration. Mol Vis. 2012;18:1658–67. [PMC free article] [PubMed] [Google Scholar]
43.Kim TW, Kang JW, Ahn J, Lee EK, Cho K-C, Han BNR, et al. Proteomic analysis of the aqueous humor in age-related macular degeneration (AMD) patients. J Proteome Res. 2012;11(8):4034–43. doi: 10.1021/pr300080s [DOI] [PubMed] [Google Scholar]
44.Kersten E, Paun CC, Schellevis RL, Hoyng CB, Delcourt C, Lengyel I, et al. Systemic and ocular fluid compounds as potential biomarkers in age-related macular degeneration. Surv Ophthalmol. 2018;63(1):9–39. doi: 10.1016/j.survophthal.2017.05.003 [DOI] [PubMed] [Google Scholar]

PLOS Digit Health. 2025 May 12;4(5):e0000849. doi: 10.1371/journal.pdig.0000849.r001

Author response to Decision Letter 0

30 Apr 2024

PLOS Digit Health. doi: 10.1371/journal.pdig.0000849.r002

Decision Letter 0

Yuzhe Yang

21 Feb 2025

PDIG-D-24-00172Artificial Intelligence’s Contribution to Biomedical Literature Search: Revolutionizing or Complicating?PLOS Digital Health Dear Dr. Mahajan, Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript within 30 days Mar 23 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: * A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers '. This file does not need to include responses to any formatting updates and technical items listed in the 'Journal Requirements' section below.* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes '.* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript '. If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. We look forward to receiving your revised manuscript. Kind regards, Luis Filipe Nakayama, M.D.Academic EditorPLOS Digital Health Luis NakayamaAcademic EditorPLOS Digital Health Leo Anthony CeliEditor-in-ChiefPLOS Digital Healthorcid.org/0000-0001-6712-6626 Journal Requirements: 1. We have amended your Competing Interest statement to comply with journal style. We kindly ask that you double check the statement and let us know if anything is incorrect. 2. Please provide an Author Summary. This should appear in your manuscript between the Abstract (if applicable) and the Introduction, and should be 150–200 words long. The aim should be to make your findings accessible to a wide audience that includes both scientists and non-scientists. Sample summaries can be found on our website under Submission Guidelines: https://journals.plos.org/digitalhealth/s/submission-guidelines#loc-parts-of-a-submission 3. In the online submission form, you indicated that “Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.” All PLOS journals now require all data underlying the findings described in their manuscript to be freely available to other researchers, either 1. In a public repository, 2. Within the manuscript itself, or 3. Uploaded as supplementary information. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If your data cannot be made publicly available for ethical or legal reasons (e.g., public availability would compromise patient privacy), please explain your reasons by return email and your exemption request will be escalated to the editor for approval. Your exemption request will be handled independently and will not hold up the peer review process, but will need to be resolved should your manuscript be accepted for publication. One of the Editorial team will then be in touch if there are any issues. 4. Some material included in your submission may be copyrighted. According to PLOS’s copyright policy, authors who use figures or other material (e.g., graphics, clipart, maps) from another author or copyright holder must demonstrate or obtain permission to publish this material under the Creative Commons Attribution 4.0 International (CC BY 4.0) License used by PLOS journals. Please closely review the details of PLOS’s copyright requirements here: PLOS Licenses and Copyright. If you need to request permissions from a copyright holder, you may use PLOS's Copyright Content Permission form. Please respond directly to this email or email the journal office and provide any known details concerning your material's license terms and permissions required for reuse, even if you have not yet obtained copyright permissions or are unsure of your material's copyright compatibility. Potential Copyright Issues: Figure 2: Please confirm whether you drew the images / clip-art within the figure panels by hand. If you did not draw the images, please provide (a) a link to the source of the images or icons and their license / terms of use; or (b) written permission from the copyright holder to publish the images or icons under our CC-BY 4.0 license. Alternatively, you may replace the images with open source alternatives. See these open source resources you may use to replace images / clip-art:- https://commons.wikimedia.org- https://openclipart.org/ Additional Editor Comments (if provided): Review of “Artificial Intelligence’s Contribution to Biomedical Literature Search: Revolutionizing or Complicating?”. This manuscript addresses an important topic, highlighting a crucial issue regarding the role of LLMs in research. The study is relevant, given the increasing reliance on AI for literature searches in biomedical research.

1) I suggest improving the pubmed search strategy, clarifying the the use of a prompt as a search strategy with an acronym instead of a keyword search.

2) I suggest improving the quality of the figures, as the details are difficult to read in their current form. Clearer visuals would significantly aid reviewers and readers’ comprehension.

3) The Methods section would benefit from reorganization, as some methodological descriptions appear within the Results and Discussion sections. [Note: HTML markup is below. Please do not edit.] Reviewers' Comments: Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria ? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript investigates the effectiveness of ChatGPT, alongside traditional literature search methods such as PubMed and Google, in assisting biomedical researchers and clinicians with literature searches. The study evaluates ChatGPT in various scenarios, testing its basic functions and exploring enhancements through user-support tools (plugins, prompt engineering, web-browsing). The authors highlight both the limitations and potential of ChatGPT in retrieving consistent, relevant, and accurate scientific literature.

while the paper provides a timely and impactful evaluation of ChatGPT's potential to assist researchers with literature searches, offering valuable insights into its current limitations and highlighting the future potential of AI-augmented search tools for scientific research, it also has potential areas for Improvement:

- While the comparison with PubMed and Google is insightful, the manuscript states that no search optimization strategies (e.g., Boolean operators) were used in PubMed to ensure fairness. However, PubMed’s strength lies in its ability to narrow results using advanced search functionalities, not using these functions could understate the potential of conventional methods, mainly when additional settings such as prompt engineering or web search are included in ChatGPT.

- While plugins and prompt engineering are tested, the explanation of how these modifications improved or failed is somewhat superficial. For example, how exactly does “prompt engineering” improve accuracy and relevance? A more granular analysis of the specific improvements introduced by each augmentation would strengthen the conclusions. Additionally, sharing the prompts used could improve the reliability and reproducibility of the experiments.

- Although the manuscript mentions "consistency," "accuracy," and "relevance" as evaluation metrics, these terms are not defined with clear, measurable criteria. Providing concrete definitions or scoring systems for these terms would allow for more rigorous assessment. For instance, how is "accuracy" quantified when ChatGPT provides a reference—what makes a reference “accurate” beyond its correct citation formatting?

- The manuscript limits each search scenario to five or ten iterations, which is a relatively small sample size for assessing the variability in outputs. This is especially important when evaluating AI, as its performance can vary significantly depending on input nuances. Expanding the number of iterations per scenario and including other statistical measures could provide a more robust understanding of ChatGPT's limitations and strengths.

- The quality of the images in the manuscript needs improvement, as they are difficult to review in their current form. Additionally, the accompanying descriptions lack clarity and should provide more detailed explanations to ensure the figures effectively support the findings and are easily interpretable by readers.

Reviewer #2: ChatGPT and Biomedical Literature Search, A Review

One could not think of a timelier and more relevant subject to review and explore than this topic. Especially for a researcher exploring the bioethics of artificial intelligence (AI) and an end user of various forms of generative AI in reading, analyzing and researching medical, surgical and academic texts. More relevant items discussed in this preprint are ophthalmic and diabetes related examples. One can summarize the work steps that the authors followed as steps of comparisons, functions and steps in the field of research in general and the field of research using ChatGPT, as a representation of generative AI. The preprint describes the trend in using generative AI in biomedical research denoting the excitement and the anticipation in adopting its models in the field. Despite the excitement, there are few concerns mentioned in the preprint and its listed references.

The Comparisons:

• Comparing between conventional web-based search methods and generative AI, namely ChatGPT. The conventional search follows a list of few steps to execute the search task, while ChatGPT provides time efficiency through automating the search, extraction and analysis of the information to give immediate response.

• Comparison of basic ChatGPT versions and augmented ChatGPT versions. The basic versions showed limitations in consistency, accuracy and relevancy. The augmented ChatGPT used tools like plugins, web browsing functions, prompt engineering and custom-GPTs.

• Comparison based on scenarios and topics. For example, high interest topics and niche topics (with limited information resources).

• Comparison based on functionalities, like hypothesis generation and newly emerging clinical practice questions.

The Functions:

• The literature search execution by conventional search methodologies, by basic ChatGPT and by augmented ChatGPT.

• The evaluation metrics were the consistency in responses to repeated search queries, the accuracy evaluated by the verification of the references, the authors and other contents of the search, and the relevancy as evaluated by proofreading the “articles returned and checked if it contains relevant information pertaining to the query in testing”

The Tools:

• Basic ChatGPT models: GPT-3.5, ChatGPT Classic.

• Augmented ChatGPT using:

1- web browsing function for real-time data access.

2- plugins specialized in scientific literature research to tailor how the LLM interprets and conveys information with the “Scholarly” plugin.

3- Prompt engineering, which ensures asking the right questions and providing clear extraction.

4- Custom GPT. For example, Consensus GPT for academic research.

• Conventional search engines: Google, PubMed, and Google Scholar

Another way to review this preprint can also be put in another framework is as follows:

• Methodology including description of the conventional search engines and detailed steps of employing ChatGPT for literature research.

• The evaluation and analysis included comparisons between conventional search engines and ChatGPT, quantitative assessment based on the number of iterations and times, and the numbers of results rendered by the various search engines. Qualitative analysis included exploring challenges in different scenarios. For example, high interest topics and niche topics. Functions like hypothesis generation and clinical practice guidelines.

• The preprint also discussed prospects for involving ChatGPT in literature research. These prospects included the rapid advancement and the augmentation of conversational AI, which will come with the two mentioned advantages of time saving and improving accessibility to research resources. Hypothesis generation can also be developed for better utilization of generative AI in research. All the mentioned limitations are prospected to be advantages for ChatGPT in the future.

• Considering human oversight, some reviews discuss it as a disadvantage of ChatGPT and other conversational AI models. This review is with the viewpoint that human oversight is crucial for the role of generative AI now and the future. Other studies presented hybrid approach in literature research, where human experts have more active role in the research to augment ChatGPT in the research process. 1

• Ethical implications include the integrity of research and the potential misuse of ChatGPT and other LLMs in biomedical literature research. Other scientific papers called for setting guidelines for LLM biomedical research deployment. 2 Despite the growing research body about the topic, it seems too early to configure the guidelines and more programed work and comparisons between the research outcomes are needed to discuss the guidelines.

• ChatGPT augmentation strategies improved accuracy of the returned resources, enhanced consistency, gave access to real-time web through real-time web browsing function, enhanced the output quality using prompt engineering as a mode of input modification, different GPTs might render different results, and some can come with more papers than others with the continuously increased iterations.

• Prompt engineering enhances quality and relevance of the responses, but it needs multiple iterations. In addition to the persistence of inconsistency and inaccuracy mentioned in the paper, it can mean the infusion of bias into the research structure and this bias might double fold with the increased iterations. The continuously changing research trends in healthcare also prompt continuous update of the engineered prompts in the deployment of ChatGPT. 3

• ChatGPT is already being tried for integration with the conventional search engines. 4 Conventional search engines including Google, Microsoft Bing and others are already implementing functionalities like ChatGPT to make their search more conversational. So, conventional search engines mentioned in the study are no longer conventional.

References:

1. Temsah, O., Khan, S.A., Chaiah, Y., Senjab, A., Alhasan, K., Jamal, A., Aljamaan, F., Malki, K.H., Halwani, R., Al-Tawfiq, J.A. and Temsah, M.H., 2023. Overview of early ChatGPT’s presence in medical literature: insights from a hybrid literature review by ChatGPT and human experts. Cureus, 15(4).

2. Sallam, M., 2023, March. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. In Healthcare (Vol. 11, No. 6, p. 887). MDPI.

3. Abhari, S., Fatahi, S., Saragadam, A., Chumachenko, D. and Morita, P.P., 2024. A Road Map of Prompt Engineering for ChatGPT in Healthcare: A Perspective Study. Studies in Health Technology and Informatics. IOS Press. https://doi. org/10.3233/SHTI240578.

4. Stokel-Walker, C., 2023. AI chatbots are coming to search engines—can you trust the results?.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: Yes: David Restrepo

Reviewer #2: Yes: Yasser Abdullah

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] Figure resubmission: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.Reproducibility: To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLOS Digit Health. 2025 May 12;4(5):e0000849. doi: 10.1371/journal.pdig.0000849.r003

Author response to Decision Letter 1

28 Mar 2025

Attachment

Submitted filename: Response Letter.docx

pdig.0000849.s003.docx^{(26.7KB, docx)}