Abstract
Navigating clinical guidelines can be complex for real-time health care decision making. Our study evaluates the chat generative prerained transformer (ChatGPT)-4 in improving responses to clinical questions by integrating guidelines on Clostridioides difficile infection and colon polyp surveillance. We assessed ChatGPT-4’s responses to questions before and after guideline integration, noting a clear improvement in accuracy. ChatGPT-4 provided guideline-aligned answers consistently. Further analysis showed its ability to summarize information from conflicting guidelines, highlighting its utility in complex clinical scenarios. The findings suggest that large language models such as ChatGPT-4 can enhance clinical decision making and patient education by providing quick, conversational, and accurate responses. This approach opens a path for using artificial intelligence to deliver reliable responses in health care, supporting clinicians in real-time decision making and improving patient care.
Large language models (LLMs) such as chat generative pretrained transformer (ChatGPT)-4 is an artificial intelligence (AI)-enabled chatbot with a potential to enhance clinical decision making, health care delivery, and patient education.1,2 The capabilities of these technologies are under wide investigation, with a varying degree of competence seen as these are trained to have general cognitive skills, leading to limitations.3,4 Studies suggest a role of LLMs as an interactive information source for questions about colorectal cancer screening or common conditions such as gastroesophageal reflux disease.1,2,5 If relying on such a tool, an incorrect response may compromise patient care and safety, suggesting an urgent need to improve accuracy. ChatGPT-4 was assessed for proficiency of medical note taking, medical knowledge, and a “consult” question with accuracy, but “hallucinations” were seen, and the responses would not serve specific or complicated scenarios.4
We developed enhancements to the ability of ChatGPT-4 to learn and apply clinical guidelines with an aim to improve skills beyond a general language model. We chose 2 commonly encountered clinical states, Clostridioides difficile infection (CDI) and colon polyp surveillance guidelines.6 Our aim was to demonstrate the ability of ChatGPT-4 to be trained to learn clinical guidelines to enhance skills and to demonstrate development of an infrastructure that can be used by clinicians to enhance patient care.
Methods
We evaluated responses from ChatGPT-4 to 10 multiple-choice board-style questions and open-ended clinical questions pertaining to CDI management and colon polyp surveillance guidelines. We selected questions for ChatGPT based on their real-world clinical use and alignment with the uploaded guidelines. We evaluated a mix of clinical scenarios that physicians typically encounter in clinical practice to explore ChatGPT’s ability to handle different types of clinical questions. The questions were entered into a new ChatGPT-4 session to avoid bias from previous use. We used the “askyourpdf” plugin with ChatGPT-4 to upload pdf files of both CDI guidelines from the American College of Gastroenterology (ACG) and colon polyp surveillance guidelines from the American Gastroenterological Association into ChatGPT-4.6,7 To extract information from guidelines using the “askyourpdf” plugin with GPT-4, we provided a PDF URL and obtained “doc_id” from the “askyourpdf” platform. We gave the specific doc_id to GPT-4. ChatGPT-4 downloaded and stored the content, allowing it to be queried. ChatCPT-4 was able to query the guideline document and answer questions and included page numbers for reference. We reentered the same board-style questions to compare ChatGPT-4 responses before and after uploading guidelines to assess whether its’ clinical performance could be enhanced with clinical guidelines. Responses from ChatGPT-4 were assessed against the guideline content by the investigators. If a response aligned with guidelines, it was deemed correct.
The evaluation of the model’s performance was conducted in 2 phases. In the initial phase, ChatGPT-4 was tested without clinical guidelines. The model was asked 5 multiple-choice questions about CDI diagnosis and management and 5 multiple-choice questions on colon polyp surveillance guidelines. Of the 10 total questions posed, ChatGPT-4 answered 50% correctly before uploading guidelines. In the second phase, after guidelines were integrated within ChatGPT-4, it answered all questions correctly, achieving a 100% accuracy rate. The responses from ChatGPT-4 included explanations and citations from the guidelines with page numbers. A summary of all questions and answers is presented in Supplemental Table 1 (available online at https://www.mcpdigitalhealth.org/).
The same 10 questions were entered into ChatGPT-4 as open-ended clinical scenarios to simulate day-to-day clinical decision making. Before integration of the guidelines, ChatGPT-4’s accuracy was 70%. However, each response explicitly mentioned its knowledge cutoff in September 2021. After integration of the guidelines, ChatGPT-4 answered all questions correctly based on the guidelines and provided guideline-based explanations. A summary of the questions and responses is detailed in Supplemental Table 2 (available online at https://www.mcpdigitalhealth.org/).
Given that there are 2 prominent guidelines from the ACG and the Infectious Diseases Society of America8 for managing CDI, we conducted a secondary analysis to explore how ChatGPT navigates when provided with multiple guidelines. This exploration aimed to mirror real-world clinical scenarios where health care professionals often navigate through varying recommendations to formulate patient-centered decisions. Our findings indicate that ChatGPT demonstrated a good understanding of the different guidelines and was able to effectively summarize results for clinicians, even when guidelines presented different recommendations (Supplemental Tables 3 and 4, available online at https://www.mcpdigitalhealth.org/).
Discussion
Since the release of ChatGPT in November 2022, the enthusiasm to use LLMs in health care has grown rapidly, and there is a potential to enhance clinical care if trained and used appropriately. We compared the performance of ChatGPT-4 before and after training it with specific clinical guidelines. ChatGPT-4 answered all questions correctly after integration and training on the respective clinical guidelines for CDI management and colon polyp surveillance guidelines (Figure).
Figure.
ChatGPT-4 response before and after guideline integration. An example of a response to an open-ended clinical question before and after clinical guideline integration into ChatGPT-4.
Studies have evaluated the performance of different versions of ChatGPT for answering questions such as the ACG self-assessment test, and 65% accuracy was seen for ChatGPT-3 and 62.4% for ChatGPT-4 with general preexisting model training.9 A study that assessed the ability of ChatGPT to answer prompts on colon cancer screening answered incorrectly regarding age of onset for screening and appropriate interval surveillance in a patient with a history of colon polyps.10 In our study, the introduction of clinical guidelines led to a remarkable improvement in the performance of ChatGPT-4, with the model achieving a 100% accuracy rate. We tested ChatGPT-4 with both open-ended and multiple-choice questions. The ability of ChatGPT-4 to answer questions with multiple-choice and open-ended questions mirrors the concept of shared clinical decision making incorporating best evidence. We found that ChatGPT4 was highly adaptable and responded accurately even when questions were modified. Its performance remained consistent, regardless of the varied ways in which the queries were framed. This demonstrates ChatGPT4’s flexibility and capability of being tailored by different users to elicit specific responses.
Although LLMs do not have human intelligence, they are built on trainable neural networks. Our approach of training ChatGPT-4 with guideline documents provides an efficient means of accessing and curating content within resources. Our study focused on health care providers using ChatGPT4, integrated with clinical guidelines. The detailed and technical nature of the questions in the study reflects the advanced knowledge and specific inquiry style of these professionals. This emphasis indicates that the primary users of this advanced tool are health care providers, not patients. This has tremendous applications in clinical care because providers constantly navigate complexities of diverse patient populations with unique concerns requiring individualized care with limited time. Enhanced LLMs with guidelines can be useful both in primary care and specialty practices where health maintenance based on published guidelines is encountered frequently. Guidelines for common medical conditions could be easily accessed to improve efficiency and educate clinicians and patients.
Integrating clinical guidelines into ChatGPT-4 is improves efficiency and accuracy in practice. Although there are rule-based systems within electronic health records for clinical scenarios, there is an alert fatigue and a delay in updating with newer recommendations. Rule-based systems do not account for nuances presented in guidelines, a limitation that can be overcome by using guideline-enhanced LLMs. In addition, although numerous platforms provide comprehensive, peer-reviewed, and reliable information for in-depth clinical insights, AI models such as ChatGPT, particularly when augmented with guideline-driven accuracy, afford the distinct advantage of delivering quick, conversational, and easily accessible responses, proving especially beneficial in time-sensitive or point-of-care contexts.
Our study highlights the ability of LLMs to adapt based on domain-specific knowledge and provides a framework to develop and maintain resource-enhanced ChatGPT-4 models. These can be queried with questions without patient identifiers and obtain guideline-based and relevant information in real-time. Augmentation of AI capabilities suffice clinical needs by leveraging technology supplements human expertise and improves efficiency in clinical practice. However, although these initial results are promising, a more comprehensive analysis of the clinical utility of the tool would still be beneficial. Such an analysis could involve a broader range of questions, cover more diverse clinical scenarios, and include feedback from a wider group of health care professionals. This would help to confirm the tool’s efficacy and reliability in varied real-world clinical situations and provide deeper insights into its potential impact on health care practices.
In conclusion, although LLMs offer notable advantages in providing rapid, conversational responses to clinical inquiries, their effective and ethical implementation necessitates a rigorous, multifaceted approach. Such approach should focus not only on evaluating the accuracy of these models but also considerations of user interaction, ethical and legal adherence, integration into clinical workflows, and continuous validation. Future research and development in this arena should be steered by a collaborative, interdisciplinary method to navigate its complexities and harness the complete potential of LLMs in health care.
Potential Competing Interests
Dr Khanna reports research grants from Rebiotix/Ferring, Seres, Finch, and Vedanta and consulting fees from Takeda, Immuron, Niche, and ProbioTech outside of the submitted work.
Supplemental Online Material
References
- 1.Lee T.C., Staller K., Botoman V., Pathipati M.P., Varma S., Kuo B. ChatGPT answers common patient questions about colonoscopy. Gastroenterology. 2023;165(2):509–511.e7. doi: 10.1053/j.gastro.2023.04.033. [DOI] [PubMed] [Google Scholar]
- 2.Tariq R., Malik S., Khanna S. Evolving landscape of large language models: an evaluation of ChatGPT and Bard in answering patient queries on colonoscopy. Gastroenterology. 2024;166(1):220–221. doi: 10.1053/j.gastro.2023.08.033. [DOI] [PubMed] [Google Scholar]
- 3.Minssen T., Vayena E., Cohen I.G. The challenges for regulating medical use of ChatGPT and other large language models. JAMA. 2023;330(4):315–316. doi: 10.1001/jama.2023.9651. [DOI] [PubMed] [Google Scholar]
- 4.Lee P., Bubeck S., Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388(13):1233–1239. doi: 10.1056/NEJMsr2214184. [DOI] [PubMed] [Google Scholar]
- 5.Henson J.B., Glissen Brown J.R., Lee J.P., Patel A., Leiman D.A. Evaluation of the potential utility of an artificial intelligence chatbot in gastroesophageal reflux disease management. Am J Gastroenterol. 2023;118(12):2276–2279. doi: 10.14309/ajg.0000000000002397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kelly C.R., Fischer M., Allegretti J.R., et al. ACG clinical guidelines: prevention, diagnosis, and treatment of Clostridioides difficile infections. Am J Gastroenterol. 2021;116(6):1124–1147. doi: 10.14309/ajg.0000000000001278. [DOI] [PubMed] [Google Scholar]
- 7.Gupta S., Lieberman D., Anderson J.C., et al. Recommendations for follow-up after colonoscopy and polypectomy: a consensus update by the US multi-society task force on colorectal cancer. Gastroenterology. 2020;158(4):1131–1153.e5. doi: 10.1053/j.gastro.2019.10.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.McDonald L.C., Gerding D.N., Johnson S., et al. Clinical practice guidelines for Clostridium difficile infection in adults and children: 2017 update by the Infectious Diseases Society of America (IDSA) and Society for Healthcare Epidemiology of America (SHEA) Clin Infect Dis. 2018;66(7):987–994. doi: 10.1093/cid/ciy149. [DOI] [PubMed] [Google Scholar]
- 9.Suchman K., Garg S., Trindade A.J. Chat generative pretrained transformer fails the multiple-choice American College of Gastroenterology self-assessment test. Am J Gastroenterol. 2023;118(12):2280–2282. doi: 10.14309/ajg.0000000000002320. [DOI] [PubMed] [Google Scholar]
- 10.Mukherjee S., Durkin C., PeBenito A.M., et al. Assessing ChatGPT’s ability to reply to queries regarding colon cancer screening based on multi-society guidelines. Gastro Hep Adv. 2023;2(8):1040–1043. doi: 10.1016/j.gastha.2023.07.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

