Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2024 Feb 5;3(2):e0000355. doi: 10.1371/journal.pdig.0000355

Harnessing the open access version of ChatGPT for enhanced clinical opinions

Zachary M Tenner 1,*, Michael C Cottone 1, Martin R Chavez 1,2
Editor: Jennifer N Avari Silva3
PMCID: PMC10843476  PMID: 38315648

Abstract

With the advent of Large Language Models (LLMs) like ChatGPT, the integration of Generative Artificial Intelligence (GAI) into clinical medicine is becoming increasingly feasible. This study aimed to evaluate the ability of the freely available ChatGPT-3.5 to generate complex differential diagnoses, comparing its output to case records of the Massachusetts General Hospital published in the New England Journal of Medicine (NEJM). Forty case records were presented to ChatGPT-3.5, prompting it to provide a differential diagnosis and then narrow it down to the most likely diagnosis. The results indicated that the final diagnosis was included in ChatGPT-3.5’s original differential list in 42.5% of the cases. After narrowing, ChatGPT correctly determined the final diagnosis in 27.5% of the cases, demonstrating a decrease in accuracy compared to previous studies using common chief complaints. These findings emphasize the necessity for further investigation into the capabilities and limitations of LLMs in clinical scenarios while highlighting the potential role of GAI as an augmented clinical opinion. Anticipating the growth and enhancement of GAI tools like ChatGPT, physicians and other healthcare workers will likely find increasing support in generating differential diagnoses. However, continued exploration and regulation are essential to ensure the safe and effective integration of GAI into healthcare practice. Future studies may seek to compare newer versions of ChatGPT or investigate patient outcomes with physicians integrating this GAI technology. Understanding and expanding GAI’s capabilities, particularly in differential diagnosis, may foster innovation and provide additional resources, especially in underserved areas in the medical field.

Author summary

Integrating artificial intelligence (AI) into clinical medicine has long been a technological goal. Since its release in November 2022, ChatGPT has gained popularity, sparking questions about its proficiency in enhancing patient care. AI has demonstrated its ability to answer multiple-choice questions and exams at a level equivalent to medical students. It also excels in scenarios involving common chief complaints. However, ChatGPT’s ability to participate in advanced clinical conversations and provide difficult patient diagnoses has been largely unexplored. In this study, we investigated the capability of ChatGPT-3.5 to generate complex differential diagnoses by presenting 40 clinical case reports sourced from the New England Journal of Medicine (NEJM). Overall, ChatGPT-3.5 accurately identified the correct differential diagnosis 27.5% of the time. As we transition towards a medical landscape where physicians may leverage AI as a clinical tool, this study emphasizes both the limitations and potential of ChatGPT. We underscore the ongoing need to define AI capabilities to ensure its safe integration into medical practice and advocate for the sustained open accessibility of generative AI for patient care.

Introduction

Since the 20th century, research and speculation regarding the integration of artificial intelligence (AI) into physician reasoning has been ongoing. In 1987, Schwartz et al. asserted that “major intellectual and technical problems must be solved before we can produce truly reliable [healthcare] consulting programs” [1]. While models of clinical problem-solving have been described for years, it is only recently that technology has advanced sufficiently to investigate the role of AI in clinical medicine. OpenAI’s ChatGPT (Generative Pre-trained Transformer), one of the world’s first widely used Large Language Models (LLM), uses billions of parameters to generate user-informed text. In the healthcare sector, this Generative Artificial Intelligence, (GAI) encompasses a wide range of medical knowledge that can be tailored to the user’s needs, from assisting medical students with United States Medical Licensing Exam (USMLE) questions to creating next-generation sequencing reports with treatments options for oncologists [2,3]. Since its release, professionals began assessing the value of ChatGPT by pushing its limits within medical knowledge; however, it is imperative to explore ChatGPT’s role in patient care to best demonstrate and provide direction for how health professionals will work with AI as technology continues to develop [4,5].

ChatGPT has distinguished itself by achieving passing scores on the USMLE examination, equivalent to those of a third-year medical student [2]. This accomplishment opens the gates for potential applications of the model in medical education, serving as an interactive tool for medical school and an overall support for clinical thinking. Radiology and pathology have received significant attention in GAI research, with efforts focused on enhancing LLMs to better understand images and detect cancers. Despite receiving no specific training in either subject, “ChatGPT nearly passed a radiology board-style examination without images,” and demonstrated accuracy in “[solving] higher-order reasoning questions in pathology” [6,7]. Ali et. al. identified ChatGPT’s ability to perform at high rates on the neurosurgery oral boards examination while emphasizing the limitation of using multiple-choice examinations to assess a neurosurgeon’s expertise in patient care [8]. Although ChatGPT has proven effective in choosing from a list of options, the role of LLMs in clinical management has been highlighted as area requiring further research.

Mirroring the progression of a medical student, the next logical step is to evaluate the chatbot’s ability to come up with differential diagnosis. These are fundamental to clinical medicine, and the proficiency of ChatGPT in producing medically rational differential diagnoses remains largely unexplored. Hirosawa et al. determined that ChatGPT can successfully create comprehensive diagnosis lists for common chief complaints [9]. Additionally, Rao et al. assessed ChatGPT’s ability to generate differential diagnosis for issues routinely encountered in healthcare settings and found that “the LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9%” [10]. Previous research has done an excellent job of assessing ChatGPT’s ability to pass multiple-choice exams and provide differential diagnosis for standard chief complains with high accuracy; however, the generalizability of ChatGPT to more complex clinical scenarios must be examined [11].

To comprehensively assess the potential of GAI and LLMs in complex medical reasoning, we conducted a study to evaluate the ability of the freely available ChatGPT-3.5 to provide differentials on case records of the Massachusetts General Hospital published in the New England Journal of Medicine (NEJM). Our research takes a unique approach by utilizing clinical case reports identified by the journal to establish novel medical or biological understanding, thereby further evaluating the chatbot’s language capabilities. At the time our study was done, ChatGPT-3.5 has a knowledge cutoff date of September 2021. Consequently, we examined ChatGPT’s ability to use clinical reasoning to diagnose 2022 case reports, avoiding reliance on its search function to locate published articles. The primary aim of this study is to evaluate the proficiency of the freely available ChatGPT-3.5 in generating complex differential diagnoses. We intend to compare the chatbot’s complete diagnosis list and final diagnosis against the published differential diagnosis for the NEJM case reports. Our hypothesis posits that the percentage of differential diagnoses generated by ChatGPT-3.5 will match the NEJM final diagnosis for the case reports approximately 50% of the time. By elucidating ChatGPT’s potential in offering differential diagnoses, we propose future clinical problem-solving cases consider utilizing GAI as an augmented clinical opinion.

Methods

We presented forty case records from the Massachusetts General Hospital, as published in the New England Journal of Medicine (NEJM) in 2022, to ChatGPT-3.5. All text preceding the “Differential Diagnosis” headline was included, excluding figures. ChatGPT was initially prompted with the instruction, “Provide a differential diagnosis from the following clinical case.” Following the generation of a complete list of differential diagnosis, we further inquired, “Can you narrow down the differential to the most likely diagnosis?” Subsequently, we recorded whether the final diagnosis, as referenced in the NEJM, was included in ChatGPT’s complete differential diagnosis list. Additionally, we noted whether ChatGPT’s “most likely diagnosis” aligned with the final diagnosis noted in NEJM.

Results

Among the 40 cases presented to ChatGPT-3.5, 23 cases (57.5%) were not considered in its original differential list. The average length of the original differential list generated by ChatGPT was 7±2 possible diagnoses, ranging from a high of 12 and a low of 3. The length of the differential appeared random. In 17 cases (42.5%), ChatGPT did include the final diagnosis in its original differential list. After narrowing down its differential list, ChatGPT correctly identified the final diagnosis in 11 cases (27.5%) and eliminated the correct diagnosis in 6 cases (15%). These results are presented in Fig 1.

Fig 1. Flowchart of the 40 case records of the Massachusetts General Hospital that were published in the NEJM after being presented to ChatGPT.

Fig 1

Discussion

The role of Generative AI (GAI) and Large Language Models (LLMs) in clinical medicine is a rapidly growing area of research. Assessing the potential and limitations of ChatGPT (v3.5) within the scope of patient care is essential to determine how and where it can best be utilized. We decided to focus on the complimentary version of ChatGPT to ensure the largest possible audience has access to this technology. Presenting 40 case records from the New England Journal of Medicine (NEJM) to ChatGPT allowed us to delve deeper into the role of LLMs in healthcare, specifically studying their success rates in producing differential diagnoses of complex patient presentations. ChatGPT accurately identified the correct differential diagnosis 27.5% of the time. Notably, the differential list accuracy of ChatGPT, when presented with clinical vignettes of common chief complaints, has been reported to be over 80% [9]. However, this accuracy dropped by over 50% when we increased the difficulty from common chief complaints to complex clinical cases using NEJM case reports.

Furthermore, our results can be compared to Kanjee et. al., where the authors utilized NEJM clinicopathologic conferences as challenging medical cases. Their assessment of Chat GPT-4 “provided the correct diagnosis in its differential in 64% of challenging cases and as its top diagnosis in 39%.” Of note, our assessment of the open access version of ChatGPT was about 20% lower when including the diagnosis in its differential and 12% lower when selecting the final diagnosis [12]. GPT-4 was also compared to medical-journal readers to assess its ability solve complex clinical cases, as it correctly diagnosed 57% of cases [13]. Our research presented diagnosis percentages that were slightly below Kanjee and Eriksen, sparking a conversation about the clinical abilities of GPT-3.5 vs GPT-4. The open-access aspect of GPT-3.5 remains important to research, encouraging its usage among the medical community without a financial investment in GAI. Establishing baseline limitations of ChatGPT allows for future comparisons of its growth and development and ensures cautious use in patient care. Furthermore, it can provide insight to how to best adjust ChatGPT’s settings to better identify the categories for which it receives the highest score.

In the imminent future, physicians and other healthcare workers will likely practice in a world where the latest research journals and electronic medical records are directly linked to Chat-like software. With these upcoming additions to GAI, we anticipate its continued growth in developing differential diagnoses. Therefore, it becomes increasingly important for the field of medicine to better comprehend this information. In both primary care and specialty settings, GAI provides a new medium for physicians to cultivate new ideas, consider novel diagnoses, and consult with a “colleague” when one may not be readily available, especially in rural settings [14].

Future studies may look to expand from our baseline findings. For example, the newer versions of ChatGPT-3.5 do not have knowledge cutoff date and are instead able to pull up-to-date information from the internet. How do newer versions of ChatGPT compare to ChatGPT-3.5? Do patients experience improved outcomes when their physicians integrate ChatGPT into their care? These questions, among others, require elucidation through further experimentation. However, before ChatGPT becomes a new tool within a physician’s practice, it is imperative to continue defining and describing its abilities to ensure safe and appropriate reliance on GAI. We strongly advocate for technology companies to consistently offer complimentary versions of generative artificial intelligence. Such accessibility not only maximizes its utilization but also fosters innovation, particularly in the field of medicine.

Data Availability

All relevant data are within the manuscript.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Schwartz WB, Patil RS, Szolovits P. Artificial intelligence in medicine. Mass Medical Soc; 1987. p. 685–8. [DOI] [PubMed] [Google Scholar]
  • 2.Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Hamilton Z, Naffakh N, Reizine NM, Weinberg F, Jain S, Gadi VK, et al. Relevance and accuracy of ChatGPT-generated NGS reports with treatment recommendations for oncogene-driven NSCLC. American Society of Clinical Oncology; 2023. [Google Scholar]
  • 4.Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine, 2023. New England Journal of Medicine. 2023;388(13):1201–8. doi: 10.1056/NEJMra2302038 [DOI] [PubMed] [Google Scholar]
  • 5.Eysenbach G. The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Med Educ. 2023;9:e46885. doi: 10.2196/46885 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology. 2023:230582. doi: 10.1148/radiol.230582 [DOI] [PubMed] [Google Scholar]
  • 7.Sinha RK, Roy AD, Kumar N, Mondal H, Sinha R. Applicability of ChatGPT in assisting to solve higher order problems in pathology. Cureus. 2023;15(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ali R, Tang OY, Connolly ID, Fridley JS, Shin JH, Zadnik Sullivan PL, et al. Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank. medRxiv. 2023:2023.04. 06.23288265. doi: 10.1227/neu.0000000000002551 [DOI] [PubMed] [Google Scholar]
  • 9.Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. International Journal of Environmental Research and Public Health. 2023;20(4):3378. doi: 10.3390/ijerph20043378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow. medRxiv. 2023:2023.02.21.23285886. doi: 10.1101/2023.02.21.23285886 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Rajkomar A, Dean J, Kohane I. Machine learning in medicine. New England Journal of Medicine. 2019;380(14):1347–58. doi: 10.1056/NEJMra1814259 [DOI] [PubMed] [Google Scholar]
  • 12.Kanjee Z, Crowe B, Rodman A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. Jama. 2023;330(1):78–80. doi: 10.1001/jama.2023.8288 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Eriksen AV, Möller S, Ryg J. Use of GPT-4 to Diagnose Complex Clinical Cases. NEJM AI. 2024;1(1):AIp2300031. [Google Scholar]
  • 14.Balas M, Ing EB. Conversational AI Models for ophthalmic diagnosis: Comparison of ChatGPT and the Isabel Pro Differential Diagnosis Generator. JFO Open Ophthalmology. 2023:100005. [Google Scholar]
PLOS Digit Health. doi: 10.1371/journal.pdig.0000355.r001

Decision Letter 0

Jennifer N Avari Silva

12 Dec 2023

PDIG-D-23-00320

Harnessing the Open Access Version of ChatGPT for Enhanced Clinical Opinions

PLOS Digital Health

Dear Dr. Tenner,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 30 days Jan 11 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Jennifer N Avari Silva, MD

Section Editor

PLOS Digital Health

Journal Requirements:

1. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article's retracted status in the References list and also include a citation and full reference for the retraction notice.

2. We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex.

3. Please provide separate figure files in .tif or .eps format only and remove any figures embedded in your manuscript file. Please also ensure that all files are under our size limit of 10MB.

For more information about figure files please see our guidelines:

https://journals.plos.org/digitalhealth/s/figures

https://journals.plos.org/digitalhealth/s/figures#loc-file-requirements

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Partly

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Well-Written Introduction:

The introduction highlights the idea of the entire paper. It raises interest in learning more.

Conceptually Correct:

The authors presented enough mathematical evidence and reasons to support their conceptual correctness.

Overall Presentation:

The overall presentation of the paper, which includes figures, tables, and references, is good. However, a minor revision will be necessary for the final approval of this paper.

Weakness of the Paper

Grammatical mistakes:

The paper has around 20 grammatical mistakes and several sentence formation problems. A careful revision is mandatory to correct these mistakes.

No Comparison:

The authors did not compare their results with any of the papers mentioned in the literature review. It is beyond the scope to validate the methodology's effectiveness without proper comparison.

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Sandeep Trivedi

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000355.r003

Decision Letter 1

Jennifer N Avari Silva

11 Jan 2024

Harnessing the Open Access Version of ChatGPT for Enhanced Clinical Opinions

PDIG-D-23-00320R1

Dear Mr. Tenner,

We are pleased to inform you that your manuscript 'Harnessing the Open Access Version of ChatGPT for Enhanced Clinical Opinions' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Jennifer N Avari Silva, MD

Section Editor

PLOS Digital Health

***********************************************************

Thank you for your responsiveness to previous review.

Reviewer Comments (if any, and for reference):

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Data Availability Statement

    All relevant data are within the manuscript.


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES