Abstract
Pre-clinical studies suggest that large language models (i.e., ChatGPT) could be used in the diagnostic process to distinguish inflammatory rheumatic (IRD) from other diseases. We therefore aimed to assess the diagnostic accuracy of ChatGPT-4 in comparison to rheumatologists. For the analysis, the data set of Gräf et al. (2022) was used. Previous patient assessments were analyzed using ChatGPT-4 and compared to rheumatologists’ assessments. ChatGPT-4 listed the correct diagnosis comparable often to rheumatologists as the top diagnosis 35% vs 39% (p = 0.30); as well as among the top 3 diagnoses, 60% vs 55%, (p = 0.38). In IRD-positive cases, ChatGPT-4 provided the top diagnosis in 71% vs 62% in the rheumatologists’ analysis. Correct diagnosis was among the top 3 in 86% (ChatGPT-4) vs 74% (rheumatologists). In non-IRD cases, ChatGPT-4 provided the correct top diagnosis in 15% vs 27% in the rheumatologists’ analysis. Correct diagnosis was among the top 3 in non-IRD cases in 46% of the ChatGPT-4 group vs 45% in the rheumatologists group. If only the first suggestion for diagnosis was considered, ChatGPT-4 correctly classified 58% of cases as IRD compared to 56% of the rheumatologists (p = 0.52). ChatGPT-4 showed a slightly higher accuracy for the top 3 overall diagnoses compared to rheumatologist’s assessment. ChatGPT-4 was able to provide the correct differential diagnosis in a relevant number of cases and achieved better sensitivity to detect IRDs than rheumatologist, at the cost of lower specificity. The pilot results highlight the potential of this new technology as a triage tool for the diagnosis of IRD.
Supplementary Information
The online version contains supplementary material available at 10.1007/s00296-023-05464-6.
Keywords: Large language models, ChatGPT, Rheumatology, Triage, Diagnostic process, Artificial intelligence
Introduction
Recent diagnostic and therapeutic advances in rheumatology are still counterbalanced by a shortage of specialists [1] resulting in a significant diagnostic delay [2]. Early and correct diagnosis is, however, essential to prevent persistent joint damage.
In this context, artificial intelligence applications including patient-facing symptom checkers represent a field of interest and could facilitate patient triage and accelerate diagnosis [3, 4]. In 2022, we were able to show that the symptom-checker Ada had a significantly higher diagnostic accuracy than physicians in the evaluation of rheumatological case vignettes [5].
Currently, the introduction of large language models (LLM) such as ChatGPT has raised expectations for their use in medicine [6]. The impact of ChatGPT's arises from its ability to engage in conversations and its performance that is either close to or on par with human capabilities in various cognitive tasks [7]. For instance, Chat-GPT has achieved satisfactory scores on the United States Medical Licensing Examinations [8] and some authors suggest that LLM applications might be suitable for clinical, educational, or research environments [9, 10].
Interestingly, pre-clinical studies suggest that this technology could also be used in the diagnostic process [11, 12] to distinguish inflammatory rheumatic from other diseases.
We therefore aimed to assess the diagnostic accuracy of ChatGPT-4 in comparison to a previous analysis including physicians and symptom checkers regarding rheumatic and musculoskeletal diseases (RMDs).
Methods
For the analysis, the data set of Gräf et al. [5] was used with minor updates to disease classification regarding the grouping of diagnoses. The assessments of the symptom-checker app were analyzed using ChatGPT-4 and compared to the previous assessment results of Ada and the diagnostic ranking of the blinded rheumatologists. ChatGPT-4 was instructed to name the top five differential diagnoses based on the available information of the Ada assessment (see Supplement 1).
All diagnostic suggestions were manually reviewed. If an Inflammatory rheumatic disease (IRD) was among the top three (D3) or top five suggestions (ChatGPT-4 D5), respectively, D3 and D5 were summarized as IRD-positive (even if non-IRD diagnoses were also among the suggestions). Proportions of correctly classified patients were compared between the different groups using McNemar’s test. Classification of inflammatory rheumatic disease (IRD) status was additionally assessed.
Results
ChatGPT-4 listed the correct diagnosis comparable often to physicians as the top diagnosis 35% vs 39% (p = 0.30); as well as among the top 3 diagnoses, 60% vs 55%, (p = 0.38). In IRD-positive cases, ChatGPT-4 provided the top diagnosis in 71% vs 62% in the physician analysis. The correct diagnosis was among the top 3 in 86% (ChatGPT-4) vs 74% (physicians). In non-IRD cases, ChatGPT-4 provided the correct top diagnosis in 15% vs 27% in the physician analysis. The correct diagnosis was among the top 3 in non-IRD cases in 46% of the ChatGPT-4 group vs 45% in the physician group (Fig. 1).
Fig. 1.
Percentage correctly classified diagnosis rank
If only the first suggestion for diagnosis was considered, ChatGPT-4 correctly classified 58% of cases as IRD compared to 56% of the rheumatologists (p = 0.52). If the top 3 diagnoses were considered, ChatGPT-4 classified 36% of the cases correctly as IRD vs 52% of the rheumatologists (p = 0.01) (see Fig. 1). ChatGPT-4 had at least one suggestion of an inflammatory diagnosis for all non-IRD cases.
Discussion
ChatGPT-4 showed a slightly higher accuracy (60% vs. 55%) for the top 3 overall diagnoses compared to the rheumatologist’s assessment. It had a higher sensitivity to determine the correct IRD status than rheumatologists, but considerably worse specificity, suggesting that ChatGPT-4 may be particularly useful for detecting IRD patients, where timely diagnosis and treatment initiation are critical. It could therefore potentially be used as a triage tool for digital pre-screening and facilitate quicker referrals of patients with suspected IRDs.
Our results are in line with those of Kanjee et al. [12] who demonstrated an accuracy of 64% for ChatGPT-4 evaluating the top 5 differential diagnoses of the New England Journal of Medicine clinicopathological conferences.
Interestingly, in the cross-sectional study of Ayers et al. [13], the authors found that chatbot responses to publicly asked medical questions on a public social media forum were preferred over physician responses and rated significantly higher for both quality and empathy, highlighting the potential of this technology as a first point of contact and source of information for patients. In summary, ChatGPT-4 was able to provide the correct differential diagnosis in a relevant number of cases and achieved better sensitivity to detect IRDs than a rheumatologist, at the cost of lower specificity.
Although this analysis has some shortcomings, i.e., the small sample size and the limited information (only access to the Ada assessments without further clinical data), it highlights the potential of this new technology as a triage tool that could support or even speed up the diagnosis of RMDs.
As digital self-assessment and remote care options are difficult for some patients due to limited digital health competencies [14], up-to-date studies should be conducted on how accurately patients can express their symptoms and complaints using AI and symptom-checker applications, so that we can benefit from these technologies more effectively.
Until satisfactory results are obtained, the use of artificial intelligence by GPs for effective referral instead of diagnostic use can be expanded and larger prospective studies are recommended to further evaluate the technology. Furthermore, issues, such as ethics, patient consent, and data privacy in the context of the use of artificial intelligence in medical-decision making, are crucial critical guidelines for the application of LLM technologies such as ChatGPT are needed [15].
Supplementary Information
Below is the link to the electronic supplementary material.
Author contribution
Conceptualization: MK and NR; data curation: all authors, formal analysis: MK and JC, and funding acquisition: not applicable. Investigation: all authors. Methodology: MK, JC, and JK; software: MK; validation: all authors; visualization: MK and JC; writing—original draft: MK and NR; writing—review and editing: all authors.
Funding
Open Access funding enabled and organized by Projekt DEAL. MK: Speaker fee from Ada, Scientific funding: Ada. JC: Speaker’ fees from Janssen-Cilag, Pfizer, and Idorsia, all unrelated to this work.
Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Rheumadocs und Arbeitskreis Junge Rheumatologie (AGJR), Krusche M, Sewerin P, Kleyer A, Mucke J, Vossen D, u. a. Facharztweiterbildung quo vadis? Z Für Rheumatol. Oktober 2019;78(8):692–7. [DOI] [PubMed]
- 2.Miloslavsky EM, Marston B. The challenge of addressing the rheumatology workforce shortage. J Rheumatol Juni. 2022;49(6):555–557. doi: 10.3899/jrheum.220300. [DOI] [PubMed] [Google Scholar]
- 3.Fuchs F, Morf H, Mohn J, Mühlensiepen F, Ignatyev Y, Bohr D. Diagnostic delay stages and pre-diagnostic treatment in patients with suspected rheumatic diseases before special care consultation: results of a multicenter-based study. Rheumatol Int März. 2023;43(3):495–502. doi: 10.1007/s00296-022-05223-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Knitza J, Mohn J, Bergmann C, Kampylafka E, Hagen M, Bohr D. Accuracy, patient-perceived usability, and acceptance of two symptom checkers (Ada and Rheport) in rheumatology: interim results from a randomized controlled crossover trial. Arthritis Res Ther. 2021;23(1):112. doi: 10.1186/s13075-021-02498-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gräf M, Knitza J, Leipe J, Krusche M, Welcker M, Kuhn S. Comparison of physician and artificial intelligence-based symptom checker diagnostic accuracy. Rheumatol Int. 2022;42(12):2167–2176. doi: 10.1007/s00296-022-05202-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hügle T. The wide range of opportunities for large language models such as ChatGPT in rheumatology. RMD Open. 2023;9(2):e003105. doi: 10.1136/rmdopen-2023-003105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–1940. doi: 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
- 8.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Thirunavukarasu AJ, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M. Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ. 2023;9:e46599. doi: 10.2196/46599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Verhoeven F, Wendling D, Prati C. ChatGPT: when artificial intelligence replaces the rheumatologist in medical writing. Ann Rheum Dis. 2023;82(8):1015–1017. doi: 10.1136/ard-2023-223936. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ueda D, Mitsuyama Y, Takita H, Horiuchi D, Walston SL, Tatekawa H. ChatGPT’s diagnostic performance from patient history and imaging findings on the diagnosis please quizzes. Radiology. 2023;308(1):e231040. doi: 10.1148/radiol.231040. [DOI] [PubMed] [Google Scholar]
- 12.Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78. doi: 10.1001/jama.2023.8288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589–596. doi: 10.1001/jamainternmed.2023.1838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.de Thurah A, Bosch P, Marques A, Meissner Y, Mukhtyar CB, Knitza J. EULAR points to consider for remote care in rheumatic and musculoskeletal diseases. Ann Rheum Dis. 2022;81(8):1065–1071. doi: 10.1136/annrheumdis-2022-222341. [DOI] [PubMed] [Google Scholar]
- 15.Tools such as ChatGPT threaten transparent science; here are our ground rules for their use. Nature. Januar 2023;613(7945):612. [DOI] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

