ChatGPT (Chat Generative Pretrained Performer), an artificial intelligence (AI) chatbot, was launched in 2022. It is trained on a large language model (LLM) consisting of text derived from websites, Internet forums, digital books, and subtitles of videos. After registration on openai.com, users can prompt ChatGPT on chat.openai.com to give answers to any questions.
Research and clinical communities are currently signaling opportunities and pitfalls when relying on ChatGPT to write scientific papers or provide information about clinical issues 1 . Importantly, few resources are available to guide the uptake of ChatGPT in health care education, e.g., concerning its performance in answering relevant clinical questions that professionals see themselves confronted with in everyday practice. Indeed, many researchers and clinicians are worried about incorrect content and lack of nuanced information generated by AI 2 , 3 . On the other hand, given large inequities in medical education opportunities and in the availability of medical knowledge and full text research publications across the globe 4 , particularly low‐ and middle‐income countries (LMICs) may benefit from AI, as Internet access on a device is the only prerequisite for free‐to‐use chatbots.
In order to address the current knowledge gap about the reliability of ChatGPT in answering questions about clinical psychiatry, we examined the accuracy, completeness and nuance of its answers to a diverse set of questions, as well as the speed at which it generates answers compared to other sources of information.
Our approach was divided into two layers: first, an author‐rated analysis of the accuracy, completeness and nuance of ChatGPT's answers; second, an analysis comparing the accuracy, completeness, nuance and speed between answers provided by respondents using ChatGPT and respondents using other information sources.
In the first layer, two raters conceived 40 questions (20 questions each) representing a diversity of topics related to epidemiology, diagnosis and treatment in psychiatry (see supplementary information). Each rater assessed the accuracy, completeness and nuance of the answers given by ChatGPT (version 3; Dec. 15, 2022 release) to the questions conceived by the other rater. ChatGPT's answers were rated on a scale from 0 to 2 (0, insufficient; 1, reasonable to good; 2, very good to perfect) for each of the quality criteria (accuracy, completeness and nuance). Average scores and standard deviations (SDs) were computed.
In the second layer, 85 psychiatrists and psychiatry residents working in institutes in The Netherlands, Germany and the US, not including the raters, were asked to participate in an online survey. Participants were randomized either to ChatGPT or to any other source of information they preferred, except for other chatbots. After randomization, each participant was requested to answer 10 of the same questions as in the first layer, with all questions having the same number of respondents in the two groups. Then, two raters blindly (for group, i.e. ChatGPT vs. other) assessed the accuracy, completeness and nuance of each answer. Squared weighted kappas were computed to assess inter‐rater reliability between the blinded raters. Times recorded to answer the questions were compared between the ChatGPT and the other group.
All analyses were performed using R version 4.2.3. The average of all accuracy, completeness and nuance scores was used as the main outcome measure in all analyses and is referred to as composite score. Additional outcomes included individual scores of accuracy, completeness and nuance, as well as response speed. For the composite score, means and SDs were divided by 6 (maximum score) and multiplied by 10, to translate the original 0‐6 range to a 0‐10 scale. To obtain mean and SD values for individual accuracy, completeness and nuance scores, original values were divided by 2 (maximum score) and multiplied by 10, to translate the 0‐2 range to 0‐10. Mann‐Whitney U test was used to compare scores between the two groups (ChatGPT vs. other). Total response times for all questions were compared, also using the Mann‐Whitney U test. Finally, odds ratios (ORs) with 95% confidence intervals (CIs) were computed to assess the chance of having the maximum composite, accuracy, completeness and nuance scores when using ChatGPT vs. when not using it. Statistical significance of the ORs was assessed using a Fisher's exact test. The threshold for statistical significance was Bonferroni‐corrected for multiple testing (dividing 0.05 by the number of tests performed).
In the first layer of the study, we found the following average 0‐10 scale scores for ChatGPT: composite 8.0 (SD=2.8), accuracy 8.4 (SD=2.9), completeness 7.6 (SD=3.0), and nuance 8.1 (SD=3.3). In the answers to the 40 questions, we detected 4 erroneous information units (average of 0.1 per question).
In the second layer of the study, 38 respondents participated (25 psychiatrists and 13 residents). The average weighted kappa across raters was 0.65. For participants using ChatGPT, the average 0‐10 scale scores were as follows: composite 7.6 (SD=2.9), accuracy 8.1 (SD=3.1), completeness 7.3 (SD=3.2), and nuance 7.2 (SD=3.5).
We detected significantly higher composite scores in ChatGPT users than in non‐users (7.6 vs. 6.7, p=0.0016). ChatGPT users were on average 19% faster in completing the questionnaire than users of other sources, although this difference was non‐significant. ChatGPT users had greater odds of maximum scores than non‐ChatGPT users: ORs were 2.34 (composite), 1.96 (completeness) and 2.89 (nuance), with 95% CIs not encompassing 1, and corrected p values of 0.0037, 0.022, and 3.09x10‐5, respectively. The OR for accuracy was 1.33 (non‐significant). ChatGPT answered questions about pharmacotherapy (particularly interactions and specific indications) less accurately than other questions, possibly due to the lack of reliable online information and reliance on textbooks for such questions.
In sum, in what we believe is the first study about the reliability of ChatGPT in answering questions about clinical psychiatry, we found that ChatGPT answered a 40‐item test with high accuracy, completeness and nuance. Participants using ChatGPT performed better than those using other resources.
A strength of our study is that we employed a comprehensive, two‐layered approach that goes beyond similar studies in other specialties by number of users, number of questions asked, numbers of outcomes, and the complementary use of two methods 5 . A limitation is the potential lack of power to detect significant differences in response speed between ChatGPT and non‐ChatGPT users. In addition, improvement over time in the performance of ChatGPT may be examined using more longitudinal designs.
We conclude that ChatGPT scores well on accuracy, completeness, nuance and speed when generating answers to clinical questions in psychiatry. It may, therefore, represent a tool providing rapid access to reliable information about clinical psychiatry which is (at the time of writing) freely accessible to medical students, residents and physicians across the globe. It may thus also contribute to bridging gaps in health care education between richer countries and LMICs. However, we highlight the need for research into ethical issues when relying on medical knowledge derived from AI 6 .
Supplementary information on this study is available at https://github.com/flgerritse/ChatGPT_psychiatry/.
REFERENCES
- 1. Haupt CE, Marks M. JAMA 2023;329:1349‐50. [DOI] [PubMed] [Google Scholar]
- 2. Sallam M. Healthcare 2023;11:887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Anonymous. Nature 2023;613:612.36694020 [Google Scholar]
- 4. Frenk J, Chen L, Bhutta ZA et al. Lancet 2010;376:1923‐58. [DOI] [PubMed] [Google Scholar]
- 5. Grünebaum A, Chervenak J, Pollet SL et al. Am J Obstet Gynecol 2023; doi: 10.1016/j.ajog.2023.03.009. [DOI] [PubMed] [Google Scholar]
- 6. Flanagin A, Bibbins‐Domingo K, Berkwits M et al. JAMA 2023;329:637‐9. [DOI] [PubMed] [Google Scholar]