Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2024 Feb 17;81(11):7489–7499. doi: 10.1111/jan.16101

A comparative vignette study: Evaluating the potential role of a generative AI model in enhancing clinical decision‐making in nursing

Mor Saban 1,, Ilana Dubovi 1
PMCID: PMC12535322  PMID: 38366690

Abstract

Aim

This study explores the potential of a generative artificial intelligence tool (ChatGPT) as clinical support for nurses. Specifically, we aim to assess whether ChatGPT can demonstrate clinical decision‐making equivalent to that of expert nurses and novice nursing students. This will be evaluated by comparing ChatGPT responses to clinical scenarios to those of nurses on different levels of experience.

Design

This is a cross‐sectional study.

Methods

Emergency room registered nurses (i.e. experts; n = 30) and nursing students (i.e. novices; n = 38) were recruited during March–April 2023. Clinical decision‐making was measured using three validated clinical scenarios involving an initial assessment and reevaluation. Clinical decision‐making aspects assessed were the accuracy of initial assessments, the appropriateness of recommended tests and resource use and the capacity to reevaluate decisions. Performance was also compared by timing response generations and word counts. Expert nurses and novice students completed online questionnaires (via Qualtrics), while ChatGPT responses were obtained from OpenAI.

Results

Concerning aspects of clinical decision‐making and compared to novices and experts: (1) ChatGPT exhibited indecisiveness in initial assessments; (2) ChatGPT tended to suggest unnecessary diagnostic tests; (3) When new information required re‐evaluation, ChatGPT responses demonstrated inaccurate understanding and inappropriate modifications. In terms of performance, the mean number of words utilized in ChatGPT answers was 27–41 times greater than that utilized by both experts and novices; and responses were provided approximately 4 times faster than those of novices and twice faster than expert nurses. ChatGPT responses maintained logical structure and clarity.

Conclusions

A generative AI tool demonstrated indecisiveness and a tendency towards over‐triage compared to human clinicians.

Impact

The study shows that it is important to approach the implementation of ChatGPT as a nurse's digital assistant with caution. More study is needed to optimize the model's training and algorithms to provide accurate healthcare support that aids clinical decision‐making.

Reporting method

This study adhered to relevant EQUATOR guidelines for reporting observational studies.

Patient or public contribution

Patients were not directly involved in the conduct of this study.

Keywords: ChatGPT, Clinical decision‐making, Generative AI, Novice, Nurse education, Nurses

1. INTRODUCTION

Clinical decision‐making (CDM) is a crucial nursing competency that involves comprehensive assessment, analytical evaluation and complex judgement in order to provide safe, quality patient care (Benner, 1982). While CDM occupies a central role in nursing practice, research has identified deficiencies in this ability among novice nurses and new graduates entering the workforce (Nibbelink & Brewer, 2018; Saintsing et al., 2011). Developing robust CDM skills is challenging yet vital for professional preparation and readiness to practice independently. Further investigation is needed to identify strategies that can strengthen novices' CDM competency before assuming full clinical responsibilities (Archibald & Clark, 2023; Jin & Ji, 2021; Wong & Kowitlawakul, 2020).

Artificial intelligence (AI) technologies have been developing rapidly and may be able to enhance certain aspects of healthcare (Gaube et al., 2021; Huh, 2023; Kao et al., 2023). Chat‐based generative pre‐trained transformer models such as ChatGPT utilize large language models to comprehend patient scenarios described in natural language and generate probable responses. ChatGPT indicates a potential for supporting clinical tasks through natural language interactions (Rosen & Saban, 2023a, 2023b). However, its ability to accurately employ principles of comprehensive CDM when managing various patient presentations compared to experienced clinicians remains uncertain (Rosen & Saban, 2023a). Evaluating ChatGPT's performance in responding to clinical vignettes and comparing this to human clinicians may provide insight into how this AI system specifically could support or supplement certain elements of the decision‐making process without replacing human expertise (Miao & Journal, 2023; 2789 AAARJ of MS (ISSN), 2023). Accordingly, this study aimed to evaluate ChatGPT's performance relative to novice and expert nurses when responding to clinical vignettes involving assessment of presenting signs/symptoms, differential diagnosis, diagnostic testing and re‐evaluation based on new patient information. Comparing AI responses to human clinicians may provide insight into how emerging technology could aid or supplement CDM processes.

2. BACKGROUND

OpenAI launched ChatGPT on 30 November 2022, quickly gaining widespread popularity (Scerri & Morin, 2023). It is considered one of the most advanced language generation models available to the public, capable of generating text for a wide range of applications. ChatGPT was trained on a diverse set of Internet texts and can render text in various styles and formats, including human‐like conversation, poetry and news articles. However, despite its success in other areas, ChatGPT was not initially developed for healthcare applications. To explore how AI could potentially support elements of CDM, recent studies have investigated ChatGPT's performance. For example, one study compared physician and ChatGPT responses to patient questions, demonstrating that ChatGPT produced better quality responses with greater empathy expression (Ayers et al., 2023). Another study by Kung et al. (2023) assessed ChatGPT's performance in medical licensing exams and found that its responses provided a role‐model deductive reasoning process, which could be implemented in medical education. Similarly, Fijačko et al. (2023) observed that ChatGPT provided insightful explanations in response to life support scenario‐based questions, suggesting its potential to support the human learning process. Finally, Rosen and Saban (2023a) demonstrated that ChatGPT was able to correctly identify 60% of complex clinical cases, suggesting its potential and the need for further research (Rosen & Saban, 2023a).

Given this, and with a focus on nursing, the present study seeks to investigate the potential of ChatGPT for enhancing CDM, a crucial component of nursing practice. The concept of CDM has evolved from the application of decision‐making to the healthcare professions (AS. E., 1995). This term was first introduced into the nursing literature in the 1980s and has since become an emblem of the nursing profession (Benner, 1982; Jones, 1988). CDM is defined as the art and science of considering a patient with a health problem, whether future or present, and reaching a decision that maximizes what the patient values while conforming to or improving clinicians’ workflow (Johansen & O'Brien, 2016; Nibbelink & Carrington, 2019). The decision‐making process includes interrelated concepts: critical thinking, clinical reasoning and clinical judgement. This process can only be achieved through strategies appropriately designed to individually evaluate each of the methods used by the nurse to analyse (critically think), apply (clinically reason) and act (clinically judge) (Lee & Park, 2019). Although essential, the nursing literature describes CDM as a tacit, intuitive and extremely complex process, especially for novice nurses (Victor‐Chmil, 2013).

Research indicates that CDM is influenced by one's level of experience. It has been shown that experienced nurses make better decisions, especially in the case of more complicated patient‐care decisions, than do inexperienced nurses (Hoffman et al., 2004; Nibbelink & Brewer, 2018). This can be explained by the seminal work produced by Patricia Benner (Benner, 19821984), which established the relationship between intuition and expert clinical nursing practice (Benner, 1982; Epstein, 2013). Benner's (1984) Novice to Expert model proposes that nurses with more clinical experience (experts) use more intuition in CDM processes, while novice nurses (beginners) are primarily task‐oriented and rely on analytical processes to make clinical judgements (Epstein, 2013). To mitigate this novice‐to‐expert gap and foster CDM skills, novel pedagogies and various simulation technologies have been adopted into nursing education (Hoffman et al., 2004). However, multiple barriers still prohibit sufficient CDM, especially among novice nurses (Alowais et al., 2023; Fazlollahi et al., 2022; Saintsing et al., 2011). The experience gap between novices and experts remains sizable.

Recent research suggests that artificial intelligence tools showing human‐level performance on language‐based tasks may help close this gap (Alshehri et al., 2023; Brown et al., 2023). By simulating the clinical reasoning of experienced practitioners, AI has the potential to provide novices with individualized guidance and feedback that leverages large training datasets (Wong & Kowitlawakul, 2020). Preliminary studies indicate that AI tutoring systems improve diagnostic accuracy and reduce cognitive biases in other domains (Shen et al., 2023). If AI is able to accurately reflect expert nursing judgement, it could expand the availability of high‐fidelity decision‐making practice opportunities necessary for competency development (Fu et al., 2023).

This study, therefore, aims to compare ChatGPT's responses to clinical vignettes involving nursing assessment, diagnosis and initial care planning, to responses by novice and expert nurses. Since CDM is especially essential in critical settings such as emergency rooms (ER), where rapid decision‐making is necessary, we selected ER triage for the current study's clinical setting. In this context, the triage nurse's ability plays a unique role in efficient emergency care, requiring accurate assessment of patients in time‐sensitive and critical situations.

2.1. Research aim

The aim of this study is to explore the potential of a generative artificial intelligence tool (ChatGPT) as clinical support for nurses. Specifically, we aim to assess whether ChatGPT can demonstrate clinical decision‐making equivalent to that of expert nurses and novice nurses. To achieve this goal, we evaluated the chatbot's ability to provide high‐quality responses to triage clinical scenarios by comparing its responses to those of expert nurses and novice nurses.

3. METHODS

3.1. Research design, participants and procedure

The study was conducted between March 1–May 30, 2023 at Tel Aviv University, located in Tel Aviv, Israel. Registered nurse recruitment took place from March 1–April 30, 2023 through an online forum for Israeli nurses with over 5000 members. Final‐year BSN nursing students at Tel Aviv University were recruited during orientation activities from March 1–31, 2023. Data collection was carried out through the administration of a CDM assessment tool online, utilizing the Qualtrics platform, to participants. Expert nurses completed the assessment online using Qualtrics from March 1 to April 30, 2023 during work hours at various hospitals located throughout Israel. Novice nursing students responded using Qualtrics from their university computers between March 31–April 15, 2023.ChatGPT's responses were elicited through the OpenAI interface from April 15–30, 2023.

The conversations with ChatGPT were conducted using an sampling temperature of 0.9 (Eva & Cunnington, 2006). Temperature is a parameter in GPT‐3.5 language models that controls the randomness of output—lower temperatures (e.g. 0.5) lead to more predictable, on‐target responses based closely on the training data, while higher temperatures (e.g. 1.0) introduce more novelty but can cause irrelevant responses. A temperature of 0.9 was selected to balance coherent, sensible responses while still allowing for some creative interpretation on the part of ChatGPT (Eva & Cunnington, 2006).

All participant data were anonymized upon collection. Following completion of data collection, responses were analysed from May 1–30, 2023 to compare CDM process across groups (Ding et al., 2019; Mamede et al., 2008).A prior statistical power analysis was performed for sample‐size estimation using G*Power software (version 3.1). With an alpha of 0.05 to detect a size effect of d = 0.25 (Antaki et al., 2023) and a power (1 − β) set on 0.80, a sample of approximately 68 participants was calculated as being satisfactory to perform t‐test comparisons between the two groups.

3.1.1. Eligibility criteria

Eligibility criteria for nurses included the following: (1) currently working at ER department, (2) having a minimum of 1 year experience working in the ER department. Final year nursing students enrolled in the BSN program who had completed clinical practicum requirements were eligible to participate.

3.1.2. Recruitment and the selection process

Convenience sampling was used to recruit ER nurses. Informational flyers were distributed to all nurses working in the ER during the specified recruitment period. Nurses who volunteered were screened for eligibility. Eligible nursing students were directly invited to participate during a scheduled class meeting.

3.2. Ethical approval and consent to participate

The study was conducted after receiving the approval of the ethics committee of the Tel‐Aviv University (#0006223‐2). All data were collected anonymously. Experts and novices were informed of their right to withdraw their participation in the study at any time and for any reason.

All nurse and student participants were provided with a digital information sheet containing details of the study purpose, procedures, risks and benefits of participation and the voluntary nature of participation via the Qualtrics platform. Informed consent was obtained digitally, whereby participants were required to confirm they had read the information sheet and provided their consent by clicking ‘I agree’ before accessing the clinical scenarios questionnaire.

3.3. Data collection instruments

3.3.1. The CDM process

The CDM process was assessed using three patient scenarios related to emergency triage. The scenarios were adapted from previously published studies by Eva and Cunnington (2006), Johnson et al. (2023), Mamede et al. (2008), and Pardos and Bhandari (2023), and one scenario was developed by the first author based on a real patient case (see Appendix A for an example). All scenarios contained some degree of ambiguity, with two probable diagnoses or management options to consider. Participants' CDM was examined based on their responses to these scenarios. All cases were validated by two emergency nurses with a PhD degree.

The CDM process was assessed for (1) the accuracy of the Canadian Triage and Acuity Scale (CTAS) urgency score; (2) the triage accuracy for use of resources, including laboratory or imaging tests; and (3) triage reevaluation accuracy in light of newly acquired clinical examinations, laboratory and imaging results. Performance was also compared by timing response generations and word counts.

3.3.2. Demographic characteristics

The questionnaire was presented only to experts and novices. Questions included gender, age and education level. Registered nurses were also asked in which department or clinic they were working and about their length of experience.

3.4. Procedure

After obtaining informed consent, participants were emailed a link to the online clinical scenarios questionnaire hosted on the Qualtrics platform. For nurses and students, this was sent in March and for ChatGPT‐3.5 (Ding et al., 2019; Mamede et al., 2008), the scenarios were presented through the OpenAI interface also in March. The clinical scenarios were presented online and each consisted of two phases. In Phase 1, participants were shown the initial patient information including history, signs/symptoms and past medical history. They were asked to assign an urgency score per the CTAS scale (Sun & Hoelscher, 2023) and recommend initial tests. In Phase 2, after completing Phase 1, additional examination and test result details were displayed. Participants were then asked if they would change their original urgency score and test recommendations. To progress between phases, participants clicked a button on the online interface.

Each participant was asked to read and respond to 3 clinical scenarios sequentially. The scenarios depicted initial patient assessments requiring CDM and re‐evaluation. Nurses and students were asked open‐ended questions about their reasoning for the care plan and tests/treatments ordered. ChatGPT's responses were automatically generated through natural language interactions.

To ensure standardization, the scenarios and questions were identically worded online and through the conversation interfaces. Participant responses were directly entered or generated within the respective platforms and later extracted for analysis.

3.5. Data analysis

All responses were coded jointly by both authors. Descriptive statistical analyses were performed using means and standard deviations for continuous variables and frequencies and percentages for categorical variables. Chi‐square tests or t‐tests were used to compare between means when appropriate. Statistical analyses were performed using SPSS Version 28 software.

4. RESULTS

The study comprised a total of 68 participants, including 30 registered nurses from nine different hospitals who work in the ER and 38 nursing students (novices). As shown in the summary of the participants’ demographic characteristics in Table 1, the novices were younger, with fewer years of education than the registered nurses. Registered nurses reported working a mean of 8.7 ± 7.6 years in the ER.

TABLE 1.

Comparison of the sociodemographic characteristics of nurses and nursing students (N = 68).

Variables Nurses (n = 30) Nursing students (n = 38)
M (SD) M (SD) p t
Age (years) a 37.7 (±9.6) 27.6 (±7.1) <.0001 86.7
Education (years) a 17 (±2.6) 15 (±3.1) <.0001 54.08
N (%) p χ 2
Gender b .085 2.975
Female 18 (60%) 30 (80%)
Male 12 (40%) 8 (20%)
a

t‐Test for continuous variables.

b

Chi square for categorical variables.

4.1. Triage urgency

To illustrate the effect of expertise level (expert vs. novice) on urgency scoring, chi‐square tests were performed for each of the three clinical scenarios or cases. As apparent from Table 2, for Case 1 there was no statistically significant difference in scoring between experts and novices (χ 2 = 3.192, p = .526). The majority of experts (43%) and novices (42%) scored Case 1 at CTAS level 2, which was in line with the score level provided by ChatGPT as well as with the evaluators. For ChatGPT responses see Appendix B.

TABLE 2.

Triage urgency, triage accuracy and triage reevaluation (descriptive statistics).

Case 1 (subarachnoid haemorrhage and MI) Case 2 (pneumonia vs. pulmonary embolus) Case 3 (GI vs. ectopic pregnancy)
Nurses Students ChatGPT Nurses Students ChatGPT Nurses Students ChatGPT
Triage urgency
P1 (Resuscitation) 6 (20%) 13 (34%)

‘CTAS score of 2’

4 (14%) 9 (24%)

‘CTAS level 2 or 3’

3 (9%) 5 (13%)

‘CTAS level 2’

P2 (Emergent) 13 (43%) 16 (42%) 9 (32%) 17 (45%) 8 (27%) 13 (34%)
P3 (Urgent) 7 (23%) 7 (19%) 14 (50%) 12 (32%) 19 (64%) 17 (45%)
P4 (Less urgent) 3 (10%) 2 (5%) 1 (4%) 0 (0%) 0 2 (5%)
P5 (Non‐urgent) 1 (3%) 0 (0%) 0 (0%) 0 (0%) 0 1 (3%)
Triage test accuracy
Accurate triage 14 (47%) 22 (58%) 19 (64%) 13 (34%) 17 (57%) 9 (22%) Missing pregnancy test
Missing triage 15 (50%) 12 (32%) 10 (33%) 16 (42%) 12 (40%) 22 (58%)
Over triage 1 (3%) 4 (10%) Over triaged by suggesting the need for urine tests and lumbar puncture 1 (3%) 9 (24%) Over triaged by suggesting the need for a blood culture 1 (3%) 7 (19%) Over triaged by suggesting stool tests and endoscopy
Triage re‐evaluation
Change the decision accuracy 14 (47%) 13 (34%) Yes, the new laboratory and physical examination results significantly changed the initial decision‐making and urgency score 7(23%) 6 (16%) The patient may not need to undergo further testing for a pulmonary embolism 11 (38%) 11 (29%) Remained unchanged

Similarly, for Case 2 there was no statistically significant difference in the CTAS scoring between experts and novices (χ 2 = 4.118, p = .249). Most of the participants scored the case at CTAS level 2 (39%) or 3 (39%). Interestingly, ChatGPT too did not provide one definite score but stated: ‘According to the Canadian Triage and Acuity Scale for this case, it would likely be a CTAS level 2 or 3’ (Appendix B). Evaluators scored the urgency here as CTAS level 2.

Finally, for Case 3 there was similarly no statistically significant difference in scoring between experts and novices (χ 2 = 3.109, p = .540). While the majority of participants suggested an urgency scoring of 3 (52%), ChatGPT suggested a CTAS level of 2, which was in line with the evaluators.

4.2. Triage test accuracy

As shown in Table 2, we assessed the impact of expertise level on triage test accuracy. Interestingly, for Case 1 there was no statistically significant difference in triage accuracy between experts and novices (χ 2 = 3.012, p = .222). However, ChatGPT over‐triaged by suggesting not only the necessary CT scan or MRI of the brain, ECG and cardiac enzymes, but also urine tests and lumbar puncture, which were unnecessary at that stage of evaluation.

For Cases 2 and 3, the accuracy level was significantly different between experts and novices (χ 2 = 10.296, p < .05; χ 2 = 7.714, p < .021, respectively). For Case 2, the majority of nurses (64%) used the laboratory and image tests accurately (Table 2). However, only 34% of novices did so accurately (42% missed requiring certain necessary tests while 24% over‐triaged; Table 2). Here again, ChatGPT over‐triaged by suggesting not merely the essential tests such as ECG, chest X‐ray, d‐dimer test and arterial blood gas, but also a blood culture test which was not required at that stage of triage.

Finally, for Case 3, as shown in Table 2, the majority (57%) of experts chose to use the lab resources and imaging tests accurately, while the majority of novices (58%) missed requesting all the necessary tests. ChatGPT missed requesting the required beta hCG pregnancy tests (due to the patient's age and ache in the left shoulder) but also over‐triaged by suggesting stool tests and endoscopy which were nonessential at that initial stage of triage.

4.3. Triage reevaluation

During the second phase of each of the three clinical scenarios, additional information was provided to facilitate triage reevaluation. Participants were then asked to reevaluate their decisions regarding urgency scoring and tests and, if needed, make changes and clarifications to their previous decisions. In Case 1, 51% clarified their decision regarding CTAS and tests, 34% did so for Case 2 and 38% for Case 3.

For Case 1 there was no statistically significant difference in the level of triage reevaluation accuracy between experts and novices, χ 2 = 3.995, p = .262. The chatbot, however, changed its initial decision and explained this (Table 2). Similarly, in Case 2 there were no statistically significant differences between experts and novices in the level of triage reevaluation accuracy (Table 2; χ 2 = 0.3.526, p = .327). Here, ChatGPT's CDM process was accurate since the findings were consistent with a bacterial infection. However, according to the evaluators, ChatGPT's decision that the patient might not need further testing for a pulmonary embolism was ruled out too early. Finally, for Case 3, no statistically significant differences were found between experts and novices (Table 2; χ 2 = 2.836, p = .242). Here, however, in contrast to the first two clinical cases, ChatGPT stated that its initial recommendations remained unchanged. Here, ChatGPT's CDM process was inaccurate since an ultrasound conducted by a gynaecologist showed fluids in the patient's abdominal cavity.

4.4. Performance time and text characteristics

Analysis of time performance demonstrated a significant mean difference between experts and novices for all three triage clinical scenarios (Figure 1a; t = −3.678, p < .001; t = −3.120 p < .001; and t = −2.650, p < .01, respectively). Novice nurses took almost twice as long as nurses to generate CDM responses. However, ChatGPT was four times faster than novices in generating CDM responses (Figure 1a).

FIGURE 1.

FIGURE 1

Performance time and word count. (a) Mean time performance (minutes) for response generation by nurses, students and chatbot. (b) The mean number of words in the responses by nurses, students and chatbot.

The word count analysis revealed no statistically significant difference between the mean number of words used by novices and experts (Figure 1b; t = −1.875, p = .064). However, ChatGPT utilized a vastly larger mean number of words to construct its CDM responses than both experts and novices (greater by 27 to 41 times), with a mean of 659 words per clinical scenario. Examination of the linguistic structure within ChatGPT responses revealed a consistent pattern. The opening sentence typically offered a concise response to the question posed, often commencing with the phrase ‘based on’ in reference to the information provided. This was followed by a series of explanatory sentences that elaborated upon and provided a logical sequence of deductive reasoning to justify the initial response. The majority of the responses incorporated a summary sentence at the end of the response to encapsulate the principal decision and its main rationale.

The linguistic analysis revealed that ChatGPT employed hedging techniques in 34 instances to avoid making definitive statements. Expressions such as ‘most likely’, ‘suggest’, ‘may include’, ‘would likely require’, ‘it is possible’ and ‘may be necessary’ were prevalent, indicating a carefully calibrated approach to conveying uncertainty or variability. Moreover, the responses by ChatGPT occasionally incorporated additional information beyond the specific query, such as potential diagnoses or further diagnostic consultation beyond the scope of triage evaluation.

5. DISCUSSION

This study examined the potential of ChatGPT as a platform for supporting novices' CDM processes. ChatGPT is an easily accessible low‐cost platform capable of suggesting CDM processes and supporting novices throughout their clinical practice as a consultant (Archibald & Clark, 2023). Despite the numerous potential advantages associated with integrating ChatGPT to support CDM, our study also highlights certain risks that warrant consideration.

To evaluate ChatGPT's ability to guide novices through the CDM process, we compared its accuracy in triage with that of ER nurses and final‐year novices. The findings indicated no significant disparities in urgency scoring between experts and novices. However, when it came to ChatGPT's scoring the results revealed only partial accuracy, as the model exhibited some indecisiveness in accurately determining the level of urgency. ChatGPT's language model was also inaccurate in its recommendations of laboratory and imaging tests. Specifically, it tended to over‐triage by suggesting unnecessary tests while simultaneously not requesting necessary tests, such as beta hCG pregnancy tests for Case 3. Even when provided with additional information, ChatGPT still produced inaccurate responses indicating the potential of gaps in care provision.

Our findings are supported by recent studies showing that for large language models used by ChatGPT, the medical and nursing domains still pose a significant challenge. Kung et al. (2023) evaluated the performance of ChatGPT on the United States Medical Licensing Exam and demonstrated that the chatbot did not reach the passing threshold and achieved an overall accuracy of only 50% (Kung et al., 2023). In addition, Anatki et al. (2023) showed only 42.7% to 55.8% accuracy in simulated ophthalmology exams. Further analysis of ChatGPT's responses to medical queries found that inaccurate answers most often occurred when responding to questions rated as highly difficult by physicians. These questions typically had binary yes/no answers or definitive diagnostic outcomes. While ChatGPT may demonstrate encouraging performance, more medical and nursing domain‐specific pre‐training is necessary to improve its performance.

Interestingly, we observed that in the clinical scenario questions, ChatGPT provided not only the answer to the question, as it does for standalone/individual questions, but also insightful explanations to support the given answer. ChatGPT's responses to clinical scenarios provided not just answers but also coherent, logical explanations that were easy for humans to follow, indicating potential as an educational supplement pending further study. Moreover, the chatbot's use of deductive reasoning in its responses aligns well with the process of CDM, making it a valuable role model for novices. Although our study revealed that ChatGPT's responses had a significantly higher word count compared to responses by experts and novices, it was able to generate them much faster compared to human responses. Such an AI‐assisted approach could unlock untapped productivity and become a self‐learning tool with immediate detailed feedback.

5.1. Limitations

The study has several limitations. First, current study only included three clinical scenarios, which may not fully capture the range of complexity and variability in real‐world triage situations. Next, we only tested one sampling temperature of 0.9 without varying this parameter. The temperature setting impacts how randomly ChatGPT samples from possible responses, thereby influencing response quality and coherence. By conducting conversations at a single temperature level, we are unable to draw any conclusions about whether different temperatures may elicit more appropriate or knowledgeable responses for different clinical situations. Future research should systematically investigate ChatGPT's outputs across a range of temperatures to better understand potential impacts. Finally, larger‐scale studies involve a more representative sample of nurses as experts and students as novices.

As an exploratory study of a general‐purpose conversational agent, it was important for us to conduct a careful initial analysis of ChatGPT before broader implementation or recruitment efforts, given the rapidly progressive pace of development in this technology. The field of conversational AI is advancing quickly, so model capabilities considered during our early 2023 study may differ in new and unexpected ways just a short time later. However, larger studies in the future may consider strategies to recruit across a longer period and widen inclusion criteria to increase the sample.

6. CONCLUSION

The promising applications of ChatGPT can induce significant shifts in standard nursing practice. However, utilization of this AI chatbot in clinical practice as a supporting tool should be conducted with extreme caution considering its potential limitations. As we show, an AI based tool such as ChatGPT could potentially generate irrelevant, inappropriate and uncertain responses, especially when it comes to the medical and nursing field of triage. Moreover, as we also showed earlier, other studies too have raised concerns regarding machine‐generated tutoring, showing clear human tutoring superiority as well as relatively low accuracy in clinical diagnosis identification (Rosen & Saban, 2023a). Other studies have also raised concerns regarding academic integrity and theft of intellectual property. However, despite all these concerns, nurses must understand the rapid development and scalability of ChatGPT and recognize its potential. As we show, ChatGPT's linguistic structure is very cohesive, detailed and easy to comprehend, suggesting the existing potential for augmenting the human nursing education process. Accordingly, future studies should explore the most effective ways of integrating ChatGPT into nursing practice, training and education as a digital assistant. Such studies can help pave the way for ChatGPT to be at the forefront of innovative approaches for enhancing clinical decision support systems or as a tool in nursing education.

AUTHOR CONTRIBUTIONS

MS and ID interpreted the data, edited and approved the final version of the article designed the study, drafted the initial manuscript and reviewed and revised the manuscript. ID designed the methods section, analysed the data and reviewed and revised the manuscript. ID and MS critically reviewed the manuscript for important intellectual content.

FUNDING INFORMATION

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

CONFLICT OF INTEREST STATEMENT

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

PEER REVIEW

The peer review history for this article is available at https://www.webofscience.com/api/gateway/wos/peer‐review/10.1111/jan.16101.

ETHICS STATEMENT

The study protocol was approved by the ethics committee of the Tel‐Aviv University (#0006223–2).

CONSENT FOR PUBLICATION

None.

Supporting information

Data S1.

JAN-81-7489-s001.docx (20.3KB, docx)

ACKNOWLEDGEMENTS

None.

APPENDIX A.

A.1. DESCRIPTION OF CASE NO. 3

Mrs. L.M. is a 25‐year‐old woman who arrives at the emergency department via ambulance due to severe stomachache, diarrhoea, vomiting and weakness. At admission, she is fully consciousness, alert, pale and anxious. Mrs. L.M. reports that she has visited the restroom at least ten times per day for the last 2 days. Besides stomachache, she complains of pain in the left shoulder.

Mrs. L.V. is in generally good health. Her menstrual periods are irregular.

Vital signs on admission:

BP, 98/65 mm Hg; pulse, 110 bpm; temperature, 37.0°C; oxygen saturation (SpO2) 100% in room air.

A.1.1. Questions (Part A)

  1. What is the urgency score from 1 (most urgent) to 5 (not urgent at all)?

  2. Which symptoms would you attend to first? Explain why.

  3. To which physician/s would you refer the patient? Explain why.

  4. For which tests would you refer the patient? (e.g. ECG, labs)

A.1.2. Initial results

Lab results.

Stool colour—normal. Vomiting—contents of the stomach.

WBC—11,500/mcL; Hgb—11.3 g/dl; blood chemistry—normal. The serum lactate level is 1 mEq/L.

Urine—leukocytosis; nitrites and erythrocytes US by gynaecologist: fluids in the abdominal cavity.

A.1.3. Questions (Part B)

  • 5

    Following the lab and physical examination results, would you change your urgency score decision above (1–4)?

  • 6

    If yes, why? Elaborate on your alternative hypothesis.

APPENDIX B.

B.1. CHATGPT RESPONSES FOR CASE NO. 3

B.1.1. ChatGPT answer (Part A)

Q1. According to the Canadian Triage and Acuity Scale (CTAS), the urgency score of a patient is determined by their presenting signs and symptoms. The scale ranges from 1 (most urgent) to 5 (least urgent).

In this case, Mrs. L.V. is experiencing severe stomachache, diarrhoea, vomiting and weakness. These symptoms indicate a potentially serious condition that requires urgent medical attention. The left‐shoulder ache is also concerning as it could be a sign of referred pain from an abdominal issue.

Based on this information, the urgency score of this case would likely be a CTAS level 2. CTAS level 2 indicates that the patient is at a high risk of becoming unstable and requires urgent assessment and treatment within 15 min L.V to 1 h.

Q2. In this case, Mrs. L.V. is presenting with multiple symptoms, including severe stomachache, diarrhoea, vomiting, weakness and left shoulder ache. The symptoms that should be attended to first depend on their potential severity and the likelihood that they may indicate a life‐threatening condition.

Based on the information provided, the most urgent symptoms to attend to first are severe stomachache, diarrhoea and vomiting. These symptoms could indicate a potentially serious gastrointestinal condition, such as dehydration, infection, or inflammation. If left untreated, these conditions could lead to significant complications, such as electrolyte imbalances, sepsis, or organ failure. After addressing the gastrointestinal symptoms, the left shoulder ache should be evaluated. Referred pain to the left shoulder can be a symptom of several conditions, including pancreatitis, which is a medical emergency that requires urgent treatment.

Finally, the patient's weakness and anxiety should also be addressed. Weakness could be a sign of dehydration or electrolyte imbalances, while anxiety could be a natural response to the patient's discomfort and uncertainty about their condition. In summary, the most urgent symptoms to attend to first in this case are severe stomachache, diarrhoea and vomiting. These symptoms could indicate a potentially serious gastrointestinal condition that requires urgent assessment and treatment.

Q3. Based on the presenting symptoms, Mrs. L.V. requires urgent medical attention from a healthcare provider who is trained to manage gastrointestinal conditions and emergencies. In this case, the patient should be referred to a gastroenterologist or an emergency medicine physician. A gastroenterologist is a medical specialist who is trained in the diagnosis and treatment of gastrointestinal disorders, including acute and chronic conditions. They are also trained in endoscopic procedures, which may be necessary to evaluate the patient's condition.

An emergency medicine physician is also an appropriate physician to refer the patient to, as they are trained to manage a wide range of medical emergencies, including gastrointestinal conditions. In the emergency department, the physician will assess the patient's condition, stabilize them if necessary and order diagnostic tests and treatments as needed. Ultimately, the choice of physician to refer the patient to will depend on the availability of healthcare providers and the specific needs of the patient. The primary goal is to ensure that the patient receives prompt and appropriate medical attention to manage their symptoms and prevent further complications.

Q4. In this case, the choice of tests to be ordered for Mrs. L.V. will depend on the physician's assessment of the patient's condition and their suspected underlying cause of the presenting symptoms. However, some tests that may be ordered to help diagnose and manage the patient's symptoms are:

Blood tests: Blood tests can be ordered to assess the patient's electrolyte balance, kidney function, liver function and the levels of inflammation. These tests can also help identify potential infections or other underlying conditions that may be causing the symptoms.

Stool tests: Stool tests can be ordered to evaluate the presence of pathogens or other abnormalities that may be causing the patient's gastrointestinal symptoms.

Imaging tests: Imaging tests, such as abdominal ultrasound, CT scan, or MRI, can be used to visualize the patient's internal organs and identify any abnormalities that may be causing the symptoms.

Endoscopy: Endoscopy is a procedure that involves inserting a flexible tube with a camera into the patient's digestive tract. This procedure can help visualize any abnormalities in the digestive tract, obtain tissue samples for analysis and even provide treatment if necessary.

Electrocardiogram (ECG): An ECG can be ordered to evaluate the patient's heart function, especially if they are experiencing left shoulder pain, which can be a sign of referred pain from the heart.

The choice of tests will depend on the patient's specific symptoms, medical history and physical exam findings. The physician will evaluate the patient's condition and order tests that will help diagnose and manage the underlying cause of the presenting symptoms.

B.1.2. ChatGPT answer (Part B)

Q5. Based on the new lab and physical examination results, my initial recommendations for urgent medical attention, referral to a surgeon or specialist in infectious diseases and diagnostic testing such as a CT scan or abdominal ultrasound remain unchanged. The new information provides additional evidence for the presence of a potentially serious intra‐abdominal condition, such as peritonitis or abscess, that requires immediate medical attention.

The lab results suggest the presence of leukocytosis, fluids in the abdominal cavity and erythrocytes in the urine, which may indicate inflammation or infection in the abdominal region. The low haemoglobin level is also a concerning finding and requires further evaluation.

Therefore, the patient should be evaluated and managed promptly by a specialist in a hospital setting. Further diagnostic tests and treatment, such as intravenous antibiotics and drainage of any abscesses or collections, may be necessary to identify and manage the underlying cause of the patient's symptoms.

Q6. Not applicable.

Saban, M. , & Dubovi, I. (2025). A comparative vignette study: Evaluating the potential role of a generative AI model in enhancing clinical decision‐making in nursing. Journal of Advanced Nursing, 81, 7489–7499. 10.1111/jan.16101

DATA AVAILABILITY STATEMENT

The analysed data will be made available to researchers upon a reasonable request.

REFERENCES

  1. 2789 AAARJ of MS (ISSN) . ChatGPT and nursing education: Challenges and opportunities. iasj.net . 2023;4:50–51. 10.54133/ajms.v4i.110 [DOI] [Google Scholar]
  2. Alowais, S. A. , Alghamdi, S. S. , Alsuhebany, N. , Alqahtani, T. , Alshaya, A. I. , Almohareb, S. N. , Aldairem, A. , Alrashed, M. , Bin Saleh, K. , Badreldin, H. A. , al Yami, M. S. , al Harbi, S. , & Albekairy, A. M. (2023). Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Medical Education, 23(1), 689. 10.1186/s12909-023-04698-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Alshehri, F. D. , Jones, S. , & Harrison, D. (2023). The effectiveness of high‐fidelity simulation on undergraduate nursing students' clinical reasoning‐related skills: A systematic review. Nurse Education Today, 121, 105679. 10.1016/J.NEDT.2022.105679 [DOI] [PubMed] [Google Scholar]
  4. Antaki, F. , Touma, S. , Milad, D. , El‐Khoury, J. , & Duval, R. (2023). Evaluating the performance of ChatGPT in ophthalmology: An analysis of its successes and shortcomings. Ophthalmology Science, 3(4), 100324. 10.1016/J.XOPS.2023.100324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Archibald, M. M. , & Clark, A. M. (2023). ChatGTP: What is it and how can nursing and health science education use it? Journal of Advanced Nursing, 79(10), 3648–3651. 10.1111/JAN.15643 [DOI] [PubMed] [Google Scholar]
  6. AS. E . Clinical reasoning in medicine. Clinical Reasoning in the Health Professions 1995:49–59. 10.15063/RIGAKU.KJ00003223460 [DOI]
  7. Ayers, J. W. , Poliak, A. , Dredze, M. , Leas, E. C. , Zhu, Z. , Kelley, J. B. , Faix, D. J. , Goodman, A. M. , Longhurst, C. A. , Hogarth, M. , & Smith, D. M. (2023). Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Internal Medicine, 183(6), 589–596. 10.1001/JAMAINTERNMED.2023.1838 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Benner, P. (1982). From novice to expert. AJN the American Journal of Nursing, 82(3), 402–407. [PubMed] [Google Scholar]
  9. Benner, P. (1984). From novice to expert. Menlo Park, 84(1480), 10–1097. [Google Scholar]
  10. Brown, C. , Nazeer, R. , Gibbs, A. , Le, P. P. , & Mitchell, A. R. (2023). Breaking bias: The role of artificial intelligence in improving clinical decision‐making. Cureus, 15(3), e36415. 10.7759/CUREUS.36415 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Ding, Y. , Park, E. , Nagarajan, M. , & Grafstein, E. (2019). Patient prioritization in emergency department triage systems: An empirical study of the Canadian triage and acuity scale (CTAS). Manufacturing and Service Operations Management, 21(4), 723–741. 10.1287/MSOM.2018.0719 [DOI] [Google Scholar]
  12. Epstein, R. M. (2013). Whole mind and shared mind in clinical decision‐making. Patient Education and Counseling, 90(2), 200–206. 10.1016/J.PEC.2012.06.035 [DOI] [PubMed] [Google Scholar]
  13. Eva, K. W. , & Cunnington, J. P. W. (2006). The difficulty with experience: Does practice increase susceptibility to premature closure? The Journal of Continuing Education in the Health Professions, 26(3), 192–198. 10.1002/CHP.69 [DOI] [PubMed] [Google Scholar]
  14. Fazlollahi, A. M. , Bakhaidar, M. , Alsayegh, A. , Yilmaz, R. , Winkler‐Schwartz, A. , Mirchi, N. , Langleben, I. , Ledwos, N. , Sabbagh, A. J. , Bajunaid, K. , Harley, J. M. , & del Maestro, R. F. (2022). Effect of artificial intelligence tutoring vs expert instruction on learning simulated surgical skills among medical students: A randomized clinical trial. JAMA Network Open, 5(2), e2149008. 10.1001/JAMANETWORKOPEN.2021.49008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Fijačko, N. , Gosak, L. , Štiglic, G. , Picard, C. T. , & Douma, M. J. (2023). Can ChatGPT pass the life support exams without entering the American heart association course? Resuscitation, 185, 109732. [DOI] [PubMed] [Google Scholar]
  16. Fu, Y. , Litu, M. O. , Yuhao, C. , Hao Peng, W. , & Khot, T. (2023). Chain‐of‐thought hub: A continuous effort to measure large language models’ reasoning performance . Published online May 26, Retrieved 22 August 2023 from https://arxiv.org/abs/2305.17306v1
  17. Gaube, S. , Suresh, H. , Raue, M. , Merritt, A. , Berkowitz, S. J. , Lermer, E. , Coughlin, J. F. , Guttag, J. V. , Colak, E. , & Ghassemi, M. (2021). Do as AI say: Susceptibility in deployment of clinical decision‐aids. npj Digital Medicine, 4(1), 1–8. 10.1038/s41746-021-00385-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hoffman, K. , Donoghue, J. , & Duffield, C. (2004). Decision‐making in clinical nursing: Investigating contributing factors. Journal of Advanced Nursing, 45(1), 53–62. 10.1046/J.1365-2648.2003.02860.X [DOI] [PubMed] [Google Scholar]
  19. Huh, S. (2023). Are ChatGPT's knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination? A descriptive study. Journal of Educational Evaluation for Health Professions, 20, 1. 10.3352/JEEHP.2023.20.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jin, M. , & Ji, C. (2021). The correlation of metacognitive ability, self‐directed learning ability and critical thinking in nursing students: A cross‐sectional study. Nursing Open, 8(2), 936–945. 10.1002/NOP2.702 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Johansen, M. L. , & O'Brien, J. L. (2016). Decision making in nursing practice: A concept analysis. Nurs Forum (Auckl)., 51(1), 40–48. 10.1111/NUF.12119 [DOI] [PubMed] [Google Scholar]
  22. Johnson, D. , Goodman, R. , Patrinely, J. , Stone, C. , Zimmerman, E. , Donald, R. , Chang, S. , Berkowitz, S. , Finn, A. , Jahangir, E. , Scoville, E. , Reese, T. , Friedman, D. , Bastarache, J. , van der Heijden, Y. , Wright, J. , Carter, N. , Alexander, M. , Choe, J. , … Wheless, L. (2023). Assessing the accuracy and reliability of AI‐generated medical responses: An evaluation of the Chat‐GPT model . Published online 10.21203/rs.3.rs-2566942/v1 [DOI]
  23. Jones, J. A. (1988). Clinical reasoning in nursing. Journal of Advanced Nursing, 13(2), 185–192. 10.1111/J.1365-2648.1988.TB01407.X [DOI] [PubMed] [Google Scholar]
  24. Kao, H. J. , Chien, T. W. , Wang, W. C. , Chou, W. , & Chow, J. C. (2023). Assessing ChatGPT's capacity for clinical decision support in pediatrics: A comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine, 102(25), e34068. 10.1097/MD.0000000000034068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kung, T. H. , Cheatham, M. , Medenilla, A. , Sillos, C. , de Leon, L. , Elepaño, C. , Madriaga, M. , Aggabao, R. , Diaz‐Candido, G. , Maningo, J. , & Tseng, V. (2023). Performance of ChatGPT on USMLE: Potential for AI‐assisted medical education using large language models. PLOS Digital Health, 2(2), e0000198. 10.1371/JOURNAL.PDIG.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Lee, D. , & Park, J. (2019). A review for concept clarification of critical thinking, clinical reasoning, and clinical judgment in nursing education. The Journal of Korean Academic Society of Nursing Education, 25(3), 378–387. 10.5977/JKASNE.2019.25.3.378 [DOI] [Google Scholar]
  27. Mamede, S. , Schmidt, H. G. , & Penaforte, J. C. (2008). Effects of reflective practice on the accuracy of medical diagnoses. Medical Education, 42(5), 468–475. 10.1111/J.1365-2923.2008.03030.X [DOI] [PubMed] [Google Scholar]
  28. Miao, H. , & Journal, H. (2023). Impact of ChatGPT on interdisciplinary nursing education and research. apinj.jmir.org . Retrieved 25 August 2023 from https://apinj.jmir.org/2023/1/e48136/ [DOI] [PMC free article] [PubMed]
  29. Nibbelink, C. W. , & Brewer, B. B. (2018). Decision‐making in nursing practice: An integrative literature review. Journal of Clinical Nursing, 27(5–6), 917–928. 10.1111/JOCN.14151 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Nibbelink, C. W. , & Carrington, J. M. (2019). Nurse decision making in acute care. CIN—Computers Informatics Nursing., 37(3), 151–160. 10.1097/CIN.0000000000000521 [DOI] [PubMed] [Google Scholar]
  31. Pardos, Z. A. , & Bhandari, S. (2023). Learning gain differences between ChatGPT and human tutor generated algebra hints. Published online February 14. Retrieved 25 August 2023 from http://arxiv.org/abs/2302.06871
  32. Rosen, S. , & Saban, M. (2023a). Can ChatGPT assist with the initial triage? A case study of stroke in young females. International Emergency Nursing, 70, 101340. 10.1016/J.IENJ.2023.101340 [DOI] [Google Scholar]
  33. Rosen, S. , & Saban, M. (2023b). Evaluating the reliability of ChatGPT as a tool for imaging test referral: A comparative study with a clinical decision support system. European Radiology, 1, 1–12. 10.1007/S00330-023-10230-0/FIGURES/2 [DOI] [PubMed] [Google Scholar]
  34. Saintsing, D. , Gibson, L. M. , & Pennington, A. W. (2011). The novice nurse and clinical decision‐making: How to avoid errors. Journal of Nursing Management, 19(3), 354–359. 10.1111/j.1365-2834.2011.01248.x [DOI] [PubMed] [Google Scholar]
  35. Scerri, A. , & Morin, K. H. (2023). Using chatbots like ChatGPT to support nursing practice. Journal of Clinical Nursing, 32, 4211–4213. 10.1111/JOCN.16677 [DOI] [PubMed] [Google Scholar]
  36. Shen, Y. , Heacock, L. , Elias, J. , Hentel, K. D. , Reig, B. , Shih, G. , & Moy, L. (2023). ChatGPT and other large language models are double‐edged swords. Radiology, 307(2), e230163. 10.1148/radiol230163 [DOI] [PubMed] [Google Scholar]
  37. Sun, G. H. , & Hoelscher, S. H. (2023). The ChatGPT storm and what faculty can do. Nurse Educator, 48(3), 119–124. 10.1097/NNE.00000000000013 [DOI] [PubMed] [Google Scholar]
  38. Victor‐Chmil, J. (2013). Critical thinking versus clinical reasoning versus clinical judgment: Differential diagnosis. Nurse Educator, 38(1), 34–36. 10.1097/NNE.0B013E318276DFBE [DOI] [PubMed] [Google Scholar]
  39. Wong, S. H. V. , & Kowitlawakul, Y. (2020). Exploring perceptions and barriers in developing critical thinking and clinical reasoning of nursing students: A qualitative study. Nurse Education Today, 95, 104600. 10.1016/J.NEDT.2020.104600 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1.

JAN-81-7489-s001.docx (20.3KB, docx)

Data Availability Statement

The analysed data will be made available to researchers upon a reasonable request.


Articles from Journal of Advanced Nursing are provided here courtesy of Wiley

RESOURCES