Skip to main content
Archives of Academic Emergency Medicine logoLink to Archives of Academic Emergency Medicine
. 2024 Jul 30;12(1):e60. doi: 10.22037/aaem.v12i1.2384

Diagnostic Accuracy of ChatGPT for Patients’ Triage; a Systematic Review and Meta-Analysis

Navid Kaboudi 1, Saeedeh Firouzbakht 2, Mohammad Shahir Eftekhar 3, Fatemeh Fayazbakhsh 4, Niloufar Joharivarnoosfaderani 5, Salar Ghaderi 6, Mohammadreza Dehdashti 7, Yasmin Mohtasham Kia 8, Maryam Afshari 9, Maryam Vasaghi-Gharamaleki 10, Leila Haghani 11, Zahra Moradzadeh 12, Fattaneh Khalaj 13, Zahra Mohammadi 12, Zahra Hasanabadi 14, Ramin Shahidi 7
PMCID: PMC11407534  PMID: 39290765

Abstract

Introduction:

Artificial intelligence (AI), particularly ChatGPT developed by OpenAI, has shown the potential to improve diagnostic accuracy and efficiency in emergency department (ED) triage. This study aims to evaluate the diagnostic performance and safety of ChatGPT in prioritizing patients based on urgency in ED settings.

Methods:

A systematic review and meta-analysis were conducted following PRISMA guidelines. Comprehensive literature searches were performed in Scopus, Web of Science, PubMed, and Embase. Studies evaluating ChatGPT's diagnostic performance in ED triage were included. Quality assessment was conducted using the QUADAS-2 tool. Pooled accuracy estimates were calculated using a random-effects model, and heterogeneity was assessed with the I² statistic.

Results:

Fourteen studies with a total of 1,412 patients or scenarios were included. ChatGPT 4.0 demonstrated a pooled accuracy of 0.86 (95% CI: 0.64-0.98) with substantial heterogeneity (I² = 93%). ChatGPT 3.5 showed a pooled accuracy of 0.63 (95% CI: 0.43-0.81) with significant heterogeneity (I² = 84%). Funnel plots indicated potential publication bias, particularly for ChatGPT 3.5. Quality assessments revealed varying levels of risk of bias and applicability concerns.

Conclusion:

ChatGPT, especially version 4.0, shows promise in improving ED triage accuracy. However, significant variability and potential biases highlight the need for further evaluation and enhancement.

Key Words: ChatGPT, triage, emergency department, diagnostic performance

1. Introduction:

Artificial intelligence (AI) has rapidly become a cornerstone of modern healthcare, revolutionizing various aspects of medical practice and research. AI technologies, including machine learning and natural language processing, have significantly improved diagnostic accuracy, treatment planning, and patient management. For instance, AI-driven systems can analyze vast amounts of medical data to identify patterns that are often undetectable to human clinicians, leading to earlier and more accurate diagnoses (1).

ChatGPT, developed by OpenAI, represents a significant milestone in the evolution of conversational AI. Initially released in November 2022 and powered by the GPT-3.5 architecture, ChatGPT quickly demonstrated its capabilities in generating coherent and contextually relevant responses across various domains. The model was trained on extensive datasets, enabling it to perform tasks such as code generation, text summarization, and complex problem-solving with remarkable accuracy (2, 3). Also, ChatGPT can assist in generating medical documentation, providing accurate and efficient patient information, and even aiding in diagnostic processes by analyzing patient data and suggesting possible diagnoses. Moreover, ChatGPT's ability to handle large volumes of data quickly and accurately makes it a valuable tool for streamlining workflows and improving overall healthcare efficiency (4).

Triage plays a crucial role in emergency departments (EDs) by prioritizing patients based on the severity of their conditions, ensuring that those who need urgent care receive it promptly. This process is essential for managing the high volume of patients and maintaining efficient workflow in busy EDs. The implementation of advanced triage protocols has been shown to improve patient outcomes and reduce waiting times, enhancing the overall efficiency of emergency care. Additionally, effective triage systems can help in better resource allocation, ensuring that critical cases receive immediate attention while non-urgent cases are managed appropriately (5).

The COVID-19 pandemic has highlighted some limitations of current triage systems, particularly in handling public health emergencies. The surge in patient numbers during the pandemic exposed weaknesses in existing triage protocols, including inadequate resources and the need for rapid adaptability. Traditional triage systems were not designed to cope with such unprecedented demand, leading to increased patient wait times and strained healthcare facilities. Moreover, there were ethical and logistical challenges in triage decisions, as protocols often failed to address disparities and ensure equitable care for all patients, particularly those from vulnerable populations (6, 7).

However, Evaluating the diagnostic performance of ChatGPT and its safety for ED triage is essential due to its potential to significantly enhance clinical decision-making and patient management. Studies have shown that while ChatGPT can achieve impressive diagnostic accuracy, it still faces challenges such as high rates of unsafe triage decisions compared to other systems like Ada and WebMD Symptom Checkers. Ensuring the accuracy and safety of ChatGPT in triage can help prevent misdiagnosis and improve patient outcomes, making it a valuable tool in emergency medicine (8).

Thus, this study aimed to assess the diagnostic performance of ChatGPT and its safety in performing ED triage. By evaluating the accuracy, reliability, and effectiveness of ChatGPT in prioritizing patients based on urgency, this study seeks to provide a comprehensive understanding of its potential role in emergency care settings.

2. Methods:

2.1 Study design and design

This systematic review and meta-analysis was conducted following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines (9). The protocol for this systematic review and meta-analysis was registered in the PROSPERO database with the registration code (CRD42024531858).

2.2 Study Questions and PICOS Framework

This study aimed to answer the following questions:

  1. What is the diagnostic accuracy of ChatGPT in performing emergency department triage?

  2. How does the performance of ChatGPT 3.5 compare to ChatGPT 4.0 in this context?

  3. What are the potential biases and limitations of using ChatGPT for ED triage?

To address these questions, we used the PICOS framework:

  • Participants (P): Patients or scenarios presented in emergency department settings.

  • Interventions (I): Use of ChatGPT 3.5 and ChatGPT 4.0 for triage.

  • Comparisons (C): Comparison of ChatGPT's diagnostic performance with human triage or other triage systems.

  • Outcomes (O): Diagnostic accuracy, and inter-rater reliability.

  • Study Design (S): Cross-sectional studies, retrospective analyses, and other relevant study designs assessing the performance of ChatGPT in ED triage.

2.3 Search Strategy

A comprehensive literature search was conducted across four major databases: Scopus, Web of Science, PubMed, and Embase. The search strategy was designed to identify studies evaluating the diagnostic performance of ChatGPT in ED triage. Keywords included combinations of terms such as "ChatGPT," and "triage". The search was restricted to studies published from the inception of these databases until 19 March 2024. Reference lists of the identified studies and relevant reviews were also screened to ensure completeness. Details of keywords and search strategies for each database are provided in the Supplementary table 1 and 2. No limits were used in our search.

Table 1.

Characteristics of included studies

First author/
year/
Country
Study Design Triage setting Evaluated scenarios Large Language Model Used Comparison Group LLM's diagnostic performance Inter-rater reliability Main Findings Limitations discussed in the included studies
Sinan Paslı et al.
2024 Turkey
Observational study Patients who presented to the ED 758 ChatGPT 4.0 Triage team and emergency medicine specialist 77,77 % to 100 % sensitivity for triaging the patients Near-perfect agreement with gold standard (Cohen's Kappa 0.899) for triaging ChatGPT showed excellent predictive skills in triaging with high agreement with gold standard Using local emergency rules, subjective general condition assessment
Rick Kye Gan et al.
2023 Malaysia
Cross-sectional study Mass casualty incident cases 15 ChatGPT 3.5 and Google Bard Medical students' performance from a previous study and Google Bard ChatGPT: 66.67% over-triaged, Correctly-triaged 26.67%, 6.67% under-triaged; Google Bard:60% Correctly-triaged, 40% over-triaged, no under-triage reported Not reported Google Bard was superior to ChatGPT in triage accuracy Small sample size, only START triage protocol evaluated
Arian Zaboli et al.
2024 Italy
Observational study Common cases in ED 30 ChatGPT 3.5 Two triage nurses using Manchester Triage System (MTS) Chat-GPT 3.5 - sensitivity (55.5%). Chat-GPT 3.5 - specificity (57.1%) Agreement in MTS code assignment between Comparison group and Chat-GPT 3.5 was 0.278. (Cohen's Kappa) This study showed Insufficient reliability and effectiveness of ChatGPT 3.5 to replace triage nurses. Small sample size (30 vignettes), ChatGPT was not trained for ED triage.
Jae Hyuk Kim et al.
2024 Online (no specific location)
Inter-rater reliability study on virtual cases Common cases in ED 202 ChatGPT 3.5 and ChatGPT 4.0 4 human raters (emergency staff) - experienced emergency medicine specialist Not reported Human raters with ChatGPT 3.5 (Fleiss’ kappa=0.320) and Human raters with ChatGPT 4.0 (Fleiss’ kappa= 0.523) ChatGPT have a Potential for ED triage but it's agreement with human raters is insufficient. Use of virtual scenarios, not real patients
Jeffrey Michael Franc et al.
2024 Online (no specific location)
Crossed gauge repeatability and reproducibility study Common cases in ED 61 ChatGPT 3.5 Emergency medicine specialists ChatGPT 3.5: 47.5% Correctly-triaged, 13.7% under-triaged, and 38.7% over-triaged Not reported ChatGPT is not effective for triaging patients using the Canadian Triage and Acuity Scale, showing low repeatability and only 47% accuracy in its assessments. One LLM evaluated, small number of prompts, prompts not optimized, no fine-tuning for triage
Marc Ayoub et al.
2023 Online (no specific location)
Cross-sectional study Common cases in ED 9 ChatGPT 3.5 No direct comparison; Physicians scored responses Average score 4.2/5 (SD 0.7) by 5 physicians for triage performance of ChatGPT 3.5 Not reported ChatGPT has the potential to augment clinical decision-making. Small number of scenarios, clinician agreement difficulty, need for further legal/accuracy evaluation
Max Ward et al.
2024
USA
Comparative analysis - LLM vs. medical professionals Cases needed Neurosurgical attention 30 ChatGPT 3.5 and ChatGPT 4.0 Neurosurgical attendings, residents, physician assistants, subinterns ChatGPT 3.5: 92.59% Correctly-triaged; ChatGPT 4.0: 100% Correctly-triaged. Not reported ChatGPT 4.0 performed at senior resident level and ChatGPT 3.5 performed near PGY-1 level for triaging the patients. Continuously changing nature of LLMs
Stefan Bushuven et al.
2023 Online (no specific location)
Cross-sectional study Prehospital Basic Life Support and Paediatric Advanced Life Support Cases 22 ChatGPT 3.5 and ChatGPT 4.0 No human comparison Not reported Poor to acceptable reliability ChatGPT/GPT-4 correctly identified 12 of 22 scenarios (54.5%) as emergencies of high urgency (Inconsistent emergency triage performance ). No human comparison, selection bias, prototypical cases, regional variability
Hamish Fraser et al.
2023
USA
Observational study Patients who presented to the ED 40 ChatGPT 3.5 and ChatGPT 4.0 Ada symptom checker, WebMD symptom checker, ED physicians ChatGPT 3.5: 59% Correctly-triaged, 41% under-triaged, and 0% over-triaged; ChatGPT 4.0: 76% Correctly-triaged, 22% under-triaged, and 3% over-triaged. Not reported ChatGPT 4.0 was better at triage than ChatGPT 3.5 but worse in diagnostic accuracy. Data presentation differences, not simulating real patient queries, changes in versions
Riley J. Lyons et al.
2023
USA
Cross-sectional study Cases with common ophthalmic complaints 44 ChatGPT 4.0 Ophthalmology trainees, Bing Chat, WebMD Symptom Checker ChatGPT 4.0: 98% Correctly-triaged. Not reported ChatGPT 4.0 showed high accuracy comparable to trainees for ophthalmology triaging the patient and There were no harmful statements generated with this LLM. Probability of generating incorrect information from ChatGPT and propagating biases in triage
Rick Kye Gan et al.
2023 Online (no specific location)
Cross-sectional study Mass casualty incident cases 15 ChatGPT 3.5 Medical students' triage performance ChatGPT 3.5: 80% Correctly-triaged, 0% under-triaged, and 20% over-triaged Not reported ChatGPT 3.5 showed Higher triage performance compared to medical students after teaching No explicit limitations identified
Georges Gebrael et al.
2023
USA
Retrospective study Metastatic prostate cancer patients 56 ChatGPT 4.0 Emergency medicine physicians ChatGPT 4.0: 50% Correctly-triaged, 4% under-triaged, and 46% over-triaged Not reported ChatGPT 4.0 showed High sensitivity for admission of but lower specificity in discharges along with accurate diagnoses and treatment recommendations Lower specificity in discharges, no association between ESI scores and outcomes
İbrahim Sarbay et al.
2023 Turkey
Cross-sectional study Common cases in ED 50 ChatGPT 3.5 Three emergency medicine specialists ChatGPT 3.5: 60% Correctly-triaged, 18% under-triaged, and 22% over-triaged Cohen's Kappa: 0.341 (fair agreement) ChatGPT had better performance for high acuity cases Small sample size, lack of generalizability, no validation on real data
Roya Zandi et al.
2024
USA
In silico study Common Ophthalmic Complaints 80 ChatGPT 4.0 Three ophthalmologists and Google Bard ChatGPT 4.0: 85% Correctly-triaged Not reported GPT-4 outperformed Google Bard in triage and had lower potential for harm Moderate sample size, couldn't fully assess hallucinatory responses or variability

Table 2.

Quality Assessment using QUADAS-2

graphic file with name aaem-12-e60-g004.jpg

2.4 Selection Process and Inclusion Criteria

The selection process involved two independent reviewers who screened titles and abstracts of all identified records. Disagreements were resolved through discussion, and a third reviewer was consulted if necessary. Full-text articles were retrieved for records deemed potentially relevant. Studies were included if they Evaluated the diagnostic performance of ChatGPT in ED triage. Studies were excluded if they were letters to the editor, guidelines, reviews, or did not focus on triage performance.

2.5 Data Extraction Process

Two reviewers independently extracted data, including the first author, publication year, study design, type of medical cases triaged, number of evaluated scenarios or patients, country, version of used ChatGPT, comparison group, ChatGPT's diagnostic performance, inter-rater reliability, main findings, and study limitations. Data were extracted using a predefined Excel format document. Discrepancies were resolved through discussion or by consulting a third reviewer.

2.6 Quality Assessment

The quality of included studies was assessed using the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) tool (10). This tool evaluates the risk of bias in four domains: patient selection, index test, reference standard, and flow and timing. Applicability concerns were also assessed for each domain except flow and timing. Two reviewers independently performed the quality assessment, and disagreements were resolved through discussion.

2.7 Statistical Analysis and Data Synthesis

Statistical analyses were conducted using R version 4.4. We used the metaprop function for proportion meta-analysis. This meta-analysis evaluated the proportion of correct triage decisions (events) out of the total number of triage decisions (correct triage decisions + not-correct triage decisions). By doing so, we were able to determine the accuracy of ChatGPT's performance in triaging patients. The principal summary measure used in this analysis was the proportion of correct triage decisions, which directly reflects the diagnostic accuracy of ChatGPT. A random-effects model was employed to account for variability among studies. Heterogeneity was assessed using the I² statistic, with values above 50% indicating substantial heterogeneity. Publication bias was evaluated using funnel plots. Sensitivity analyses were conducted to assess the robustness of the pooled estimates by excluding studies with high heterogeneity.

3. Results:

3.1 Literature Review

A comprehensive literature search identified a total of 162 records across four databases: Scopus, Web of Science, PubMed, and Embase. After removing 92 duplicates, 70 records remained for screening. Through title and abstract screening, 27 records were excluded, leaving 43 full-text articles for detailed assessment. Of these, 29 were excluded for various reasons, such as being letters to the editor, guidelines, or not relevant to triage. Ultimately, 14 studies were included in our systematic review and meta-analysis (Figure 1).

Figure 1.

Figure 1

PRISMA Flowchart of study.

3.2 Study Characteristics

Across these 14 included studies, a total of 1,412 patients or scenarios were evaluated. These studies were conducted between 2023 and 2024. The included studies predominantly employed a cross-sectional study design, focusing on a variety of medical cases in emergency departments (EDs). Specifically, seven studies addressed common cases in EDs (8, 11-16), two studies focused on mass casualty incidents (17, 18), one study examined cases needing neurosurgical attention (19), one study evaluated pre-hospital basic life support and pediatric advanced life support cases (20), two studies dealt with ophthalmic conditions (21, 22), and one study focused on metastatic prostate cancer patients (23).

Regarding the large language models used, six studies exclusively utilized ChatGPT-3.5 (12, 14-18), four studies used ChatGPT 4.0 (11, 21-23), and four studies employed both versions of ChatGPT in their analyses (8, 13, 19, 20). Inter-rater reliability varied among the studies. One study reported near-perfect agreement with the gold standard, with a Cohen's kappa of 0.899 (11). Another study noted an agreement in MTS code assignment between the comparison group and ChatGPT-3.5, with a Cohen's kappa of 0.278 (14). Additionally, one study reported the agreement between human raters and ChatGPT-3.5 with Fleiss' kappa of 0.320, and between human raters and ChatGPT-4.0 with Fleiss' kappa of 0.523 (13). Conversely, another study highlighted the poor acceptable reliability of ChatGPT (20), and one study reported a Cohen's kappa of 0.341 for the agreement between ChatGPT and the reference standard (16). The most commonly identified limitation across the studies was the small sample size. Table 1 summarizes the characteristics of the included studies.

3.3 Diagnostic Performance of ChatGPT 3.5 and 4 for Triage

The following forest plots provide insights into the diagnostic performance of different versions of ChatGPT in triaging patients.

The pooled accuracy for ChatGPT 4.0 was 0.86 (95% CI: 0.64-0.98), with substantial heterogeneity among studies (I² = 93%). Individual study sensitivities ranged from 0.50 to 1.00 (Figure 2).

Figure 2.

Figure 2

Forest Plot of ChatGPT 4.0 Performanc

The pooled accuracy for ChatGPT 3.5 was 0.63 (95% CI: 0.43-0.81), with significant heterogeneity (I² = 84%). Sensitivities in individual studies varied from 0.27 to 0.93. These results indicate that while ChatGPT 4.0 shows higher diagnostic performance for triaging ED patients compared to ChatGPT 3.5, there is considerable variability among studies (Figure 3).

Figure 3.

Figure 3

Forest Plot of ChatGPT 3.5 Performance.

3.4 Sensitivity Analysis

Sensitivity analyses were conducted to evaluate the robustness of the pooled estimates. For ChatGPT 4.0, the sensitivity analysis (Figure 1S) demonstrated that omitting individual studies did not significantly alter the pooled accuracy, indicating stability in the results. For ChatGPT 3.5, the sensitivity analysis (Figure 1S) also showed consistent results, although the exclusion of some studies slightly affected the pooled estimates (19), reflecting moderate robustness (Figure 1S).

3.5 Publication bias

Publication bias was assessed using funnel plots for both ChatGPT 4.0 and ChatGPT 3.5. The funnel plots (Figures 2S) showed some asymmetry, suggesting potential publication bias, particularly for ChatGPT 3.5. Due to the low number of studies, we couldn’t perform the Egger test to evaluate the funnel plot asymmetry (Figure 2S).

3.6 Quality Assessment

The quality assessment using the QUADAS-2 tool indicated varying levels of risk of bias and applicability concerns across the studies. Patient selection generally had a low risk of bias, though a few studies showed high or unclear risk. The index test was mostly low risk, indicating appropriate conduct of the tests. The reference standard was low risk in most studies. Flow and timing were also low risk in the majority of studies, indicating appropriate design execution. However, a few studies had high or unclear concerns regarding patient selection, index test, and reference standard applicability (Table 2).

4. Discussion:

The diagnostic performance of ChatGPT 4.0 was higher than ChatGPT 3.5, with pooled accuracies of 0.86 and 0.63, respectively. There was substantial heterogeneity among the studies for both versions. Sensitivity analyses indicated that the results were robust, though ChatGPT 3.5 showed some variability. Publication bias was suggested for both versions particularly ChatGPT 3.5 based on funnel plot asymmetry. The quality assessment using QUADAS-2 indicated varying levels of risk of bias and applicability concerns across the studies.

Our findings are consistent with those of Hirosawa et al. (24), who found a high diagnostic accuracy of 93.3% for ChatGPT 3.5 within differential-diagnosis lists, suggesting that AI chatbots can generate accurate diagnosis lists for common complaints. Rao et al. (25) also reported overall accuracy for ChatGPT at 71.7%, with the highest performance in final diagnosis and the lowest in initial differential diagnosis.

Mehnen et al. (26) noted that ChatGPT 4 requires more suggestions to solve rare diseases compared to common cases, aligning with our findings of variable diagnostic performance. Moreover, Williams et al. (27) found that GPT-3.5 achieved 84% accuracy in determining higher acuity patients.

The results indicate that ChatGPT, particularly version 4.0, has the potential to improve triage accuracy and reduce unsafe decisions in emergency settings. However, the variability among studies highlights the need for further evaluation and improvements. Fraser et al. (8) caution against the unsupervised use of ChatGPT for triage without enhancements to accuracy and clinical validation. Gebrael et al. (23) suggest that ChatGPT can assist healthcare providers in improving patient triage, while Knebel et al. (28) found that although ChatGPT provides appropriate measures, there is a potential for harmful recommendations.

5. Limitations

This meta-analysis provides a comprehensive evaluation of ChatGPT's diagnostic performance in ED triage, incorporating a diverse range of studies and medical cases. However, the substantial heterogeneity and potential publication bias identified are limitations that must be considered. The varying levels of risk of bias and applicability concerns across studies further underscore the need for a cautious interpretation of the results.

Another limitation of our study is that most of the included studies did not provide sensitivity and specificities or ROC curves. Due to the absence of these data, we were unable to perform a diagnostic accuracy meta-analysis using true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Instead, we performed a proportion meta-analysis to obtain the accuracy of our interventions, using 'accuracy' as a single metric for evaluating the performance of our interventions in triaging patients.

6. Conclusion:

This systematic review and meta-analysis assessed the diagnostic performance of ChatGPT 3.5 and ChatGPT 4.0 in ED triage. ChatGPT 4.0 demonstrated higher diagnostic accuracy (0.86) compared to ChatGPT 3.5 (0.63), but substantial heterogeneity and potential publication bias were noted. The variability in performance across different medical domains highlights the need for further evaluation and improvements.

Supplementary materials

Supplementary table 1.

Used keywords for search design

Keywords Terms
LLM
MeSH terms -
Emtree terms large language model; ChatGPT
SYNONYMS Large language model; Bard; Gemini; Chat-GPT; ChatGPT
Triage
MeSH terms Triage
Emtree terms Patient triage
SYNONYMS Triage; triaging

LLM; Large language model

Supplementary table.

Search query

Data base Search query
PubMed
Results: 39 articles
(Large language model[tiab] OR Bard[tiab] OR Gemini[tiab] OR Chat-GPT[tiab] OR ChatGPT[tiab]) AND ("Triage"[mesh] OR Triage*[tiab] OR triaging[tiab])
Embase
Results: 55 articles
(‘large language model’/exp OR ‘ChatGPT’/exp OR ‘Large language model’:ab,ti OR ‘Bard’:ab,ti OR ‘Gemini’:ab,ti OR ‘Chat-GPT’:ab,ti OR ‘ChatGPT’:ab,ti) AND (‘patient triage’/exp OR ‘Triage*’:ab,ti OR ‘triaging’:ab,ti)
Scopus
Results: 38 articles
(TITLE-ABS (“Large language model” OR “Bard” OR “Gemini” OR “Chat-GPT” OR “ChatGPT”)) AND (TITLE-ABS (“Triage*” OR “triaging”))
Web of science
Results: 30 articles
(TS= (“Large language model” OR “Bard” OR “Gemini” OR “Chat-GPT” OR “ChatGPT”)) AND (TS= (“Triage*” OR “triaging”))

Time of search: 19 March 2024.

Figure 1S.

Figure 1S

Sensitivity analysis of ChatGPT 4.0 (left) and ChatGPT 3.5 (right) pooled accuracy.

Figure 2S.

Figure 2S

Funnel plot of ChatGPT 4.0 (left) and ChatGPT 3.5 (right).

7. Declarations:

7.1 Author contributions

All authors are accountable for all sections of the manuscript and declare that it is written originally and there is no data fabrication; data falsification including deceptive manipulation of images and plagiarism. Details of authors contributions are as follows:

(1) The conception and design of the study: Ramin Shahidi, Navid Kaboudi

(2) Acquisition of data: Fatemeh Fayazbakhsh, Salar Ghaderi, Mohammadreza Dehdashti, Maryam Vasaghi-Gharamaleki, Zahra Moradzadeh, Yasmin Mohtasham Kia

(3) Analysis and interpretation of data: Mohammad Shahir Eftekhar, Maryam Afshari, Fattaneh Khalaj

(4) Drafting the article: Ramin Shahidi, Niloufar Joharivarnoosfaderani, Zahra Mohammadi, Zahra Hasanabadi, Saeedeh Firouzbakht, Salar Ghaderi, Leila Haghani, Zahra Moradzadeh

(5) Revising it critically for important intellectual content: Ramin Shahidi, Navid Kaboudi, Saeedeh Firouzbakht, Leila Haghani

(6) Final approval of the version to be submitted: Ramin Shahidi, Navid Kaboudi, Leila Haghani

The final version was read and approved by all author.

7.2 Conflicts of interest

There are no conflicts of interest to declare.

7.3 Funding

This study didn’t receive any funding.

7.4 Acknowledgment

None

7.5 Data Availability

Data of this study are available and will be provided if anyone needs them.

7.6 Ethics approval and consent to participate

Not applicable

7.7 Consent to publication

Not applicable

7.8 Using artificial intelligence chatbots

During the preparation of this work, the authors used ChatGPT 3.5 by OpenAI to improve paper readability. After using this service, the authors reviewed and edited the content as needed and took full responsibility for the publication's content.

References

  • 1.Quazi S, Saha RP, Singh MK. Applications of artificial intelligence in healthcare. Journal of Experimental Biology and Agricultural Sciences. 2022;10(1):211–26. [Google Scholar]
  • 2.Currie G. A conversation with ChatGPT. Journal of Nuclear Medicine Technology. 2023;51(3):255–60. doi: 10.2967/jnmt.123.265864. [DOI] [PubMed] [Google Scholar]
  • 3.Wu T, He S, Liu J, Sun S, Liu K, Han Q-L, Tang Y. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica. 2023;10(5):1122–36. [Google Scholar]
  • 4.Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in artificial intelligence. 2023;6:1169595. doi: 10.3389/frai.2023.1169595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Alumran A, Alkhaldi O, Aldroorah Z, Alsayegh Z, Alsafwani F, Almaghraby N. Utilization of an electronic triage system by emergency department nurses. Journal of Multidisciplinary Healthcare. 2020:339–44. doi: 10.2147/JMDH.S250962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Manoochehry S, Saboori F, Faraji M, Behzadnia MJ. Coronavirus disease 2019: a revolution in biological triage in the emergency setting. Universa Medicina. 2020;39(3):212–23. [Google Scholar]
  • 7.Tolchin B, Latham SR, Bruce L, Ferrante LE, Kraschel K, Jubanyik K, et al. Developing a triage protocol for the COVID-19 pandemic: allocating scarce medical resources in a public health emergency. The Journal of clinical ethics. 2020;31(4):303–17. [PubMed] [Google Scholar]
  • 8.Fraser H, Crossland D, Bacher I, Ranney M, Madsen T, Hilliard R. Comparison of diagnostic and triage accuracy of Ada health and WebMD symptom checkers, ChatGPT, and physicians for patients in an emergency department: clinical data analysis study. JMIR mHealth and uHealth. 2023;11(1):e49995. doi: 10.2196/49995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Takkouche B, Norman G. PRISMA statement. Epidemiology. 2011;22(1):128. doi: 10.1097/EDE.0b013e3181fe7825. [DOI] [PubMed] [Google Scholar]
  • 10.Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Annals of internal medicine. 2011;155(8):529–36. doi: 10.7326/0003-4819-155-8-201110180-00009. [DOI] [PubMed] [Google Scholar]
  • 11.Paslı S, Şahin AS, Beşer MF, Topçuoğlu H, Yadigaroğlu M, İmamoğlu M. Assessing the precision of artificial intelligence in emergency department triage decisions: Insights from a study with ChatGPT. The American Journal of Emergency Medicine. 2024;78:170–5. doi: 10.1016/j.ajem.2024.01.037. [DOI] [PubMed] [Google Scholar]
  • 12.Ayoub M, Ballout AA, Zayek RA, Ayoub NF. Mind+ Machine: ChatGPT as a Basic Clinical Decisions Support Tool. Cureus. 2023;15:8. doi: 10.7759/cureus.43690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kim JH, Kim SK, Choi J, Lee Y. Reliability of ChatGPT for performing triage task in the emergency department using the Korean Triage and Acuity Scale. Digital Health. 2024;10:20552076241227132. doi: 10.1177/20552076241227132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zaboli A, Brigo F, Sibilio S, Mian M, Turcato G. Human intelligence versus artificial intelligence: Who performs better in correctly classifying patients in triage? The American Journal of Emergency Medicine. 2024 doi: 10.1016/j.ajem.2024.02.008. [DOI] [PubMed] [Google Scholar]
  • 15.Franc JM, Cheng L, Hart A, Hata R, Hertelendy A. Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale. Canadian Journal of Emergency Medicine. 2024;26(1):40–6. doi: 10.1007/s43678-023-00616-w. [DOI] [PubMed] [Google Scholar]
  • 16.Sarbay İ, Berikol GB, Özturan İU. Performance of emergency triage prediction of an open access natural language processing based chatbot application (ChatGPT): A preliminary, scenario-based cross-sectional study. Turkish Journal of Emergency Medicine. 2023;23(3):156–61. doi: 10.4103/tjem.tjem_79_23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gan RK, Ogbodo JC, Wee YZ, Gan AZ, González PA. Performance of Google bard and ChatGPT in mass casualty incidents triage. The American journal of emergency medicine. 2024;75:72–8. doi: 10.1016/j.ajem.2023.10.034. [DOI] [PubMed] [Google Scholar]
  • 18.Gan RK, Uddin H, Gan AZ, Yew YY, González PA. ChatGPT's performance before and after teaching in mass casualty incident triage. Scientific Reports. 2023;13(1):20350. doi: 10.1038/s41598-023-46986-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ward M, Unadkat P, Toscano D, Kashanian A, Lynch DG, Horn AC, et al. A Quantitative Assessment of ChatGPT as a Neurosurgical Triaging Tool. Neurosurgery. 2022 ;10:1227. doi: 10.1227/neu.0000000000002867. [DOI] [PubMed] [Google Scholar]
  • 20.Bushuven S, Bentele M, Bentele S, Gerber B, Bansbach J, Ganter J, et al. “ChatGPT, can you help me save my child’s life?”-Diagnostic Accuracy and Supportive Capabilities to lay rescuers by ChatGPT in prehospital Basic Life Support and Paediatric Advanced Life Support cases–an in-silico analysis. Journal of Medical Systems. 2023;47(1):123. doi: 10.1007/s10916-023-02019-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lyons RJ, Arepalli SR, Fromal O, Choi JD, Jain N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Canadian Journal of Ophthalmology. 2023 doi: 10.1016/j.jcjo.2023.07.016. [DOI] [PubMed] [Google Scholar]
  • 22.Zandi R, Fahey JD, Drakopoulos M, Bryan JM, Dong S, Bryar PJ, et al. Exploring Diagnostic Precision and Triage Proficiency: A Comparative Study of GPT-4 and Bard in Addressing Common Ophthalmic Complaints. Bioengineering. 2024;11(2):120. doi: 10.3390/bioengineering11020120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gebrael G, Sahu KK, Chigarira B, Tripathi N, Mathew Thomas V, Sayegh N, et al. Enhancing triage efficiency and accuracy in emergency rooms for patients with metastatic prostate cancer: a retrospective analysis of artificial intelligence-assisted triage using ChatGPT 4 0. Cancers. 2023;15(14):3717. doi: 10.3390/cancers15143717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. International journal of environmental research and public health. 2023;20(4):3378. doi: 10.3390/ijerph20043378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK, et al. Assessing the utility of ChatGPT throughout the entire clinical workflow. MedRxiv. 2023:23285886. doi: 10.2196/48659. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Mehnen L, Gruarin S, Vasileva M, Knapp B. ChatGPT as a medical doctor? A diagnostic accuracy study on common and rare diseases. medRxiv. 2023;2023:23288859. [Google Scholar]
  • 27.Williams CY, Zack T, Miao BY, Sushil M, Wang M, Butte AJ. Assessing clinical acuity in the Emergency Department using the GPT-3 5 Artificial Intelligence Model. medRxiv. 2023; 2023:23293795. [Google Scholar]
  • 28.Knebel D, Priglinger S, Scherer N, Klaas J, Siedlecki J, Schworm B. Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies–An Analysis of 10 Fictional Case Vignettes. Klinische Monatsblätter für Augenheilkunde. 2023 doi: 10.1055/a-2149-0447. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data of this study are available and will be provided if anyone needs them.


Articles from Archives of Academic Emergency Medicine are provided here courtesy of Shahid Beheshti University of Medical Sciences

RESOURCES