Abstract
Aim
In Japan, emergency ambulance dispatches involve minor cases requiring outpatient services, emphasizing the need for improved public guidance regarding emergency care. This study evaluated both the medical plausibility of the GPT model in aiding laypersons to determine the need for emergency medical care and the laypersons' interpretations of its outputs.
Methods
This cross‐sectional study was conducted from December 10, 2023, to March 7, 2024. We input clinical scenarios into the GPT model and evaluated the need for emergency visits based on the outputs. A total of 314 scenarios were labeled with red tags (emergency, immediate emergency department [ED] visit) and 152 with green tags (less urgent). Seven medical specialists assessed the outputs' validity, and 157 laypersons interpreted them via a web‐based questionnaire.
Results
Experts reported that the GPT model accurately identified important information in 95.9% (301/314) of red‐tagged scenarios and recommended immediate ED visits in 96.5% (303/314). However, only 43.0% (135/314) of laypersons interpreted those outputs as indicating urgent hospital visits. The model identified important information in 99.3% (151/152) of green‐tagged scenarios and advised against immediate visits in 88.8% (135/152). However, only 32.2% (49/152) of laypersons considered them routine follow‐ups.
Conclusions
Expert evaluations revealed that the GPT model could be highly accurate in advising on emergency visits. However, laypersons frequently misinterpreted its recommendations, highlighting a substantial gap in understanding AI‐generated medical advice.
Keywords: emergency care, emergency visit, GPT3.5, the GPT model, triage
In this cross‐sectional study, we assessed the medical plausibility of ChatGPT's outputs as evaluated by seven experts and the interpretation of these outputs by 157 laypersons. ChatGPT demonstrated high accuracy in advising regarding emergency visits when evaluated by experts, but laypersons frequently misinterpreted its recommendations.
INTRODUCTION
As of 2022, Japan recorded 7,229,572 emergency ambulance dispatches and 6,217,283 transports, reflecting year‐on‐year increases of 16.7% and 13.2%, respectively. 1 Among these cases, 47.3% involved minor illnesses that did not necessitate emergency care. 1 The Fire and Disaster Management Agency (FDMA) seeks to prevent critical delays and fatalities by promoting appropriate ambulance use, offering patient guidelines, 2 the National Emergency Medical Consultation App (Q‐suke), 3 and a telephone consultation system (#7119). 4 , 5 Despite some success, 4 , 5 low public awareness and rising transports suggest the need for additional measures. 6
GPT models (OpenAI Inc., San Francisco, CA, USA) have garnered global attention due to advances in large language model (LLM) technology that enable human‐like conversations, question‐answering, and content generation. 7 , 8 , 9 , 10 However, concerns persist regarding reliability and safety, particularly “hallucinations” that introduce inaccurate information. 11 , 12 Studies on its validity as a self‐diagnostic or medical assistant tool show mixed results, especially in emergency medical settings, where safety, efficacy, and ethics are paramount. 13 , 14 , 15 , 16 , 17 , 18 , 19
The GPT model offers ease of use, free access, and quick multilingual operation, suggesting potential in helping patients assess the need for emergency care. However, the medical validity of the GPT model's outputs concerning emergency medicine is not well established, and the impact of its advice on the actions of the general public has yet to be studied. Therefore, this study aimed to evaluate both the medical plausibility of the outputs generated by the GPT model regarding the need for emergency medical care and the interpretations of these outputs by laypersons.
MATERIALS AND METHODS
Ethical considerations
This cross‐sectional study protocol was approved by the Nippon Sports Science University Ethics Committee (reference number: 023‐H131). Consistent with Japan's Ethical Guidelines for Medical and Health Research Involving Human Subjects, only participants who consented to the use of anonymized data for statistical analysis were included.
Study design
We evaluated the model's clinical validity and layperson perceptions in three steps (Figure 1). First, we entered simulated clinical information into the GPT model to determine urgency (i.e., the need to call an ambulance or visit the ED immediately). Second, medical specialists reviewed the outputs' validity. Finally, a web‐based questionnaire assessed whether laypersons could take appropriate actions based on these outputs.
FIGURE 1.
Protocol time selection flowchart.
Clinical scenario
Scenarios were developed from the FDMA “Emergency Severity Assessment Protocol Ver. 3: Emergency Consultation Guide (Home Self‐Judgment).” 2 Each item described typical symptoms and provided guidance on visiting EDs. Answers were categorized as red (emergency, immediate visit to ED), yellow (urgent ED visit), green (less urgent, regular follow‐up), or white (non‐emergency, observation). For instance, “I have sudden shortness of breath” was a red‐tagged scenario. 2 (https://www.fdma.go.jp/mission/enrichment/appropriate/appropriate002.html).
We selected 338 red‐tagged and 179 green‐tagged clinical scenarios from the protocol and excluded the yellow‐ and white‐tagged scenarios to avoid ambiguous recommendations. We excluded 15 negative‐sentence scenarios that provided green‐tag recommendations based on the absence of multiple emergency symptoms because these descriptions were too long to input. Additionally, 24 red‐tagged and 12 green‐tagged items were excluded from our study. These exclusions were based on a consensus by at least two medical experts from a group of seven reviewers, which included six board‐certified emergency physicians and one senior paramedic. All reviewers possessed either a PhD or a Master of Public Health degree. They concluded that the situations described in these scenarios were unclear or that the urgency assessments did not necessarily conform to current standards. We analyzed the remaining 466 clinical scenarios.
The symptom description parts were adjusted to the natural Japanese language using non‐technical terms before inputting them as prompts. The original inputs from the FDMA and the outputs generated by the GPT model are included in the Data S1. The manipulation process is detailed on GitHub (https://github.com/JAAM‐Technology‐Committee/ChatGPT_ER_visit_prompt_test).
Encoding
We developed prompts to allow the GPT model to evaluate the urgency of the simulated clinical scenarios based on previous literature that recommended asking for a specific description. The standardized prompt was “Please evaluate whether I need to visit an ED,” followed by modified symptom descriptions based on the clinical scenarios described above. For pediatric scenarios, we changed the prompt to “Please evaluate whether my child needs to visit an ED.”
We applied the GPT model using the OpenAI application programming interface with Python on December 10, 2023. The prompts were individually inputted into the model with the learning settings turned off. The output responses from the GPT model were collected and processed using Python for further evaluation. GPT‐3.5 was used for this study because we determined that the general public, who would be examining this research topic, would likely choose the GPT model, which is free, over GPT‐4, which requires a paid subscription (at the time of this study period). 7
Medical evaluations of outputs
The validity of the GPT model's outputs was reviewed using the medical viewpoint of the expert. For each set of inputs and outputs, a pair of reviewers evaluated the quality of the responses using three binary criteria: (1) identification of the most important symptoms, (2) accuracy of recommendations regarding the necessity of emergency visits, and (3) medical appropriateness of the explanations and supplementary information provided. In cases of disagreement between the reviewers, they consulted a third board‐certified emergency physician to reach a consensus.
We defined under‐triage as failure to recommend an immediate ED visit when needed and over‐triage as recommending an emergency visit when unnecessary.
Questionnaire of laypersons
We conducted a web‐based questionnaire survey of laypersons to understand their interpretation and perception of the GPT model's responses. We surveyed 157 laypersons through a third‐party company from February 29, 2024, to March 7, 2024. We collected the following demographics as baseline variables: age, sex, having children under 15 years old, annual income, and occupation. The survey was conducted by an independent external agency. The target population included laypersons aged 15–64 years, excluding pregnant women. During a preliminary screening phase, which involved identifying unsuitable responses and confirming eligibility, the agency prepared a panel that reflected the demographic distribution of age and gender. The main survey was subsequently distributed to respondents who passed this screening, ensuring that data collection adhered to certain population composition ratios.
The participants were asked to assume they had the symptoms described in the prompts and input them into the GPT model. We investigated the clarity and practicality of their responses from a non‐medical perspective by asking them to report their interpretation of the recommendations. They selected one choice from each color tag to show their actions based on the advice given.
We evaluated each pair of 466 input prompts and output responses using three layperson reviewers. A response was deemed accurate for red‐tagged symptoms if at least two out of three laypersons identified the need for an immediate ED visit. Similarly, a response was considered accurate for green‐tagged symptoms if at least two out of three laypersons did not identify the need for an immediate visit. Participants were randomly assigned to review a maximum of 10 input and output pairs and provide their responses by selecting options. Only participants with children were assigned to pediatric clinical scenarios.
Participants were first asked the baseline question: “Do you know ChatGPT? If yes, have you used it before?” After completing the entire questionnaire, they were asked two additional questions to evaluate their perceptions of the GPT model's outputs: “Do you think the response is trustworthy, and would you follow its recommendation?” and “Do the responses from ChatGPT make you feel relieved or anxious?”
Statistical analysis
Continuous variables are expressed as the median (interquartile range [IQR]), and categorical variables were described using numbers and percentages (%). To enhance the representativeness of our data, additional analyses were conducted to compare the income distribution of our participants with national data published by the Ministry of Health, Labor, and Welfare in Japan. Responses were standardized according to Japan's income distribution, applying weights to each income category based on the ratio of participants in our survey relative to their proportion in the national income structure. All data were analyzed using SPSS software (version 28; IBM., Armonk, NY, USA) and R software packages (version 4.3.1; R Foundation for Statistical Computing, Vienna, Austria).
RESULTS
Table 1 summarizes the medical expert evaluations of the GPT model's outputs. In the red‐tag (emergency) scenarios, the GPT model accurately identified critical symptoms in 95.9% of cases, correctly recommended immediate ED visits in 96.5% (Figure 2), and provided reasonable recommendations in 68%. In the green‐tag group (non‐urgent) scenarios, it identified important symptoms in 99.3%, correctly advised against immediate hospital visits in 88.8%, and provided practical reasons in 97.0% of scenarios. Under‐triage and over‐triage rates were 3.5% and 11.2%, respectively.
TABLE 1.
Validity of the GPT model outputs evaluated by board‐certified emergency physicians and senior paramedics.
Red‐tagged (number of scenarios = 314) | Judgment (Yes) |
Did the GPT model recognize the important symptoms? | 301/314 (95.9%) |
Did the GPT model advise the patient to visit the emergency department immediately? | 303/314 (96.5%) |
If the answer to the question above is yes, did the response contain a reasonable explanation? | 206/303 (68.0%) |
Green‐tagged (number of scenarios = 152) | Judgment (Yes) |
Did the GPT model recognize the important symptoms? | 151/152 (99.3%) |
Did the GPT model advise the patient not to go to the emergency department immediately? | 135/152 (88.8%) |
If the answer to the question above is yes, did the answer contain a reasonable explanation? | 131/135 (97.0%) |
FIGURE 2.
Interpretation of the response from the GPT model. (A) The GPT model accurately recommended immediate emergency department visits in 96.5% of red‐tagged cases, as evaluated by emergency specialists. In contrast, only 43.0% of layperson reviewers correctly interpreted those recommendations, with the income‐standardized layperson group showing similar trends. (B) The GPT model accurately recommended that emergency visits were not required in 99.3% of green‐tagged cases, as evaluated by emergency specialists. In contrast, only 32.3% of layperson reviewers correctly interpreted those recommendations, with similar findings observed in the income‐standardized layperson group.
Table 2 outlines the demographics of 157 laypersons who participated in the questionnaire survey. Of these, 120 (76.4%) had heard of ChatGPT, and 23.6% had used it before. The median age of the participants was 46 years, and 90 (57.3%) were male. Figure 3 shows the income distribution of study participants compared with the Japanese national income distribution.
TABLE 2.
Backgrounds of laypersons.
Variables | All patients (n = 157) |
---|---|
Male | 90 (57.3%) |
Age, years | 46 (38–53) |
Having a child/children aged ≤15 years old | 79 (50.3) |
Annual income in Japanese yen | |
<2,000,000 | 15 (9.6%) |
2,000,000–4,000,000 | 29 (18.5%) |
4,000,000–6,000,000 | 32 (20.4%) |
6,000,000–8,000,000 | 24 (15.3%) |
8,000,000–10,000,000 | 13 (8.3%) |
10,000,000–12,000,000 | 1 (0.6%) |
12,000,000–15,000,000 | 7 (4.5%) |
>15,000,000 | 1 (0.6%) |
Unknown | 35 (22.3%) |
Occupation | |
Public officer | 2 (1.3%) |
Manager, executive, or president | 2 (1.3%) |
Employee (office) | 24 (15.3%) |
Employee (others) | 26 (16.6%) |
Self‐employed | 6 (3.8%) |
Freelance | 1 (0.6%) |
Housewife | 18 (11.5%) |
Part‐time employee | 33 (21%) |
Student | 9 (5.7%) |
Others | 21 (13.4%) |
Familiarity with the ChatGPT | |
Unaware of ChatGPT | 37 (23.6%) |
Aware of the ChatGPT model but never used it | 83 (52.9%) |
Have used the ChatGPT model | 37 (23.6%) |
Note: Data are presented as the number of positive observations/total number of observations (percentage) or median (interquartile range). For each variable, the number of missing observations was obtained as the difference between the total number of patients in each group and the total number of observations.
FIGURE 3.
Income distribution overview. The figure compares the income distribution of study participants (dark blue) with the Japanese national income distribution (light blue), as published by the Ministry of Health, Labour and Welfare.
Of the 314 red‐tag scenarios, 135 (43.0%) were interpreted by layperson reviewers as the GPT model correctly recommending immediate emergency hospital visits, with the income‐standardized layperson group showing similar trends. Of the 152 green‐tag scenarios, 49 (32.2%) were interpreted as the GPT model suggesting regular follow‐up (i.e., non‐emergency), with similar findings observed in the income‐standardized layperson group.
Table 3 shows the potential actions of laypersons in response to the GPT model's answers. Of the 157 laypersons surveyed, 84 (53.5%) reported trusting the GPT model's answers and following them, while 43 (27.4%) trusted the outputs but chose not to follow the recommendations provided. Moreover, 9 (5.7%) did not trust the answers but still followed them. Regarding anxiety levels, 58 (36.9%) participants said they felt relieved by the GPT model's suggestions, whereas 20 (12.7%) reported increased anxiety after consulting the model.
TABLE 3.
Layperson's perception toward the output of the GPT model.
Variables | All patients (n = 157) |
---|---|
Reliability | |
Reliable and willing to follow the answer | 84 (53.5%) |
Reliable but not willing to follow the answer | 43 (27.4%) |
Neither reliable nor willing to follow the answer | 21 (13.4%) |
Not reliable but willing to follow the answer | 9 (5.7%) |
Change of feeling | |
Relieved | 58 (36.9%) |
Unchanged | 79 (50.3%) |
Anxious | 20 (12.7%) |
Note: Data are presented as the number of positive observations/total number of observations (percentage).
DISCUSSION
In this cross‐sectional study, we assessed both the medical plausibility of the GPT model's outputs as evaluated by seven experts and the interpretation of these outputs by 157 laypersons. The evaluation revealed that the GPT model accurately identified important symptoms in 95.9% of the red‐tag scenarios and correctly recommended immediate ED visits in 96.5% of these cases, as assessed by experts. However, only 43.0% of layperson reviewers interpreted the GPT model's recommendations as indicating the need for immediate emergency hospital visits in the same scenarios. Of the 152 green‐tagged cases, 99.3% were indicated as not requiring emergency care by the experts, whereas only 32.3% of layperson reviewers correctly interpreted them.
This study determined the need for emergency care using scenarios created based on the triage protocol published by the FDMA. The GPT model generated recommendations for these scenarios. The strength of the current study lies in its dual evaluation of the GPT model's responses by clinical experts, including board‐certified emergency physicians and paramedics with doctoral degrees, and by over 150 laypersons. This approach provides a balanced assessment of the GPT model's recommendations from both professional and lay perspectives, thereby enhancing the study's validity and relevance.
Our study suggested that the clinical experts deemed most of the GPT model's advice regarding the necessity of visiting an ED as appropriate. However, the laypersons frequently judged it inappropriately, highlighting the disparity in interpreting the GPT model responses between medical experts and the general public. The options presented to laypersons were ‘urgent visit’ or ‘non‐urgent visit.’ However, laypersons may not base their choices on the severity of disease or trauma, medical urgency, or the distinction between an emergency visit and a general outpatient visit, due to their limited medical knowledge. This misunderstanding could account for the discrepancy between experts' judgments and laypersons' comprehension. This underscores the need for clearer communication in AI‐generated outputs and emphasizes the importance of consulting medically trained professionals to ensure informed decision‐making and mitigate risks associated with overreliance on AI tools.
In an emergency setting, under‐triage can lead to the deterioration of disease and injury, and over‐triage can lead to unnecessary hospital visits. It is crucial to estimate the risk of under‐triage for safety because under‐triage can directly harm health. However, non‐urgent hospital visits might cause an overflow of EDs, leading to delays in necessary medical care. Therefore, over‐triage risk must be evaluated for safety. Our results showed over‐triage and under‐triage rates of 11.2% and 3.5%, respectively. In emergency medicine, previous research has reported that an ideal prehospital trauma triage scheme prioritizes a ≤5% under‐triage rate and permits a higher over‐triage rate of ≥35%. Therefore, the GPT model's performance may be considered sufficiently useful based on these parameters. 20
Considering the high accuracy of the GPT model, when evaluated by experts, AI systems may be used as supplementary tools in triage. Notably, two previous studies that applied the GPT model for triage in EDs yielded varying results; one suggested that the GPT model can improve triage accuracy, whereas the other indicated that the GPT model is an insufficient substitute for triage nurses' expertise. 13 , 21 Thus, it may be better to regard the GPT model as a reliable, supportive method for medical staff and not a substitute. Integrating AI tools into clinical workflows requires careful planning and concurrent evaluation to address potential safety concerns and ensure that these tools complement, rather than replace, human judgment.
Concerning laypersons' attitudes toward the GPT model regarding urgent medical visits, our findings revealed intriguing results. While 53.5% of participants trusted and followed the outputs, 27.4% trusted them but did not follow the recommendations, and 5.7% did not trust the outputs yet followed them. Furthermore, 36.9% of participants felt relieved by the GPT model's suggestions, whereas 12.7% reported increased anxiety after consultation. These results suggest that laypersons may hold ambivalent feelings toward AI‐generated recommendations. Sometimes, smart AI can serve as a reliable companion, but it may also fall short of user expectations at other times.
This study has several limitations. First, the case scenarios were based on a telephone triage protocol rather than actual clinical cases, which may affect the results' applicability to real‐world settings. Second, we did not assess the reproducibility of the GPT model's outputs due to budget constraints and the design of our layperson questionnaire, which allowed for only one response per scenario. This approach may not fully capture the variability inherent in language model outputs. Additionally, these outputs could vary with future versions, such as GPT‐4. Third, the evaluation by two independent reviewers, though experienced, may be subject to bias. Fourth, the layperson evaluations may not fully represent the general public in Japan. Our study excluded individuals aged 65 years or older, as they are less likely to use ChatGPT. This exclusion may limit the generalizability of our findings to populations with different age distributions. Moreover, the level of medical knowledge among layperson participants was not assessed, which may have influenced the study's findings. Lastly, the study's findings, derived from a Japanese context and language, may not be generalizable to other settings.
CONCLUSION
This study evaluated both the medical plausibility of the GPT model's outputs using scenarios derived from the Japanese government's telephone triage protocol for emergency visits, as assessed by medical experts, along with laypersons' comprehension of these outputs. The findings suggest that while the GPT model achieved high accuracy in advising on emergency visits when reviewed by experts, laypersons frequently misinterpreted its recommendations. This highlights a significant gap in the understanding and interpretation of AI‐generated medical advice.
CONFLICT OF INTEREST STATEMENT
Prof. Jun Oda and Prof. Shoji Yokobori, who are Editorial Board members of the AMS Journal, are co‐authors of this article. To prevent any potential bias, they were excluded from all editorial decisions regarding the acceptance of this article for publication.
ETHICS STATEMENT
Approval of the research protocol: The Nippon Sports Science University Ethics Committee approved this study (reference number: 023‐H131).
Informed consent: This study analyzed anonymous data. Therefore, the requirement for written informed consent was waived.
Registry and the registration no. of the study/trial: N/A.
Animal studies: N/A.
Supporting information
Data S1.
ACKNOWLEDGEMENTS
This study was performed on behalf of the Special Committee on the Utilization of Advanced Technology in Emergency Medicine, Japanese Association for Acute Medicine.
Tanaka C, Kinoshita T, Okada Y, Satoh K, Homma Y, Suzuki K, et al. Medical validity and layperson interpretation of emergency visit recommendations by the GPT model: A cross‐sectional study. Acute Med Surg. 2025;12:e70042. 10.1002/ams2.70042
DATA AVAILABILITY STATEMENT
All the code used for data management and analyses is openly shared online for review and re‐use. All iterations of the study design are archived with version control. https://github.com/JAAM‐Technology‐Committee/ChatGPT_ER_visit_prompt_test.
REFERENCES
- 1. Fire and Disaster Management Agency . Current Situation of Emergency and Rescue Services in Japan (in Japanese). https://www.fdma.go.jp/publication/rescue/post‐5.html. Accessed 14 July 2024.
- 2. Fire and Disaster Management Agency . Emergency Severity Assessment Protocol Ver.3 Emergency Consultation Guide (Home Self‐judgement). https://www.fdma.go.jp/mission/enrichment/appropriate/appropriate002.html. Accessed 14 July 2024.
- 3. Fire and Disaster Management Agency . National Emergency Medical Service Consultation App (Q‐suke). https://www.fdma.go.jp/mission/enrichment/appropriate/appropriate003.html. Accessed 14 July 2024.
- 4. Morimura N, Aruga T, Sakamoto T, Aoki N, Ohta S, Ishihara T, et al. The impact of an emergency telephone consultation service on the use of ambulances in Tokyo. Emerg Med J. 2011;28(1):64–70. [DOI] [PubMed] [Google Scholar]
- 5. Sakurai A, Oda J, Muguruma T, Kim S, Ohta S, Abe T, et al. Revision of the protocol of the telephone triage system in Tokyo, Japan. Emerg Med Int. 2021;2021:8832192. 10.1155/2021/8832192 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Kadooka Y, Asai A, Enzo A, Okita T. Misuse of emergent healthcare in contemporary Japan. BMC Emerg Med. 2017;17(1):23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. OpenAI . ChatGPT [Internet]. https://chat.openai.com/. Accessed 3 July 2024.
- 8. Secinaro S, Calandra D, Secinaro A, Muthurangu V, Biancone P. The role of artificial intelligence in healthcare: a structured literature review. BMC Med Inform Decis Mak. 2021;21(1):125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K. The practical implementation of artificial intelligence technologies in medicine. Nat Med. 2019;25(1):30–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Hopkins AM, Logan JM, Kichenadasse G, Sorich MJ. Artificial intelligence chatbots will revolutionize how cancer patients access information: ChatGPT represents a paradigm‐shift. JNCI Cancer Spectr. 2023;7(2):pkad010. 10.1093/jncics/pkad010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare Basel. 2023;11(6):887. 10.3390/healthcare11060887 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Sharun K, Banu SA, Pawde AM, Kumar R, Akash S, Dhama K, et al. ChatGPT and artificial hallucinations in stem cell research: assessing the accuracy of generated references ‐ a preliminary study. Ann Med Surg. 2023;85(10):5275–5278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Dahdah JE, Kassab J, Helou MCE, Gaballa A, Sayles S 3rd, Phelan MP. ChatGPT: a valuable tool for emergency medical assistance. Ann Emerg Med. 2023;82(3):411–413. [DOI] [PubMed] [Google Scholar]
- 14. Fraser H, Crossland D, Bacher I, Ranney M, Madsen T, Hilliard R. Comparison of diagnostic and triage accuracy of Ada health and WebMD symptom checkers, ChatGPT, and physicians for patients in an emergency department: clinical data analysis study. JMIR Mhealth Uhealth. 2023;11:e49995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Kim JH, Kim SK, Choi J, Lee Y. Reliability of ChatGPT for performing triage task in the emergency department using the Korean triage and acuity scale. Digit Health. 2024;10:20552076241227132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Kuroiwa T, Sarcon A, Ibara T, Yamada E, Yamamoto A, Tsukamoto K, et al. The potential of ChatGPT as a self‐diagnostic tool in common orthopedic diseases: exploratory study. J Med Internet Res. 2023;25:e47621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Franc JM, Cheng L, Hart A, Hata R, Hertelendy A. Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale. CJEM. 2024;26(1):40–46. [DOI] [PubMed] [Google Scholar]
- 18. Stokel‐Walker C. AI bot ChatGPT writes smart essays ‐ should professors worry? Nature. 2022; 10.1038/d41586-022-04397-7. [DOI] [PubMed] [Google Scholar]
- 19. Zuniga Salazar G, Zuniga D, Vindel CL, Yoong AM, Hincapie S, Zuniga AB, et al. Efficacy of AI chats to determine an emergency: a comparison between OpenAI's ChatGPT, Google bard, and Microsoft Bing AI chat. Cureus. 2023;15(9):e45473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Newgard CD, Fischer PE, Gestring M, Michaels HN, Jurkovich GJ, Lerner EB, et al. National guideline for the field triage of injured patients: recommendations of the National Expert Panel on field triage, 2021. J Trauma Acute Care Surg. 2022;93(2):e49–e60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Zaboli A, Brigo F, Sibilio S, Mian M, Turcato G. Human intelligence versus chat‐GPT: who performs better in correctly classifying patients in triage? Am J Emerg Med. 2024;79:44–47. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1.
Data Availability Statement
All the code used for data management and analyses is openly shared online for review and re‐use. All iterations of the study design are archived with version control. https://github.com/JAAM‐Technology‐Committee/ChatGPT_ER_visit_prompt_test.