Skip to main content
Global Spine Journal logoLink to Global Spine Journal
. 2025 Feb 17;15(7):3199–3220. doi: 10.1177/21925682251321837

Evaluating Artificial Intelligence in Spinal Cord Injury Management: A Comparative Analysis of ChatGPT-4o and Google Gemini Against American College of Surgeons Best Practices Guidelines for Spine Injury

Alexander Yu 1,, Albert Li 1, Wasil Ahmed 1, Michael Saturno 1, Samuel K Cho 1
PMCID: PMC11833805  PMID: 39959933

Abstract

Study Design

Comparative Analysis.

Objectives

The American College of Surgeons developed the 2022 Best Practice Guidelines to provide evidence-based recommendations for managing spinal injuries. This study aims to assess the concordance of ChatGPT-4o and Gemini Advanced with the 2022 ACS Best Practice Guidelines, offering the first expert evaluation of these models in managing spinal cord injuries.

Methods

The 2022 ACS Trauma Quality Program Best Practices Guidelines for Spine Injury were used to create 52 questions based on key clinical recommendations. These were grouped into informational (8), diagnostic (14), and treatment (30) categories and posed to ChatGPT-4o and Google Gemini Advanced. Responses were graded for concordance with ACS guidelines and validated by a board-certified spine surgeon.

Results

ChatGPT was concordant with ACS guidelines on 38 of 52 questions (73.07%) and Gemini on 36 (69.23%). Most non-concordant answers were due to insufficient information. The models disagreed on 8 questions, with ChatGPT concordant in 5 and Gemini in 3. Both achieved 75% concordance on clinical information; Gemini outperformed on diagnostics (78.57% vs 71.43%), while ChatGPT had higher concordance on treatment questions (73.33% vs 63.33%).

Conclusions

ChatGPT-4o and Gemini Advanced demonstrate potential as valuable assets in spinal injury management by providing responses aligned with current best practices. The marginal differences in concordance rates suggest that neither model exhibits a superior ability to deliver recommendations concordant with validated clinical guidelines. Despite LLMs increasing sophistication and utility, existing limitations currently prevent them from being clinically safe and practical in trauma-based settings.

Keywords: artificial intelligence, search engine, spinal cord injuries, spinal injuries, surgeons, large language model, ChatGPT

Introduction

Among all fractures that commonly result from traumatic injury, spinal column fractures pose uniquely significant challenges that can portend long-term physical, social, and financial consequences for patients and their loved ones. These injuries require complex management, particularly when complicated by the risk of irreversible neural damage and the presence of other life-threatening injuries. 1 The decision-making process for spinal injury care remains intensely debated and often revolves around discussions of operative vs non-operative management, appropriate timing for surgery, mode and duration of steroid administration, and the most optimal hemodynamic targets prior to intervention.2,3 Despite advances in spinal injury classification and severity assessment, the approach to spinal cord injury management remains controversial. Such uncertainty has driven the creation and implementation of national guidelines that function to keep spine surgeons abreast of the newest literature and practice recommendations.

To assist with such intricate and multifactorial considerations when managing spinal cord injuries, the American College of Surgeons (ACS) developed the Best Practice Guidelines: Spine Injury in 2022. The ACS guidelines offer the most recently published evidence-based practical recommendations for the evaluation and management of adult patients with a spine injury, which encompasses both spinal column fracture and spinal cord injury. 3 The guidelines put forth a series of summative recommendations based on an expert working group evaluation of the available medical literature.

As these guidelines continue to be widely used as a central source of information on spine injury and its management, artificial intelligence (AI) chatbots continue to grow in parallel and are increasing in use among physicians and patients as both a supportive adjunct to clinical decision-making and a general information-gathering tool. OpenAI’s most recent ChatGPT model, ChatGPT-4o, and Google’s Gemini Advanced are large language models (LLMs) capable of integrating, understanding, and generating massive volumes of human language text. Both models have been trained on a vast array of publicly available and licensed data up until at least 2023, with Gemini offering internet access for queries, as well.

The utility of such chatbots in the medical context continues to be assessed, especially given the evidence that earlier ChatGPT models are capable of passing the USMLE exams. 4 Research in the fields of ophthalmology and urology, among others, has demonstrated ChatGPT’s commendable capacity to offer clinically accurate advice to patients seeking medical care.5,6 With regards to orthopedic spine surgery, ChatGPT’s clinical accuracy has been shown to be widely variable when queried on subjects such as antibiotic and thromboembolic prophylaxis, low back pain, degenerative spondylolisthesis, and cervical radiculopathy.7-11 Although these studies provide valuable preliminary insights into the potential role of chatbots in managing various spine-related conditions, none specifically address spinal cord injuries or compare ChatGPT with other leading LLM services, such as Google’s Gemini.

Considering this limitation, the goal of the present study was to evaluate the concordance of ChatGPT-4o and Gemini Advanced when compared to the 2022 ACS Best Practice Guidelines for spine injury. In doing so, we aimed to put forth the first expert assessment of ChatGPT-4o′s and Gemini’s capacities to provide accurate practice recommendations for managing spinal fractures and spinal cord injuries.

Methods

Institutional review board approval was not required for this study, as ChatGPT-4o and Google Gemini Advanced are publicly available resources, and no clinical data or patient information was used. The 2022 American College of Surgeons (ACS) Best Practices Guidelines: Spine Injury contains recommendations across 21 relevant sections, including imaging, spinal cord injury classification, spinal shock, and analgesia in spinal cord injury. 3 These 21 sections provide 52 key points to guide clinical decision-making in various spinal cord injury scenarios.

To simulate the interaction between a surgeon and the AI chatbots, the 52 ACS key points were rephrased into questions designed to elicit recommendations from the LLMs for each specific scenario. These questions were categorized into three groups: “Informational,” “Diagnostic,” and “Treatment.” Given that the wording of questions can influence LLM responses, the phrasing was carefully crafted to closely mirror the original language and tone of the ACS key points, ensuring reproducibility and minimizing subjective bias. A board-certified spine surgeon then reviewed the questions to ensure clinical relevance. In total, 52 questions (8 informational, 14 diagnostic, and 30 treatment-related) were generated and posed to ChatGPT-4o and Google Gemini Advanced on August 13, 2024. The questions were prompted to the LLMs only once, simulating a “zero-shot” learning scenario to assess their baseline capabilities without any prior learning or training bias. This approach also reflects typical clinical use, where clinicians or patients are unlikely to pose the same question to LLMs multiple times. Furthermore, to prevent stored memory from affecting subsequent responses, memory settings were disabled, and a new chat session was initiated for each question posed to the LLMs.

The ACS recommendations were compared to the responses provided by ChatGPT-4o and Google Gemini Advanced for each scenario. A complete list of the questions, corresponding ACS key points, and their subgroup categorization can be found in Table 3.

Table 3.

American College of Surgeons guideline-based questions and our concordance grading rationale for ChatGPT and Gemini.

ACS Guideline Section Subgroup ACS Guideline-Based Question ChatGPT Concordance (Y/N) ChatGPT Grading Rationale Gemini Concordance (Y/N) Gemini Grading Rationale
Epidemiology Informative What are the leading mechanisms of spinal cord injury? Y Correctly identifies vehicular trauma and falls as the top 2 mechanisms of SCI Y Correctly identifies vehicular trauma and falls as the top 2 mechanisms of SCI
Epidemiology Informative What populations are most susceptible to spinal cord injury and why? Y Correctly identifies that elderly individuals (ages 65 and older) are more susceptible to spinal injuries due to falls and degenerative spinal conditions like arthritis and osteoporosis Y Correctly identifies that elderly individuals (ages 65 and older) are more susceptible to spinal injuries due to falls and degenerative spinal conditions like arthritis and osteoporosis
Pre-hospital spinal motion restriction Treatment How can spinal motion restriction (SMR) be achieved? Y The ChatGPT response accurately covers the concept of spinal motion restriction (SMR) and mentions several methods and devices such as backboards, scoop stretchers, and vacuum mattresses that align with the ACS key point. Although it does not explicitly mention an ambulance cot, the general principles of SMR and the use of similar devices are well-covered Y The response accurately mentions various methods of spinal motion restriction, including the use of a long spine board, scoop stretcher, vacuum mattress, and ambulance cot, which align with the devices mentioned in the ACS key point
Pre-hospital spinal motion restriction Treatment When is it necessary to apply spinal motion restriction to the entire spine? Y The response aligns with the ACS key point by emphasizing the importance of applying spinal motion restriction (SMR) to the entire spine when there is suspicion of spinal injury. It also mentions various conditions and symptoms that would necessitate full spinal immobilization, which corresponds with the key point about addressing the entire spine to mitigate the risk of noncontiguous injuries Y The response aligns with the ACS key point by emphasizing that spinal motion restriction (SMR) should be applied to the entire spine when there is a suspected or potential spinal injury. The response explains that the spine is an interconnected structure, and an injury in one part can affect other areas, which aligns with the rationale behind applying SMR to the entire spine to mitigate the risk of noncontiguous injuries
Pre-hospital spinal motion restriction Informative Why is it important to ensure that a sufficient number of properly trained individuals are available to assist with patient transfers in the case of spinal cord injury? Y The response aligns with the ACS key point by emphasizing the importance of having a sufficient number of properly trained individuals available to assist with patient transfers, specifically to minimize the risk of exacerbating a potentially unstable spinal injury. The response also highlights the critical aspects of preventing further injury, maintaining spinal alignment, and ensuring patient safety, which directly address the ACS guideline Y The response is consistent with the ACS key point. It emphasizes the need for a sufficient number of properly trained individuals to assist with patient transfers, particularly in the context of spinal cord injuries. The response addresses the critical aspects of preventing further injury, ensuring patient comfort, reducing the risk of falls, and promoting efficient care, which aligns with the goal of minimizing the risk of displacement of an unstable spinal injury
Cervical collar clearance Treatment Under what conditions can the cervical collar be discontinued without additional radiographic imaging in an awake, asymptomatic, adult trauma patient? Y The response accurately reflects the ACS key point by outlining the conditions under which a cervical collar can be safely discontinued without additional radiographic imaging. It references clinical decision rules such as the NEXUS criteria and the Canadian C-spine rule, which align with the criteria mentioned in the ACS key point: normal neurological exam, no high-risk injury mechanism, free range of cervical motion, and no neck tenderness N, insufficient Insufficient Although Gemini follows the spirit of the key points, it does not specifically mention the key point that the cervical collar can be discontinued where there is no high-risk injury mechanism that is listed. It does mention the two criteria/rules used in the ACS Guideline, but does not directly mention it in text response
Cervical collar clearance Treatment When is the removal of a cervical collar recommended for adult blunt trauma patients? N, insufficient The response discusses various imaging modalities and clinical criteria but does not emphasize the specific condition of having a negative helical CT as the definitive criterion for collar removal N, insufficient The response does mention the role of CT imaging, but it does not clearly emphasize the combination of negative CT results and the absence of neurological symptoms as the primary criteria for collar removal
Cervical collar clearance Diagnostic What is recommended as sufficient to remove a C-collar in an adult blunt trauma patient who is obtunded/unevaluable? N, insufficient While the response provides a thorough discussion of the process for evaluating an obtunded or unevaluable blunt trauma patient, it does not explicitly state that a negative helical cervical CT scan alone is sufficient for removing the C-collar N, insufficient This response does not clearly state that a negative helical cervical CT scan is definitively sufficient, as recommended by the ACS key point
Imaging Diagnostic Why are plain radiographs of the cervical and thoracolumbar spine not recommended in the initial screening of spinal trauma? Y The response accurately aligns with the ACS key point by explaining that plain radiographs are not recommended for the initial screening of spinal trauma due to their low sensitivity Y The response clearly aligns with the ACS key point by explaining that plain radiographs (x-rays) are not recommended for the initial screening of spinal trauma due to their low sensitivity
Imaging Diagnostic What is the initial imaging modality of choice to evaluate the cervical and thoracolumbar spine? N, Contradictory The response incorrectly states that x-rays (radiography) are the initial imaging modality of choice for evaluating the cervical and thoracolumbar spine, which directly contradicts the ACS key point that recommends non-contrast MDCT as the initial imaging modality N, Contradictory The response incorrectly identifies x-rays as the initial imaging modality of choice for evaluating the cervical and thoracolumbar spine. This contradicts the ACS key point
Imaging Diagnostic Which imaging modality is the only one capable of evaluating the internal structure of the spinal cord? Y The response is fully aligned with the ACS key point, stating that MRI is the only imaging modality capable of evaluating the internal structure of the spinal cord in detail Y The response is fully aligned with the ACS key point, stating that MRI is the only imaging modality capable of evaluating the internal structure of the spinal cord in detail
Imaging Diagnostic When should universal screening for blunt cerebrovascular injury using a whole-body CT scan be considered? N, insufficient The response outlines specific scenarios and criteria for screening blunt cerebrovascular injury (BCVI), which focuses on selective screening based on high-risk factors, injury patterns, and clinical signs. However, it does not align with the ACS key point, which suggests considering universal screening for all major trauma patients using a whole-body CT scan. The response emphasizes selective screening rather than the universal approach recommended by the ACS key point Y The response is aligned with the ACS key point by advocating for universal screening for blunt cerebrovascular injury (BCVI) using a whole-body CT scan, including CT angiography of the neck
Physical examination Diagnostic How is the neurologic level of injury determined? Y The response accurately aligns with the ACS key point by describing the process of determining the neurologic level of injury through the assessment of sensory and motor levels Y Clearly explains that the neurologic level of injury (NLI) is determined through a comprehensive assessment of both sensory and motor function
Spine injury classification systems Diagnostic What do spine trauma classification systems include? Y The response aligns with the ACS key point by describing how spine trauma classification systems encompass specific injury characteristics, such as the anatomical location, type of injury, spinal stability, and morphology. It also addresses the patient’s medical and neurologic status, including the neurological assessment and the use of grading systems like the ASIA Impairment Scale Y The response aligns with the ACS key point by discussing spine trauma classification systems that categorize spinal injuries based on fracture morphology, neurological status, and other factors like ligamentous injury
Spine injury classification systems Diagnostic How are patient scores for the classification system used in decision making? Y ChatGPT successfully answers the question in line with the ACS key point given it states “guides healthcare providers in determining the level of monitoring, the intensity of treatment, and the urgency of interventions”. This refers to inpatient vs outpatient management Y Gemini successfully answers the question in line with the ACS key point, as it mentions treatment decisions, which are implied as surgical vs non-surgical
Spinal cord injury classification Diagnostic When should you complete the assessment to accurately assign an ASIA impairment grade? Y The ACS key point is straightforward in its instruction that the ASIA impairment grade should be assigned only after the period of spinal shock has passed Y The response aligns with the ACS key point by acknowledging the importance of timing in the ASIA impairment scale (AIS) assessment. It correctly notes that spinal shock can mask the true extent of neurological impairment and emphasizes that the assessment may need to be repeated after spinal shock resolves to ensure accuracy
Spinal cord injury classification Diagnostic How is the level of injury defined after assessing sensation and motor function? Y The response aligns with the ACS key point by describing how the neurological level of injury (NLI) is determined through the assessment of sensation and motor function. It correctly explains that the NLI is the most caudal spinal segment with normal sensory and motor function on both sides of the body. The response also clarifies that the motor level is defined as the lowest segment where muscle function has a grade of at least 3 (active movement against gravity), which aligns with the concept of “anti-gravity motor function” Y It explains that the level of injury is determined by assessing the lowest spinal segment with normal sensory function and motor function, specifically mentioning the requirement for muscle strength to be at least 3 out of 5 on the MRC scale, which corresponds to anti-gravity motor function
Nonoperative management Treatment How can occipital condyle fractures without neural compression or cranio-cervical misalignment be managed successfully? Y The response aligns with the ACS key point by discussing the conservative management of occipital condyle fractures without neural compression or cranio-cervical misalignment. It mentions that these fractures can often be managed with a cervical collar, which corresponds to the use of a rigid or semi-rigid cervical orthosis Y The response aligns with the ACS key point by recommending the use of a rigid cervical collar (a type of rigid or semi-rigid cervical orthosis) for managing occipital condyle fractures without neural compression or cranio-cervical misalignment. It accurately describes immobilization as the primary management strategy for these stable injuries and outlines additional steps, such as pain management and physical therapy, that are consistent with the ACS guideline
Nonoperative management Treatment How should treatment for cervical fractures be selected? Y The response aligns with the ACS key point by emphasizing the need for individualized treatment plans for cervical fractures. It discusses the importance of assessing the fracture type, stability, neurological status, and patient-specific factors such as age and overall health when determining the appropriate treatment approach. The response covers both non-surgical and surgical management options, as well as the need for a multidisciplinary approach, which is consistent with the ACS guideline Y The response aligns with the ACS key point by emphasizing the need to select treatment for cervical fractures based on the specific characteristics of the fracture (eg, stability, type) and patient factors such as neurological status and overall health. It covers both conservative and surgical treatment options, and it mentions the importance of considering the patient's overall health, which includes age, in the decision-making process
Nonoperative management Treatment What is best practice for managing stable thoracolumbar fractures without neurologic deficits? N, Contradictory The response emphasizes the use of bracing (eg, a thoracolumbosacral orthosis, or TLSO) as a part of the treatment for stable thoracolumbar fractures, which directly contradicts the ACS key point that suggests managing these fractures without a brace N, Contradictory The response contradicts the ACS key point by recommending the use of bracing (eg, thoracolumbar sacral orthosis, TLSO, or jewett brace) as part of the management strategy for stable thoracolumbar fractures
Penetrating spinal injury Informative What is the typical outcome for the vast majority of penetrating spinal cord level injuries according to the ASIA impairment scale? Y The response aligns with the ACS key point by accurately stating that the vast majority of penetrating spinal cord injuries result in complete injuries, classified as ASIA A. It explains that these injuries often cause severe and irreversible damage to the spinal cord, leading to a total loss of sensory and motor function below the level of injury Y The response aligns with the ACS key point by clearly stating that the typical outcome for the vast majority of penetrating spinal cord injuries is ASIA A, indicating a complete injury with no motor or sensory function preserved below the level of injury. This directly reflects the ACS guideline
Penetrating spinal injury Treatment How often do gunshot injuries of the spinal cord require surgical stabilization? Y The response is aligned with the ACS key point by suggesting that a significant proportion of spinal gunshot injuries are managed non-operatively Y The response is aligned with the ACS key point by clearly stating that gunshot injuries to the spinal cord rarely require surgical stabilization
Penetrating spinal injury Treatment What is the recommendation for the use of steroids in penetrating spinal injury? Y The response aligns with the ACS key point by clearly stating that steroids are not recommended for penetrating spinal injuries. It highlights the lack of evidence supporting the use of steroids in these cases, the potential risks associated with steroid use, and the recommendations from authoritative guidelines that advise against their use N, contradictory The response does not fully align with the ACS key point because it suggests that the use of steroids in penetrating spinal cord injury might be considered on a case-by-case basis. This contradicts the ACS guideline, which clearly states that steroids are not recommended for penetrating spinal injuries
Concomitant Injuries affecting timing of spinal intervention Treatment How does clinical judgment play a role in determining the optimal timing of spinal surgery in polytrauma patients? Y The response is aligned with the ACS key point by emphasizing the importance of clinical judgment in determining the timing of spinal surgery in polytrauma patients. It covers the need to balance early spinal stabilization with the patient’s hemodynamic stability, addressing factors such as overall stability, injury prioritization, neurological status, and patient-specific considerations Y The response aligns with the ACS key point by emphasizing the importance of clinical judgment in determining the optimal timing of spinal surgery in polytrauma patients. It covers critical considerations such as patient stability, neurological status, injury severity, coexisting injuries, and individual patient factors
Concomitant Injuries affecting timing of spinal intervention Treatment Why must resuscitation and positioning the patient on the operating room (OR) table be planned? Y The response aligns with the ACS key point by emphasizing the importance of careful planning in both resuscitation and patient positioning in the OR. It covers various aspects, such as patient safety, surgical access, anesthesia considerations, and team coordination, all of which are essential to managing potential complications associated with other injuries Y The response aligns with the ACS key point by emphasizing the need for careful planning of resuscitation and positioning to manage potential complications. It highlights crucial aspects such as emergency preparedness, patient safety, surgical efficiency, team communication, and customized care, all of which are necessary to address the complexities of managing patients with multiple injuries
Neurogenic shock and systemic pressure-directed therapy Diagnostic Injuries to what region of the spine will cause vasoplegia and neurogenic shock and why? Y The response aligns with the ACS key point by accurately explaining that injuries to the cervical and high thoracic spine result in vasoplegia and neurogenic shock due to the loss of sympathetic tone. It details the physiological mechanisms behind these conditions, such as the loss of sympathetic control leading to vasodilation and bradycardia Y The response aligns with the ACS key point by accurately explaining that injuries to the cervical and upper thoracic spine (above T6) lead to vasoplegia and neurogenic shock due to the disruption of sympathetic nervous system control. It highlights the mechanisms by which these injuries cause widespread vasodilation and hypotension, which are key features of neurogenic shock
Neurogenic shock and systemic pressure-directed therapy Treatment What should be considered when deciding to use mean arterial pressure (MAP) goals of 85-90 mmHg for 7 days in patients with SCI? Y The response aligns with the ACS key point by discussing the importance of avoiding hypotension in patients with spinal cord injury (SCI) and considering the potential benefits of maintaining MAP goals of 85-90 mmHg. The response also appropriately emphasizes the need to weigh this approach against the risks and limitations, including the use of vasopressors, the requirement for invasive monitoring, and the burden on critical care resources Y The response aligns with the ACS key point by discussing the importance of maintaining MAP goals of 85-90 mmHg in patients with spinal cord injury (SCI) while also emphasizing the need to carefully weigh the decision against potential risks and limitations. The response covers the limited evidence available, the need for close monitoring, and the importance of individualized care, which are all key aspects of the ACS guideline
Neurogenic shock and systemic pressure-directed therapy Treatment What is recommended to treat both hypotension and bradycardia associated with symptomatic denervation? N, insufficient The response provides a comprehensive overview of managing hypotension and bradycardia associated with symptomatic denervation but does not specifically mention the use of an agent with both alpha- and beta-adrenergic activity, which is the key recommendation in the ACS guideline N, insufficient The response provides a comprehensive overview of managing hypotension and bradycardia associated with symptomatic denervation but does not specifically mention the use of an agent with both alpha- and beta-adrenergic activity, which is the key recommendation in the ACS guideline
Pharmacologic management of spinal cord injury Treatment Should methylprednisolone be administered within 8 hours? Y The response accurately reflects the current consensus that the use of methylprednisolone within 8 hours of spinal cord injury (SCI) is not definitively recommended. It acknowledges the historical context based on the NASCIS trials but correctly highlights the shift in guidelines due to concerns about potential risks and inconsistent evidence of benefit N, insufficient The response provided does not align with the ACS key point regarding the specific context of administering methylprednisolone in spinal cord injury (SCI). The response generalizes the timing of methylprednisolone administration without addressing the specific controversy and evidence related to its use within 8 hours after SCI
Pharmacologic management of spinal cord injury Treatment Have any other potential therapeutic agents demonstrated efficacy for motor recovery and neuroprotection? N, Contradictory The ChatGPT response contradicts the ACS guideline's key point by suggesting that there are therapeutic agents with potential efficacy in motor recovery and neuroprotection, while the ACS guideline asserts that no other agents have demonstrated efficacy N, Contradictory The response contradicts the ACS guideline's key point by implying that several therapeutic agents have shown promise in motor recovery and neuroprotection, even though it acknowledges that further research is needed
Venous thromboembolism prophylaxis Treatment When should chemoprophylaxis be initiated to reduce the risk of venous thromboembolism (VTE)? N, insufficient The response discusses various scenarios where chemoprophylaxis for VTE might be appropriate, detailing specific patient groups and conditions. However, it does not specifically emphasize the timing of initiation, particularly the recommendation to start within 72 hours of injury, which is the critical aspect of the ACS key point N, insufficient The response discusses various scenarios where chemoprophylaxis for VTE might be appropriate, detailing specific patient groups and conditions. However, it does not specifically emphasize the timing of initiation, particularly the recommendation to start within 72 hours of injury, which is the critical aspect of the ACS key point
Venous thromboembolism prophylaxis Treatment How should the duration of chemoprophylaxis be determined in the case of spinal cord injury Y The response aligns with the ACS key point by emphasizing that the duration of chemoprophylaxis should be individualized based on factors such as injury severity, mobility status, bleeding risk, and other comorbidities Y The response aligns well with the ACS key point by emphasizing that the duration of chemoprophylaxis should be individualized based on factors such as the severity of the spinal cord injury, mobility status, additional risk factors, and the presence of other complications
Venous thromboembolism prophylaxis Diagnostic When can surveillance duplex ultrasound for VTE be considered in asymptomatic patients in the case of spinal cord injury? Y The response aligns with the ACS key point by acknowledging that routine surveillance duplex ultrasound (DUS) is not universally recommended for all asymptomatic patients with spinal cord injury (SCI). It further highlights that DUS may be considered in specific high-risk scenarios, particularly when pharmacological prophylaxis is contraindicated or delayed, which directly corresponds to the ACS guideline N, insufficient The response suggests that surveillance duplex ultrasound (DUS) for VTE may be considered more broadly in asymptomatic patients with spinal cord injury (SCI), including baseline screening upon admission to rehabilitation and regular surveillance during the acute phase. However, it does not emphasize the ACS guideline's specific point that routine surveillance DUS is generally not recommended and should be reserved for high-risk patients who cannot receive chemoprophylaxis
Spinal shock Diagnostic What is spinal shock, and what are its characteristics? Y The response aligns well with the ACS key point by accurately describing spinal shock as a condition characterized by the total or near-total loss of reflexes (areflexia) and the complete loss or suppression of motor function and sensation below the level of the injury. The response provides a detailed explanation of the key features of spinal shock, including flaccid paralysis, loss of sensation, and autonomic dysfunction, which are consistent with the ACS definition Y The response aligns with the ACS key point by accurately describing spinal shock as a condition involving a sudden loss or impairment of spinal cord function, including areflexia, flaccid paralysis, sensory loss, and autonomic dysfunction below the level of injury
Spinal shock Informative How long can spinal shock persist, and what factors can prolong it? Y It mentions complications such as infection, making it concordant Y It mentions complications such as infection, making it concordant
Spinal shock Diagnostic How is the end of spinal shock typically observed in patients? N, insufficient It does not mention the specific sequence of reflex recovery outlined in the ACS key point, such as the deep plantar reflex, cremasteric reflex, ankle jerk, babinski sign, and knee jerk, which are important indicators in the progression of recovery Y The response aligns well with the ACS key point by mentioning the return of specific reflexes such as the deep tendon reflexes (knee jerk, ankle jerk), babinski reflex, and bulbocavernosus reflex, which are key indicators of the end of spinal shock
Spinal cord injury-induced bradycardia Informative What is the most common dysrhythmia occurring during the acute phase following spinal cord injury? Y The response aligns with the ACS key point by identifying bradycardia as the most common dysrhythmia during the acute phase following a spinal cord injury. The response accurately describes the condition, its causes, and its clinical significance, which are consistent with the ACS guideline Y The response accurately identifies sinus bradycardia as the most common dysrhythmia during the acute phase following spinal cord injury, which aligns with the ACS key point. It also correctly highlights that this is particularly common in cases of cervical spinal cord injury and provides a timeline for when bradycardia typically peaks and resolves
Spinal cord injury-induced bradycardia Informative What often precipitates cardiovascular instability following spinal cord injury? N, insufficient It does not specifically address the key point that cardiovascular instability is often precipitated by suctioning, turning, and hypoxia N, insufficient It does not specifically address the key point that cardiovascular instability is often precipitated by suctioning, turning, and hypoxia
Spinal cord injury-induced bradycardia Treatment What treatments may be included for persistent bradycardia or intermittent episodes of severe bradycardia following spinal cord injury? N, insufficient The response discusses various treatments for bradycardia following spinal cord injury, including some of the medications mentioned in the ACS key point, such as atropine, dopamine, and theophylline. However, it does not explicitly mention the use of beta-2 adrenergic agonists like albuterol or phosphodiesterase inhibitors like aminophylline N, insufficient The response correctly mentions several medications that are part of the recommended treatment for bradycardia following spinal cord injury, including atropine, dopamine, epinephrine, and theophylline. However, it does not mention the use of a beta-2 adrenergic agonist like albuterol, which is specifically included in the ACS key point
Ventilator management in high spinal cord injury Treatment Why is early tracheostomy recommended for patients with spinal cord injury (SCI)? Y The response aligns with the ACS key point by recommending early tracheostomy for patients with spinal cord injury (SCI). It provides a detailed explanation of the benefits of early tracheostomy, such as preventing respiratory complications, reducing ventilator-associated pneumonia, facilitating ventilator weaning, and improving patient comfort Y The response aligns with the ACS key point by recommending early tracheostomy for patients with spinal cord injury (SCI). It highlights several key benefits of early tracheostomy, such as aiding in respiratory management, reducing the risk of complications, improving patient comfort, facilitating weaning from mechanical ventilation, and enhancing communication and swallowing
Ventilator management in high spinal cord injury Treatment Why should stimulation of the diaphragm be considered in high spinal cord injury (SCI) patients? Y The response aligns with the ACS key point by advocating for the consideration of diaphragm stimulation in patients with high spinal cord injury (SCI). It correctly emphasizes the role of diaphragm stimulation in preserving respiratory function, reducing dependence on mechanical ventilation, and potentially improving the patient’s ability to wean from the ventilator, which directly corresponds to the ACS guideline Y The response aligns with the ACS key point by recommending diaphragm stimulation for patients with high spinal cord injury (SCI). It emphasizes the benefits of diaphragm stimulation, including the potential to restore breathing, reduce dependence on mechanical ventilation, and improve overall quality of life, which directly supports the ACS guideline's emphasis on using diaphragm stimulation to plan long-term ventilator strategies and assess the potential for weaning from the ventilator
Placement of tracheostomy following cervical stabilization Treatment When can tracheostomy be performed after anterior cervical spinal stabilization without increasing the risk of infection or other wound complications? N, contradictory The response suggests caution in timing a tracheostomy after anterior cervical spinal stabilization, recommending a waiting period of 3 to 7 days based on concerns about infection and wound complications. This contrasts with the ACS key point, which states that early tracheostomy can be performed without increasing the risk of infection or other wound complications Y The response aligns with the ACS key point by stating that early tracheostomy after anterior cervical spinal stabilization is generally safe and does not significantly increase the risk of infection or wound complications. It mentions that research supports the safety of performing tracheostomy within the first few days post-surgery, which is consistent with the ACS guideline
Placement of tracheostomy following cervical stabilization Treatment What are considered safe techniques for tracheostomy following spinal cord injury? Y ChatGPT successfully answers the question with respect to the ACS key point and the referenced techniques Y The response aligns with the ACS key point by describing both percutaneous dilatational tracheostomy (PDT) and open surgical tracheostomy as safe techniques for patients with spinal cord injury (SCI). It provides information on the contexts in which each technique might be preferred, emphasizing their safety and effectiveness depending on patient-specific factors
Analgesia in spinal cord injury Treatment Why is pain management a priority in the care of acutely injured spinal cord injury (SCI) patients? Y The response aligns with the ACS key point by emphasizing that pain management is crucial in the care of acutely injured spinal cord injury (SCI) patients. It highlights the importance of pain control not only for relieving suffering but also for preventing complications such as autonomic dysreflexia (a form of dysautonomia) triggered by pain N, insufficient The response effectively emphasizes the importance of pain management in acutely injured SCI patients, highlighting its impact on quality of life, recovery, psychological well-being, and prevention of chronic pain. However, it does not explicitly mention the prevention of dysautonomia symptoms, such as autonomic dysreflexia, which can be triggered by uncontrolled pain
Analgesia in spinal cord injury Treatment What approach should be implemented for the acute pain management of patients with spinal cord injury (SCI)? Y The response aligns with the ACS key point by advocating for a multimodal approach to pain management in patients with spinal cord injury (SCI). It comprehensively outlines a strategy that includes both pharmacological and non-pharmacological interventions, tailored to the specific needs of SCI patients Y The response aligns with the ACS key point by advocating for a multimodal approach to managing acute pain in patients with spinal cord injury (SCI). It includes a combination of pharmacological and non-pharmacological interventions, tailored to the individual needs of the patient, which is consistent with the ACS guideline's emphasis on a comprehensive pain management strategy
Avoidance of associated symptoms of spinal cord injury Treatment How should acute autonomic dysreflexia be treated? Y The response aligns well with the ACS key point by outlining the immediate treatment steps for acute autonomic dysreflexia, including sitting the patient upright, removing tight clothing, identifying and correcting the inciting stimulus, and administering quick-onset, short-acting antihypertensive medications if necessary Y The response aligns with the ACS key point by accurately describing the immediate steps to manage acute autonomic dysreflexia (AD). It covers sitting the patient upright, removing tight clothing, identifying and removing the triggering stimulus, monitoring blood pressure, and administering medications such as nifedipine or nitrates if necessary to lower blood pressure
Avoidance of associated symptoms of spinal cord injury Treatment How is spasticity managed following spinal cord injury? Y The response aligns with the ACS key point by emphasizing the use of physical therapy as the primary non-pharmacological approach to managing spasticity and also discusses the use of anti-spasticity medications in certain cases Y The response aligns with the ACS key point by outlining the use of physical therapy as a primary method for managing spasticity and discussing the use of anti-spasticity medications when necessary. It provides a balanced approach, detailing both non-pharmacological (physical therapy, orthotics, heat/cold therapy) and pharmacological (oral medications, intrathecal baclofen, botulinum toxin) interventions, which are in line with the ACS guideline
Avoidance of associated symptoms of spinal cord injury Informative How can skin breakdown and decubitus ulcers be prevented following spinal cord injury? N, insufficient While the response provides a comprehensive overview of strategies to prevent skin breakdown and decubitus ulcers in individuals with spinal cord injury (SCI), it does not specifically mention the use of clinical judgment and validated assessment tools to assess skin breakdown risks, which is a crucial element of the ACS key point. Additionally, it does not explicitly discuss the importance of avoiding known modifiable risk factors like pressure, shear force, and moisture to the skin N, insufficient While the response provides comprehensive strategies for preventing skin breakdown and decubitus ulcers, it does not explicitly mention the use of clinical judgment or a validated assessment tool to assess skin breakdown risks. Additionally, the response does not specifically address the importance of avoiding known modifiable risk factors such as pressure, shear force, and moisture to the skin, which are critical aspects of the ACS key point
Neurogenic bowel and bladder acute care management Treatment What should be initiated for all patients with acute spinal cord injury? N, insufficient The response is insufficient because it fails to address the need for initiating a bowel management program, which is a critical aspect of care for patients with acute SCI as outlined by the ACS key point N, insufficient The response is insufficient because it fails to address the need for initiating a bowel management program, which is a critical aspect of care for patients with acute SCI as outlined by the ACS key point
Neurogenic bowel and bladder acute care management Treatment What is the goal of effective bladder management after acute spinal cord injury, and how should it be customized? Y The response effectively aligns with the ACS key point by focusing on the primary goals of bladder management in patients with acute spinal cord injury (SCI). It emphasizes the prevention of urinary tract infections (UTIs) and the protection of upper urinary tract structures, which are central to the ACS guideline. Furthermore, the response clearly outlines the importance of customizing bladder management based on individual factors, including the level and completeness of the injury, patient lifestyle, and personal preferences Y It accurately emphasizes the primary goals of bladder management, which are to preserve upper urinary tract structures and minimize urinary tract infections (UTIs). It also addresses the importance of customizing bladder management based on individual factors such as the type of bladder dysfunction, level and completeness of the injury, individual preferences, and the presence of other medical conditions
Mobilization and rehabilitation for acute traumatic spinal cord injury Treatment When should physical and occupational therapy treatment begin for patients with acute SCI? Y The response accurately reflects the key points from the ACS guideline, emphasizing the importance of beginning physical and occupational therapy as soon as the patient is medically stable, typically within the first week after injury Y It emphasizes the importance of early initiation of physical and occupational therapy for patients with acute spinal cord injury (SCI), even in the ICU setting, which aligns with the ACS recommendation of starting therapy within the first week post-injury, provided the patient is medically ready
Mobilization and rehabilitation for acute traumatic spinal cord injury Treatment Where should patients with an acute SCI be discharged to when possible? Y ChatGPT successfully answers the question with respect to the ACS key point Y The response aligns with the ACS key point by emphasizing that patients with acute spinal cord injuries (SCI) should ideally be discharged to an inpatient rehabilitation facility specializing in SCI care

After collecting responses from both LLMs, four independent reviewers evaluated each answer as either “concordant” or “non-concordant” with the ACS guidelines. In cases of disagreement between the reviewers, a joint discussion was held until a consensus was reached. A board-certified spine surgeon further validated all assessments. ChatGPT’s responses were graded as “concordant” if they accurately reflected all key aspects of the ACS recommendations. Responses that deviated from the guidelines were classified as “non-concordant” and further categorized as:

  • 1. Insufficient: The response omitted one or more key elements of the guideline or lacked adequate specificity.

  • 2. Contradictory: The response directly contradicted the ACS recommendations.

As an example, given the following diagnostic-related ACS key point:

“Plain radiographs of the cervical and thoracolumbar spine are not recommended in the initial screening of spinal trauma because of their low sensitivity” (Section: Imaging, page 14).

Figure 1 displays the respective question to ChatGPT and the response it elicited.

Figure 1.

Figure 1.

ChatGPT’s concordant response to the ACS guideline-based clinical question.

In this instance, ChatGPT’s response was graded concordant for accurately stating that plain radiographs are not recommended for initial spinal trauma screening due to low sensitivity.

As another example, given the following treatment-related ACS Key Point:

“Tracheostomy can be performed early after anterior cervical spinal stabilization without increasing the risk of infection or other wound complications.” (Section: Placement of Tracheostomy following Cervical Stabilization, page 58).

Figure 2 details the corresponding question to ChatGPT and the response it yielded.

Figure 2.

Figure 2.

ChatGPT’s non-concordant, contradictory response to the ACS guideline-based clinical question.

This response was graded as non-concordant and contradictory. ChatGPT suggests caution in timing a tracheostomy after anterior cervical spinal stabilization, recommending a 3 to 7-day waiting period based on concerns about infection and wound complications. This contradicts the ACS key point, which states that early tracheostomy can be performed without increasing the risk of infection or other wound complications.

The concordance rates of ChatGPT and Gemini with the ACS guidelines were analyzed with Chi-Squared tests, with the significance level at alpha = 0.05.

Results

Of the 52 total clinical questions, ChatGPT provided concordant responses to 38 (73.07%) questions, while Gemini’s responses were concordant for 36 (69.23%) questions (Table 1). Despite ChatGPT’s marginally higher concordance rate, the chi-square test revealed no statistically significant difference between the models’ overall performance (P = 0.829). Both models generated a comparable number of non-concordant responses, with 14 (26.93%) non-concordant answers for ChatGPT and 16 (30.77%) non-concordant answers for Gemini. Among these non-concordant responses, ChatGPT and Gemini showed similar tendencies in the reasons for their non-concordance. In the overall analysis, 71.43% of ChatGPT’s non-concordant answers were classified as insufficient and 28.57% as contradictory. For Gemini, 66.67% of non-concordant responses were insufficient, and 33.33% were contradictory. When both models failed to provide concordant answers, they more often missed key aspects of the ACS recommendations than produced directly contradictory advice.

Table 1.

ChatGPT and Gemini Cumulative performance.

Concordant
ChatGPT Gemini P Value
Overall, n (%) 38/52 (73.1%) 32/52 (69.2%) 0.829
Informative, n (%) 6/8 (75%) 6/8 (75%) 1.000
Diagnostic, n (%) 10/14 (71.4%) 11/14 (78.6%) 1.000
Treatment, n (%) 22/30 (73.3%) 19/30 (63.3%) 0.579

Overall, the two models conflicted on 8 out of 52 questions (15.38%), where one model was concordant with ACS guidelines while the other was not. In these instances of disagreement, ChatGPT provided the correct response in 5 out of 8 cases (62.5%), whereas Gemini was correct in 3 out of 8 cases (37.5%).

Subgroup Performance

In the subset of 8 informational questions, ChatGPT and Gemini answered six questions (75%) in concordance with ACS guidelines (Table 2). Both models produced two non-concordant responses, and neither provided contradictory responses in this category. For the 14 diagnostic questions, ChatGPT answered 10 (71.43%) in concordance with ACS guidelines, while Gemini answered 11 (78.57%) (Table 2). The models produced a similar number of contradictory responses (1 each), and ChatGPT provided 3 insufficient responses compared to 2 from Gemini. Among the 30 treatment-related questions, ChatGPT provided concordant responses to 22 (73.33%), outperforming Gemini, which answered 19 (63.33%) questions concordantly (Table 2). Both models had four contradictory responses, though Gemini produced more insufficient answers (7 vs 4 for ChatGPT). There were no significant differences between ChatGPT and Gemini within any of the subgroups.

Table 2.

Stratification of ChatGPT and Gemini Non-Concordance.

ChatGPT Gemini P Value
Overall, n (%)
 Contradictory 4/14 (28.6%) 4/16 (25%) 1.000
 Insufficient 10/14 (71.4%) 12/16 (75%) 1.000
Informative, n (%)
 Contradictory 0/2 (0%) 0/2 (0%) 1.000
 Insufficient 2/2 (100%) 2/2 (100%) 1.000
Diagnostic, n (%)
 Contradictory 1/4 (25%) 1/3 (33.3%) 1.000
 Insufficient 3/4 (75%) 2/3 66.7%) 1.000
Treatment, n (%)
 Contradictory 4/8 (50%) 4/11 (36.4%) 0.658
 Insufficient 4/8 (50%) 7/11 (64.6%) 0.658

A detailed rationale for the grading of each ChatGPT and Gemini response to our ACS guideline-based questions is recorded in Table 3.

Discussion

This study is the first to compare the performance of Open AI’s ChatGPT-4o and Google’s Gemini Advanced against an evidence-based clinical guideline for traumatic spinal injuries. ChatGPT-4o demonstrated a marginally higher but insignificant concordance rate compared to Gemini Advanced. Additionally, both models exhibited similar trends in their responses, with the majority of non-concordant responses from both LLMs being due to insufficient information rather than contradictory advice. When subgroup analyses were conducted based on guideline categories, each model displayed varied strengths and weaknesses in each category. Despite these differences, we found no statistically significant differences between ChatGPT-4o and Gemini Advanced in response concordance and tendencies. These findings are consistent with several previous comparison studies, which found that versions of ChatGPT and Gemini (previously Bard) could assist clinical decision-making but struggle to provide nuanced clinical guidance.8,11-14

Stratified Subgroup Findings

In our analysis of the subgroup questions, both ChatGPT-4o and Gemini Advanced displayed strengths in specific categories. This is again consistent with previous studies that indicate that LLMs can provide valuable information in the clinical setting.7,8,10,11,13

Informative

ChatGPT-4o and Gemini Advanced performed equally in the informative question subgroup, answering 6 out of 8 questions (75%) correctly. For questions that were answered concordantly with ACS Guidelines, both were able to correctly identify the content of the question and address all important aspects that were provided within the ACS Guideline’s key points. For example, when asked, “What is the typical outcome for the vast majority of penetrating spinal cord level injuries according to the ASIA impairment scale?” both ChatGPT-4o and Gemini Advanced correctly identified that the vast majority of such injuries result in complete injuries, classified as ASIA A, and result in total loss of sensory and motor function below the level of injury.

However, there were instances in which ChatGPT-4o and Gemini provided non-concordant answers that lacked sufficient detail. In questions such as “How can skin breakdown and decubitus ulcers be prevented following spinal cord injury?” both models offered comprehensive strategies for preventing skin breakdown and decubitus in patients with spinal cord injuries, but neither introduced the nuance of clinical judgment and validated assessment tools like the Spinal Cord Injury Pressure Ulcer Scale (SCIPUS) or Braden Scale nor did either emphasize avoiding known modifiable risk factors.15,16 This information is a critical component of the ACS Guidelines for preventing skin breakdown and decubitus ulcers, and its omission indicates the AI’s lack of regard for the more tactful aspects of spinal cord injury care necessary for proper injury management. This inconsistency and lack of sufficient detail is mirrored in other medical applications of LLMs. For instance, a study by Pirkle et al 17 found significant non-concordant answers regarding pediatric orthopedics, further highlighting the imperfect state of LLMs in their ability to provide consistent and accurate medical recommendations. 17 It is also noteworthy that both ChatGPT-4o and Gemini Advanced provided non-concordant answers to the same questions due to insufficient detail, suggesting a common gap in their training data or a more general deficiency in clinical understanding. Similar issues have been posited in other studies within orthopedic management, implying that AI models likely lack the depth required for accurate, complex clinical decision-making.7,11-13,17

Diagnostic

In the diagnostic subgroup, Gemini Advanced provided concordant responses for 11 out of 14 questions (78.57%). In contrast, ChatGPT-4o provided concordant responses in 10 out of 14 cases (71.43%), with both being non-concordant solely due to insufficient detail. In this subgroup, both models correctly recommended advanced imaging techniques, particularly CT scans, over plain radiographs for initial assessment of suspected spinal injury. As such, both LLMs accurately identified trauma-related contexts for spinal cord injuries and correctly identified best practices in line with ACS guidelines.

Among non-concordant responses, ChatGPT-4o and Gemini Advanced shared responses to two questions. When asked about the initial imaging modality of choice for the cervical and thoracolumbar spine, ChatGPT-4o and Gemini Advanced contradicted the ACS Guidelines by suggesting that X-rays are the ideal initial imaging modality. The ACS Guidelines prioritize CT scans due to their superior sensitivity and specificity, and this direct contradiction further reinforces the need for verification of LLM responses, which is counter to the goal of reducing physician burden in the clinical setting. This issue of reliance on older, less accurate protocols is consistent with findings in the study by Howard et al, which analyzed ChatGPT’s recommendations in infectious disease management and highlighted the model’s tendency to rely on outdated information when no current literature is available. 18 Similarly, Sosa et al’s study on orthopedic management highlighted the tendency for LLMs to reference outdated and inaccurate information. 12 Such discrepancies may also stem from both models’ reliance on older clinical protocols in their training data, which historically emphasized radiographs before the broader adoption of CT scanning.19,20

Additionally, there were three responses to which only one of the models responded non-concordantly. These responses included when ChatGPT-4o was asked, “When should universal screening for blunt cerebrovascular injury using a whole-body CT scan be considered?”. The model was unable to recommend universal screening for all major trauma patients using a whole-body CT scan and instead suggested that screening should be specific and determined based on high-risk factors, injury patterns, and clinical signs. This approach reflects older screening protocols prevalent in the literature before the recent shift toward universal screening.21-25 Due to how LLMs are trained, the data that ChatGPT-4o was trained on may have been outdated or led it to believe that the majority consensus was with the older approach. In contrast, Gemini Advanced was unable to completely answer the question, “When can surveillance duplex ultrasound for VTE be considered in asymptomatic patients in the case of spinal cord injury?” in that it only suggested surveillance duplex ultrasounds may be used broadly for asymptomatic patients as opposed to the ACS Guidelines that generally recommend reserving such imaging for high-risk patients who cannot receive chemoprophylaxis. This suggests that Gemini may overlook the nuanced clinical practicality of certain recommendations, as the high cost and low yield of routine surveillance in low-risk patients make such practices impractical. 26

Treatment

ChatGPT-4o displayed a relatively strong performance in the Treatment Subgroup with 22 out of 30 (73.33%) concordant responses, outperforming Gemini Advanced’s 19 out of 30 (63.33%). Examples of prompts that were successfully answered include a question on safe techniques for tracheostomy following spinal cord injury. Both ChatGPT-4o and Gemini Advanced responded with the two techniques recommended by ACS Guidelines, suggesting that both open and percutaneous tracheostomies are safe for spinal cord injury patients. This is further supported by Lorenzi et al, who examined ChatGPT-4 and Gemini Advanced’s ability to provide accurate recommendations for head and neck malignancies, finding similar concordance rates for surgical advice. 14

Of the non-concordant responses, ChatGPT-4o and Gemini Advanced shared 7 out of 30 questions that they answered non-concordantly and consistently shared reasons for why both were non-concordant. An instance where both models shared a non-cordant response was when prompted, “What should be initiated for all patients with acute spinal cord injury?”. In their responses, both were marked insufficient due to the absence of the recommendation for initiating a bowel management program, which is a critical aspect of care that is outlined in the ACS Guidelines. This mirrors findings from Sosa et al, who observed that LLMs often omitted essential treatment protocols in orthopedic care guidance. 12 Another involved the prompt, “Have any other potential therapeutic agents demonstrated efficacy for motor recovery and neuroprotection?” where both ChatGPT-4o and Gemini Advanced were marked contradictory for suggesting that there are therapeutic agents that may benefit in motor recovery and neuroprotection. This is in direct contradiction with ACS Guidelines, which asserts that no other agents have demonstrated efficacy in this regard. The issue of incorrect recommendations has been similarly highlighted in several studies. Lum et al examined the accuracy of generative AI in orthopedic resident-level decision-making, while Kumar et al focused on comparing the ability of LLMs to generate differential diagnoses for neurosurgical disorders. Both studies underscore the potential risks of relying on AI for critical medical decisions without a deeper understanding of its limitations.27-29 This pattern continues to indicate that all the use of LLMs in clinical practice is promising as a supportive tool for supplementary information and clinical guidance, current limitations require careful clinician verification, which can make their use redundant. Additionally, there is a risk that AI-generated information could reinforce pre-existing incorrect clinician assumptions without such oversight, emphasizing the need for these tools to serve strictly as adjuncts rather than primary sources of decision-making.

In instances where only one model responded non-concordantly, it was found that ChatGPT-4o had isolated non-concordant responses to 1 out of 30 questions while Gemini Advanced had isolated non-concordant responses to 4 out of 30 questions. With ChatGPT-4o, the model contradicted ACS Guidelines when prompted, “When can tracheostomy be performed after anterior cervical spinal stabilization without increasing the risk of infection or other wound complications?”. Instead of stating that tracheostomy can be performed early without increasing the risk of infection or other wound complications, ChatGPT-4o contradictorily recommended up to a 7-day waiting period before tracheostomy. This is clinically significant as if a physician were to follow ChatGPT’s recommendations to delay tracheostomy after anterior cervical spinal stabilization, it could lead to multiple adverse outcomes. Delayed tracheostomy, defined as occurring more than 7 days post-ACSF, has been associated with increased morbidity, longer ICU stays, prolonged mechanical ventilation, and extended hospital stays.30-33 For cases where only Gemini Advanced responded non-concordantly, 3 out of the four questions were due to insufficiency, while 1 out of 4 were due to being contradictory. An example of an insufficient response is when Gemini Advanced was prompted, “Why is pain management a priority in the care of acutely injured spinal cord injury (SCI) patients?”. In the model’s response, Gemini Advanced did not explicitly provide information regarding preventing dysautonomia symptoms that may be triggered by uncontrolled pain, a critical component of ACS Guideline reasoning. This omission in Gemini’s recommendation could lead to adverse consequences if followed by a physician, as preventing dysautonomia symptoms, particularly autonomic dysreflexia, is crucial after spinal cord injury due to the severe cardiovascular and systemic complications associated with the condition.34,35 In one instance, Gemini Advanced provided a contradictory response, suggesting that steroids can be used in penetrating spinal cord injuries when asked, “What is the recommendation for the use of steroids in penetrating spinal injury?” However, the ACS Guidelines clearly state that steroids are not recommended for such cases. The isolated errors suggest that while ChatGPT-4o may have a slightly higher concordance rate overall, both models exhibit unique limitations depending on the clinical context. As pointed out by Zaidat et al, AI models require continuous updates and integration of high-quality evidence to enhance their reliability in clinical care before their use can be consistently recommended. 10

Future Directions

While this study provides valuable insight into the comparative performance of ChatGPT-4o and Gemini Advanced in providing clinical guidance in a spinal trauma setting, several key areas require further exploration and development to enhance the clinical utility of LLMs in healthcare.

Firstly, future efforts and research should ensure that LLMs prioritize training on the latest medical literature. As found in previous studies, AI models can become outdated if not updated with a focus on the most recent evidence, ultimately limiting their clinical accuracy and relevance.7-9,11-14,28,36 This could be performed in conjunction with the customization of LLMs like ChatGPT-4o and Gemini Advanced to specific guidelines relevant to medical subspecialties like orthopedics. From the results of this study and others, LLMs are incredibly useful due to their flexibility and broad base of understanding.7,10,12,27 However, this type of focus limits such LLMs to processing context and producing recommendations for complex and specialized situations. Thus, future research should explore how LLMs can be trained with specialized data sets, like clinical guidelines for treating spinal trauma or back pain, which could benefit the LLM’s ability to provide contextually appropriate recommendations. Such an approach could involve incorporating case-based training to refine models’ understanding of specific clinical situations and guidelines alongside uploading clinical guidelines that are influential to a certain subspecialty and prompting the LLM to focus on such reasoning. Avenues that could potentially be explored include the utilization of OpenAI’s “Create a GPT” function, where users can customize a GPT to function based on specific prompts, uploaded documents, and extra functionalities offered by OpenAI. 37

Given the many controversies in acute SCI management, the alignment of AI platforms with clinical guidelines may be shaped not only by their intrinsic capabilities but also by the inherently contentious nature of the subject matter.38-40 Our findings highlight an important consideration: non-concordance in AI-generated recommendations may stem from the lack of consensus within the medical community or insufficient evidence supporting a unified guideline. These challenges reflect the difficulty of translating nuanced or controversial clinical topics into standardized guidance. Alternatively, variability in platform responses could indicate differences in the quality and scope of their training data, the sophistication of their natural language processing algorithms, or their ability to interpret and incorporate clinical context effectively. Future studies could stratify the questions posed to AI platforms into two categories:

  • (1) Topics with well-established consensus based on robust clinical guidelines.

  • (2) Topics that remain contentious within the medical community.

Additionally, it is valuable to continue exploring how to combine AI’s strengths with clinician oversight. Due to gaps in accuracy displayed by ChatGPT-4o and Gemini Advanced in this study, AI may be best suited as a tool to validate clinical judgment. This role is already being explored within areas like Electronic Medical Records (EMR), with companies like Epic Systems already implementing AI in various roles. 41

To successfully integrate AI into clinical practice, it is essential to prioritize reducing bias and enhancing transparency in LLMs. Since LLMs operate as “black boxes,” with their internal reasoning and response generation largely opaque, gaining a thorough understanding of their mechanisms is crucial before they can be fully trusted as primary clinical tools.42,43 Although we can guide LLMs to focus on specific areas, the uncertainty about how they generate their answers means that clinicians and patients may never fully trust their recommendations or be confident that inherent biases are adequately addressed without a clearer understanding of the frameworks within which these models operate.

Limitations

This study has several limitations that warrant consideration. Firstly, the responses generated by ChatGPT, which is based on an LLM trained on data up to April 2023, may lack awareness of significant discoveries or updates in the field made after this date. Similar to how clinical guidelines require periodic revisions to incorporate the latest evidence, LLMs also need regular updates to align with the most current medical literature. Additionally, since the CNS guidelines were published in 2022, the LLM’s access to more recent data may have influenced the concordance. Secondly, the scope of this study was limited, as it only evaluated a specific set of questions related to a single set of guidelines for spinal cord injury. Therefore, the findings may not be generalizable to other medical conditions or interventions beyond spinal cord injuries. Although no current guidelines are universally regarded as the gold standard for SCI treatment, the ACS guidelines were chosen for this analysis because of their comprehensive scope, encompassing epidemiology, diagnosis, conservative treatment, surgical intervention, and special considerations, all based on the best available evidence at the time. In contrast, the American Association of Neurological Surgeons and Congress of Neurological Surgeons Guidelines for the Management of Acute Cervical Spine and Spinal Cord Injuries, last updated in 2013, may not reflect the most current research. 44 Similarly, while the AO Spine Clinical Practice Guideline for the Management of Acute Spinal Cord Injury is comprehensive, providing recommendations on surgical timing, anticoagulant thromboprophylaxis, preoperative imaging, and rehabilitation, its guidelines are based on literature published up until 2017. 45 The AO Spine & Praxis Spinal Cord Institute Guidelines for the Management of Acute Spinal Cord Injury, published in 2024, have a relatively narrower focus on surgical decompression timing, spinal cord perfusion optimization, and intra-operative SCI management. 46 However, the 2024 AO Spine guidelines identify critical knowledge gaps and propose future research directions, emphasizing that all guidelines are dynamic, evolving with emerging evidence, and must be applied judiciously in clinical practice based on individual patient factors such as presentation, frailty, and comorbidities. 47 These recommendations are inherently shaped by current data limitations and the need for continuous refinement to enhance SCI management. Future research could explore the alignment of these evolving guidelines with LLMs to expand AI applications in SCI management, assess their potential for real-time adaptation to new evidence, and evaluate their capability for personalized risk assessment.

Additionally, while a board-certified spine surgeon verified the grading of the LLMs’ concordance with guideline categories, the process was inherently subjective and did not represent a precise quantitative measure of the model’s accuracy. This subjectivity may introduce bias into the evaluation. Despite these limitations, this study provides valuable insights into the ability of ChatGPT-4o and Gemini Advanced to generate evidence-based recommendations for spinal cord injury management, offering a useful starting point for further exploration of LLMs in clinical decision-making.

Conclusion

Our analysis indicates that both ChatGPT-4o and Gemini Advanced have the potential to be a valuable asset to healthcare providers by providing responses aligned with current best practices in spinal injury management. However, the marginal and insignificant differences in concordance rates suggest that neither ChatGPT-4o nor Gemini Advanced have a superior ability to successfully provide recommendations that are concordant with a validated clinical guideline. As such, our findings highlight the current state of AI and LLMs in healthcare: although AI models like ChatGPT-4o and Gemini Advanced are becoming increasingly sophisticated and useful, their current level of performance still exhibits limitations that currently bar them from being clinically safe and practical in a trauma-based setting.

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

Disclosure: Samuel K. Cho, MD, FAAOS. AAOS: Board or committee member. American Orthopaedic Association: Board or committee member. AOSpine North America: Board or committee member. Cerapedics: Fellowship support. Cervical Spine Research Society: Board or committee member. Globus Medical: IP royalties, fellowship support. North American Spine Society: Board or committee member. Scoliosis Research Society: Board or committee member. SI-Bone: Paid consultant.

Ethical Statement

Ethical Approval

Ethical approval and informed consent were not required for this study as it did not involve human or animal subjects.

ORCID iDs

Alexander Yu https://orcid.org/0000-0002-7246-2269

Albert Li https://orcid.org/0009-0004-8597-0459

Wasil Ahmed https://orcid.org/0009-0001-0904-1891

Michael Saturno https://orcid.org/0000-0002-1132-0662

References

  • 1.Karsy M, Hawryluk G. Modern medical management of spinal cord injury. Curr Neurol Neurosci Rep. 2019;19(9):65. [DOI] [PubMed] [Google Scholar]
  • 2.Hejrati N, Moghaddamjou A, Pedro K, et al. Current practice of acute spinal cord injury management: a global survey of members from the AO Spine. Glob Spine J. 2024;14(2):546-560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Best practices guidelines ACS. https://www.facs.org/quality-programs/trauma/quality/best-practices-guidelines/. Accessed October 11, 2024.
  • 4.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bernstein IA, Zhang YV, Govil D, et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw Open. 2023;6(8):e2330320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Connors C, Gupta K, Khusid JA, et al. Evaluation of the current status of artificial intelligence for endourology patient education: a blind comparison of ChatGPT and google bard against traditional information resources. J Endourol. 2024;38(8):843-851. [DOI] [PubMed] [Google Scholar]
  • 7.Hoang T, Liou L, Rosenberg AM, et al. An analysis of ChatGPT recommendations for the diagnosis and treatment of cervical radiculopathy. J Neurosurg Spine. 2024;41(3):385-395. [DOI] [PubMed] [Google Scholar]
  • 8.Ahmed W, Saturno M, Rajjoub R, et al. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis. Eur Spine J. 2024;33:4182-4203. doi: 10.1007/s00586-024-08198-6 [DOI] [PubMed] [Google Scholar]
  • 9.Duey AH, Nietsch KS, Zaidat B, et al. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J. 2023;23(11):1684-1691. [DOI] [PubMed] [Google Scholar]
  • 10.Zaidat B, Shrestha N, Rosenberg AM, et al. Performance of a large language model in the generation of clinical guidelines for antibiotic prophylaxis in spine surgery. Neurospine. 2024;21(1):128-146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Shrestha N, Shen Z, Zaidat B, et al. Performance of ChatGPT on nass clinical guidelines for the diagnosis and treatment of low back pain: a comparison study. Spine. 2024;49(9):640-651. [DOI] [PubMed] [Google Scholar]
  • 12.Sosa BR, Cung M, Suhardi VJ, et al. Capacity for large language model chatbots to aid in orthopedic management, research, and patient queries. J Orthop Res. 2024;42(6):1276-1282. doi: 10.1002/jor.25782 [DOI] [PubMed] [Google Scholar]
  • 13.Lang SP, Yoseph ET, Gonzalez-Suarez AD, et al. Analyzing large language models’ responses to common lumbar spine fusion surgery questions: a comparison between ChatGPT and bard. Neurospine. 2024;21(2):633-641. doi: 10.14245/ns.2448098.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lorenzi A, Pugliese G, Maniaci A, et al. Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and gemini advanced. Eur Arch Otorhinolaryngol. 2024;281(9):5001-5006. doi: 10.1007/s00405-024-08746-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bergstrom N, Braden B, Boynton P, Bruch S. Using a research-based assessment scale in clinical practice. Nurs Clin. 1995;30(3):539-551. https://pubmed.ncbi.nlm.nih.gov/7567578/. Accessed October 15, 2024. [PubMed] [Google Scholar]
  • 16.Krishnan S, Brick RS, Karg PE, et al. Predictive validity of the Spinal Cord Injury Pressure Ulcer Scale (SCIPUS) in acute care and inpatient rehabilitation in individuals with traumatic spinal cord injury. NeuroRehabilitation. 2016;38(4):401-409. doi: 10.3233/NRE-161331 [DOI] [PubMed] [Google Scholar]
  • 17.Pirkle S, Yang J, Blumberg TJ. Do ChatGPT and gemini provide appropriate recommendations for pediatric orthopaedic conditions? J Pediatr Orthop. 2024;45:e66-e71. doi: 10.1097/BPO.0000000000002797 [DOI] [PubMed] [Google Scholar]
  • 18.Howard A, Hope W, Gerada A. ChatGPT and antimicrobial advice: the end of the consulting infection doctor? Lancet Infect Dis. 2023;23(4):405-406. doi: 10.1016/S1473-3099(23)00113-5 [DOI] [PubMed] [Google Scholar]
  • 19.Cingolani E, Siddi C, Ranaldi G, et al. Standard X-rays for the victims of severe trauma: time for a change. Crit Care. 2010;14(1):P283. [Google Scholar]
  • 20.France JC, Bono CM, Vaccaro AR. Initial radiographic evaluation of the spine after trauma: when, what, where, and how to image the acutely traumatized spine. J Orthop Trauma. 2005;19(9):640-649. doi: 10.1097/01.bot.0000188036.69078.ef [DOI] [PubMed] [Google Scholar]
  • 21.Black JA, Abraham PJ, Abraham MN, et al. Universal screening for blunt cerebrovascular injury. J Trauma Acute Care Surg. 2021;90(2):224-231. doi: 10.1097/TA.0000000000003010 [DOI] [PubMed] [Google Scholar]
  • 22.Vogt K, Kaminsky M, Joos E, Ball CG, Evidence Based Reviews in Surgery EBRS Group . Universal screening for blunt cerebrovascular injury: a critical appraisal. Evidence-based reviews in surgery. J Trauma Acute Care Surg. 2021;91(6):e142-e145. doi: 10.1097/TA.0000000000003403 [DOI] [PubMed] [Google Scholar]
  • 23.Leichtle SW, Banerjee D, Schrader R, et al. Blunt cerebrovascular injury: the case for universal screening. J Trauma Acute Care Surg. 2020;89(5):880-886. doi: 10.1097/TA.0000000000002824 [DOI] [PubMed] [Google Scholar]
  • 24.Ali A, Broome JM, Tatum D, et al. Cost-effectiveness of universal screening for blunt cerebrovascular injury: a markov analysis. J Am Coll Surg. 2023;236(3):468-475. doi: 10.1097/XCS.0000000000000490 [DOI] [PubMed] [Google Scholar]
  • 25.Paulus EM, Fabian TC, Savage SA, et al. Blunt cerebrovascular injury screening with 64-channel multidetector computed tomography: more slices finally cut it. J Trauma Acute Care Surg. 2014;76(2):279-283. doi: 10.1097/TA.0000000000000101 [DOI] [PubMed] [Google Scholar]
  • 26.Ley EJ, Brown CVR, Moore EE, et al. Updated guidelines to reduce venous thromboembolism in trauma patients: a western trauma association critical decisions algorithm. J Trauma Acute Care Surg. 2020;89(5):971-981. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Lum ZC, Collins DP, Dennison S, et al. Generative artificial intelligence performs at a second-year orthopedic resident level. Cureus. 2024;16(3):e56104. doi: 10.7759/cureus.56104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Kumar RP, Sivan V, Bachir H, et al. Can artificial intelligence mitigate missed diagnoses by generating differential diagnoses for neurosurgeons? World Neurosurg. 2024;187:e1083-e1088. doi: 10.1016/j.wneu.2024.05.052 [DOI] [PubMed] [Google Scholar]
  • 29.Rao A, Pang M, Kim J, et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res. 2023;25:e48659. doi: 10.2196/48659 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Kelly EM, Fleming AM, Lenart EK, et al. Delayed tracheostomy after cervical fixation is not associated with improved outcomes: a trauma quality improvement program analysis. Am Surg. 2023;89(7):3064-3071. [DOI] [PubMed] [Google Scholar]
  • 31.Balas M, Jaja BNR, Harrington EM, et al. Earlier tracheostomy reduces complications in complete cervical spinal cord injury in real-world practice: analysis of a multicenter cohort of 2001 patients. Neurosurgery. 2023;93(6):1305-1312. [DOI] [PubMed] [Google Scholar]
  • 32.Essa A, Shakil H, Malhotra AK, et al. Quantifying the association between surgical spine approach and tracheostomy timing after traumatic cervical spinal cord injury. Neurosurgery. 2024;95(2):408-417. [DOI] [PubMed] [Google Scholar]
  • 33.Wang XR, Zhang Q, Ding WS, Zhang W, Zhou M, Wang HB. Comparison of clinical outcomes of tracheotomy in patients with acute cervical spinal cord injury at different timing. Clin Neurol Neurosurg. 2021;210(106947):106947. [DOI] [PubMed] [Google Scholar]
  • 34.Eldahan KC, Rabchevsky AG. Autonomic dysreflexia after spinal cord injury: systemic pathophysiology and methods of management. Auton Neurosci. 2018;209:59-70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Walters ET. How is chronic pain related to sympathetic dysfunction and autonomic dysreflexia following spinal cord injury? Auton Neurosci. 2018;209:79-89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Teixeira-Marques F, Medeiros N, Nazaré F, et al. Exploring the role of ChatGPT in clinical decision-making in otorhinolaryngology: a ChatGPT designed study. Eur Arch Otorhinolaryngol. 2024;281(4):2023-2030. doi: 10.1007/s00405-024-08498-z [DOI] [PubMed] [Google Scholar]
  • 37.Creating a GPT. https://help.openai.com/en/articles/8554397-creating-a-gpt. Accessed October 23, 2024.
  • 38.Ahuja CS, Schroeder GD, Vaccaro AR, Fehlings MG. Spinal cord injury-what are the controversies? J Orthop Trauma. 2017;31(Suppl 4):S7-S13. [DOI] [PubMed] [Google Scholar]
  • 39.Fehlings MG, Wilson JR, Dvorak MF, Vaccaro A, Fisher CG. The challenges of managing spine and spinal cord injuries: an evolving consensus and opportunities for change. Spine. 2010;35(21 Suppl):S161-S165. [DOI] [PubMed] [Google Scholar]
  • 40.Chen WT, Zhou YP, Zhang GS. The progress and controversies regarding steroid use in acute spinal cord injury. Eur Rev Med Pharmacol Sci. 2023;27(13):6101-6110. [DOI] [PubMed] [Google Scholar]
  • 41.Artificial intelligence . https://www.epic.com/software/ai/. Accessed October 15, 2024.
  • 42.Denecke K, May R, Rivera Romero O, LLMHealthGroup . Potential of large language models in health care: delphi study. J Med Internet Res. 2024;26:e52399. doi: 10.2196/52399 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Tam TYC, Sivarajkumar S, Kapoor S, et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit Med. 2024;7(1):258. doi: 10.1038/s41746-024-01258-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Walters BC, Hadley MN, Hurlbert RJ, et al. Guidelines for the management of acute cervical spine and spinal cord injuries: 2013 update. Neurosurgery. 2013;60(CN_suppl_1):82-91. [DOI] [PubMed] [Google Scholar]
  • 45.Fehlings MG, Tetreault LA, Wilson JR, et al. A clinical practice guideline for the management of acute spinal cord injury: introduction, rationale, and scope. Glob Spine J. 2017;7(3 Suppl):84S-94S. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kwon BK, Tetreault LA, Evaniew N, Skelly AC, Fehlings MG. AO Spine/Praxis clinical practice guidelines for the management of acute spinal cord injury: an introduction to a focus issue. Glob Spine J. 2024;14(3_suppl):5S-9S. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Fehlings MG, Moghaddamjou A, Evaniew N, et al. The 2023 AO Spine-Praxis guidelines in acute spinal cord injury: what have we learned? What are the critical knowledge gaps and barriers to implementation? Glob Spine J. 2024;14(3_suppl):223S-230S. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Global Spine Journal are provided here courtesy of SAGE Publications

RESOURCES