Skip to main content
BMC Neurology logoLink to BMC Neurology
. 2025 Jul 1;25:264. doi: 10.1186/s12883-025-04280-8

Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines

Jiayi Deng 1,2,#, Xu Qiu 1,2,#, Chengqi Dong 1,2, Li Xu 1,2, Xiaoxue Dong 4,5, Shiyue Yang 6, Qinghua Li 2,1, Tao Mei 2,1, Shi Chen 2,1, Yali Wu 2,1, Jianliang Sun 2,1, Feifang He 3,, Hanbin Wang 2,1,, Liang Yu 2,1,
PMCID: PMC12211737  PMID: 40597769

Abstract

Objective

To evaluate the use of ChatGPT and DeepSeek in clinical practice to provide healthcare professionals with accurate information on the prevention, diagnosis, and management of post-dural puncture headache (PDPH), in particular to evaluate ChatGPT-4o, ChatGPT-4o mini, DeepSeek-V3 and DeepSeek with Deep Think(R1)’s responses with consensus practice guidelines for headache after dural puncture.

Background

Post-dural puncture headache (PDPH) is a common complication of dural puncture. Currently, there is a lack of evidence-based guidance on the prevention, diagnosis and management of PDPH. The 2023 Consensus guidelines provide comprehensive information. With the development and popularization of AI, more and more people are using ai models, including patients and doctors. However, the quality of the answers provided by ai has not yet been tested.

Methods

Responses from ChatGPT-4o, ChatGPT-4o mini, DeepSeek-V3, and DeepSeek-R1 were evaluated against PDPH guidelines using four dimensions: Accuracy (guideline adherence), Overconclusiveness (unjustified recommendations), Supplementary information (additional relevant details), and Incompleteness (omission of critical guidelines). A 5-point Likert scale further assessed response accuracy and completeness.

Results

All four models show high accuracy and completeness.Of the 10 clinical guidelines evaluated,ChatGPT-4o, ChatGPT-4o mini, DeepSeek-V3 and DeepSeek-R1 all showed 100% accuracy in responses (10/10)(p = 1). None of the four models showed overly conclusive results(p = 1). In terms of supplementary information, ChatGPT-4o,ChatGPT-4o mini and DeepSeek-R1 are 100% (10/10), DeepSeek-V3 is 90% (9/10)(p = 1). In terms of incompleteness, ChatGPT-4o is 80%(8/10), DeepSeek-R1 is 70%(7/10), ChatGPT-4o mini and DeepSeek-V3 are 60% (6/10) (p = 0.729).

Conclusion

All four AI models demonstrate clinical validity, with ChatGPT-4o and DeepSeek-R1 showing stronger guideline alignment. Though largely accurate, their responses achieve only 60–80% completeness relative to medical guidelines. Healthcare professionals must exercise caution when using AI tools and should critically evaluate outputs before clinical application. While promising, their partial guideline coverage requires careful human oversight. Further validation research is essential before these models can reliably support clinical decision-making for complex conditions like PDPH.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12883-025-04280-8.

Keywords: Artificial intelligence, ChatGPT, DeepSeek, Postdural puncture headache

Introduction

The emergence of generative artificial intelligence(AI) models like ChatGPT and DeepSeek mark a transformative phase in healthcare applications. These models leverage large-scale language processing technology to enable context-aware interactions [1], with DeepSeek employing a hybrid training approach combining supervised learning and reinforcement learning. Supervised learning internalizes knowledge from curated biomedical corpora, while reinforcement learning calibrates clinical reasoning through expert feedback. This dual strategy enhances both knowledge absorption and adaptability to clinical scenarios, achieving professional-level diagnostic accuracy in relevant test data [2]. While ethical considerations persist regarding AI-assisted decision-making, emerging evidence suggests that systems like DeepSeek could enhance diagnostic efficiency and reduce cognitive burdens when implemented as clinical decision support tools under human supervision [3].

Postdural puncture headache(PDPH) is a recognized complication of unintentional dural puncture during epidural analgesia or intentional dural puncture for spinal anesthesia or for diagnostic or interventional neuraxial procedures [4]. Despite extensive reviews on the prevention and management of PDPH, the absence of structured recommendations in most of them highlights the necessity for innovative consensus frameworks.Recent studies highlight that PDPH, characterized by severe headache following spinal procedures, lacks standardized prevention, diagnosis, and treatment protocols, despite its significant clinical burden [5]. This gap has prompted the development of a novel consensus practice guideline for PDPH, published in 2023, which integrates multidisciplinary expertise to address these unmet needs [6].

However, the rapid adoption of AI technologies in healthcare raises critical questions about their alignment with clinical guidelines.Large language models(LLMs), such as ChatGPT, are increasingly accessible to patients and clinicians, raising concerns about their reliability in providing medically accurate advice without explicit validation [7]. While existing research has evaluated LLMs in orthopedic pathology [8], their application to PDPH—a condition marked by heterogeneity in etiology and response to interventions—remains unexplored. To address this, our study aims to rigorously assess how well an LLM aligns with the newly proposed PDPH consensus guidelines. By analyzing the model’s responses to clinical scenarios, we aim to determine its fidelity to evidence-based recommendations and identify potential gaps in knowledge representation. The findings will have implications for integrating LLMs into practice, ensuring patient safety, and reducing physician workload through automated, guideline-aligned recommendations.

Methods

This study utilized publicly available ChatGPT-4o, ChatGPT-4o mini, DeepSeek-V3 and DeepSeek with Deep Think(DeepSeek-R1) without requiring institutional review board approval. When applying AI models, the priming effect is involved, which means that the output results of the model will be potentially guided or suggested by the input data (or context), thereby presenting certain trends or specific patterns.Avoiding priming effects by adhering to strict temporal separation between versions. We created a new window for each question.Additionally, ChatGPT and DeepSeek are publicly accessible and has demonstrated relevance in current medical literature, showing potential in supporting clinical workflows [9, 10].

A 5-point Likert scale was used to assess the accuracy and completeness of ChatGPT and DeepSeek responses.

Accuracy:

  1. Completely incorrect

  2. More incorrect than correct [> 75% incorrect]

  3. Approximately equal correct and incorrect

  4. More correct than incorrect [> 75% correct]

  5. Completely correct

Completeness:

  1. Very incomplete [0–25%]

  2. Incomplete [25–50%]

  3. Moderate [50–75%]

  4. Complete [> 75%]

  5. Very complete [100%]

The concordance of ChatGPT and DeepSeek responses was also evaluated by comparing them to the answers provided by the guidelines under the following 4 criteria: accuracy, overconclusiveness, supplemental, and incomplete [10].

The grading criteria are described in detail below:

  1. Accuracy: Is the ChatGPT or DeepSeek response accurate with respect to the guidelines?
    • a
      If YES, the ChatGPT or DeepSeek response did not violate the guidelines, either by complying with them or by being unrelated to them.
    • b
      If NO, the ChatGPT or DeepSeek response contradicted the guideline.
  2. Overconclusiveness: If the guidelines concluded that there was insufficient evidence to provide a recommendation, did ChatGPT or DeepSeek provide one?
    • a
      If YES, ChatGPT or DeepSeek made a recommendation while the guidelines did not provide a recommendation.
    • b
      If NO, either the guidelines provided a recommendation or both the guidelines and ChatGPT or DeepSeek failed to provide a recommendation.
  3. Supplementary: Did ChatGPT or DeepSeek include additional information relevant to the question which the guide lines did not specify?
    • a
      If YES, ChatGPT or DeepSeek included significant additional information such as references to peer-reviewed articles or further explanations that were not included in the guidelines.
    • b
      If NO, ChatGPT or DeepSeek did not contribute additional information relevant to the question.
  4. Incompleteness: If the ChatGPT or DeepSeek response was accurate, did ChatGPT or DeepSeek omit any relevant details which the guidelines included?
    • a
      If YES, ChatGPT or DeepSeek failed to provide relevant information that was included in the guideline.
    • b
      If NO, ChatGPT or DeepSeek provided answers that the guideline didn't have.

The evaluation of ChatGPT or DeepSeek’s responses was conducted by 3 separate reviewers(pain specialists) to confirm the reliability of the grading process.In cases of disagreement, a third author was consulted for resolution. SPSS 26.0 was used for statistical analysis.We compared the accuracy, overconclusiveness, supplemental content, and completeness of ChatGPT and DeepSeek using a chi-square test.We used one-way analysis to compare the accuracy and completeness of the Likert scale between ChatGPT and DeepSeek.Fleiss'kappa, serving as a generalization of this statistic, was used in SPSS to evaluate the consistency among the three raters for the ChatGPT and DeepSeek response qualities of “Accuracy,”and “Completeness”. p < 0.05 was considered to have statistical significance.

Results

There were a total of 10 clinical scenarios that were included in the consensus practice guidelines on PDPH. ChatGPT-4o, ChatGPT-4o mini, DeepSeek-V3 and DeepSeek-R1 all showed 100% accuracy in responses (10/10) in which no contradictions were made to the guidelines. None of the four models showed overly conclusive results.In 10 of its responses(100%), ChatGPT-4o offered supplemental information,often explaining the procedures involved in various medical interventions or detailing how diagnostic tests are conducted.There were 8 responses (80%) in which ChatGPT-4o provided an incomplete response by failing to provide key information that was included in the guidelines.For example, the content of AI answers in many aspects, such as the pathogenesis of PDPH and the treatment measures of PDPH, are incomplete.Among the ChatGPT-4o mini responses 10 (100%) were supplemental, and 6 (60%) were incomplete. Among the DeepSeek-V3 responses 9 (90%) were supplemental, and 6 (60%) were incomplete. Among the DeepSeek-R1 responses 10 (100%) were supplemental, and 7 (70%) were incomplete(Table 1, Fig. 1).

Table 1.

Summary of concordance between consensus practice guidelineson postdural puncture headache and AI responses

1 When Should PDPH Be Suspected?
ChatGPT-4o mini commentaryChatGPT-4omini Response emphasizes that PDPH should be suspected when a patient develops a headache after a spinal procedure, especially within 5 days, and the headache is typically worse when sitting or standing and improves when lying down. This aligns well with the guideline's statement and recommendation
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: NO
ChatGPT-4o commentary: ChatGPT-4o Response details that PDPH should be suspected when a headache develops after a dural puncture procedure, especially within 5 days, with symptoms worsening in an upright position and improving when lying flat. This closely aligns with the guideline's statement and recommendation
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: NO
DeepSeek-V3 commentary: DeepSeek-V3 Response outlines that PDPH should be suspected when a severe, throbbing headache develops within 5 days of a dural puncture procedure, especially when symptoms worsen in an upright position and improve when lying down, which aligns well with the guideline's timeframe and positional characteristics mentioned in the statement
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: NO
DeepSeek-R1 commentary:DeepSeek-R1 Response details that PDPH should be suspected when a positional headache emerges within 5 days of a dural puncture, aligning with the guideline's timeframe and symptom characteristics
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: NO
2 What Patient Factors Are Associated With the Incidence of PDPH?
ChatGPT-4o mini commentary:The ChatGPT-4omini response aligns well with the guideline's statement on patient factors associated with PDPH, as both highlight age, sex, pregnancy, history of headaches, and BMI as significant risk factors, though the ChatGPT-4omini provides more detailed explanations and additional factors
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
ChatGPT-4o commentary:The ChatGPT-4o response aligns well with the guideline's statement on patient factors associated with PDPH, as both highlight age, sex, BMI, history of headaches, pregnancy, and other related factors as significant risk factors, though the ChatGPT-4o provides more detailed explanations and additional factors
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
DeepSeek-V3 commentary:The DeepSeek-V3 response aligns well with the guidelines by identifying patient factors associated with the incidence of PDPH, including younger age, female gender, lower BMI, pregnancy, history of headaches or prior PDPH, as significant risk factors
Accurate: YES, Overconclusive:NO, Supplemental: YES, Incomplete: YES
DeepSeek-R1 commentary:The DeepSeek-R1 response accords with the guidelines in identifying patient factors (e.g., younger age, female gender, low BMI, prior PDPH, pregnancy) associated with PDPH incidence and highlighting high-risk patients
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
3 What Procedural Characteristics Are Associated With PDPH?
ChatGPT-4o mini commentary:The ChatGPT-4o mini response aligns well with the guideline's statement on procedural factors associated with PDPH, as both highlight needle type, size, insertion technique, and multiple attempts as significant risk factors, though the ChatGPT-4o mini provides more detailed explanations and additional factors
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
ChatGPT-4o commentary:The ChatGPT-4o response aligns well with the guideline's statement on procedural factors associated with PDPH, as both highlight needle type, size, insertion technique, and multiple attempts as significant risk factors
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
DeepSeek-V3 commentary:The DeepSeek-V3 response aligns with the guidelines in identifying procedural factors like needle type (non-cutting preferred), size (smaller gauge better), orientation (bevel parallel for cutting needles), attempts (multiple risky), operator experience, patient positioning, and others associated with PDPH risk
Accurate: YES, Overconclusive:NO, Supplemental: YES, Incomplete: YES
DeepSeek-R1 commentary:The DeepSeek-R1 response aligns with the guidelines in identifying procedural factors for PDPH, like needle type (non-cutting/small gauge preferred), orientation (bevel parallel for cutting), attempts (multiple risky), and anesthesia type differences
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
4 What Measures May Be Used to Prevent PDPH?
ChatGPT-4omini commentary:The ChatGPT-4o mini response aligns well with the guideline's statement on PDPH prevention, as both highlight needle selection, proper technique, and limited evidence for routine prophylactic measures like EBP and bed rest
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
ChatGPT-4o commentary: The ChatGPT-4o response aligns with the guidelines in identifying PDPH prevention measures like using atraumatic/smaller gauge needles, correct orientation, minimizing attempts, and early dural puncture identification, with some added details and differences on prophylactic measures
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
DeepSeek-V3 commentary:The DeepSeek-V3 response aligns with the guidelines in emphasizing selective prophylactic EBP use and proper procedural techniques for PDPH prevention, and further details needle selection, orientation, and other preventive measures
Accurate: YES, Overconclusive:NO, Supplemental: YES, Incomplete: YES
DeepSeek-R1 commentary:The DeepSeek-R1 response aligns with the guidelines in agreeing that routine prophylactic EBP isn't recommended due to insufficient evidence and highlighting the importance of proper procedural techniques (e.g., small, atraumatic needles)
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
5 What Conservative Measures May Be Used to Treat PDPH?
ChatGPT-4omini commentary:The ChatGPT-4o mini response aligns with the guidelines in agreeing that routine bed rest isn't routine for PDPH treatment, emphasizing hydration and caffeine use, regular multimodal analgesia, and short-term opioids, with the ChatGPT-4o mini further covering other measures
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: NO
ChatGPT-4o commentary: The ChatGPT-4o response aligns with the guidelines in noting non-routine bed rest, emphasizing hydration, caffeine, multimodal analgesia, and short-term opioids, with the ChatGPT-4o adding on abdominal binders, support, and observation
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
DeepSeek-V3 commentary:The DeepSeek-V3 response aligns with the guidelines in noting non-routine bed rest, emphasizing hydration and caffeine use, multimodal analgesia, and short-term opioids, with the DeepSeek-V3 adding on other measures like abdominal binders and monitoring
Accurate: YES, Overconclusive:NO, Supplemental: YES, Incomplete: NO
DeepSeek-R1 commentary:The DeepSeek-R1 response aligns with the guidelines in noting non-routine bed rest, emphasizing hydration and caffeine use, multimodal analgesia, short-term opioids as needed, with the AI adding on abdominal binders, theophylline, gabapentin, positioning, and education
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
6 What Procedural Interventions May Be Used to Treat PDPH?
ChatGPT-4omini commentary:The ChatGPT-4o mini response and the guidelines both highlight the limited efficacy and significant uncertainties associated with various procedural interventions for treating PDPH, with the guidelines emphasizing low levels of certainty and specific contraindications for certain treatments
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
ChatGPT-4o commentary: The ChatGPT-4o response and the guidelines concur that EBP is the most effective procedural treatment for PDPH with high success rates, while other interventions like greater occipital nerve blocks, epidural saline, and sphenopalatine ganglion blocks are mentioned but have lower efficacy and are considered adjuncts or alternatives in specific cases
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
DeepSeek-V3 commentary:The DeepSeek-V3 response and guidelines concur that EBP is the gold standard for PDPH treatment, with other interventions less effective and used in specific cases
Accurate: YES, Overconclusive:NO, Supplemental: NO, Incomplete: YES
DeepSeek-R1 commentary:The DeepSeek-R1 response and guidelines both identify EBP as the gold standard for PDPH treatment, with other interventions like fibrin glue injection and surgical repair having limited efficacy and higher risks, reserved for refractory cases
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
7 Is Imaging Required in PDPH Management?
ChatGPT-4omini commentary:The ChatGPT-4o mini response aligns with the guidelines, noting that imaging is generally not required for routine PDPH management but may be considered in cases of severe or atypical headache, neurological deficits, or when alternative diagnoses are suspected
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: NO
ChatGPT-4o commentary: The ChatGPT-4o response and guidelines both agree that imaging is not typically required in the management of PDPH, with both suggesting it may be considered in cases of atypical headache or when other complications or alternative diagnoses need to be ruled out
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
DeepSeek-V3 commentary:The DeepSeek-V3's response and guidelines both agree that imaging is not routinely required for typical cases of PDPH but may be necessary in cases with atypical symptoms, neurological deficits, or prolonged headache to rule out other serious conditions or complications
Accurate: YES, Overconclusive:NO, Supplemental: YES, Incomplete: NO
DeepSeek-R1 commentary:The DeepSeek-R1 and guidelines agree that imaging isn't routinely needed for typical PDPH but should be considered for red flags, atypical cases, or refractory ones to rule out other diagnoses or complications
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: NO
8 What Are the Contraindications to an EBP?
ChatGPT-4omini commentary:The ChatGPT-4o mini response and guidelines both highlight similar contraindications for an epidural blood patch (EBP), including active infection, coagulopathy, increased intracranial pressure, and severe spinal abnormalities, and emphasize caution in patients with low platelet counts or those receiving antithrombotics
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: NO
ChatGPT-4o commentary:The ChatGPT-4o and guidelines highlight similar EBP contraindications like active infection, coagulopathy, increased ICP, spinal abnormalities, and caution with anticoagulation/low platelet counts
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
DeepSeek-V3 commentary:The DeepSeek-V3 and guidelines identify similar EBP contraindications, including patient refusal, infection, coagulopathy, increased ICP, and caution in pregnant patients or those with mild coagulopathy
Accurate: YES, Overconclusive:NO, Supplemental: YES, Incomplete: NO
DeepSeek-R1 commentary:The DeepSeek-R1 and guidelines identify similar EBP contraindications like patient refusal, active infection, uncorrected coagulopathy, elevated ICP, severe allergies, and note caution in pregnancy complications, neurological conditions, etc
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: NO
9 When and How Should an EBP Be Performed?
ChatGPT-4omini commentary:The ChatGPT-4o mini response aligns with guidelines, emphasizing EBP for refractory PDPH, blood volume, sterile techniques, guidance as needed, monitoring, and reflecting guidelines'moderate certainty on aspects and success rates
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
ChatGPT-4o commentary:The ChatGPT-4o response aligns with guidelines, emphasizing EBP for refractory PDPH, blood volume, sterile techniques, guidance, monitoring, and reflecting guidelines'certainty and success rate variability
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
DeepSeek-V3 commentary:The DeepSeek-V3 response aligns with guidelines, recommending EBP for refractory PDPH, 15—20 mL blood, aseptic technique, monitoring, and reflecting guidelines'emphasis on variable success rates and moderate-to-low certainty on procedural specifics
Accurate: YES, Overconclusive:NO, Supplemental: YES, Incomplete: YES
DeepSeek-R1 commentary:The DeepSeek-R1 response aligns with guidelines, recommending EBP for refractory PDPH post-conservative failure, using 15—20 mL blood, aseptic technique, immobilization, and reflecting guidelines'uncertainty on timing and success rate
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
10 What Are the Long-Term Complications of PDPH, and How Should Patients Be Followed Up?
ChatGPT-4omini commentary:The ChatGPT-4o mini response aligns with guidelines, emphasizing PDPH complications, recommending follow-up and referrals for unresolved symptoms, and reflecting guidelines'uncertainty on EBP efficacy and evidence on sequelae mitigation
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
ChatGPT-4o commentary:The ChatGPT-4o response aligns with guidelines, highlighting PDPH complications, recommending follow-up for unresolved symptoms, and reflecting guidelines'uncertainty on EBP's role and limited efficacy evidence
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES
DeepSeek-V3 commentary:The DeepSeek-V3 response aligns with guidelines, highlighting PDPH complications, recommending follow-up for unresolved symptoms, and reflecting guidelines'uncertainty on EBP efficacy
Accurate: YES, Overconclusive:NO, Supplemental: YES, Incomplete: YES
DeepSeek-R1 commentary:The DeepSeek-R1 response aligns with guidelines, emphasizing PDPH complications (chronic headache, hematoma), recommending follow-up and referrals for unresolved symptoms, and reflecting guidelines'uncertainty on EBP efficacy
Accurate: YES, Overconclusive: NO, Supplemental: YES, Incomplete: YES

Fig. 1.

Fig. 1

Accuracy, overconclusiveness, supplementary, and incompleteness of ChatGPT and DeepSeek recommendations compared to the guidelines

The differences between ChatGPT-4o, ChatGPT-4o mini, DeepSeek-V3 and DeepSeek-R1 in accuracy(p = 1), overconclusiveness (p = 1), supplemental information(p = 1), and completeness(p = 0.729) were not statistically significant. A detailed overview of the ChatGPT and DeepSeek responses, are provided in Supplementary Table 1.

Using a 5-point Likert scale, ChatGPT-4o, ChatGPT-4o mini, and DeepSeek-R1 achieved a mean accuracy rating of 5.0, while DeepSeek-V3 scored 4.93. Statistical analysis showed no significant differences between the groups (p = 0.108). Similarly, completeness scores were as follows: ChatGPT-4o mini scored 4.03, ChatGPT-4o scored 4.23, DeepSeek-V3 scored 4.17, and DeepSeek-R1 scored 4.23. Again, no statistically significant differences were observed between the groups (p = 0.695). These results suggest comparable performance across all models in both accuracy and completeness metrics(Table 2, Fig. 2).The Kappa value of 0.492 and 0.759 demonstrated a moderate level of agreement with physicians (Table 3).

Table 2.

Four models scored for accuracy and completeness

ChatGPT-4omini ChatGPT-4o DeepSeeek-V3 DeepSeeek-R1 P
Mean accuracy 5 5 4.93 5 0.108
Median accuracy 5(5,5) 5(5,5) 5(5,5) 5(5,5) -
Mean completeness 4.03 4.23 4.17 4.23 0.659
Median completeness 4(3,5) 4(4,5) 4(4,5) 4(4,5) -

Fig. 2.

Fig. 2

The accuracy, completeness and reliability of ChatGPT and DeepSeek

Table 3.

Results of Fleiss’ kappa analysis

Item Fleiss kappa Interpretation P
Accuracy 0.492 Moderate agreement  < 0.05
Completeness 0.759 Substantial agreement  < 0.05

Consistent interpretation of Fleiss kappa values: Poor agreement(< 0.01)Slight agreement(.01-.20); Fair agreement(.21-.40);Moderate agreement(.41-.60); Substantial agreement(.61-.80), Almost perfectagreement(.81–1.00)

Discussion

PDPH, a recognized complication of spinal anesthesia and neuraxial procedures, poses significant clinical challenges due to its impact on patient recovery and quality of life. Despite its prevalence, there remains a notable absence of comprehensive practice guidelines for the perioperative and interventional management of PDPH [11]. To fill this gap, the current multisocial practice guidelines were developed to fill this gap and provide comprehensive information and patient-centered recommendations for the prevention, diagnosis, and management of PDPH.With various AI being applied in the medical environment, the reliability of AI-generated content needs to be fully assessed.This study evaluates the concordance of ChatGPT-4o, ChatGPT-4o mini, DeepSeek-V3 and DeepSeek-R1’s iterative reasoning model with the 2023 international consensus guidelines for PDPH.By rigorously assessing the concordance of these models with international standards, our research provides critical evidence for the potential role of AI in perioperative medicine. The findings highlight both the strengths and limitations of LLMs in supporting clinical decision-making, emphasizing the need for careful integration and human oversight to ensure optimal patient care.

The responses from all four models demonstrated high consistency with the guidelines. None of the models exhibited overly conclusive tendencies. While slight differences existed among the models in terms of information supplementation and completeness, these differences were not statistically significant. However, the AI responses exhibited a high degree of incompleteness (60%−80%), suggesting that the models may prioritize brevity over comprehensiveness. This raises concerns that relying solely on AI outputs might undermine the systematic value of guidelines. For instance, omitting critical details on preventing adverse reactions during epidural blood patch (EBP) procedures could lead to underutilization of guideline recommendations and introduce clinical treatment deviations.

The evaluated AI models demonstrated perfect accuracy in adhering toguidelines, indicating their reliability in providing medically sound information. This consistency is crucial as clinical guidelines represent evidence-based consensus from medical experts and serve as a cornerstone for standardized care. The absence of overconclusiveness further suggests that these models are cautious in their recommendations, avoiding unwarranted assertions when guidelines are inconclusive. For the 10 questions in the guidelines, AI's ability to provide supplementary information beyond the guidelines presents both opportunities and challenges. In our analysis, DeepSeek-V3 provided additional detail in 90% of responses, and other models such as DeepSeek-R1 also showed a 100% supplemental content rate.This extra information can enhance clinical decision-making by providing broader context or recent research insights not yet incorporated into guidelines. For instance, AI might elaborate on the pathophysiology of PDPH or discuss emerging therapies under investigation, which could be valuable for patient education or for clinicians seeking updated knowledge. However, this supplementary information must be carefully evaluated. As noted by Giuffrè et al., the accuracy and relevance of such details can vary, and there is a risk of information overload or distraction from guideline—centered care [7]. Future research should explore how to optimize the balance between guideline—adherence and informative supplementation in AI responses to maximize their utility in clinical settings.

However, the incompleteness of all the models raised concerns, especially as ChatGPT-4o showed incompleteness in all eight of the answers.In clinical practice, omitting key details from guidelines could lead to suboptimal decision-making, especially in complex cases where comprehensive information is essential. For instance, gaps in management protocols could result in inadequate treatment plans.Incompleteness does not conflict with the addition of additional information. Because the answers of AI do not cover all the contents involved in the guidelines, but provide what the guidelines do not cover.Incompleteness and Supplementary information manifest when AI, while adding details not in the guidelines, overlooks specific recommendations on PDPH prevention, management, and follow-up.For instance, it might miss contraindications for certain treatments or essential steps in diagnostic procedures, despite providing extra context or emerging research. This oversight could lead to incomplete clinical pictures, potentially affecting care quality.

We conducted a thorough evaluation of the accuracy and completeness of AI-generated answers from the two systems. The results demonstrated a high level of consistency between the two systems, suggesting that their outputs are reliable and robust. This consistency, combined with the systems'performance metrics, indicates that AI-generated answers from these systems could serve as a valuable and trustworthy reference in clinical practice. Such findings underscore the potential utility of AI tools in supporting clinical decision-making processes. Beyond these findings, it is important to note that the guidelines provide corresponding references for the content in related questions, while AI responses often lack such references. This discrepancy may stem from the fact that AI are trained on vast amounts of data but do not directly incorporate or cite specific sources in their outputs. Many patients and healthcare professionals increasingly rely on AI for quick answers, making the accuracy and reliability of AI responses critical. However, the absence of references in AI outputs can limit the transparency and traceability of the information provided, potentially affecting users'trust in the responses [1].

While AI models like ChatGPT and DeepSeek demonstrate strong performance in aligning with guidelines, their responses often lack the depth and granularity found in guideline recommendations. For instance, guidelines typically include detailed protocols, contraindications, and evidence-based justifications, whereas AI responses may offer broader insights without the same level of specificity. This gap highlights the need for future advancements in AI systems to incorporate structured references or citations directly into their outputs, enhancing the credibility and usability of their recommendations.

Comparing our results with existing literature provides valuable context. Previous studies assessing AI in different medical domains have reported similar patterns. For example, Mejia et al. found that ChatGPT's recommendations for lumbar disc herniation treatment deviated from clinical guidelines in certain aspects, highlighting the need for careful validation before clinical deployment [12]. Similarly, Ponzo et al. observed that while AI chatbots can offer useful nutritional advice, their responses often lack the completeness required for reliable patient guidance [13]. These parallels suggest that incompleteness may be a common challenge for AI systems across various medical applications, potentially stemming from limitations in training data or knowledge representation frameworks.

Notably, both ChatGPT-4o mini and DeepSeek-R1 have shown strong clinical service abilities. In our research, ChatGPT-4o mini provided more comprehensive answers than ChatGPT-4o, which is contrary to the findings [14]. This discrepancy may be related to differences in model optimization strategies, training data, or evaluation criteria used in different studies.

Our study also revealed that DeepSeek, a Chinese-developed AI, did not incorporate traditional Chinese medicine (TCM) techniques specific to China when addressing questions related to the prevention and treatment of PDPH. This indicates that AI responses are not influenced by cultural or geographical factors. Such objectivity is crucial for ensuring the universality and reliability of AI systems in providing medical advice across different cultural contexts. It underscores the AI's capability to adhere strictly to the international consensus practice guidelines, regardless of its origin.

The implications of our findings for clinical practice are significant.Healthcare professionals considering AI as a decision-support tool must be aware of its strengths and limitations. The current generation of AI models can serve as a quick reference for guideline-aligned information, particularly when time constraints or resource limitations hinder thorough manual research.However, clinicians should complement AI suggestions with conducting systematic literature reviews using established medical databases,especially when critical details are missing [15]. If an AI recommendation omits important contraindications for a particular treatment, the clinician's experience becomes vital in preventing adverse events.

Given that ChatGPT-4o mini and DeepSeek are available for free use, they represent accessible and reliable tools for healthcare professionals seeking guideline-aligned support in clinical decision-making. Their strong performance in accuracy and supplementary information, combined with their free availability, positions them as promising candidates for integration into medical workflows.

The study has several limitations. First, the variability of AI output. Although we initiate a new dialogue each time we ask a question, which may be due to the inherent unpredictability of AI models across different dialogues, It is inevitable. Second, our evaluation of AI responses is subjective, leading to potential inconsistencies in the results. Finally, the study only included ChatGPT-4o, ChatGPT-4o mini, DeepSeek-V3 and DeepSeek-R1, different versions of which may improve clinical applicability, and future work should include a broader range of AI models for comparison.

In the future, several research directions and practical applications of AI in clinical settings deserve attention. AI model improvements can tackle output incompleteness, while integrating medical knowledge bases and enhancing information retrieval can ensure guideline coverage. Collaboration between developers and professionals can create specialized models for specific fields, while multi-center studies can provide stronger evidence of AI's effectiveness and safety. Additionally, AI can develop real-time scoring metrics in clinical workflows, such as a"COMPL Index,"triggering alerts for human review when guideline adherence falls below 90%. AI integration with hospital information systems can generate risk alerts based on lab data, like suggesting alternative treatments when detecting low platelet counts. To ensure safety, AI systems should require attending physicians to electronically sign off on recommendations before implementation. These approaches can enhance clinical decision-making while maintaining ethical standards.

Conclusion

This comparative study demonstrates that ChatGPT-4, ChatGPT-4-mini, DeepSeek-V3, and DeepSeek-R1 achieve high guideline adherence accuracy in post-dural puncture headache (PDPH) management recommendations without exhibiting overconclusiveness.ChatGPT-4o mini and DeepSeek are available for free use, providing great convenience for both medical workers and patients to use.However, critical limitations persist:all models demonstrated content omissions in guideline implementation and provided supplementary suggestions without supporting references.These findings underscore artificial intelligence's potential as an accuracy-preserving clinical decision support tool. Notably, the observed pattern of selective guideline execution coupled with unsubstantiated auxiliary content necessitates mandatory human verification in clinical applications, emphasizing the indispensable role of expert oversight in AI-assisted medical decision-making processes.

Supplementary Information

Supplementary Material 1 (95.2KB, docx)

Acknowledgements

We used AI tools ChatGPT and DeepSeek to answer questions from the research survey, and the full answers are attached to the supplementary materials.

Abbreviations

AI

Artificial intelligence

PDPH

Post-dural puncture headache

LLMs

Large language models

Authors’ contributions

JY-D, XQ and HB-W wrote the manuscript. JYD analyzed the data and plotted it. CQ-D,LX,XX-D,SY-Y,QH-L, TM, SC, YL-W and JL-S validate the data. FF-H and LY supervised all parts.

Funding

This study received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data availability

Data availability Contact the email:yuliang@hospital.westlake.edu.cn.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jiayi Deng and Xu Qiu contributed equally to this work.

Contributor Information

Feifang He, Email: hefeifang@zju.edu.cn.

Hanbin Wang, Email: wanghanbin@hospital.westlake.edu.cn.

Liang Yu, Email: yuliang@hospital.westlake.edu.cn.

References

  • 1.Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science. Nature. 2023;614(7947):214–6. 10.1038/d41586-023-00340-6. [DOI] [PubMed] [Google Scholar]
  • 2.Goh E, Gallo R, Hom J, Strong E, Weng Y, Kerman H, Cool JA, Kanjee Z, Parsons AS, Ahuja N, Horvitz E, Yang D, Milstein A, Olson A, Rodman A, Chen JH. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open. 2024;7(10): e2440969. 10.1001/jamanetworkopen.2024.40969. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Temsah A, Alhasan K, Altamimi I, Jamal A, Al-Eyadhy A, Malki KH, Temsah MH. DeepSeek in Healthcare: Revealing Opportunities and Steering Challenges of a New Open-Source Artificial Intelligence Frontier. Cureus. 2025;17(2): e79221. 10.7759/cureus.79221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Schyns-van den Berg A, Gupta A. Postdural puncture headache: Revisited. Best Pract Res Clin Anaesthesiol. 2023;37(2):171–87. 10.1016/j.bpa.2023.02.006. [DOI] [PubMed] [Google Scholar]
  • 5.Kuczkowski KM. Post-dural puncture headache in the obstetric patient: an old problem New solutions. Minerva Anestesiol. 2004;70(12):823–30. [PubMed] [Google Scholar]
  • 6.Uppal V, Russell R, Sondekoppam R, Ansari J, Baber Z, Chen Y, DelPizzo K, Dîrzu DS, Kalagara H, Kissoon NR, Kranz PG, Leffert L, Lim G, Lobo CA, Lucas DN, Moka E, Rodriguez SE, Sehmbi H, Vallejo MC, Volk T, Narouze S. Consensus Practice Guidelines on Postdural Puncture Headache From a Multisociety, International Working Group: A Summary Report. JAMA Netw Open. 2023;6(8): e2325387. 10.1001/jamanetworkopen.2023.25387. [DOI] [PubMed] [Google Scholar]
  • 7.Giuffrè M, Kresevic S, You K, Dupont J, Huebner J, Grimshaw AA, Shung DL. Systematic review: The use of large language models as medical chatbots in digestive diseases. Aliment Pharmacol Ther. 2024;60(2):144–66. 10.1111/apt.18058. [DOI] [PubMed] [Google Scholar]
  • 8.Mejia MR, Arroyave JS, Saturno M, Ndjonko L, Zaidat B, Rajjoub R, Ahmed W, Zapolsky I, Cho SK. Use of ChatGPT for Determining Clinical and Surgical Treatment of Lumbar Disc Herniation With Radiculopathy: A North American Spine Society Guideline Comparison. Neurospine. 2024;21(1):149–58. 10.14245/ns.2347052.526. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Chen YQ, Yu T, Song ZQ, Wang CY, Luo JT, Xiao Y, Qiu H, Wang QQ, Jin HM. Application of Large Language Models in Drug-Induced Osteotoxicity Prediction. J Chem Inf Model. 2025. 10.1021/acs.jcim.5c00275. [DOI] [PubMed] [Google Scholar]
  • 10.Duey AH, Nietsch KS, Zaidat B, Ren R, Ndjonko L, Shrestha N, Rajjoub R, Ahmed W, Hoang T, Saturno MP, Tang JE, Gallate ZS, Kim JS, Cho SK. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J. 2023;23(11):1684–91. 10.1016/j.spinee.2023.07.015. [DOI] [PubMed] [Google Scholar]
  • 11.Uppal V, Russell R, Sondekoppam RV, Ansari J, Baber Z, Chen Y, DelPizzo K, Dirzu DS, Kalagara H, Kissoon NR, Kranz PG, Leffert L, Lim G, Lobo C, Lucas DN, Moka E, Rodriguez SE, Sehmbi H, Vallejo MC, Volk T, Narouze S. Evidence-based clinical practice guidelines on postdural puncture headache: a consensus report from a multisociety international working group. Reg Anesth Pain Med. 2024;49(7):471–501. 10.1136/rapm-2023-104817. [DOI] [PubMed] [Google Scholar]
  • 12.Seas A, Abd-El-Barr MM. Commentary on “Use of ChatGPT for Determining Clinical and Surgical Treatment of Lumbar Disc Herniation With Radiculopathy: A North American Spine Society Guideline Comparison.” Neurospine. 2024;21(1):159–61. 10.14245/ns.2448248.124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ponzo V, Rosato R, Scigliano MC, Onida M, Cossai S, De Vecchi M, Devecchi A, Goitre I, Favaro E, Merlo FD, Sergi D, Bo S (2024) Comparison of the Accuracy, Completeness, Reproducibility, and Consistency of Different AI Chatbots in Providing Nutritional Advice: An Exploratory Study. J Clin Med 13(24). 10.3390/jcm13247810. [DOI] [PMC free article] [PubMed]
  • 14.Wang S, Wang Y, Jiang L, Chang Y, Zhang S, Zhao K, Chen L, Gao C. Assessing the clinical support capabilities of ChatGPT 4o and ChatGPT 4o mini in managing lumbar disc herniation. Eur J Med Res. 2025;30(1):45. 10.1186/s40001-025-02296-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, Wang Y, Dong Q, Shen H, Wang Y. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. 2017;2(4):230–43. 10.1136/svn-2017-000101. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (95.2KB, docx)

Data Availability Statement

Data availability Contact the email:yuliang@hospital.westlake.edu.cn.


Articles from BMC Neurology are provided here courtesy of BMC

RESOURCES