Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Oct 28;15:37600. doi: 10.1038/s41598-025-21065-8

International survey-based assessment of the reliability, validity, and interpretability of the TDN grade for neurosurgical adverse events

Alexis Paul Romain Terrapon 1,2, Vincens Kälin 1, Anna Maria Zeitlberger 1, Jonathan Weller 3, Cédric Kissling 2, Nicolas Neidert 4, Malte Mohme 5, Ahmed El-Garci 6,7, Tareq A Juratli 8, Philip Dao Trong 9, Martin N Stienen 1, Isabel Charlotte Hostettler 1,10, Morgan Broggi 11, Johannes Sarnthein 12, Luca Regli 12, Oliver Bozinov 1, Marian Christoph Neidert 1,; TDN Study Group
PMCID: PMC12569239  PMID: 41152337

Abstract

Neurosurgical adverse events (AE) are frequent and may have dramatic consequences on quality of life. The lack of a standardized classification of their severity hinders evaluation and improvement of the safety of procedures. The Therapy-Disability-Neurology (TDN) grade, validated in 2021 on 6071 interventions, overcomes limitations of previous grading-systems by addressing the severity of neurologic and disabling AEs. The aim of the current study is to assess the reliability, validity and applicability of the TDN grade. We conducted an online survey involving participants with varying levels of neurosurgical expertise. Participants assessed the TDN grade for 16 case vignettes and reviewed the validity, interpretability, logicality, simplicity, and usefulness of the grading-system. The TDN grade showed substantial inter-rater (α = 0.66) and intra-rater (α = 0.79) reliability. Most participants recommended reporting its separate dimensions, which demonstrated substantial to almost perfect reliability (inter-rater: α = 0.74; intra-rater: α = 0.85). Online calculation tools significantly improved agreement and participants’ scores. The TDN grade was considered fairly useful, very logical, fairly simple to use and interpret, and its separate dimensions were considered a very valid measure of the severity of AEs. Neurosurgical AEs should be systematically reported, and surveyed neurosurgeons recommend the use of the TDN grade along its separate dimensions for this purpose.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-21065-8.

Keywords: Cranial, Classification, Complication, Complications, Outcome, Spinal

Subject terms: Outcomes research, Medical ethics, Neurology

Introduction

Despite all precautions and cutting-edge technological developments to guarantee the safety of neurosurgery, up to one-fifth of patients suffer perioperative adverse events (AE)1. To compare and improve the safety and quality of neurosurgical procedures, AEs must be monitored and documented in a standardized manner.

There is growing effort to unify the reporting of neurosurgical AEs and many grading systems were proposed in the last few years27. The Clavien-Dindo-Grading Scheme (CDG)810 is best established in general surgery and has been slightly adapted for neurosurgery into the Landriel-Ibañez classification (LIC)2 Both the CDG and the LIC consider the severity of AEs based mainly on the extent of therapy used to counteract them. However, CDG and LIC may not always adequately reflect the severity of AEs that often cannot be treated. For example, severe complications of neurosurgical procedures, such as infarction with hemiparesis, are classified as low-grades AEs (usually grade 1) despite their significant impact on quality of life. This limitation forces neurosurgeons to choose between reporting AEs using classifications that fail to capture severity, describing them in a purely qualitative manner, or using broad terms such as “minor” or “major”, which does not allow for meaningful impact analysis.

In 2021, a new tool to classify the severity of neurosurgical AEs was introduced11. It integrates the therapy used to treat the AE, the occurrence of new neurologic deficits, and the disability that resulted from the AE (Fig. 1, Table 1). The Therapy-Disability-Neurology (TDN) grade was validated on 6071 interventions from the broad neurosurgical spectrum and was shown to correlate with the length of hospital stay, treatment cost, and deterioration of functional status at discharge as well as follow-up11. The aim of this study is to evaluate the inter- and intra-rater reliability of the TDN grade, as well as its subjective validity, interpretability, logicality, simplicity and usefulness as perceived by neurosurgeons from different countries and experience levels.

Fig. 1.

Fig. 1

The Therapy-Disability-Neurology (TDN) grade. The TDN grade as well as its separate dimensions can be measured using this flowchart. First, choose the severity of the Therapy (T) dimension (based on the Clavien-Dindo-Classification, CDG, and Landriel-Ibañez Classification, LIC)2,10 and follow the arrow. Second, answer the question of the Disability (D, based on the modified Rankin Scale, mRS)40 dimension (follow green arrow to the next dimension if answer is “no”, follow red arrow if answer is “yes”). Third, answer to the question of the Neurology (N) dimension and follow the corresponding arrow to get the overall TDN grade. In order to report dimensions separately, use the flowchart separately for each dimension, while ignoring other dimensions (or use Table 1). It should be noted that grade 5 adverse events always correspond to T5D5N2.

Table 1.

The Therapy-Disability-Neurology grade for neurosurgical adverse events.

TDN Definition Therapy Disability Neurology
TDN 1 Any adverse event without the need for a treatment or an intervention, that does not impact daily life activities, and does not result in any new neurologic deficit. Allowed therapeutic modalities are drugs as antiemetics, antipyretics, analgetics, diuretics, electrolytes, physiotherapy and bedside opening of wound infections T1 CDG I LIC Ia D1 mRS 0–1 N1 No deficit
TDN 2 Any adverse event requiring pharmacological treatment (including blood transfusions and total parenteral nutrition) or hindering at least one activity of daily living, or resulting in a new neurologic deficit T2 CDG II LIC Ib D2 mRS 2–3 N2 Any new deficit
TDN 3 Any adverse event requiring an invasive procedure, hindering walking, or preventing the patient from attending his own bodily needs T3 CDG III LIC II D3 mRS 4
TDN 4 Any life- threatening adverse event requiring a management in intensive care or leaving the patient bedridden, in need of constant help, incontinent T4 CDG IV LIC III D4 mRS 5
TDN 5 Any adverse event resulting in the death of a patient T5 CDG V LIC IV D5 mRS 6

Each dimension of the Therapy-Disability-Neurology (TDN) grade can be measured separately11. The “Therapy” dimension is graded from T1 to T5 based on the Clavien-Dindo classification (CDG)10 and the Landriel-Ibañez Classification (LIC)2. The “Disability” dimension is graded D1 to D5 based on the modified Rankin Scale (mRS)40. The “Neurology” dimension is based on a binary definition (N1 = no new neurologic deficit, N2 = any new neurologic deficit). The dimensions of disability and neurology should only be considered when these have deteriorated since the preoperative state. This impairment must result from the adverse event (AE). The TDN grade is equal to the worse dimension (e.g. T3D4N1 = TDN 4). Grade 5 AE are always TDN 5, T5D5N2.

Methods

Design

The TDN grade, developed and validated in 2021 on 6,071 neurosurgical interventions11, was evaluated in this study through a two-part online survey to assess its reliability, subjective validity, and other user-perceived properties. We conducted a longitudinal study based on two online surveys created using SurveyMonkey (SurveyMonkey Inc)12. Participants were recruited by email by the authors or by other participants and were free to take part in only the first survey or in both surveys. Only physicians or medical students involved in neurosurgery were eligible. Participants were recruited from multiple countries to maximize the number of respondents and to evaluate the applicability of the TDN grade across different healthcare systems and training backgrounds, thereby improving the generalizability of our results. The survey was divided in 5 sections (see Supplementary Information 5 [Survey Part 1] online). The first three sections delineated the study’s background and objectives, obtained electronic consent and demographic data (level of neurosurgical expertise, country of residence), and explained the functioning of the TDN grade with a flowchart similar to Fig. 1 along with links to online calculation tools1315. The fourth section contained 16 case vignettes from 7 specific neurosurgical sub-specialties. Each vignette described the setting, patients’ preoperative state, intra- and postoperative course, and postoperative state. In each case, an AE occurred intra- or postoperatively, and participants were asked to calculate its severity according to each separate dimension as well as the overall TDN grade. No lethal AEs were included, as they are always graded as TDN 5 (T5D5N2). The vignettes were developed based on the authors’ clinical experience, with particular attention given to include various types of neurosurgical subspecialties, as well as various types of adverse events, especially those frequently misreported (e.g., electrolyte imbalance requiring substitution, which is often classified as grade 2 instead of grade 1). No formal external validation was performed, but the vignette-based approach and rating method followed established practices from prior reliability studies of AE grading systems4,5,10. After each case vignette, participants were asked how well they considered that the TDN grade reflected the severity of the AE in the particular case (0 = very badly to 100 = very well). All participants received the same case vignettes but in a random order to reduce habituation and order bias. The last section contained general questions about the grading of neurosurgical AEs (Table 2) to assess the subjective validity, interpretability, logicality, simplicity, and usefulness of the grading system.

Table 2.

Subjective validity, interpretability, logicality, simplicity and usefulness of the TDN grade.

Question First survey Second survey
Please rate how the TDN grade reflects the severity of adverse events in your opinion (0–100)

Median: 75

IQR: 22.5

N: 84

Median: 76

IQR: 21.5

N: 59

Please rate how the TDN dimensions reflect the severity of adverse events in your opinion (0–100)

Median: 80

IQR: 30.25

N: 84

Median: 81

IQR: 16.0

N: 59

Please rate the interpretability of the TDN grade (0–100)

Median: 65

IQR: 33.5

N: 84

Median: 70

IQR: 26.5

N: 59

Please rate the logicality of the TDN grade (0–100)

Median: 73

IQR: 28.0

N: 84

Median: 80

IQR: 20.0

N: 59

Please rate the simplicity of the TDN grade (0–100)

Median: 61.5

IQR: 47.0

N: 82

Median 70

IQR: 28.5

N: 59

Please rate the usefulness of the TDN grade (0–100)

Median: 75

IQR: 35.5

N: 84

Median: 75

IQR: 25.0

N: 59

Would you recommend the reporting of the TDN grade as separate dimensions to increase interpretability? (Yes/No)

Yes: 53, 88.3%

No: 7, 11.7%

Yes: 50, 83.3%

No: 10, 20.0%

Do you know of an adverse event grading system better suited for neurosurgery? (Yes/No)

Yes: 4*, 4.8%

No: 80, 95.2%

Yes: 0, 0.0%

No: 60, 100.0%

Would you recommend to use the TDN grade to classify the severity of adverse events in neurosurgery? (Yes/No) NA

Yes: 53, 88.3%

No: 7, 11.7%

The last section of each survey contained the abovementioned questions. The questions were chosen to reflect measurement properties inspired from the Consensus-based Standards for the selection of health Measurement Instruments (COSMIN) Taxonomy of Measurement Properties16.

*Of the four participants that declared to know an AE grading system better suited for neurosurgery, one proposed the Common Terminology Criteria for Adverse Events (CTCAE) that was developed for medical AEs following chemotherapy, two proposed the Clavien-Dindo classification (CDG) and one the modified Rankin Scale (mRS) which are both part of the Therapy-Disability-Neurology (TDN) grade. Interquartile Range (IQR), N (Number of participants).

The questions were chosen to reflect measurement properties inspired from the Consensus-based Standards for the selection of health Measurement Instruments (COSMIN) Taxonomy of Measurement Properties16. Participants had the possibility to provide their email addresses in order to participate in the intra-rater reliability assessment. The second part of the survey (see Supplementary Information 6 [Survey Part 2] online) was sent per email 6 weeks after completion of the first. The second survey was similar to the first and included the same case vignettes (detailed in Supplementary Information 7 [Table—Case vignettes and authors answers] online), in another random order. In the second survey, participants were asked if they used online tools to measure the severity of AEs. This question was added to the first survey only during the course of the study. On the 7th of January 2024, a brief explanation about the TDN grade was provided to ChatGPT 3.5 (OpenAI)17. Next, we provided the 16 case vignettes to the chatbot which was prompted to determine the TDN grade for each case (see Supplementary Information 3 [Conversation with Chat GPT] online).

Statistical analysis

All statistical analyses and figures were computed using the statistical programming language R (R Core Team)18,19. When a participant did not answer to one of the questions, he was excluded from the corresponding analysis (pairwise deletion). We report means with standard deviations (SD) and medians with interquartile range (IQR). Inter- and intra-rater reliability was measured using Krippendorff’s α, reported with the 95% Confidence Interval (CI) that was computed using bootstrapping (1000 samples)20. Agreement coefficients (α) were interpreted as follows: 0.0–0.2: slight agreement; 0.2–0.4: fair agreement; 0.4–0.6 moderate agreement; 0.6–0.8 substantial agreement; 0.8–1.0: almost perfect agreement21. The performance of each participant (including ChatGPT) was scored by comparing their answers with author’s answers (percentage of right answers, from 0 to 100%), questions not answered were not considered in the scoring. The author’s answers corresponded to the grading that the authors considered correct when designing the vignettes, based on their application of the TDN criteria. These predetermined answers served as the reference standard (“correct” answers) for scoring. The correlation between duration of neurosurgical training and scores was measured with Spearman’s rank coefficient rho, which is reported with the proportion of shared variance Rs2. Since the data was not normally distributed, medians were compared using the Wilcoxon rank sum test. P-values were corrected for multiple comparisons using the Benjamini–Hochberg procedure. They were considered statistically significant when below 0.05. When participants were asked to evaluate the properties of the grading system from 0 to 100, results were interpreted as follows (e.g. for usefulness): 0–20: not (useful), 20–40: not very (useful), 40–60: moderately (useful), 60–80: fairly (useful), 80–100: very (useful).

Ethical considerations

This study was reported in accordance with the Consensus-Based Checklist for Reporting of Survey Studies (CROSS, see Supplementary Information 1 [CROSS Checklist] online)22. All participants provided informed electronic consent. Participants had the choice to remain anonymous or to provide contact data in order to participate in the second part of the study. This survey was coordinated from Switzerland and involved only the evaluation of case vignettes, without collection of patient data or health-related personal data. According to Article 2 of the Swiss Human Research Act (HFG), such research falls outside the scope of ethics committee review. The institutional review board (IRB) of Eastern Switzerland confirmed this by waiving the need for approval (BASEC-ID Req-2024-01086, EKOS 24/156, 15 August 2024). In line with jurisdictional guidance, no additional ethics approvals were required in Germany, the UK, the US, or Italy.

Results

Demographics and general results

A total of 91 respondents completed the questionnaire, of whom 61/91 (67%) took part in the second part after 6 weeks (median time interval 48.1 days, IQR 30.4). 39/91 participants were board certified neurosurgeons (42.9%), other participants included neurosurgeons in training (49/91, 53.8%), residents from other specialties but involved in neurosurgery (2/91, 2.2%) and one medical student (1.1%). The mean duration of neurosurgical training among participants was 6.9 years (SD 6.8), ranging from less than one year to 35 years. Participants were practicing neurosurgery in 6 different countries: Germany (44/91, 48.4%), Switzerland (35/91, 38.5%), Italy (7/91, 7.7%), United-Kingdoms (3/91, 3.3%), and United-States (2/91, 2.2%).

Inter- and intra-rater reliability

Figure 2 provides all individual answers to the case vignettes along authors’ recommended answers. There was a substantial inter-rater agreement regarding the TDN grade (Krippendorff’s α = 0.66, CI: 0.50–0.79). The agreement was higher between raters who used online tools to calculate the grading system (with tools: α = 0.77, CI: 0.62–0. 87; no tools: α = 0.57, CI: 0.34–0.77; CI 0.19–0.21, p-value < 0.0001). The agreement regarding separate dimensions of the TDN grade was even higher (α = 0.74, CI: 0.65–0.81), and almost perfect when raters used online tools (α = 0.81, CI: 0.62–0.86). The median intra-rater reliability was substantial (median α = 0.79, IQR 0.24) for the TDN grade and almost perfect for the separate dimensions (median α = 0.85, IQR 0.14). All α values including sub-analyses are displayed in Fig. 3.

Fig. 2.

Fig. 2

Results of the survey (Case vignettes). Each panel corresponds to a case vignette (see Supplementary Information 7 [Table—Case vignettes and authors answers] online). Participants were asked to assess the severity of AEs using the TDN grade and its separate dimensions (Therapy: “T”, Disability: “D”, Neurology: “N”). All individual answers are displayed with black points. For each case vignette, participants were asked the question “How well does the TDN grade reflect severity in this case”. Original answers ranged from 0 to 100 and were rescaled for the Fig. (0 = “very badly” = red, 5 = “very well” = green). Big green circles represent the authors’ recommended answer and horizontal bars participants’ median answer (black when concordant with author’s recommendations, red when discordant).

Fig. 3.

Fig. 3

Inter- and intra-rater reliability of the TDN grade and its separate dimensions. Krippendorff’s α is provided for each sub-analysis and displayed along the interpretation of agreement coefficients as proposed by Landis et al21. TDN TDN grade, T/D/N each separate dimensions of the TDN grade together, T Therapy dimension, D Disability dimension, N Neurology dimension, CI 95% Confidence Interval, IQR Interquartile range.

Participant scores

The median score among all participants was 68.8% (IQR 25.0) for the TDN grade and 81.3% (IQR 12.5) for its separate dimensions. Participants who used online tools scored significantly better (median 81.3%, IQR 18.8) than participants who did not (median 62.5%, IQR 12.5; CI 0.42–25.00, p-value = 0.0288, see Fig. 4). There was only a weak correlation between the number of years of neurosurgical training and participants’ scores (rho 0.22, Rs2 4.86%, CI 0.01–0.42, p-value = 0.043), and board-certified neurosurgeons and residents had the same median score (Table 3). ChatGPT scored better than any subgroup of participants.

Fig. 4.

Fig. 4

Differences in participants’ score according to the use of online tools or artificial intelligence. Participant scores were measured through a comparison of their responses (overall TDN grade, not individual dimensions) with author’s responses (percentage of correct answers), with unanswered questions excluded from the scoring. The difference in the median score of participants who did or did not use online calculating tools was compared using the Wilcoxon rank sum test. Violin plots present the density of scores, where broader sections indicate higher density. Boxplots within each violin offer key summary statistics (median, quartiles, outliers). Furthermore, after receiving instructions concerning the functioning of the TDN grade, ChatGPT 3.5 (OpenAI ©)17 was prompted to determine the grade of AEs for each case vignette and its score was calculated similarly as for other participants.

Table 3.

Baseline characteristics and comparison of results.

Residents Board-certified P-value
Survey 1—N participants 52 (57.1%) 39 (42.9%)
Survey 2—N participants 37 (60.7%) 24 (39.3%)
Mean years of experience (SD) 2.6 (1.9) 12.4 (7.0)
Median score (IQR) 68.75 (19.4) 68.75 (25.0) .9379
Inter-rater reliability α = 0.69 α = 0.64  < .0001
Intra-rater reliability α = 0.79 α = 0.80 .9379

This table presents the baseline characteristics and results of the two subgroups. Intra-rater reliability was slightly higher for residents, while overall scores and inter-rater reliability were similar between groups. To simplify interpretation, the medical student was included in the “residents” subgroup. Scores were computed by comparing participants’ answers with those of the author (percentage of correct answers, ranging from 0 to 100%), with unanswered questions excluded from scoring. Inter- and intra-rater reliability were assessed using Krippendorff’s α. The difference in α between groups was evaluated using bootstrapping, with confidence intervals (CI) derived from the bootstrap distribution. P-values for both score and reliability comparisons were obtained using the Wilcoxon rank sum test.

Validity and applicability

For each case vignette, participants rated on a scale from 0 (“very badly”) to 100 (“very well”) how accurately they thought the TDN grade reflected the severity of the AE, providing a measure of the system’s face validity. For each participant, the median validity among all 16 case vignettes was computed. The median subjective validity of the TDN grade as evaluated by the whole cohort was 79.0% (IQR 19.6), indicating that the grading system was generally perceived as a faithful representation of AE severity in most cases. At the end of the survey, participants were asked general questions about the validity, interpretability, logicality, simplicity and usefulness of the TDN grade. All results are shown in Table 2.

Discussion

Key results

The inter-rater reliability analysis showed substantial agreement regarding the TDN grade and its separate dimensions. The use of online calculation tools helped increase inter-rater reliability for both the TDN grade and its separated dimensions (reaching perfect agreement) and significantly increased participant’s scores, but artificial intelligence was superior to all subgroups. Residents performed as well as board-certified neurosurgeons in grading AEs correctly, and we found only a weak correlation between years of training and participants’ scores. The median intra-rater reliability was substantial for the TDN grade and almost perfect for the separate dimensions. Participants who participated in both surveys found that TDN dimensions were a very valid reflection of AEs’ severity and recommended to report the dimensions separately along with the TDN grade. Participants considered the grading system fairly useful, while remaining very logical, fairly simple to use, and to interpret. Almost all (95%) participants of the first and all of the second survey did not consider any classification to be better suited for neurosurgical AEs, and 88% of participants would recommend the use of the TDN grade to report AEs.

Interpretation

Standardized and transparent assessment of AEs is of paramount importance to monitor, compare and improve quality of treatment, as well as to address research questions. Unfortunately, AEs are often omitted in the neurosurgical literature or either reported without specifying their severity or with only vague terms such as “minor” or “major”. This lacks objectivity and does not allow for meaningful analyses and conclusions. In general surgery, the introduction of the CDG greatly harmonized the reporting of AEs810,23. A series of classification systems have been proposed in the neurosurgical literature in the hope of achieving such consistency27, but none gained wide acceptance. To our knowledge, the two most used neurosurgery-specific grading systems are the Landriel Ibanez Classification (LIC) and the Spinal Adverse Events Severity System and its derivations (SAVES-V2 and SAVES-N)2,4,24. The LIC was based on the CDG and no fundamental change was introduced. It measures the severity of AEs based on the therapy used to counteract them, which is irrelevant for non-treatable neurological adverse. As a result, severe and disabling neurologic deficits are still considered as Grade 1 AEs by both the CDG and the LIC. The SAVES-V2 uses the occurrence of neurological AEs and their likely duration on outcome in order to measure their severity, but does not consider the extent of the disability experienced by patients, which is probably the most important aspect of AEs. As the introduction of prospective patient registries in many centers is giving rise to a multitude of pro- and retrospective outcome research projects1,2532, the limitations of existing grading systems and the urgent need for a standardized and neurosurgery-specific classification is being increasingly discussed7,25,3339.

Unlike any previous classification, the TDN grade takes not only the therapy used to counteract AEs, but also the disability experienced by the patient and the occurrence of neurologic deficits into account. Corresponding to the Clavien-Dindo classification (CDG grade 1 to 5 are equal to TDN T1 to T5, Table 1)10, the “Therapy” dimension is very objective and encompasses AEs that may not result in any disability but still compromise patients’ health and generate costs (e.g. infection of neurostimulator requiring explantation, antibiotics, and reimplantation). The “Disability” dimension assesses the severity of the functional impairment caused by the AE (e.g. severe hemiparesis), regardless of the type of neurologic deficit or the deployed treatment, and is also graded from 1 to 5 based on the modified Rankin Scale (mRS, Table 1)40. The “Disability” dimension already include disabling neurologic deficits, but the “Neurology” dimension was added because some neurologic deficits may cause distress without being disabling and with no treatment available (e.g. severe facial paresis). As a result, the “Neurology” dimension has only two grades (N1 = no new neurologic deficit, N2 = new neurologic deficit, Table 1). Because of its multidimensionality, the TDN grade not only summarizes the severity of AEs from 1 to 5, which is very useful for comparisons and analyses, but also provides a separate grading for each dimensions offering a nuanced understanding of each AE. Thus, like most survey participants, we recommend reporting the TDN grade along with its separate dimensions.

Our findings highlight that online calculation tools not only simplify assessments but also enhance reliability. To improve the generalizability of our results, we designed a survey that mimicked real-world conditions, where neurosurgeons were not specifically trained to use the grading system. The survey provided only minimal guidance beyond the flowchart (Fig. 1) and the original publication11. While the flowchart enables the application of the TDN grade to any clinical scenario with minimal neurosurgical knowledge—evidenced by residents performing as well as board-certified surgeons—it may initially seem complex, potentially discouraging routine use. In contrast, online tools allow users to calculate the TDN grade and its dimensions in just three to five clicks, without requiring prior familiarity with the grading system. These tools are freely accessible from any platform, which may encourage wider adoption in clinical practice. A large portion of neurosurgical research, however, relies on retrospective data extracted from electronic records—a process that remains time-consuming and resource-intensive. We included ChatGPT (version 3.5, OpenAI)17 as a proof-of-concept to assess whether a large language model could calculate the TDN grade directly from case descriptions. ChatGPT was provided with the same brief instructions and vignettes as participants and asked to assign the TDN grade without iterative prompting (see Supplementary Information 3 [Conversation with ChatGPT]). This approach follows emerging literature where ChatGPT demonstrated high accuracy in grading postoperative complications using the CDG41, or in classifying diseases using the International Classification of Diseases 10 coding system42. In this context, our finding that ChatGPT outperformed human participants, even those using online tools, supports its potential as an adjunct for AE grading in future clinical workflows. Such systems could integrate with electronic health records to auto-extract case details, apply the TDN algorithm, and flag patterns suggestive of complications for earlier detection.

Since its introduction, the TDN grade was already used in several institutions and patient cohorts31,38,4346. Vecchio et al. compared the TDN to the LIC in patients with diffuse lower-grade gliomas, from which 110 suffered AEs38. They found that the TDN grade captured more AEs of higher severity corresponding to new neurologic or functional deficits. In their cohort, the distribution of patients with a reduction in quality of life was higher (and the percentage with an improvement lower) in patients who suffered TDN 3–5 AEs as compared to TDN 1–2. However, the sample size for this analysis was small (n = 27) and they found no statistically or clinically significant difference between the groups. Li et al. used the TDN grade in one of the largest elderly patient cohorts to date and found that AEs of higher severity were associated with an increase in mortality45. To summarize, the TDN grade was shown to be concordant with functional outcome (KPS)11, mortality45, costs, and length of hospitalization11, but its association with quality of life still has to be explored.

Limitations

Survey studies have inherent limitations, including selection bias due to our network-based and snowball sampling approach. We recruited participants through direct outreach, professional associations, and previous research networks, while encouraging them to share the survey further. Consequently, the total number of recipients is unknown. Our sample size is inherently limited by the specialized nature of neurosurgical adverse events and the substantial survey completion time. However, it remains comparable to similar studies and exceeds the sample sizes of previous reliability assessments of neurosurgical AE classifications (e.g., inter-rater reliability of SAVES-N and SAVES-V2 was validated with 10 and 51 participants, respectively)4,5,10. To mimic real-world scenarios, minimal training on the scoring system was provided, which may have led to misunderstanding of the TDN grade (e.g. electrolyte substitution was often considered as a grade 2 AE instead of grade 1, Fig. 2). The survey length may have contributed to rushed responses at its end, though question randomization mitigated this bias. Although the same 16 vignettes were used for both surveys—which could theoretically inflate intra-rater reliability through recall—the six-week gap, absence of feedback, and randomized vignette order likely minimized this effect. The surveys were provided in English independently of the participants’ mother tongue, and we did not assess respondents’ English proficiency. This may have introduced language-related limitations, as differences in comprehension could have influenced interpretation of vignette details and AE grading, lowering participants’ score as well as reliability measures. Lastly, while we included diverse AEs, our results may not fully represent the entire spectrum of neurosurgical complications and cannot be generalized to all neurosurgeons.

Conclusion

Neurosurgical AEs should be reported uniformly across the literature. The inter- and intra-rater reliability of the TDN grade ranges from substantial to almost perfect and is increased by the use of online calculation tools. The grading system was considered a valid, fairly useful measure of the severity of AEs, while remaining very logical, fairly simple to use, and to interpret. As a result, a consensus was reached among surveyed neurosurgeons supporting the adoption of the TDN grade for reporting neurosurgical adverse events.

Supplementary Information

Acknowledgements

We thank all participants for taking the time complete the survey.

Author contributions

APRT, VK, AMZ, OB, and MCN formulated the study question and designed the study. All authors recruited participants and collected data. APRT prepared the data, conducted the statistical analysis, created figures and tables. All authors contributed to writing, data interpretation, critical revision of the work, and approved the final version.

Data availability

Anonymous data is provided in Supplementary Information 2 [Raw Data].

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A list of authors and their affiliations appears at the end of the paper.

Contributor Information

Marian Christoph Neidert, Email: marianneidert@hotmail.com.

TDN Study Group:

Alexis Paul Romain Terrapon, Vincens Kälin, Anna Maria Zeitlberger, Jonathan Weller, Cédric Kissling, Nicolas Neidert, Malte Mohme, Ahmed El-Garci, Tareq A. Juratli, Philip Dao Trong, Martin N. Stienen, Isabel Charlotte Hostettler, Morgan Broggi, Johannes Sarnthein, Luca Regli, Oliver Bozinov, Marian Christoph Neidert, Erik Schulz, Francis Kissling, Jun Thorsteinsdottir, Francescco Restelli, Michael Hugelshofer, Sarah Stricker, Francesco Marchi, Anne-Katrin Hickmann, Meltem Gönel, Mukesch Johannes Shah, Veit Stoecklein, Antonia Wehn, Michal Ziga, Svenja Maschke, Michael Schmutzer-Sondergeld, Philipp Karschnia, Felix C. Stengel, Vittorio Stumpo, Max Schrammel, Marie T. Krüger, Manuel Kramer, Lorenzo Bertulli, Witold H. Polanski, Piotr Sumislawski, Tobias Greve, Frederic Thiele, Daniel Hoffmann Ayala, Biyan Nathanael Harapan, Sebastian Siller, Ulrich Hubbe, Arian Karbe, Sven Richter, Schirin Hunziker, Christian V. Eisenring, Emanuele La Corte, André N. J. Sagerer, Katharina Janosovits, Costanza Maria Zattra, Manou Overstijns, David M. Hasan, Jacopo Falco, Sivani Sivanrupan, Emanuele Rubiu, Stefanie Ott, Menno Germans, Christoph Scholz, Richard Drexler, Diederik Bulters, Christine Steiert, Florian Volz, Alice Senta Ryba, Soham Bandyopadhyay, Chibueze Agwu, Gregor Fischer, Markus Florian Oertel, Luis Padevit, Oliver Bichsel, Alexandra Grob, Victor E. Staartjes, Elisa Colombo, and Alexander Hoyningen

References

  • 1.Sarnthein, J., Staartjes, V. E. & Regli, L. Neurosurgery outcomes and complications in a monocentric 7-year patient registry. Brain Spine2, 100860 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Landriel Ibanez, F. A. et al. A new classification of complications in neurosurgery. World Neurosurg.75, 709–715 (2011). [DOI] [PubMed] [Google Scholar]
  • 3.Houkin, K. et al. Quantitative analysis of adverse events in neurosurgery. Neurosurgery65, 587–594 (2009). [DOI] [PubMed] [Google Scholar]
  • 4.Rampersaud, Y. R., Anderson, P. A., Dimar, J. R. 2nd. & Fisher, C. G. Spinal adverse events severity system, version 2 (SAVES-V2): inter- and intraobserver reliability assessment. J. Neurosurg. Spine25, 256–263 (2016). [DOI] [PubMed] [Google Scholar]
  • 5.Castle-Kirszbaum, M. D. et al. Interobserver reliability of spinal adverse events severity system-neuro (SAVES-N): A prospective adverse event reporting system for neurosurgical cases. World Neurosurg.116, e882–e888 (2018). [DOI] [PubMed] [Google Scholar]
  • 6.Gozal, Y. M. et al. Defining a new neurosurgical complication classification: lessons learned from a monthly Morbidity and Mortality conference. J. Neurosurg.132, 272–276 (2019). [DOI] [PubMed] [Google Scholar]
  • 7.Chandra Venkata Vemula, R., Prasad, B. C. M. & Kumar, K. Prospective study of complications in neurosurgery and their impact on the health related quality of life (HRQOL)—Proposal of a new complication grading in neurosurgery based on HRQOL. Interdiscipl. Neurosurg.23, 101002 (2021). [Google Scholar]
  • 8.Clavien, P. A., Sanabria, J. R. & Strasberg, S. M. Proposed classification of complications of surgery with examples of utility in cholecystectomy. Surgery111, 518–526 (1992). [PubMed] [Google Scholar]
  • 9.Clavien, P. A. et al. The Clavien-Dindo classification of surgical complications: five-year experience. Ann. Surg.250, 187–196 (2009). [DOI] [PubMed] [Google Scholar]
  • 10.Dindo, D., Demartines, N. & Clavien, P. A. Classification of surgical complications: a new proposal with evaluation in a cohort of 6336 patients and results of a survey. Ann. Surg.240, 205–213 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Terrapon, A. P. R. et al. Adverse events in neurosurgery: The Novel Therapy-Disability-Neurology grade. Neurosurgery89, 236–245 (2021). [DOI] [PubMed] [Google Scholar]
  • 12.SurveyMonkey Inc. SurveyMonkey. https://www.surveymonkey.com (2024).
  • 13.MDCalc. Therapy-Disability-Neurology (TDN) Grade. https://www.mdcalc.com/calc/10461/therapy-disability-neurology-tdn-grade (2024).
  • 14.MDApp. Therapy-Disability-Neurology Grade (TDN Grade). https://www.mdapp.co/therapy-disability-neurology-grade-tdn-grade-615/ (2022).
  • 15.QxMD. Therapy-Disability-Neurology Grade. https://qxmd.com/calculate/calculator_870/therapy-disability-neurology-grade (2023).
  • 16.Mokkink, L. B. et al. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J. Clin. Epidemiol.63, 737–745 (2010). [DOI] [PubMed] [Google Scholar]
  • 17.OpenAI. ChatGPT v. 3.5. https://chatgpt.com (2024).
  • 18.R Foundation for Statistical Computing. R: A Language and Environment for Statistical Computing (2019).
  • 19.Kassambara, A. ggpubr: ‘ggplot2’ Based Publication Ready Plots v. 0.6.0 (2023).
  • 20.Canty, A. & Ripley, B. D. boot: Bootstrap R (S-Plus) Functions. R Package Version 1.3-30 (2024).
  • 21.Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics33, 159–174 (1977). [PubMed] [Google Scholar]
  • 22.Sharma, A. et al. A consensus-based checklist for reporting of survey studies (CROSS). J. Gen. Intern. Med.36, 3179–3187 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Balvardi, S. et al. Systematic review of grading systems for adverse surgical outcomes. Can. J. Surg.64, E196-e204 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Rampersaud, Y. R., Neary, M. A. & White, K. Spine adverse events severity system: content validation and interobserver reliability assessment. Spine35, 790–795 (2010). [DOI] [PubMed] [Google Scholar]
  • 25.Sarnthein, J., Stieglitz, L., Clavien, P. A. & Regli, L. A patient registry to improve patient safety: recording general neurosurgery complications. PLoS ONE11, e0163154 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ferroli, P. et al. Predicting functional impairment in brain tumor surgery: the Big Five and the Milan Complexity Scale. Neurosurg. Focus39, E14 (2015). [DOI] [PubMed] [Google Scholar]
  • 27.Strömqvist, B., Fritzell, P., Hägg, O. & Jönsson, B. The Swedish Spine Register: development, design and utility. Eur. Spine J.18(Suppl 3), 294–304 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Corell, A. et al. Neurosurgical treatment and outcome patterns of meningioma in Sweden: a nationwide registry-based study. Acta Neurochir. (Wien)161, 333–341 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bydon, M. et al. Building and implementing an institutional registry for a data-driven national neurosurgical practice: experience from a multisite medical center. Neurosurg. Focus51, E9 (2021). [DOI] [PubMed] [Google Scholar]
  • 30.Lohmann, S. et al. Development and validation of prediction scores for nosocomial infections, reoperations, and adverse events in the daily clinical setting of neurosurgical patients with cerebral and spinal tumors. J. Neurosurg.134, 1226–1236 (2020). [DOI] [PubMed] [Google Scholar]
  • 31.Dao Trong, P., Olivares, A., El Damaty, A. & Unterberg, A. Adverse events in neurosurgery: a comprehensive single-center analysis of a prospectively compiled database. Acta Neurochir.165, 585–593 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Asher, A. L., McCormick, P. C., Selden, N. R., Ghogawala, Z. & McGirt, M. J. The national neurosurgery quality and outcomes database and neuropoint alliance: rationale, development, and implementation. Neurosurg. Focus34, E2 (2013). [DOI] [PubMed] [Google Scholar]
  • 33.Bellut, D. et al. Validating a therapy-oriented complication grading system in lumbar spine surgery: a prospective population-based study. Sci. Rep.7, 11752 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ferroli, P. et al. Complications in neurosurgery: application of Landriel Ibanez classification and preliminary considerations on 1000 cases. World Neurosurg.82, e576-577 (2014). [DOI] [PubMed] [Google Scholar]
  • 35.Schiavolin, S. et al. The impact of neurosurgical complications on patients’ health status: a comparison between different grades of complications. World Neurosurg.84, 36–40 (2015). [DOI] [PubMed] [Google Scholar]
  • 36.Schenker, P. et al. Patients with a normal pressure hydrocephalus shunt have fewer complications than do patients with other shunts. World Neurosurg.110, e249–e257 (2018). [DOI] [PubMed] [Google Scholar]
  • 37.Rybkin, I. et al. Unique neurosurgical morbidity and mortality conference characteristics: a comprehensive literature review of neurosurgical morbidity and mortality conference practices with proposed recommendations. World Neurosurg.135, 48–57 (2020). [DOI] [PubMed] [Google Scholar]
  • 38.Gómez Vecchio, T. et al. Classification of adverse events following surgery in patients with diffuse lower-grade gliomas. Front. Oncol.11, 792878 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Guo, Z. Q. et al. A nomogram for predicting the risk of major postoperative complications for patients with meningioma. Neurosurg. Rev.46, 288 (2023). [DOI] [PubMed] [Google Scholar]
  • 40.Quinn, T. J., Dawson, J., Walters, M. R. & Lees, K. R. Functional outcome measures in contemporary stroke trials. Int. J. Stroke4, 200–205 (2009). [DOI] [PubMed] [Google Scholar]
  • 41.Staubli, S. M. et al. Decoding the Clavien-Dindo classification: artificial intelligence (AI) as a novel tool to grade postoperative complications. Ann. Surg.281, 273–279 (2025). [DOI] [PubMed] [Google Scholar]
  • 42.Mustafa, A., Naseem, U. & Rahimi Azghadi, M. Large language models vs human for classifying clinical documents. Int. J. Med. Inform.195, 105800 (2025). [DOI] [PubMed] [Google Scholar]
  • 43.Millward, C. P. et al. Cranioplasty with hydroxyapatite or acrylic is associated with a reduced risk of all-cause and infection-associated explantation. Br. J. Neurosurg.36, 385–393 (2022). [DOI] [PubMed] [Google Scholar]
  • 44.Clynch, A. L. et al. Cranial meningioma with bone involvement: surgical strategies and clinical considerations. Acta Neurochir. (Wien)165, 1355–1363 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Li, H. et al. Decision-making tree for surgical treatment in meningioma: a geriatric cohort study. Neurosurg. Rev.46, 196 (2023). [DOI] [PubMed] [Google Scholar]
  • 46.Yildiz, Y. et al. Subarachnoid hemorrhage due to pituitary adenoma apoplexy—case report and review of the literature. Neurol. Sci.45, 997–1005 (2024). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

Anonymous data is provided in Supplementary Information 2 [Raw Data].


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES