Key Points
Question
What is the interrater reliability of the skin-specific scores of the National Institutes of Health response criteria for chronic graft-vs-host disease?
Findings
In this study of 10 physicians (6 blood and marrow transplant specialists and 4 dermatologists) who evaluated 8 patients with cutaneous chronic graft-vs-host disease, interrater agreement was best for range of motion scoring, among all groups. Dermatologists had acceptable agreement for the skin graft-vs-host disease and skin feature scores, near perfect agreement in identifying sclerosis, and poor agreement for skin sclerosis grading.
Meaning
Although dermatologists had significant agreement in identifying cutaneous sclerosis, methods of grading severity of cutaneous chronic graft-vs-host disease appear to need improvement.
This study evaluates the interrater agreement and reliability of skin-specific and range of motion variables of the 2014 National Institutes of Health response criteria for patients with chronic graft-vs-host disease as well as a skin sclerosis grading scale.
Abstract
Importance
Cutaneous chronic graft-vs-host disease (cGVHD) is common after allogeneic hematopoietic stem cell transplant and is often associated with poor patient outcomes. A reliable and practical method for assessing disease severity and response to therapy among these patients is urgently needed.
Objective
To evaluate the interrater agreement and reliability of skin-specific and range of motion (ROM) variables of the 2014 National Institutes of Health (NIH) response criteria for cGVHD and a skin sclerosis grading scale (SSG).
Design, Setting, and Participants
In this observational study performed at a single tertiary academic center, 6 academic blood and marrow transplant specialists and 4 medical dermatologists examined 8 patients with diagnosed cutaneous cGVHD on July 10, 2015. The patient cohort was enriched for patients with sclerotic features. Each patient was evaluated by using the skin-specific and ROM criteria of the 2014 NIH response criteria for cGVHD and an SSG ranging from 0 to 3. Each patient was also asked to complete quality-of-life scoring instruments. Interrater agreement and reliability were estimated by calculating the Krippendorff α and Cohen κ statistics. Data were analyzed from September 29, 2015, through November 22, 2018.
Main Outcomes and Measures
Estimation of interrater agreement by interclass coefficient (Krippendorff α and Cohen κ statistics) for the skin-specific and ROM components of the 2014 NIH Response Criteria for Chronic GVHD and for the SSG.
Results
The median age of the patients evaluated was 54 years (range, 46-58 years). Patients were predominantly male (6 [75%]). Six of the 8 patients had a predominantly sclerotic cutaneous phenotype. Interrater agreement among our experts was acceptable for NIH skin feature score (0.68; 95% CI, 0.30-0.86) and good for NIH ROM scoring (0.80; 95% CI, 0.68-0.86). Dermatologists had acceptable agreement for NIH skin GVHD score (0.69; 95% CI, 0.25-0.82) and skin feature score (0.78; 95% CI, 0.17-0.98), good agreement in ROM grading (0.85; 95% CI, 0.69-0.90), and near perfect agreement in identifying sclerosis (0.82; 95% CI, 0.27-0.97).
Conclusions and Relevance
Although dermatologists had acceptable agreement in NIH skin GVHD score and skin features score, near perfect agreement in identifying cutaneous sclerosis, better agreement in grading severity of cutaneous cGVHD, especially in the intermediate grades, appears to be needed.
Introduction
Cutaneous chronic graft-vs-host disease (cGVHD) is a common complication after allogeneic hematopoietic stem cell transplant.1 The sclerotic variant is difficult to treat and is associated with poor outcomes.2 An accurate clinical scoring tool that has good interrater reliability is important in research and clinical management of this complex disease. Clinical tools that incorporate a grade for the degree of sclerosis and body surface area involvement have shown promise in quantifying disease severity.3 The National Institutes of Health (NIH) Working Group for cGHVD developed consensus criteria for grading and measuring therapeutic response in 2005 and revised these in 2014.4,5 Herein, we examined the interrater agreement among academic blood and marrow transplant (BMT) specialists and academic medical dermatologists when using (1) skin-specific variables and range of motion (ROM) severity grading of the 2014 NIH cGVHD response criteria and (2) a body site skin sclerosis grade (SSG) of 0 to 3 (higher scores indicate greater severity). Finally, we correlated skin-specific variables of the NIH cGVHD response criteria with patient-reported severity scores and quality-of-life metrics.
Methods
Study Participants
Our study was approved by the institutional review board of Duke University, Durham, North Carolina. Written informed consent was obtained from all patients included in this study. Six BMT specialists and 4 medical dermatologists participated as physician evaluators. Written and online training for evaluators was provided 1 week before the study date. Training included (1) a publicly available online video developed by the Fred Hutchinson Cancer Research Center for evaluating cGVHD (https://vimeo.com/20901528); (2) visual training for the evaluation of erythema, scaling, and induration and identification of papulosquamous features; and (3) visual guides for evaluating the body surface area.
A medical dermatologist with expertise in GVHD (A.R.C.) screened and enrolled adult patients who had clinically active cutaneous cGVHD4 but was excluded from evaluating the patients for interrater agreement. All 10 physicians evaluated 8 patients with cGVHD on July 10, 2015, using the skin-specific variables and ROM scoring of the 2014 NIH cGVHD response criteria.4 The SSG was administered for 20 body sites.6 Patients were asked to fill out the patient-reported cGVHD assessment form included in the 2014 NIH response criteria,5 the cGVHD symptoms scale (Lee symptom scale),7 and the Functional Assessment of Cancer Therapy–Bone Marrow Transplant (FACT-BMT)8 questionnaire (version 4).
The NIH skin GVHD score (SGS) identifies active skin involvement and depends on body surface area (0 indicates none; 1, 1%-18%; 2, 19%-50%; and 3, ≥50%). The NIH SFS stratifies the severity of sclerotic features (0 indicates no sclerotic features; 2, superficial sclerotic features; and 3, deep sclerotic, hidebound, impaired mobility, or ulceration). The NIH ROM quantifies limitation of motion in the shoulders, elbows, wrists/fingers, and ankles, on a scale from 1 to 5 or 7, depending on the joint involved. A photograph accompanies each option in the scoring sheet to serve as a guide for the evaluator.
Statistical Analysis
Data were analyzed from September 29, 2015, through November 22, 2018. Our primary outcome was interrater agreement, as estimated by interclass coefficient, among evaluators. Minimum number to detect an interclass coefficient of 0.30 or above with power of 80% and α = .05 for our primary outcome was 8 observers. Interrater agreement for variables with multiple ordinal ratings (NIH SGS, skin features score [NIH SFS], NIH ROM, and body site SSG) was estimated using the Krippendorff α statistic.9 Agreement for dichotomous variables (identification of the presence of cGHVD clinical phenotypes and skin features) was determined using the Cohen κ statistic.10 Interrater agreement was measured among all raters and within the subgroups of dermatologists and BMT physicians. Interpretation of the α and κ statistics was based on recommendations by Krippendorff9 and Landis and Koch,11 respectively. Correlation between physician scores and patient-reported outcomes was determined using Spearman correlation. Statistical analysis to determine Krippendorrff α was performed using R software (R Foundation for Statistical Computing).12 All other analyses were performed using Stata software version 5 (StataCorp, LLC). Tests were 2-tailed, with significance set at P = .05.
Results
The median age of the patients evaluated was 54 years (range, 46-58 years). Six patients were male (75%) and 2 were female (25%). All but 1 patient had undergone transplant more than 12 months before evaluation. Six of the 8 patients had a predominantly sclerotic cutaneous phenotype.
Interrater Agreement
Evaluators had acceptable agreement (α range, 0.67-0.80) using the NIH SFS and good agreement (α > 0.80) in grading the NIH ROM (Table 1). Dermatologists had acceptable agreement with NIH SGS (0.69; 95% CI, 0.25-0.82) and NIH SFS (0.78; 95% CI, 0.17-0.98) and good agreement in grading the NIH ROM (0.85; 95% CI, 0.69-0.90), whereas BMT specialists had acceptable agreement in grading the NIH ROM (0.76; 95% CI, 0.59-0.83). Agreement was not acceptable when using body site–specific SSG.
Table 1. Interrater Agreement Using Krippendorff α for Multiple Ordinal Ratings.
| Ratinga | Interrater Agreement Among Evaluators, Krippendorff α (95% CI)b | ||
|---|---|---|---|
| Overall | BMT Specialists | Dermatologists | |
| 2014 NIH therapeutic response criteria | |||
| SGS | 0.43 (0.10 to 0.63) | 0.37 (−0.08 to 0.63) | 0.69 (0.25 to 0.82) |
| SFS | 0.68 (0.30 to 0.86) | 0.59 (0.19 to 0.78) | 0.78 (0.17 to 0.98) |
| ROM | 0.80 (0.68 to 0.86) | 0.76 (0.59 to 0.83) | 0.85 (0.69 to 0.90) |
| SSG (0-3) for each body site | 0.54 (0.46 to 0.62) | 0.55 (0.47 to 0.63) | 0.57 (0.47 to 0.64) |
Abbreviations: BMT, blood and marrow transplant; NIH, National Institutes of Health; ROM, range of motion; SFS, skin features score; SGS, skin GVHD (graft-vs-host disease) score; SSG, skin sclerosis grade.
Includes 10 raters and 8 observations for SGS and SFS; 10 raters and 32 observations for ROM; and 10 raters and 140 to 160 observations for SSG (with some missing values). Information on the rating tools is given in the Methods section.
Interpretation uses suggested α > 0.80, with 0.67 as the lowest cutoff.9
Agreement in dichotomous variables was none to moderate among evaluators (Table 2). Agreement in the identification of sclerotic disease was near perfect among dermatologists (0.82; 95% CI, 0.27-0.97; P < .001).
Table 2. Interrater Agreement Using Cohen κ for Dichotomous Variables.
| Ratinga | Interrater Agreement Among Evaluators, Cohen κ (95% CI)b | |||||
|---|---|---|---|---|---|---|
| Overall | P Value | BMT Specialists | P Value | Dermatologists | P Value | |
| Presence of skin GVHD types | ||||||
| Maculopapular | 0.10 (0.00 to 0.50) | .04 | 0.03 (−0.05 to 0.46) | .38 | 0.11 (−0.08 to 0.53) | .22 |
| Lichen planus–like | 0.02 (−0.01 to 0.51) | .34 | −0.01 (0.06 to 0.41) | .56 | 0.14 (−0.01 to 0.77) | .16 |
| Sclerotic | 0.45 (0.08 to 0.81) | <.001 | 0.24 (0.01 to 0.73) | .01 | 0.82 (0.27 to 0.97) | <.001 |
| Papulosquamous or ichthyotic | 0.11 (0.00 to 0.52) | .03 | 0.03 (−0.05 to 0.42) | .60 | 0.23 (to 0.02 to 0.66) | .06 |
| Keratosis pilaris | −0.03 (−0.00 to 0.77) | .72 | −0.04 (−0.01 to 0.71) | .65 | NA | NA |
| Presence of skin features | ||||||
| Deep sclerotic | 0.40 (0.14 to 0.72) | <.001 | 0.31 (0.05 to 0.67) | .001 | 0.41 (0.07 to 0.74) | .003 |
| Hidebound | 0.37 (0.10 to 0.71) | <.001 | 0.30 (0.04 to 0.68) | .002 | 0.33 (0.02 to 0.70) | .01 |
| Impaired mobility | 0.33 (0.09 to 0.67) | <.001 | 0.23 (0.01 to 0.61) | .01 | 0.36 (0.04 to 0.72) | .008 |
| Ulceration | 0.36 (0.03 to 0.82) | <.001 | 0.38 (0.03 to 0.83) | .002 | 0.17 (to 0.01 to 0.78) | .13 |
Abbreviations: BMT, blood and marrow transplant; GVHD, graft-vs-host disease; NA, not applicable.
Includes 10 raters and 8 observations. Dichotomous ratings are from the skin and joint specific variables in the 2014 National Institutes of Health therapeutic response criteria for chronic GVHD.
Interpretation by Landis and Koch11: 0 indicates poor agreement; 0.20 or less, slight; 0.21 to 0.40, fair; 0.41 to 0.60, moderate; 0.61 to 0.80, substantial; and 0.81 to 1.00, near perfect.
Correlation With Patient-Reported Severity
We found a positive correlation between NIH SGS and the patient-reported symptom severity score (Spearman ρ, 0.76; P = .05) (Table 3). The NIH SGS also had a positive correlation with the patient-reported FACT-BMT physical well-being score (Spearman ρ, 0.77; P = .03). Finally, correlation between the NIH SFS and patient-reported FACT-BMT physical well-being score fell short of statistical significance (Spearman ρ, 0.70; P = .06).
Table 3. Correlation Between Patient- and Physician-Reported Severity Scores.
| Severity Score | NIH SGS | NIH SFS | ||
|---|---|---|---|---|
| Spearman ρ | P Value | Spearman ρ | P Value | |
| NIH | ||||
| Global rating | 0.56 | .19 | 0.69 | .08 |
| Symptom severity score | 0.76 | .05 | 0.52 | .24 |
| Chronic GVHD symptom scale (skin) | 0.21 | .62 | 0.28 | .50 |
| FACT-BMT | ||||
| Physical well-being | 0.77 | .03 | 0.70 | .06 |
| Social and familial | −0.51 | .19 | −0.45 | .26 |
| Emotional | 0.45 | .27 | 0.43 | .29 |
| Functional | −0.38 | .35 | −0.13 | .76 |
Abbreviations: FACT-BMT, Functional Assessment of Cancer Therapy–Bone Marrow Transplant; GVHD, graft-vs-host disease; NIH, National Institutes of Health; SFS, skin features score; SGS, skin GVHD score.
Discussion
Chronic GVHD occurs in approximately half of patients who have undergone an allogeneic hematopoietic stem cell transplant, and the skin is a commonly affected organ. The sclerotic variant is the most severe subtype of cutaneous cGVHD, and the most difficult to manage. Patients with sclerotic cGHVD are more likely to have longer courses of immunosuppressive therapy.2 Identifying the appropriate time to institute or reduce immunosuppressive therapy can be challenging. Having a reliable, reproducible measure for grading disease severity and progression or response to therapy will help us improve clinical care and advance research.
We found higher agreement among the academic dermatologists who participated in our study, especially in the identification of clinical phenotypes. This agreement underscores the value of a multidisciplinary approach to cGVHD, as previously expressed by other experts13 and echoed at our institution. Increased and repeated training on using clinical scoring tools may improve interrater agreement,14 although the effects of these may not be long lasting.6 Interrater agreement for NIH ROM grade was good, perhaps owing to incorporation of a detailed visual aid with the scoring sheet. The value of patient-reported outcomes in assessing disease severity should also not be overlooked. The patient-reported 2014 NIH response criteria symptom severity score and the previously validated physical well-being component of the FACT-BMT correlated with the NIH SGS. A 2005 NIH SGS of 3 has previously been validated to correlate with poor outcomes.15 Finally, we found no agreement in body site SSG, but the addition of objective measures of cutaneous sclerosis, such as incorporating imaging techniques that interrogate tissue stiffness, may also improve assessment of cutaneous cGVHD.16
Limitations
This study was limited by the small number of patients and evaluators and the lack of an opportunity to test intrarater reliability. The predominance of the sclerotic phenotype among the evaluated patients limits the applicability of our data to patients with sclerotic cutaneous cGVHD. Time limitations for evaluation may affect interrater agreement, but similar limitations exist in real-life clinical care. Determining diagnostic accuracy was beyond the scope of this study and is the subject of future follow-up efforts.
Conclusions
In summary, the 2014 NIH cGVHD response criteria remain a promising clinical and research tool in evaluating skin sclerosis among patients with cGHVD. Inclusion of dermatologists in the evaluation of patients and incorporation of patient-reported outcomes are important in clinical care and research. Further investigation on how to further improve reliability of our grading tools appears to be needed.
References
- 1.Goerner M, Gooley T, Flowers ME, et al. Morbidity and mortality of chronic GVHD after hematopoietic stem cell transplantation from HLA-identical siblings for patients with aplastic or refractory anemias. Biol Blood Marrow Transplant. 2002;8(1):47-56. doi: 10.1053/bbmt.2002.v8.pm11858190 [DOI] [PubMed] [Google Scholar]
- 2.Inamoto Y, Storer BE, Petersdorf EW, et al. Incidence, risk factors, and outcomes of sclerosis in patients with chronic graft-versus-host disease. Blood. 2013;121(25):5098-5103. doi: 10.1182/blood-2012-10-464198 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Greinix HT, Pohlreich D, Maalouf J, et al. A single-center pilot validation study of a new chronic GVHD skin scoring system. Biol Blood Marrow Transplant. 2007;13(6):715-723. doi: 10.1016/j.bbmt.2007.02.007 [DOI] [PubMed] [Google Scholar]
- 4.Jagasia MH, Greinix HT, Arora M, et al. National Institutes of Health Consensus Development Project on Criteria for Clinical Trials in Chronic Graft-versus-Host Disease: I. The 2014 Diagnosis and Staging Working Group report. Biol Blood Marrow Transplant. 2015;21(3):389-401.e1. doi: 10.1016/j.bbmt.2014.12.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lee SJ, Wolff D, Kitko C, et al. Measuring therapeutic response in chronic graft-versus-host disease: National Institutes of Health Consensus Development Project on Criteria for Clinical Trials in Chronic Graft-versus-Host Disease, IV: the 2014 Response Criteria Working Group report. Biol Blood Marrow Transplant. 2015;21(6):984-999. doi: 10.1016/j.bbmt.2015.02.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ionescu R, Rednic S, Damjanov N, et al. Repeated teaching courses of the modified Rodnan skin score in systemic sclerosis. Clin Exp Rheumatol. 2010;28(2)(suppl 58):S37-S41. [PubMed] [Google Scholar]
- 7.Lee Sk, Cook EF, Soiffer R, Antin JH. Development and validation of a scale to measure symptoms of chronic graft-versus-host disease. Biol Blood Marrow Transplant. 2002;8(8):444-452. doi: 10.1053/bbmt.2002.v8.pm12234170 [DOI] [PubMed] [Google Scholar]
- 8.McQuellon RP, Russell GB, Cella DF, et al. Quality of life measurement in bone marrow transplantation: development of the Functional Assessment of Cancer Therapy–Bone Marrow Transplant (FACT-BMT) scale. Bone Marrow Transplant. 1997;19(4):357-368. doi: 10.1038/sj.bmt.1700672 [DOI] [PubMed] [Google Scholar]
- 9.Krippendorff K. Content Analysis: An Introduction to Its Methodology. 3rd ed Los Angeles, CA: SAGE; 2013. [Google Scholar]
- 10.Gisev N, Bell JS, Chen TF. Interrater agreement and interrater reliability: key concepts, approaches, and applications. Res Social Adm Pharm. 2013;9(3):330-338. doi: 10.1016/j.sapharm.2012.04.004 [DOI] [PubMed] [Google Scholar]
- 11.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159-174. doi: 10.2307/2529310 [DOI] [PubMed] [Google Scholar]
- 12.R: a language and environment for statistical computing [computer program]. Vienna, Austria: R Foundation for Statistical Computing; 2018.
- 13.Baird K, Steinberg SM, Grkovic L, et al. National Institutes of Health chronic graft-versus-host disease staging in severely affected patients: organ and global scoring correlate with established indicators of disease severity and prognosis. Biol Blood Marrow Transplant. 2013;19(4):632-639. doi: 10.1016/j.bbmt.2013.01.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mitchell SA, Jacobsohn D, Thormann Powers KE, et al. A multicenter pilot evaluation of the National Institutes of Health chronic graft-versus-host disease (cGVHD) therapeutic response measures: feasibility, interrater reliability, and minimum detectable change. Biol Blood Marrow Transplant. 2011;17(11):1619-1629. doi: 10.1016/j.bbmt.2011.04.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Jacobsohn DA, Kurland BF, Pidala J, et al. Correlation between NIH composite skin score, patient-reported skin score, and outcome: results from the Chronic GVHD Consortium. Blood. 2012;120(13):2545-2552. doi: 10.1182/blood-2012-04-424135 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lee SY, Cardones AR, Doherty J, Nightingale K, Palmeri M. Preliminary results on the feasibility of using ARFI/SWEI to assess cutaneous sclerotic diseases. Ultrasound Med Biol. 2015;41(11):2806-2819. doi: 10.1016/j.ultrasmedbio.2015.06.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
