Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Nov 1.
Published in final edited form as: Biol Blood Marrow Transplant. 2012 Jun 9;18(11):1649–1655. doi: 10.1016/j.bbmt.2012.05.005

Poor Agreement between Clinician Response Ratings and Calculated Response Measures in Patients with Chronic Graft-versus-host Disease

Jeanne M Palmer 1, Stephanie J Lee 2, Xiaoyu Chai 2, Barry E Storer 2, Mary ED Flowers 2, Kirk R Schultz 3, Yoshihiro Inamoto 2, Corey Cutler 4, Joseph Pidala 5, Mukta Arora 6, David A Jacobsohn 2, Paul A Carpenter 2, Steven Z Pavletic 8, Paul J Martin 2
PMCID: PMC3448865  NIHMSID: NIHMS384640  PMID: 22691695

Abstract

In 2005, a NIH consensus conference was held to refine methods for research in chronic graft-versus-host disease (GVHD), including proposed objective response measures and a provisional algorithm for calculating organ-specific and overall response. In this study, we used weighted kappa statistics to evaluate the level of agreement between clinician response ratings and calculated response categories in patients with chronic GVHD. The study included 290 patients who had paired enrollment and follow-up visits. Based on a set of objective measures, 37% of the patients had an overall complete or partial response, whereas clinicians reported an overall complete or partial response rate of 71% (slight to fair agreement, weighted kappa 0.20). Agreement rates between calculated organ-specific responses and clinician-reported changes in skin, mouth and eye were fair to moderate (weighted kappa 0.28–0.54). We conclude that for both overall and organ-specific comparisons, clinician response ratings did not agree well with calculated response categories. Possible reasons for this discrepancy include a high clinical sensitivity for detecting response, a clinical predisposition to recognize selective improvements as overall response, the large change in objective measures proposed to define response, and the high incidence of progressive disease based on new manifestations. Conclusions from prior literature reporting high overall response rates based on clinician judgment would not be supported if the provisional algorithm had been applied to calculate response. Our analysis also highlights the need to define an overall response measure that incorporates both patient reported and objective measures, and accurately reflects the outcome in patients with a mixed response where one organ or site improves, while another shows new involvement.

Introduction

Chronic graft-versus-host disease (GVHD) affects 40–70% of patients after allogeneic hematopoietic cell transplantation (HCT) and is associated with significant morbidity and mortality [1]. Historically, chronic GVHD was defined as any GVHD occurring more than 100 days after HCT [2]. In 2005, the National Institutes of Health (NIH) formed a consensus committee to clarify the diagnosis of chronic GVHD [3]. Rather than being defined as a function of time after HCT, it was proposed that chronic GVHD should be defined according to diagnostic signs such as lichen planus-like lesions, sclerosis, or fasciitis, or by distinctive manifestations confirmed by biopsy or laboratory testing [3]. The NIH criteria define mild, moderate, and severe chronic GVHD according to scores for signs and symptoms involving skin, fascia and joints, eyes, mouth, gastrointestinal tract, liver, genital tract and lungs [3]. These definitions are predictive of overall survival and quality of life [4,5]. No widely accepted gold standard is currently available for determining activity of chronic GVHD or the response to treatment. Many clinical trials have evaluated treatment of chronic GVHD, but most relied upon the overall judgment of the clinician in reporting response, and response criteria were not precisely defined [612]. Only a single randomized study relied upon an entirely objective primary end point for measuring treatment success [13]. Generally, complete resolution of all signs or symptoms has been classified as complete response (CR), and any significant reduction in signs or symptoms without resolution of all manifestations has been classified as partial response (PR). Progressive disease (PD) has been defined as significant worsening of symptoms or development of new organ involvement, while stable disease (SD) has been defined as the absence of significant improvement or worsening.

The 2005 NIH consensus conference addressed the complex considerations involved in assessing response to treatment in patients with chronic GVHD. In its report [14], the Response Criteria Working Group recommended that the measures used to assess response should be practical for use both by transplantation and non-transplantation medical providers, adaptable for use in adults and in children, and focused on the most important chronic GVHD manifestations. The measures should also give preference to quantitative, rather than semi-quantitative, measures, capture information regarding signs, symptoms, and function separately from each other, and use validated scales whenever possible to demonstrate improved patient outcomes to meet requirements for regulatory approval of novel agents. The Working Group proposed a set of objective measures to be considered for use in clinical trials, and forms for data collection were developed. Provisional algorithms for calculating complete response, partial response, and progression were proposed for each organ and for overall response, based on the objective response measures [14]. The provisional algorithms to calculate response categories were based on expert consensus opinion and were intended to improve consistency in the conduct and reporting of chronic GVHD trials. These definitions were not intended to be implemented without validation, and in fact are currently being assessed in a large multicenter clinical study (BMT CTN 0801: NCT01106833)[15].

In the current analysis, we used data from an observational study to compare response assessed by clinicians versus the calculated response categories. Understanding this relationship will be helpful for evaluating current therapies when results are analyzed according to the provisional response categories in comparison to previously studied therapies where response was determined by the overall judgment of the clinician.

Methods

Chronic GVHD Consortium: Description of study cohort and cohort for this analysis

A cohort of HCT recipients affected by chronic GVHD was prospectively assembled in a multicenter observational study [15]. The protocol was approved by the Institutional Review Board at each site, and all subjects provided written informed consent. Patients enrolled in the cohort were allogeneic HCT recipients at least 2 years of age with chronic GVHD requiring systemic immunosuppressive therapy, including both those with classic chronic GVHD and those with overlap syndrome. Cases were classified as incident (enrollment less than 3 months after chronic GVHD diagnosis) or prevalent (enrollment three or more months but less than three years after transplantation). Primary disease relapse, inability to comply with study procedures, and anticipated survival of less than 6 months were exclusion criteria. At enrollment and every 6 months thereafter, clinicians and patients reported standardized information summarizing chronic GVHD organ involvement and symptoms after a clinic visit. Patients were allowed to take the self-administered survey home and return it by mail. At each time point, the same clinician reported all the clinician-reported items. Incident cases had an additional assessment time point at 3 months after enrollment. Objective medical data including ancillary testing and laboratory results, medical complications, and medication profiles were abstracted through standardized chart review after each visit. Clinicians were not given the calculated NIH overall or organ-specific response at any time during this study, nor were they provided with previously completed study forms.

Measures

Several response measures were used in the comparisons.

  1. Objective response measures and calculated overall and organ-specific responses. The objective measures used in this study were derived from recommendations suggested in Part IV of the National Institutes of Health Consensus Development Project on Criteria for Clinical Trials in chronic Graft-versus-Host Disease [14]. For example, the skin assessment was based on the extent of involved body surface, the eye assessment was based on Schirmer’s test, the oral assessment was based on the 15 point Schubert score, and the liver assessment was based on laboratory values relative to the upper limit of normal. Evaluation of the gastrointestinal (GI) tract utilized the NIH response scales separately for the esophagus, upper GI tract and lower GI tract. Provisional overall response categories were assigned based on change from enrollment to follow-up, with complete resolution of organ dysfunction considered a complete response. The definition of partial response was based on the principle that more than 50% relative improvement or 25% absolute improvement from baseline, whichever is greater, without worsening elsewhere would be evidence of clinical benefit and within the range of detectible change by clinicians [16]. More than 25% worsening in a previously involved organ was progression. Since the amount of new organ involvement that would define PD was not clearly stated in the provisional response algorithm, any new organ involvement at the assessment time compared to baseline was also considered PD for purposes of this analysis. All other categories were considered stable disease.

  2. Four-point clinician response assessment. (“4-point assessment”). At follow-up visits, clinicians were asked whether the patient had a CR, PR, SD or PD.

  3. Collapsed 8-point clinician response assessment. (“8-point assessment”). At follow-up visits, clinicians assessed their perception of change in overall GVHD and organ-specific involvement of skin, mouth, eye, and fascia and joints on an 8-point scale, which was categorized as CR (resolved = 1), PR (very much better = 2, moderately better = 3), SD (a little better = 4, about the same = 5, a little worse = 6), and PD (moderately worse = 7, very much worse = 8).

  4. Change in NIH 0–3 severity scores. At each assessment time, the clinician determined overall severity or organ-specific severity on a scale of 0–3 [3]. The overall severity score had the following descriptions associated with it: none (0), mild (1), moderate (2), or severe (3). The organ-specific severity scores were determined by both physical findings and activity limitations. Change in severity scores was determined by comparing the enrollment and follow-up score. A decrease in the score to 0 was considered CR. Improvement indicated by a decrease in the score without a final score of 0 was considered a PR (for example, a change from 3 at enrollment to 2 at follow-up). Worsening indicated by an increase in the score was considered PD.

Statistical Methods

Patient socio-demographics, transplantation characteristics, and GVHD organ severities at enrollment were presented as median and range for continuous variables, and as frequency and percentage for categorical variables.

Agreement between various overall and organ-specific responses at 3 or 6 months visit was tested by weighted kappa statistic for ordinal measures with Fleiss-Cohen weights [17]. Empirical interpretation was used for the kappa coefficient (0, no agreement; 0–0.2, slight agreement; 0.2–0.4 fair agreement; 0.4–0.6, moderate agreement; 0.6–0.8, substantial agreement; 0.8–1.0, almost perfect agreement). Chi-square tests were used to evaluate difference in the proportions of responses. Results are presented for the entire cohort, although incident and prevalent cases were also analyzed separately to ensure that conclusions were similar. Statistical analyses were conducted using SAS/STAT software, Version 9.2 (SAS Institute).

Results

Patient characteristics

As of September 2010, 290 patients who had at least one follow-up visit 3 or 6 months after enrollment were included. The median age of these patients was 51 years (2–74) (Table 1), 14 patients were <20 years old at the time of enrollment. At enrollment, the sites most frequently affected by chronic GVHD were the mouth (62%) and skin (61%). Less frequently involved organs were the liver (50%), eye (48%), lung (47%), fascia and joints (28%) and gastrointestinal tract (26%).

Table 1.

Patient characteristics (N = 290)

Characteristic Data
Case type, N (%)
 Incident 160 (55)
 Prevalent 130 (45)

Patient age at enrollment, median years (range) 51 (2 – 74)

Months from transplant to enrollment, median (range) 12 (3 – 39)

Adults, N (%) 279 (96)

Male, N (%) 169 (58)

Disease status, N (%)
 Early 101 (35)
 Intermediate 127 (44)
 Advanced 62 (21)

Transplant source, N (%)
 Peripheral blood 260 (89)
 Bone marrow 19 (7)
 Cord blood 11 (4)

Transplant type, N (%)
 Myeloablative 162 (56)
 Not-myeloablative 128 (44)

Donor match, N (%)
 HLA-matched related 141 (49)
 HLA-matched unrelated 108 (37)
 HLA-mismatched 41 (14)

Donor and recipient gender, N (%)*
 Female into male 82 (28)
 Other 206 (72)

Prior acute GVHD, N (%) 188 (65)

NIH severity score at enrollment, N (%)
 None 2 (1)
 Mild 37 (13)
 Moderate 167 (57)
 Severe 84 (29)
*

Gender information is not available for 2 donors.

Response rates, according to different overall outcome measures

Overall complete plus partial response rates showed striking differences, depending on the measure used to evaluate response, primarily because of differences in the proportion of patients with PR and PD (Table 2). By the calculated response categories, 37% of the patients had a CR or PR, much lower than the 71% rate observed with the 4-point assessment (P = 0.004) and also lower than the 48% rate observed with the collapsed 8-point assessment (P < 0.001). These results demonstrate that the algorithm for calculating response is much more stringent than clinician ratings for concluding response has occurred. In contrast, the CR plus PR rate derived from the change in overall severity 0–3 score was 27%, lower than the 37% rate observed according to the provisional response categories (P < 0.001).

Table 2.

Distribution of response categories according to method of outcome assessment

Method of outcome assessment Response Category
CR (%) PR (%) SD (%) PD (%)
Calculated response 24 (8) 83 (29) 25 (9) 158 (54)
4-point assessment* 31 (11) 171 (60) 30 (10) 56 (19)
Collapsed 8-point assessment* 19 (7) 117 (41) 127 (44) 22 (8)
Change in overall severity score* 15 (5) 63 (22) 177 (62) 32 (11)
*

2 patients had data missing for the 4-point assessment, 5 patients had data missing for the 8-point assessment, and 3 patients had data missing for the change in overall severity score

The proportion of patients with progressive disease was substantially higher with the provisional algorithm for calculating response than with any of the other measures (Table 2). In many cases, categorization of PD was based on the development of a new manifestation that was not reported at the baseline visit. In fact, raising the threshold to 50% worsening in order to define PD minimally changed Table 2 (data not shown). Among 98 patients with discrepant response measures who had PD according to the calculated response categories and CR or PR according to the 4-point clinician assessment, PD was based on new organ involvement in 79 (81%) patients, 63 with only a single new manifestation, 15 with two new manifestations, and 1 with three new manifestations. Newly involved organs included the gastrointestinal tract (n = 30), skin (n = 27), liver (n = 20), eye as measured by Schirmer’s test (n = 11), oral cavity (n = 7) and lung (n = 1). Among the 85 new manifestations involving the gastrointestinal tract, skin, liver, oral cavity and lung, 11 (13%) had grade 0 or 1 severity according to the NIH 0–3 scale (Table 3). Among the eleven cases where the eye was the reason for PD based on Schirmer’s test results, only 5 were categorized as PD based on NIH 0–3 eye scale, and two were not even categorized as abnormal based on NIH 0–3 eye scale. Among the 27 patients who had PD based on new skin manifestations, 15 had different skin manifestations at the baseline visit (e.g. movable sclerosis at baseline, erythema at the follow-up assessment). If the NIH 0–3 skin score had been used to measure the skin response for these 15 cases, four would have improved, eight would have remained the same, and only three would have been considered PD.

Table 3.

Characteristics of new organ involvement at 3 or 6 months in patients without progression in a previously involved site.

Organ Number of patients NIH Severity
NIH 0 NIH 1 NIH 2 NIH 3
Gastrointestinal tract 30 1 3 17 9
Skin 27 0 2 18 7
Liver 20 0 4 12 4
Eye 11 0 0 9 2
Mouth 7 0 1 4 2
Lung 1 0 0 0 1

Overall response comparisons

We compared the calculated response categories to ratings that reflected the clinician impression, measured as the 4-point assessment, where clinicians categorized the outcome as CR, PR, SD or PD, or as the collapsed 8-point assessment. The calculated response categories did not agree well with either the 4-point assessment, the collapsed 8-point assessment, or change in overall severity score as indicated by the low weighted kappa statistics of 0.20, 0.24, and 0.25 respectively (Table 4). There was better correlation among the clinician-reported measures. The change in 0–3 overall severity had moderate agreement with the 4-point assessment or 8-point assessment as indicated by the weighted kappa statistics of 0.42 and 0.52 respectively (Table 4). Results were similar regardless of whether the same or different clinicians did the enrollment and follow-up assessments. The best correlation was seen between the 4-point assessment and the collapsed 8-point assessment, which are both clinician-reported measures (weighted kappa = 0.69).

Table 4.

Comparisons between overall outcome categories

Comparison Weighted kappa
All evaluations Same clinician Different clinician
Calculated response versus 4-point assessment 0.20 0.28 0.11
Calculated response versus collapsed 8-point assessment 0.24 0.28 0.19
Calculated response versus change in overall severity score 0.25 0.35 0.16
4-point versus collapsed 8-point assessment 0.69 0.62 0.75
4-point assessment versus change in overall severity score 0.42 0.45 0.40
8-point assessment versus change in overall severity score 0.52 0.53 0.51

Organ-specific response comparisons

We also examined the different measures for response in each organ individually and compared them wherever possible. Specifically, calculated response categories, clinician assessments and changes in severity scores were available for skin, mouth, and eyes. For the skin, the clinician response assessment and the change in severity score showed moderate to substantial agreement with the calculated response category, but evaluations of the eyes and mouth showed only fair to moderate agreement (Table 5).

Table 5.

Comparisons between organ-specific outcome measures

Response Category
Organ and outcome measure CR (%) PR (%) SD (%) PD (%) Weighted kappa
Skin
 Calculated response 63 (33) 21 (11) 46 (24) 60 (32) Reference
 Collapsed 8-point assessment 49 (24) 49 (24) 91 (45) 15 (7) 0.54
 Change in 0–3 severity score 55 (27) 27 (13) 74 (37) 47 (23) 0.65

Mouth
 Calculated response 42 (18) 21 (9) 132 (57) 37 (16) Reference
 Collapsed 8-point assessment 42 (18) 82 (36) 94 (41) 11 (5) 0.44
 Change in 0–3 severity score 56 (27) 24 (11) 91 (43) 40 (19) 0.45

Eye
 Calculated response 17 (17) 7 (7) 44 (44) 32 (32) Reference
 Collapsed 8-point assessment 10 (6) 23 (14) 116 (72) 13 (8) 0.28
 Change in 0–3 severity score 24 (13) 16 (9) 79 (43) 64 (35) 0.23

Discussion

Overall comparisons of the response assessments (CR, PR, SD, PD) or perceptions of change (“completely resolved” to “very much worse”) reported by clinicians did not agree well with the calculated response categories. These data suggest that conclusions from prior literature reporting high overall CR+PR rates based on clinician judgment would not be supported if the provisional algorithm to calculate response based on objective measures [14] had been used. Neither the clinician-reported responses nor the calculated response categories can be considered to represent a ‘gold standard’, but our analysis provides an opportunity to compare them and determine why such marked differences exist.

Many cases classified as PD by the provisional algorithm were based on the development of mild manifestations in a previously uninvolved site, together with improvement in other organs. This high proportion of PD contributed to the poor agreement with clinician ratings. In most of these cases, the new manifestations had grade 1 or 2 severity based on the NIH 0–3 scale. Clinicians might have concluded that the effects of new manifestations were outweighed by improvements in other manifestations. The assessment of overall response is further complicated when an organ can have multiple manifestations. For example, a change from 10% non-movable sclerosis at baseline to 8% non-moveable and 2% moveable sclerosis at the follow-up assessment could reflect inter-observer variability [18], softening of preexisting sclerosis (i.e., improvement) or extension of sclerosis into previously unaffected areas (i.e., worsening). In the current analysis, the provisional algorithm would have assigned this patient to the PD category. In addition, some of the objective measures might not be reliable. For example, the proposed objective scale for ocular involvement is based entirely on the Schirmer’s test. In a previous study, we have shown that the NIH 0–3 scale appears to be the best measure of chronic GVHD involving the eye, but this scale does not agree well with results of the Schirmer’s test [19]. Finally, clinicians could have attributed new manifestations involving the gastrointestinal tract or liver to acute GVHD or other causes that were not considered relevant to the assessment of chronic GVHD [20]. These insights from our study suggest that special consideration should be taken in situations where one organ improves significantly, while another shows clinically insignificant worsening. Standardization of such a response categorization system would likely be enormously complex.

In attempting to validate the provisional algorithm for calculating response, we chose the clinician’s perceived assessment of response for comparison, since this measure clearly has value, especially for purposes of clinical care. While this may suggest that we are advocating clinical judgment as the gold standard, this endpoint is not sufficiently objective to use as an endpoint in clinical trials. Clinicians were more likely to report responses than what was calculated from the objective measures using the provisional algorithm. In some cases, it appears that clinicians reported that patients were improving, even when their own documentation of organ dysfunction at enrollment and follow-up did not support this conclusion. There was evidence that some clinicians were focusing their attention on the most bothersome or worrisome manifestation without due consideration of other changes (e.g., liver function tests). It is also possible that the magnitude of response required for PR according to the provisional algorithm is too stringent, as suggested by the high number of patients with SD. Simplification of the criterion to require only an absolute 25% improvement, even in patients with more than 50% abnormality at baseline might more accurately reflect the evaluation by clinicians. A study by Mitchell et al [16] demonstrated that the minimal detectible change was 10–28% of BSA for erythema, 18–26% for movable sclerosis, and 17–32% for non-movable sclerosis. Clinicians might benefit from training on best practices in performing a chronic GVHD evaluation [21] to encourage a comprehensive and systematic approach in assessing treatment response. This would also lead to greater precision in documenting chronic GVHD manifestations which would increase the likelihood that changes in reported organ involvement reflect change in chronic GVHD disease activity and not measurement variations.

This study has several limitations. The frequent dissociation between the beginning of new treatment and the baseline evaluation did not mimic conditions that would apply in the context of a clinical trial, and the results cannot be used to estimate the response rates in a clinical trial setting. In our study, clinicians were asked to assess changes over a 3–6 month interval, and the clinician performing the evaluation did not have access to the previous detailed documentation by the tools proposed for use in evaluating response. Clinicians were not asked to apply the proposed algorithms for response assessment and were not given any training in these calculations. In many cases, the baseline and subsequent evaluations were done by different clinicians. Therefore, the clinician ratings depended on either patient or clinician memory or standard medical record documentation. The number of different clinicians performing the exams may be seen as a weakness, but this reflects clinical practice and the objective criteria were designed to provide consistency across providers even if a patient saw different providers. In addition, the objective measure used for certain sites of involvement was not optimal, as exemplified by the assessment of ocular GVHD. Another limitation was the low number of pediatric patients evaluated in this study, making it difficult to apply these findings to a pediatric population. Although, the inclusion of both incident and prevalent cases could be questioned, the purpose of our study was not to determine the actual response rates, but to evaluate the level of agreement between the different measures of response which were not affected by incident versus prevalent status at enrollment. Finally, the measures used for the current study did not include patient reported outcomes, including quality of life, which is the subject of another analysis.

Our current results indicate the need for caution when results of studies that use the provisional algorithm to calculate response are compared with those of previous clinical trials that relied primarily on clinician ratings. The response criteria working group in the 2005 NIH consensus conference on chronic GVHD recognized that many outcome measures could be used to evaluate the response of chronic GVHD [14], but the current analysis was focused specifically on clinician response ratings and calculated response measures. Our results highlight the need to identify the most reliable and informative clinician-reported and patient-reported measures of chronic GVHD activity, so that changes between the baseline and the assessment time can be used to assess treatment effect. This measure should accurately reflect the outcome in patients with a mixed response where one organ or site improves, while another shows new involvement. More sophisticated algorithms for measurement of response should be developed and tested in prospective studies.

Acknowledgments

This work was supported by grant CA 118953 (PI: Lee, SJ) NIH/NCI

Footnotes

Author contributions:

JP proposed the study concept, analyzed data, and wrote the manuscript. BS and XC performed statistical analyses and contributed to writing the manuscript. MF, KS, YI, CC, JP, MA, DJ, PC, SP contributed to data analysis and critical review of the manuscript. PM and SJL contributed to development of study concept, data analysis, and writing of the manuscript.

Disclosure: The authors report no relevant conflict of interest.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Lee SJ, Vogelsang G, Flowers MED. Chronic graft-versus-host disease. Biology of Blood and Marrow Transplantation. 2003;9:215–233. doi: 10.1053/bbmt.2003.50026. [DOI] [PubMed] [Google Scholar]
  • 2.Shulman H, Sullivan KM, Weiden PL, et al. Chronic graft-versus-host syndrome in man: A long-term clinicopathologic study of 20 seattle patients. American Journal of Medicine. 1980;69:204–217. doi: 10.1016/0002-9343(80)90380-0. [DOI] [PubMed] [Google Scholar]
  • 3.Filipovich AH, Weisdorf D, Pavletic S, et al. National Institutes of Health Consensus Development Project on Criteria for Clinical Trials in Chronic Graft-versus-Host Disease: I. Diagnosis and Staging Working Group Report. Biology of Blood and Marrow Transplantation. 2005;11:945–956. doi: 10.1016/j.bbmt.2005.09.004. [DOI] [PubMed] [Google Scholar]
  • 4.Pidala J, Kurland B, Chai X, et al. Patient-reported quality of life is associated with severity of chronic graft-versus-host disease as measured by NIH criteria: report on baseline data from the Chronic GVHD Consortium. Blood. 2011;117:4651–4657. doi: 10.1182/blood-2010-11-319509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Arai S, Jagasia M, Storer B, et al. Global and organ-specific chronic graft-versus-host disease severity according to the 2005 NIH Consensus Criteria. Blood. 2011;118:4242–4249. doi: 10.1182/blood-2011-03-344390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Arora M, Wagner JE, Davies SM, et al. Randomized clinical trial of thalidomide, cyclosporine, and prednisone versus cyclosporine and prednisone as initial therapy for chronic graft-versus-host disease. Biology of Blood and Marrow Transplantation. 2001;7:265–273. doi: 10.1053/bbmt.2001.v7.pm11400948. [DOI] [PubMed] [Google Scholar]
  • 7.Busca A, Locatelli F, Marmont F, Audisio E, Falda M. Response to mycophenolate mofetil therapy in refractory chronic graft-versus-host disease. Haematologica. 2003;88:837–839. [PubMed] [Google Scholar]
  • 8.Chen GL, Arai S, Flowers MED, et al. A phase 1 study of imatinib for corticosteroiddependent/refractory chronic graft-versus-host disease: response does not correlate with anti-PDGFRA antibodies. Blood. 2011;118:4070–4078. doi: 10.1182/blood-2011-03-341693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Couriel DR, Saliba R, Escalón MP, et al. Sirolimus in combination with tacrolimus and corticosteroids for the treatment of resistant chronic graft-versus-host disease. British Journal of Haematology. 2005;130:409–417. doi: 10.1111/j.1365-2141.2005.05616.x. [DOI] [PubMed] [Google Scholar]
  • 10.Couriel DR, Hosing C, Saliba R, et al. Extracorporeal photochemotherapy for the treatment of steroid-resistant chronic GVHD. Blood. 2006;107:3074–3080. doi: 10.1182/blood-2005-09-3907. [DOI] [PubMed] [Google Scholar]
  • 11.Cutler C, Miklos D, Kim HT, et al. Rituximab for steroid-refractory chronic graft-versus-host disease. Blood. 2006;108:756–762. doi: 10.1182/blood-2006-01-0233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Jacobsohn DA, Chen AR, Zahurak M, et al. Phase II Study of Pentostatin in Patients With Corticosteroid-Refractory Chronic Graft-Versus-Host Disease. Journal of Clinical Oncology. 2007;25:4255–4261. doi: 10.1200/JCO.2007.10.8456. [DOI] [PubMed] [Google Scholar]
  • 13.Flowers MED, Apperley JF, van Besien K, et al. A multicenter prospective phase 2 randomized study of extracorporeal photopheresis for treatment of chronic graft-versus-host disease. Blood. 2008;112:2667–2674. doi: 10.1182/blood-2008-03-141481. [DOI] [PubMed] [Google Scholar]
  • 14.Pavletic SZ, Martin P, Lee SJ, et al. Measuring Therapeutic Response in Chronic Graft-versus-Host Disease: National Institutes of Health Consensus Development Project on Criteria for Clinical Trials in Chronic Graft-versus-Host Disease: IV. Response Criteria Working Group Report. Biology of Blood and Marrow Transplantation. 2006;12:252–266. doi: 10.1016/j.bbmt.2006.01.008. [DOI] [PubMed] [Google Scholar]
  • 15.The Chronic GVHD Consortium. Rationale and Design of the Chronic GVHD Cohort Study: Improving Outcomes Assessment in Chronic GVHD. Biology of Blood and Marrow Transplantation. 2011;17:1114–1120. doi: 10.1016/j.bbmt.2011.05.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Mitchell SA, Jacobsohn D, Thormann Powers KE, et al. A Multicenter Pilot Evaluation of the National Institutes of Health Chronic Graft-versus-Host Disease (cGVHD) Therapeutic Response Measures: Feasibility, Interrater Reliability, and Minimum Detectable Change. Biology of Blood and Marrow Transplantation. 2011;17:1619–1629. doi: 10.1016/j.bbmt.2011.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fleiss J, Cohen J. The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability. Educational and Psychological Measurement. 1973;33:613–619. [Google Scholar]
  • 18.Jacobsohn DA, Rademaker A, Kaup M, Vogelsang GB. Skin response using NIH consensus criteria vs Hopkins scale in a phase II study for steroid-refractory chronic GVHD. Bone Marrow Transplant. 2009:1–7. doi: 10.1038/bmt.2009.84. [DOI] [PubMed] [Google Scholar]
  • 19.Inamoto Y, Chai X, Kurland BF, et al. Validation of Measurement Scales in Ocular Graft-versus-Host Disease. Ophthalmology. 2012;119:487–493. doi: 10.1016/j.ophtha.2011.08.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Pidala J, Vogelsang G, Martin P, et al. Overlap subtype of chronic graft vs. host disease is associated with adverse prognosis, functional impairment, and inferior patient reported outcomes: a chronic graft vs. host disease Consortium study. Haematologica. 2011;96:1678–1684. doi: 10.3324/haematol.2011.055186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Carpenter PA. How I conduct a comprehensive chronic graft-versus-host disease assessment. Blood. 2011;118:2679–2687. doi: 10.1182/blood-2011-04-314815. [DOI] [PubMed] [Google Scholar]

RESOURCES