INTRODUCTION
Chronic graft-versus-host disease (cGVHD) is the leading cause of non-relapse long-term morbidity and mortality in patients after allogeneic haematopoietic cell transplantation (HCT) [1]. Skin is the earliest and most commonly affected organ and has a central role in evaluating treatment efficacy and disease progression [2]. By way of National Institutes of Health (NIH) scoring, the affected cutaneous body surface area (BSA) has been incorporated into study design to test all three current FDA-approved cGVHD treatments. For example, one inclusion criterion for the pivotal trial of ibrutinib was a minimum of 25% body surface area erythema [3]. However, visual assessment of cGVHD suffers low reliability, limiting the ability to effectively follow patient response. The gap of measuring affected BSA in known cGVHD patients could be addressed by automated image analysis techniques leveraging artificial intelligence (AI). To this end, we trained a deep learning neural network [4] to mark (segment) affected skin and tested performance in 36 previously unseen patients by leave-one-patient-out validation. We further benchmarked the AI against exact human measurements of affected BSA in 3D photographs.
METHODS
Annotated photo dataset
Thirty-six patients (characteristics in Supplemental Table S1) were photographed with a Vectra H1 3D camera (Canfield Scientific, Inc., Parsippany, NJ), for a dataset of 360 three-dimensional cross-polarized photographs. The camera uses ranging lights to ensure precise measurement of the square centimetres of surface captured, and a consistent field of view for each acquired photograph. When the entire body is photographed, the sum of all skin surface areas from each photo equals the total body surface area. As human ground truth, areas affected by cGVHD in each photo were marked by an annotator (KP) with >6 months of cGVHD training (Supplemental Methods: Ground truth annotations). 179 photos from 25 patients were marked as having skin affected by cutaneous cGVHD. The remaining 181 photos had unaffected skin (Supplemental Table S2).
Artificial intelligence segmentation algorithm
We trained a deep learning U-Net algorithm [4] to identify skin affected by cGVHD in patient photos (Supplemental Methods: Algorithm development and model training). For leave-one-out validation, each of 36 models was trained on the ground truth annotations from 35 patients. The performance of each model was tested on photos of the unseen patient.
Evaluation of algorithm performance
Two segmentation performance metrics were used:
the Dice coefficient, a widely used computer vision metric of spatial overlap; and
surface area error, defined as the difference between the fraction of skin pixels marked as affected by human and algorithm.
An independent board-certified dermatologist (LW) visually evaluated both human ground truth and algorithm output quality. Blinded to all clinical and technical details, he scored the success of contours to mark abnormal skin on a 5-point subjective scale (1 poor quality; 3 clinically acceptable; 5 expert level).
RESULTS AND DISCUSSION
Algorithm agreement with human annotations
Figure 1 shows representative algorithm output. In identifying whether affected skin was present in a photograph, the algorithm achieved an accuracy of 0.92, positive predictive value of 0.92, and negative predictive value of 0.92 at the image level (Supplemental Table S2). Segmentation performance was only assessed in the 179 photographs of affected skin (by human ground truth). The algorithm achieved a median Dice coefficient of 0.74 (interquartile range: 0.40 – 0.89) and median surface area error of 8.89% (22.10 – 3.69%) (Figure 2a). Performance varied substantially between patients (Figure 2b) and was significantly lower in Fitzpatrick skin types IV – VI, underrepresented in our study (Figure 2c). Whereas the algorithm’s median surface area error was under 9% overall, it was 24% for these darker skin types, likely due to the paucity of training examples. To contextualize, Mitchell et al. reported a minimum detectable change for clinicians approximately 20% [5]. Disease type showed no consistent effect on the per-patient algorithm performance (Figure S2).
Figure 1.
Example contour overlays of human annotation (ground truth - green) and algorithm segmentation (blue). (a, b) Example contours where both human and algorithm were scored as expert-level (5/5) by the board-certified dermatologist. (c) Example where algorithm scored higher (5) than human (4). (d) Example where human scored higher (5) than algorithm (2).
Figure 2.
Deep learning algorithm performance on photos of cGVHD patients from the leave-one-out experiment. (a) Segmentation performance assessed by Dice coefficient of agreement (0 none to 1 perfect) and surface area error (0 to 100%). Each point corresponds to a photograph (n=179) of an unseen patient (N=25). (b) Performance on affected photos (n=179) split into skin types I-III (blue, N=23) and IV-VI (green, N=2). ****P < 0.0001 by Mann-Whitney U test. (c) Segmentation performance per patient with three or more affected photos. Each dot represents a different photo of skin affected by cGVHD (n=167) and each box represents a different patient (N=17). Patients with Fitzpatrick skin type I-III are shown in blue and types IV-VI in green. Note that eight type I-III patients (with n=12 affected photos) from panels a and b are excluded here due to having less than three affected photos. (d) Dermatologist assessment of the cGVHD affected skin contours marked by the algorithm and human. Histogram shows percentage of photos given each score. (e) Histogram of differences between algorithm and human segmentations. Positive x-axis values represent cases where the algorithm contour scored higher, and negative x-axis values mean the human contour scored higher. (f) Bland-Altman plot of the estimated percent body surface area (BSA) marked as affected by the human annotator and algorithm for patients with at least three photographed body sites (N=31).
External dermatologist evaluation of algorithm segmentation contours
Algorithm output was scored as clinically acceptable or better (3+) in 77% of photos, at expert level (5/5) in 20% of photos (Figure 2d), and noninferior to human annotations in 52% of photos (Figure 2e). The mean score for algorithm contours was 3.45 compared to 4.17 for human contours. The median difference of the two scores for an individual photo was 0 (interquartile range −1 to 0), (Supplementary Figure S1). Thus, most annotations by the algorithm were equivalently acceptable to the ground truth human annotation.
Reliability of BSA estimation
Our calibrated 3D photography enabled exact skin surface area in square centimetres to be calculated for each photograph, along with the area marked by the human annotator. To estimate the reliability of the algorithm in predicting affected BSA, we compared the proportion of total skin surface marked by the human for patients with at least three photos (N=31) to the percentage of skin pixels marked by the algorithm for each patient. Bland-Altman analysis revealed algorithm error comparable to the 20% error reported for clinicians [5] (Figure 2f).
CONCLUSIONS
We created an automated algorithm that accurately recognized and segmented cGVHD-affected skin. Despite the relatively small dataset, algorithm predictions were assessed by an independent dermatologist as clinically acceptable or better for most patients. Our results suggest that AI could reliably assess clinically affected cGVHD surface area. As this is central to NIH scoring [6], AI holds promise to facilitate reliable cGVHD staging. However, our training photos did not adequately cover the spectrum of disease appearance and suffered significantly lower performance in under-represented darker skin types, reflecting the limited patients available at recruitment. In furthering this approach, the number of patients in underrepresented groups must be expanded to improve AI training and address this performance shortfall. Future studies must assess the clinical utility of AI in practice, ability to track surface area changes over time, and the potential for capturing early changes that are key to enabling early intervention [7].
Supplementary Material
Data S1. Detailed Methods.
Figure S1. Confusion matrix showing the dermatologist assessment of human and algorithm segmentation contours. Each row represents the instances of human contours given the corresponding score by the dermatologist, and each column represents the algorithm score on the same image. All instances on the positive diagonal represent images assigned the same score for both human and algorithm contours. The total number of images assigned each score are shown on the top and right axes for the algorithm and human respectively.
Figure S2. Segmentation performance of the algorithm on affected images, grouped by disease type. Analogously to Figure 2 in the main article, each dot represents a different image. Patients with only erythema are shown in red (N=5), only sclerosis in yellow (N=4), both erythema and sclerosis in blue (N=7), and neither in grey (vitiligo-type, N=1). (a) Performance on all 179 affected images, split by disease type. (b) Each box represents a patient with more than three affected images, arranged by decreasing median Surface Area Error. There was also no statistically significant difference between per-photo performance based on patient disease type, with a Mann-Whitney U test p-value > 0.05 for all three disease type pairs.
Table S1. Race, gender, ethnicity, and Fitzpatrick skin types of study patients.
Table S2. Performance of the algorithm at identifying images affected by cGVHD (with 95% confidence intervals (CI)).
ACKNOWLEDGMENTS
This work was supported by a Career Development Award to Eric Tkaczyk from the United States Department of Veterans Affairs Clinical Sciences R&D Service (IK2 CX001785), a Discovery Research Grant from Vanderbilt University, the Vanderbilt University Medical Center Departments of Medicine and Dermatology, the National Institutes of Health (K12 CA090625, R21 AR074589), and the European Regional Development Fund (1.1.1.2/VIAA/4/20/665). Lee Wheless is funded by grants from the Skin Cancer Foundation and the Dermatology Foundation.
Footnotes
CONFLICT OF INTEREST STATEMENT
The authors state no conflict of interest.
REFERENCES
- [1].Socié G and Ritz J, “Current issues in chronic graft-versus-host disease,” Blood, vol. 124, no. 3, pp. 374–384, 2014, doi: 10.1182/blood-2014-01-514752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Rodgers CJ, Burge S, Scarisbrick J, and Peniket A, “More than skin deep? Emerging therapies for chronic cutaneous GVHD,” Bone Marrow Transplant., vol. 48, no. 3, pp. 323–337, 2013, doi: 10.1038/bmt.2012.96. [DOI] [PubMed] [Google Scholar]
- [3].Miklos D et al. , “Ibrutinib for chronic graft-versus-host disease after failure of prior therapy,” Blood, vol. 130, no. 21, pp. 2243–2250, 2017, doi: 10.1182/blood-2017-07-793786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Ronneberger O, Fischer P, and Brox T, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9351, no. Cvd, pp. 12–20, 2015, doi: 10.1007/978-3-319-24574-4. [DOI] [Google Scholar]
- [5].Mitchell SA et al. , “A multicenter pilot evaluation of the national institutes of health chronic graft-versus-host disease (cGVHD) therapeutic response measures: Feasibility, interrater reliability, and minimum detectable change,” Biol. Blood Marrow Transplant, vol. 17, no. 11, pp. 1619–1629, 2011, doi: 10.1016/j.bbmt.2011.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Jagasia MH et al. , “National Institutes of Health Consensus Development Project on Criteria for Clinical Trials in Chronic Graft-versus-Host Disease: I. The 2014 Diagnosis and Staging Working Group report,” Biol. Blood Marrow Transplant, vol. 21, no. 3, pp. 389–401.e1, Mar. 2015, doi: 10.1016/J.BBMT.2014.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Kitko CL et al. , “National Institutes of Health Consensus Development Project on Criteria for Clinical Trials in Chronic Graft-versus-Host Disease: IIa. The 2020 Clinical Implementation and Early Diagnosis Working Group Report,” Transplant. Cell. Ther, vol. 27, no. 7, pp. 545–557, Jul. 2021, doi: 10.1016/J.JTCT.2021.03.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1. Detailed Methods.
Figure S1. Confusion matrix showing the dermatologist assessment of human and algorithm segmentation contours. Each row represents the instances of human contours given the corresponding score by the dermatologist, and each column represents the algorithm score on the same image. All instances on the positive diagonal represent images assigned the same score for both human and algorithm contours. The total number of images assigned each score are shown on the top and right axes for the algorithm and human respectively.
Figure S2. Segmentation performance of the algorithm on affected images, grouped by disease type. Analogously to Figure 2 in the main article, each dot represents a different image. Patients with only erythema are shown in red (N=5), only sclerosis in yellow (N=4), both erythema and sclerosis in blue (N=7), and neither in grey (vitiligo-type, N=1). (a) Performance on all 179 affected images, split by disease type. (b) Each box represents a patient with more than three affected images, arranged by decreasing median Surface Area Error. There was also no statistically significant difference between per-photo performance based on patient disease type, with a Mann-Whitney U test p-value > 0.05 for all three disease type pairs.
Table S1. Race, gender, ethnicity, and Fitzpatrick skin types of study patients.
Table S2. Performance of the algorithm at identifying images affected by cGVHD (with 95% confidence intervals (CI)).


