Abstract
Purpose:
To create an unsupervised cross domain segmentation algorithm for segmenting intraretinal fluid and retinal layers on normal and pathologic macular optical coherence tomography (OCT) images from different manufacturers and camera devices.
Design:
We sought to use Generative Adversarial Networks (GAN) to generalize a segmentation model trained on one OCT device to segment B-scans obtained from a different OCT device manufacturer in a fully unsupervised approach without labeled data from the latter manufacturer.
Subjects:
A total of 732 OCT B-scans from four different OCT devices (Heidelberg Spectralis, Topcon 1000, Maestro2, and Zeiss Plex Elite 9000).
Methods:
We developed an unsupervised GAN model, GANSeg, to segment seven retinal layers and intraretinal fluid in Topcon 1000 OCT images (domain B) that only had access to labeled data on Heidelberg Spectralis images (domain A). GANSeg was unsupervised as it only had access to 110 Heidelberg labeled OCTs, and 556 raw and unlabeled Topcon 1000 OCTs. To validate GANSeg segmentations, three masked graders manually segmented 60 OCTs from an external Topcon 1000 test dataset independently. To test the limits of GANSeg, graders also manually segmented 3 OCTs from Zeiss Plex Elite 9000 and Topcon Maestro2. A U-Net was trained on the same labeled Heidelberg images as baseline. The GANSeg repository with labeled annotations is at https://github.com/uw-biomedical-ml/ganseg.
Main Outcome Measures:
Dice scores comparing segmentation results from GANSeg and the U-Net model to the manual segmented images.
Results:
While GANSeg and U-Net achieved comparable Dice performance as human experts on the labeled Heidelberg test dataset, only GANSeg achieved comparable Dice with the best performance for the GCL+IPL layer (90%, 95% CI: 68%–96%) and the worst performance for intraretinal fluid (58%, 95% CI: 18%–89%), which was statistically similar to human graders (79%, 95% CI 43%–94%). GANSeg significantly outperformed the U-Net model. Moreover, GANSeg generalized to both Zeiss and Topcon Maestro2 swept source-OCT domains which it had never encountered before.
Conclusions:
GANSeg enables the transfer of supervised deep learning (DL) algorithms across OCT devices without labeled data, thereby greatly expanding the applicability of DL algorithms.
Keywords: Unsupervised learning, cross domain learning, optical coherence tomography, retina, macula
Precis
Deep learning does not generalize across different optical coherence tomography devices. We generalize a model trained on a single imaging platform to segment retinal layers on B-scans from a different device using Generative Adversarial Networks.
Introduction
The introduction of deep learning has transformed medical image analysis, with meaningful clinical applications such as segmentation of anatomic features, prediction of clinical outcomes, and suggestions of possible treatment approaches.1–2,3 In the field of ophthalmology, deep learning has shown strong performance in the detection of diabetic retinopathy,4–7 glaucoma,4,8 and age-related macular degeneration.4,9 While these supervised models work by learning certain representational features from their input data, they also require considerable efforts from experts to manually label the training data to serve as the ground truth.10
One major roadblock for deep learning is the problem of domain shift, where models trained on a particular dataset may experience significant performance degradation when applied to slightly different datasets resulting from different hospitals, imaging protocols, and device manufacturers. This challenge of generalizability is exemplified by the first commercial artificial intelligence-based device used to screen patients for diabetic retinopathy from retinal fundus imaging, where FDA approval was solely based on its use with one particular camera type due to concerns about domain shift related performance degradations.11,12 To date, most strategies to address this problem require additional labeled data from the new setting, which can be challenging, time-consuming, and expensive to obtain,11 along with additional resources required to fine-tune the model on new labeled data.
Optical coherence tomography (OCT) has become widely used throughout ophthalmology to capture high resolution ocular structures.13 With regards to retinal diseases, the segmentation of retinal layers in OCT can be crucial for early diagnosis and subsequent treatment. Several deep learning frameworks have been successfully applied to automated segmentation of the retinal layers, however the training dataset and subsequent performance were only collected and measured from a single OCT device manufacturer.14–16 In an ideal situation, automated deep learning models in ophthalmology practice should have the ability to analyze OCT images from different device manufacturers without any significant loss in performance.
Several methods have been proposed to address this issue. Transfer learning, in which the network is pre-trained on a large general image database and then fine-tuned on a smaller more task-specific dataset, can be a possible approach. However, this solution still requires substantial manual labeling of data along with the model needing to be re-trained.17 Style transfer is an alternative approach. This can be done directly on paired images via encoder and decoder networks along with feature alignment18. Alternatively it can be accomplished by transferring the styles between the source and target domains using a Cycle-Consistent GAN model (CycleGAN), a type of Generative adversarial networks (GANs)19, in an unpaired manner and demonstrating the effectiveness of style transfer through the performance of a supervised model trained on the source domain on the target domain style-transferred into the source domain 20,21,22.
Separately, domain adaptation, combines the style transfer and transfer learning. In domain adaptation, style transfer is often accomplished with GANs in order to generate synthetic images of the source domain images in the target domain style, then the transfer learning is fine-tuned on both the source domain and the style-transferred source domain images, and the GAN and transfer model are trained end-to-end. This approach is often used in the computer vision literature, to domain adapt Cityscapes23 to SYNTHIA 24 and vice versa 25,26. Yan et al. used this type of approach to classify cine MRI images, but included feature alignment between the source and target domain images.27
In this study, we sought to expand upon the domain adaptation approach by combining a GAN with a U-Net segmentation algorithm into one model in order to create an unsupervised cross domain segmentation algorithm. The GAN and the segmentation network are learned end-to-end, providing real time feedback and improved generalization performance.
Method
This study was conducted in accordance with the Declaration of Helsinki. The Heidelberg OCT dataset is publicly available from Chiu et al 28. The Topcon 1000 dataset is from the UK Biobank study, for which all participants provided written informed consent. The Zeiss Plex Elite 9000 and Topcon Maestro2 dataset images were obtained as part of the Eye ACT study with institutional review board approval from the Kaiser Permanente Washington Health Research Institute; informed consent was obtained from all participants.
We developed a cross-domain unsupervised learning method that is similar in philosophy to Cycada25. Our model (GANSeg), as shown in Figure 1b, combines a GAN, specifically a modified version of U-GAT-IT29, with a supervised model, in this case a U-Net. In contrast to the traditional supervised training paradigm (Figure 1a), which trains a segmenter (U) with respect to the labeled data for A, GANSeg trains the segmenter (U) to be proficient on images of type A, images with A content but B style (A2B) and finally reconstructed images with A content and A style (A2B2A). GANSeg is semi-supervised: it is unsupervised with respect to the target domain B, as it has no labels for B, only labels for the source domain A. This is likely to be the typical practical case of cross-domain adaptation of supervised models. To allow direct comparison of GANSeg to a traditional supervised learning model, the segmenter U of GANSeg was chosen to be a U-Net30. GANSeg is trained end-to-end with total loss:
Figure 1a.
Schematic of traditional supervised deep learning framework, with A being training images, Alabel labeled masks and U a U-Net. 1b. Schematic of GANSeg, where GA2B denotes the generator of style A images into style B, and vice versa for GB2A. U is a single U-Net that segments all versions of images derived from A (A, A2B, and A2B2A) and compares them to Alabel. 1c. Sample 256*256 cropped optical coherence tomography B-scans from Duke Heidelberg (left) and Topcon 1000 (right) highlighting differences between the devices.
GANSeg’s supervised loss (Lsupervised) applies a normalized focal loss31, instead of the traditional cross entropy loss, to each of the three branches. The GAN portion of GANSeg is illustrated in Supplemental Figure 1 (available at https://www.aaojournal.org) and the GAN loss described in detail in the Supplemental Appendix. Ablation studies on the choice of a normalized focal loss, the number of branches of supervised loss are analyzed in the supplemental materials and summarized in Supplemental Table 1. In addition, we compare the performance of GANSeg state-of-the-art cross-domain models in Supplemental Table 2.
Data
We obtained 110 manually segmented OCT B-scans (Heidelberg Spectralis,Heidelberg Engineering, Heidelberg, Germany) from the volume scans of ten patients from Chiu et al 28 for our source domain A. For each patient, 11 out of the 61 B-scans in the volume were manually labeled by two Duke graders for intraretinal fluid (IRF) and 7 retinal layers: 1) Anterior aspect of internal limiting membrane (ILM) to posterior limit of retinal nerve fiber layer (RNFL), ILM+RNFL, 2) Anterior aspect of ganglion cell layer (GCL) to posterior aspect of inner plexiform layer (IPL), GCL+IPL, 3) Inner nuclear layer (INL), 4) Outer plexiform layer (OPL), 5) Anterior aspect of outer nuclear layer (ONL) to anterior aspect of ellipsoid zone (EZ), ONL-EZ, 6) Anterior aspect of EZ to posterior aspect of interdigitation zone (IZ), EZ-IZ, and 7) retinal pigment epithelium (RPE). The labels for the 7 retinal layers and IRF were performed separately, so that the labeled segmentation mask for each image was achieved by superimposing the IRF labels on the 7 layer labels. The labels by Duke Grader 1 were used as ground truth for training and the reference standard for Dice comparisons. These 110 labeled B-scans were split 80–10-10 at the patient level into training, validation and test sets, yielding 88, 11, and 11 B-scans, respectively. While the Heidelberg B-scans had dimensions of 496*768 px, they were only manually labeled in the middle 512 pixels, so that effective labeled dimensions were 496*512. The images were cropped to 256 pixels in height automatically by centering on the retinal using intensity percentiles. The automatically cropped images were then manually checked, and there were no cases of edema that extended beyond 256 pixels. Therefore the B-scans were centered and cropped into 2 non-overlapping images of 256*256 px. This yielded a total of 176 training images, 22 validation and 22 test images for the source domain (Supplemental Table 3).
For our target domain B, we extracted Topcon 1000 (3D OCT-1000 Mark II, Topcon, Japan) images from the UK Biobank. The patients for the target domain are distinct from those in the source domain, and the OCT B-scans were not paired. The Topcon 1000 OCT volumes consist of 128 B-scans each with dimension 650*512 px, covering an biological area of 6mm * 6mm. One B-scan was randomly selected from slices 1 to 128 for each of the 556 normal UK Biobank patients and 5 B-scans were randomly selected for the 198 UK Biobank patients with IRF (Supplemental Table 3).
To validate the performance of the unsupervised segmenter, an external test dataset was created by having three retina specialists from Moorfields Eye Hospital manually segment, independently of other graders and GANSeg segmentations, 30 Topcon 1000 OCT scans from normal patients, 25 scans from patients with IRF, and 5 scans with subretinal fluid. Segmentations were done using ITK-SNAP (version 3.8.0). A protocol manual with a standardized protocol for the segmentation tasks was created (AO-B, RY, and YW) to train MEH graders, and was available for review during the annotations. An inter-observer comparison of the three graders was performed with Moorfields Grader 1 as the reference standard. A U-Net was trained on only Heidelberg images to serve as baseline comparison. The segmentations from GANSeg and U-Net were compared to the graders in terms of Dice, , where TP, FP and FN are true positive, false positive and false negative pixels respectively. In the supplementary materials, we provide the segmentation metrics in terms of IOU. (Supplemental Appendix)
To understand labeling differences between Duke and Moorfields graders, Moorfields Grader 1, Grader 2 and Grader 3 manually labeled all 11 B-scans in the Heidelberg test set. The inter-institute inter-observer Dice, as well as GANSeg’s and U-Net’s Dice to the ground truth of Duke Grader 1 were computed for these Heidelberg test B-scans (Supplemental Table 4).
We noticed substantial inter-institution differences in the segmentation of intraretinal fluid (IRF) on the test Heidelberg B-scans between the two Duke graders and our three clinical experts, but less intra-institutional differences in IRF grading (Supplemental Table 4). Therefore, we had the Moorfields graders re-segment IRF in the Heidelberg training and validation images to create new training and validation datasets. The GANSeg and U-Net models were retrained on these new datasets and are denoted GANSeg M and U-Net M respectively in the results and discussion. While the GANSeg and U-Net models trained on the Duke Grader 1 segmentation labels are denoted GANSeg D and U-Net D respectively.
Finally, we tested GANSeg M and U-Net M, without any training or tuning, on three B-scans each of normal patients from Zeiss Plex Elite 9000 (Zeiss, Dublin, CA, USA) and Topcon Maestro2 (Topcon, Japan). Their segmentation performance was measured against the manual labels of two retina specialists from Moorfields. Since the resolution of Zeiss (1536*1536 px) and Topcon Maestro2 (885*1024 px) B-scans are significantly larger than GANSeg’s training patches (256*256 px), we split the Zeiss and Maestro2 B-scans into (256*256 px) patches, predicted on the patches and combined the patch predictions to achieve prediction at native resolution.
Results
Table 1 shows the median Dice for each layer to the gold standard grader for the 11 Heidelberg and the 55 Topcon 1000 B-scans. Figure 2 provides a visual representation of the full distribution of the layer Dice compared to the gold standard. From the left half of Table 1 and Figure 2, the deep learning methods GANSeg M and U-Net M perform as well in terms of Dice as human graders for the held-out test Heidelberg B-scans for all classes. For most layers, Dice scores ranged between 70%–100%. For IRF, GANSeg M and U-Net M achieved median Dice scores of 44% and 40%, which is comparable to median inter-observer Dice of 42%, 30%, 51% and 53% for Duke G1, Duke G2, Moorfields G2 and Moorfields G3 respectively. The 95% CI of the Dice scores overlap for layers and IRF. The inter-observer Dice for IRF is much lower than that of other classes, suggesting that IRF is the most difficult class for human experts. Figure 3a exemplifies main differences in IRF segmentations for Heidelberg test B-scans. Duke Grader 1 uses thicker brush strokes in his/her labeling and also labels more IRF than Moorfields Grader 1. GANSeg M looks qualitatively similar to Moorfields Grader 1 in terms of its IRF predictions, while U-Net M overcalls compared to Moorfields Grader 1.
Table 1.
Median Dice by segmentation layer and the corresponding 95% confidence intervals for 11 held-out Heidelberg B-scans and 55 Topcon 1000 B-scans against Moorfields Grader 1. Inter-observer Dice are bounded in blue. GANSeg M and U-Net M overlap with human experts in terms of layer Dice for Heidelberg. However, only GANSeg M generalizes to Topcon 1000 B-scans and it significantly outperforms U-Net M on Topcon 1000.
Device | Heidelberg | Topcon 1000 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
Reference Standard | Moorfields G1 | Moorfields G1 | ||||||||
| ||||||||||
Layer | Duke G1 | Duke G2 | Moorfields G2 | Moorfields G3 | GANSegM | U-Net M | Moorfields G2 | Moorfields G3 | GANSeg M | U-Net M |
| ||||||||||
BG | 99% (99%, 99%) | 99% (99%, 99%) | 99% (99%, 100%) | 99% (98%, 99%) | 99% (99%, 99%) | 99% (99%, 100%) | 100% (98%, 100%) | 100% (99%, 100%) | 99% (88%, 100%) | 47% (35%, 54%) |
ILM+RNFL | 84% (80%, 89%) | 85% (72%, 90%) | 86% (83%, 92%) | 85% (76%, 90%) | 83% (73%, 87%) | 87% (79%, 92%) | 88% (70%, 93%) | 84% (69%, 92%) | 80% (64%, 91%) | 38% (13%, 47%) |
GCL+IPL | 90% (84%, 93%) | 87% (83%, 92%) | 89% (85%, 93%) | 90% (80%, 93%) | 89% (83%, 92%) | 87% (80%, 90%) | 93% (87%, 96%) | 92% (79%, 94%) | 90% (68%, 96%) | 60% (2%, 80%) |
INL | 76% (62%, 85%) | 73% (60%, 78%) | 75% (70%, 81%) | 72% (49%, 80%) | 77% (59%, 85%) | 75% (59%, 83%) | 82% (65%, 88%) | 76% (62%, 86%) | 80% (48%, 90%) | 0% (0%, 1%) |
OPL | 74% (66%, 82%) | 68% (57%, 76%) | 75% (67%, 80%) | 71% (61%, 77%) | 73% (66%, 85%) | 73% (61%, 78%) | 79% (63%, 86%) | 76% (59%, 84%) | 74% (45%, 86%) | 0% (0%, 5%) |
ONL+EZ | 84% (72%, 93%) | 81% (70%, 93%) | 91% (85%, 95%) | 91% (86%, 93%) | 91% (85%, 95%) | 89% (82%, 93%) | 94% (79%, 97%) | 93% (85%, 97%) | 88% (57%, 94%) | 0% (0%, 5%) |
EZ+IZ | 90% (88%, 93%) | 86% (82%, 91%) | 90% (85%, 93%) | 90% (89%, 92%) | 92% (90%, 93%) | 92% (90%, 93%) | 89% (80%, 94%) | 92% (83%, 94%) | 86% (56%, 92%) | 3% (0%, 14%) |
RPE | 85% (83%, 90%) | 79% (74%, 84%) | 86% (79%, 88%) | 83% (79%, 86%) | 87% (83%, 91%) | 84% (77%, 88%) | 73% (32%, 83%) | 76% (55%, 84%) | 81% (62%, 87%) | 0% (0%, 3%) |
IRF | 42% (28%, 60%) | 30% (5%, 52%) | 51% (13%, 70%) | 53% (39%, 78%) | 44% (28%, 72%) | 40% (2%, 54%) | 79% (43%, 94%) | 81% (54%, 95%) | 58% (18%, 89%) | 0% (0%, 5%) |
Below RPE | 99% (99%, 100%) | 99% (99%, 100%) | 100% (99%, 100%) | 100% (99%, 100%) | 100% (99%, 100%) | 99% (99%, 99%) | 100% (100%, 100%) | 100% (100%, 100%) | 99% (97%, 100%) | 5% (2%, 26%) |
BG, background; Below RPE, background below RPE; GCL, ganglion cell layer; EZ, ellipsoid zone; ILM, inner limiting membrane; INL, inner nuclear layer; IPL, inner plexiform layer; IZ, interdigitation zone; OPL, outer plexiform layer; RNFL, retinal nerve fiber layer; IRF, intraretinal fluid; RPE, retinal pigment epithelium.
Figure 2.
Dice by device for each layer compared against Moorfields Grader 1. The layer Dice for Heidelberg overlap, demonstrating that the deep learning methods GANSeg M and U-Net M are comparable to human expert segmentations. However, only GANSeg M retains its Dice performance to Topcon 1000, while U-Net M performs significantly worse.
Figure 3a.
Three example segmentations from the Heidelberg test set. The columns are respectively the raw B-scan, segmentation labels by Duke Grader 1, labels by Moorfields Grader 1, GANSeg M and U-Net M predictions. Figure 3b. Three example segmentations from the Topcon 1000 test set. The columns are respectively the raw B-scan, labels by Moorfields Grader 1, GANSeg M and U-Net M predictions.
The right half of Table 1 and Figure 2 show that GANSeg M maintains its Dice performance on Topcon 1000 B-scans, but U-Net M does not. All the Dice scores of GANSeg M on Topcon 1000 are within the range for Moorfields inter-observer Dice as shown in Figure 2. GANSeg M has a median Dice of 58%, which is lower than the median Moorfields inter-observer Dice of 79% and 81%, but the 95% CI of GANSeg M’s Dice is 18%–89% which overlaps with the inter-observer CIs of 43%–94% and 54%–95%.
Figure 3b shows three sample Topcon 1000 test B-scans, as rows. GANSeg M appears qualitatively similar to Moorfields Grader 1. In comparison, U-Net M is confused by the Topcon 1000 B-scans and fails to segment the layers or the IRF. GANSeg M did not cleanly segment the epiretinal membrane in Figure 3b however, no epiretinal membrane examples were present in Duke’s Heidelberg labeled dataset.
Table 2 and Figure 4 show GANSeg M performance in terms of Dice on Zeiss Plex Elite 9000 and Topcon Maestro2 B-scans. While GANSeg M retains performance, U-Net M does not. Example segmentations are provided for Zeiss and Topcon Maestro2 in Figure 5 and 6.
Table 2.
Median Dice by segmentation layer and the corresponding 95% confidence intervals against Moorfields Grader 1 for three Zeiss Plex Elite 9000 and three Topcon Maestro2 B-scans. GANSeg generalizes to Zeiss and Topcon Maestro2, and significantly outperforms U-Net on them.
Device | Zeiss | Topcon Maestro2 | ||||
---|---|---|---|---|---|---|
Gold Standard | Moorfields G1 | |||||
Layer | Moorfields G2 | GANSeg M | U-Net M | Moorfields G2 | GANSeg D | U-Net M |
| ||||||
BG | 100% (100%, 100%) | 100% (94%, 100%) | 100% (96%, 100%) | 100% (100%, 100%) | 99% (99%, 99%) | 79% (63%, 82%) |
ILM+RNFL | 87% (86%, 87%) | 73% (70%, 74%) | 73% (71%, 81%) | 87% (79%, 88%) | 75% (73%, 76%) | 78% (65%, 82%) |
GCL+IPL | 92% (91%, 94%) | 87% (82%, 88%) | 88% (81%, 89%) | 93% (5%, 95%) | 93% (91%, 93%) | 83% (79%, 83%) |
INL | 81% (80%, 83%) | 78% (77%, 82%) | 51% (20%, 61%) | 83% (80%, 87%) | 81% (79%, 84%) | 61% (44%, 61%) |
OPL | 80% (75%, 81%) | 78% (74%, 82%) | 53% (48%, 66%) | 71% (4%, 76%) | 76% (72%, 77%) | 62% (61%, 63%) |
ONL+EZ | 82% (79%, 93%) | 93% (91%, 93%) | 73% (63%, 75%) | 93% (92%, 93%) | 88% (87%, 92%) | 64% (41%, 68%) |
EZ+IZ | 74% (72%, 88%) | 86% (84%, 88%) | 77% (71%, 84%) | 87% (83%, 89%) | 88% (88%, 89%) | 66% (49%, 71%) |
RPE | 67% (66%, 72%) | 80% (68%, 86%) | 70% (63%, 78%) | 74% (59%, 77%) | 81% (80%, 82%) | 50% (45%, 60%) |
Below RPE | 100% (98%, 100%) | 96% (96%, 100%) | 98% (95%, 100%) | 100% (100%, 100%) | 100% (100%, 100%) | 90% (82%, 93%) |
BG, background; BG sub RPE, background below RPE; GCL, ganglion cell layer; EZ, ellipsoid zone; ILM, inner limiting membrane; INL, inner nuclear layer; IPL, inner plexiform layer; IZ, interdigitation zone; OPL, outer plexiform layer; RNFL, retinal nerve fiber layer; RPE, retinal pigment epithelium.
Figure 4.
Dice by device for each retinal layer using Moorfields Grader 1 as the reference standard. GANSeg M generalizes to Zeiss Plex Elite 9000 and Topcon Maestro2, and significantly outperforms U-Net M.
Figure 5.
Three example Zeiss Plex Elite 9000 segmentations. The columns are respectively the raw B-scan, labels by Moorfields Grader 1, GANSeg M predicted segmentation and U-Net M predictions. The predictions are post-processed so that anything below the predicted RPE layer is considered background. The raw predictions are given in Supplemental Figure 2.
Figure 6.
Three example Topcon Maestro2 segmentations. The columns are respectively the raw B-scan, labels by Moorfields Grader 1, GANSeg M predicted segmentation and U-Net M predictions.
Figure 5 shows 3 sample Zeiss images as rows with the columns being respectively: i) the raw image, ii) Moorfields Grader 1’s labels, iii) GANSeg M’s predicted mask with post-processing, and iv) U-Net M’s predicted mask with post-processing. The predictions are post-processed so that anything below the predicted RPE layer is considered background. The raw predictions are given in Supplemental Figure 2. The raw predictions show that GANSeg M can be confused by the swept-source Zeiss B-scans, which it has never encountered before, and predict IRF below the RPE. This is an even bigger problem for the U-Net M. For transparency, the median Dice scores and 95% CI for GANSeg M and U-Net M on Zeiss with post-processing are given in Supplemental Table 3 and Supplemental Figure 3. GANSeg M still significantly outperforms U-Net M in terms of Dice.
Figure 6 shows 3 sample Topcon Maestro2 images as rows with the columns being respectively: i) the raw image, ii) Moorfields Grader 1’s labels, iii) GANSeg M’s predicted mask, and iv) U-Net M’s predicted mask. No post-processing is performed on predictions for Topcon Maestro2. U-Net M calls IRF in these B-scans of normal patients. In addition, the layer boundaries of U-Net M show more variability than that of GANSeg M, especially in the second and third rows.
Discussion
To our knowledge, this is the first implementation of a cross domain unpaired and unsupervised deep learning model trained end-to-end applied to macular OCT images across different device manufacturers. Our approach is novel in two ways. First, GANSeg is completely unsupervised with respect to the target domain, Topcon 1000, as it did not have any labeled Topcon 1000 data. GANSeg only had access to raw, unlabeled Topcon 1000 B-scans. These raw Topcon 1000 B-scans helped the GAN component of GANSeg learn the Topcon 1000 style, so that it could be applied to the Heidelberg scans. In this way, the GAN component augments the dataset by pairing Heidelberg labels with Heidelberg scans in the style of Topcon 1000, thereby allowing the U segmenter to perform well on both Heidelberg and Topcon 1000 styles.
Second, the GANSeg model innovatively combines a domain transfer GAN component directly with a segmentation network to perform IRF and 7-layer retinal segmentation. All the GANSeg networks were trained end-to-end. Specifically, we explicitly enforce that GANSeg’s GAN style generators must allow its U segmenter to segment the layers of the Heidelberg scans in the style of Topcon 1000 in the same anatomical positions by propagating the loss end-to-end.
The Dice results validates our approach. While both U-Net and GANSeg achieve comparable Dice performance compared to human experts on the labeled Heidelberg test dataset, only GANSeg was able to effectively segment Topcon 1000 B-scans whereas the U-Net could not. Therefore, GANSeg learned to generalize to Topcon 1000 OCTs by making its U segmenter robust to Heidelberg images (A), GAN generated Heidelberg images in the style of Topcon 1000 (A2B), and GAN reconstructed Heidelberg images (A2B2A).
While GANSeg was expected to be robust to Topcon 1000 B-scans, our findings significantly add to the literature by showing that GANSeg was able to generalize to B-scans from Zeiss Plex Elite 9000 and Topcon Maestro2, without the need for any additional training or fine-tuning. Furthermore, Zeiss Plex Elite 9000 is swept source OCT (SS-OCT) devices, a variation of Fourier-domain OCT with faster scanning speeds and provides improved visualization of structures beneath the RPE,13,32 while Heidelberg and Topcon 1000 are spectral-domain (SD-OCT) devices, a fact that further stresses the relevance of our study findings. GANSeg did not have any prior exposure to OCT images from either device, yet it was still able to achieve Dice scores comparable to those of the human grader and far exceeding the performance of the traditional U-Net. (Table 2)
In the current machine learning landscape, the issue of model generalizability is a major hindrance for model deployment into clinical practice.33 Significant amounts of time, manual labeling, and computational resources are required to clean and compile medical images into sufficiently large and diverse datasets. Furthermore, the lead time required to construct reliable machine learning pipelines for specific manufacturer domains would not be sustainable, as the time required for data collection, model development, and validation may exceed the time it takes for newer devices to arrive on the market.
While transfer learning can alleviate the relabeling problem, it would still require substantial additional manual labeling and re-training, and would also not apply to newer devices with increasingly larger image resolutions. In contrast, GANSeg does not require any labeled data from new devices, as it is completely unsupervised. Therefore, our work, as a successful demonstration of unsupervised cross domain learning, is a significant step that would reduce much of the aforementioned challenges in supervised learning approaches.
The limitations for GANSeg include a modest training set of 88 Heidelberg B-scans, which is unlikely to capture the full spectrum of OCT diversity and anatomical variations. For example, when tasked with segmenting Topcon 1000 B-scans with subretinal fluid, GANSeg can mistake subretinal fluid for intraretinal fluid (Supplemental Figure 4a) or even fail to predict any fluid (Supplemental Figure 4b). Furthermore, GANSeg layer segmentation breaks down on Topcon Maestro2 around the optic nerve, which it had never previously encountered, as shown in Supplemental Figure 4c. While the training set of 88 Heidelberg B-scans is small, it is comparable to datasets used in several previous studies that have relied in small number of annotations and mostly normal data 34–37. For B-scans that are not too dissimilar to its training corpus such as those highlighted in Supplemental Figure 4, GANSeg performs comparably to human experts.
In conclusion, our unsupervised cross domain segmentation algorithm GANSeg achieved comparable Dice versus human graders on all tested OCT devices: i) Heidelberg Spectralis, ii) Topcon 1000, iii) Topcon Maestro2, and iv) Zeiss Plex Elite 9000. Our results suggest that we can transfer data across domains, thus removing traditional manufacturer-, or camera version-derived imaging limitations, and greatly increasing the generalizability of deep learning algorithms for supervised tasks such as classification and segmentation.
Supplementary Material
Financial Support:
NIH/NIA R01AG060942 (Cecilia S. Lee); NIH/NEI K23EY029246 (Aaron Y. Lee), NIA/NIH U19AG066567 (Cecilia S. Lee, Aaron Y. Lee), Mexican Council of Science and Technology grant #2018-000009-01EXTF-00573 (AO-B), Research to Prevent Blindness Career Development Award (Aaron Y. Lee); Latham Vision Innovation Award (Cecilia S. Lee, Aaron Y. Lee), and an unrestricted grant from Research to Prevent Blindness (Cecilia S. Lee, Aaron Y. Lee). The sponsor or funding organization had no role in the design or conduct of this research.
Footnotes
Conflict of Interest: Aaron Lee reports support from the US Food and Drug Administration, grants from Santen, Regeneron, Carl Zeiss Meditec, and Novartis, and personal fees from Genentech, Roche, and Johnson and Johnson, outside of the submitted work. This article does not reflect the opinions of the Food and Drug Administration. The remaining authors have no financial disclosures to report.
Meeting Presentation: Poster presentation at the Association for Research in Vision and Ophthalmology, Annual Meeting Denver, CO, May 4, 2022
This article contains supplemental material. The following should appear as supplemental material: Supplemental Figures 1–11 Supplemental Tables 1–9 and Supplemental Appendix.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Ker J, Wang L, Rao J, Lim T. Deep Learning Applications in Medical Image Analysis. IEEE Access. 2018;6:9375–9389. [Google Scholar]
- 2.Ting DSW, Pasquale LR, Peng L, et al. Artificial intelligence and deep learning in ophthalmology. Br J Ophthalmol. 2019;103(2):167–175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Esteva A, Chou K, Yeung S, et al. Deep learning-enabled medical computer vision. NPJ Digit Med. 2021;4(1):5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ting DSW, Cheung CYL, Lim G, et al. Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes. JAMA. 2017;318(22):2211–2223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gulshan V, Peng L, Coram M, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA. 2016;316(22):2402–2410. [DOI] [PubMed] [Google Scholar]
- 6.Lee CS, Tyring AJ, Deruyter NP, Wu Y, Rokem A, Lee AY. Deep-learning based, automated segmentation of macular edema in optical coherence tomography. Biomed Opt Express. 2017;8(7):3440–3448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Abràmoff MD, Lou Y, Erginay A, et al. Improved Automated Detection of Diabetic Retinopathy on a Publicly Available Dataset Through Integration of Deep Learning. Invest Ophthalmol Vis Sci. 2016;57(13):5200–5206. [DOI] [PubMed] [Google Scholar]
- 8.Li Z, He Y, Keel S, Meng W, Chang RT, He M. Efficacy of a Deep Learning System for Detecting Glaucomatous Optic Neuropathy Based on Color Fundus Photographs. Ophthalmology. 2018;125(8):1199–1206. [DOI] [PubMed] [Google Scholar]
- 9.Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund DE, Bressler NM. Automated Grading of Age-Related Macular Degeneration From Color Fundus Images Using Deep Convolutional Neural Networks. JAMA Ophthalmol. 2017;135(11):1170–1176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wang SY, Pershing S, Lee AY, AAO Taskforce on AI and AAO Medical Information Technology Committee. Big data requirements for artificial intelligence. Curr Opin Ophthalmol. 2020;31(5):318–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wang F, Casalino LP, Khullar D. Deep Learning in Medicine-Promise, Progress, and Challenges. JAMA Intern Med. 2019;179(3):293–294. [DOI] [PubMed] [Google Scholar]
- 12.Indications for Use - US - Digital Diagnostics. Published May 24, 2021. Accessed November 20, 2021. https://www.digitaldiagnostics.com/products/eye-disease/indications-for-use-us/
- 13.Gabriele ML, Wollstein G, Ishikawa H, et al. Optical coherence tomography: history, current status, and laboratory work. Invest Ophthalmol Vis Sci. 2011;52(5):2425–2436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Li Q, Li S, He Z, et al. DeepRetina: Layer Segmentation of Retina in OCT Images Using Deep Learning. Transl Vis Sci Technol 2020;9(2):61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Borkovkina S, Camino A, Janpongsri W, Sarunic MV, Jian Y. Real-time retinal layer segmentation of OCT volumes with GPU accelerated inferencing using a compressed, low-latency neural network. Biomed Opt Express. 2020;11(7):3968–3984. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sousa JA, Paiva A, Silva A, et al. Automatic segmentation of retinal layers in OCT images with intermediate age-related macular degeneration using U-Net and DexiNed. PLoS One. 2021;16(5):e0251591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Karimi D, Warfield SK, Gholipour A. Transfer learning in medical image segmentation: New insights from analysis of the dynamics of model parameters and learned representations. Artif Intell Med. 2021;116:102078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Liu Z, Yang X, Gao R, et al. Remove Appearance Shift for Ultrasound Image Segmentation via Fast and Universal Style Transfer. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI).; 2020:1824–1828. [Google Scholar]
- 19.Goodfellow IJ, Pouget-Abadie J, Mirza M, et al. Generative Adversarial Nets. Advances in Neural Information Processing Systems 27. Published online 2014:2672–2680. [Google Scholar]
- 20.Romo-Bucheli D, Seeböck P, Orlando JI, et al. Reducing image variability across OCT devices with unsupervised unpaired learning for improved segmentation of retina. Biomed Opt Express. 2020;11(1):346–363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Viedma IA, Alonso-Caneiro D, Read SA, Collins MJ. OCT retinal image-to-image translation: Analysing the use of CycleGAN to improve retinal boundary semantic segmentation. 2021 Digital Image Computing: Techniques and Applications (DICTA). Published online 2021. doi: 10.1109/dicta52665.2021.9647266 [DOI] [Google Scholar]
- 22.Wollmann T, Eijkman CS, Rohr K. Adversarial domain adaptation to improve automatic breast cancer grading in lymph nodes. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).; 2018:582–585. [Google Scholar]
- 23.Cordts Omran, Ramos. The cityscapes dataset for semantic urban scene understanding. Proc Estonian Acad Sci Biol Ecol. http://openaccess.thecvf.com/content_cvpr_2016/html/Cordts_The_Cityscapes_Dataset_CVPR_2016_paper.html [Google Scholar]
- 24.Ros Sellart, Materzynska. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. Proc Estonian Acad Sci Biol Ecol. https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Ros_The_SYNTHIA_Dataset_CVPR_2016_paper.html
- 25.Hoffman J, Tzeng E, Park T, et al. Cycada: Cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning. PMLR; 2018:1989–1998. [Google Scholar]
- 26.Chen M, Xue H, Cai D. Domain adaptation for semantic segmentation with maximum squares loss. In: Proceedings of the IEEE/CVF International Conference on Computer Vision.; 2019:2090–2099. [Google Scholar]
- 27.Yan W, Wang Y, Gu S, et al. The Domain Shift Problem of Medical Image Segmentation and Vendor-Adaptation by Unet-GAN. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. Springer International Publishing; 2019:623–631. [Google Scholar]
- 28.Chiu SJ, Allingham MJ, Mettu PS, Cousins SW, Izatt JA, Farsiu S. Kernel regression based segmentation of optical coherence tomography images with diabetic macular edema. Biomed Opt Express. 2015;6(4):1172–1194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Kim J, Kim M, Kang H, Lee K. U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation. arXiv [csCV]. Published online July 25, 2019. http://arxiv.org/abs/1907.10830 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2015. Springer International Publishing; 2015:234–241. [Google Scholar]
- 31.Ma X, Huang H, Wang Y, Romano S, Erfani S, Bailey J. Normalized Loss Functions for Deep Learning with Noisy Labels. In: Iii HD, Singh A, eds. Proceedings of the 37th International Conference on Machine Learning. Vol 119. Proceedings of Machine Learning Research. PMLR; 2020:6543–6553. [Google Scholar]
- 32.Miller AR, Roisman L, Zhang Q, et al. Comparison Between Spectral-Domain and Swept-Source Optical Coherence Tomography Angiographic Imaging of Choroidal Neovascularization. Invest Ophthalmol Vis Sci. 2017;58(3):1499–1505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yasaka K, Abe O. Deep learning and artificial intelligence in radiology: Current applications and future directions. PLoS Med. 2018;15(11):e1002707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Pazos M, Dyrda AA, Biarnés M, et al. Diagnostic Accuracy of Spectralis SD OCT Automated Macular Layers Segmentation to Discriminate Normal from Early Glaucomatous Eyes. Ophthalmology. 2017;124(8):1218–1228. [DOI] [PubMed] [Google Scholar]
- 35.Mehta N, Lee CS, Mendonça LSM, et al. Model-to-Data Approach for Deep Learning in Optical Coherence Tomography Intraretinal Fluid Segmentation. JAMA Ophthalmol. 2020;138(10):1017–1024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Liefers B, Colijn JM, González-Gonzalo C, et al. A Deep Learning Model for Segmentation of Geographic Atrophy to Study Its Long-Term Natural History. Ophthalmology. 2020;127(8):1086–1096. [DOI] [PubMed] [Google Scholar]
- 37.Van Brummen A, Owen JP, Spaide T, et al. PeriorbitAI: Artificial Intelligence Automation of Eyelid and Periorbital Measurements. Am J Ophthalmol. 2021;230:285–296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Chen Z, Duan Y, Wang W, et al. Vision Transformer Adapter for Dense Predictions. arXiv [csCV]. Published online May 17, 2022. http://arxiv.org/abs/2205.08534 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.