Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Apr 5.
Published in final edited form as: Proc SPIE Int Soc Opt Eng. 2023 Mar 6;12567:125670D. doi: 10.1117/12.2669772

Self-Supervised Equivariant Regularization Reconciles Multiple Instance Learning: Joint Referable Diabetic Retinopathy Classification and Lesion Segmentation

Wenhui Zhu a,, Peijie Qiu b,, Natasha Lepore c, Oana M Dumitrascu d, Yalin Wang a,*
PMCID: PMC10074924  NIHMSID: NIHMS1842111  PMID: 37026019

Abstract

Lesion appearance is a crucial clue for medical providers to distinguish referable diabetic retinopathy (rDR) from non-referable DR. Most existing large-scale DR datasets contain only image-level labels rather than pixel-based annotations. This motivates us to develop algorithms to classify rDR and segment lesions via image-level labels. This paper leverages self-supervised equivariant learning and attention-based multi-instance learning (MIL) to tackle this problem. MIL is an effective strategy to differentiate positive and negative instances, helping us discard background regions (negative instances) while localizing lesion regions (positive ones). However, MIL only provides coarse lesion localization and cannot distinguish lesions located across adjacent patches. Conversely, a self-supervised equivariant attention mechanism (SEAM) generates a segmentation-level class activation map (CAM) that can guide patch extraction of lesions more accurately. Our work aims at integrating both methods to improve rDR classification accuracy. We conduct extensive validation experiments on the Eyepacs dataset, achieving an area under the receiver operating characteristic curve (AU ROC) of 0.958, outperforming current state-of-the-art algorithms.

Keywords: Weakly-Supervised Lesion Segmentation, Multiple Instances Learning, Self-Supervised Method, Diabetic Retinopathy, Classification

1. INTRODUCTION

The International Clinical Diabetic Retinopathy Disease Severity Scale grades the severity of diabetic retinopathy (DR) according to the characteristic lesion area, and separates them into the following classes: no retinopathy, mild non-proliferative DR (NPDR), moderate NPDR, severe NPDR, and proliferative DR (PDR).1 No retinopathy or Mild NPDR is defined as non-referable DR that has no obvious pathological features. On the contrary, the severity beyond moderate is defined as referable DR(rDR). Delayed diagnosis of referable diabetic retinopathy (rDR) may cause severe vision loss and is likely to result in blindness. As a result, an automatic DR diagnosis framework is crucial in clinical practice to help patients receive proper treatments, lowering the risk of severe vision damage.

The critical difference between referable and non-referable DR is the appearance of the various lesions, such as retinal haemorrhage or exudate, which are the main bio-markers to help ophthalmologists differentiate them. The semantic information of lesion, e.g., boundary and intensity, helps us localize and categorize different lesions and hence facilitate rDR grading. However, most of the existing large-scale DR datasets lack information on lesion regions, e.g., pixel-level annotations for rDR lesions. Manually annotating lesion segmentation masks is a laborious process and requires medical expertise. To mimic how ophthalmologists diagnose rDR and take advantage of a large amount of weakly annotated (i.e., image-level annotations) retinal images, we propose a self-supervised equivariant regularization joint with a multiple instance learning (MIL) to classify rDR and segment lesion regions.

In recent years, there has been significant progress in the diagnosis of rDR through deep learning-based methods. Most of this work to date has focused on rDR classification in supervised frameworks, taking advantage of different architectures. Some studies2,3 focused on exploring the effects of different pre-processing techniques, including image pyramid, enhancement selection, and improved training on neural network performance. For instance, Huang et al.4 demonstrated that ResNet505 achieved better classification performance after optimizing the pre-processing and training strategies. Attention-based multiple instance learning was used in6 to localize the patches within a retinal image that contribute most to the final prediction. However, the method only provided patch-level lesion localization, and each lesion was likely to cover only a small portion of a patch. It is challenging to differentiate positive and negative instances if a lesion region is divided into adjacent patches. Sadafi et al.7 mitigated this problem by using R-CNN to filter out potential instances and then feeding them into an attention-based MIL. Nevertheless, this method cannot provide segmentation-level localization of the lesions. Some weakly supervised segmentation methods were implemented810 that leveraged class activation mapping (CAM)11 to provide a segmentation mask. In addition, Wang et al.12 proposed a self-supervised equivariant attention mechanism (SEAM) to narrow the gap between classification and segmentation tasks via an equivariant regularization and self-attention-based CAM refinement. SEAM is a semantic segmentation framework, the main task lies in the refinement of lesions, and the segmentation results need to be further applied to the classification task to improve the accuracy.

In this paper, we propose a novel method based on self-supervised equivariant regularization and attention-based multiple instances learning to jointly classify rDR and produce lesion segmentations. As shown in Fig. 1, the refined CAM from the SEAM mechanism can localize lesion regions accurately, which assists the MIL in accurately selecting positive and negative patches. Simultaneously, MIL fine-tunes the SEAM module to produce lesion-localized CAM. The three main contributions of our work can be summarized as: (1) We propose a self-supervised method that provides an implicit regularization to guide the MIL to select accurate positive and negative instances. (2) A framework to obtain segmentation-level CAM based on weakly-supervised image-level labels is introduced. (3) The MIL prediction module, based on an accurate lesion selection provided by SEAM, considerably improves rDR classification performance. To the best of our knowledge, this is the first work to simultaneously classify rDR while generating lesion segmentation with only image-level labels.

Figure 1.

Figure 1.

The generated class activation maps (CAM) from the training monitoring process: (a) CAM from the original image branch in the Siamese network, (b) refined CAM from the original image branch, (c) CAM from the affine-transformed branch in the Siamese network, (d) refined CAM from the affine-transformed branch. There is a difference between the CAM of the original image and that of the affine-transformed image. The refined CAM produces a better localization boundary of lesion regions compared to the original CAM.

2. METHODS

The proposed method consists of two main modules, a SEAM module, and a modified attention-based MIL one. The SEAM mechanism serves to localize the lesion areas and produces a fine-grained CAM as a byproduct, that approximates the ground truth lesion segmentation. The equivariant regularization narrows the gap between the classification and segmentation tasks by moving the CAM toward the ground truth segmentation of the sought-after object. The MIL module can further localize pathologically important regions for grading Diabetic Retinopathy within an image by representing an image with a bag of features. Our modified attention-based MIL module formulates bags of instances directly from the feature space instead of feeding every single patch into the network, saving a lot of computation time compared to the conventional patch-based MIL. The rationale behind this is that differentiating positive and negative instances in the feature space is more effective than that in the patch-based MIL because equivariant regularization helps the feature space encode meaningful semantic information of lesions. Our proposed method is implemented as a Siamese network13 with ResNet3814 as the backbone; the network architecture is shown in Fig. 2.

Figure 2.

Figure 2.

The network architecture of our proposed method is implemented as a Siamese network where CAMorig, CAMrm denotes the original CAM and refined CAM, respectively, from the original branch in the Siamese network, and CAMAF, CAMrmAF denotes the CAM and refined CAM from the affine-transformed branch, respectively.

2.1. Self-Supervised Equivariant Attention Module

Equivariant Regularization.

A fundamental assumption of DL-based segmentation is that feeding an affine-transformed (e.g., rescale, rotation, flip, etc.) input image to a segmentation network results in approximately the same affine transformation in the segmentation output. In contrast, the pooling layer in classification networks destroys this property.8 Penalizing this inequivalence pushes the CAM of a target object toward its corresponding segmentation. The equivariant regularization (ER) loss can be given by

𝓛ER=AF(CAMorig)CAMAF1, (1)

where CAMorig, CAMAF denotes the CAM from the original image and the affine-transformed image, respectively, and AF() denotes the affine transformation applied to the original image. Using the L1 norm in the ER loss ensures the sparsity of the CAM because lesions typically only account for a small portion of a retinal image.

CAM Refinement.

The CAM is further refined by the correlation between pixels, based on the fact that similar pixels ought to belong to the same object. The correlation is computed as cosine similarity between pixels as

Corr(fi,fj)=ReLU(θ(fi)Tθ(fj)θ(fi)θ(fj)), (2)

where fi,fj denotes the feature map at spatial location i,j and θ is an embedding function to reduce the number of channels given by a 1×1 convolution layer. The ReLU activation is to avoid negative values in the correlation matrix. The refined CAM is then computed as:

CAMrm(fi)=1N(fi)jCorr(fi,fj)CAMorig(fi), (3)

where N(fi)=jCorr(fi,fj) denotes a normalization constant.

Equivariant Cross Regularization.

In our experiments, we found that the affine-transformed branch in the Siamese network identifies more lesion areas than the original branch. The refined CAMs in each branch lose information, making the original ER loss fall into a local minimum where all pixels belong to a single class. This problem can be mitigated by cross-regulating the original CAM and refined CAM in both branches of the Siamese network because the CAM from those two branches focuses on different regions. The equivariant cross regularization (ECR) loss, which helps us reduce CAM degeneration and escape local minima, is given as

𝓛ECR=AF(CAMorig)CAMrmAF1+AF(CAMrm)CAMAF1, (4)

Where CAMrmAF denotes the refined CAM of the affine-transformed input image.

2.2. Attention-Based Multiple Instance Learning Module

We view the MIL learning task as solving two fundamental problems: how to extract instances and how to aggregate them. For the former problem, conventional patch-based instance extraction divides an image into a number N of p×p patches, where each patch serves as an instance. In our early experiments, however, patch-based MIL6 with attention pooling had difficulty in differentiating the positive and negative instances. The contribution weights between positive and negative are close to each other, particularly when: 1) the lesion region is extremely small, 2) a single lesion area spans across adjacent patches during patch selection. For the latter problem, different pooling strategies1517 have been proposed to aggregate instance-level representation. We use the attention pooling proposed by Ilse et al.6 to assign a weight to each instance based on an attention mechanism.

The feature map under self-supervised equivariant regularization encodes the global distribution of lesions. Instead of using the patch-based MIL method, we take advantage of the global lesion feature maps forigRL×HW and frvRL×HW. Each element fi,j at the spatial location i,j of the feature map f is treated as an instance, resulting in H×W is the number of instances. We aggregate the information from both forig and frv together and increase the number of channels to K by a 1×1 convolution layer to enrich the feature space, resulting in a final feature map fRK×HW. The MIL Attention mechanism, which weights the contribution of each instance, is given by

A=Sigmoid(w1ReLU(w2f)), (5)

where w1RD×1, w2RD×K, we apply the attention map A to the feature map fRK×HW,

fMIL=fA (6)

where denotes element-wise multiplication. The feature map is fed into a fully-connected layer with a softmax activation to output probabilistic predictions.

2.3. Loss Design

The network is jointly trained with self-supervised equivariant regularization and MIL, outputting classification prediction, and a fine-grained lesion CAM. The overall objective function is designed as

𝓛=𝓛multi-class+𝓛ER+𝓛ECR+𝓛cross-entropy. (7)

The 𝓛cross-entropy is the cross-entropy loss used for optimizing the output from MIL for classification prediction, given by

𝓛crossentropy=c=1Cyclog(pc), (8)

where yc{0,1} is a binary label to indicate whether it is rDR, and pc is predicted probability of class c. The 𝓛multi-class is a multi-label soft margin loss, which adapts the cases that multiple labels exist for one image, and is used for optimizing the CAM, given by

𝓛multi-class=1C1c=1C1yclog(11+eyo,c)+(1yc)log(eyo,c1+eyo,c). (9)

Here C1 denotes the number of target objects (we only have one foreground object-lesion-in our cases), yc{0,1} denotes whether lesion exists, and yo,c is the output predicted CAM after an adaptive-average pooling layer. Note that yc consists of both foreground and background labels, representing whether a target object or background exists, respectively. In our case, we one-hot encoded the foreground and background regions for each image. For example, the {1,1} represents referable DR with background, and {1,0} represents No-referable DR label with background, where the first label indicates whether background exists, and the second indicates whether lesion exists.

3. EXPERIMENTS

We validate our proposed method on the Eyepacs dataset,18 through three experiments: (i) ResNet 38, (ii) attention-based MIL with ResNet38 as backbone, and (iii) our proposed self-supervised equivariant regularized attention based MIL with ResNet38 as backbone. We also compare the performance of our model with other state-of-the-art work.

Dataset.

The Eyepacs dataset consists of 35,126 training images, 10,906 validation images, and 42,670 testing images. The original dataset consists of 4 levels grading of DR: mild, moderate, severe non-proliferative, and proliferative. In clinical practice, DR grading beyond the mild stage is recognized as referable DR; otherwise, it is non-referable DR. All images are center-cropped and resized to size of 512×512.

Data Augmentation.

To prevent overfitting and increase generalizability, massive data augmentation is performed, including random horizontal flips, vertical flips, random crops, extra color jitters, random rotations, and random translations. Note that the same data augmentation is deployed in all our experiments.

Implementation Details.

ResNet38,14 excluding the global average pooling layer and final fully connected layer, is used as the backbone for all our experiments, because of its state-of-the-art performance on semantic segmentation. A rescaling transformation with a factor of 0.4 is deployed as our affine transformation when training our proposed method. All models were trained with a stochastic gradient descent (SGD) optimizer, but we assigned different learning rates to different parameter groups: lower learning rates to the parameters of ResNet38, and larger ones to others. The initial learning rate is set to 0.001, with a polynomial decay with a rate of 0.9. A weight decay with rate of 0.0005 is applied to further prevent overfitting. All our experiments were performed on an Nvidia GeForce RTX 2080 with a batch size of 3 and a number of training epochs of 100.

Evaluation Metrics.

We evaluated the rDR classification via accuracy, F1 score, and area under the receiver operating characteristic curve (AU ROC).

3.1. Baseline Experiment

The baseline experiment is based on the split of the training set, validation set, and testing set of the official EyePacs dataset. It evaluates the importance of different modules by comparing the accuracy, F1 score, and AU ROC. As shown in Table 1, comparative experiments are conducted on Resnet38, the MIL module, and the SEAM module, respectively, with the same training configuration. All the experimental results are reported in Table 1 for rDR classification in terms of accuracy, F1 score, and AU ROC. The highest AU ROC score of our proposed method is 0.958. The original CAM, refined CAM, and segmentation was generated on the test set to visually inspect and validate our proposed method, as shown in Fig. 3.

Table 1.

Comparison of our proposed method (ResNet38+SEAM + MIL) with other baselines. Mean and standard deviation for accuracy, F1 score, and area under ROC curve (AU ROC).

Method Dataset Accuracy FI Score AU ROC
ResNet38 Val set. 0.931 0.809 0.954
Test, set, 0.928 0.805 0.950
ResNet38 + MIL (Patch-based) Val set 0.936 0.818 0.957
Test set, 0.932 0.810 0.953
ResNet38 + SEAM + MIL(Ours) Val set 0.939 0.829 0.961
Test set, 0.936 0.823 0.958
Figure 3.

Figure 3.

(a) Original input image, (b) original CAM, (c) lesion segmentation from original CAM, (d) refined CAM, (e) lesion segmentation from refined CAM. The segmentation generated by the original CAM is likely to lose lesion information. Our proposed method provides better localization in terms of lesion boundary and preserves more information about lesion regions.

3.2. Comparison Experiments

We compared our results with some state-of-the-art work in terms of AU ROC in Table 2. Our proposed method outperforms the listed state-of-the-art work. Note that we used the official split training set with only data augmentation to train our model, and evaluated it on the official split validation and test datasets, while other methods included a series of preprocessing and training strategies as detailed in.2,3,19,20

Table 2.

Comparison of our proposed method with other state-of-the-art methods on the Eyepacs dataset for rDR classification.

Dataset Split Method AU ROC
Training(all) Testing(all) Leibig et, al.21 0.927
Rakhlin et al.19 0.920
Pires et al.20 0.946
Graham et al.2 0.951
Quellec et al.3 0.954

Training(all) Testing(all) MIL-SEAM(Ours) 0.958

4. CONCLUSION

In this paper, we proposed to regularize multiple instances learning to bag more accurately positive and negative instances for rDR classification, using a self-supervised equivariant regularization mechanism. As a byproduct, we obtained a fine-grained CAM and a coarse segmentation of lesions with weakly supervised image-level labels, guiding medical providers to diagnose rDR. Our ablation study on the Eyepacs dataset showed improvement of our proposed method over the patch-based attention-based MIL method. The performance of our proposed method is also comparable to the state-of-the-art work, achieving a 0.958 AU ROC.

The application of our proposed method can be extended to many other medical imaging applications for which only image-level annotations are available, and where a specific series of patterns are directly related to predictive models. We provided an example of adapting SEAM to multiple instances learning to mitigate its drawbacks to our task. At the same time, predictive models other than multiple instance learning can be adjusted based on different tasks.

REFERENCES

  • [1].Wu L, Fernandez-Loaiza P, Sauma J, Hernandez-Bogantes E, and Masís M, “Classification of diabetic retinopathy and diabetic macular edema.,” World journal of diabetes 4 6, 290–4 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Graham B, “Kaggle diabetic retinopathy detection competition report,” University of Warwick, 24–26 (2015). [Google Scholar]
  • [3].Quellec G, Charrière K, Boudi Y, Cochener B, and Lamard M, “Deep image mining for diabetic retinopathy screening,” Medical image analysis 39, 178–193 (2017). [DOI] [PubMed] [Google Scholar]
  • [4].Huang Y, Lin L, Cheng P, Lyu J, and Tang X, “Identifying the key components in resnet-50 for diabetic retinopathy grading from fundus images: a systematic investigation,” ArXiv abs/2110.14160 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778 (2016). [Google Scholar]
  • [6].Ilse M, Tomczak JM, and Welling M, “Attention-based deep multiple instance learning,” ArXiv abs/1802.04712 (2018). [Google Scholar]
  • [7].Sadafi A, Makhro A, Bogdanova AY, Navab N, Peng T, Albarqouni S, and Marr C, “Attention based multiple instance learning for classification of blood cell disorders,” ArXiv abs/2007.11641 (2020). [Google Scholar]
  • [8].Ahn J.and Kwak S, “Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4981–4990 (2018). [Google Scholar]
  • [9].Kolesnikov A.and Lampert CH, “Seed, expand and constrain: Three principles for weakly-supervised image segmentation,” ArXiv abs/1603.06098 (2016). [Google Scholar]
  • [10].Fan J, Zhang Z, and Tan T, “Cian: Cross-image affinity net for weakly supervised semantic segmentation,” in [AAAI], (2020). [Google Scholar]
  • [11].Zhou B, Khosla A, Lapedriza A, Oliva A, and Torralba A, “Learning deep features for discriminativè localization,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2921–2929 (2016). [Google Scholar]
  • [12].Wang Y, Zhang J, Kan M, Shan S, and Chen X, “Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12272–12281 (2020). [Google Scholar]
  • [13].Bromley J, Bentz JW, Bottou L, Guyon I, LeCun Y, Moore C, Säckinger E, and Shah R, “Signature verification using a ”siamese” time delay neural network,” in [Int. J. Pattern Recognit. Artif. Intell], (1993). [Google Scholar]
  • [14].Wu Z, Shen C, and van den Hengel A, “Wider or deeper: Revisiting the resnet model for visual recognition,” ArXiv abs/1611.10080 (2019). [Google Scholar]
  • [15].Pinheiro PHO and Collobert R, “From image-level to pixel-level labeling with convolutional networks,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1713–1721 (2015). [Google Scholar]
  • [16].Zhu W, Lou Q, Vang YS, and Xie X, “Deep multi-instance networks with sparse label assignment for whole mammogram classification,” in [MICCAI], (2017). [Google Scholar]
  • [17].Feng J.and Zhou Z-H, “Deep miml network,” in [AAAI], (2017). [Google Scholar]
  • [18].Cuadros J.and Bresnick G, “Eyepacs: an adaptable telemedicine system for diabetic retinopathy screening,” Journal of diabetes science and technology 3(3), 509–516 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Rakhlin A, “Diabetic retinopathy detection through integration of deep learning classification framework,” BioRxiv, 225508 (2018). [Google Scholar]
  • [20].Pires R, Avila S, Wainer J, Valle E, Abràmoff MD, and Rocha A, “A data-driven approach to referable diabetic retinopathy detection,” Artificial intelligence in medicine 96, 93–106 (2019). [DOI] [PubMed] [Google Scholar]
  • [21].Leibig C, Allken V, Ayhan MS, Berens P, and Wahl S, “Leveraging uncertainty information from deep neural networks for disease detection,” Scientific reports 7(1), 1–14 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES