Abstract
The availability of big data can transform the studies in biomedical research to generate greater scientific insights if expert labeling is available to facilitate supervised learning. However, data annotation can be labor-intensive and cost-prohibitive if pixel-level precision is required. Weakly supervised semantic segmentation (WSSS) with image-level labeling has emerged as a promising solution in medical imaging. However, most existing WSSS methods in the medical domain are designed for single-class segmentation per image, overlooking the complexities arising from the co-existence of multiple classes in a single image. Additionally, the multi-class WSSS methods from the natural image domain cannot produce comparable accuracy for medical images, given the challenge of substantial variation in lesion scales and occurrences. To address this issue, we propose a novel anomaly-guided mechanism (AGM) for multi-class segmentation in a single image on retinal optical coherence tomography (OCT) using only image-level labels. AGM leverages the anomaly detection and self-attention approach to integrate weak abnormal signals with global contextual information into the training process. Furthermore, we include an iterative refinement stage to guide the model to focus more on the potential lesions while suppressing less relevant regions. We validate the performance of our model with two public datasets and one challenging private dataset. Experimental results show that our approach achieves a new state-of-the-art performance in WSSS for lesion segmentation on OCT images.
Keywords: Weakly supervised segmentation, Multi-label classification, Retinal OCT lesion segmentation, Anomaly detection, Self-attention
1. Introduction
The management of a variety of vision-threatening retinal conditions can be substantially improved with the aid of imaging technologies that reveal lesions as diagnostic and prognostic imaging biomarkers. Optical Coherence Tomography (OCT) is one such imaging modality, providing high-resolution, cross-sectional images of the retina for improved detection and monitoring of retinal diseases. In clinical practice, the ability to recognize these lesions can facilitate treatment planning. Moreover, segmentation and quantification of these imaging biomarkers have been shown to provide a further nuanced understanding of their contribution to disease activity (Schmidt-Erfurth et al., 2021).
Semantic segmentation is one of the fundamental computer vision tasks that aim to obtain pixel-level segmentation results for given images. However, annotating medical images at a pixel-level can be time-consuming and costly, especially for biomedical images that typically require domain knowledge. To overcome this barrier, efforts have been dedicated to weakly supervised semantic segmentation (WSSS). WSSS uses weaker forms of supervision, such as image-level labels (Pinheiro and Collobert, 2015; Ahn and Kwak, 2018; Kolesnikov and Lampert, 2016; Kwak et al., 2017; Niu et al., 2023), scribbles (Lin et al., 2016; Vernaza and Chandraker, 2017; Luo et al., 2022; Valvano et al., 2021), and bounding boxes (Kervadec et al., 2019; Ma et al., 2022; Oh et al., 2021; Dai et al., 2015), to create pixel-level predicted segmentation. One common approach of WSSS is to create pseudo labels for training segmentation networks using classification task byproducts, such as Class Activation Maps (CAMs) (Zhou et al., 2016), which provide a heatmap of salient regions for the predicted target class. In this study, we focus on developing a WSSS method that only utilizes image-level supervision to segment lesions in multi-label OCT images.
In recent years, various CAM-based WSSS methods have been proposed (Wu et al., 2021; Ahn and Kwak, 2018; Wu et al., 2021; Chen et al., 2022b). However, the focus has mainly been on natural images, and applying these models directly to medical datasets can be problematic due to the inherent differences between natural and medical images. These differences include variations in image intensities, object appearance, and diverse scales of anatomical structures (Xing et al., 2021; van Engeland et al., 2006; Prince and Links, 2006; Chen et al., 2022a). Particularly, the extracted raw and refined CAMs, which are generated by extending seed CAMs to entire objects, are still relatively coarse due to the small, noisy, and low-contrast lesions in OCT images compared to natural objects.
In medical imaging, especially in OCT images, current CAM-based methods (Wang et al., 2021; Roth et al., 2021; Zhang et al., 2022; Liu et al., 2023) face two primary challenges: they are often tuned for specific diseases or simplified to binary segmentation per image, and struggle to detect all lesion regions when those regions are low-contrast. Recognizing the importance of detecting subtle or unexpected abnormalities, which can indicate underlying pathologies, anomaly detection has become a popular research direction to highlight abnormal regions (Schlegl et al., 2019; Zhou et al., 2020; Liu et al., 2023). However, the limitation of anomaly detection is its inability to differentiate between lesion types.
Drawing inspiration from anomaly detection and CAM-based WSSS, our work seeks to integrate abnormal signals with CAMs in multi-class WSSS, which remains underexplored in the literature to the best of our knowledge. We propose an anomaly-guided mechanism (AGM) in this paper that can capture rich anomalous information from a multi-label OCT image. In particular, our proposed method exploits the anomaly-discriminative representation with the aid of GAN-generated healthy counterpart of the same retina (Akcay et al., 2018) to provide a more robust representation of the lesion. We further enhance the model’s ability to localize small lesions with spatial constraints by incorporating self-attention, and an iterative refinement learning step to leverage anomalous features. Fig. 1 illustrates the effectiveness of AGM on small low-contrast lesions in comparison with a plain backbone without anomaly guidance. In summary, our main contributions are threefold:
We introduce a novel anomaly-guided WSSS method with image-level supervision specifically designed for medical lesion segmentation.
We leverage anomaly information for the detection of small lesions. Instead of using anomaly knowledge in the pre/post-processing, we utilize the self-attention mechanism to enhance small lesion localization by capturing global lesion information and develop an efficient refinement learning approach to further direct the attention by utilizing anomalous features.
We perform comprehensive experiments on two public and one private OCT datasets, achieving superior performance compared to current state-of-the-art methods in lesion segmentation with image-level labels only.
Fig. 1:

An illustration of examples generated by our proposed AGM. (a) Original image. (b) CAM of PED lesion. (c) CAM of SRF lesion. (d) Pseudo labels after binarizing b and c: green region represents PED and red represents SRF. (e) Pseudo label generated by ResNet-50 for comparison. (f) Pixel-level ground truth.
2. Related Works
2.1. Weakly Supervised Semantic Segmentation
Due to the limited availability of expert-annotated pixel-level labels in many application domains, the development of WSSS has taken a great leap in computer vision tasks. Most of the methods exploit CAMs to segment class objects (Zhou et al., 2016; Selvaraju et al., 2017; Wang et al., 2020a; Ramaswamy et al., 2020; Wu et al., 2021; Kim et al., 2021; Wang et al., 2020c; Ru et al., 2022; Chen et al., 2022b; Shi et al., 2021). The main idea is to highlight the discriminative regions of an image that lead to the classification decision, and then use these regions as pixel-level pseudo labels to train an independent segmentation network, mimicking fully-supervised segmentation. For example, (Wu et al., 2021) embeds an attention mechanism to capture class-specific affinity. Approaches like (Choe et al., 2020) and (Jo and Yu, 2021) drop out of the most discriminative parts or remove patches of images to expand object regions. (Kolesnikov and Lampert, 2016) expands initial seeds by a seed-expand-constraint framework. These approaches are more applicable to natural images in which objects are often relatively larger and more distinct than those in medical images. Therefore, applying them directly to the medical domain can misinterpret the finer structures, leading to over-activated small regions of interest (ROIs), such as tumors and blood vessel exudate. The precise segmentation in medical images directly impacts diagnostic outcomes.
Compared to the computer vision domain, there is a scarcity of research (Ouyang et al., 2019; Li et al., 2022; Patel and Dolz, 2022; Viniavskyi et al., 2020; Belharbi et al., 2021; Zhang et al., 2022) on medical image segmentation using only image-level labels, and the approaches are typically driven by domain knowledge. For instance, OEEM (Li et al., 2022) utilizes patch-level labels to improve gland segmentation with the initial CAM seed that is specifically designed for histology images. Similarly, (Patel and Dolz, 2022) introduces a constraint that is invariant to cross-modality, but it necessitates the existence of multiple modalities in the corresponding domain. TransWS (Zhang et al., 2022) derives finer CAMs from Transformer-based architecture for histology images. C-CAM (Chen et al., 2022a) introduces category causality and anatomy causality to solve the ambiguous boundary and co-occurrence problems on MRI images. In addition, image-level supervision is also investigated for the segmentation of lesions in OCT B-scans. For example, (Xing et al., 2021; Wang et al., 2020b) focus on retinal detachment segmentation, with two-stage refinement learning and feature propagation learning, respectively. TSSK (Liu et al., 2023) proposes a distillation-based framework that is optimized by pre-training on OCT images.
However, most of the existing methods mentioned above in the medical image domain, especially OCT segmentation tasks, are only for single-class segmentation, where only one lesion class can be found in an image. Our proposed method can tackle the challenging task of segmenting multiple coexisting biomarkers in a single OCT image, given the diverse morphologies of lesions.
2.2. Pseudo Label Refinement
Pseudo labels generated by original CAM tend to produce coarse results or focus only on the most discriminative part of the class region. Thus, their performance on the segmentation step is not satisfactory. To address this challenge, many studies have added refinement steps to enhance the initial CAMs and construct a more complete object. Some papers use random walk algorithms to expand the object regions with semantically similar pixels (Ahn et al., 2019; Ahn and Kwak, 2018; Viniavskyi et al., 2020). AffinityNet (Ahn and Kwak, 2018) predicts semantic affinities between adjacent coordinate pairs. IRNet (Ahn et al., 2019) provides class boundary maps by computing pairwise semantic affinities. Alternatively, (Wu et al., 2021) uses thresholding on a saliency map which highlights the ROI of a given image to detect insignificant parts of the foreground to improve the segmented region. ReCAM (Chen et al., 2022b) proposes a method of leveraging softmax cross-entropy loss to reactivate more pixels in pseudo labels. Within the mentioned studies, the refinement step is performed as the post-processing or independent subsequent training stage of WSSS. Thus, the classification model does not benefit from the refined CAM. Besides, given the OCT lesion context, the refinement strategy of expanding CAMs does not perform effectively on small lesions as the initial CAMs tend to be broad or capture less relevant areas. In our proposed refinement technique, we improve the training model by sharing signals from refinement learning in each iteration to enhance targeting regions.
2.3. Anomaly Detection
Anomaly detection approaches involve identifying abnormal patterns that deviate from normalcy by learning a representation of normal data and identifying deviations from it in an unsupervised manner. Medical image anomaly detection is challenging due to high variability and lack of annotated data. Therefore, recent studies have focused on developing generative adversarial networks (GANs) (Goodfellow et al., 2020) to improve detection efficacy and robustness. In GANomaly (Akcay et al., 2018), authors propose a semi-supervised adversarial training approach to detect anomalies with larger anomaly scores from the learned data distribution. In GANomaly, the training dataset contains only normal images, while the test dataset contains both normal and abnormal images. Similarly, in (Schlegl et al., 2019), authors use a fast mapping technique on latent space to identify anomalous images, and anomalies are detected using a combined anomaly score. The method proposed in (Wang et al., 2021) using GANs to help locate the lesions is close to our concept, utilizing CycleGAN to create fake healthy images and reconstruct healthy OCT images from abnormal inputs to segment lesions. However, it is not generalizable to other lesion types due to the strict lesion-driven post-processing required.
Some literature combines anomaly detection with CAMs (Chen et al., 2020; Meissen et al., 2021; Silva-Rodríguez et al., 2022; Liu et al., 2021). For instance, (Liu et al., 2021) proposes to combine the residual outcomes from the anomaly detection model with the attention map derived from weakly supervised classification to achieve finer segmentation on OCT images. Anomaly detection tasks are typically applicable to single-class segmentation per image. Contrary to these anomaly detection works, our study explores segmenting instances of multiple coexisting lesion types that are significantly different in shape and size within a single multi-label image. Particularly, we integrate weak abnormal signals by anomaly detection into WSSS architecture, enabling a more robust WSSS model to segment each image into multiple lesion types.
2.4. Self-Attention
The attention mechanism (Vaswani et al., 2017; Wang et al., 2018; Zhu et al., 2019; Yang et al., 2021) has gained significant attention in recent years in various research fields. (Wang et al., 2018) proposes a non-local convolution network to overcome local dependency caused by convolutional networks. In another study, (Ramachandran et al., 2019) replaces ResNet spatial convolution layers with self-attention layers. In their proposed technique, they consider all pixel positions in a window size with a given pixel size and compute query, key, and value vectors for these pixels. Due to the local dependency on convolutional networks, many studies (Ru et al., 2022; Choe et al., 2020) adopt self-attention mechanisms in WSSS to extract global contextual information. For example, SEAM (Wang et al., 2020c) proposes an equivariant attention mechanism between affine-transformed images to improve performance. Additionally, the attention mechanism proposed in (Wu et al., 2021) highlights the similar region for a given class category and enhances the predicted pseudo labels. In this work, we aim to utilize the benefits of self-attention to isolate and extract subtle abnormal signals, in contrast to prior literature that relies on the use of original inputs. This is essential for medical imaging applications where the target regions are very small and only the anomaly regions need to be emphasized.
3. Method
In this section, we introduce our proposed method designed for generating high-quality pseudo labels, which can serve as the ground truth for semantic segmentation in a fully-supervised manner. The pseudo label generation stage consists of three essential components, as depicted in Fig. 2: 1) synthesizing a fake healthy image from a given input image, which could be either diseased or healthy, by employing a well-trained GANomaly network, 2) generating CAM by performing classification on anomaly features using image-level class labels, and 3) iterative refinement learning to produce the final pseudo labels.
Fig. 2:

Overview of the proposed framework for the pseudo label generation. (left) GANomaly training. (right) Anomaly-guided mechanism process.
The proposed model is trained on a dataset containing images, which include both healthy and diseased samples, denoted as . Each image is labeled with classes, denoted as , where , including lesion classes and one non-lesion/background class.
3.1. Anomaly-Discriminative Representations
Inspired by anomaly detection, our model focuses on anomalous regions by taking advantage of the abundance of normal OCT images in the public domain. The first step is to train a GANomaly network with non-anomalous retinal images to generate ‘fake healthy’ OCT images. This step, as shown in the left part of Fig. 2, serves as the foundation for building anomaly-discriminative representations in subsequent stages.
GANomaly consists of one generator , one reconstruction encoder , and one discriminator . As shown in the left part of Fig. 2, given a large training dataset with normal/healthy images, is the encoding generator that maps the input to the latent vector , while is the encoder network that maps the generated to a vector . The function yields an intermediate layer of the discriminator model to learn the normal data distribution and minimize the output anomaly score.
We incorporate the pre-trained from GANomaly to concurrently generate the healthy counterparts of the input during the training of the classification model, as depicted in the right part of Fig. 2. These healthy counterparts are utilized to construct anomaly-discriminative representations. These representations are then fed into our proposed network to achieve multi-label classification and generate CAMs. The network comprises two branches that accept anomaly-discriminative representations as input: 1) a classification backbone, and 2) a self-attention module to extract global anomalous features.
Let represents the generated healthy counterpart. The input of the backbone branch is the concatenation of the original image and fake healthy image , denoted as . Moreover, the self-attention module input, , is obtained by normalizing the absolute subtraction between and . This subtraction , as shown in Fig. 3(c), reflects the structural difference between the original image and its healthy counterpart, enabling the self-attention module to capture weak abnormal signals and deliver global anomalous features.
Fig. 3:

The illustration of our proposed anomaly-discriminative representation. (a) Original diseased image. (b) GANomaly-generated healthy counterpart. Please note that it is not a completely realistic normal image and can exhibit structural distortions, particularly when there’s a large lesion present. (c) Anomaly-discriminative representation for the self-attention branch. (d) Pixel-level ground truth.
3.2. Anomaly-Guided Dual-Branch Network
In this section, we present our anomaly-guided mechanism for the classification and initial CAMs generation, designed to localize smaller disease areas while reducing the focus on background or noise. Meanwhile, we leverage the anomaly-discriminative representations generated in Sec. 3.1 to improve classification performance.
Backbone Branch.
The backbone branch employs synthetic images generated by GANomaly alongside the original image, providing the model with additional information to learn more complex representations of the input data. Specifically, we use a ResNet-50 as the convolutional backbone for the multi-label classification task. Given a representation , the feature map extracted by backbone is calculated as Eq. 1, where . The denote the number of channels, height, and width of input, respectively. Similarly, the denote the dimensions of output features.
| (1) |
Anomaly Self-Attention Module Branch.
To identify global dependencies and positional relationships among lesion pixels within anomalous patterns, we propose the anomaly self-attention module (ASAM), which extracts feature maps from using multi-head self-attention layers that learn multiple distinct representations of the input. Our self-attention module is built on the attention layer introduced in SASA (Ramachandran et al., 2019), which utilizes a sliding window-based approach to capture long-range interactions and relative pixel position.
We first embed a convolutional encoder before the self-attention layers for three purposes: 1) obtaining local features of anomaly-discriminative representations, 2) downsampling the input to the same dimensions as the from the other branch for effective feature fusion, and 3) reducing the computational cost. The process is denoted as Eq 2 where
| (2) |
Next, the anomaly-discriminative features are fed into the self-attention module. Given a single pixel , the local region of pixels in positions within spatial extent centered around is selected, named memory block. Thus, the single-headed self-attention layer is formulated as:
| (3) |
In Eq. 3, the indicates the pixel output. The queries, keys, and values are , and respectively. A softmax function is applied to all logits computed in the memory block. The memory blocks are computed for every pixel in the given image to obtain global information. In practice, we implement multi-head attention to learn multiple distinct representations of the anomaly features. Let be the number of heads, and is partitioned into groups . The pixel-correlation is generated on single-headed attention on each group separately, and subsequently concatenated.
Furthermore, the relative position information is encoded as well. The relative position defines the relative distance of to , denoting as and . The row and column offset embeddings are concatenated to form . The final attention between the target pixel and surrounding regions is formulated in Eq. 4. Next, we can obtain the feature map by ASAM branch by Eq. 5, where .
| (4) |
| (5) |
Initial CAMs Extraction.
We now fuse the feature maps by Eq. 6, performing element-wise multiplication between and from the backbone and ASAM branches, respectively. It provides an efficient way to impose the global anomalous constraints on the local anomalous features.
To enhance the model’s ability to differentiate between healthy and typically small pathological regions, we use global maximum pooling (GMP) instead of global average pooling (GAP) before the fully connected (FC) layer. This encourages the model to focus on the most salient features from the fused feature map . Now the classification result, , can be formulated by Eq. 7.
| (6) |
| (7) |
Then, we use the multi-label binary cross-entropy loss for training as follows,
| (8) |
Here, denotes the n-th class, and is the sigmoid function.
As the CAMs are used as the pseudo label for fully supervised semantic segmentation, the quality of the pseudo label is critical. In this study, we use GradCAM to generate the pseudo label for foreground lesion classes . The optimal performance in our experiments is achieved using GradCAM with augmentation smoothing, which reduces the noise in the CAMs. We can obtain the initial GradCAM heat map for a single class by:
| (9) |
| (10) |
Here, is the feature map activations of a convolutional layer, and denotes the -th channel of the target layers. The weights are defined as the average derivatives of with respect to each pixel in the activation .
Furthermore, we implement augmentation smoothing steps, including horizontal flipping and image scaling to the input. This process yields multiple CAMs, which are subsequently averaged to produce augmented CAMs. The final GradCAM heat map is obtained by averaging the augmented CAMs per class .
3.3. Iterative Anomaly-Guided Refinement
To further refine initial CAMs, we propose an iterative anomaly-guided refinement learning approach. After obtaining the initial GradCAM heat maps when the model converges, we create a weighted ROI mask from these heat maps to enhance the input. This forces the model to generate smaller and more precise GradCAM in subsequent iterations. The weighted ROI mask, which assigns higher weights to the most relevant regions and down-weights less important areas, is derived as follows:
| (11) |
| (12) |
where is the -th image, and represents a certain pixel. is the heat map of -th class generated by Eq. 10, and is the prediction by Eq. 7. The mask is the sum of the weighted GradCAM by classification results through foreground lesion classes. Consequently, we obtain a normalized weighted ROI mask .
To incorporate the anomaly-weighted ROI mask in iterative refinement learning, we redefine the original input using the SM to create an enhanced version, . It is important to note that we only refine the original image for the backbone branch. In other words, the anomaly-discriminative representation of the backbone is denoted as , where . To prevent over-eliminating image information, we do not enhance the input of the ASAM branch as the anomaly representation already roughly filters out non-anomaly areas.
3.4. Pseudo Labels
Before generating the final pseudo labels following refinement learning, we apply a simple morphological transformation to extract the retina mask on the CAMs. The value 0 represents the background while 1 is the retinal foreground. It is important to note that the transformation may not be precise due to the low resolution and noise of the OCT modality, as illustrated in Fig. 4. Thus, we convert it into a soft version to gently suppress the background regions by replacing 0 with 0.5, .
Fig. 4:

Illustration of retina mask on good (left pair) and bad (right pair) examples.
We then update the GradCAM heat map by multiplying it with the soft . Afterward, we normalize the updated GradCAM and apply a threshold to suppress the background. Finally, we obtain the final pseudo labels for segmentation by applying argmax along lesion classes. Our training strategy is detailed in Algorithm 1.
4. Dataset and Experimental Settings
4.1. Datasets
We evaluate our proposed method on one private and two publicly available OCT B-scan datasets, which include both healthy and diseased retinal images. Fig. 5 displays the distribution of images per dataset. It is important to emphasize that for all experiments, we only utilize image-level labels for training, while only the validation set has pixel-level labels for evaluation purposes. Besides, all OCT scans from the same patient are assigned to the same set (either training or validation) to ensure no information leakage between them.
Fig. 5:

Data composition for each dataset.

Our private dataset contains four different types of lesions with known prognostic values: intraretinal fluid (IRF), subretinal fluid (SRF), ellipsoid zone disruption (EZ), and hyperreflective dots (HRD). Due to the lack of standard OCT lesion definitions for these biomarkers, especially for EZ and HRD, experts have to reach a consensus on the rules for both image-level and pixel-level labeling. More details on the definition and the measurement amongst expert graders can be found in the supplementary material. The original images are collected from two different sources (Kermany et al., 2018; Gholami et al., 2020), which were initially proposed for disease classification with disease labels only. For this reason, two ophthalmologists relabeled them with lesion types under consideration. We use 8006 images with image-level labels for training and 342 images with pixel-level labels for validation. Additionally, we collect healthy images from (Kermany et al., 2018) to train our GANomaly model, as it contains a large number of healthy OCT images. In the end, the dataset includes normal subjects, diabetic macular edema (DME), and diabetic retinopathy (DR) pathologies.
The first public dataset is the Retinal Edema Segmentation Challenge Dataset (RESC) (Hu et al., 2019; Zhou et al., 2020) provided by a competition platform called “AI Challenger”, which contains subretinal fluid (SRF) and pigment epithelial detachments (PED) lesion images. Note that the RESC provides annotations for retina edema area (REA) as well but we recognize it as background in our experiment. This dataset includes a standard training and validation split and provides pixel-level annotations for both. There are 8960 and 1920 images with normal subjects and DME pathology for training and validation, respectively.
The second public dataset is the Duke SD-OCT (Chiu et al., 2015) dataset of which all the images are macular-centered with severe DME pathology. This dataset is proposed for layer and fluid segmentation with 110 pixel-level annotated images (11 per patient), but only 78 of them contain fluid. We use these 78 images as the validation set to evaluate the performance of pseudo labels. To create the training set, we collect all images with fluid (IRF & SRF) from RESC and our private datasets. Since the original Duke dataset does not include healthy images, we randomly select healthy retinal images from (Kermany et al., 2018) to create a new Duke dataset for our study. In the end, we obtain 15,235 images for training and 178 images for validation.
4.2. Baselines and Evaluation Metrics
To evaluate the performance of our proposed method, we compare the quality of pseudo labels with existing WSSS methods that utilize image-level supervision. These methods include IRNet (Ahn et al., 2019), ReCAM (Chen et al., 2022b), SEAM (Wang et al., 2020c), WSMIS (Viniavskyi et al., 2020), MSCAM (Ma et al., 2020), TransWS (Zhang et al., 2022), and DFP (Wang et al., 2020b). In our experiments, we consider both the base versions, for SEAM and ReCAM, and their corresponding extensions (see Tab. 1). In addition to assessing the final pseudo labels’ performance, we report segmentation results from a network trained on these pseudo labels in a fully-supervised setting (see Tab. 2), and also present classification outcomes for a comprehensive evaluation of our method’s effectiveness (see Tab. 3).
Table 1:
Pseudo label performance on different datasets per class. The table shows the Dice Similarity Coefficient (DSC) and mean Intersection over Union (mIoU) scores for each model, including our proposed AGM. The rows are organized by the dataset used for evaluation and its corresponding lesion classes, while the columns present the performance metrics for each model. For each dataset, the overall performance is highlighted in blue, while the ‘Overall (w/o bg)’ performance, which represents lesions-only, is marked in light blue. The best results are highlighted in bold.
| Models | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | Lesions | Metrics | IRNet | SEAM | SEAM+ | ReCAM | ReCAM+ | WSMIS | MSCAM | TransWS | DFP | AGM |
| RESC | BG | DSC | 98.88% | 98.69% | 98.68% | 98.81% | 98.96% | 96.90% | 98.59% | 99.07% | 98.83% | 99.15% |
| mIoU | 97.78% | 97.43% | 97.65% | 97.66% | 97.96% | 95.64% | 97.25% | 98.18% | 97.72% | 98.34% | ||
| SRF | DSC | 49.18% | 46.44% | 54.19% | 31.19% | 12.90% | 45.91% | 18.52% | 52.44% | 20.39% | 57.84% | |
| mIoU | 33.75% | 34.13% | 47.29% | 14.23% | 12.71% | 24.64% | 10.14% | 34.88% | 6.40% | 43.94% | ||
| PED | DSC | 22.98% | 28.09% | 26.37% | 31.99% | 33.32% | 10.34% | 17.03% | 30.28% | 31.39% | 34.03% | |
| mIoU | 14.66% | 10.71% | 3.49% | 19.11% | 36.99% | 2.96% | 11.97% | 17.22% | 15.64% | 22.33% | ||
| Overall (w/o bg) Overall | DSC | 36.08% | 37.27% | 40.28% | 31.59% | 23.11% | 28.12% | 17.78% | 41.36% | 25.89% | 45.94% | |
| mIoU | 24.21% | 22.42% | 25.39% | 16.67% | 24.85% | 13.80% | 11.06% | 26.05% | 11.02% | 33.14% | ||
| DSC | 57.01% | 57.74% | 59.74% | 54.00% | 48.39% | 51.05% | 44.71% | 60.60% | 50.20% | 63.68% | ||
| mIoU | 48.73% | 47.42% | 49.48% | 43.67% | 49.22% | 41.08% | 39.79% | 50.09% | 39.92% | 54.87% | ||
| Duke | BG | DSC | 99.02% | 98.48% | 98.33% | 98.16% | 98.40% | 98.16% | 98.98% | 99.06% | 99.10% | 99.13% |
| mIoU | 98.10% | 97.03% | 96.75% | 96.41% | 96.87% | 96.41% | 98.00% | 98.15% | 98.24% | 98.29% | ||
| Fluid | DSC | 17.79% | 25.48% | 22.59% | 18.91% | 22.38% | 0.42% | 29.93% | 37.58% | 27.53% | 40.17% | |
| mIoU | 20.45% | 17.87% | 18.64% | 11.67% | 13.86% | 0.42% | 17.98% | 27.01% | 15.14% | 30.06% | ||
| Overall (w/o bg) Overall | DSC | 17.79% | 25.48% | 22.59% | 18.91% | 22.38% | 0.42% | 29.93% | 37.58% | 27.53% | 40.17% | |
| mIoU | 20.45% | 17.87% | 18.64% | 11.67% | 13.86% | 0.42% | 17.98% | 27.01% | 15.14% | 30.06% | ||
| DSC | 58.41% | 61.98% | 60.46% | 58.54% | 60.39% | 49.29% | 64.46% | 68.32% | 63.32% | 69.65% | ||
| mIoU | 59.27% | 57.45% | 57.69% | 54.04% | 55.37% | 48.41% | 57.99% | 62.58% | 56.69% | 64.17% | ||
| Our Dataset | BG | DSC | 99.09% | 98.88% | 98.93% | 98.48% | 98.66% | 99.28% | 98.66% | 98.39% | 99.15% | 98.55% |
| mIoU | 98.18% | 97.87% | 97.95% | 97.13% | 97.47% | 98.57% | 97.36% | 97.52% | 98.31% | 97.09% | ||
| SRF | DSC | 0.02% | 24.21% | 23.47% | 0.06% | 0.45% | 33.67% | 37.98% | 9.93% | 30.57% | 48.98% | |
| mIoU | 0.01% | 13.52% | 13.30% | 0.10% | 1.01% | 26.71% | 22.50% | 9.41% | 18.05% | 31.76% | ||
| IRF | DSC | 10.97% | 20.71% | 20.32% | 13.86% | 9.80% | 13.23% | 1.76% | 21.00% | 18.28% | 14.82% | |
| mIoU | 11.29% | 14.45% | 14.68% | 9.42% | 9.21% | 8.93% | 1.47% | 11.21% | 10.06% | 11.17% | ||
| EZ | DSC | 0.02% | 2.37% | 2.28% | 2.12% | 1.53% | 0.03% | 6.50% | 0.36% | 1.34% | 2.48% | |
| mIoU | 0.07% | 1.61% | 1.71% | 1.02% | 0.68% | 0.08% | 3.25% | 0.11% | 0.67% | 1.74% | ||
| HRD | DSC | 0.31% | 4.57% | 4.50% | 3.42% | 0.78% | 0.17% | 0.34% | 3.83% | 10.81% | 1.63% | |
| mIoU | 1.17% | 2.25% | 2.29% | 1.54% | 0.72% | 0.17% | 0.14% | 1.96% | 5.72% | 0.57% | ||
| Overall (w/o bg) Overall | DSC | 2.83% | 12.97% | 12.64% | 4.87% | 3.15% | 11.78% | 11.65% | 8.78% | 15.25% | 16.98% | |
| mIoU | 3.13% | 7.96% | 8.00% | 3.02% | 2.91% | 8.97% | 6.84% | 5.67% | 8.62 | 11.30% | ||
| DSC | 22.08% | 30.15% | 29.90% | 23.59% | 22.25% | 29.28% | 29.05% | 26.70% | 32.03% | 33.29% | ||
| mIoU | 22.14% | 25.94% | 25.99% | 21.84% | 21.82% | 26.89% | 24.94% | 24.04% | 26.56% | 28.46% | ||
Table 2:
Comparison of segmentation results (mIoU) between SEAM+ and our proposed AGM on the RESC and Duke datasets, using DeepLabV3+ with ResNet-50 and ResNet-101 backbones. The upper bound is provided for reference, representing a fully supervised segmentation method employing pixel-level ground truth with DeepLabV3+ (ResNet-101).
| Models | RESC | Duke | |
|---|---|---|---|
| SEAM+ | DeepLabV3+ (ResNet-50) | 49.24% | 58.28% |
| DeepLabV3+ (ResNet-101) | 49.62% | 59.50% | |
| AGM | DeepLabV3+ (ResNet-50) | 52.11% | 66.26% |
| DeepLabV3+ (ResNet-101) | 53.87% | 66.42% | |
| Upper Bound | DeepLabV3+ (ResNet-101) | 71.65% | - |
Table 3:
Classification performance on different datasets. The best results are highlighted in bold.
| Models | ||||||
|---|---|---|---|---|---|---|
| Dataset | Metrics | IRNet | SEAM | ReCAM | TransWS | AGM |
| RESC | Acc | 93.12% | 98.62% | 74.06% | 98.80% | 97.55% |
| F1 | 43.90% | 81.47% | 29.71% | 82.19% | 73.22% | |
| AUC | 86.00% | 99.36% | 77.38% | 99.52% | 99.68% | |
| Duke | Acc | 99.37% | 97.75% | 98.12% | 98.88% | 99.44% |
| F1 | 99.33% | 97.70% | 98.00% | 98.86% | 99.43% | |
| AUC | 99.98% | 99.87% | 99.99% | 99.99% | 99.99% | |
| Our Dataset | Acc | 93.59% | 93.42% | 94.37% | 92.98% | 94.96% |
| F1 | 71.40% | 69.95% | 75.62% | 68.13% | 78.31% | |
| AUC | 98.12% | 98.62% | 98.61% | 98.04% | 98.97% | |
We employ commonly used evaluation metrics for assessing our method’s performance. For a comprehensive assessment of pixel-level prediction, we use the macro-averaging Dice Similarity Coefficient (DSC) and the micro-averaging mean Intersection over Union (mIoU) metrics. For classification assessment, we utilize Accuracy scores (Acc), F1 score, and Area under the ROC Curve (AUC).
4.3. Implementation Details
We resize the input image to 512 × 512 for all baselines and our model. For the encoder that generates , we implement convolutional layers to extract the feature map with dimension 2048 × 16 × 16. For the ASAM branch (Eq. 4), we use a kernel size of , and the number of heads . In the iterative anomaly-guided refinement step, we employ GradCAM without augmentation smoothing to complete the image refinement. For both our method and the baselines, we determine the threshold to obtain the final pseudo labels by traversing values from 0 to 1 and selecting the one that achieves the best mIoU across classes on the validation set. The initial learning rate is set to 0.001, and it is set to 0.0001 in the refinement learning step. We use a batch size of 8 to train the network for 40 epochs, and then a batch size of 4 with another 10 epochs for refinement learning. The model is trained using binary cross-entropy loss with SGD optimizer and a momentum of 0.9. For data augmentation, we apply random horizontal flipping, random rotation, and random color jittering. Detailed parameter settings for GANomaly training and the baselines can be found in the supplementary materials.
For semantic segmentation, we employ ResNet-50/101 as the backbones for DeepLabV3+ (Chen et al., 2018), both pre-trained on ImageNet (Deng et al., 2009). We implement our method with PyTorch on 2 NVIDIA GeForce GTX 1080 Ti GPUs. The code and trained models are available at: https://github.com/yangjiaqidig/WSSS-AGM.
5. Results
5.1. Quantitative Results
5.1.1. Pseudo Label
The experimental results on pseudo labels are reported in Tab. 1, where we use DSC and mIoU for each lesion per dataset to assess the similarity between the generated pseudo labels from various models and the pixel-level ground truth. In each dataset, the first row, BG, denotes the background. Meanwhile, the last two rows, overall (w/o bg) and overall, represent the mean scores computed from only the foregrounds and from both the foregrounds and background, respectively, on the validation sets. The baselines is training an AffinityNet (Ahn and Kwak, 2018) based on the CAMs provided by SEAM to elevate the synthesized pseudo labels. Similarly, the method further refines CAMs generated by ReCAM with IRNet. Please note that we traverse background threshold in steps of 0.01 and report the best mIoU of pseudo labels for each method. For instance, the best mIoU on RESC dataset obtained by SEAM is with , while yields the best result for ReCAM+. The impact of will be discussed in Sec. 5.2.
Our proposed AGM consistently outperforms other WSSS methods on various datasets (RESC, Duke, and our dataset) in terms of the overall DSC and mIoU scores, demonstrating the effectiveness of our approach in generating high-quality pseudo labels. The proposed model yields 1% to 3% and 2% to 5% improvement in terms of DSC and mIoU, respectively, compared to the second-best performing method. As evident in the table, achieving optimal overall scores does not necessarily guarantee the best performance for individual foreground classes. This observation is consistent with the notion expressed in previous works (Lee et al., 2021; Chen et al., 2022b; Wang et al., 2020c; Patel and Dolz, 2022), which highlights the inherent trade-off between false positives (FP) and false negatives (FN).
For the RESC dataset, we observe that our proposed method ranks second in terms of mIoU for the SRF and PED classes. However, the best-performing baselines for SRF(SEAM+) and PED (ReCAM+) exhibit extremely low scores for the other lesion class (PED at 3.49% and SRF at 12.71%). In the Duke dataset, there is only one lesion (Fluid), which is relatively larger compared to other lesions and our method achieves around 3% higher than the second-highest mIoU by TransWS (27.01%) regarding the Fluid. Our private dataset is the most challenging one as there are four different types of lesions. EZ disruption and HRD exhibit extremely distinct structural characteristics compared to edema-related lesions such as IRF and SRF, and they sometimes consist of only a few pixels. We can observe that our method lags behind SEAMs’ performance in the IRF, yet it surpasses SEAMs in the SRF class, showing an improvement of approximately 25% in DSC and 18% in mIoU. Meanwhile, the DFP displays competitive scores in both overall evaluations, with and without the background. It works particularly well in the HRD class, while its performance in other lesions lags behind AGM.
5.1.2. Semantic Segmentation
The pseudo labels obtained by our AGM can be utilized to train a semantic segmentation model in a fully-supervised manner. Tab. 2 displays the mIoU results of semantic segmentation models trained on RESC and Duke datasets using pseudo labels from two WSSS methods, SEAM+ and our proposed AGM. These pseudo labels are used as ground truth for training DeepLabV3+ with ResNet-50 and ResNet-101 backbones. The results indicate that models trained with pseudo labels generated by AGM outperform those trained with SEAM+. Specifically, DeepLabV3+ (ResNet-50) achieves a mIoU of 52.11% and 66.26% on RESC and Duke datasets, respectively. This performance is further enhanced with DeepLabV3+ (ResNet-101), reaching a mIoU of 53.87% on RESC and 66.42% on Duke. For comparison, the table also includes an upper bound for segmentation results, which represents a fully supervised segmentation method using pixel-level ground truth with DeepLabV3+ (ResNet-101), achieving a mIoU of 71.65% on the RESC dataset. Since the Duke dataset does not have pixel-level annotations for the training set, we mark it as ‘−’ in the Upper Bound.
However, we note that the mIoU (53.87%) of semantic segmentation models trained with pseudo labels on the RESC dataset does not exceed the mIoU (54.87%) of the pseudo labels as shown in Tab. 1, while it does exceed on the Duke dataset. This discrepancy could be attributed to the lower mIoU performance (22.33%) of the PED pseudo labels, which fails to provide sufficient information for the semantic segmentation model to accurately capture the lesion details. This is particularly evident in the case of the RESC dataset, which exhibits an imbalanced lesion distribution as shown in Fig. 5 (a), and the suboptimal pseudo labels of PED may contribute to the increased gap. To gain a better understanding, we visualize the pseudo labels and semantic segmentation results in Fig. 6. The first 2 rows are examples from RESC, while the last row is from Duke. The colors green, red, and yellow represent PED, SRF, and Fluid. Although the semantic segmentation results (SEG), trained using pseudo labels generated by our AGM, demonstrate smoother and more consistent lesion detection, it struggles with PED prediction in the first example. It is worth mentioning that our aim is not to achieve new state-of-the-art results surpassing fully supervised learning, but rather to narrow the gap between weakly and fully supervised segmentation models.
Fig. 6:

Visual examples of pseudo labels and segmentation results. (a) Input images. (b) GT represents the ground truth per example. Columns (c) and (d) depict pseudo labels generated by SEAM and AGM, respectively. (e) SEG refers to the semantic segmentation results trained with pseudo labels generated by our AGM.
5.1.3. Classification
Even though the primary objective of our study is to generate improved pixel-level pseudo labels under the supervision of multi-label image classification, it is also crucial for clinical analysis to ensure the classifier is accurate and reliable. Tab. 3 presents the classification performance of different models on various datasets, with the average Acc, F1, and AUC across lesions. Our proposed method demonstrates superior or competitive performance across different datasets, indicating its effectiveness for lesion classification tasks with the support of the anomaly-guided mechanism. We observe that our AGM achieves the top AUC scores for all datasets and the highest Acc and F1 on both Duke and our private datasets. However, it lags slightly behind TransWS in terms of Acc (97.55% vs. 98.80%) and F1(73.22% vs. 82.19%) in the RESC dataset. Given the inherent data imbalance, as shown in Fig 5, a thorough assessment requires considering these metrics altogether. Besides, the classification result of IRNet can be considered as the performance of ResNet-50 as their CAM improvement step does not affect the classifier.
5.2. Ablation Studies
In this ablation study section, we present a comprehensive analysis of our proposed AGM. Please note that we use the same threshold value per dataset, which is selected in increments of 0.1 within the range of 0 to 1, to simplify and ensure a fair comparison.
Effect of Different Module:
In Tab. 4, we present an evaluation of each sub-module in our proposed AGM by reporting the overall mIoU scores across lesions on two public datasets. The term BASE refers to the backbone with only the original OCT image input, while ADR denotes the input of anomaly-discriminative representations. ASAM represents the anomaly self-attention module (Sec. 3.2), and Ref indicates the iterative refinement learning (Sec. 3.3). Additionally, GAP and GMP denote global average pooling and global max pooling, respectively. The results demonstrate that incorporating each sub-module step by step results in consistent improvements in the mIoU scores for both RESC and Duke datasets. Module II validates our hypothesis that pairing the original image with its synthetic counterpart enables the model to robustly learn the discrepancy features between them. Furthermore, recognizing lesions with anomaly attention emphasis, ASAM (III), boosts the performance by 12.68% on RESC. Replacing GAP with GMP leads to an increase in performance, ranging from 2% to 4% (IV). Overall, the complete AGM (V) achieves gains of 17.67% and 9.11% over the BASE model on the two datasets, respectively.
Table 4:
Ablation study for proposed AGM on public datasets (mIoU). Roman numerals in the first column correspond to different combinations of these components, as detailed in the rows.
| BASE | ADR | ASAM | GAP | GMP | Ref | RESC | Duke | |
|---|---|---|---|---|---|---|---|---|
| I | ✓ | ✓ | 35.78% | 55.06% | ||||
| II | ✓ | ✓ | ✓ | 37.12% | 55.38% | |||
| III | ✓ | ✓ | ✓ | ✓ | 48.46% | 58.70% | ||
| IV | ✓ | ✓ | ✓ | ✓ | 50.54% | 62.10% | ||
| V | ✓ | ✓ | ✓ | ✓ | ✓ | 53.45% | 64.17% |
We also provide a visual representation of the advantages offered by each sub-module in Fig. 7. The number in each column corresponds to the row number in Tab. 4 for various submodule combinations. The top row presents the original image and the CAMs associated with each sub-module. Similarly, the bottom row shows the ground truth and pseudo labels obtained after applying the threshold on CAMs. The combination of backbone and ADR with GAP (II) leads the network to generate a more focused PED region. The localization performance is further enhanced upon incorporating ASAM with GMP (IV). The finest localization is achieved with the addition of refinement learning (V).
Fig. 7:

Efficacy of sub-modules in AGM for PED lesion visualization. The first column displays the original image alongside its corresponding ground truth label, followed by CAMs and pseudo labels in the remaining columns.
Combination of CAMs from Different Layers:
Tab. 5 shows the performances of pseudo labels when CAMs are computed from different layers. Layer1 to layer4 are the four layers in ResNet-50, while is the last layer before the self-attention module. The localization from the latter convolutional layers, which contain more high-level semantic information, outperforms that of earlier layers. Therefore, combining all layers (bottom row) overshadows the more discriminative features from deeper layers, leading to sub-optimal CAM localization. We calculate the final heatmap by averaging GradCAM across the specified layers. We can observe that the best results are obtained when combining with the last two layers of ResNet-50 for RESC and with only the last layer for Duke. Ultimately, we choose the combination of with only the last layer of ResNet-50 for all experiments in our work, as it achieves a high score while slightly reducing the computational cost.
Table 5:
Ablation study on different convolutional layers (mloU).
| layer1 | layer2 | layer3 | layer4 | RESC | Duke | |
|---|---|---|---|---|---|---|
| ✓ | 36.40% | 50.76% | ||||
| ✓ | 32.85% | 49.40% | ||||
| ✓ | 32.64% | 49.49% | ||||
| ✓ | 41.58% | 51.43% | ||||
| ✓ | 51.92% | 59.33% | ||||
| ✓ | ✓ | 39.15% | 51.94% | |||
| ✓ | ✓ | 34.76% | 51.84% | |||
| ✓ | ✓ | 47.76% | 55.35% | |||
| ✓ | ✓ | 53.45% | 64.17% | |||
| ✓ | ✓ | ✓ | 54.73% | 61.82% | ||
| ✓ | ✓ | ✓ | ✓ | 50.41% | 59.02% | |
| ✓ | ✓ | ✓ | ✓ | ✓ | 48.44% | 57.36% |
Number of Heads :
We additionally assess the performance with varying numbers of heads in our Anomaly Self-Attention Module (ASAM) as shown in Tab. 6. This table demonstrates the changes in DSC and mIoU scores on the RESC dataset. The performance peaks with 4 heads, which is the configuration adopted in our final model.
Table 6:
Ablations study of the number of heads in ASAM on RESC. Columns 2–4 represent the background and lesions mIoU scores, while the last two columns display the overall DSC and mIoU. We employ 4 heads in our work (best results in bold).
| Controller | BG | SRF | PED | DSC | mIoU |
|---|---|---|---|---|---|
| 1-Head | 96.48% | 22.35% | 9.78% | 48.94% | 42.87% |
| 2-Heads | 97.44% | 38.83% | 11.80% | 55.67% | 49.36% |
| 4-Heads | 98.04% | 46.72% | 15.58% | 60.34% | 53.45% |
| 8-Heads | 96.70% | 34.26% | 10.85% | 54.11% | 47.27% |
| 16-Heads | 96.93% | 27.43% | 11.51% | 54.04% | 45.29% |
Masking Threshold :
Fig. 8 illustrates the impact of background threshold value on the RESC dataset, which is used to binarize the CAMs. The responses of IRNet (orange), SEAM (red), and MSCAM (cyan) for two lesions are more polarized, whereas the rest baseline models and our proposed AGM (green) exhibit more uniform distributions among lesions, ensuring that the optimal global threshold results in fewer tradeoffs. As the threshold shifts from 0 to 1, there is a trade-off between FP and FN, leading to fewer FPs (over-activated) and more FNs (under-activated) for individual lesions. Therefore, less bias between lesions results in more consistent overall improvement across threshold values. We can observe that our proposed method achieves the best overall mIoU near , proving that our anomaly-guided strategy contributes to consistency among lesions during training.
Fig. 8:

Ablation study for background threshold on RESC.
Different CAMs:
Tab. 7 shows the effectiveness of different CAM models and their corresponding running time per image. GradCAM++ (Chattopadhay et al., 2018) uses second-order gradients. GradCAM with eigen smoothing (Muhammad and Yeasin, 2020), which utilizes the principal components of the feature maps from convolutional layers only, achieves the lowest mIoU with the longest running time. GradCAM with augmentation smoothing, adapted in our work, achieves the best performance, but with longer running time compared to GradCAM models without smoothing.
Table 7:
Ablation study on RESC for different CAM models.
| Models | mIoU | Time (sec) |
|---|---|---|
| GradCAM | 51.73% | 0.199 |
| GradCAM++ | 46.85% | 0.222 |
| GradCAM & eigen | 38.83% | 1.929 |
| GradCAM & aug | 53.45% | 1.279 |
Area Ratio Analysis:
To further understand the potential source of improvement on our proposed AGM, we perform over- and under-segmented analysis, and the result can be found in Tab. 8 Following the notation in (Patel and Dolz, 2022), we calculate the under-segmentation ratio as the division of FN by True Positives (TP), represented as . Similarly, the over-segmentation ratio is computed as the division of FP by TP, denoted as . Tab. 8 reports the area ratios on both RESC and Duke datasets compared to SEAM+ model. We can observe that our proposed AGM effectively reduces the over-activated CAMs compared to SEAM+, especially on smaller lesion PED (2.19 vs. 11.80). In terms of under-segmentation, our method slightly underperforms SEAM+ for both large and small lesions in the RESC dataset, showing the inclination of our model in reducing over-segmentation at the expense of under-segmentation. However, for both methods, the still exceeds , indicating that over-segmentation remains more challenging than under-segmentation in this study. Nonetheless, our method can locate the lesions more precisely as expected with anomaly guidance and the effective refinement step.
Table 8:
Area ratios on RESC and Duke datasets. The represents the under-segmented ratio, while the represents the over-segmented ratio.
| Lesions | Ratio ↓ | SEAM+ | AGM |
|---|---|---|---|
| 0.64 | 0.69 | ||
| 1.81 | 1.45 | ||
| 0.91 | 1.26 | ||
| 11.80 | 2.19 | ||
| 0.74 | 0.61 | ||
| 9.95 | 9.92 |
5.3. Qualitative Results
Fig. 9 provides a visual comparison of the results obtained from our proposed AGM method and other baseline techniques. It is important to note that the displayed pseudo labels are generated using the optimal threshold for each method across the entire validation set. We observe that the segmented lesions by ReCAM fail to generate acceptable pseudo labels for multiple lesions in images, while IRNet exhibits instability with small lesions (rows 3, 5, 6) due to its sensitivity to the global threshold setting on CAMs. The SEAM model generates better visual results, while the region coverage is slightly over-segmented. TransWS shows good results on RESC and Duke samples (rows 1–5) but struggles with accurate lesion localization on samples from our dataset (rows 6–7). This observation aligns with the quantitative metrics presented in Tab. 1. It can be observed that the pseudo labels generated by our AGM demonstrate a closer alignment with the ground truth, indicating the superiority of our approach.
Fig. 9:

Qualitative results of predicted pseudo labels by baselines and our proposed AGM on three validation sets. Each row represents a different image, and the second column shows the corresponding ground truth. The pseudo labels are overlaid on the original images for better visualization. The red, blue, yellow, orange, and green colors correspond to SRF, PED, Fluid, EZ disruption, and IRF, respectively.
Fig. 10 illustrates the comparison between the results before and after iterative refinement learning steps. The first column, labeled ‘Inputs’, visually shows the changes of the input for the backbone branch, from to as detailed in Sec. 3.3. In this specific example from the RESC dataset, we can observe that the localization of SRF becomes more precise after the refinement learning (as seen in the ‘Pseudo Labels’ column). Specifically, the mIoU score for the SRF lesion in this shown example improves from 32% to 44% after refinement.
Fig. 10:

Visualization of the refinement effectiveness for lesion SRF. The first row shows the results before refinement, while the second row shows the effect after refinement. The first column presents differences in inputs for the backbone branch before and after applying anomaly-guided refinement.
Tab. 8 quantitatively presents that even though our AGM has dramatically reduced the FP regions, the over-segmented issue is still over than the under-segmented issue. Now, Fig. 11 can qualitatively present some of the failure cases from our AGM. We can observe that our model sometimes segments unrelated regions, as demonstrated in the first row for PED, resulting in over-segmentation. This issue is more pronounced when the anomaly is exceptionally small, leading the model to identify regions far from actual lesions. This problem could be traced back to the inherent OCT image noise and the model’s sensitivity to minor variations in image data. On the other hand, for images where distinguishing lesion boundaries is difficult due to the presence of extensive edema tissue, our method occasionally under-segments these large edema regions. These failure cases underline potential areas for improvement in our AGM approach, particularly in enhancing the robustness of lesion delineation.
Fig. 11:

Examples of failure cases generated by our AGM. The blue, yellow, and green regions represent SRF, PED, and IRF, respectively.
6. Discussion and Conclusion
In this paper, we have presented a novel anomaly-guided mechanism designed to accurately segment lesions in medical images under weakly supervised learning, utilizing only image-level class labels. Our approach constructs anomaly-discriminative representations to incorporate anomalies into the training process and leverages global anomaly information to improve the precision of pseudo labels for segmentation. The effectiveness of our method has been validated through extensive experiments conducted on three OCT datasets with prognostic lesion classes across various scales. Our proposed method achieves new state-of-the-art performance in weakly-supervised lesion segmentation on OCT images and shows its adaptability in handling lesion-related tasks.
Despite the significant improvements over baseline methods, we acknowledge some limitations that warrant further investigation. The anomaly-guided mechanism is built on the concept of normal versus abnormal and is currently limited to lesion segmentation, where lesions are identified as abnormal. It may not be directly applicable to organ segmentation or natural image object detection. For the former, the objects of interest are considered normal and are always present, and for the latter, the notion of normality is lacking to the contrary. However, our method is already capable of addressing a broad range of problems, particularly those related to abnormality detection, making it especially valuable for lesion segmentation in the medical domain for the discovery of diagnostic and prognostic imaging biomarkers. This underscores the utility and potential of our approach, even though it may not be applicable to all segmentation tasks.
Additionally, one requirement that one might consider as limitation is the amount of normal data required for training the generative model. However, in most cases, the availability of normal data is much higher than that of abnormal data. If there is limited normal data to train a robust GAN network, data augmentation techniques, such as various transformations, can help improve the generative model’s performance by providing more diverse and representative normal data. Additionally, leveraging transfer learning can be a potential solution. By utilizing pre-trained models that have already learned generic features from large-scale natural image datasets, the generative model can be fine-tuned using a smaller amount of normal data specific to the task at hand. This approach can help enhance the model’s performance even with limited normal data by exploiting the knowledge gained from the pre-trained model.
Despite weakly supervised models tend to underperform compared to full-supervision as evidenced in previous works (Roy et al., 2020; Patel and Dolz, 2022), their utility extends beyond direct performance comparison. As highlighted in our supplementary material, the low agreement between experts on lesion regions indicates a potential role of our pseudo labels generated by WSSS as an additional reference in clinical studies. Furthermore, studies (Ouali et al., 2020; Lee et al., 2021) have shown that combining these pseudo labels with strongly labeled examples in a semi-supervised segmentation network can yield performance comparable to fully-supervised learning. In this context, our proposed AGM approach delivers a considerable performance improvement relative to existing techniques under the same level of supervision, indicating its potential to narrow this performance gap. In conclusion, our proposed AGM approach represents a significant step forward in the field of weakly supervised lesion segmentation for medical images. We believe that our approach has the potential to reduce the demand for label-intensive image annotation for studies of novel biomarkers in biomedical research to improve patient care.
Supplementary Material
Presents a weakly-supervised learning with image-level labels for medical lesion segmentation.
Constructs anomaly-discriminative representations to incorporate anomalies in the training.
Utilizes global anomaly information to improve pseudo label precision for segmentation.
Demonstrates adaptability in handling various types of diseases or lesions.
Validates the effectiveness across three OCT datasets.
Acknowledgments
This work was partially supported by grants PSC-CUNY Research Award 65406-00 53, NSF CCF-2144901, and NIH R21CA258493.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Declaration of Generative AI and AI-assisted technologies in the writing process
Statement: During the preparation of this work the author(s) used ChatGPT (GPT-4) in order to improve readability and check grammar. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.
Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- Ahn J, Cho S, Kwak S, 2019. Weakly supervised learning of instance segmentation with inter-pixel relations, in: CVPR, pp. 2209–2218. [Google Scholar]
- Ahn J, Kwak S, 2018. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation, in: CVPR, pp. 4981–4990. [Google Scholar]
- Akcay S, Atapour-Abarghouei A, Breckon TP, 2018. Ganomaly: Semi-supervised anomaly detection via adversarial training, in: ACCV, pp. 622–637. [Google Scholar]
- Belharbi S, Rony J, Dolz J, Ayed IB, McCaffrey L, Granger E, 2021. Deep interpretable classification and weakly-supervised segmentation of histology images via max-min uncertainty. TMI, 702–714. [DOI] [PubMed] [Google Scholar]
- Chattopadhay A, Sarkar A, Howlader P, Balasubramanian VN, 2018. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks, in: WACV, IEEE. pp. 839–847. [Google Scholar]
- Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H, 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation, in: ECCV, pp. 801–818. [Google Scholar]
- Chen X, You S, Tezcan KC, Konukoglu E, 2020. Unsupervised lesion detection via image restoration with a normative prior. MedIA, 101713. [DOI] [PubMed] [Google Scholar]
- Chen Z, Tian Z, Zhu J, Li C, Du S, 2022a. C-cam: Causal cam for weakly supervised semantic segmentation on medical image, in: CVPR, pp. 11676–11685. [Google Scholar]
- Chen Z, Wang T, Wu X, Hua XS, Zhang H, Sun Q, 2022b. Class reactivation maps for weakly-supervised semantic segmentation, in: CVPR, pp. 969–978. [Google Scholar]
- Chiu SJ, Allingham MJ, Mettu PS, Cousins SW, Izatt JA, Farsiu S, 2015. Kernel regression based segmentation of optical coherence tomography images with diabetic macular edema. Biomedical optics express, 1172–1194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choe J, Lee S, Shim H, 2020. Attention-based dropout layer for weakly supervised single object localization and semantic segmentation. TPAMI, 4256–4271. [DOI] [PubMed] [Google Scholar]
- Dai J, He K, Sun J, 2015. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation, in: ICCV, pp. 1635–1643. [Google Scholar]
- Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L, 2009. Imagenet: A large-scale hierarchical image database, in: CVPR, pp. 248–255. [Google Scholar]
- van Engeland S, Snoeren PR, Huisman H, Boetes C, Karssemeijer N, 2006. Volumetric breast density estimation from full-field digital mammograms. TMI, 273–282. [DOI] [PubMed] [Google Scholar]
- Gholami P, Roy P, Parthasarathy MK, Lakshminarayanan V, 2020. Octid: Optical coherence tomography image database. Computers & Electrical Engineering, 106532. [Google Scholar]
- Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y, 2020. Generative adversarial networks. Communications of the ACM, 139–144. [Google Scholar]
- Hu J, Chen Y, Yi Z, 2019. Automated segmentation of macular edema in oct using deep neural networks. MedIA, 216–227. [DOI] [PubMed] [Google Scholar]
- Jo S, Yu IJ, 2021. Puzzle-cam: Improved localization via matching partial and full features, in: ICIP, pp. 639–643. [Google Scholar]
- Kermany D, Zhang K, Goldbaum M, et al. , 2018. Labeled optical coherence tomography (oct) and chest x-ray images for classification. Mendeley data, 651. [Google Scholar]
- Kervadec H, Dolz J, Tang M, Granger E, Boykov Y, Ayed IB, 2019. Constrained-cnn losses for weakly supervised segmentation. MedIA, 88–99. [DOI] [PubMed] [Google Scholar]
- Kim B, Han S, Kim J, 2021. Discriminative region suppression for weakly-supervised semantic segmentation, in: AAAI, pp. 1754–1761. [Google Scholar]
- Kolesnikov A, Lampert CH, 2016. Seed, expand and constrain: Three principles for weakly-supervised image segmentation, in: ECCV, pp. 695–711. [Google Scholar]
- Kwak S, Hong S, Han B, 2017. Weakly supervised semantic segmentation using superpixel pooling network, in: AAAI. [Google Scholar]
- Lee J, Kim E, Yoon S, 2021. Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation, in: CVPR, pp. 4071–4080. [DOI] [PubMed] [Google Scholar]
- Li Y, Yu Y, Zou Y, Xiang T, Li X, 2022. Online easy example mining for weakly-supervised gland segmentation from histology images, in: MICCAI, pp. 578–587. [Google Scholar]
- Lin D, Dai J, Jia J, He K, Sun J, 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation, in: CVPR, pp. 3159–3167. [Google Scholar]
- Liu X, Liu Q, Zhang Y, Wang M, Tang J, 2023. Tssk-net: Weakly supervised biomarker localization and segmentation with image-level annotation in retinal oct images. Computers in Biology and Medicine, 106467. [DOI] [PubMed] [Google Scholar]
- Liu X, Liu Z, Zhang Y, Wang M, Li B, Tang J, 2021. Weakly-supervised automatic biomarkers detection and classification of retinal optical coherence tomography images, in: ICIP, pp. 71–75. [Google Scholar]
- Luo X, Hu M, Liao W, Zhai S, Song T, Wang G, Zhang S, 2022. Scribble-supervised medical image segmentation via dual-branch network and dynamically mixed pseudo labels supervision. arXiv preprint arXiv:2203.02106, 528–538 [Google Scholar]
- Ma T, Wang Q, Zhang H, Zuo W, 2022. Delving deeper into pixel prior for box-supervised semantic segmentation. TIP, 1406–1417. [DOI] [PubMed] [Google Scholar]
- Ma X, Ji Z, Niu S, Leng T, Rubin DL, Chen Q, 2020. Ms-cam: Multi-scale class activation maps for weakly-supervised segmentation of geographic atrophy lesions in sd-oct images. IEEE Journal of Biomedical and Health Informatics, 3443–3455. [DOI] [PubMed] [Google Scholar]
- Meissen F, Kaissis G, Rueckert D, 2021. Challenging current semi-supervised anomaly segmentation methods for brain mri, in: International MICCAI brainlesion workshop, pp. 63–74. [Google Scholar]
- Muhammad MB, Yeasin M, 2020. Eigen-cam: Class activation map using principal components, in: IJCNN, IEEE. pp. 1–7. [Google Scholar]
- Niu S, Xing R, Gao X, Liu T, Chen Y, 2023. A fine-to-coarse-to-fine weakly supervised framework for volumetric sd-oct image segmentation. IET Computer Vision, 123–134. [Google Scholar]
- Oh Y, Kim B, Ham B, 2021. Background-aware pooling and noise-aware loss for weakly-supervised semantic segmentation, in: CVPR, pp. 6913–6922. [Google Scholar]
- Ouali Y, Hudelot C, Tami M, 2020. Semi-supervised semantic segmentation with cross-consistency training, in: CVPR, pp. 12674–12684. [Google Scholar]
- Ouyang X, Xue Z, Zhan Y, Zhou XS, Wang Q, Zhou Y, Wang Q, Cheng JZ, 2019. Weakly supervised segmentation framework with uncertainty: A study on pneumothorax segmentation in chest x-ray, in: MICCAI, pp. 613–621. [Google Scholar]
- Patel G, Dolz J, 2022. Weakly supervised segmentation with cross-modality equivariant constraints. MedIA, 102374. [DOI] [PubMed] [Google Scholar]
- Pinheiro PO, Collobert R, 2015. Weakly supervised semantic segmentation with convolutional networks, in: CVPR, p. 6. [Google Scholar]
- Prince JL, Links JM, 2006. Medical imaging signals and systems. Pearson Prentice Hall; Upper Saddle River. [Google Scholar]
- Ramachandran P, Parmar N, Vaswani A, Bello I, Levskaya A, Shlens J, 2019. Stand-alone self-attention in vision models, in: NeurIPS [Google Scholar]
- Ramaswamy HG, et al. , 2020. Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization, in: WACV, pp. 983–991. [Google Scholar]
- Roth HR, Yang D, Xu Z, Wang X, Xu D, 2021. Going to extremes: weakly supervised medical image segmentation. Machine Learning and Knowledge Extraction, 507–524. [Google Scholar]
- Roy AG, Siddiqui S, Pölsterl S, Navab N, Wachinger C, 2020. ‘squeeze & excite’guided few-shot segmentation of volumetric images. MedIA, 101587. [DOI] [PubMed] [Google Scholar]
- Ru L, Zhan Y, Yu B, Du B, 2022. Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers, in: CVPR, pp. 16846–16855. [Google Scholar]
- Schlegl T, Seeböck P, Waldstein SM, Langs G, Schmidt-Erfurth U, 2019. f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. MedIA, 30–44. [DOI] [PubMed] [Google Scholar]
- Schmidt-Erfurth U, Reiter GS, Riedl S, Seeböck P, Vogl WD, Blodi BA, Domalpally A, Fawzi A, Jia Y, Sarraf D, et al. , 2021. Ai-based monitoring of retinal fluid in disease activity and under therapy. Progress in retinal and eye research, 100972. [DOI] [PubMed] [Google Scholar]
- Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D, 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization, in: ICCV, pp. 618–626. [Google Scholar]
- Shi X, Khademi S, Li Y, van Gemert J, 2021. Zoom-cam: Generating fine-grained pixel annotations from image labels, in: ICPR, pp. 10289–10296. [Google Scholar]
- Silva-Rodríguez J, Naranjo V, Dolz J, 2022. Constrained unsupervised anomaly segmentation. MedIA, 102526. [DOI] [PubMed] [Google Scholar]
- Valvano G, Leo A, Tsaftaris SA, 2021. Self-supervised multi-scale consistency for weakly supervised segmentation learning, in: DART, pp. 14–24. [Google Scholar]
- Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I, 2017. Attention is all you need, in: NeurIPS. [Google Scholar]
- Vernaza P, Chandraker M, 2017. Learning random-walk label propagation for weakly-supervised semantic segmentation, in: CVPR, pp. 7158–7166. [Google Scholar]
- Viniavskyi O, Dobko M, Dobosevych O, 2020. Weakly-supervised segmentation for disease localization in chest x-ray images, in: AIME, pp. 249–259. [Google Scholar]
- Wang H, Wang Z, Du M, Yang F, Zhang Z, Ding S, Mardziel P, Hu X, 2020a. Score-cam: Score-weighted visual explanations for convolutional neural networks, in: CVPR, pp. 24–25. [Google Scholar]
- Wang J, Li W, Chen Y, Fang W, Kong W, He Y, Shi G, 2021. Weakly supervised anomaly segmentation in retinal oct images using an adversarial learning approach. Biomedical optics express, 4713–4729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang T, Niu S, Dong J, Chen Y, 2020b. Weakly supervised retinal detachment segmentation using deep feature propagation learning in sd-oct images, in: Ophthalmic Medical Image Analysis: 7th International Workshop, OMIA 2020, pp. 146–154 [Google Scholar]
- Wang X, Girshick R, Gupta A, He K, 2018. Non-local neural networks, in: CVPR, pp. 7794–7803. [Google Scholar]
- Wang Y, Zhang J, Kan M, Shan S, Chen X, 2020c. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation, in: CVPR, pp. 12275–12284. [Google Scholar]
- Wu T, Huang J, Gao G, Wei X, Wei X, Luo X, Liu CH, 2021. Embedded discriminative attention mechanism for weakly supervised semantic segmentation, in: CVPR, pp. 16765–16774. [Google Scholar]
- Xing R, Niu S, Gao X, Liu T, Fan W, Chen Y, 2021. Weakly supervised serous retinal detachment segmentation in sd-oct images by two-stage learning. Biomedical Optics Express, 2312–2327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J, Hu X, Chen C, Tsai C, 2021. A topological-attention convlstm network and its application to em images, in: MICCAI, pp. 217–228. [Google Scholar]
- Zhang S, Zhang J, Xia Y, 2022. Transws: Transformer-based weakly supervised histology image segmentation, in: International Workshop on MLMI, pp. 367–376. [Google Scholar]
- Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A, 2016. Learning deep features for discriminative localization, in: CVPR, pp. 2921–2929. [Google Scholar]
- Zhou K, Xiao Y, Yang J, Cheng J, Liu W, Luo W, Gu Z, Liu J, Gao S, 2020. Encoding structure-texture relation with p-net for anomaly detection in retinal images, in: ECCV, pp. 360–377. [Google Scholar]
- Zhu Z, Xu M, Bai S, Huang T, Bai X, 2019. Asymmetric non-local neural networks for semantic segmentation, in: ICCV, pp. 593–602. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
