Mine yOur owN Anatomy: Revisiting Medical Image Segmentation with Extremely Limited Labels

Chenyu You; Weicheng Dai; Fenglin Liu; Yifei Min; Nicha C Dvornek; Xiaoxiao Li; David A Clifton; Lawrence Staib; James S Duncan

doi:10.1109/TPAMI.2024.3461321

. Author manuscript; available in PMC: 2026 Mar 13.

Published before final editing as: IEEE Trans Pattern Anal Mach Intell. 2024 Sep 13;PP:10.1109/TPAMI.2024.3461321. doi: 10.1109/TPAMI.2024.3461321

Mine yOur owN Anatomy: Revisiting Medical Image Segmentation with Extremely Limited Labels

Chenyu You ^1,^*, Weicheng Dai ^2,^*, Fenglin Liu ³, Yifei Min ⁴, Nicha C Dvornek ⁵, Xiaoxiao Li ⁶, David A Clifton ^7,⁸, Lawrence Staib ⁹, James S Duncan ¹⁰

PMCID: PMC11903367 NIHMSID: NIHMS2034236 PMID: 39269798

Abstract

Recent studies on contrastive learning have achieved remarkable performance solely by leveraging few labels in the context of medical image segmentation. Existing methods mainly focus on instance discrimination and invariant mapping (i.e., pulling positive samples closer and negative samples apart in the feature space). However, they face three common pitfalls: (1) tailness: medical image data usually follows an implicit long-tail class distribution. Blindly leveraging all pixels in training hence can lead to the data imbalance issues, and cause deteriorated performance; (2) consistency: it remains unclear whether a segmentation model has learned meaningful and yet consistent anatomical features due to the intra-class variations between different anatomical features; and (3) diversity: the intra-slice correlations within the entire dataset have received significantly less attention. This motivates us to seek a principled approach for strategically making use of the dataset itself to discover similar yet distinct samples from different anatomical views. In this paper, we introduce a novel semi-supervised 2D medical image segmentation framework termed Mine yOur owN Anatomy (MONA), and make three contributions. First, prior work argues that every pixel equally matters to the model training; we observe empirically that this alone is unlikely to define meaningful anatomical features, mainly due to lacking the supervision signal. We show two simple solutions towards learning invariances – through the use of stronger data augmentations and nearest neighbors. Second, we construct a set of objectives that encourage the model to be capable of decomposing medical images into a collection of anatomical features in an unsupervised manner. Lastly, we both empirically and theoretically, demonstrate the efficacy of our MONA on three benchmark datasets with different labeled settings, achieving new state-of-the-art under different labeled semi-supervised settings. MONA makes minimal assumptions on domain expertise, and hence constitutes a practical and versatile solution in medical image analysis. We provide the PyTorch-like pseudo-code in supplementary. Codes will be available on here.

Keywords: Semi-supervised Learning, Contrastive Learning, Imbalanced Learning, Long-tailed Medical Image Segmentation

1. Introduction

With the advent of deep learning, medical image segmentation has drawn great attention and substantial research efforts in recent years. Traditional supervised training schemes coupled with large-scale annotated data can engender remarkable performance. However, training with massive high-quality annotated data is infeasible in clinical practice since a large amount of expert-annotated medical data often incurs considerable clinical expertise and time. Under such a setting, this poses the question of how models benefit from a large amount of unlabelled data during training. Recently emerged methods based on contrastive learning (CL) significantly reduce the training cost by learning strong visual representations in an unsupervised manner [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. A popular way of formulating this idea is through imposing feature consistency to differently augmented views of the same image - which treats each view as an individual instance.

Despite great promise, the main technical challenges remain: (1) How far is CL from becoming a principled framework for medical image segmentation? (2) Is there any better way to implicitly learn some intrinsic properties from the original data (i.e., the inter-instance relationships and intra-instance invariance)? (3) What will happen if models can only access a few labels in training?

To address the above challenges, we outline three principles below: (1) tailness: existing approaches inevitably suffer from class collapse problems – wherein similar pairs from the same latent class are assumed to have the same representation [11], [12], [13]. This assumption, however, rarely holds for real-world clinical data. We observe that the long-tail distribution problem has received increasing attention in the computer vision community [14], [15], [16], [17], [18]. In contrast, there have been few prior long-tail works for medical image segmentation. For example, as illustrated in Figure 1, most medical images follow a Zipf long-tail distribution where various anatomical features share very different class frequencies, which can result in worse performance; (2) consistency: considering the scarcity of medical data in practice, augmentations are a widely adopted pre-text task to learn meaningful representations. Intuitively, the anatomical features should be semantically consistent across different transformations and deformations. Thus, it is important to assess whether the model is robust to diverse views of anatomy; (3) diversity: recent work [19], [20], [21] pointed out that going beyond simple augmentations to create more diverse views can learn more discriminative anatomical features. At the same time, this is particularly challenging to both introduce sufficient diversity and preserve the anatomy of the original data, especially in data-scarce clinical scenarios. To deploy into the wild, we need to quantify and address three research gaps from different anatomical views.

Fig. 1. — Examples of three benchmarks (*i.e*., ACDC, LiTS, MMWHS) with long-tail class distributions. As observed, the ratios of different label classes over three benchmarks are imbalanced.

In this paper, we present Mine yOur owN Anatomy (MONA), a novel contrastive semi-supervised 2D medical segmentation framework, based on different anatomical views. The workflow of MONA is illustrated in Figure 2. The key innovation in MONA is to seek diverse views (i.e., augmented/mined views) of different samples whose anatomical features are homogeneous within the same class type, while distinctive for different class types. We make the following contributions. First, we consider the problem of tailness. An issue is that label classes within medical images typically exhibit a long-tail distribution. Another one, technically more challenging, is the fact that there is only a few labeled data and large quantities of unlabeled ones during training. Intuitively we would like to sample more pixel-level representations from tail classes. Thus, we go beyond the naïve setting of instance discrimination in CL [4], [5], [6] by decomposing images into diverse and yet consistent anatomical features, each belonging to different classes. In particular, we propose to use pseudo labeling and knowledge distillation to learn better pixel-level representations within multiple semantic classes in a training mini-batch. Considering performing pixel-level CL with medical images is impractical for both memory cost and training time, we then adopt active sampling strategies [22] such as in-batch hard negative pixels, to better discriminate the representations at a larger scale.

Fig. 2. — Overview of the MONA framework including two stages: (1)GLCon is design to seek both *augmented* and *mined* views for instance discrimination $ℒ_{inst}$ in the global and local manners. Here the global instance discrimination is designed to exploit the correlations among views within the latent feature space, which is generated by the encoders. Meanwhile, local instance discrimination aims to leverage the correlations among views - specifically, local regions of the image - within the output feature space produced by the decoder (See Section 3.1), (2) our proposed anatomical contrastive reconstruction fine-tuning (See Section 3.2). Note that U and L denote unlabeled and labeled data.

We further address the two other challenges: consistency and diversity. The success of the common CL theme is mainly attributed to invariant mapping [23] and instance discrimination [1], [4]. Starting from these two key aspects, we try to further improve the segmentation quality. More specifically, we suggest that consistency to transformation (equivariance) is an effective strategy to establish the invariances (i.e., anatomical features and shape variance) to various image transformations. Furthermore, we investigate two ways to include diversity-promoting views in sample generation. First, we incorporate a memory buffer to alleviate the demand for large batch size, enabling much more efficient training without inhibiting segmentation quality. Second, we leverage stronger augmentations and nearest neighbors to mine views as positive views for more semantic similar contexts.

Extensive experiments are conducted on a variety of datasets and the latest CL frameworks (i.e., MoCo [5], SimCLR [4], BYOL [6], and ISD [24]), which consistently demonstrate the effectiveness of our proposed MONA. For example, our MONA establishes the new state-of-the-art performance, compared to all the state-of-the-art semi-supervised approaches with different label ratios (i.e., 1%, 5%, 10%). Moreover, we present a systematic evaluation for analyzing why our approach performs so well and how different factors contribute to the final performance (See Section 4.4). Theoretically, we show the efficacy of our MONA in label efficiency (See Section A). Empirically, we also study whether these principles can effectively complement each CL framework (See Section 4.7). We hope our findings will provide useful insights on medical image segmentation to other researchers.

To summarise, our contributions are as follows: ❶ we carefully examine the problem of semi-supervised 2D medical image segmentation with extremely limited labels, and identify the three principles to address such challenging tasks; ❷ we construct a set of objectives, which significantly improves the segmentation quality, both long-tail class distribution and anatomical features; ❸ we both empirically and theoretically analyze several critical components of our method and conduct thorough ablation studies to validate their necessity; ❹ with the combination of different components, we establish state-of-the-art under SSL settings, for all the challenging three benchmarks.

2. Related work

Medical Image Segmentation.

Medical image segmentation aims to assign a class label to each pixel in an image, and plays a major role in real-world applications, such as assisting the radiologists for better disease diagnosis and reduced cost. With sufficient annotated training data, significant progress has been achieved with the introduction of Fully convolutional networks (FCN) [25] and UNet [26]. Follow-up works can be categorized into two main directions. One direction is to improve modern segmentation network design. Many CNN-based [27], [28] and Transformer-like [29], [30] model variants [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41] have been proposed since then. For example, some works [32], [35], [42] proposed to use dilated/atrous/deformable convolutions with larger receptive fields for more dense anatomical features. Other works [36], [37], [38], [39], [40], [41] include Transformer blocks to capture more long-range information, achieving the impressive performance. A parallel direction is to select proper optimization strategies, by designing loss functions to learn meaningful representations [43], [44], [45]. However, those methods assume access to a large, labeled dataset. This restrictive assumption makes it challenging to deploy in most real-world clinical practices. In contrast, our MONA is more robust as it leverages only a few labeled data and large quantities of unlabeled one in the learning stage.

Semi-Supervised Learning (SSL).

The goal in robust SSL is to improve the medical segmentation performance by taking advantage of large amounts of unlabelled data during training. It can be roughly categorized into three groups: (1) self-training by generating unreliable pseudo-labels for performance gains, such as pseudo-label estimation [46], [47], [48], [49], [50], model uncertainty [51], [52], [53], confidence estimation [54], [55], [56], and noisy student [57]; (2) consistency regularization [58], [59], [60] by integrating consistency corresponding to different transformation, such as pi-model [61], co-training [62], [63], and mean-teacher [9], [10], [64], [65], [66], [67]; (3) other training strategies such as adversarial training [68], [69], [70], [71], [72], [73] and entropy minimization [74]. In contrast to these works, we do not explore more advanced pseudo-labelling strategy to learn spatially structured representations. In this work, we are the first to explore a novel direction for discovering distinctive and semantically consistent anatomical features without image-level or region-level labels. Further, we expect that our findings can be relevant for other medical image segmentation frameworks.

Contrastive Learning.

CL has recently emerged as a promising paradigm for medical image segmentation via exploiting abundant unlabeled data, leading to state-of-the-art results [9], [10], [75], [76], [77], [78], [79], [80], [81], [82]. The high-level idea of CL is to pull closer the different augmented views of the same instance but pushes apart all the other instances away. Intuitively, differently augmented views of the same image are considered positives, while all the other images serve as negatives. The major difference between different CL-based frameworks lies in the augmentation strategies to obtain positives and negatives. [83] augments a given image with 4 different rotation degrees and trains the model to be aware of which rotation degree of each image by applying an contrastive loss. In contrast, our goal is to train a model to yield segments that adhere to anatomical, geometric and equivariance constraints in an unsupervised manner. A few very recent studies [14], [18] confirm the superiority of CL of addressing imbalance issues in image classification. Moreover, existing CL frameworks [75], [77] mainly focus on the instance level discrimination (i.e., different augmented views of the same instance should have similar anatomical features or clustered around the class weights). However, we argue that not all negative samples equally matter, and the above issues have not been explored from the perspective of medical image segmentation, considering the class distributions in the medical image are perspectives diverse and always exhibit long tails [84], [85], [86]. Inspired by the aforementioned, we address these two issues in medical image segmentation - two appealing perspectives that still remain under-explored.

3. Mine yOur owN Anatomy (MONA)

Overview.

MONA consists of two parts: a global-local contrastive pre-training part named GLCon (Section 3.1) and a fine-tuning part named Anatomical Contrastive Reconstruction (Section 3.2). We illustrate our contrastive learning framework (See Figure 2), which includes (1) relational semi-supervised pre-training, and (2) anatomical contrastive reconstruction fine-tuning.

3.1. GLCon

Our pre-training stage is built upon ISD [24] - a competitive framework for image classification. The main differences between ISD and the pre-training part of MONA (i.e. GLCon) are: GLCon is more tailored to medical image segmentation, i.e., considering the dense nature of this problem both in global and local manner, and can generalize well to those long-tail scenarios. Also, our principles are expected to apply to other CL framework ((i.e., MoCo [5], SimCLR [4], BYOL [6]). More detailed empirical and theoretical analysis can be found in Section 4.7 and Section A.

Pre-training preliminary.

Let $(X, Y)$ be our dataset, including training images $x \in X$ and their corresponding $𝒞$ class segmentation labels $y \in Y$ , where $X$ is composed of $N$ labeled and $M$ unlabeled slices. Note that, for brevity, $y$ can be either sampled from $Y$ or pseudo-labels. The student and teacher networks $ℱ$ , parameterized by weights $θ$ and $ξ$ , each consist of a encoder $ℰ$ and a decoder $𝒟$ (i.e., UNet [26]). Concretely, given a sample $s$ from our unlabeled dataset, we have two ways to generate views: (1) we formulate augmented views (i.e., $x$ , $x^{'}$ ) through two different augmentation chains; and (2) we create $d$ mined views (i.e., $x_{r, i}$ ) by randomly selecting from the unlabeled dataset followed by additional augmentation.¹ We then fed the augmented views to both $ℱ_{θ}$ and $ℱ_{ξ}$ , and the mined views to $ℱ_{ξ}$ . Similar to [75], we adopt the global and local instance discrimination strategies in the latent and output feature spaces.² Specifically, the encoders generate global features $z_{g} = ℰ_{θ} (x)$ , $z_{g}^{'} = ℰ_{ξ} (x^{'})$ , and $z_{r, g} = ℰ_{ξ} (x_{r})$ , which are then fed into the nonlinear projection heads to obtain $v_{g} = h_{θ} (z_{g})$ , $v_{g}^{'} = h_{ξ} (z_{g}^{'})$ , and $w_{g} = h_{ξ} (z_{r, g})$ . The augmented embeddings from the student network are further projected into secondary space, i.e., $u_{g} = h_{θ}^{'} (v_{g})$ . We calculate similarities across mined views and augmented views from the student and teacher in both global and local manners. Then a softmax function is applied to process the calculated similarities, which models the relationship distributions:

\begin{array}{l} s_{θ} = l o g \frac{e x p (s i m (u, w) / τ_{θ})}{\sum_{j = 1}^{k} e x p (s i m (u, w_{j}) / τ_{θ})}, \\ s_{ξ} = l o g \frac{e x p (s i m (v^{'}, w) / τ_{ξ})}{\sum_{j = 1}^{k} e x p (s i m (v^{'}, w_{j}) / τ_{ξ})}, \end{array}

(3.1)

where $τ_{θ}$ and $τ_{ξ}$ are different temperature parameters, $k$ denotes the number of mined views and $s i m (\cdot, \cdot)$ denotes cosine similarity. The unsupervised instance discrimination loss (i.e., Kullback-Leibler divergence $𝒦 ℒ$ ) can be defined as:

ℒ_{inst} = 𝒦 ℒ (s_{θ} ‖ s_{ξ}) .

(3.2)

The parameters $ξ$ of $ℱ_{ξ}$ is updated as: $ξ = t ξ + (1 - t) θ$ with $t = 0.99$ as a momentum hyperparameter. In our pre-training stage, the total loss is the sum of global and local instance discrimination loss $ℒ_{inst}$ (on pseudo-labels), and supervised segmentation loss $ℒ_{sup}$ (i.e., equal combination of dice loss and cross-entropy loss on ground-truth labels): $ℒ_{inst}^{global} + ℒ_{inst}^{local} + ℒ_{sup}$ . Therefore, the GLCon loss encourages that the model acquires both global and local features.

3.2. Anatomical Contrastive Reconstruction

Principles.

The key idea of the fine-tuning part is to seek diverse yet semantically consistent views whose anatomical features are homogeneous within the same class type, while distinctive for different class types. As shown in Figure 2, the principles behind MONA (the anatomical contrastive reconstruction stage) aim to ensure tailness, consistency, and diversity. Concretely, tailness is for actively sampling more tail class hard pixels; consistency ensures the feature invariances; and diversity further encourages to discover more anatomical features in different images. More theoretical analysis is in Section A.

Tailness.

Motivated by the observations (Figure 1), our primary cue is that medical images naturally exhibit an imbalanced or long-tailed class distribution, wherein many class labels are associated with only a few pixels. To generalize well on such imbalanced setting, we propose to use anatomical contrastive formulation (ACF) (See Figure 3).

Fig. 3. — Illustration of the contrastive loss. Intuitively, we actively sample a set of pixel-level anchor representations, pulling them closer to the class-averaged mean of representations within this class (*positive keys*), and pushing away from representations from other classes (*negative keys*).

Here we additionally attach the representation heads to fuse the multi-scale features with the feature pyramid network (FPN) [87] structure and generate the $m$ -dimensional representations with consecutive convolutional layers. The high-level idea is that the features should be very similar among the same class type, while very dissimilar across different class types. Particularly for long-tail medical data, a naïve application of this idea would require substantially computational resources proportional to the square of the number of pixels within the dataset, and naturally overemphasize the anatomy-rich head classes and leaves the tail classes under-learned in learning invariances, both of which suffer performance drops.

To this end, we address this issue by actively sampling a set of pixel-level anchor representations $r_{q} \in ℛ_{q}^{c}$ (queries), pulling them closer to the class-averaged mean of representations $r_{k}^{c, +}$ within this class $c$ (positive keys), and pushing away from representations $r_{k}^{-} \in ℛ_{k}^{c}$ from other classes (negative keys). Formally, the contrastive loss is defined as:

\begin{array}{l} ℒ_{contrast} = \sum_{c \in 𝒞} \sum_{r_{q} ~ ℛ_{q}^{c}} \\ - l o g \frac{e x p (r_{q} \cdot r_{k}^{c, +} / τ)}{e x p (r_{q} \cdot r_{k}^{c, +} / τ) + \sum_{r_{k}^{-} ~ ℛ_{k}^{c}} e x p (r_{q} \cdot r_{k}^{-} / τ)}, \end{array}

(3.3)

where $𝒞$ denotes a set of all available classes for each mini-batch, and $τ$ is a temperature hyperparameter. Suppose $𝒜$ is a collection including all pixel coordinates within $x$ , these representations are:

\begin{array}{l} ℛ_{q}^{c} = ⋃_{[m, n] \in 𝒜} 1 (y_{[m, n]} = c) r_{[m, n]}, \\ ℛ_{k}^{c} = ⋃_{[m, n] \in 𝒜} 1 (y_{[m, n]} \neq c) r_{[m, n]}, \\ r_{k}^{c, +} = \frac{1}{|ℛ_{q}^{c}|} \sum_{r_{q} \in ℛ_{q}^{c}} r_{q} . \end{array}

(3.4)

Note that in Eq. 3.3, we are using the negative pairs $r_{k}^{-}$ to estimate the centers of opposite classes. The class average representation $r_{k}^{c, +}$ is averaged over all instances from the target class $c$ . We also note that CL might benefit more, where the instance discrimination task is achieved by incorporating more positive and negative pairs. However, naively unrolling CL to this setting is impractical since it requires extra memory overheads that grow proportionally with the amount of instance discrimination tasks. To this end, we adopt a random set (i.e., the mini-batch) of other images. Intuitively, we would like to maximize the anatomical similarity between all the representations from the query class, and analogously minimize all other class representations. In order to compare the pairs of instances between opposite and target classes, we then create a graph $𝒢$ to compute the pair-wise class relationship: $𝒢 [p, q] = (r_{k}^{p, +} \cdot r_{k}^{q, +}), \forall p, q \in 𝒞$ , and $p \neq q$ , where $𝒢 \in R^{| 𝒞 | \times | 𝒞 |}$ . Here finding the accurate decision boundary can be formulated mathematically by normalizing the pair-wise relationships among all negative class representations via the softmax operator. To be specific, in Eq. 3.3, we use adaptive sampling for the negative keys $r_{k}^{-}$ from the opposite classes. To do so, we use softmax to yield a distribution $e x p (G [c, v]) / \sum_{n \in 𝒞, n \neq c} e x p (G [c, n])$ , with which we adaptively sample negative keys from class $v$ , for $v \neq c$ . To address the challenge in imbalanced medical image data, we define the pseudo-label (i.e., easy and hard queries) based on a defined threshold as follows:

\begin{array}{l} ℛ_{q}^{c, easy} = ⋃_{r_{q} \in ℛ_{q}^{c}} 1 ({\hat{y}}_{q} > δ_{θ}) r_{q}, \\ ℛ_{q}^{c, hard} = ⋃_{r_{q} \in ℛ_{q}^{c}} 1 ({\hat{y}}_{q} \leq δ_{θ}) r_{q}, \end{array}

(3.5)

where ${\hat{y}}_{q}$ is the c^th-class pseudo-label corresponding to $r_{q}$ , and $δ_{θ}$ is the user-defined threshold. For further improvement in long-tail scenarios, we construct a class-aware memory bank [5] to store a fixed number of negative samples per class $c$ .

Consistency.

The proposed ACF is designed to address imbalanced issues, but anatomical consistency remains to be weak in the long-tail medical image setting since medical segmentation should be robust to different tissue types which show different anatomical variations. Our goal is to train a model to yield segments that adhere to anatomical, geometric and equivariance constraints in an unsupervised manner. As shown in Figure 4, we hence construct a random image transformation $𝒯$ and define the equivariance loss on both labeled and unlabeled data by measuring the feature consistency distance between each original segmentation map and the segmentation map generated from the transformed image:

\begin{array}{l} ℒ_{eqv} (x, 𝒯 (x)) = \sum_{x \in X} 𝒦 ℒ (𝒯 (ℱ_{θ} (x)), ℱ_{θ} (𝒯 (x))) \\ + 𝒦 ℒ (ℱ_{θ} (𝒯 (x)), 𝒯 (ℱ_{θ} (x))) . \end{array}

(3.6)

Fig. 4. — Illustration of the equivariance loss.

Here we define $𝒯$ on both the input image $x$ and $ℱ_{θ} (x)$ , via the random transformations (i.e., affine, intensity, and photo-metric augmentations), since the model should learn to be robust and invariant to these transformations.

Diversity.

Oversampling too many images from the random set would create extra memory overhead, and more importantly, our finding also uncovers that a large number of random images might not necessarily help impose additional invariances between neighboring samples since redundant images might introduce additional noise during training (see Section 4.8). To counteract this, we utilize the nearest neighbor strategy, ensuring the model benefits from its previous outputs without overly concentrating on extraneous features. Thus, we formulate our insight as an auxiliary loss that regularizes the representations - keeping the anatomical contrastive reconstruction task as the main force. In practice, given a batch of unlabeled images, we use both the teacher and student models to obtain $v_{g}^{'}$ and $u_{g}$ , which are then normalized using the $l_{2}$ norm. $v_{g}^{'}$ is fed to the first-in-first-out (FIFO) memory bank [5], where it search for $K$ -nearest neighbors from the memory bank. Then we use the nearest neighbor loss $ℒ_{n n}$ to maximize cosine similarity, thereby exploiting the inter-instance relationship. Specifically, we minimize the distance between $u_{g}$ and the $K$ -nearest neighbors, with the distance defined as negative cosine similarity, thereby maximizing cosine similarity.

Setup.

The total loss $ℒ_{total}$ is the sum of contrastive loss $ℒ_{contrast}$ (on both ground-truth labels and pseudo-labels), equivariance loss $ℒ_{eqv}$ (on both ground-truth labels and pseudo-labels), nearest neighbors loss $ℒ_{nn}$ (on both ground-truth labels and pseudo-labels), unsupervised cross-entropy loss $ℒ_{unsup}$ (on pseudo-labels) and supervised segmentation loss $ℒ_{sup}$ (on ground-truth labels): $ℒ_{sup} + λ_{1} ℒ_{contrast} + λ_{2} ℒ_{eqv} + λ_{3} ℒ_{unsup} + λ_{4} ℒ_{n n}$ . We theoretically analyze the effectiveness of our MONA in the very limited label setting (See Section A). We also empirically conduct ablations on different hyperparameters (See Section 4.8).

4. Experiments

In this section, we evaluate our proposed MONA on three popular medical image segmentation datasets under varying labeled ratio settings: the ACDC dataset [92], the LiTS dataset [93], and the MMWHS dataset [94].

4.1. Datasets

The ACDC dataset was hosted in MICCAI 2017 ACDC challenge [92], which includes 200 3D cardiac cine MRI scans with expert annotations for three classes (i.e., left ventricle (LV), myocardium (Myo), and right ventricle (RV)). We use 120, 40 and 40 scans for training, validation, and testing³. Note that 1%, 5%, and 10% label ratios denote the ratio of patients. For pre-processing, we adopt the similar setting in [75] by normalizing the intensity of each 3D scan (i.e., using min-max normalization) into [0, 1], and re-sampling all 2D scans and the corresponding segmentation maps into a fixed spatial resolution of 256 × 256 pixels.

The LiTS dataset was hosted in MICCAI 2017 Liver Tumor Segmentation Challenge [93], which includes 131 contrast-enhanced 3D abdominal CT volumes with expert annotations for two classes (i.e., liver and tumor). Note that 1%, 5%, and 10% label ratios denote the ratio of patients. We use 100 and 31 scans for training, and testing with random order. The splitting details are in the supplementary material. For pre-processing, we adopt the similar setting in [95] by truncating the intensity of each 3D scan into [−200, 250] HU for removing irrelevant and redundant details, normalizing each 3D scan into [0, 1], and re-sampling all 2D scans and the corresponding segmentation maps into a fixed spatial resolution of 256 × 256 pixels.

The MMWHS dataset was hosted in MICCAI 2017 challenge [94], which includes 20 3D cardiac MRI scans with expert annotations for seven classes: left ventricle (LV), left atrium (LA), right ventricle (RV), right atrium (RA), myocardium (Myo), ascending aorta (AAo), and pulmonary artery (PA). Note that 1%, 5%, and 10% label ratios denote the ratio of patients. We use 15 and 5 scans for training and testing with random order. The splitting details are in the supplementary material. For pre-processing, we normalize the intensity of each 3D scan (i.e., using min-max normalization) into [0, 1], and re-sampling all 2D scans and the corresponding segmentation maps into a fixed spatial resolution of 256 × 256 pixels.

Moreover, to further validate our approach’s unsupervised imbalance handling ability, we consider a more realistic and more challenging scenario, wherein the models would only have access to the extremely limited labeled data (i.e., 1% labeled ratio) and large quantities of unlabeled one in training. For all experiments, we follow the same training and testing protocol. Note that 1%, 5%, and 10% label ratios denote the ratio of patients. For ACDC, we adopt the fixed data split [96]. For LiTS and MMWHS, we adopt the random data split with respect to patient.

4.2. Implementation Details.

We implement all the evaluated models using PyTorch library [97]. All the models are trained using Stochastic Gradient Descent (SGD) (i.e., initial learning rate = 0.01, momentum = 0.9, weight decay = 0.0001) with batch size of 6, and the initial learning rate is divided by 10 every 2500 iterations. All of our experiments are conducted on NVIDIA GeForce RTX 3090 GPUs. We first train our model with 100 epochs during the pre-training, and then retrain the model for 200 epochs during the fine-tuning. We set the temperature $τ_{ξ}$ , $τ_{θ}$ , $τ$ as 0.01, 0.1, 0.5. The size of the memory bank is 36. During the pre-training, we follow the settings of ISD, including global projection head setting, and predictors with the 512-dimensional output embedding, and adopt the setting of local projection head in [79]. More specifically, given the predicted logits $\hat{y} \in R^{𝒞 \times ℋ \times 𝒲}$ , we create 36 different views (i.e., random crops at the same location) of $\hat{y}$ and ${\hat{y}}^{'}$ with the fixed size 64 × 64, and then project all pixels into 512-dimensional output embedding space, and the output feature dimension of $h_{θ}^{'}$ is also 512. An illustration of our representation head is presented in Figure 6. We then actively sample 256 query embeddings and 512 key embeddings for each mini-batch, and the confidence threshold $δ_{θ}$ is set to 0.97. When fine-tuning we use an equally sized pool of candidates $K = 5$ , as well as $λ_{1} = 0.01$ , $λ_{2} = 1.0$ , $λ_{3} = 1.0$ , and $λ_{4} = 1.0$ . For different augmentation strategies, we implement the weak augmentation to the teacher’s input as random rotation, random cropping, horizontal flipping, and strong augmentation to the student’s input as random rotation, random cropping, horizontal flipping, random contrast, CutMix [98], brightness changes [99], morphological changes (diffeomorphic deformations). We adopt two popular evaluation metrics: Dice coefficient (DSC) and Average Symmetric Surface Distance (ASD) for 3D segmentation results. Of note, the projection heads, the predictor, and the representation head are only used in training, and will be discarded during inference.

Fig. 6. — Overview of the representation head architecture.

4.3. Main Results

We show the effectiveness of our method under three different label ratios (i.e., 1%, 5%, 10%). We also compare MONA with various state-of-the-art SSL and fully-supervised methods on three datasets: ACDC [92], LiTS [93], MMWHS [94]. We choose 2D UNet [26] as backbone, and compare against SSL methods including EM [88], CCT [89], DAN [68], URPC [90], DCT [62], ICT [91], MT [64], UAMT [51], CPS [49], SimCVD [80], MMS [82], SCS [79], GCL [75], and PLC [78]. The upper bound and lower bound method are UNet trained with full/limited supervisions (UNet-F/UNet-L), respectively. We report quantitative comparisons on ACDC and LiTS in Table 1.

TABLE 1.

Comparison of segmentation performance (DSC[%]/ASD[mm]) on ACDC and LiTS under three labeled ratio settings (1%, 5%, 10%). The best results are indicated in bold.

Method	ACDC						LiTS
	1% Labeled		5% Labeled		10% Labeled		1% Labeled		5% Labeled		10% Labeled
	DSC ↑	ASD ↓	DSC ↑	ASD ↓	DSC ↑	ASD ↓	DSC ↑	ASD ↓	DSC ↑	ASD ↓	DSC ↑	ASD ↓

UNet-F [26]	91.5	0.996	91.5	0.996	91.5	0.996	68.5	17.8	68.5	17.8	68.5	17.8
UNet-L	14.5	19.3	51.7	13.1	79.5	2.73	57.0	34.6	60.4	30.4	61.6	28.3

EM [88]	21.1	21.4	59.8	5.64	75.7	2.73	56.6	38.4	61.2	33.3	62.9	38.5
CCT [89]	30.9	28.2	59.1	10.1	75.9	3.60	52.4	52.3	60.6	48.7	63.8	31.2
DAN [68]	34.7	25.7	56.4	15.1	76.5	3.01	57.2	27.1	62.3	25.8	63.2	30.7
URPC [90]	32.2	26.9	58.9	8.14	73.2	2.68	55.5	34.6	62.4	37.8	63.0	43.1
DCT [62]	36.0	24.2	58.5	10.8	78.1	2.64	57.6	38.5	60.8	34.4	61.9	31.7
SimCVD [80]	32.1	20.3	76.1	4.14	79.2	2.21	56.2	32.7	60.5	23.6	61.3	26.0
MMS [82]	32.5	13.6	77.6	3.61	79.4	1.74	56.9	45.6	61.6	55.4	62.5	46.9
ICT [91]	35.8	21.3	59.0	4.59	75.1	0.898	58.3	32.2	60.1	39.1	62.5	32.4
MT [64]	36.8	19.6	58.3	11.2	80.1	2.33	56.7	34.3	61.9	40.0	63.3	26.2
UAMT [51]	35.2	24.3	61.0	7.03	77.6	3.15	57.8	41.9	61.0	47.0	62.3	26.0
CPS [49]	37.1	30.0	61.0	2.92	78.8	3.41	57.7	39.6	62.1	36.0	64.0	23.6
GCL [75]	59.7	14.3	70.6	2.24	87.0	0.751	59.3	29.5	63.3	20.1	65.0	37.2
SCS [79]	59.4	12.7	73.6	5.37	84.2	2.01	57.8	39.6	61.5	28.8	64.6	33.9
PLC [78]	58.8	15.1	70.6	2.67	87.3	1.34	56.6	41.6	62.7	26.1	68.2	16.9
• MONA (ours)	82.6	2.03	88.8	0.622	90.7	0.864	64.1	20.9	67.3	16.4	69.3	18.0

Open in a new tab

ACDC.

We benchmark performances on ACDC with respect to different labeled ratios (i.e., 1%, 5%, 10%). The following observations can be drawn: First, our proposed MONA significantly outperforms all other SSL methods under three different label ratios. Especially, with only extremely limited labeled data available (e.g., 1%), our method obtains massive gains of 22.9% and 10.67 in Dice and ASD (i.e., dramatically improving the performance from 59.7% to 82.6%). Second, as shown in Figure 5, we can see the clear advantage of MONA, where the anatomical boundaries of different tissues are clearly more pronounced such as RV and Myo regions. As seen, our method is capable of producing consistently sharp and accurate object boundaries across various challenge scenarios.

Fig. 5. — Visualization of segmentation results on ACDC with 5% label ratio. As is shown, MONA consistently yields more accurate predictions and better boundary adherence compared to all other SSL methods. Different anatomical classes are shown in different colors (RV: ; Myo: ; LV: ).

Inline graphic — Visualization of segmentation results on ACDC with 5% label ratio. As is shown, MONA consistently yields more accurate predictions and better boundary adherence compared to all other SSL methods. Different anatomical classes are shown in different colors (RV: ; Myo: ; LV: ).

LiTS.

We then evaluate MONA on LiTS, using 1%, 5%, 10% labeled ratios. The results are summarized in Table 1 and Figure 7. The conclusions are highly consistent with the above ACDC case: First, at the different label ratios (i.e., 1%, 5%, 10%), MONA consistently outperforms all the other SSL methods, which again demonstrates the effectiveness of learning representations for the inter-class correlations and intra-class invariances under imbalanced class-distribution scenarios. In particular, our MONA, trained on a 1% labeled ratio (i.e., extremely limited labels), dramatically improves the previous best averaged Dice score from 59.3% to 64.1% by a large margin, and even performs on par with previous SSL methods using 10% labeled ratio. Second, our method consistently outperforms all the evaluated SSL methods under different label ratios (i.e., 1%, 5%, 10%). Third, as shown in Figure 7, we observe that MONA is able to produce more accurate results compared to the previous best schemes.

Fig. 7. — Visualization of segmentation results on LiTS with 5% labeled ratio. As is shown, MONA consistently produces sharp and accurate object boundaries compared to all other SSL methods. Different anatomical classes are shown in different colors (Liver: ; Tumor: ).

MMWHS

Lastly, we validate MONA on MMWHS, under 1%, 5%, 10% labeled ratios. The results are provided in Table 2 and Figure 8. Again, we found that MONA consistently outperforms all other SSL methods with a significant performance margin, and achieves the highest accuracy among all the SSL approaches under three labeled ratios. As is shown, MONA trained at the 1% labeled ratio significantly outperforms all other methods trained at the 1% labeled ratio, even over the 5% labeled ratio. Concretely, MONA trained at only 1% labeled ratio outperforms the second-best method (i.e., GCL) both at the 1% and 5% labeled, yielding 12.3% and 2.8% gains in Dice. We also observe the similar patterns that, MONA performs better or on par with all the other methods at 10% labeled, which again demonstrates the superiority of MONA in extremely limited labeled data regimes.

TABLE 2.

Comparison of segmentation performance (DSC[%]/ASD[mm]) on MMWHS under three labeled ratio settings (1%, 5%, 10%). On all three labeled settings, MONA significantly outperforms all the state-of-the-art methods by a significant margin. The best results are in bold.

	1% Labeled		5% Labeled		10% Labeled
Method	DSC ↑	ASD ↓	DSC ↑	ASD ↓	DSC ↑	ASD ↓

UNet-F [26]	85.8	8.01	85.8	8.01	85.8	8.01
UNet-L	58.3	33.9	77.8	24.4	82.7	13.5

EM [88]	54.5	41.1	80.6	17.3	82.1	15.1
CCT [89]	62.8	27.5	79.0	21.9	79.4	16.3
DAN [68]	52.8	48.4	79.4	22.7	80.2	15.0
URPC [90]	65.7	29.7	73.7	20.5	81.9	12.3
DCT [62]	62.7	27.5	80.8	23.0	82.8	12.4
SimCVD [80]	64.6	39.5	77.0	20.2	80.3	16.8
MMS [82]	66.2	36.9	80.6	18.4	82.1	16.7
ICT [91]	59.9	32.8	76.5	15.4	82.2	12.0
MT [64]	58.8	35.6	76.5	15.5	79.4	19.8
UAMT [51]	61.1	37.6	76.3	20.9	83.7	14.2
CPS [49]	58.8	33.6	78.3	22.5	82.0	13.1
GCL [75]	71.6	20.3	83.5	7.41	86.7	8.76
SCS [79]	71.4	19.3	81.1	11.5	82.6	9.68
PLC [78]	71.5	19.8	83.4	10.7	86.0	9.65
• MONA (ours)	83.9	9.06	86.3	8.22	87.6	6.83

Open in a new tab

Fig. 8. — Visualization of segmentation results on MMWHS with 5% labeled ratio. As is shown, MONA consistently generates more accurate predictions compared to all other SSL methods with a significant performance margin. Different anatomical classes are shown in different colors (LV: ; LA: ; RV: ; RA: ; Myo: ; PA: ).

Overall, we conclude that MONA provides robust performance on all the medical datasets we evaluated, exceeding that of the fully-supervised baseline, and outperforming all other SSL methods.

4.4. Ablation Study

In this subsection, we conduct comprehensive analyses to understand the inner workings of MONA on ACDC under 5% labeled ratio.

4.5. Effects of Different Components

Our key observation is that it is crucial to build meaningful anatomical representations for the inter-class correlations and intra-class invariances under imbalanced class-distribution scenarios can further improve performance. Upon our choice of architecture, we first consider our CL pre-trained method (i.e., GLCon). To validate this, we experiment with the key components in MONA on ACDC, including: (1) tailness, (2) consistency, and (3) diversity. The results are in Table 3. As is shown, each key component makes a clear difference and leveraging all of them contributes to the remarkable performance improvements. This suggests the importance of learning meaningful representations for the inter-class correlations and intra-class invariances within the entire dataset. The intuitions behind each concept are as follows: (1) Only tailness: many anatomy-rich head classes would be sampled; (2) Only consistency: it would lead to object collapsing due to the different anatomical variations; (3) Only diversity: oversampling too many negative samples often comes at the cost of performance degradation. By combining tailness, consistency, and diversity, our method confers a significant advantage at representation learning in imbalanced feature similarity, semantic consistency and anatomical diversity, which further highlights the superiority of our proposed MONA (More results in Section 4.7).

TABLE 3.

Ablation on model component: (1) tailness; (2) consistency; (3) diversity, compared to the Vanilla and our MONA.

Method	Metrics
Method	Dice[%] ↑	ASD[mm] ↓

Vanilla	74.2	3.89

w/tailness	83.1	0.602
w/consistency	84.2	1.86
w/diversity	78.2	3.07
w/tailness + consistency	88.1	0.864
w/consistency + diversity	80.2	2.11
w/tailness + diversity	85.0	0.913

• MONA (ours)	88.8	0.622

Open in a new tab

4.6. Effects of Different Augmentations

In addition to further improving the quality and stability in anatomical representation learning, we claim that MONA also gains robustness using augmentation strategies. For augmentation strategies, previous works [19], [24], [100] show that composing the weak augmentation strategy for the “pivot-to-target” model (i.e., trained with limited labeled data and a large number of unlabeled data) is helpful for anatomical representation learning since the standard contrastive strategy is too aggressive, intuitively leading to a “hard” task (i.e., introducing too many disturbances and yielding model collapses). Here we examine whether and how applying different data augmentations helps MONA. In this work, we implement the weak augmentation to the teacher’s input as random rotation, random cropping, horizontal flipping, and strong augmentation to the student’s input as random rotation, random cropping, horizontal flipping, random contrast, CutMix [98], brightness changes [99], morphological changes (diffeomorphic deformations). We summarize the results in Table 4, and list the following observations: (1) weak augmentations benefits more: composing the weak augmentation for the teacher model and strong augmentation for the student model significantly boosts the performance across two benchmark datasets. (2) same augmentation pairs do not make more gains: interestingly, applying same type of augmentation pairs does not lead to the best performance compared to different types of augmentation pairs. We postulate that composing different augmentations can be considered as a harder albeit more useful strategy for anatomical representation learning, making feature more generalizable.

TABLE 4.

Ablation on augmentation strategies for MONA on the ACDC and LiTS dataset under 5% labeled ratio.

Dataset	Student	Teacher	Metrics
Dataset	Aug.	Aug.	Dice[%] ↑	ASD[mm] ↓

ACDC	Weak	Weak	86.0	1.02
	Strong	Weak	88.8	0.622
	Weak	Strong	86.4	2.83
	Strong	Strong	88.8	2.07

LiTS	Weak	Weak	62.3	26.5
	Strong	Weak	67.3	16.4
	Weak	Strong	64.3	34.7
	Strong	Strong	66.5	21.1

Open in a new tab

4.7. Generalization across Contrastive Learning Frameworks

As discussed in Section 3.1, our motivation comes from the observation that there are only very limited labeled data and a large amount of unlabeled data in real-world clinical practice. As the fully-supervised methods generally outperform all other SSL methods by clear margins, we postulate that leveraging massive unlabeled data usually introduces additional noise during training, leading to degraded segmentation quality. To address this challenge, “contrastive learning” is a straightforward way to leverage existing unlabeled data in the learning procedure. As supported in Section 4, our findings have shown that MONA generalizes well across different benchmark datasets (i.e., ACDC, LiTS, MMWHS) with diverse labeled settings (i.e., 1%, 5%, 10%). In the following subsection, we further demonstrate that our proposed principles (i.e., tailness, consistency, diversity) are beneficial to various state-of-the-art CL-based frameworks (i.e., MoCov2 [7], kNN-MoCo [21], SimCLR [4], BYOL [6], and ISD [24]) with different label settings. More details about these three principles can be found in Section 3.2. Of note, MONA can consistently outperform the semi-supervised methods on diverse benchmark datasets with only 10% labeled ratio.

Training Details of Competing CL Methods.

We identically follow the default setting in each CL framework [4], [6], [7], [21], [24] except the epochs number. We train each model in the semi-supervised setting. For labeled data, we follow the same training strategy in Section 3.1. As for unlabeled data, we strictly follow the default settings in each baseline. Specifically, for fair comparisons, we pre-train each CL baseline and our CL pre-trained method (i.e., GLCon) for 100 epochs in all our experiments. Then we fine-tune each CL model with our proposed principles with the same setting, as provided in Section 4.2. For kNN-MoCo [21], given the following ablation study we set the number of neighbors $k$ as 5, and further compare different settings of $k$ in kNN-MoCo [21] in the following subsection. All the experiments are run with three different random seeds, and the results we present are calculated from the validation set. Of note, UNet-F is fully supervised.

Comparisons with CL-based Frameworks.

Table 5 presents the comparisons between our methods (i.e., GLCon and MONA) and various CL baselines. After analyzing these extensive results, we can draw several consistent observations. First, we can observe that our GLCon achieves performance gains under all the labeled ratios, which not only demonstrates the effectiveness of our method, but also further verifies this argument using “global-local” strategy [75]. The average improvement in Dice obtained by GLCon could reach up to 2.53%, compared to the second best scores at different labeled ratios. Second, we can find that incorporating our proposed three principles significantly outperforms the CL baselines without fine-tuning, across all frameworks and different labeled ratios. These experimental findings suggest that our proposed three principles can further improve the generalization across different labeled ratios. On the ACDC dataset at the 1% labeled ratio, the backbones equipped with all three principles all obtain promising results, improving the performance of MoCov2, kNN-MoCo, SimCLR, BYOL, ISD, and our GLCon by 39.1%, 38.5%, 40.9%, 41.2%, 34.3%, 33.3%, respectively. The ACDC dataset is a popular multiclass medical image segmentation dataset, with massive imbalanced or long-tailed class distribution cases. The imbalanced or long-tailed class distribution gap could result in the vanilla models overfitting to the head class, and generalizing very poorly to the tail class. With the addition of under-sampling the head classes, the principle – tailness – can be deemed as the prominent strategy to yield better generalization and segmentation performance of the models across different labeled ratios. Similar results are found under 5% and 10% labeled ratios. Third, over a wide range of labeled ratios, MONA can establish the new state-of-the-art performance bar for semi-supervised 2D medical image segmentation. Particularly, MONA – for the first time – boosts the segmentation performance with 10% labeled ratio over the fully-supervised UNet (UNet-F). From Table 1 we see that on LiTS with 10% labeled ratio, MONA outperforms UNet-F by 0.8 in terms of DSC (69.3 vs 68.5). From table 2, MONA outperforms UNet-F on MMWHS by 1.8 in terms of DSC (87.6 vs 85.8). Table 1 and 2 also show that MONA significantly outperforms all the other semi-supervised methods by a large margin. In summary, our methods (i.e., GLCon and MONA) obtain remarkable performance on all labeled settings. The results verify the superiority of our proposed three principles (i.e., tailness, consistency, diversity) jointly, which makes the model well generalize to different labeled settings, and can be easily and seamlessly plugged into all other CL frameworks [4], [6], [7], [21], [24] adopting the two-branch design, demonstrating that these concepts consistently help the model yield extra performance boosts for them all.

TABLE 5.

Ablation study of different contrastive learning frameworks on ACDC under three labeled ratio settings (1%, 5%, 10%). We compare two settings: with or without fine-tuning on the segmentation performance (DSC[%]/ASD[mm]). We denote ‘without fine-tuning” to only pretaining. On all three labeled settings, our methods (i.e., GLCon and MONA) significantly outperform all the state-of-the-art methods by a significant margin. All the experiments are run with three different random seeds. The best results are in bold.

		1% Labeled		5% Labeled		10% Labeled
Framework	Method	DSC ↑	ASD ↓	DSC ↑	ASD ↓	DSC ↑	ASD ↓

only pretaining	MoCov2 [7]	38.6	22.4	56.2	17.9	81.0	5.36
	kNN-MoCo [21]	39.5	22.0	58.3	15.7	83.1	7.18
	SimCLR [4]	34.8	24.3	51.7	19.9	80.3	4.16
	BYOL [6]	35.9	7.25	65.9	9.15	85.6	2.51
	ISD [24]	45.8	17.2	71.0	4.29	85.3	2.97
	∘ GLCon (ours)	49.3	7.11	74.2	3.89	86.5	1.92

w/fine-tuning	MoCov2 [7]	77.7	4.78	85.4	1.52	86.7	1.74
	kNN-MoCo [21]	78.0	4.28	85.9	1.51	86.9	1.61
	SimCLR [4]	75.7	4.33	83.2	2.06	86.1	2.25
	BYOL [6]	77.1	4.84	85.3	2.06	88.1	0.994
	ISD [24]	80.1	3.00	83.8	1.95	88.6	1.20
	• MONA (ours)	82.6	2.03	88.8	0.622	90.7	0.864

Open in a new tab

Generalization Across CL Frameworks.

As demonstrated in Table 6, incorporating tailness, consistency, and diversity have obviously superior performance boosts, which is aligned with consistent observations with Section 4.4 can be drawn. This suggests that these three principles can serve as desirable properties for medical image segmentation in both supervised and unsupervised settings.

TABLE 6.

Ablation study of different principles across different contrastive learning frameworks under various labeled ratio settings (1%, 5%, 10%). Experiments are conducted on ACDC using UNet [26] as the backbone with three independent runs. Here we report the segmentation performance in terms of DSC[%] and ASD[mm]. On all three labeled settings, incorporating our methods (i.e., tailness, consistency, and diversity) consistently achieve superior model robustness gains across different state-of-the-art CL frameworks.

		1% Labeled		5% Labeled		10% Labeled
Framework	Principle	DSC ↑	ASD ↓	DSC ↑	ASD ↓	DSC ↑	ASD ↓

MOCOV2 [7]	Vanilla	38.6	22.4	56.2	17.9	81.0	5.36
	tailness	65.0	3.99	81.3	1.13	84.8	1.52
	consistency	70.3	6.88	79.5	3.65	81.9	3.79
	diversity	47.5	10.2	72.2	5.82	83.1	5.46
	tailness + consistency	75.8	5.10	83.8	1.89	85.7	2.81
	consistency + diversity	73.3	6.34	75.4	5.63	82.7	4.39
	tailness + diversity	75.5	5.40	82.4	3.39	85.3	2.49
	tailness + consistency + diversity	77.7	4.78	85.4	1.52	86.7	1.74

kNN-MoCo [21]	Vanilla	39.5	22.0	58.3	15.7	83.1	7.18
	tailness	66.7	3.87	83.7	1.39	86.2	1.17
	consistency	72.2	5.97	81.7	3.13	84.8	3.57
	diversity	50.5	9.53	73.5	5.92	83.5	5.45
	tailness + consistency	76.3	4.51	84.3	2.51	85.7	2.72
	consistency + diversity	72.1	6.45	78.6	5.56	84.6	4.08
	tailness + diversity	75.5	5.75	81.7	3.01	85.6	2.14
	tailness + consistency + diversity	78.0	4.28	85.9	1.51	86.9	1.61

SimCLR [4]	Vanilla	34.8	24.3	51.7	19.9	80.3	4.16
	tailness	61.9	3.52	79.8	1.70	84.5	2.01
	consistency	70.8	5.46	78.1	2.89	84.7	2.24
	diversity	45.9	8.49	68.3	6.46	83.5	3.92
	tailness + consistency	73.0	4.24	83.0	2.43	85.9	2.46
	consistency + diversity	71.1	6.49	75.6	4.47	83.9	3.51
	tailness + diversity	71.9	4.98	81.1	2.92	85.3	2.94
	tailness + consistency + diversity	75.7	4.33	83.2	2.06	86.1	2.25

BYOL [6]	Vanilla	35.9	7.25	65.9	9.15	85.6	2.51
	tailness	64.2	4.26	81.9	1.71	86.4	0.871
	consistency	71.0	5.45	80.2	3.22	87.0	2.08
	diversity	47.5	6.29	70.7	5.48	85.7	2.36
	tailness + consistency	73.7	4.74	83.3	2.01	87.7	1.25
	consistency + diversity	70.9	6.08	76.0	4.55	86.1	1.93
	tailness + diversity	72.2	5.81	82.6	3.12	86.4	1.33
	tailness + consistency + diversity	77.1	4.84	85.3	2.06	88.1	0.994

ISD [24]	Vanilla	45.8	17.2	71.0	4.29	85.3	2.97
	tailness	71.8	2.80	79.2	1.47	87.1	1.02
	consistency	78.8	3.98	80.2	2.90	87.3	1.94
	diversity	54.5	8.03	77.1	6.90	86.2	2.58
	tailness + consistency	79.6	2.99	83.0	1.93	88.2	1.24
	consistency + diversity	75.1	4.72	77.8	3.65	86.5	2.45
	tailness + diversity	74.8	7.98	82.3	2.02	87.2	1.35
	tailness + consistency + diversity	80.1	3.00	83.8	1.95	88.6	1.20

MONA (ours)	Vanilla	49.3	7.11	74.2	3.89	86.5	1.92
	tailness	75.1	1.83	83.1	0.602	87.8	0.577
	consistency	81.5	2.78	84.2	1.86	88.4	1.33
	diversity	62.8	3.97	78.2	3.07	86.6	1.88
	tailness + consistency	81.2	2.19	88.1	0.864	90.1	0.966
	consistency + diversity	81.8	3.29	80.2	2.11	86.9	1.67
	tailness + diversity	78.6	3.33	85.0	0.913	89.5	0.673
	tailness + consistency + diversity	82.6	2.03	88.8	0.622	90.7	0.864

Open in a new tab

Does $k$ -nearest neighbour in global feature space help?

Prior work suggests that the use of stronger augmentations and nearest neighbour can be the very effective tools in learning additional invariances [21]. That is, both the specific number of nearest neighbours and specific augmentation strategies are necessary to achieve superior performance. In this subsection, we study the relationship of $k$ -nearest neighbour in global feature space and the behavior of our GLCon for the downstream medical image segmentation. Here we first follow the same augmentation strategies in [21] (More analysis on data augmentation can be found in Section 4.4), and then conduct ablation studies on how the choices of $k$ -nearest neighbour can influence the performance of GLCon. Specifically, we run GLCon on the ACDC dataset at the 5% labeled ratio with a range of $k \in {3, 5, 7, 10, 12}$ . Figure 9(a) shows the ablation study on $k$ -nearest neighbour in global feature on the segmentation performance. As is shown, we find that GLCon at $k = 5, 7, 10$ have almost identical performance ( $k = 5$ has slightly better performance compared to other two settings), and all have superior performance compared to all others. In contrast, GLCon – through the use of randomly selected samples – is capable of finding diverse yet semantically consistent anatomical features from the entire dataset, which at the same time gives better segmentation performance.

Fig. 9. — Effects of $k$ -nearest neighbour in global feature space, mined view-set size, and mined view patch size. We report Dice and ASD of GLCon on the ACDC dataset at the 5% labeled ratio. All the experiments are run with three different random seeds.

Ablation Study of Mined View-Set Size.

We then conduct ablation studies on how the mined view-set size in GLCon can influence the segmentation performance. We run GLCon on the ACDC dataset at 5% labeled ratio with a range of the mined view-set size $\in {12, 18, 24, 30, 36, 42, 48}$ . The results are summarized in Figure 9(b). As is shown, we find that GLCon trained with view-set size 36 and 42 have similar or superior performance compared to all other settings, and our model with view-set size of 36 achieves the highest performance.

Ablation Study of Mined View Size.

Lastly, we study the influence of mined view size on the segmentation performance. Specifically, we run GLCon on the ACDC dataset at the 5% labeled ratio with a range of the mined view size $\in {8, 16, 32, 64, 128}$ . Figure 9(c) shows the ablation study of mined view size on the segmentation performance. As is shown, we observe that GLCon trained with mined view size of 32 and 64 have similar segmentation abilities, and both achieve superior performance compared to other settings. Here the mined view size of 64 works the best for GLCon to yield the superior segmentation performance.

Conclusion.

Given the above ablation study, we set $k$ , mined view-set size, patch size as 5, 36, 64 × 64 in our experiments, respectively. This can contribute to satisfactory segmentation performance.

4.8. Ablation Study of Anatomical Contrastive Reconstruction

In this section, we give a detailed analysis on the choice of the parameters in the anatomical contrastive reconstruction fine-tuning, and take a deeper look and understand how they contribute to the final segmentation performance. All the hyperparameters in training are the same across three benchmark datasets. All the experiments are run with three different random seeds, and the experimental results we report are calculated from the validation set.

Ablation Study of Total Loss $ℒ_{total}$ .

Proper choices of hyperparameters in total loss $ℒ_{total}$ (See Section 3.2) play a significant role in improving overall segmentation quality. We hence conduct the fine-grained analysis of the hyperparameters in $ℒ_{total}$ . In practice, we fine-tune the models with three independent runs, and grid search to select multiple hyperparameters. Specifically, we run MONA on the ACDC dataset at the 5% labeled ratio with a range of different hyperparameters $λ_{1} \in {0.005, 0.001, 0.05, 0.01, 0.05, 0.1}$ , and $λ_{2}, λ_{3}, λ_{4} \in {0.1, 0.2, 0.5, 1.0, 2.0, 10.0}$ . We summarize the results in Figure 10, and take the best setting $λ_{1} = 0.01$ , $λ_{2} = 1.0$ , $λ_{3} = 1.0$ , $λ_{4} = 1.0$ .

Fig. 10. — Effects of hyperparameters $λ_{1}$ , $λ_{2}$ , $λ_{3}$ , $λ_{4}$ . We report Dice and ASD of MONA on the ACDC dataset at the 5% labeled ratio. All the experiments are run with three different random seeds.

Ablation Study of Confidence Threshold $δ_{θ}$ .

We then assess the influence of $δ_{θ}$ on the segmentation performance. Specifically, we run MONA on the ACDC dataset at the 5% labeled ratio with a range of the confidence threshold $δ_{θ} \in {0.85, 0.88, 0.91, 0.94, 0.97, 1.0}$ . Figure 11(a) shows the ablation study of $δ_{θ}$ on the segmentation performance. As we can see, MONA on $δ_{θ} = 0.97$ has superior performance compared to other settings.

Fig. 11. — Effects of confidence threshold $δ_{θ}$ , $K$ -nearest neighbour constraint, and output embedding dimension. We report Dice and ASD of MONA on the ACDC dataset at the 5% labeled ratio. All the experiments are run with three different random seeds.

Ablation Study of $K$ -Nearest Neighbour Constraint.

Next, we conduct ablation studies on how the choices of $K$ in $K$ -nearest neighbour constraint can influence the segmentation performance. Specifically, we run MONA on the ACDC dataset at the 5% labeled ratio with a range of the choices $K \in {3, 5, 7, 10, 12}$ . Figure 11(b) shows the ablation study of $K$ choices on the segmentation performance. As we can see, MONA on $K = 5$ achieves the best performance compared to other settings.

Ablation Study of Output Embedding Dimension.

Finally, we study the influence of the output embedding dimension on the segmentation performance of MONA. Specifically, we run MONA on the ACDC dataset at the 5% labeled ratio with a range of output embedding dimension $\in {64, 128, 256, 512, 768}$ . Figure 11(c) shows the ablation study of output embedding dimension on the segmentation performance. As we can see, MONA with output embedding dimension of 512, can be trained to outperform other settings.

Conclusion.

Given the above ablation study, we select $λ_{1} = 0.01$ , $λ_{2} = 1.0$ , $λ_{3} = 1.0$ , $λ_{4} = 1.0$ , $δ_{θ} = 0.97$ , $K = 5$ , output embedding dimension = 512 in our experiments. This can provide the optimal segmentation performance across different labeled ratios.

5. Conclusion

In this paper, we have presented MONA, a semi-supervised contrastive learning method for 2D medical image segmentation. We start from the observations that medical image data always exhibit a long-tail class distribution, and the same anatomical objects (i.e., liver regions for two people) are more similar to each other than different objects (e.g.liver and tumor regions). We further expand upon this idea by introducing anatomical contrastive formulation, as well as equivariance and invariances constraints. Both empirical and theorical studies show that we can formulate a generic set of perspectives that allows us to learn meaningful representations across different anatomical features, which can dramatically improve the segmentation quality and alleviate the training memory bottleneck. Extensive experiments on three datasets demonstrate the superiority of our proposed framework in the long-tailed medical data regimes with extremely limited labels. We believe our results contribute to a better understanding of medical image segmentation and point to new avenues for long-tailed medical image data in realistic clinical applications.

Supplementary Material

supp1-3461321

NIHMS2034236-supplement-supp1-3461321.pdf^{(359.2KB, pdf)}

Biographies

graphic file with name nihms-2034236-b0001.gif

Chenyu You is a Ph.D. student at Yale University. He received the B.S. and M.Sc. degree from Rensselaer Polytechnic Institute and Stanford University, respectively. His research interests are broadly in the area of machine learning theory and algorithms intersecting the fields of computer & medical vision, natural language processing, and signal processing. Previously, he has published several works in NeurIPS, ICLR, CVPR, MICCAI, ACL, EMNLP, NAACL, AAAI, IJCAI, and KDD, etc. He has been awarded as the Outstanding/Distinguished Reviewer of CVPR, MICCAI, IEEE Transactions on Medical Imaging (TMI), and Medical Physics. He serves as an area chair at MICCAI, and has co-led workshops such as foundation model on healthcare (ICML).

graphic file with name nihms-2034236-b0002.gif

Weicheng Dai is a Postgraduate Associate at Yale University. He received the B.S. and M.Sc. degree from Southeast University and New York University, respectively. His research interests include computer vision, medical image analysis, natural language processing, and theoretical machine learning. He has published papers at top-tier conferences in MICCAI and IPMI, etc. He has served as reviewers for IEEE-TMI, MICCAI, AAAI, and IJCAI.

graphic file with name nihms-2034236-b0003.gif

Fenglin Liu is a PhD student at the University of Oxford. His research interests include Natural Language Processing (NLP), especially vision- and -language, Machine Learning, and their applications to clinical, i.e., Clinical NLP. He has published papers at top-tier journals and conferences, e.g., TPAMI, NeurIPS, CVPR, ACL, EMNLP, NAACL. He has served as a senior program committee member for IJCAI and was awarded as the Distinguished/Outstanding Reviewer of CVPR, AAAI, and IJCAI.

graphic file with name nihms-2034236-b0004.gif

Yifei Min is a Ph.D. student in Statistics and Data Science at Yale University. He received the B.S. and M.A. degree from the University of Hong Kong and University of Pennsylvania, respectively. He has a broad research interest in the area of machine learning theory with a focus on online learning, reinforcement learning, and medical imaging analysis. Previously, several of his work have been published in NeurIPS, ICLR, ICML, UAI, AISTATS, etc.

graphic file with name nihms-2034236-b0005.gif

Nicha C. Dvornek is an Assistant Professor of Radiology & Biomedical Imaging and Biomedical Engineering at Yale University. She received the B.S. in Biomedical Engineering at Johns Hopkins University and the M.S., M.Phil., and Ph.D. in Engineering and Applied Science at Yale University. She completed postdoctoral training in Diagnostic Radiology at Yale and as a T32 Fellow with the Yale Child Study Center. Her recent work focuses on the development and application of novel machine learning approaches to medical image processing and analysis, with applications spanning from neurodevelopmental disorders to cancer. Dr. Dvornek and her team’s work has been recognized by multiple paper and poster awards at international meetings. She is a member of the MICCAI society and on the board of the Women in MICCAI. She serves as an associate editor for Computerized Medical Imaging and Graphics and the Journal of Medical Imaging, an area chair for MICCAI and MIDL conferences, and a reviewer for multiple journals and conferences such as Medical Image Analysis, IEEE Transactions on Medical Imaging, NeurIPS, and CVPR.

graphic file with name nihms-2034236-b0006.gif

Xiaoxiao Li received the Ph.D. degree from Yale University, New Haven, CT, USA, in 2020. She was a Post-Doctoral Research Fellow with the Computer Science Department, Princeton University, Princeton, NJ, USA. Since August 2021, she has been an Assistant Professor at the Department of Electrical and Computer Engineering (ECE), University of British Columbia (UBC), Vancouver, BC, Canada. In the last few years, she has over 30 papers published in leading machine learning conferences and journals, including NeurlPS, ICML, ICLR, MICCAL, IPML, BMVC, IEEE TRANSACTIONS ON MEDICAL LMAGING, and Medical Image Analysis. Her research interests include range across the interdisciplinary fields of deep learning and biomedical data analysis, aiming to improve the trustworthiness of artificial intelligence (AI) systems for health care. Dr. Li’s work has been recognized with several best paper awards at international conferences.

graphic file with name nihms-2034236-b0007.gif

David A. Clifton is the Royal Academy of Engineering Chair of Clinical Machine Learning at the University of Oxford, and OCC Fellow in AI & Machine Learning at Reuben College, Oxford. He was the first AI scientist to be appointed to an NIHR Research Professorship, which is the UK medical research community’s “flagship Chair programme”. He is a Fellow of the Alan Turing Institute, Research Fellow of the Royal Academy of Engineering, Visiting Chair in AI for Healthcare at the University of Manchester, and a Fellow of Fudan University, China. He studied Information Engineering at Oxford’s Department of Engineering Science, supervised by Prof. Lionel Tarassenko CBE, Chair of Electrical Engineering. His research focuses on the development of machine learning for tracking the health of complex systems. His previous research resulted in patented systems for jet-engine health monitoring, used with the engines of the Airbus A380, the Boeing 787 “Dreamliner”, and the Eurofighter Typhoon. Since graduating from his DPhil in 2009, he has focused mostly on the development of AI-based methods for healthcare. Patents arising from this collaborative research have been commercialised via university spin-out companies OBS Medical, Oxehealth, and Sensyne Health, in addition to collaboration with multinational industrial bodies. He was awarded a Grand Challenge award from the UK Engineering and Physical Sciences Research Council, which is an EPSRC Fellowship that provides long-term strategic support for “future leaders in healthcare”. His research has been awarded over 35 academic prizes; in 2018, he was joint winner of the inaugural “Vice-Chancellor’s Innovation Prize”, which identifies the best interdisciplinary research across the entirety of the University of Oxford.

graphic file with name nihms-2034236-b0008.gif

Lawrence Staib is Professor of Radiology & Biomedical Imaging, Biomedical Engineering, and Electrical Engineering at Yale University. He is Director of Undergraduate Studies in Biomedical Engineering at Yale. He is a Fellow of the American Institute for Medical and Biological Engineering and of the Medical Image Computing and Computer-Assisted Intervention Society. He is a Distinguished Investigator of the Academy for Radiology & Biomedical Imaging Research. He received his B.A. in Physics from Cornell University and his Ph.D. in Engineering and Applied Science from Yale University. He is a member of the editorial board of Medical Image Analysis and Associate Editor of IEEE Transactions on Biomedical Engineering. His research interests are in medical image analysis and machine learning.

graphic file with name nihms-2034236-b0009.gif

James S. Duncan is the Ebenezer K. Hunt Professor of Biomedical Engineering and a Professor of Radiology & Biomedical Engineering, Electrical Engineering and Statistics & Data Science at Yale University. He is currently the Chair of the Department of Biomedical Engineering. Professor Duncan received his B.S.E.E. with honors from Lafayette College (1973), his M.S. (1975) in Electrical Engineering from the University of California, Los Angeles and his Ph.D. from the University of Southern California (1982) in Electrical Engineering. Dr. Duncan’s research efforts have been in the areas of computer vision, image processing, and medical imaging, with an emphasis on biomedical image analysis. These efforts have included the segmentation of deformable structure from 3D image data, the tracking of non-rigid motion/deformation from spatiotemporal images, and the development strategies for image-guided intervention/surgery. Most recently, his efforts have focused on the development of image analysis strategies based on the integration of data-driven machine learning and physical model-based approaches. He has published over 325 peer-reviewed articles in these areas and has been the principal investigator on a number of peer-reviewed grants from both the National Institutes of Health and the National Science Foundation over the past 40 years. Professor Duncan is a Fellow of the Institute of Electrical and Electronic Engineers (IEEE), of the American Institute for Medical and Biological Engineering (AIMBE) and of the Medical Image Computing and Computer Assisted Intervention (MICCAI) Society. In 2012, he was elected to the Council of Distinguished Investigators, Academy of Radiology Research. In 2014 he was elected to the Connecticut Academy of Science & Engineering. He currently serves as the co-Editor-in-Chief of Medical Image Analysis, and has served on the editorial boards of the IEEE Transactions on Medical Imaging, the Journal of Mathematical Imaging and Vision and the Proceedings of the IEEE. He is a past President of the MICCAI Society and in 2017 received the MICCAI Society’s Enduring Impact Award.

Footnotes

^1.

Note that the subscript $i$ is omitted for simplicity in following contexts.

^2.

Here we omit details of local instance discrimination strategyfor simplicity because the global and local instance discrimination experimental setups are similar.

^3.

https://github.com/HiLab-git/SSL4MIS/tree/master/data/ACDC

Contributor Information

Chenyu You, Yale University, New Haven 06510, U.S.A..

Weicheng Dai, Yale University, New Haven 06510, U.S.A..

Fenglin Liu, University of Oxford, OX3 7DQ Oxford, U.K..

Yifei Min, Yale University, New Haven 06510, U.S.A..

Nicha C. Dvornek, Yale University.

Xiaoxiao Li, University of British Columbia, Vancouver, BC V6T 1Z4, Canada..

David A. Clifton, University of Oxford, OX3 7DQ Oxford, U.K. Oxford-Suzhou Centre for Advanced Research, Suzhou, China.

Lawrence Staib, Yale University, New Haven 06510, U.S.A..

James S. Duncan, Yale University, New Haven 06510, U.S.A..

References

[1].Wu Z, Xiong Y, Stella XY, and Lin D, “Unsupervised feature learning via non-parametric instance discrimination,” in CVPR, 2018. [Google Scholar]
[2].Oord A. v. d., Li Y, and Vinyals O, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018. [Google Scholar]
[3].Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, and Bengio Y, “Learning deep representations by mutual information estimation and maximization,” in ICLR, 2019. [Google Scholar]
[4].Chen T, Kornblith S, Norouzi M, and Hinton G, “A simple framework for contrastive learning of visual representations,” in ICML, 2020. [Google Scholar]
[5].He K, Fan H, Wu Y, Xie S, and Girshick R, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020. [Google Scholar]
[6].Grill J-B, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi Azar M et al. , “Bootstrap your own latent-a new approach to self-supervised learning,” in NeurIPS, 2020. [Google Scholar]
[7].Chen X, Fan H, Girshick R, and He K, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020. [Google Scholar]
[8].Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, and Joulin A, “Unsupervised learning of visual features by contrasting cluster assignments,” in NeurIPS, 2020. [Google Scholar]
[9].Quan Q, Yao Q, Li J et al. , “Information-guided pixel augmentation for pixel-wise contrastive learning,” arXiv preprint arXiv:2211.07118, 2022. [Google Scholar]
[10].Li J, Quan Q, and Zhou SK, “Mixcl: Pixel label matters to contrastive learning,” arXiv preprint arXiv:2203.02114, 2022. [Google Scholar]
[11].Arora S, Khandeparkar H, Khodak M, Plevrakis O, and Saunshi N, “A theoretical analysis of contrastive unsupervised representation learning,” arXiv preprint arXiv:1902.09229, 2019. [Google Scholar]
[12].Chuang C-Y, Robinson J, Lin Y-C, Torralba A, and Jegelka S, “Debiased contrastive learning,” in NeurIPS, 2020. [Google Scholar]
[13].Li J, Zhou P, Xiong C, and Hoi S, “Prototypical contrastive learning of unsupervised representations,” in ICLR, 2021. [Google Scholar]
[14].Kang B, Li Y, Xie S, Yuan Z, and Feng J, “Exploring balanced feature spaces for representation learning,” in ICLR, 2020. [Google Scholar]
[15].Zhu X, Anguelov D, and Ramanan D, “Capturing long-tail distributions of object subcategories,” in CVPR, 2014. [Google Scholar]
[16].Cui Y, Jia M, Lin T-Y, Song Y, and Belongie S, “Class-balanced loss based on effective number of samples,” in CVPR, 2019. [Google Scholar]
[17].Yang Y and Xu Z, “Rethinking the value of labels for improving class-imbalanced learning,” in NeurIPS, 2020. [Google Scholar]
[18].Jiang Z, Chen T, Chen T, and Wang Z, “Improving contrastive learning on imbalanced data via open-world sampling,” in NeurIPS, 2021. [Google Scholar]
[19].Zheng M, You S, Wang F, Qian C, Zhang C, Wang X, and Xu C, “Ressl: Relational self-supervised learning with weak augmentation,” in NeurIPS, 2021. [DOI] [PubMed] [Google Scholar]
[20].Azabou M, Azar MG, Liu R, Lin C-H, Johnson EC, Bhaskaran-Nair K, Dabagia M, Avila-Pires B, Kitchell L, Hengen KB et al. , “Mine your own view: Self-supervised learning through across-sample prediction,” arXiv preprint arXiv:2102.10106, 2021. [Google Scholar]
[21].Van Gansbeke W, Vandenhende S, Georgoulis S, and Gool LV, “Revisiting contrastive methods for unsupervised learning of visual representations,” in NeurIPS, 2021. [Google Scholar]
[22].Liu S, Zhi S, Johns E, and Davison AJ, “Bootstrapping semantic segmentation with regional contrast,” arXiv preprint arXiv:2104.04465, 2021. [Google Scholar]
[23].Hadsell R, Chopra S, and LeCun Y, “Dimensionality reduction by learning an invariant mapping,” in CVPR, 2006. [Google Scholar]
[24].Tejankar A, Koohpayegani SA, Pillai V, Favaro P, and Pirsiavash H, “Isd: Self-supervised learning by iterative similarity distillation,” in ICCV, 2021. [Google Scholar]
[25].Long J, Shelhamer E, and Darrell T, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015. [DOI] [PubMed] [Google Scholar]
[26].Ronneberger O, Fischer P, and Brox T, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015. [Google Scholar]
[27].Simonyan K and Zisserman A, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [Google Scholar]
[28].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in CVPR, 2016. [Google Scholar]
[29].Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, and Polosukhin I, “Attention is all you need,” in NeurIPS, 2017. [Google Scholar]
[30].Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al. , “An image is worth 16×16 words: Transformers for image recognition at scale,” in ICLR, 2020. [Google Scholar]
[31].Milletari F, Navab N, and Ahmadi S-A, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV. IEEE, 2016. [Google Scholar]
[32].Chen L-C, Papandreou G, Kokkinos I, Murphy K, and Yuille AL, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, 2017. [DOI] [PubMed] [Google Scholar]
[33].Alom MZ, Hasan M, Yakopcic C, Taha TM, and Asari VK, “Recurrent residual convolutional neural network based on U-Net (R2U-Net) for medical image segmentation,” arXiv preprint arXiv:1802.06955, 2018. [Google Scholar]
[34].Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B et al. , “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018. [Google Scholar]
[35].Chen L-C, Zhu Y, Papandreou G, Schroff F, and Adam H, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018. [Google Scholar]
[36].Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, and Zhou Y, “Transunet: Transformers make strong encoders for medical image segmentation,” in MICCAI, 2021. [Google Scholar]
[37].Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, and Wang M, “Swin-unet: Unet-like pure transformer for medical image segmentation,” arXiv preprint arXiv:2105.05537, 2021. [Google Scholar]
[38].Xie Y, Zhang J, Shen C, and Xia Y, “Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation,” in MICCAI, 2021. [Google Scholar]
[39].Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth H, and Xu D, “Unetr: Transformers for 3d medical image segmentation,” arXiv preprint arXiv:2103.10504, 2021. [Google Scholar]
[40].Valanarasu JMJ, Oza P, Hacihaliloglu I, and Patel VM, “Medical transformer: Gated axial-attention for medical image segmentation,” in MICCAI, 2021. [Google Scholar]
[41].You C, Zhao R, Liu F, Chinchali S, Topcu U, Staib L, and Duncan JS, “Class-aware generative adversarial transformers for medical image segmentation,” arXiv preprint arXiv:2201.10737, 2022. [PMC free article] [PubMed] [Google Scholar]
[42].Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, and Wei Y, “Deformable convolutional networks,” in CVPR, 2017. [Google Scholar]
[43].Lin T-Y, Goyal P, Girshick R, He K, and Dollár P, “Focal loss for dense object detection,” in ICCV, 2017. [DOI] [PubMed] [Google Scholar]
[44].Xue Y, Tang H, Qiao Z, Gong G, Yin Y, Qian Z, Huang C, Fan W, and Huang X, “Shape-aware organ segmentation by predicting signed distance maps,” arXiv preprint arXiv:1912.03849, 2019. [Google Scholar]
[45].Shi G, Xiao L, Chen Y, and Zhou SK, “Marginal loss and exclusion loss for partially supervised multi-organ segmentation,” Medical Image Analysis, 2021. [DOI] [PubMed] [Google Scholar]
[46].Lee D-H et al. , “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Workshop on challenges in representation learning, ICML, 2013. [Google Scholar]
[47].Bai W, Oktay O, Sinclair M, Suzuki H, Rajchl M, Tarroni G, Glocker B, King A, Matthews PM, and Rueckert D, “Semi-supervised learning for network-based cardiac mr image segmentation,” in MICCAI, 2017. [Google Scholar]
[48].Fan D-P, Zhou T, Ji G-P, Zhou Y, Chen G, Fu H, Shen J, and Shao L, “Inf-net: Automatic covid-19 lung infection segmentation from ct images,” IEEE Transactions on Medical Imaging, 2020. [DOI] [PubMed] [Google Scholar]
[49].Chen X, Yuan Y, Zeng G, and Wang J, “Semi-supervised semantic segmentation with cross pseudo supervision,” in CVPR, 2021. [Google Scholar]
[50].Nassar I, Hayat M, Abbasnejad E, Rezatofighi H, and Haffari G, “Protocon: Pseudo-label refinement via online clustering and prototypical consistency for efficient semi-supervised learning,” in CVPR, 2023. [Google Scholar]
[51].Yu L, Wang S, Li X, Fu C-W, and Heng P-A, “Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation,” in MICCAI, 2019. [Google Scholar]
[52].Graham S, Chen H, Gamper J, Dou Q, Heng P-A, Snead D, Tsang YW, and Rajpoot N, “Mild-net: Minimal information loss dilated network for gland instance segmentation in colon histology images,” Medical image analysis, 2019. [DOI] [PubMed] [Google Scholar]
[53].Cao X, Chen H, Li Y, Peng Y, Wang S, and Cheng L, “Uncertainty aware temporal-ensembling model for semi-supervised abus mass segmentation,” IEEE Transactions on Medical Imaging, 2020. [DOI] [PubMed] [Google Scholar]
[54].Blundell C, Cornebise J, Kavukcuoglu K, and Wierstra D, “Weight uncertainty in neural network,” in ICML, 2015. [Google Scholar]
[55].Gal Y and Ghahramani Z, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in ICML, 2016. [Google Scholar]
[56].Kendall A and Gal Y, “What uncertainties do we need in bayesian deep learning for computer vision?” in NeurIPS, 2017. [Google Scholar]
[57].Xie Q, Luong M-T, Hovy E, and Le QV, “Self-training with noisy student improves imagenet classification,” in CVPR, 2020. [Google Scholar]
[58].Bortsova G, Dubost F, Hogeweg L, Katramados I, and de Bruijne M, “Semi-supervised medical image segmentation via learning consistency under transformations,” in MICCAI, 2019. [Google Scholar]
[59].Cui W, Liu Y, Li Y, Guo M, Li Y, Li X, Wang T, Zeng X, and Ye C, “Semi-supervised brain lesion segmentation with an adapted mean teacher model,” in MICCAI, 2019. [Google Scholar]
[60].Fotedar G, Tajbakhsh N, Ananth S, and Ding X, “Extreme consistency: Overcoming annotation scarcity and domain shifts,” in MICCAI, 2020. [Google Scholar]
[61].Sajjadi M, Javanmardi M, and Tasdizen T, “Regularization with stochastic transformations and perturbations for deep semi-supervised learning,” in NeurIPS, 2016. [Google Scholar]
[62].Qiao S, Shen W, Zhang Z, Wang B, and Yuille A, “Deep co-training for semi-supervised image recognition,” in ECCV, 2018. [Google Scholar]
[63].Zhou Y, Wang Y, Tang P, Bai S, Shen W, Fishman E, and Yuille A, “Semi-supervised 3d abdominal multi-organ segmentation via deep multi-planar co-training,” in WACV, 2019. [Google Scholar]
[64].Tarvainen A and Valpola H, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in NeurIPS, 2017. [Google Scholar]
[65].Zhao Z, Zhou F, Zeng Z, Guan C, and Zhou SK, “Meta-hallucinator: Towards few-shot cross-modality cardiac image segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part V. Springer, 2022, pp. 128–139. [Google Scholar]
[66].Li X, Yu L, Chen H, Fu C-W, Xing L, and Heng P-A, “Transformation-consistent self-ensembling model for semi-supervised medical image segmentation,” IEEE Transactions on Neural Networks and Learning Systems, 2020. [DOI] [PubMed] [Google Scholar]
[67].Reiß S, Seibold C, Freytag A, Rodner E, and Stiefelhagen R, “Every annotation counts: Multi-label deep supervision for medical image segmentation,” in CVPR, 2021. [Google Scholar]
[68].Zhang Y, Yang L, Chen J, Fredericksen M, Hughes DP, and Chen DZ, “Deep adversarial networks for biomedical image segmentation utilizing unannotated images,” in MICCAI, 2017. [Google Scholar]
[69].Nie D, Gao Y, Wang L, and Shen D, “Asdnet: Attention based semi-supervised deep networks for medical image segmentation,” in MICCAI, 2018. [Google Scholar]
[70].Zhang Z, Yang L, and Zheng Y, “Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network,” in CVPR, 2018. [Google Scholar]
[71].Zheng H, Lin L, Hu H, Zhang Q, Chen Q, Iwamoto Y, Han X, Chen Y-W, Tong R, and Wu J, “Semi-supervised segmentation of liver using adversarial learning with deep atlas prior,” in MICCAI, 2019. [Google Scholar]
[72].Li S, Zhang C, and He X, “Shape-aware semi-supervised 3d semantic segmentation for medical images,” in MICCAI, 2020. [Google Scholar]
[73].Valvano G, Leo A, and Tsaftaris SA, “Learning to segment from scribbles using multi-scale adversarial attention gates,” IEEE Transactions on Medical Imaging, 2021. [DOI] [PubMed] [Google Scholar]
[74].Grandvalet Y and Bengio Y, “Semi-supervised learning by entropy minimization,” NeurIPS, 2004. [Google Scholar]
[75].Chaitanya K, Erdil E, Karani N, and Konukoglu E, “Contrastive learning of global and local features for medical image segmentation with limited annotations,” in NeurIPS, 2020. [Google Scholar]
[76].Dwibedi D, Aytar Y, Tompson J, Sermanet P, and Zisserman A, “With a little help from my friends: Nearest-neighbor contrastive learning of visual representations,” in ICCV, 2021. [Google Scholar]
[77].You C, Zhao R, Staib L, and Duncan JS, “Momentum contrastive voxel-wise representation learning for semi-supervised volumetric medical image segmentation,” arXiv preprint arXiv:2105.07059, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[78].Chaitanya K, Erdil E, Karani N, and Konukoglu E, “Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation,” arXiv preprint arXiv:2112.09645, 2021. [DOI] [PubMed] [Google Scholar]
[79].Hu X, Zeng D, Xu X, and Shi Y, “Semi-supervised contrastive learning for label-efficient medical image segmentation,” in MICCAI, 2021. [Google Scholar]
[80].You C, Zhou Y, Zhao R, Staib L, and Duncan JS, “Simcvd: Simple contrastive voxel-wise representation distillation for semi-supervised medical image segmentation,” IEEE Transactions on Medical Imaging, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[81].Zhang S, Zhang J, Tian B, Lukasiewicz T, and Xu Z, “Multi-modal contrastive mutual learning and pseudo-label re-learning for semi-supervised medical image segmentation,” Medical Image Analysis, 2023. [DOI] [PubMed] [Google Scholar]
[82].Lou A, Tawfik K, Yao X, Liu Z, and Noble J, “Min-max similarity: A contrastive semi-supervised deep learning network for surgical tools segmentation,” IEEE Transactions on Medical Imaging, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[83].Dangovski R, Jing L, Loh C, Han S, Srivastava A, Cheung B, Agrawal P, and Soljacic M, “Equivariant contrastive learning,” in ICLR, 2022. [Google Scholar]
[84].Galdran A, Carneiro G, and González Ballester MA, “Balanced-mixup for highly imbalanced medical image classification,” in MICCAI, 2021. [Google Scholar]
[85].Yan K, Cai J, Jin D, Miao S, Guo D, Harrison AP, Tang Y, Xiao J, Lu J, and Lu L, “Sam: Self-supervised learning of pixel-wise anatomical embeddings in radiological images,” IEEE Trans. Med. Imaging, 2022. [DOI] [PubMed] [Google Scholar]
[86].Roy AG, Ren J, Azizi S, Loh A, Natarajan V, Mustafa B, Pawlowski N, Freyberg J, Liu Y, Beaver Z et al. , “Does your dermatology classifier know what it doesn’t know? detecting the long-tail of unseen conditions,” Medical Image Analysis, 2022. [DOI] [PubMed] [Google Scholar]
[87].Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, and Belongie S, “Feature pyramid networks for object detection,” in CVPR, 2017. [Google Scholar]
[88].Vu T-H, Jain H, Bucher M, Cord M, and Pérez P, “Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation,” in CVPR, 2019. [Google Scholar]
[89].Ouali Y, Hudelot C, and Tami M, “Semi-supervised semantic segmentation with cross-consistency training,” in CVPR, 2020. [Google Scholar]
[90].Luo X, Liao W, Chen J, Song T, Chen Y, Zhang S, Chen N, Wang G, and Zhang S, “Efficient semi-supervised gross target volume of nasopharyngeal carcinoma segmentation via uncertainty rectified pyramid consistency,” in MICCAI, 2021. [Google Scholar]
[91].Verma V, Kawaguchi K, Lamb A, Kannala J, Bengio Y, and Lopez-Paz D, “Interpolation consistency training for semi-supervised learning,” in IJCAI, 2019. [DOI] [PubMed] [Google Scholar]
[92].Bernard O, Lalande A, Zotti C, Cervenansky F, Yang X, Heng P-A, Cetin I, Lekadir K, Camara O, Ballester MAG et al. , “Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?” IEEE Transactions on Medical Imaging, 2018. [DOI] [PubMed] [Google Scholar]
[93].Bilic P, Christ PF, Vorontsov E, Chlebus G, Chen H, Dou Q, Fu C-W, Han X, Heng P-A, Hesser J et al. , “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019. [Google Scholar]
[94].Zhuang X and Shen J, “Multi-scale patch and multi-modality atlases for whole heart segmentation of mri,” Medical image analysis, 2016. [DOI] [PubMed] [Google Scholar]
[95].Li X, Chen H, Qi X, Dou Q, Fu C-W, and Heng P-A, “H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes,” IEEE Trans. Med. Imaging, 2018. [DOI] [PubMed] [Google Scholar]
[96].Wu Y, Wu Z, Wu Q, Ge Z, and Cai J, “Exploring smoothness and class-separation for semi-supervised medical image segmentation,” in MICCAI, 2022. [Google Scholar]
[97].Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al. , “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, 2019. [Google Scholar]
[98].French G, Laine S, Aila T, Mackiewicz M, and Finlayson G, “Semi-supervised semantic segmentation needs strong, varied perturbations,” in BMVC, 2020. [Google Scholar]
[99].Perez F, Vasconcelos C, Avila S, and Valle E, “Data augmentation for skin lesion analysis,” in OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis. Springer, 2018, pp. 303–311. [Google Scholar]
[100].Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel CA, Cubuk ED, Kurakin A, and Li C-L, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” NeurIPS, 2020. [Google Scholar]
[101].Bartlett PL and Mendelson S, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002. [Google Scholar]
[102].Furlanello T, Lipton Z, Tschannen M, Itti L, and Anandkumar A, “Born again neural networks,” in ICML, 2018. [Google Scholar]
[103].Zhang Z and Sabuncu M, “Self-distillation as instance-specific label smoothing,” in NeurIPS, 2020. [Google Scholar]
[104].Mobahi H, Farajtabar M, and Bartlett P, “Self-distillation amplifies regularization in hilbert space,” in NeurIPS, 2020. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp1-3461321

NIHMS2034236-supplement-supp1-3461321.pdf^{(359.2KB, pdf)}

[R1] [1].Wu Z, Xiong Y, Stella XY, and Lin D, “Unsupervised feature learning via non-parametric instance discrimination,” in CVPR, 2018. [Google Scholar]

[R2] [2].Oord A. v. d., Li Y, and Vinyals O, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018. [Google Scholar]

[R3] [3].Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, and Bengio Y, “Learning deep representations by mutual information estimation and maximization,” in ICLR, 2019. [Google Scholar]

[R4] [4].Chen T, Kornblith S, Norouzi M, and Hinton G, “A simple framework for contrastive learning of visual representations,” in ICML, 2020. [Google Scholar]

[R5] [5].He K, Fan H, Wu Y, Xie S, and Girshick R, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020. [Google Scholar]

[R6] [6].Grill J-B, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi Azar M et al. , “Bootstrap your own latent-a new approach to self-supervised learning,” in NeurIPS, 2020. [Google Scholar]

[R7] [7].Chen X, Fan H, Girshick R, and He K, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020. [Google Scholar]

[R8] [8].Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, and Joulin A, “Unsupervised learning of visual features by contrasting cluster assignments,” in NeurIPS, 2020. [Google Scholar]

[R9] [9].Quan Q, Yao Q, Li J et al. , “Information-guided pixel augmentation for pixel-wise contrastive learning,” arXiv preprint arXiv:2211.07118, 2022. [Google Scholar]

[R10] [10].Li J, Quan Q, and Zhou SK, “Mixcl: Pixel label matters to contrastive learning,” arXiv preprint arXiv:2203.02114, 2022. [Google Scholar]

[R11] [11].Arora S, Khandeparkar H, Khodak M, Plevrakis O, and Saunshi N, “A theoretical analysis of contrastive unsupervised representation learning,” arXiv preprint arXiv:1902.09229, 2019. [Google Scholar]

[R12] [12].Chuang C-Y, Robinson J, Lin Y-C, Torralba A, and Jegelka S, “Debiased contrastive learning,” in NeurIPS, 2020. [Google Scholar]

[R13] [13].Li J, Zhou P, Xiong C, and Hoi S, “Prototypical contrastive learning of unsupervised representations,” in ICLR, 2021. [Google Scholar]

[R14] [14].Kang B, Li Y, Xie S, Yuan Z, and Feng J, “Exploring balanced feature spaces for representation learning,” in ICLR, 2020. [Google Scholar]

[R15] [15].Zhu X, Anguelov D, and Ramanan D, “Capturing long-tail distributions of object subcategories,” in CVPR, 2014. [Google Scholar]

[R16] [16].Cui Y, Jia M, Lin T-Y, Song Y, and Belongie S, “Class-balanced loss based on effective number of samples,” in CVPR, 2019. [Google Scholar]

[R17] [17].Yang Y and Xu Z, “Rethinking the value of labels for improving class-imbalanced learning,” in NeurIPS, 2020. [Google Scholar]

[R18] [18].Jiang Z, Chen T, Chen T, and Wang Z, “Improving contrastive learning on imbalanced data via open-world sampling,” in NeurIPS, 2021. [Google Scholar]

[R19] [19].Zheng M, You S, Wang F, Qian C, Zhang C, Wang X, and Xu C, “Ressl: Relational self-supervised learning with weak augmentation,” in NeurIPS, 2021. [DOI] [PubMed] [Google Scholar]

[R20] [20].Azabou M, Azar MG, Liu R, Lin C-H, Johnson EC, Bhaskaran-Nair K, Dabagia M, Avila-Pires B, Kitchell L, Hengen KB et al. , “Mine your own view: Self-supervised learning through across-sample prediction,” arXiv preprint arXiv:2102.10106, 2021. [Google Scholar]

[R21] [21].Van Gansbeke W, Vandenhende S, Georgoulis S, and Gool LV, “Revisiting contrastive methods for unsupervised learning of visual representations,” in NeurIPS, 2021. [Google Scholar]

[R22] [22].Liu S, Zhi S, Johns E, and Davison AJ, “Bootstrapping semantic segmentation with regional contrast,” arXiv preprint arXiv:2104.04465, 2021. [Google Scholar]

[R23] [23].Hadsell R, Chopra S, and LeCun Y, “Dimensionality reduction by learning an invariant mapping,” in CVPR, 2006. [Google Scholar]

[R24] [24].Tejankar A, Koohpayegani SA, Pillai V, Favaro P, and Pirsiavash H, “Isd: Self-supervised learning by iterative similarity distillation,” in ICCV, 2021. [Google Scholar]

[R25] [25].Long J, Shelhamer E, and Darrell T, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015. [DOI] [PubMed] [Google Scholar]

[R26] [26].Ronneberger O, Fischer P, and Brox T, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015. [Google Scholar]

[R27] [27].Simonyan K and Zisserman A, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [Google Scholar]

[R28] [28].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in CVPR, 2016. [Google Scholar]

[R29] [29].Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, and Polosukhin I, “Attention is all you need,” in NeurIPS, 2017. [Google Scholar]

[R30] [30].Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al. , “An image is worth 16×16 words: Transformers for image recognition at scale,” in ICLR, 2020. [Google Scholar]

[R31] [31].Milletari F, Navab N, and Ahmadi S-A, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV. IEEE, 2016. [Google Scholar]

[R32] [32].Chen L-C, Papandreou G, Kokkinos I, Murphy K, and Yuille AL, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, 2017. [DOI] [PubMed] [Google Scholar]

[R33] [33].Alom MZ, Hasan M, Yakopcic C, Taha TM, and Asari VK, “Recurrent residual convolutional neural network based on U-Net (R2U-Net) for medical image segmentation,” arXiv preprint arXiv:1802.06955, 2018. [Google Scholar]

[R34] [34].Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B et al. , “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018. [Google Scholar]

[R35] [35].Chen L-C, Zhu Y, Papandreou G, Schroff F, and Adam H, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018. [Google Scholar]

[R36] [36].Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, and Zhou Y, “Transunet: Transformers make strong encoders for medical image segmentation,” in MICCAI, 2021. [Google Scholar]

[R37] [37].Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, and Wang M, “Swin-unet: Unet-like pure transformer for medical image segmentation,” arXiv preprint arXiv:2105.05537, 2021. [Google Scholar]

[R38] [38].Xie Y, Zhang J, Shen C, and Xia Y, “Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation,” in MICCAI, 2021. [Google Scholar]

[R39] [39].Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth H, and Xu D, “Unetr: Transformers for 3d medical image segmentation,” arXiv preprint arXiv:2103.10504, 2021. [Google Scholar]

[R40] [40].Valanarasu JMJ, Oza P, Hacihaliloglu I, and Patel VM, “Medical transformer: Gated axial-attention for medical image segmentation,” in MICCAI, 2021. [Google Scholar]

[R41] [41].You C, Zhao R, Liu F, Chinchali S, Topcu U, Staib L, and Duncan JS, “Class-aware generative adversarial transformers for medical image segmentation,” arXiv preprint arXiv:2201.10737, 2022. [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, and Wei Y, “Deformable convolutional networks,” in CVPR, 2017. [Google Scholar]

[R43] [43].Lin T-Y, Goyal P, Girshick R, He K, and Dollár P, “Focal loss for dense object detection,” in ICCV, 2017. [DOI] [PubMed] [Google Scholar]

[R44] [44].Xue Y, Tang H, Qiao Z, Gong G, Yin Y, Qian Z, Huang C, Fan W, and Huang X, “Shape-aware organ segmentation by predicting signed distance maps,” arXiv preprint arXiv:1912.03849, 2019. [Google Scholar]

[R45] [45].Shi G, Xiao L, Chen Y, and Zhou SK, “Marginal loss and exclusion loss for partially supervised multi-organ segmentation,” Medical Image Analysis, 2021. [DOI] [PubMed] [Google Scholar]

[R46] [46].Lee D-H et al. , “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Workshop on challenges in representation learning, ICML, 2013. [Google Scholar]

[R47] [47].Bai W, Oktay O, Sinclair M, Suzuki H, Rajchl M, Tarroni G, Glocker B, King A, Matthews PM, and Rueckert D, “Semi-supervised learning for network-based cardiac mr image segmentation,” in MICCAI, 2017. [Google Scholar]

[R48] [48].Fan D-P, Zhou T, Ji G-P, Zhou Y, Chen G, Fu H, Shen J, and Shao L, “Inf-net: Automatic covid-19 lung infection segmentation from ct images,” IEEE Transactions on Medical Imaging, 2020. [DOI] [PubMed] [Google Scholar]

[R49] [49].Chen X, Yuan Y, Zeng G, and Wang J, “Semi-supervised semantic segmentation with cross pseudo supervision,” in CVPR, 2021. [Google Scholar]

[R50] [50].Nassar I, Hayat M, Abbasnejad E, Rezatofighi H, and Haffari G, “Protocon: Pseudo-label refinement via online clustering and prototypical consistency for efficient semi-supervised learning,” in CVPR, 2023. [Google Scholar]

[R51] [51].Yu L, Wang S, Li X, Fu C-W, and Heng P-A, “Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation,” in MICCAI, 2019. [Google Scholar]

[R52] [52].Graham S, Chen H, Gamper J, Dou Q, Heng P-A, Snead D, Tsang YW, and Rajpoot N, “Mild-net: Minimal information loss dilated network for gland instance segmentation in colon histology images,” Medical image analysis, 2019. [DOI] [PubMed] [Google Scholar]

[R53] [53].Cao X, Chen H, Li Y, Peng Y, Wang S, and Cheng L, “Uncertainty aware temporal-ensembling model for semi-supervised abus mass segmentation,” IEEE Transactions on Medical Imaging, 2020. [DOI] [PubMed] [Google Scholar]

[R54] [54].Blundell C, Cornebise J, Kavukcuoglu K, and Wierstra D, “Weight uncertainty in neural network,” in ICML, 2015. [Google Scholar]

[R55] [55].Gal Y and Ghahramani Z, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in ICML, 2016. [Google Scholar]

[R56] [56].Kendall A and Gal Y, “What uncertainties do we need in bayesian deep learning for computer vision?” in NeurIPS, 2017. [Google Scholar]

[R57] [57].Xie Q, Luong M-T, Hovy E, and Le QV, “Self-training with noisy student improves imagenet classification,” in CVPR, 2020. [Google Scholar]

[R58] [58].Bortsova G, Dubost F, Hogeweg L, Katramados I, and de Bruijne M, “Semi-supervised medical image segmentation via learning consistency under transformations,” in MICCAI, 2019. [Google Scholar]

[R59] [59].Cui W, Liu Y, Li Y, Guo M, Li Y, Li X, Wang T, Zeng X, and Ye C, “Semi-supervised brain lesion segmentation with an adapted mean teacher model,” in MICCAI, 2019. [Google Scholar]

[R60] [60].Fotedar G, Tajbakhsh N, Ananth S, and Ding X, “Extreme consistency: Overcoming annotation scarcity and domain shifts,” in MICCAI, 2020. [Google Scholar]

[R61] [61].Sajjadi M, Javanmardi M, and Tasdizen T, “Regularization with stochastic transformations and perturbations for deep semi-supervised learning,” in NeurIPS, 2016. [Google Scholar]

[R62] [62].Qiao S, Shen W, Zhang Z, Wang B, and Yuille A, “Deep co-training for semi-supervised image recognition,” in ECCV, 2018. [Google Scholar]

[R63] [63].Zhou Y, Wang Y, Tang P, Bai S, Shen W, Fishman E, and Yuille A, “Semi-supervised 3d abdominal multi-organ segmentation via deep multi-planar co-training,” in WACV, 2019. [Google Scholar]

[R64] [64].Tarvainen A and Valpola H, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in NeurIPS, 2017. [Google Scholar]

[R65] [65].Zhao Z, Zhou F, Zeng Z, Guan C, and Zhou SK, “Meta-hallucinator: Towards few-shot cross-modality cardiac image segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part V. Springer, 2022, pp. 128–139. [Google Scholar]

[R66] [66].Li X, Yu L, Chen H, Fu C-W, Xing L, and Heng P-A, “Transformation-consistent self-ensembling model for semi-supervised medical image segmentation,” IEEE Transactions on Neural Networks and Learning Systems, 2020. [DOI] [PubMed] [Google Scholar]

[R67] [67].Reiß S, Seibold C, Freytag A, Rodner E, and Stiefelhagen R, “Every annotation counts: Multi-label deep supervision for medical image segmentation,” in CVPR, 2021. [Google Scholar]

[R68] [68].Zhang Y, Yang L, Chen J, Fredericksen M, Hughes DP, and Chen DZ, “Deep adversarial networks for biomedical image segmentation utilizing unannotated images,” in MICCAI, 2017. [Google Scholar]

[R69] [69].Nie D, Gao Y, Wang L, and Shen D, “Asdnet: Attention based semi-supervised deep networks for medical image segmentation,” in MICCAI, 2018. [Google Scholar]

[R70] [70].Zhang Z, Yang L, and Zheng Y, “Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network,” in CVPR, 2018. [Google Scholar]

[R71] [71].Zheng H, Lin L, Hu H, Zhang Q, Chen Q, Iwamoto Y, Han X, Chen Y-W, Tong R, and Wu J, “Semi-supervised segmentation of liver using adversarial learning with deep atlas prior,” in MICCAI, 2019. [Google Scholar]

[R72] [72].Li S, Zhang C, and He X, “Shape-aware semi-supervised 3d semantic segmentation for medical images,” in MICCAI, 2020. [Google Scholar]

[R73] [73].Valvano G, Leo A, and Tsaftaris SA, “Learning to segment from scribbles using multi-scale adversarial attention gates,” IEEE Transactions on Medical Imaging, 2021. [DOI] [PubMed] [Google Scholar]

[R74] [74].Grandvalet Y and Bengio Y, “Semi-supervised learning by entropy minimization,” NeurIPS, 2004. [Google Scholar]

[R75] [75].Chaitanya K, Erdil E, Karani N, and Konukoglu E, “Contrastive learning of global and local features for medical image segmentation with limited annotations,” in NeurIPS, 2020. [Google Scholar]

[R76] [76].Dwibedi D, Aytar Y, Tompson J, Sermanet P, and Zisserman A, “With a little help from my friends: Nearest-neighbor contrastive learning of visual representations,” in ICCV, 2021. [Google Scholar]

[R77] [77].You C, Zhao R, Staib L, and Duncan JS, “Momentum contrastive voxel-wise representation learning for semi-supervised volumetric medical image segmentation,” arXiv preprint arXiv:2105.07059, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R78] [78].Chaitanya K, Erdil E, Karani N, and Konukoglu E, “Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation,” arXiv preprint arXiv:2112.09645, 2021. [DOI] [PubMed] [Google Scholar]

[R79] [79].Hu X, Zeng D, Xu X, and Shi Y, “Semi-supervised contrastive learning for label-efficient medical image segmentation,” in MICCAI, 2021. [Google Scholar]

[R80] [80].You C, Zhou Y, Zhao R, Staib L, and Duncan JS, “Simcvd: Simple contrastive voxel-wise representation distillation for semi-supervised medical image segmentation,” IEEE Transactions on Medical Imaging, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R81] [81].Zhang S, Zhang J, Tian B, Lukasiewicz T, and Xu Z, “Multi-modal contrastive mutual learning and pseudo-label re-learning for semi-supervised medical image segmentation,” Medical Image Analysis, 2023. [DOI] [PubMed] [Google Scholar]

[R82] [82].Lou A, Tawfik K, Yao X, Liu Z, and Noble J, “Min-max similarity: A contrastive semi-supervised deep learning network for surgical tools segmentation,” IEEE Transactions on Medical Imaging, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R83] [83].Dangovski R, Jing L, Loh C, Han S, Srivastava A, Cheung B, Agrawal P, and Soljacic M, “Equivariant contrastive learning,” in ICLR, 2022. [Google Scholar]

[R84] [84].Galdran A, Carneiro G, and González Ballester MA, “Balanced-mixup for highly imbalanced medical image classification,” in MICCAI, 2021. [Google Scholar]

[R85] [85].Yan K, Cai J, Jin D, Miao S, Guo D, Harrison AP, Tang Y, Xiao J, Lu J, and Lu L, “Sam: Self-supervised learning of pixel-wise anatomical embeddings in radiological images,” IEEE Trans. Med. Imaging, 2022. [DOI] [PubMed] [Google Scholar]

[R86] [86].Roy AG, Ren J, Azizi S, Loh A, Natarajan V, Mustafa B, Pawlowski N, Freyberg J, Liu Y, Beaver Z et al. , “Does your dermatology classifier know what it doesn’t know? detecting the long-tail of unseen conditions,” Medical Image Analysis, 2022. [DOI] [PubMed] [Google Scholar]

[R87] [87].Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, and Belongie S, “Feature pyramid networks for object detection,” in CVPR, 2017. [Google Scholar]

[R88] [88].Vu T-H, Jain H, Bucher M, Cord M, and Pérez P, “Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation,” in CVPR, 2019. [Google Scholar]

[R89] [89].Ouali Y, Hudelot C, and Tami M, “Semi-supervised semantic segmentation with cross-consistency training,” in CVPR, 2020. [Google Scholar]

[R90] [90].Luo X, Liao W, Chen J, Song T, Chen Y, Zhang S, Chen N, Wang G, and Zhang S, “Efficient semi-supervised gross target volume of nasopharyngeal carcinoma segmentation via uncertainty rectified pyramid consistency,” in MICCAI, 2021. [Google Scholar]

[R91] [91].Verma V, Kawaguchi K, Lamb A, Kannala J, Bengio Y, and Lopez-Paz D, “Interpolation consistency training for semi-supervised learning,” in IJCAI, 2019. [DOI] [PubMed] [Google Scholar]

[R92] [92].Bernard O, Lalande A, Zotti C, Cervenansky F, Yang X, Heng P-A, Cetin I, Lekadir K, Camara O, Ballester MAG et al. , “Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?” IEEE Transactions on Medical Imaging, 2018. [DOI] [PubMed] [Google Scholar]

[R93] [93].Bilic P, Christ PF, Vorontsov E, Chlebus G, Chen H, Dou Q, Fu C-W, Han X, Heng P-A, Hesser J et al. , “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019. [Google Scholar]

[R94] [94].Zhuang X and Shen J, “Multi-scale patch and multi-modality atlases for whole heart segmentation of mri,” Medical image analysis, 2016. [DOI] [PubMed] [Google Scholar]

[R95] [95].Li X, Chen H, Qi X, Dou Q, Fu C-W, and Heng P-A, “H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes,” IEEE Trans. Med. Imaging, 2018. [DOI] [PubMed] [Google Scholar]

[R96] [96].Wu Y, Wu Z, Wu Q, Ge Z, and Cai J, “Exploring smoothness and class-separation for semi-supervised medical image segmentation,” in MICCAI, 2022. [Google Scholar]

[R97] [97].Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al. , “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, 2019. [Google Scholar]

[R98] [98].French G, Laine S, Aila T, Mackiewicz M, and Finlayson G, “Semi-supervised semantic segmentation needs strong, varied perturbations,” in BMVC, 2020. [Google Scholar]

[R99] [99].Perez F, Vasconcelos C, Avila S, and Valle E, “Data augmentation for skin lesion analysis,” in OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis. Springer, 2018, pp. 303–311. [Google Scholar]

[R100] [100].Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel CA, Cubuk ED, Kurakin A, and Li C-L, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” NeurIPS, 2020. [Google Scholar]

[R101] [101].Bartlett PL and Mendelson S, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002. [Google Scholar]

[R102] [102].Furlanello T, Lipton Z, Tschannen M, Itti L, and Anandkumar A, “Born again neural networks,” in ICML, 2018. [Google Scholar]

[R103] [103].Zhang Z and Sabuncu M, “Self-distillation as instance-specific label smoothing,” in NeurIPS, 2020. [Google Scholar]

[R104] [104].Mobahi H, Farajtabar M, and Bartlett P, “Self-distillation amplifies regularization in hilbert space,” in NeurIPS, 2020. [Google Scholar]

PERMALINK

Mine yOur owN Anatomy: Revisiting Medical Image Segmentation with Extremely Limited Labels

Chenyu You

Weicheng Dai

Fenglin Liu

Yifei Min

Nicha C Dvornek

Xiaoxiao Li

David A Clifton

Lawrence Staib

James S Duncan

Roles

Abstract

1. Introduction

Fig. 1.

Fig. 2.

2. Related work

Medical Image Segmentation.

Semi-Supervised Learning (SSL).

Contrastive Learning.

3. Mine yOur owN Anatomy (MONA)

Overview.

3.1. GLCon

Pre-training preliminary.

3.2. Anatomical Contrastive Reconstruction

Principles.

Tailness.

Fig. 3.

Consistency.

Fig. 4.

Diversity.

Setup.

4. Experiments

4.1. Datasets

4.2. Implementation Details.

Fig. 6.

4.3. Main Results

TABLE 1.

ACDC.

Fig. 5.

LiTS.

Fig. 7.

MMWHS

TABLE 2.

Fig. 8.

4.4. Ablation Study

4.5. Effects of Different Components

TABLE 3.

4.6. Effects of Different Augmentations

TABLE 4.

4.7. Generalization across Contrastive Learning Frameworks

Training Details of Competing CL Methods.

Comparisons with CL-based Frameworks.

TABLE 5.

Generalization Across CL Frameworks.

TABLE 6.

Does k-nearest neighbour in global feature space help?

Fig. 9.

Ablation Study of Mined View-Set Size.

Ablation Study of Mined View Size.

Conclusion.

4.8. Ablation Study of Anatomical Contrastive Reconstruction

Ablation Study of Total Loss ℒtotal.

Fig. 10.

Ablation Study of Confidence Threshold δθ.

Fig. 11.

Ablation Study of K-Nearest Neighbour Constraint.

Ablation Study of Output Embedding Dimension.

Conclusion.

5. Conclusion

Supplementary Material

Biographies

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Does $k$ -nearest neighbour in global feature space help?

Ablation Study of Total Loss $ℒ_{total}$ .

Ablation Study of Confidence Threshold $δ_{θ}$ .

Ablation Study of $K$ -Nearest Neighbour Constraint.