Momentum Contrastive Voxel-wise Representation Learning for Semi-supervised Volumetric Medical Image Segmentation

Chenyu You; Ruihan Zhao; Lawrence Staib; James S Duncan

doi:10.1007/978-3-031-16440-8_61

. Author manuscript; available in PMC: 2023 Jul 18.

Published in final edited form as: Med Image Comput Comput Assist Interv. 2022 Sep 16;13434:639–652. doi: 10.1007/978-3-031-16440-8_61

Momentum Contrastive Voxel-wise Representation Learning for Semi-supervised Volumetric Medical Image Segmentation

Chenyu You ¹, Ruihan Zhao ², Lawrence Staib ^1,^3,⁴, James S Duncan ^1,^3,⁴

PMCID: PMC10352821 NIHMSID: NIHMS1912991 PMID: 37465615

Abstract

Contrastive learning (CL) aims to learn useful representation without relying on expert annotations in the context of medical image segmentation. Existing approaches mainly contrast a single positive vector (i.e., an augmentation of the same image) against a set of negatives within the entire remainder of the batch by simply mapping all input features into the same constant vector. Despite the impressive empirical performance, those methods have the following shortcomings: (1) it remains a formidable challenge to prevent the collapsing problems to trivial solutions; and (2) we argue that not all voxels within the same image are equally positive since there exist the dissimilar anatomical structures with the same image. In this work, we present a novel Contrastive Voxel-wise Representation Learning (CVRL) method to effectively learn low-level and high-level features by capturing 3D spatial context and rich anatomical information along both the feature and the batch dimensions. Specifically, we first introduce a novel CL strategy to ensure feature diversity promotion among the 3D representation dimensions. We train the framework through bi-level contrastive optimization (i.e., low-level and high-level) on 3D images. Experiments on two benchmark datasets and different labeled settings demonstrate the superiority of our proposed framework. More importantly, we also prove that our method inherits the benefit of hardness-aware property from the standard CL approaches. Codes will be available soon.

Keywords: Contrastive Learning, Semi-Supervised Learning, Medical Image Segmentation

1. Introduction

Learning from just a few labeled examples while leveraging a large amount of unlabeled data is a long-standing pursuit in the machine learning community, which is especially crucial for the medical imaging domain. Generating reliable manual annotations of 3D imaging data at scale is expensive, time-consuming, and may require domain-specific expertise. Due to privacy concerns, another challenge in medical imaging is relatively small training datasets.

In the medical imaging domain, substantial efforts [36, 13, 19, 35, 2, 16, 4, 1, 32, 33, 31, 30, 22] have been devoted to incorporating unlabeled data to improve network performance due to the limited 3D data and annotations. The most common training techniques are adversarial learning and consistency loss as regularization terms to encourage unsupervised mapping. Recently, contrastive learning (CL) has drawn considerable attention to learning useful representations without expert supervision and shown remarkable performance in the medical image analysis domain [3, 8, 34]. The central idea [6, 28, 18, 5, 25, 3, 14, 29, 15] is to learn powerful representations invariant to data augmentations that maximize the agreement between instance embeddings from different augmentations of the same images. The major stream of subsequent work focuses on the choice of dissimilar pairs, which determine the quality of learned representations. The loss function used to contrast is chosen from several options, such as InfoNCE [20], Triplet [27], and so on. However, while remarkable, those methods assume that the repulsive effect of negatives can avoid collapsing along all dimensions to trivial solutions by explicitly using positive and negative pairs in the loss function. However, it has been empirically observed that such design may still collapse along certain dimensions (i.e., dimensional collapse), as noted in [9, 26]. Such scenarios can happen in predefined augmentations, which usually lead to better performance due to the inter-instance constraints while usually ignoring anatomically feasibility in the transformation.

In this paper, we present CVRL, a novel end-to-end semi-supervised framework to learn high-level contexts and local-level features in both the batch and the feature directions for 3D medical image segmentation. One blessing comes from the recent finding [10] in the context of image classification. The authors note that applying strong augmentation along the feature dimension may result in the dimensional collapse in CL. In other words, the augmented images are not “standardized” well and may easily admits collapsed solutions (e.g., generating the same vector for all 3D scans), making it challenging or even infeasible in real-world clinical practice. Three key aspects distinguish from the recent success [3]. First, the standard CL encourages instance-level representation diversity within the batch. By contrast, we propose an anatomy-informed regularization among the feature dimension, as an intra-instance constraint to encourage feature diversity for its improved robustness, as illustrated in Figure 2. This design is appealing: (1) our idea is rather plug-and-play and can be easily compatible with existing inter-instance constraints; and (2) it inherits the strength of CL in learning useful representations to improve the feature spaces’ discriminative capability (See Appendix). Second, we propose to perform low-level contrast in a lower-dimensional 3D subspace, which could capture rich anatomic information; and (3) existing methods mainly perform local contrast in the image-level space, which may usually lead to sub-optimal segmentation quality due to the lack of spatial information. In contrast, if the proposed method can learn more generic representation from 3D context, it will unlock the appealing prospect of 3D nature in medical images (i.e., 3D volumetric scans). We propose an additional high-level contrast to exploit distinctive features in the embedding feature space by designing a new 3D projection head to encode 3D features. We also theoretically show our dimensional contrastive learning inherits the hardness-aware property in Appendix. The results demonstrate that our segmentation network outperforms the state-of-the-art methods on two benchmark datasets, and generates object segmentation with high-quality global shapes.

Fig. 2. — Comparison of (a) Voxel-wise Contrastive Objective (*i.e.*, batch dimension), and (b) Dimensional Contrastive Objective (*i.e.*, feature dimension). $b$ denotes batch size and $c$ is the feature dimension.

1.1. Overview

An overview of the architecture is illustrated in Figure 1. Our CVRL is based on GCL [3], and follows its most important components such as data augmentations. our goal is to learn stronger visual representations that avoid collapsing for improving the overall segmentation quality with limited annotation clinical scenarios. In the limited annotation setting, we train semi-supervised CVRL alongside two components - supervised and unsupervised learning objectives. Specifically, we propose a novel voxel-wise representation learning algorithm to learn low-level and high-level representations from 3D unlabeled data by regularizing the embedding space and exploring the geometric and spatial context of training voxels.

Fig. 1. — Overview of CVRL architecture. We learn rich dense voxel-wise representations by exploiting high-level context between voxels and volumetric regions and low-level correlations in both batch and feature dimensions in a semi-supervised manner.

In our problem setting, we consider a set of training data (3D images) including $N$ labeled data and $M$ unlabeled data, where $N ≪ M$ . For simplicity of exposition, we denote limited label data as $𝒟_{l} = {\{(x_{i}, y_{i})\}}_{i = 1}^{N}$ , and abundant unlabeled data as $𝒟_{u} = {\{(x_{i})\}}_{i = N + 1}^{N + M}$ , where $x_{i} \in R^{H \times W \times D}$ are volume inputs, and $y_{i} \in {0, 1}^{H \times W \times D}$ are ground-truth labels. Specifically, we adopt V-Net [35] as the network backbone $F (\cdot)$ , which consists of an encoder network and a decoder network. To maximize mutual information between latent representations, we design a projection head, that comprises one encoder network which share the similar architecture to the prior encoder network.

1.2. Unsupervised Contrastive Learning

A key component of CVRL is the ability to capture rich voxel-wise representations of high dimensional data by contrastive distillation. CVRL trains on the contrastive objective as an auxiliary loss during the volume batch updates. We utilize two contrastive learning objectives: (i) voxel-wise contrastive objective (ii) dimensional contrastive objective, each applied on a low-level feature and a high-level feature. The resulting combined training objective greatly improves the quality of learned representations.

First we establish some notations which will assist in explaining our approach. We denote a batch of input image $x_{1} \dots x_{b}$ , the teacher encoder network $f$ , the student encoder network $g$ and a set of data augmentation transformations $𝒯$ (e.g. random flipping, random rotation, random brightness, random contrast, random zooming, cube rearrangement and cube rotation). Here $f$ and $g$ use the same encoder architecture $e (\cdot)$ as introduced in section 1.1, but with different parameters. $z^{f} \in R^{h \times w \times d \times c} = f (t (x)))$ is a feature volume produced by the student encoder, whereas $z^{g} = g (t^{'} (x))$ is the corresponding feature volume produced by the teacher encoder, under a different random transformation $t^{'} \neq t$ .

Voxel-wise Contrastive Objective

Using standard contrastive learning, we encourage the feature extractor to produce representations that are invariant under data augmentations. On the other hand, features should still preserve locality: different voxels in a feature volume should contain their unique information. Specifically, as the learned feature volumes are divided into slices, we pull the pairs of voxels that come from two augmentations of the same image closer; voxels at different locations or from different images are pushed away. To learn a feature extractor that unlocks the desired properties, we use InfoNCE loss [20]:

ℒ_{q} = - l o g \frac{e x p (q \cdot k_{+} / τ)}{e x p (q \cdot k_{+} / τ) + \sum_{k \in 𝒦_{-}} e x p (q \cdot k / τ)}

(1)

The query $q \in R^{c}$ is a voxel in a student feature volume $z^{f}$ , and key $k \in R^{c}$ comes from the teacher feature volume $z^{g}$ . In particular, $k_{+}$ , the positive key, is the teacher feature voxel corresponding to the same location in the same image as the query $q$ . The set $𝒦_{-}$ contains all other keys in the mini-batch, from different locations and different inputs. $τ$ is a temperature hyper-parameter.

To obtain the voxel-wise contrastive loss, we take the average over the set of query voxels $𝒬_{v}$ consisting of all feature voxels in the mini-batch of student feature volumes:

ℒ_{v} = \frac{1}{|𝒬_{v}|} \sum_{q \in 𝒬_{v}} ℒ_{q}

(2)

Dimensional Contrastive Objective

Motivated by recent findings on dimensional collapse in contrastive learning [10, 9], we propose a dimensional contrastive objective to encourage different dimensions/channels in the feature voxels to contain diverse information. Given a batch of student feature volumes of shape $b \times h \times w \times d \times c$ , we group the first 4 dimensions to obtain a set of dimensional queries: $q \in 𝒬_{d} \subset R^{(b \times h \times w \times d)}$ where $|𝒬_{d}| = c$ , the number of channels in the feature volume. We define the $𝒦 = \{k_{+}\} \cup 𝒦_{-}$ in the same way, but using the corresponding batch feature volumes from the teacher encoder. In the dimensional contrastive setting, $k_{+}$ is defined as the key vector that corresponds to the same feature dimension as the query $q$ . The dimensional contrastive loss is the average over all query dimensions:

ℒ_{d} = \frac{1}{c} \sum_{q \in 𝒬_{d}} ℒ_{q}

(3)

We theoretically show our dimensional contrastive learning inherits the hardness-aware property in Appendix.

Consistency Loss

$ℒ_{c}$ Recent work [11, 24] show that using an exponential moving average (EMA) over network parameters is empirically shown to improve training stability and models’ final performance. With this insight, we introduce an EMA teacher model with parameters $θ$ as the moving-average of the parameters $θ^{'}$ from the original student network. Specifically, the architecture of EMA model follows the original model. At training step $t$ , the update rule follows $θ_{t} = m θ_{t - 1} + (1 - m) θ_{t}^{'}$ , where $m \in [0, 1)$ is momentum parameter. On the unlabeled set, we perform different perturbation operations on the unlabeled input volume sample $x^{u},$ e.g. adding noise $ϵ$ . To encourage training stability and performance improvements, we define consistency loss as:

ℒ^{con} = ℒ_{mse} (g (x^{u}; θ, ϵ), f (x^{u}; θ^{'}, ϵ^{'})),

(4)

where $ℒ_{mse}$ is the mean squared error loss.

Overall Training Objective

Our overall learning objective is to minimize a combination of supervised and unsupervised losses. On the labeled data, the supervised training objective is the linear combination of cross-entropy loss and dice loss. On the unlabeled dataset, unsupervised training objective consist of the consistency loss $ℒ_{c}$ , high-level contrastive loss $ℒ^{high}$ (i.e., the linear combination of $ℒ_{v}^{high}$ and $ℒ_{d}^{high}$ , and low-level contrastive loss $ℒ^{low}$ (i.e., the linear combination of $ℒ_{v}^{low}$ and $ℒ_{d}^{low})$ . The overall loss function is:

ℒ = ℒ^{sup} + α ℒ^{high} + β ℒ^{low} + γ ℒ^{con}

(5)

where $α$ , $β$ , $γ$ are hyperparameters that balance each term.

2. Experiments

Dataset and Pre-processing

We conduct our experiments on two benchmark datasets: the Left Atrium (LA) dataset from Atrial Segmentation Challenge¹ and the NIH pancreas CT dataset [21]. The LA dataset includes 100 3D gadolinium-enhanced MR imaging scans with annotations. The isotropic resolution of the scan is of 0.625 × 0.625 × 0.625mm³. We use 80 scans for training, and 20 scans for evaluation. For pre-processing, we crop all the scans at the heart region and normalized them into zero and unit variance with the size of sub-volumes 112 × 112 × 80mm³. The pancreas dataset consists of 82 contrast-enhanced abdominal CT scans. We use the same experimental setting [16] to randomly select 62 scans for training, and 20 scans for evaluation. For pre-processing, we first rescale the intensities of the CT images into the window [−125, 275] HU [38], and then re-sample all the data to a isotropic resolution of 1.0 × 1.0 × 1.0mm³. We crop all the scans at the pancreas region and normalized into zero and unit variance with the size of sub-volumes 96 × 96 × 96mm³. In this study, we conduct all experiments on LA and pancreas dataset under 10% and 20% labeled ratios.

Implementation Details

In our framework, we use V-Net as the network backbone for two networks. For data augmentation, we use standard data augmentation techniques [35, 23]. We empirically set the hyper-parameters $α$ , $β$ , $γ$ , $τ$ as 0.1, 0.1, 0.1, 1.0, respectively. We use SGD optimizer with a momentum 0.9 and weight decay 0.0005 to optimize the network parameters. The initial learning rate is set as 0.01 and divided by 10 every 3000 iterations. For EMA updates, we follow the experimental setting in [35], where the EMA decay rate $α$ is set to 0.999. We use the time-dependent Gaussian warming-up function $Ψ_{con} (t) = e x p (- 5 {(1 - t / t_{m a x})}^{2})$ to ramp up parameters, where $t$ and $t_{\max}$ denote the current and the maximum training step, respectively. For fairness, all evaluated methods are implemented in PyTorch, and trained for 10000 iterations on an NVIDIA 3090Ti GPU with batch size 6. In the testing stage, we adopt four metrics to evaluate the segmentation performance, including Dice coefficient (Dice), Jaccard Index (Jaccard), 95% Hausdorff Distance (95HD), and Average Symmetric Surface Distance (ASD).

Comparison with Other Semi-supervised Methods

We evaluate our CVRL with several state-of-the-art semi-supervised segmentation methods on different amounts of labeled data, including V-Net [17], DAN [36], DAP [37], UA-MT [35], LG-ER-MT [7], SASSNet [12], and Chaitanya et al. [3]. Table 1 compares our segmentation results with other methods.

Table 1.

Quantitative segmentation results on the LA dataset. The backbone network of all evaluated methods are V-Net.

Method	# scans used		Metrics
Method	Labeled	Unlabeled	Dice[%]	Jaccard[%]	ASD[voxel]	95HD[voxel]
V-Net [17]	80	0	91.14	83.82	1.52	5.75
V-Net	16	0	86.03	76.06	3.51	14.26
DAN [36]	16	64	87.52	78.29	2.42	9.01
DAP [37]	16	64	87.89	78.72	2.74	9.29
UA-MT [35]	16	64	88.88	80.21	2.26	7.32
LG-ER-MT [7]	16	64	89.56	81.22	2.06	7.29
SASSNet [12]	16	64	89.27	80.82	3.13	8.83
Chaitanya et al.[3]	16	64	89.94	81.82	2.66	7.23
CVRL(ours)	16	64	90.45	83.02	1.81	6.56
V-Net [17]	8	0	79.99	68.12	5.48	21.11
DAN [36]	8	72	80.87	70.65	3.72	15.96
DAP [37]	8	72	81.89	71.23	3.80	15.81
UA-MT [35]	8	72	84.25	73.48	3.36	13.84
LG-ER-MT [7]	8	72	85.43	74.95	3.75	15.01
SASSNet [12]	8	72	86.81	76.92	3.94	12.54
Chaitanya et al. [3]	8	72	84.95	74.77	3.70	10.68
CVRL(ours)	8	72	88.56	78.89	2.81	8.22

Open in a new tab

We first conduct experiments under 20% annotation ratios (16 labeled and 64 unlabeled). Under this setting, most above approaches achieve superior segmentation performance. CVRL gives better performance thanks to its low-level and high-level voxel-wise feature extraction. In particular, our proposed method outperforms other end-to-end semi-supervised methods in Dice (90.45%), Jaccard (83.02%), ASD (1.81), and 95HD (6.56).

To further evaluate the effectiveness of CVRL, we compare it with other methods in 10% annotation ratio (8 labeled and 72 unlabeled), as reported in Table 1. We observe consistent performance improvements over state-of-the-arts, in terms of Dice (88.56%), and Jaccard (78.89%). This evidence that i). taking voxel samples with contrastive learning yields better voxel embeddings; ii) both high-level and low-level relations are informative cues in both batch and feature dimension; iii) utilizing dimensional contrast is capable of consistently helping improve the segmentation performance. Leveraging all these aspects, it can observe consistent performance gains. As shown in Fig. 3, our method is capable of generating more accurate segmentation, considering the fact the improvement in such setting is difficult. This demonstrates i) the necessity of comprehensively considering both high-level and low-level contrast in both batch and feature dimension; and ii) efficacy of both inter-instance and intra-instance constraints. We also assess the performance of CVRL on Pancreas. We provide detailed evaluation results on Pancreas in Appendix Table A1. We find that CVRL significantly outperforms all the state-of-the-art methods by a significant margin. We noticed that our proposed CVRL could improve results by an especially significant margin, with up to 3.25 – 5.21% relative improvement in Dice.

Fig. 3. — Visual comparisons with other methods. As observed, our CVRL achieves superior performance with more accurate borders and shapes. Red and blue lines denote output predictions and ground-truth, respectively.

Ablation Study

We perform ablation experiments to validate the effectiveness of major components in our proposed method, including high-level and low-level contrastive strategy, and consistency loss. The quantitative results is reported in Table A2. We compare CVRL with its five variants under 10% annotation ratio (8 labeled and 72 unlabeled). Specially, the Baseline model refers to MT [24]. We gradually incorporate $ℒ^{high}$ , $ℒ^{low}$ , $ℒ^{con}$ , denoted as $B a s e l i n e + ℒ^{low}, B a s e l i n e + ℒ^{high}, B a s e l i n e + ℒ^{con}, B a s e l i n e + ℒ^{low} + ℒ^{high}, B a s e l i n e + ℒ^{low} + ℒ^{high} + ℒ^{con}$ (CVRL), respectively. As shown in the table, the Baseline network achieve 83.09%, 71.75%, 5.53, 19.65 in terms of Dice, Jaccard, ASD, and 95HD. With the progressive introduction of $ℒ^{high}$ , $ℒ^{low}$ , $ℒ^{con}$ , our proposed algorithm enjoys consistently improvement gains over the Baseline network, boosting Dice and Jaccard by 5.47%, 7.14%, respectively. Also, the metrics ASD and 95HD are reduced by 2.72 and 11.43, respectively. This further validates the effectiveness of each key component. We summarize the effects of hyperparameters in Appendix Figure A1.

3. Conclusion

In this work, we propose CVRL, a semi-supervised contrastive representation distillation framework by leveraging low-level and high-level cues to learn voxel-wise representations for volumetric medical image segmentation. Specifically, we propose to use voxel-wise contrastive and dimensional contrastive learning to ensure diversity promotion and exploit complex relations among training voxels. We also show the hardness-aware property is a key property for the success of our proposed dimensional contrastive learning. Experimental results demonstrate that our model yields state-of-the-art performance with generating more accurate boundaries with very limited annotations.

Supplementary Material

appendix

NIHMS1912991-supplement-appendix.pdf^{(392.2KB, pdf)}

Table 2.

Ablation study for the key component modules of CVRL on the LA dataset with 10% annotation ratio (8 labeled and 72 unlabeled).

Method	# scans used		Metrics
Method	Labeled	Unlabeled	Dice[%]	Jaccard[%]	ASD[voxel]	95HD[voxel]
Baseline	8	72	83.09	71.75	5.53	19.65
$Baseline + ℒ^{high}$	8	72	87.46	78.12	3.03	9.99
$Baseline + ℒ^{low}$	8	72	87.24	77.49	3.36	10.13
$Baseline + ℒ^{con}$	8	72	85.72	75.31	4.72	13.18
$Baseline + ℒ^{high} + ℒ^{low}$	8	72	88.14	78.38	3.02	9.58
$Baseline + ℒ^{high} + ℒ^{low} + ℒ^{con}$	8	72	88.56	78.89	2.81	8.22

Open in a new tab

Footnotes

http://atriaseg2018.cardiacatlas.org/

References

1.Bai W, Chen C, Tarroni G, Duan J, Guitton F, Petersen SE, Guo Y, Matthews PM, Rueckert D: Self-supervised learning for cardiac mr image segmentation by anatomical position prediction. In: MICCAI. pp. 541–549. Springer; (2019) [Google Scholar]
2.Bortsova G, Dubost F, Hogeweg L, Katramados I, de Bruijne M: Semi-supervised medical image segmentation via learning consistency under transformations. In: MICCAI. pp. 810–818. Springer; (2019) [Google Scholar]
3.Chaitanya K, Erdil E, Karani N, Konukoglu E: Contrastive learning of global and local features for medical image segmentation with limited annotations. In: NeurIPS (2020) [Google Scholar]
4.Chen S, Bortsova G, Juárez AGU, van Tulder G, de Bruijne M: Multitask attention-based semi-supervised learning for medical image segmentation. In: MICCAI. pp. 457–465. Springer; (2019) [Google Scholar]
5.Chen T, Kornblith S, Norouzi M, Hinton G: A simple framework for contrastive learning of visual representations. In: ICML. pp. 1597–1607. PMLR (2020) [Google Scholar]
6.Hadsell R, Chopra S, LeCun Y: Dimensionality reduction by learning an invariant mapping. In: CVPR. vol. 2, pp. 1735–1742. IEEE; (2006) [Google Scholar]
7.Hang W, Feng W, Liang S, Yu L, Wang Q, Choi KS, Qin J: Local and global structure-aware entropy regularized mean teacher model for 3d left atrium segmentation. In: MICCAI. pp. 562–571. Springer; (2020) [Google Scholar]
8.Hu X, Zeng D, Xu X, Shi Y: Semi-supervised contrastive learning for label-efficient medical image segmentation. In: MICCAI. Springer; (2021) [Google Scholar]
9.Hua T, Wang W, Xue Z, Ren S, Wang Y, Zhao H: On feature decorrelation in self-supervised learning. In: ICCV. pp. 9598–9608 (2021) [Google Scholar]
10.Jing L, Vincent P, LeCun Y, Tian Y: Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348 (2021) [Google Scholar]
11.Laine S, Aila T: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016) [Google Scholar]
12.Li S, Zhang C, He X: Shape-aware semi-supervised 3d semantic segmentation for medical images. In: MICCAI. pp. 552–561. Springer; (2020) [Google Scholar]
13.Li X, Yu L, Chen H, Fu CW, Heng PA: Semi-supervised skin lesion segmentation via transformation consistent self-ensembling model. arXiv preprint arXiv:1808.03887 (2018) [Google Scholar]
14.Liu F, Yang B, You C, Wu X, Ge S, Woicik A, Wang S: Gene ontology description generation: Task, dataset, model (2022)
15.Liu F, You C, Wu X, Ge S, Sun X, et al. : Auto-encoding knowledge graph for unsupervised medical report generation (2021)
16.Luo X, Chen J, Song T, Wang G: Semi-supervised medical image segmentation through dual-task consistency. In: AAAI (2020) [Google Scholar]
17.Milletari F, Navab N, Ahmadi SA: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3DV. pp. 565–571. IEEE (2016) [Google Scholar]
18.Misra I, Maaten L.v.d.: Self-supervised learning of pretext-invariant representations. In: CVPR. pp. 6707–6717 (2020) [Google Scholar]
19.Nie D, Gao Y, Wang L, Shen D: Asdnet: Attention based semi-supervised deep networks for medical image segmentation. In: MICCAI. pp. 370–378. Springer; (2018) [Google Scholar]
20.Oord A.v.d., Li Y, Vinyals O: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018) [Google Scholar]
21.Roth HR, Farag A, Turkbey E, Lu L, Liu J, Summers RM: Data from pancreas-ct. the cancer imaging archive (2016)
22.Sun S, Han K, Kong D, You C, Xie X: Mirnf: Medical image registration via neural fields. arXiv preprint arXiv:2206.03111 (2022) [DOI] [PubMed] [Google Scholar]
23.Taleb A, Loetzsch W, Danz N, Severin J, Gaertner T, Bergner B, Lippert C: 3d self-supervised methods for medical imaging. In: NeurIPS. pp. 18158–18172 (2020) [Google Scholar]
24.Tarvainen A, Valpola H: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NeurIPS. pp. 1195–1204 (2017) [Google Scholar]
25.Tian Y, Krishnan D, Isola P: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019) [Google Scholar]
26.Wang F, Liu H: Understanding the behaviour of contrastive loss. In: CVPR. pp. 2495–2504 (2021) [Google Scholar]
27.Wang X, Gupta A: Unsupervised learning of visual representations using videos. In: ICCV. pp. 2794–2802 (2015) [Google Scholar]
28.Wu Z, Xiong Y, Stella XY, Lin D: Unsupervised feature learning via nonparametric instance discrimination. In: CVPR (2018) [Google Scholar]
29.You C, Chen N, Zou Y: Self-supervised contrastive cross-modality representation learning for spoken question answering. arXiv preprint arXiv:2109.03381 (2021) [Google Scholar]
30.You C, Dai W, Staib L, Duncan JS: Bootstrapping semi-supervised medical image segmentation with anatomical-aware contrastive distillation. arXiv preprint arXiv:2206.02307 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
31.You C, Xiang J, Su K, Zhang X, Dong S, Onofrey J, Staib L, Duncan JS: Incremental learning meets transfer learning: Application to multi-site prostate mri segmentation. arXiv preprint arXiv:2206.01369 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
32.You C, Yang J, Chapiro J, Duncan JS: Unsupervised wasserstein distance guided domain adaptation for 3d multi-domain liver segmentation. In: Interpretable and Annotation-Efficient Learning for Medical Image Computing. pp. 155–163. Springer International Publishing; (2020) [Google Scholar]
33.You C, Zhao R, Liu F, Chinchali S, Topcu U, Staib L, Duncan JS: Class-aware generative adversarial transformers for medical image segmentation. arXiv preprint arXiv:2201.10737 (2022) [PMC free article] [PubMed] [Google Scholar]
34.You C, Zhou Y, Zhao R, Staib L, Duncan JS: SimCVD: Simple contrastive voxel-wise representation distillation for semi-supervised medical image segmentation. IEEE Transactions on Medical Imaging, 2022 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Yu L, Wang S, Li X, Fu CW, Heng PA: Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation. In: MICCAI. pp. 605–613. Springer; (2019) [Google Scholar]
36.Zhang Y, Yang L, Chen J, Fredericksen M, Hughes DP, Chen DZ: Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In: MICCAI. pp. 408–416. Springer; (2017) [Google Scholar]
37.Zheng H, Lin L, Hu H, Zhang Q, Chen Q, Iwamoto Y, Han X, Chen YW, Tong R, Wu J: Semi-supervised segmentation of liver using adversarial learning with deep atlas prior. In: MICCAI. pp. 148–156. Springer; (2019) [Google Scholar]
38.Zhou Y, Li Z, Bai S, Wang C, Chen X, Han M, Fishman E, Yuille AL: Prior-aware neural network for partially-supervised multi-organ segmentation. In: ICCV. pp. 10672–10681 (2019) [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

appendix

NIHMS1912991-supplement-appendix.pdf^{(392.2KB, pdf)}

[R1] 1.Bai W, Chen C, Tarroni G, Duan J, Guitton F, Petersen SE, Guo Y, Matthews PM, Rueckert D: Self-supervised learning for cardiac mr image segmentation by anatomical position prediction. In: MICCAI. pp. 541–549. Springer; (2019) [Google Scholar]

[R2] 2.Bortsova G, Dubost F, Hogeweg L, Katramados I, de Bruijne M: Semi-supervised medical image segmentation via learning consistency under transformations. In: MICCAI. pp. 810–818. Springer; (2019) [Google Scholar]

[R3] 3.Chaitanya K, Erdil E, Karani N, Konukoglu E: Contrastive learning of global and local features for medical image segmentation with limited annotations. In: NeurIPS (2020) [Google Scholar]

[R4] 4.Chen S, Bortsova G, Juárez AGU, van Tulder G, de Bruijne M: Multitask attention-based semi-supervised learning for medical image segmentation. In: MICCAI. pp. 457–465. Springer; (2019) [Google Scholar]

[R5] 5.Chen T, Kornblith S, Norouzi M, Hinton G: A simple framework for contrastive learning of visual representations. In: ICML. pp. 1597–1607. PMLR (2020) [Google Scholar]

[R6] 6.Hadsell R, Chopra S, LeCun Y: Dimensionality reduction by learning an invariant mapping. In: CVPR. vol. 2, pp. 1735–1742. IEEE; (2006) [Google Scholar]

[R7] 7.Hang W, Feng W, Liang S, Yu L, Wang Q, Choi KS, Qin J: Local and global structure-aware entropy regularized mean teacher model for 3d left atrium segmentation. In: MICCAI. pp. 562–571. Springer; (2020) [Google Scholar]

[R8] 8.Hu X, Zeng D, Xu X, Shi Y: Semi-supervised contrastive learning for label-efficient medical image segmentation. In: MICCAI. Springer; (2021) [Google Scholar]

[R9] 9.Hua T, Wang W, Xue Z, Ren S, Wang Y, Zhao H: On feature decorrelation in self-supervised learning. In: ICCV. pp. 9598–9608 (2021) [Google Scholar]

[R10] 10.Jing L, Vincent P, LeCun Y, Tian Y: Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348 (2021) [Google Scholar]

[R11] 11.Laine S, Aila T: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016) [Google Scholar]

[R12] 12.Li S, Zhang C, He X: Shape-aware semi-supervised 3d semantic segmentation for medical images. In: MICCAI. pp. 552–561. Springer; (2020) [Google Scholar]

[R13] 13.Li X, Yu L, Chen H, Fu CW, Heng PA: Semi-supervised skin lesion segmentation via transformation consistent self-ensembling model. arXiv preprint arXiv:1808.03887 (2018) [Google Scholar]

[R14] 14.Liu F, Yang B, You C, Wu X, Ge S, Woicik A, Wang S: Gene ontology description generation: Task, dataset, model (2022)

[R15] 15.Liu F, You C, Wu X, Ge S, Sun X, et al. : Auto-encoding knowledge graph for unsupervised medical report generation (2021)

[R16] 16.Luo X, Chen J, Song T, Wang G: Semi-supervised medical image segmentation through dual-task consistency. In: AAAI (2020) [Google Scholar]

[R17] 17.Milletari F, Navab N, Ahmadi SA: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3DV. pp. 565–571. IEEE (2016) [Google Scholar]

[R18] 18.Misra I, Maaten L.v.d.: Self-supervised learning of pretext-invariant representations. In: CVPR. pp. 6707–6717 (2020) [Google Scholar]

[R19] 19.Nie D, Gao Y, Wang L, Shen D: Asdnet: Attention based semi-supervised deep networks for medical image segmentation. In: MICCAI. pp. 370–378. Springer; (2018) [Google Scholar]

[R20] 20.Oord A.v.d., Li Y, Vinyals O: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018) [Google Scholar]

[R21] 21.Roth HR, Farag A, Turkbey E, Lu L, Liu J, Summers RM: Data from pancreas-ct. the cancer imaging archive (2016)

[R22] 22.Sun S, Han K, Kong D, You C, Xie X: Mirnf: Medical image registration via neural fields. arXiv preprint arXiv:2206.03111 (2022) [DOI] [PubMed] [Google Scholar]

[R23] 23.Taleb A, Loetzsch W, Danz N, Severin J, Gaertner T, Bergner B, Lippert C: 3d self-supervised methods for medical imaging. In: NeurIPS. pp. 18158–18172 (2020) [Google Scholar]

[R24] 24.Tarvainen A, Valpola H: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NeurIPS. pp. 1195–1204 (2017) [Google Scholar]

[R25] 25.Tian Y, Krishnan D, Isola P: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019) [Google Scholar]

[R26] 26.Wang F, Liu H: Understanding the behaviour of contrastive loss. In: CVPR. pp. 2495–2504 (2021) [Google Scholar]

[R27] 27.Wang X, Gupta A: Unsupervised learning of visual representations using videos. In: ICCV. pp. 2794–2802 (2015) [Google Scholar]

[R28] 28.Wu Z, Xiong Y, Stella XY, Lin D: Unsupervised feature learning via nonparametric instance discrimination. In: CVPR (2018) [Google Scholar]

[R29] 29.You C, Chen N, Zou Y: Self-supervised contrastive cross-modality representation learning for spoken question answering. arXiv preprint arXiv:2109.03381 (2021) [Google Scholar]

[R30] 30.You C, Dai W, Staib L, Duncan JS: Bootstrapping semi-supervised medical image segmentation with anatomical-aware contrastive distillation. arXiv preprint arXiv:2206.02307 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.You C, Xiang J, Su K, Zhang X, Dong S, Onofrey J, Staib L, Duncan JS: Incremental learning meets transfer learning: Application to multi-site prostate mri segmentation. arXiv preprint arXiv:2206.01369 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.You C, Yang J, Chapiro J, Duncan JS: Unsupervised wasserstein distance guided domain adaptation for 3d multi-domain liver segmentation. In: Interpretable and Annotation-Efficient Learning for Medical Image Computing. pp. 155–163. Springer International Publishing; (2020) [Google Scholar]

[R33] 33.You C, Zhao R, Liu F, Chinchali S, Topcu U, Staib L, Duncan JS: Class-aware generative adversarial transformers for medical image segmentation. arXiv preprint arXiv:2201.10737 (2022) [PMC free article] [PubMed] [Google Scholar]

[R34] 34.You C, Zhou Y, Zhao R, Staib L, Duncan JS: SimCVD: Simple contrastive voxel-wise representation distillation for semi-supervised medical image segmentation. IEEE Transactions on Medical Imaging, 2022 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Yu L, Wang S, Li X, Fu CW, Heng PA: Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation. In: MICCAI. pp. 605–613. Springer; (2019) [Google Scholar]

[R36] 36.Zhang Y, Yang L, Chen J, Fredericksen M, Hughes DP, Chen DZ: Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In: MICCAI. pp. 408–416. Springer; (2017) [Google Scholar]

[R37] 37.Zheng H, Lin L, Hu H, Zhang Q, Chen Q, Iwamoto Y, Han X, Chen YW, Tong R, Wu J: Semi-supervised segmentation of liver using adversarial learning with deep atlas prior. In: MICCAI. pp. 148–156. Springer; (2019) [Google Scholar]

[R38] 38.Zhou Y, Li Z, Bai S, Wang C, Chen X, Han M, Fishman E, Yuille AL: Prior-aware neural network for partially-supervised multi-organ segmentation. In: ICCV. pp. 10672–10681 (2019) [Google Scholar]

PERMALINK

Momentum Contrastive Voxel-wise Representation Learning for Semi-supervised Volumetric Medical Image Segmentation

Chenyu You

Ruihan Zhao

Lawrence Staib

James S Duncan

Abstract

1. Introduction

Fig. 2.

1.1. Overview

Fig. 1.

1.2. Unsupervised Contrastive Learning

Voxel-wise Contrastive Objective

Dimensional Contrastive Objective

Consistency Loss

Overall Training Objective

2. Experiments

Dataset and Pre-processing

Implementation Details

Comparison with Other Semi-supervised Methods

Table 1.

Fig. 3.

Ablation Study

3. Conclusion

Supplementary Material

Table 2.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Momentum Contrastive Voxel-wise Representation Learning for Semi-supervised Volumetric Medical Image Segmentation

Chenyu You

Ruihan Zhao

Lawrence Staib

James S Duncan

Abstract

1. Introduction

Fig. 2.

1.1. Overview

Fig. 1.

1.2. Unsupervised Contrastive Learning

Voxel-wise Contrastive Objective

Dimensional Contrastive Objective

Consistency Loss

Overall Training Objective

2. Experiments

Dataset and Pre-processing

Implementation Details

Comparison with Other Semi-supervised Methods

Table 1.

Fig. 3.

Ablation Study

3. Conclusion

Supplementary Material

Table 2.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases