Self-supervised 3D anatomy segmentation using self-distilled masked image transformer (SMIT)

Jue Jiang; Neelam Tyagi; Kathryn Tringale; Christopher Crane; Harini Veeraraghavan

doi:10.1007/978-3-031-16440-8_53

. Author manuscript; available in PMC: 2022 Dec 1.

Published in final edited form as: Med Image Comput Comput Assist Interv. 2022 Sep 16;13434:556–566. doi: 10.1007/978-3-031-16440-8_53

Self-supervised 3D anatomy segmentation using self-distilled masked image transformer (SMIT)

Jue Jiang ¹, Neelam Tyagi ¹, Kathryn Tringale ², Christopher Crane ², Harini Veeraraghavan ¹

PMCID: PMC9714226 NIHMSID: NIHMS1846525 PMID: 36468915

Abstract

Vision transformers efficiently model long-range context and thus have demonstrated impressive accuracy gains in several image analysis tasks including segmentation. However, such methods need large labeled datasets for training, which is hard to obtain for medical image analysis. Self-supervised learning (SSL) has demonstrated success in medical image segmentation using convolutional networks. In this work, we developed a self-distillation learning with masked image modeling method to perform SSL for vision transformers (SMIT) applied to 3D multi-organ segmentation from CT and MRI. Our contribution combines a dense pixel-wise regression pretext task performed within masked patches called masked image prediction with masked patch token distillation to pre-train vision transformers. Our approach is more accurate and requires fewer fine tuning datasets than other pretext tasks. Unlike prior methods, which typically used image sets arising from disease sites and imaging modalities corresponding to the target tasks, we used 3,643 CT scans (602,708 images) arising from head and neck, lung, and kidney cancers as well as COVID-19 for pre-training and applied it to abdominal organs segmentation from MRI pancreatic cancer patients as well as publicly available 13 different abdominal organs segmentation from CT. Our method showed clear accuracy improvement (average DSC of 0.875 from MRI and 0.878 from CT) with reduced requirement for fine-tuning datasets over commonly used pretext tasks. Extensive comparisons against multiple current SSL methods were done. Our code is available at: https://github.com/harveerar/SMIT.git.

Keywords: Self-supervised learning, segmentation, self-distillation, masked image modeling, masked embedding transformer

1. Introduction

Vision transformers (ViT)[1] efficiently model long range contextual information using multi-head self attention mechanism, making them robust to occlusions, image noise, as well as domain and image contrast differences. Hence, ViTs have shown to produce more accurate medical image segmentation than convolutional neural networks (CNN)[2, 3]. However, ViT requires a large number of labeled training datasets that are not commonly available in medical applications. Self-supervised learning (SSL) overcomes the afore-mentioned requirement by extracting visual information inherent in images from large unlabeled datasets by using pre-defined, annotation free pretext tasks as surrogate supervision signals for pre-training[4–6]. Once pre-trained, the model can be re-purposed for a variety of tasks by fine-tuning with relatively few labeled sets.

The choice of pretext tasks is crucial for SSL to successfully mine useful image information. Image denoising to recover images from their corrupted versions using CNN-based autoencoders[7, 8], pseudo labels[8–10], and contrastive learning[11–14] using CNNs have been used as pretext tasks in medical image applications. Data augmentation strategies including jigsaw puzzles[11, 15], restoration of image contrast and local texture[7], predicting image rotations[5], and prediction of masked image slices[16] have also been successfully used as pretext tasks for medical image segmentation with convolutional networks. However, CNNs are less effective than transformers in their capacity to model long-range context. Hence, we combined ViT with SSL using masked image modeling (MIM) and self-distillation of concurrently trained teacher and student networks.

MIM has been combined with transformers in natural image analysis[17–20, 19, 21]. Knowledge distillation with concurrently trained teacher has also been used for medical image segmentation by leveraging different imaging modalities (CT and MRI)[22, 23]. Self-distillation differs from knowledge distillation in it’s use of different augmented views of the same image[24]. It has been used for medical image classification[10] by combining contrastive learning with CNN encoders. Prior works have shown the ability to achieve highly accurate natural image classification and segmentation[19, 24] by combining self-distillation of a pair of teacher and student transformer encoders using MIM. It was also shown that combining global and local patch token embeddings[19] improved accuracy compared to pretext tasks that only extracted class tokens [CLS] to model global image embedding[24]. However, these methods ignored the dense pixel dependencies, which is essential for dense prediction tasks like segmentation. Hence, we introduced masked image prediction (MIP) pretext task to predict pixel intensities within masked patches combined with the local and global embedding tasks for medical image segmentation.

Our contributions include:(i) SSL using MIM and self-distillation approach combining masked image prediction, masked patch token distillation, and global image token distillation for CT and MRI organs segmentation using transformers. (ii) a simple linear projection layer for medical image reconstruction to speed up pre-training, which we show is more accurate than multi-layer decoder. (iii) SSL pre-training using large 3,643 3D CTs arising from a variety of disease sites including head and neck, chest, and abdomen with different cancers (lung, naso/oropharynx, kidney) and COVID-19 applied to CT and MRI segmentation. (iv) Evaluation of various pretext tasks using transformer encoders related to fine tuning data size requirements and segmentation accuracy.

2. Method

Goal:

Extract a universal representation of images for dense prediction tasks, given an unlabeled dataset of Q images.

Approach:

A visual tokenizer f_s(θ_s) implemented as a transformer encoder is learned via self-distillation using MIM pretext tasks in order to convert an image x into image tokens ${x_{i}}_{i = 1}^{N}$ , N being the sequence length. MIM pretext tasks include masked image prediction (MIP) and masked patch token distillation (MPD). Self distillation is performed by concurrently training an online teacher tokenizer model f_t(θ_t) with the same network structure as f_s(θ_s) serving as the student model. In addition, global image token distillation (ITD) pretext task is done to match the global tokens extracted by f_t and f_s[24].

Suppose {u, v} are two augmented views of a 3D image x. N image patches are extracted from the images to create a sequence of image tokens[1], say $u = {u_{i}}_{i = 1}^{N}$ . The image tokens are then corrupted by randomly masking image tokens based on a binary vector $m = {m_{i}}_{i = 1}^{N} \in {0, 1}$ with a probability p and then replacing with mask token[20] e_[MASK] such that as $\tilde{u} = m ⊙ u$ with ${\tilde{u}}_{i} = e_{[MASK]}$ at m_i = 1 and ${\tilde{u}}_{i} = u_{i}$ at m_i = 0. The second augmented view v is also corrupted but using a different mask vector instance m′ as $\tilde{v} = m^{'} ⊙ v$

Dense pixel dependency modeling using MIP:

MIP involves recovering the original image view u from corrupted $\tilde{u}$ , as $\hat{u} = h_{s}^{Pred} (f_{s} (\tilde{u}, θ_{s}))$ , where $h_{s}^{Pred}$ decodes the visual tokens produced by a visual tokenizer f_s(θ_s) into images (see Fig.1). MIP involves dense pixel regression of image intensities within masked patches using the context of unmasked patches. The MIP loss is computed as (dotted green arrow in Fig.1):

L_{MIP} = \sum_{i}^{N} E ‖ m_{i} \cdot (h_{s}^{Pred} (f_{s} ({\tilde{u}}_{i}, θ_{s}))) - u_{i}) ‖_{1}

(1)

$h_{s}^{Pred}$ is a linear projection with one layer for dense pixel regression. A symmetrized loss using v and $\tilde{v}$ is combined to compute the total loss for L_MIP.

Fig. 1: — SMIT: Self-distillation with masked image modeling for transformers using SSL. Two augmented views of 3D image patches are passed to a student (with masking) and teacher (without masking) networks. Teacher regularizes the student to extract the masked patch tokens through masked patch token distillation (MPD). Masked image prediction (MIP) and global image token [CLS] prediction (ITD) are additional pretext tasks. The teacher uses exponential moving average (EMA) for parameter updates.

Masked patch token self-distillation (MPD):

MPD is accomplished by optimizing a teacher f_t(θ_t) and a student visual tokenizer f_s(θ_s) such that the student network predicts the tokens of the teacher network. The student network f_s tokenizes the corrupted version of an image $\tilde{u}$ to generate visual tokens $ϕ^{'} = {ϕ_{i}^{'}}_{i = 1}^{N}$ . The teacher network f_t tokenizes the uncorrupted version of the same image u to generate visual tokens $ϕ = {ϕ_{i}}_{i = 1}^{N}$ . Similar to MIP, MPD focuses on accurate prediction of the masked patch tokens. Therefore, the loss is computed from masked portions (i.e. m_i=1) using cross-entropy of the predicted patch tokens (dotted red arrow in Fig.1):

L_{MPD} = - \sum_{i = 1}^{N} m_{i} \cdot P_{t}^{Patch} (u_{i}, θ_{t}) log (P_{s}^{Patch} ({\tilde{u}}_{i}, θ_{s})),

(2)

where $P_{s}^{Patch}$ and $P_{t}^{Patch}$ are the patch token distributions for student and teacher networks. They are computed by applying softmax to the outputs of $h_{s}^{Patch}$ and $h_{t}^{Patch}$ . The sharpness of the token distribution is controlled using a temperature term τ_s > 0 and τ_t > 0 for the student and teacher networks, respectively. Mathematically, such a sharpening can expressed as (using notation for the student network parameters) as:

P_{s}^{Patch} (u, θ_{s}) = \frac{exp (h_{s}^{Patch} (f_{s} (u_{j}, θ_{s})) / τ_{s}}{\sum_{j = 1}^{K} exp (h_{s}^{Patch} (f_{s} (u_{j}, θ_{s})) / τ_{s}} .

(3)

A symmetrized cross entropy loss corresponding to the other view v and $\tilde{v}$ is also computed and averaged to compute the total loss for MPD.

Global image token self-distillation (ITD):

ITD is done by matching the global image embedding or class tokens [CLS] distribution $P_{s}^{[C L S]}$ extracted from the corrupted view $\tilde{u}$ by student transformer network using $h_{s}^{[C L S]} (f_{s} (θ_{s}, \tilde{u}))$ with the token distribution $P_{t}^{[C L S]}$ extracted from the uncorrupted and different view v by the teacher network using $h_{t}^{[C L S]} (f_{t} (θ_{t}, v))$ (shown by dotted blue arrow in Fig.1) as:

L_{I T D} = - \sum_{i = 1}^{N} m_{i} \cdot P_{t}^{[C L S]} (v_{i}, θ_{t}) log (P_{s}^{[C L S]} ({\tilde{u}}_{i}, θ_{s}))

(4)

Sharpening transforms are applied to $P_{t}^{[C L S]}$ and $P_{s}^{[C L S]}$ similar to Equation 4. A symmetrized cross entropy loss corresponding to the corrupted view $\tilde{v}$ and another u is also computed and averaged to compute the total loss for L_ITD.

Online teacher network update:

Teacher network parameters were updated using exponential moving average (EMA) with momentum update, and shown to be feasible for SSL[24, 19] as: θ_t = λ_mθ_t+(1−λ_m)θ_s, where λ_m is momentum, which was updated using a cosine schedule from 0.996 to 1 during training. The total loss was, L_total = L_MIP + λ_MPD L_MPD + λ_ITD L_ITD.

Implementation details:

All the networks were implemented using the Pytorch library and trained on 4 Nvidia GTX V100. SSL optimization was done using ADAMw with a cosine learning rate scheduler trained for 400 epochs with an initial learning rate of 0.0002 and warmup for 30 epochs. λ_MPD=0.1, λ_ITD =0.1 were set experimentally. A default mask ratio of 0.7 was used. Centering and sharpening operations reduced chances of degenerate solutions[24]. τ_s was set to 0.1 and τ_t was linearly warmed up from 0.04 to 0.07 in the first 30 epochs. SWIN-small backbone[25] with 768 embedding, window size of 4 × 4 × 4, patch size of 2 was used. The 1-layer decoder was implemented with a linear projection layer with the same number of output channels as input image size. The network had 28.19M parameters. Following pre-training, only the student network was retained for fine-tuning and testing.

3. Experiments and Results

Training dataset:

SSL pre-training was performed using 3,643 CT patient scans containing 602,708 images. Images were sourced from patients with head and neck (N=837) and lung cancers (N=1455) from internal and external[26], as well as those with kidney cancers[27] (N=710), and COVID-19[28] (N=650). GPU limitation was addressed for training, fine-tuning, and testing by image resampling (1.5×1.5×2mm voxel size) and cropping (128×128×128) to enclose the body region. Augmented views for SSL training was produced through randomly cropped 96×96×96 volumes, which resulted in 6×6×6 image patch tokens. A sliding window strategy with half window overlap was used for testing[2, 3]. Dataset I and pre-training CT datasets were pre-processed with intensity rescaling [−175 HU to 250 HU]. Dataset II (MRI) was subjected to histogram standardization, intensity clipping [0, 2000], and intensity normalization [0, 1].

CT abdomen organ segmentation (Dataset I): The pre-trained networks were fine-tuned to generate volumetric segmentation of 13 different abdominal organs from contrast-enhanced CT (CECT) scans using publicly available beyond the cranial vault (BTCV)[32] dataset. Randomly selected 21 images are used for training and the remaining used for validation. Furthermore, blinded testing of 20 CECTs evaluated on the grand challenge website is also reported.

MRI upper abdominal organs segmentation (Dataset II): The SSL network was evaluated for segmenting abdominal organs at risk for pancreatic cancer radiation treatment, which included stomach, small and large bowel, liver, and kidneys. No MRI or pancreatic cancer scans were used for SSL pre-training. Ninety two 3D T2-weighted MRIs (TR/TE = 1300/87 ms, voxel size of 1×1×2 mm³, FOV of 400×450×250 mm³) and acquired with pnuematic compression belt to suppress breathing motion were analyzed. Fine tuning used five-fold cross-validation and results from the validation folds not used in training are reported.

Experimental comparisons:

SMIT was compared against representative SSL medical image analysis methods. Results from representative published methods on the BTCV testing set[30, 2, 3] are also reported. The SSL comparison methods were chosen to evaluate the impact of the pretext task on segmentation accuracy and included (a) local texture and semantics modeling using model genesis[7], (b) jigsaw puzzles[15], (c) contrastive learning[14] with (a),(b), (c) implemented on CNN backbone, (d) self-distillation using whole image reconstruction[24], (e) masked patch reconstruction[18] without self-distillation, (f) MIM using self-distillation[19] with (d),(e), and (f) implemented in a SWIN transformer backbone. Random initialization results are shown for benchmarking purposes using both CNN and SWIN backbones. Identical training and testing sets were used with hyper-parameters adopted from their default implementation.

CT segmentation accuracy:

As shown in Table.1, SMIT outperformed representative published methods including transformer based segmentation[31, 3, 2]. SMIT was also more accurate than all evaluated SSL methods (Table.2) for most organs. Prior-guided contrast learning (PRCL)[14] was more accurate than SMIT only for gall bladder (0.797 vs. 0.787). SMIT was more accurate than self-distillation with MIM[19] (average DSC of 0.848 vs. 0.833) as well as masked image reconstruction without distillation[18] (0.848 vs. 0.830). Fig.2 shows a representative case with multiple organs segmentations produced by the various methods. SMIT was the most accurate method including for organs with highly variable appearance and size such as the stomach and esophagus.

Table 1:

Accuracy on BTCV standard challenge test set. SP: spleen, RK/LK: right & left kidney, GB: gall bladder, ESO: esophagus, LV: liver, STO: stomach, AOR: aorta, IVC: inferior vena cava, SPV: portal & splenic vein, Pan: Pancreas, AG: Adrenals.

Method	SP	RK	LK	GB	ESO	LV	STO	AOR	IVC	SPV	Pan	AG	AVG
ASPP[29]	0.935	0.892	0.914	0.689	0.760	0.953	0.812	0.918	0.807	0.695	0.720	0.629	0.811
nnUnet[30]	0.942	0.894	0.910	0.704	0.723	0.948	0.824	0.877	0.782	0.720	0.680	0.616	0.802
TrsUnet[31]	0.952	0.927	0.929	0.662	0.757	0.969	0.889	0.920	0.833	0.791	0.775	0.637	0.838
CoTr[2]	0.958	0.921	0.936	0.700	0.764	0.963	0.854	0.920	0.838	0.787	0.775	0.694	0.844
UNETR[3]	0.968	0.924	0.941	0.750	0.766	0.971	0.913	0.890	0.847	0.788	0.767	0.741	0.856
SMIT(rand)	0.959	0.921	0.947	0.746	0.802	0.972	0.916	0.917	0.848	0.797	0.817	0.711	0.850
SMIT(SSL)	0.967	0.945	0.948	0.826	0.822	0.976	0.934	0.921	0.864	0.827	0.851	0.754	0.878

Open in a new tab

Table 2:

CT and MRI segmentation accuracy comparisons to SSL methods. Rand-random; LB-Large bowel, SB - Small bowel.

Mod	Organ	CNN					SWIN
Mod	Organ	Rand	MG[7]	CPC[11]	Cub++[15]	PRCL[14]	Rand	DINO[24]	iBOT[19]	SSIM[18]	SMIT
	Sp	0.930	0.950	0.940	0.926	0.937	0.944	0.946	0.948	0.950	0.963
	RK	0.892	0.934	0.916	0.928	0.919	0.926	0.931	0.936	0.934	0.950
	LK	0.894	0.918	0.903	0.914	0.921	0.905	0.913	0.919	0.913	0.943
	GB	0.605	0.639	0.718	0.715	0.797	0.694	0.730	0.777	0.761	0.787
	ESO	0.744	0.739	0.756	0.768	0.759	0.732	0.752	0.760	0.772	0.772
	LV	0.947	0.967	0.953	0.946	0.954	0.950	0.954	0.956	0.956	0.970
CT	STO	0.862	0.879	0.896	0.881	0.877	0.861	0.891	0.900	0.898	0.903
CT	AOR	0.875	0.909	0.900	0.892	0.894	0.885	0.906	0.901	0.905	0.913
	IVC	0.844	0.882	0.855	0.866	0.851	0.851	0.866	0.879	0.867	0.871
	SPV	0.727	0.739	0.731	0.734	0.760	0.725	0.752	0.759	0.754	0.784
	Pan	0.719	0.706	0.726	0.731	0.693	0.688	0.763	0.755	0.764	0.810
	RA	0.644	0.671	0.655	0.665	0.661	0.660	0.651	0.659	0.640	0.669
	LA	0.648	0.640	0.655	0.675	0.680	0.590	0.680	0.681	0.678	0.687
	AVG.	0.795	0.813	0.816	0.819	0.823	0.801	0.826	0.833	0.830	0.848
	LV	0.921	0.936	0.925	0.920	0.930	0.922	0.920	0.939	0.937	0.942
	LB	0.786	0.824	0.824	0.813	0.823	0.818	0.804	0.833	0.835	0.855
	SB	0.688	0.741	0.745	0.735	0.745	0.708	0.729	0.744	0.759	0.775
MR	STO	0.702	0.745	0.769	0.783	0.793	0.732	0.750	0.783	0.775	0.812
	LK	0.827	0.832	0.876	0.866	0.876	0.837	0.911	0.883	0.874	0.936
	RK	0.866	0.886	0.863	0.861	0.871	0.845	0.896	0.906	0.871	0.930
	AVG.	0.798	0.827	0.834	0.830	0.840	0.810	0.835	0.848	0.842	0.875

Open in a new tab

Fig. 2: — Segmentation performance of different methods on MRI abdomen organs.

MRI segmentation accuracy:

SMIT was more accurate than all other SSL-based methods for all evaluated organs (Table.2). SMIT produced more accurate segmentations than other methods even for small bowel, a difficult organ to segment due to the presence of closely packed bowel loops. Fig.2 shows a representative MRI case with segmentations produced by the various methods.

Ablation experiments:

All ablation and design experiments (1layer decoder vs. multi-layer or ML decoder) were performed using the BTCV dataset and used the SWIN-backbone as used for SMIT. ML decoder was implemented with five transpose convolution layers for up-sampling back to the input image resolution. Fig.4 shows the accuracy comparisons of networks pre-trained with different tasks including full image reconstruction, contrastive losses, pseudo labels[33], and various combination of the losses (L_MIP , L_MPD, L_ITD). As shown, the accuracies for all the methods was similar for large organs depicting good contrast that include liver, spleen, left and right kidney (Fig.4(I)). On the other hand, organs with low soft tissue contrast and high variability (Fig.4(II)) and small organs (Fig.4(III)) show larger differences in accuracies between methods with SMIT achieving more accurate segmentations. Major blood vessels Fig.4(IV) also depict segmentation accuracy differences across methods, albeit less so than for small organs and those with low soft-tissue contrast. Importantly, both full image reconstruction and multi-layer decoder based MIP (ML-MIP) were less accurate than SMIT, which uses masked image prediction with 1-layer linear projection decoder (Fig.4 (II,III,IV)). MPD was the least accurate for organs with low soft-tissue contrast and high variability (Fig.4(II)), which was improved slightly by adding global image distillation (ITD). MIP alone (using 1-layer decoder) was similarly accurate as SMIT and more accurate than other pretext task based segmentation including ITD[24], MPD+ITD[19]. Lower MSE loss indicates better reconstruction as shown in Fig.4 using 1-layer vs. multi-layer decoder.

Fig. 4: — Accuracy variations by organ types using different pretext tasks. MIM pretext tasks are MIP using 1-layer decoder, ML-MIP using multi-layer decoder, MPD, and ITD combined with MIP or MPD.

Impact of pretext tasks on sample size for fine tuning:

SMIT was more accurate than all other SSL methods irrespective of sample size used for fine-tuning (Fig.3(a)) and achieved faster convergence (Fig.3(c)). It outperformed iBot[19], which uses MPD and ITD, indicating effectiveness of MIP for SSL.

Impact of mask ratio on accuracy:

Fig.3(b) shows the impact of mask ratio (percentage of masked patches) in the corrupted image for both the accuracy of masked image reconstruction (computed as mean square error [MSE]) as well as segmentation (computed using DSC metric). Accuracy increased initially with the mask ratio and then stabilized. Image reconstruction error also increased slightly with mask ratio. Fig.5 shows a representative CT and MRI reconstruction produced using default and multi-layer decoder, wherein our method was more accurate even in highly textured portions of the images containing multiple organs (additional examples are shown in Supplementary Fig 1). SMIT using 1-layer decoder was more accurate than multi-layer decoder (MSE of 0.061 vs. 0.32) for CT (N=10 cases) and 92 MRI (MSE of 0.062 vs. 0.34).

Fig. 5: — Reconstructed images using 1-layer vs. multi-layer decoder trained with SMIT from masked images (0.7 masking ratio).

4. Discussion and conclusion

In this work, we demonstrated the potential for SSL with 3D transformers for medical image segmentation. Our approach, which leverages CT volumes arising from highly disparate body locations and diseases produced more accurate segmentations from CT and MRI scans than current SSL-based methods, especially for hard to segment organs with high appearance variability and small sizes. Importantly, masked image dense prediction improved segmentation accuracy with reduced requirement of fine tuning dataset size. Although pre-training used CT, the network showed ability to segment on T2-weighted MRI because T2-weighted MRI also captures anatomic information like CT. Higher soft-tissue contrast on MRI, histogram standarization to harmonize MRIs, combined with the use of transformers, known to be robust to domain differences[34], aided generalization with fine tuning.

Supplementary Material

supplement

NIHMS1846525-supplement-supplement.pdf^{(219.3KB, pdf)}

References

1.Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N: An image is worth 16×16 words: Transformers for image recognition at scale. In: Intl Conf Learning Representations. (2021) [Google Scholar]
2.Xie Y, Zhang J, Shen C, Xia Y: Cotr: Efficiently bridging CNN and transformer for 3D medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention. (2021) 171–180 [Google Scholar]
3.Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth HR, Xu D: UNETR: Transformers for 3D medical image segmentation. In: IEEE/CVF Winter Conf. Applications of Computer Vision (2022) 1748–1758 [Google Scholar]
4.Noroozi M, Favaro P: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Proc. European Conf. Computer Vision, Springer; (2016) 69–84 [Google Scholar]
5.Komodakis N, Gidaris S: Unsupervised representation learning by predicting image rotations. In: Intl Conf Learning Representations. (2018) [Google Scholar]
6.He K, Fan H, Wu Y, Xie S, Girshick R: Momentum contrast for unsupervised visual representation learning. In: Proc IEEE/CVF Conf Computer Vision and Pattern Recognition. (2020) 9729–9738 [Google Scholar]
7.Zhou Z, Sodha V, Pang J, Gotway MB, Liang J: Models genesis. Medical Image Analysis 67 (2021) 101840. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Haghighi F, Taher MRH, Zhou Z, Gotway MB, Liang J: Learning semantics-enriched representation via self-discovery, self-classification, and self-restoration. In: Medical Image Computing and Computer Assisted Intervention, Springer; (2020) 137–147 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Chen L, Bentley P, Mori K, Misawa K, Fujiwara M, Rueckert D: Self-supervised learning for medical image analysis using image context restoration. Medical Image analysis 58 (2019) 101539. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Sun J, Wei D, Ma K, Wang L, Zheng Y: Unsupervised representation learning meets pseudo-label supervised self-distillation: A new approach to rare disease classification. In: Medical Image Computing and Computer Assisted Intervention, Springer; (2021) 519–529 [Google Scholar]
11.Taleb A, Loetzsch W, Danz N, Severin J, Gaertner T, Bergner B, Lippert C: 3D self-supervised methods for medical imaging. Advances in Neural Information Processing Systems 33 (2020) 18158–18172 [Google Scholar]
12.Chaitanya K, Erdil E, Karani N, Konukoglu E: Contrastive learning of global and local features for medical image segmentation with limited annotations. Adv. in Neur. Inf. Proc Sys 33 (2020) 12546–12558 [Google Scholar]
13.Feng R, Zhou Z, Gotway MB, Liang J: Parts2whole: Self-supervised contrastive learning via reconstruction. In: Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning. Springer; (2020) 85–95 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Zhou HY, Lu C, Yang S, Han X, Yu Y: Preservational learning improves self-supervised medical image models by reconstructing diverse contexts. In: IEEE/CVF Intl. Conf. Computer Vision (2021) 3499–3509 [Google Scholar]
15.Zhu J, Li Y, Hu Y, Ma K, Zhou SK, Zheng Y: Rubik’s cube+: A self-supervised feature learning framework for 3D medical image analysis. Medical Image Analysis 64 (2020) 101746. [DOI] [PubMed] [Google Scholar]
16.Jun E, Jeong S, Heo DW, Suk HI: Medical Transformer: Universal brain encoder for 3D MRI analysis. arXiv preprint arXiv:2104.13633 (2021) [Google Scholar]
17.Li Z, Chen Z, Yang F, Li W, Zhu Y, Zhao C, Deng R, Wu L, Zhao R, Tang M, et al. : MST: Masked self-supervised transformer for visual representation. Adv. in Neu. Inf. Proc. Sys 34 (2021) 13165–13176 [Google Scholar]
18.Xie Z, Zhang Z, Cao Y, Lin Y, Bao J, Yao Z, Dai Q, Hu H: Simmim: A simple framework for masked image modeling. In: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (2022) 9653–9663 [Google Scholar]
19.Zhou J, Wei C, Wang H, Shen W, Xie C, Yuille A, Kong T: Image BERT pre-training with online tokenizer. In: Intl Conf. Learning Representations (2022) [Google Scholar]
20.Bao H, Dong L, Wei F: BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021) [Google Scholar]
21.He K, Chen X, Xie S, Li Y, Dollár P, Girshick R: Masked autoencoders are scalable vision learners. In: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (2022) 16000–16009 [Google Scholar]
22.Li K, Yu L, Wang S, Heng PA: Towards cross-modality medical image segmentation with online mutual knowledge distillation. Proc. AAAI 34(01) (Apr. 2020) 775–783 [Google Scholar]
23.Jiang J, Rimner A, Deasy JO, Veeraraghavan H: Unpaired cross-modality educed distillation (CMEDL) for medical image segmentation. IEEE Trans Med Imaging (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A: Emerging properties in self-supervised vision transformers. In: IEEE/CVF Int Conf. Computer Vision (2021) 9650–9660 [Google Scholar]
25.Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B: SWIN transformer: Hierarchical vision transformer using shifted windows. In: IEEE Int. Conf. Computer Vision (2021) 10012–10022 [Google Scholar]
26.Aerts H, E. RV, Leijenaar RT, Parmar C, Grossmann P, Carvalho S, Lambin P: Data from NSCLC-radiomics. The Cancer Imaging Archive (2015) [Google Scholar]
27.Akin O, Elnajjar P, Heller M, Jarosz R, Erickson B, Kirk S, Filippini J: Radiology data from the cancer genome atlas kidney renal clear cell carcinoma [tcga-kirc] collection. The Cancer Imaging Archive (2016) [Google Scholar]
28.Harmon SA, Sanford TH, Xu S, Turkbey EB, Roth H, Xu Z, Yang D, Myronenko A, Anderson V, Amalou A, et al. : Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets. Nature Communications 11(1) (2020) 1–7 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proc European Conf. Computer Vision (2018) 801–818 [Google Scholar]
30.Isensee F, Jaeger PF, Kohl SA, Petersen J, Maier-Hein KH: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18(2) (2021) 203–211 [DOI] [PubMed] [Google Scholar]
31.Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y: Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021) [Google Scholar]
32.Landman B, Xu Z, Igelsias J, Styner M, Langerak T, Klein A: MICCAI multi-atlas labeling beyond the cranial vault–workshop and challenge (2015)
33.Chen X, Xie S, He K: An empirical study of training self-supervised vision transformers. In: IEEE/CVF Intl Conf. Computer Vision (2021) 9640–9649 [Google Scholar]
34.Naseer MM, Ranasinghe K, Khan SH, Hayat M, Shahbaz Khan F, Yang MH: Intriguing properties of vision transformers. Advances in Neural Information Processing Systems 34 (2021) [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

NIHMS1846525-supplement-supplement.pdf^{(219.3KB, pdf)}

[R1] 1.Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N: An image is worth 16×16 words: Transformers for image recognition at scale. In: Intl Conf Learning Representations. (2021) [Google Scholar]

[R2] 2.Xie Y, Zhang J, Shen C, Xia Y: Cotr: Efficiently bridging CNN and transformer for 3D medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention. (2021) 171–180 [Google Scholar]

[R3] 3.Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth HR, Xu D: UNETR: Transformers for 3D medical image segmentation. In: IEEE/CVF Winter Conf. Applications of Computer Vision (2022) 1748–1758 [Google Scholar]

[R4] 4.Noroozi M, Favaro P: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Proc. European Conf. Computer Vision, Springer; (2016) 69–84 [Google Scholar]

[R5] 5.Komodakis N, Gidaris S: Unsupervised representation learning by predicting image rotations. In: Intl Conf Learning Representations. (2018) [Google Scholar]

[R6] 6.He K, Fan H, Wu Y, Xie S, Girshick R: Momentum contrast for unsupervised visual representation learning. In: Proc IEEE/CVF Conf Computer Vision and Pattern Recognition. (2020) 9729–9738 [Google Scholar]

[R7] 7.Zhou Z, Sodha V, Pang J, Gotway MB, Liang J: Models genesis. Medical Image Analysis 67 (2021) 101840. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Haghighi F, Taher MRH, Zhou Z, Gotway MB, Liang J: Learning semantics-enriched representation via self-discovery, self-classification, and self-restoration. In: Medical Image Computing and Computer Assisted Intervention, Springer; (2020) 137–147 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Chen L, Bentley P, Mori K, Misawa K, Fujiwara M, Rueckert D: Self-supervised learning for medical image analysis using image context restoration. Medical Image analysis 58 (2019) 101539. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Sun J, Wei D, Ma K, Wang L, Zheng Y: Unsupervised representation learning meets pseudo-label supervised self-distillation: A new approach to rare disease classification. In: Medical Image Computing and Computer Assisted Intervention, Springer; (2021) 519–529 [Google Scholar]

[R11] 11.Taleb A, Loetzsch W, Danz N, Severin J, Gaertner T, Bergner B, Lippert C: 3D self-supervised methods for medical imaging. Advances in Neural Information Processing Systems 33 (2020) 18158–18172 [Google Scholar]

[R12] 12.Chaitanya K, Erdil E, Karani N, Konukoglu E: Contrastive learning of global and local features for medical image segmentation with limited annotations. Adv. in Neur. Inf. Proc Sys 33 (2020) 12546–12558 [Google Scholar]

[R13] 13.Feng R, Zhou Z, Gotway MB, Liang J: Parts2whole: Self-supervised contrastive learning via reconstruction. In: Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning. Springer; (2020) 85–95 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Zhou HY, Lu C, Yang S, Han X, Yu Y: Preservational learning improves self-supervised medical image models by reconstructing diverse contexts. In: IEEE/CVF Intl. Conf. Computer Vision (2021) 3499–3509 [Google Scholar]

[R15] 15.Zhu J, Li Y, Hu Y, Ma K, Zhou SK, Zheng Y: Rubik’s cube+: A self-supervised feature learning framework for 3D medical image analysis. Medical Image Analysis 64 (2020) 101746. [DOI] [PubMed] [Google Scholar]

[R16] 16.Jun E, Jeong S, Heo DW, Suk HI: Medical Transformer: Universal brain encoder for 3D MRI analysis. arXiv preprint arXiv:2104.13633 (2021) [Google Scholar]

[R17] 17.Li Z, Chen Z, Yang F, Li W, Zhu Y, Zhao C, Deng R, Wu L, Zhao R, Tang M, et al. : MST: Masked self-supervised transformer for visual representation. Adv. in Neu. Inf. Proc. Sys 34 (2021) 13165–13176 [Google Scholar]

[R18] 18.Xie Z, Zhang Z, Cao Y, Lin Y, Bao J, Yao Z, Dai Q, Hu H: Simmim: A simple framework for masked image modeling. In: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (2022) 9653–9663 [Google Scholar]

[R19] 19.Zhou J, Wei C, Wang H, Shen W, Xie C, Yuille A, Kong T: Image BERT pre-training with online tokenizer. In: Intl Conf. Learning Representations (2022) [Google Scholar]

[R20] 20.Bao H, Dong L, Wei F: BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021) [Google Scholar]

[R21] 21.He K, Chen X, Xie S, Li Y, Dollár P, Girshick R: Masked autoencoders are scalable vision learners. In: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (2022) 16000–16009 [Google Scholar]

[R22] 22.Li K, Yu L, Wang S, Heng PA: Towards cross-modality medical image segmentation with online mutual knowledge distillation. Proc. AAAI 34(01) (Apr. 2020) 775–783 [Google Scholar]

[R23] 23.Jiang J, Rimner A, Deasy JO, Veeraraghavan H: Unpaired cross-modality educed distillation (CMEDL) for medical image segmentation. IEEE Trans Med Imaging (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A: Emerging properties in self-supervised vision transformers. In: IEEE/CVF Int Conf. Computer Vision (2021) 9650–9660 [Google Scholar]

[R25] 25.Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B: SWIN transformer: Hierarchical vision transformer using shifted windows. In: IEEE Int. Conf. Computer Vision (2021) 10012–10022 [Google Scholar]

[R26] 26.Aerts H, E. RV, Leijenaar RT, Parmar C, Grossmann P, Carvalho S, Lambin P: Data from NSCLC-radiomics. The Cancer Imaging Archive (2015) [Google Scholar]

[R27] 27.Akin O, Elnajjar P, Heller M, Jarosz R, Erickson B, Kirk S, Filippini J: Radiology data from the cancer genome atlas kidney renal clear cell carcinoma [tcga-kirc] collection. The Cancer Imaging Archive (2016) [Google Scholar]

[R28] 28.Harmon SA, Sanford TH, Xu S, Turkbey EB, Roth H, Xu Z, Yang D, Myronenko A, Anderson V, Amalou A, et al. : Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets. Nature Communications 11(1) (2020) 1–7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proc European Conf. Computer Vision (2018) 801–818 [Google Scholar]

[R30] 30.Isensee F, Jaeger PF, Kohl SA, Petersen J, Maier-Hein KH: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18(2) (2021) 203–211 [DOI] [PubMed] [Google Scholar]

[R31] 31.Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y: Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021) [Google Scholar]

[R32] 32.Landman B, Xu Z, Igelsias J, Styner M, Langerak T, Klein A: MICCAI multi-atlas labeling beyond the cranial vault–workshop and challenge (2015)

[R33] 33.Chen X, Xie S, He K: An empirical study of training self-supervised vision transformers. In: IEEE/CVF Intl Conf. Computer Vision (2021) 9640–9649 [Google Scholar]

[R34] 34.Naseer MM, Ranasinghe K, Khan SH, Hayat M, Shahbaz Khan F, Yang MH: Intriguing properties of vision transformers. Advances in Neural Information Processing Systems 34 (2021) [Google Scholar]

PERMALINK

Self-supervised 3D anatomy segmentation using self-distilled masked image transformer (SMIT)

Jue Jiang

Neelam Tyagi

Kathryn Tringale

Christopher Crane

Harini Veeraraghavan

Abstract

1. Introduction

2. Method

Goal:

Approach:

Dense pixel dependency modeling using MIP:

Fig. 1:

Masked patch token self-distillation (MPD):

Global image token self-distillation (ITD):

Online teacher network update:

Implementation details:

3. Experiments and Results

Training dataset:

Experimental comparisons:

CT segmentation accuracy:

Table 1:

Table 2:

Fig. 2:

MRI segmentation accuracy:

Ablation experiments:

Fig. 4:

Impact of pretext tasks on sample size for fine tuning:

Fig. 3:

Impact of mask ratio on accuracy:

Fig. 5:

4. Discussion and conclusion

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases