Abstract
Vision transformer-based self-supervised learning (SSL) approaches have recently shown substantial success in learning visual representations from unannotated photographic images. However, their acceptance in medical imaging is still lukewarm, due to the significant discrepancy between medical and photographic images. Consequently, we propose POPAR (patch order prediction and appearance recovery), a novel vision transformer-based self-supervised learning framework for chest X-ray images. POPAR leverages the benefits of vision transformers and unique properties of medical imaging, aiming to simultaneously learn patch-wise high-level contextual features by correcting shuffled patch orders and fine-grained features by recovering patch appearance. We transfer POPAR pretrained models to diverse downstream tasks. The experiment results suggest that (1) POPAR outperforms state-of-the-art (SoTA) self-supervised models with vision transformer backbone; (2) POPAR achieves significantly better performance over all three SoTA contrastive learning methods; and (3) POPAR also outperforms fully-supervised pretrained models across architectures. In addition, our ablation study suggests that to achieve better performance on medical imaging tasks, both fine-grained and global contextual features are preferred. All code and models are available at GitHub.com/JLiangLab/POPAR.
Keywords: Vision transformer, Self-supervised learning, Medical image analysis, Transfer learning
1. Introduction
Self-supervised learning (SSL) aims to learn generalizable representations from (unannotated) images and transfer the learned representations to application-specific tasks to boost performance and reduce annotation efforts [19]. SSL has achieved state-of-the-art (SoTA) performance, sometimes even surpassing standard supervised ImageNet models in computer vision [5,6,26]. However, its popularity in medical imaging remains tepid, even in light of annotation dearth, a significant challenge facing deep learning for medical image analysis (MIA) [23]. We believe that this is due to the marked differences between medical and photographic images [12,22,27]. Photographic images, particularly those in ImageNet [8], typically have objects of considerable variations (cats, dogs, flowers, etc.), with distinctive components, centered in front of varying backgrounds (see Fig. 1). Therefore, object recognition in photographic images is based mainly on high-level features extracted from objects’ discriminative components [12,22]. By contrast, medical imaging protocols are designed for specified clinical purposes by focusing on particular body parts, generating images of remarkable similarity in anatomy across patients [14]. For example, the posteroanterior chest X-rays all look similar (Fig. 1). However, diagnostically valuable information may spread across entire images. Therefore, understanding high-level anatomical structures and their relative spatial relationships is essential for distinguishing diseases from normal anatomy [12]. Nevertheless, the fine-grained details throughout entire images are equally indispensable because identifying diseases, delineating organs, and isolating lesions may rely on subtle texture variations [12,16]. Therefore, a natural question is: how to learn integrated high-level and fine-grained features from medical images via self-supervision?
Fig. 1.

Photographic images typically have objects of considerable differences (bicycle, dog, flower, etc.) centered in front of varying backgrounds, while medical images generated from a particular imaging protocol are remarkably similar in anatomy (e.g., lung) across patients with diagnostic information spread across entire images (e.g., conditions as boxed in yellow). Analyzing medical images requires not only high-level knowledge of anatomical structures and their relationships but also fine-grained features across entire images. Our POPAR aims to meet this requirement by autodidactically learning high-level anatomical knowledge via patch order prediction and automatically gleaning fine-grained features via (patch) appearance recovery (see Fig. 2). (Color figure online)
To answer this question, we have developed a new SSL method called POPAR (Patch Order Prediction and Appearance Recovery), because it is equipped with two novel learning perspectives: (1) patch order prediction, which autodidactically learns high-level anatomical structures and their relative relationships, and (2) (patch) appearance recovery, which automatically gleans fine-grained features from medical images. We employ Swin Transformer as the POPAR’s backbone because its hierarchical design enables multi-scale modeling, which naturally supports the two learning perspectives simultaneously.
For performance comparison and ablation studies, we have also trained three downgraded versions of POPAR: POPAR−1, POPAR−2, and POPAR−3 (see Table 1). Our extensive experiments demonstrate that (1) POPAR outperforms self-supervised ImageNet models with transformer backbone (see Table 2); (2) POPAR outperforms SoTA self-supervised pretrained models with CNN and transformer backbones (see Table 3); and (3) POPAR outperforms fully-supervised pretrained models across CNN and transformer architectures (see Table 4). This performance is attributed to our insights into the requirements of medical imaging tasks for global anatomical knowledge and fine-grained details in texture variations (see Sect. 5: Pretraining tasks).
Table 1.
We evaluate POPAR with Swin-B and ViT-B backbones using four different pretraining and finetuning image resolutions, denoted as PT and FT, respectively. POPAR, our official implementation, is the model with Swin-B backbone, pretraining and finetuning resolutions of 448 × 448, which yields the best performance on all target tasks. For performance comparison and ablation studies, we have pretrained three downgraded versions: (1) POPAR−1 with Swin-B backbone, pretraining resolution of 448 × 448, and finetuning resolution of 224 × 224; (2) POPAR−2 with Swin-B backbone and pretraining and finetuning resolutions of 224 × 224; and (3) POPAR−3 with ViT-B backbone and pretraining and finetuning resolutions of 224 × 224.
| Setup name | Backbone | Shuffled patches | PT → FT | ChestX-ray14 | CheXpert | ShenZhen | RSNA Pneumonia |
|---|---|---|---|---|---|---|---|
| POPAR−3 | ViT-B | 196 | 2242 → 2242 | 79.58 ± 0.13 | 87.86 ± 0.17 | 93.87 ± 0.63 | 73.17 ± 0.46 |
|
| |||||||
| POPAR−2 | Swin-B | 47 | 2242 → 2242 | 79.50 ± 0.20 | 87.63 ± 0.39 | 95.07 ± 1.22 | 73.07 ± 0.46 |
| POPAR−1 | 196 | 4482 → 2242 | 80.51 ± 0.15 | 88.25 ± 0.78 | 96.81 ± 0.40 | 73.58 ± 0.18 | |
| POPAR | 196 | 4482 → 4482 | 81.81 ± 0.10 | 88.34 ± 0.50 | 97.33 ± 0.74 | 74.19 ± 0.37 | |
Table 2.
Even POPAR−1and POPAR−3 (two downgraded versions of POPAR) outperform SoTA self-supervised ImageNet models with transformer backbone in three target tasks. The best methods are bolded, while the second best are underlined.
| Backbone | Method | ChestX-ray14 | CheXpert | ShenZhen | RSNA Pneumonia |
|---|---|---|---|---|---|
| ViT-B | MoCoV3 | 79.20 ± 0.29 | 86.91 ± 0.77 | 85.71 ± 1.41 | 72.79 ± 0.52 |
| SimMIM | 79.55 ± 0.56 | 87.83 ± 0.46 | 92.74 ± 0.92 | 72.08 ± 0.47 | |
| DINO | 78.37 ± 0.47 | 86.91 ± 0.44 | 87.83 ± 7.20 | 71.27 ± 0.45 | |
| BEiT | 74.69 ± 0.39 | 85.81 ± 1.00 | 92.95 ± 1.25 | 72.78 ± 0.37 | |
| MAE | 78.97 ± 0.65 | 87.12 ± 0.54 | 93.58 ± 1.18 | 72.85 ± 0.50 | |
| POPAR−3 | 79.58 ± 0.13 | 87.86 ± 0.17 | 93.87 ± 0.63 | 73.17 ± 0.46 | |
|
| |||||
| Swin-B | SimMIM | 81.39 ± 0.18 | 87.50 ± 0.23 | 87.86 ± 4.92 | 73.15 ± 0.73 |
| POPAR−1 | 80.51 ± 0.15 | 88.16 ± 0.66 | 96.81 ± 0.40 | 73.58 ± 0.18 | |
Table 3.
Even POPAR−1 (a downgraded version of POPAR) yields significant performance boosts (p< 0.05) in comparison with SoTA self-supervised methods pretrained on ResNet-50 or transformer architectures. All models are pretrained on the ChestX-ray14 dataset. The best methods are bolded while the second best are underlined.
| Backbone | Method | ChestX-ray14 | CheXpert | ShenZhen | RSNA Pneumonia |
|---|---|---|---|---|---|
| ResNet-50 | SimSiam | 79.62 ± 0.34 | 83.82 ± 0.94 | 93.13 ± 1.36 | 71.20 ± 0.60 |
| MoCoV2 | 80.36 ± 0.26 | 86.42 ± 0.42 | 92.59 ± 1.79 | 71.98 ± 0.82 | |
| Barlow Twins | 80.45 ± 0.29 | 86.90 ± 0.62 | 92.17 ± 1.54 | 71.45 ± 0.82 | |
|
| |||||
| ViT-B | SimMIM | 79.20 ± 0.19 | 83.48 ± 2.43 | 93.77 ± 1.01 | 71.66 ± 0.75 |
| POPAR−3 | 79.58 ± 0.13 | 87.86 ± 0.17 | 93.87 ± 0.63 | 73.17 ± 0.46 | |
|
| |||||
| Swin-B | SimMIM | 79.09 ± 0.57 | 86.75 ± 0.96 | 93.03 ± 0.48 | 71.99 ± 0.55 |
| POPAR−1 | 80.51 ± 0.15 | 88.25 ± 0.78 | 96.81 ± 0.40 | 73.58 ± 0.18 | |
Table 4.
POPAR models outperform fully-supervised pretrained models on ImageNet and ChestX-ray14 datasets in three target tasks across architectures. The best methods are bolded while the second best are underlined. Transfer learning is inapplicable when pretraining and target tasks are the same, denoted by “–”.
| Backbone | Initialization | ChestX-ray14 | CheXpert | ShenZhen | RSNA Pneumonia |
|---|---|---|---|---|---|
| ResNet-50 | Random | 80.40 ± 0.05 | 86.60 ± 0.17 | 90.49 ± 1.16 | 70.00 ± 0.50 |
| ImageNet-1K | 81.70 ± 0.15 | 87.17 ± 0.22 | 94.96 ± 1.19 | 73.04 ± 0.35 | |
| ChestX-ray14 | – | 87.40 ± 0.26 | 96.32 ± 0.65 | 71.64 ± 0.37 | |
|
| |||||
| ViT-B | Random | 70.84 ± 0.19 | 80.78 ± 0.13 | 84.46 ± 1.65 | 66.59 ± 0.39 |
| ImageNet-21K | 77.55 ± 1.82 | 83.32 ± 0.69 | 91.85 ± 3.40 | 71.50 ± 0.52 | |
| ChestX-ray14 | – | 84.37 ± 0.42 | 91.23 ± 0.81 | 66.96 ± 0.24 | |
| POPAR−3 | 79.58 ± 0.13 | 87.86 ± 0.17 | 93.87 ± 0.63 | 73.17 ± 0.46 | |
|
| |||||
| Swin-B | Random | 74.29 ± 0.41 | 85.78 ± 0.01 | 85.83 ± 3.68 | 70.02 ± 0.42 |
| ImageNet-21K | 81.32 ± 0.19 | 87.94 ± 0.36 | 94.23 ± 0.81 | 73.15 ± 0.61 | |
| ChestX-ray14 | – | 87.22 ± 0.22 | 91.35 ± 0.93 | 70.67 ± 0.18 | |
| POPAR−1 | 80.51 ± 0.15 | 88.25 ± 0.78 | 96.81 ± 0.40 | 73.58 ± 0.18 | |
In summary, we make the following main contributions:
A novel vision transformer-based SSL framework that simultaneously learns global relationships of anatomical structures and fine-grained details embedded in medical images.
A collection of pretrained models for transformer architectures (ViT-B and Swin-B) that yield SoTA performance on a set of MIA classification tasks.
An extensive set of experiments that demonstrate POPAR’s superiority over SoTA supervised and self-supervised pretrained models across architectures.
2. Related Works and Novelties
Image Context Learning.
Image context has been shown to be a powerful source for learning visual representations via SSL. Multiple pretext tasks have been formulated to predict the context arrangement of image patches, including predicting the relative position of two image patches [10], solving Jigsaw puzzles [20], and playing Rubik’s cube [28]. These methods employ multi-Siamese CNN backbones as feature extractors, followed by additional feature aggregation layers for determining the relationships between the input patches. However, the feature aggregation layers are discarded after the pretraining step, and only the pretrained multi-Siamese CNNs are transferred to the target tasks. As a result, the learned relationships among image patches are mainly ignored in the target tasks. In contrast to these approaches, our POPAR uses the multi-head attention mechanism to capture the relationships among anatomical patterns embedded in image patches, which is fully transferable to target tasks.
Masked Image Modeling.
Inspired by masked language modeling [3,9], multiple vision transformer-based SSL methods have been developed for masked image modeling. BEiT [2] predicts the discrete tokens from masked images. SimMIM [25] and MAE [15] mask random patches from the input image and reconstruct the missing patches. While POPAR bears similarities to these methods in patch reconstruction, it distinguishes itself from them by (1) reconstructing correct image patches from misplaced patches or from transformed patches, and (2) predicting the correct positions of shuffled image patches for learning global contextual features.
Restorative Learning.
The restorative SSL methods aim to learn representations by recovering original images from their distorted versions. Multiple SSL methods have incorporated image restoration into their pretext tasks. Models Genesis [27] proposed four effective image transformations for restorative SSL in medical imaging. TransVW [13,14] introduced a SSL framework for learning semantic representation from the consistent anatomical structures. CAiD [22] formulated a restoration task to boost instance discrimination SSL with context-aware representations. DiRA [12] integrates discriminative, restorative, and adversarial SSL to learn fine-grained representations via collaborative learning. However, none of these approaches learns anatomical relationships among image patches. By contrast, POPAR employs a transformer backbone to integrate restorative learning with patch order prediction, capturing not only visual details but also relationships among anatomical structures.
3. Method
Notations.
Given an image sample , where (H, W ) is the resolution of the image and C is the number of channels, we randomly select and apply one of the following distortion functions: (a) patch order distortion (the upper path in Fig. 2) or (b) patch appearance distortion (the bottom path in Fig. 2). In patch order distortion, we first divide x into a sequence of n non-overlapping image patches P = (p1, p2, ..., pn), where and (k, k) is the resolution of each patch. We use L = (1, 2, ..., n) to denote the correct patch positions within x. We then apply a random permutation operator on L to generate permuted patch positrons Lperm. We use Lperm to re-arrange the patch sequence P, resulting in permuted patch sequence Pperm. In patch appearance distortion, we first apply an image transformation operator on x, resulting in an appearance-transformed image xtran. We then divide xtran into a sequence of n non-overlapping transformed image patches . Following [11], we map the patches in Pperm and Ptran to D dimension patch embeddings using a trainable linear projection layer. Then, trainable positional embeddings are added to the patch embeddings, resulting in a sequence of embedding vectors. The embedding vectors are further processed by the transformer encoder gθ(·) to generate a set of contextual patch features . We then pass Zʹ to two distinct prediction heads sθ(·) and kθ(·) to generate predictions and for performing the patch order prediction and patch appearance recovery, respectively, as described below. Following [21], we define to be “shall be (made) equal”.
Fig. 2.

POPAR aims to learn (1) contextualized high-level anatomical structures via patch order prediction, and (2) fine-grained image features via patch appearance recovery. For each image, we divide it into a sequence of non-overlapping patches, and randomly distort the patch order (upper path) or patch appearances (bottom path). We give the distorted patch sequence to a transformer network, and train the model to predict the correct position of each input patch and recover the correct patch appearance for each position as the original patch sequence.
Patch order prediction aims to predict the correct position of a patch based on its appearance. Particularly, depending on which distortion function is selected, the expected prediction for is formulated as follows.
| (1) |
Patch appearance recovery aims to reconstruct the correct appearance for each position in the input sequence. We expect the network to predict the original appearance in P regardless of which distortion function ( or ) is selected. The expected reconstruction prediction for is defined as follows.
| (2) |
Overall Training Scheme.
We formulate the patch order prediction as a n-way multi-class classification task and optimize the model by minimizing the categorical cross-entropy loss: , where B denotes the batch size, n is the number of patches for each image, represent the ground truth (as defined in Eq. (1)), and represents the network’s patch order prediction. Moreover, we formulate the patch appearance recovery as a reconstruction task and train the model by minimizing L2 distance between the original patch sequence P and the restored patch sequence , where pj and represent the patch appearance from P and , respectively. We integrate both learning schemes and train POCAR with an overall loss function , where λ is the weight to specify the importance of each loss. The formulation of the encourages the transformer model to learn high-level anatomical structures and their relative relationships. Moreover, the definition of encourages the model to capture more fine-grained features from images.
4. Experiments and Results
4.1. Implementation Details
Pretraining Settings.
We pretrain POPAR with ViT-B and Swin-B as backbones using their official default configurations on the training set of ChestX-ray14 [24] dataset. Due to architecture differences (detailed in appendix), we use image size of 224×224 and 448×448 for ViT-B and Swin-B backbones, respectively. Accordingly, we divide images into 16×16 and 32×32 patches for ViT-B and Swin-B, respectively, which results in n = 196 patches in both backbones. We use two single linear layers as the prediction heads for the classification (order prediction) and restoration (appearance recovery) tasks. For all models, we use SGD optimizer with learning rate 0.1. We set λ to 0.5. We train POPAR models with ViT-B and Swin-B backbones for 1000 and 300 epochs, respectively. Image transformation function includes local pixel shuffling, non-linear transformation, and outer/inner cutouts [27]. More details are in the appendix.
Target Tasks and Finetuning Settings.
We evaluate the efficacy of POPAR models in transfer learning to four medical classification tasks in chest X-ray datasets, including ChestX-ray14, CheXpert [17], NIH Shenzhen CXR [18], and RSNA Pneumonia [1,24]. We transfer POPAR models to target tasks by removing the prediction heads and inserting randomly initialized target classification heads that include (1) a linear layer for the ViT-B backbone and (2) an average pooling and a linear layer for the Swin-B backbone. We finetune all the parameters of target models. Details of target tasks and finetuning settings are provided in the appendix.
4.2. Results
(1). POPAR outperforms self-supervised ImageNet models with transformer backbone.
To demonstrate the effectiveness of pretraining transformers with in-domain medical data, we compare POPAR with SoTA transformer-based self-supervised methods that are pretrained on ImageNet. We evaluate existing self-supervised ImageNet models with ViT-B (MoCoV3 [7], SimMIM [25], DINO [4], BEiT [2], and MAE [15]) and Swin-B (SimMIM) backbones. We use officially released models for all baselines, among which BEiT model is pretrained on the ImageNet-21K dataset, while the rest of the models are pretrained on the ImageNet-1K dataset. We make the following observations from the results in Table 2. Firstly, SimMIM and MAE achieve superior performance over other baselines, demonstrating the effectiveness of masked image restoration for pretraining transformer models. Secondly, POPAR with ViT-B backbone surpasses all self-supervised ImageNet models with the same backbone. Thirdly, even POPAR−1 (a downgraded version of POPAR) outperforms SimMIM with Swin-B backbone on three out of four target tasks.
(2). POPAR outperforms self-supervised pretrained models across architectures.
To demonstrate the effectiveness of representation learning via our proposed framework, we compare POPAR with SoTA CNN-based and transformer-based SSL methods pretrained on medical images. To do so, we evaluate (1) three recent SSL methods with ResNet-50 backbone, including MoCo v2 [5], Barlow Twins [26], and SimSiam [6], and (2) SimMIM [25], which has shown superior performance over other transformer-based SSL methods in both vision [25] and medical (refer to Table 2) tasks, with ViT-B and Swin-B backbones. All models are pretrained on ChestX-ray14 dataset. As shown in Table 3, even POPAR−1, a downgraded POPAR, yields significantly better performance compared with three SSL methods with ResNet-50 backbone in all target tasks. Moreover, even POPAR−1 and POPAR−3 (two downgraded POPAR models) outperform SimMIM in all target tasks across Swin-B and ViT-B backbones. These results demonstrate that POPAR models provide more useful representations for various medical imaging tasks.
Discussion:
By integrating Table 1, 2 and 3, based on the reasoning detailed in the supplementary material (Section B.3), we may infer that POPAR would outperform all baseline approaches if they were pretrained with NIH ChestX-rays14.
(3). POPAR outperforms fully-supervised pretrained models across architectures.
We compare POPAR models, which are pretrained on unlabeled images of ChestX-ray14 dataset, with fully-supervised pretrained models on ImageNet and ChestX-ray14 across three architectures: ResNet-50, ViT-B, and Swin-B. We use existing supervised ImageNet models, with CNN and transformer backbones pretrained on ImageNet-1K and ImageNet-21K datasets, respectively. As shown in Table 4, POPAR models provide superior performance over both supervised ImageNet and ChestX-ray14 models across architectures in three target tasks. In particular, the downgraded POPAR models with ViT-B and Swin-B backbones outperform corresponding supervised baselines with the same backbone in all and three target tasks, respectively. Moreover, POPAR−1 outperforms supervised models with ResNet-50 backbone in three target tasks. In summary, these results demonstrate that POPAR provides more generic features for various medical imaging tasks.
5. Ablation Study
Impact of Input Resolutions.
We evaluate POPAR with ViT-B and Swin-B backbones using four different pretraining and finetuning image resolutions. As shown in Table 1, comparing POPAR−1 with POPAR−2 indicates that a larger number of shufflable patches provides a larger performance gain on all target tasks. Moreover, with the same number of shufflable patches, POPAR−1 with a Swin-B backbone provides superior performance compared with POPAR−3 with a ViT-B backbone; as a result, the Swin transformer is the most suggested POPAR backbone. Lastly, POPAR pretrained and finetuned with 448×448 resolution, denoted by POPAR in Table 1, suggests the SoTA performance on all four target tasks. It indicates that the higher input resolution is preferred for all four MIA tasks studied in this paper, since higher resolution provides more detailed anatomical information, thus enhancing the performance of all MIA target tasks.
Pretraining Tasks.
POPAR seamlessly combines two tasks: patch order prediction and patch appearance recovery. As shown in our supplementary material (Section C and Table 5), they can be further broken down into three individual sub-tasks: (a) patch order classification, denoted by ; (b) misplaced patch appearance recovery, denoted by ; and (c) Models Genesis [27] transformed image restoration, denoted by . We evaluate the effectiveness of different POPAR pretraining subtasks on the ViT-B backbone. Compared with the Models Genesis [27] transformed image restoration, the patch order prediction task provides a significant performance boost on most target tasks. Furthermore, the combination of the misplaced patch appearance recovery task and the patch order classification task provides an on-par or less performance increment on four target tasks. Finally, we demonstrate that POPAR pretrained with all subtasks provides the highest performance boost.
6. Conclusion
We propose POPAR, a novel transformer-based SSL framework for MIA tasks. POPAR integrates patch order prediction and appearance recovery, capturing not only high-level relationships among anatomical structures but also fine-grained details from medical images. As our future work, we will extend POPAR to 3D and cover target segmentation tasks.
Supplementary Material
Acknowledgments.
This research has been supported in part by ASU and Mayo Clinic through a Seed Grant and an Innovation Grant, and in part by the NIH under Award Number R01HL128785. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. This work has utilized the GPUs provided in part by the ASU Research Computing and in part by the Extreme Science and Engineering Discovery Environment (XSEDE) funded by the National Science Foundation (NSF) under grant numbers: ACI-1548562, ACI-1928147, and ACI-2005632. The content of this paper is covered by patents pending.
Footnotes
Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-16852-9_8.
References
- 1.RSNA pneumonia detection challenge (2018). https://www.kaggle.com/c/rsna-pneumonia-detection-challenge
- 2.Bao H, Dong L, Wei F: BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
- 3.Brown T, et al. : Language models are few-shot learners In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020) [Google Scholar]
- 4.Caron M, et al. : Emerging properties in self-supervised vision transformers In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) [Google Scholar]
- 5.Chen X, Fan H, Girshick R, He K: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
- 6.Chen X, He K: Exploring simple Siamese representation learning In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021) [Google Scholar]
- 7.Chen X, Xie S, He K: An empirical study of training self-supervised vision transformers In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640–9649 (2021) [Google Scholar]
- 8.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L: ImageNet: a large-scale hierarchical image database In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE; (2009) [Google Scholar]
- 9.Devlin J, Chang MW, Lee K, Toutanova K: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
- 10.Doersch C, Gupta A, Efros AA: Unsupervised visual representation learning by context prediction In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015) [Google Scholar]
- 11.Dosovitskiy A, et al. : An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
- 12.Haghighi F, Hosseinzadeh Taher MR, Gotway MB, Liang J: DiRA: discriminative, restorative, and adversarial learning for self-supervised medical image analysis In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20824–20834 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Haghighi F, Hosseinzadeh Taher MR, Zhou Z, Gotway MB, Liang J: Learning semantics-enriched representation via self-discovery, self-classification, and self-restoration. In: Martel AL, et al. (eds.) MICCAI 2020. LNCS, vol. 12261, pp. 137–147. Springer, Cham: (2020). 10.1007/978-3-030-59710-8_14 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Haghighi F, Taher MRH, Zhou Z, Gotway MB, Liang J: Transferable visual words: exploiting the semantics of anatomical patterns for self-supervised learning. IEEE Trans. Med. Imaging 40(10), 2857–2868 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.He K, Chen X, Xie S, Li Y, Dollár P, Girshick R: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)
- 16.Hosseinzadeh Taher MR, Haghighi F, Feng R, Gotway MB, Liang J: A systematic benchmarking analysis of transfer learning for medical image analysis. In: Albarqouni S, et al. (eds.) DART/FAIR −2021. LNCS, vol. 12968, pp. 3–13. Springer, Cham: (2021). 10.1007/978-3-030-87722-4_1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Irvin J, et al. : CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 590–597 (2019) [Google Scholar]
- 18.Jaeger S, Candemir S, Antani S, Wáng YXJ, Lu PX, Thoma G: Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surg 4(6), 475 (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jing L, Tian Y: Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans. Pattern Anal. Mach. Intell 43(11), 4037–4058 (2020) [DOI] [PubMed] [Google Scholar]
- 20.Noroozi M, Favaro P: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe B, Matas J, Sebe N, Welling M (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham: (2016). 10.1007/978-3-319-46466-4_5 [DOI] [Google Scholar]
- 21.Schwichtenberg J: Physics from Symmetry Springer, Cham: (2015). 10.1007/978-3-319-66631-0 [DOI] [Google Scholar]
- 22.Taher MRH, Haghighi F, Gotway MB, Liang J: CAiD: context-aware instance discrimination for self-supervised learning in medical imaging. arXiv:2204.07344 (2022) [PMC free article] [PubMed]
- 23.Tajbakhsh N, Roth H, Terzopoulos D, Liang J: Guest editorial annotation-efficient deep learning: the holy grail of medical imaging. IEEE Trans. Med. Imaging 40(10), 2526–2533 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM: ChestX-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106 (2017) [Google Scholar]
- 25.Xie Z, et al. : SimMIM: a simple framework for masked image modeling. arXiv preprint arXiv:2111.09886 (2021)
- 26.Zbontar J, Jing L, Misra I, LeCun Y, Deny S: Barlow twins: self-supervised learning via redundancy reduction In: International Conference on Machine Learning, pp. 12310–12320. PMLR; (2021) [Google Scholar]
- 27.Zhou Z, Sodha V, Pang J, Gotway MB, Liang J: Models genesis. Med. Image Anal 67, 101840 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zhuang X, Li Y, Hu Y, Ma K, Yang Y, Zheng Y: Self-supervised feature learning for 3D medical images by playing a Rubik’s cube. In: Shen D, et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 420–428. Springer, Cham: (2019). 10.1007/978-3-030-32251-9_46 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
