Abstract
While data-driven approaches excel at many image analysis tasks, the performance of these approaches is often limited by a shortage of annotated data available for training. Recent work in semi-supervised learning has shown that meaningful representations of images can be obtained from training with large quantities of unlabeled data, and that these representations can improve the performance of supervised tasks. Here, we demonstrate that an unsupervised jigsaw learning task, in combination with supervised training, results in up to a 9.8% improvement in correctly classifying lesions in colonoscopy images when compared to a fully-supervised baseline. We additionally benchmark improvements in domain adaptation and out-of-distribution detection, and demonstrate that semi-supervised learning outperforms supervised learning in both cases. In colonoscopy applications, these metrics are important given the skill required for endoscopic assessment of lesions, the wide variety of endoscopy systems in use, and the homogeneity that is typical of labeled datasets.
Keywords: Colonoscopy, deep learning, domain adaptation, endoscopy, jigsaw, lesion classification, out-of-distribution detection, semi-supervised, unsupervised
1. INTRODUCTION
Colorectal cancer is the second leading cause of cancer death and will cause a predicted 53,200 deaths in the United States in 2020 [1]. Optical colonoscopy is considered the gold-standard for detecting and preventing colorectal cancer with approximately 15 million procedures being performed annually [2]. Screening procedures are used to inspect the large intestine and rectum for precancerous lesions so that they may be removed prior to the onset of carcinoma. These lesions come in a variety of geometries and textures, each with an associated risk of progressing to a cancerous state [3]. Colonoscopists analyze optical images to visually classify lesions, using cues such as color, shape, and vasculature patterns in conjunction with published guidelines [4]–[6]. Improving the reliability of lesion classification from images and de-skilling this task could reduce the costs, time, and other resources associated with histopathology. Further, lesions which are benign in nature may be left in place, eliminating associated risks of polyp removal [7].
In the past decade, deep learning models have achieved astounding success in the computer vision field on tasks such as image classification and object recognition, surpassing human-level performance in some cases. In medical imaging, these models have outperformed traditional image processing techniques in a variety of fields such as radiology, histopathology, retinopathy, and mammography. Most of these models are trained in a supervised fashion, requiring large quantities of expertly annotated medical data to achieve optimal performance. In the medical imaging field, compiling annotated data is particularly time consuming, expensive, fraught with privacy concerns, and limited by the availability of expert annotators. In contrast, unsupervised methods have shown that meaningful representations can be extracted from unlabeled data, which is often plentiful. In this work, we leverage the advantages of both labeled and unlabeled data using a semi-supervised learning paradigm to improve the performance of colonoscopy lesion classification.
Semi-supervised learning (SSL) is an emerging area of research that aims to learn a supervised objective, while enriching the encoded features through an unsupervised task. Recent works have shown marked improvement over purely supervised training, especially with small quantities of labeled data [8]–[10]. SSL involves simultaneously training an unsupervised proxy task, and a supervised task. Many proxy tasks involve applying some type of transformation to an image, then tasking the network with predicting the transformation. In this way, the network learns to encode information to a feature space which may enhance the performance of the supervised task. One example of a proxy task is applying a known rotation to an image, then tasking the network with estimating the degree of rotation.
In this paper, we use a jigsaw puzzle as the proxy task for SSL, as was first proposed by [11]. In this task, an input image is cut into an N × N grid, and the resulting tiles are reshuffled into an order defined by a randomly selected pseudo-label. The network then learns to encode the shuffled image into a feature vector which allows it to accurately predict the tile order. The unsupervised jigsaw task ideally enriches the encoder’s resultant feature vectors, making them more discriminative for the supervised lesion classification task. Using this method, we find that a semi-supervised learning model outperforms a purely supervised model in lesion classification. While most semi-supervised learning research focuses solely on improvements in accuracy, trained models also benefit from improved robustness and generalizability. We also investigate the jigsaw method’s effect on domain adaptation and out-of-distribution detection in colonoscopy - important metrics when deploying models to real-world clinical settings. Specifically, the contributions of this study are:
To the best of our knowledge, this is first research applying semi-supervised learning to colonoscopy lesion classification.
We demonstrate that a jigsaw-puzzle-solving task can effectively leverage unlabeled data to significantly improve the performance of lesion classification.
We show that semi-supervised learning also improves performance in analyzing domain-shifted images and detecting out-of-distribution samples at inference.
II. BACKGROUND & PREVIOUS WORK
A. LESION CLASSIFICATION
Polyp classification is a widely researched problem in the medical image analysis community [12], [13]. Previous work has used traditional methods for hand-crafted feature extraction using color, texture, and 3D features for polyp classification in videos [14]. More recent research uses deep learning models, which have shown significant improvements in classification accuracy. Most use transfer learning [15] with off-the-shelf models such as ResNet [16] and Inception [17]–[19]. Others have combined traditional methods with deep learning approaches, such as fusion of wavelets and convolutional neural network features [20]. Multi-modal fusion of pixel-level information, such as color and depth, have also been shown to improve classification accuracy [21], [22]. Still, none of these methods utilize the large quantities of available unlabeled data [23].
B. SELF-SUPERVISED & SEMI-SUPERVISED LEARNING
Self-supervised and semi-supervised learning are highly active areas of artificial intelligence research. These methods exploit unlabeled data for effective representation learning. Recent semi-supervised works have achieved comparable performance to conventional fully supervised networks, while only requiring a small fraction of labeled data. To learn from data without manual annotations, self-supervised methods employ proxy tasks where pseudo-labels can be generated using know transformations or data manipulations. According to [24], there are four common types of proxy tasks:
Generation-based methods: Some part of the data is deliberately removed, and the network is tasked with predicting the missing data. Examples include image inpainting [25], and video generation from single frames using generative adversarial networks (GANs) [26].
Context based methods: The network is tasked with learning to make predictions using either spatial or temporal contextual information. Examples include image clustering [10], [27], context prediction [11], [28], [29], and predicting a geometric transformation such as rotation [30].
Free semantic label-based methods: Semantic labels are automatically generated for object segmentation [31], [32] or contour detection [33]–[35].
Cross modal methods: Data correspondence between data modalities is learned, such as Visual-Audio Correspondence [36], [37].
Recent works have shown that semi-supervised learning methods improve model robustness and generalizability, as well as the ability to measure uncertainty [38], [39]. Deep learning models are notorious for silently providing incorrect predictions when test samples are drawn from a distribution other than the distribution used for training. Surrogate methods have been incorporated into the inference pipeline, drawing on the network’s prediction probabilities to determine an out-of-distribution score for test samples [40]. The success of semi-supervised learning in medical imaging is dependent on deploying networks that can handle a wide distribution of samples, and have a mechanism for appropriately handling samples which the network is ill-conditioned to classify.
1). SEMI-SUPERVISED LEARNING IN MEDICAL IMAGING
Since labeled data in medical imaging community is particularly scarce, researchers in this field have long explored unsupervised methods. Cheplygina et al. [41] present a comprehensive review of semi-supervised and self-supervised methods employed in medical imaging. Popular approaches include using self-labeling and co-training, where a classifier is first trained on the available labeled data, and is then used to generate pseudo labels on unlabeled data. The classifier is then retrained using the newly generated labeled data. This method is especially popular where precise labeling is cumbersome, such as pixel-level segmentation tasks with applications in neuro [42]–[44], heart [45], and retinal [46] imaging. More recent works have employed state-of-art semi-supervised and self-supervised techniques across a wide range of applications, such as consistency regularization for skin lesion classification and thorax disease diagnosis [47], unsupervised anomaly detection for white matter lesion segmentation [48], and image synthesis with GANs for data augmentation in glaucoma assessment [49].
2). JIGSAW PUZZLE SOLVING
The original semi-supervised jigsaw approach proposes decomposing an image into non-overlapping patches, shuffling the patches in a random order, then individually feeding these patches to a Siamese network [11]. The network predicts a permutation index that corresponds to the particular shuffled order as a pretext task, and it is later fine-tuned on the downstream, supervised task using labeled data. Many variations of the jigsaw task have been explored, including for videos [50], three-dimensional data [51], and negative sample inclusion for increased difficulty [52]. Specifically in medical imaging, the jigsaw paradigm has been applied to imaging of the brain and pancreas [51], [53], [54].
In this work, we adapt the jigsaw proxy task for improving the performance of a supervised classifier [39]. To the best of our knowledge, this work is the first to explore semi-supervised learning for lesion classification in colonoscopy. The most similar prior art is [55] which performs medical instrument segmentation on endoscopy images using image colorization as the pretext task.
III. METHODS
Our problem statement is defined as follows: given a colonoscopy image of a lesion, we attempt to classify it into one of two classes - neoplastic/precancerous or non-neoplastic. Our dataset consists of labeled and unlabeled image sets, , where consists of image-label pairs, and is the set of unlabeled lesion images, . A detailed description of the classes & dataset is given in section IV–A. The goal is to leverage the unlabeled data using the jigsaw task to improve the performance of lesion classification.
A. ARCHITECTURE
As shown in Figure 1, our model consists of the standard ResNet-18 [56] as a shared feature extractor with two classifier heads - one for supervised lesion classification and a second for jigsaw classification. Our end-to-end deep model is comprised of a shared feature encoder parameterized by θe, and supervised and unsupervised classifier heads by θs and θu, respectively. The network trains in two phases - a supervised phase that minimizes the supervised loss , followed by an unsupervised phase that minimizes the jigsaw loss . The parameters of the network are learned by alternating training between the supervised and unsupervised tasks on each iteration. The following sections describe the two training phases in detail.
FIGURE 1.

The proposed semi-supervised learning model uses lesion type labels for a supervised loss and jigsaw index pseudo labels for an unsupervised loss. This model is sequentially trained in a supervised phase then an unsupervised phase for each iteration.
1). SUPERVISED PHASE
The main supervised objective is to classify colonoscopy lesion images into neoplastic vs non-neoplastic classes. We aim to minimize the supervised classification loss , which is the weighted cross-entropy loss between the target label yi and the model prediction for image xi with . In our experiments to assess the effectiveness of semi-supervised learning, we report the performance of the network trained on various fractions of the labeled dataset. Consequently, is obtainjed by selecting the kth percentage of labeled data where k varies logarithmically i.e. k = {100, 50, 25, 12.5, 6.25}. A detailed description of how data selection is performed is discussed in IV-C. The cross entropy loss function is weighted to account for the class imbalance in the dataset. Formally, the supervised loss function is defined as
| (1) |
where is the number of images in the selected labeled dataset, weight wc = 1/freq(c) is the inverse of class c frequency in the dataset , yi is the one-hot encoded target label for the ith image, and p is the posterior probability obtained by taking the softmax of output logits. In this phase, only parameters of the feature encoder θe and supervised fully connected layer θs are updated.
2). UNSUPERVISED PHASE
Following each supervised phase, an unsupervised phase is trained using the entire dataset . In this phase, the objective of the network is to learn to solve the jigsaw task. As shown in Figure 2, we first decompose an image into a 3 × 3 grid of tiles. Then, a patch of 0.75-0.9 times the original tile size and a random offset is cropped from each tile. The patches are then scaled back to the original tile size, reordered according to a selected permutation index , and concatenated to reform a 222 × 222 input image z. This transformation prevents the network from using low level cues such as continuity of edges, color, or texture when estimating the patch order. Instead, the network is forced to learn high-level, global primitives such as shape. With 9 grid positions, there are 9! possible patch permutations, creating far too many labels for the network to learn. To make the classification task achievable for the network, we select a small subset of the possible P permutations with maximal Hamming distance from one another [11]. An index is assigned to each permutation, which then functions as a pseudo-label. The jigsaw task is then formulated as a classification problem, tasking the network to learn to correctly predict the pseudo label of z. Here, the zero index refers to the unscrambled, original image case.
FIGURE 2.

Overview of the jigsaw shuffler procedure for generating shuffled images with a pseudolabel for unsupervised learning.
We use a weighted cross-entropy loss as the unsupervised loss . When creating a mini-batch for training in the unsupervised phase, we keep the scrambled-to-unscrambled image ratio equal to s : (1 – s), where s ∈ [0, 1]. In the jigsaw shuffler, the permutation index for the scrambled images is drawn from uniform distribution . Hence, every scrambled permutation is equally likely to occur with a probability 1/P. Multiplying this probability with ratio of scrambled images in the mini-batch gives us the relative frequency of occurrence of any scrambled index as s/P. The unscrambled label (zero index, in our case) is always chosen with a frequency of (1 – s) in the mini-batch. Overall, the frequency of occurrence for permutation indices is given as freq = ((1 – s), s/P, s/P, …, s/P), where is the frequency of permutation index . The inverse of frequency is used as a scalar weighting in the cross entropy loss, . The unsupervised loss is defined as follows :
| (2) |
where is the total number of images in the training dataset, zi is the ith recomposed image, yi is the one hot encoded pseudo label vector, and is the prediction probability for the permutation. Minimization of the unsupervised loss involves only learning the feature encoder θe and the unsupervised head θu.
The overall training loss is then :
| (3) |
where λ is a scalar weight applied to the unsupervised loss. In the unsupervised phase, ordered and shuffled images are mixed. During the supervised phase, input images remained ordered, just as they are presented during testing. When training is complete, the unsupervised head is discarded, and only the trained feature encoder and supervised lesion classification head are used for testing.
B. DOMAIN ADAPTATION
This section describes experiments to assess how semi-supervised learning impacts the domain generalizability of a model. In the context of colonoscopy, domain adaption would be useful when applying a network to new endoscope types or manufacturers, to endoscopes with imaging performance that varies over time (e.g. dirty optics), or to new imaging modes. We experimentally withhold a target domain of data from the supervised task and only include it in the unlabeled set for the unsupervised task. We then assess the domain adaptability of the network by testing on labeled samples from the target domain.
In colonoscopy, two widely used imaging modalities are White Light Imaging (WLI), and Narrow Band Imaging (NBI). For our experiment, we consider WLI as the source domain, and NBI as the target domain. For training, we use labeled WLI images from where , and unlabeled NBI images where . For testing the performance of the network, we use labeled NBI images , such that .
The network training approach remains the same as was described in the previous section, with the only exception being the data used in each phase. In the supervised phase, we use the labeled WLI images from , whereas in the unsupervised phase we use both the labeled WLI images and the unlabeled NBI images i.e. . In the testing phase, we use labeled NBI images .
C. OUT-OF-DISTRIBUTION DETECTION
In out-of-distribution detection (OOD), the goal is to identify test samples which don’t belong to the distribution on which the model was trained. These out-of-distribution samples should then be rejected to avoid unreliable inference. A pretrained semi-supervised learning model can act as an efficient out-of-distribution detector. In this experiment, we train a classifier using in-distribution samples on the main objective of lesion classification, and then later test its performance as an out-of-distribution detector. We consider white light images to be in-distribution samples, and NBI images are treated as out-of-distribution samples. In the supervised phase, we use labeled white light images from . For the unsupervised phase, we use unlabeled and labeled white light images i.e. . To use the classifier as an out-of-distribution detector, we utilize the posterior probabilities p(y|x). It is shown in [40], [57] that the probability distribution of prediction softmax probabilities for out-of-distribution samples appears roughly uniform. Whereas, in-distribution samples have a more ‘peaky’ distribution with a higher maximum softmax probability maxcp(y = c|x). An out-of-distribution detector score κ based on the posterior probabilities and the auxiliary jigsaw loss is defined as follows:
| (4) |
where KL[U‖p(y|x)] is the KL-divergence between the uniform distribution and the prediction softmax probabilities, and is the unsupervised loss for image x as defined in Equation 2. KL divergence measures the difference between two probability distributions. If two probability distributions are similar, the KL divergence between them is low, whereas a high value indicates that they are starkly different. The KL divergence between distributions P(y) & Q(y) is defined as :
| (5) |
where y is the support of the distribution i.e y ∈ {0, 1} for this case. In the baseline experiment, the OOD score is κ = KL[U‖p(y|x)]. For the semi-supervised learning case, we also add the jigsaw cross entropy loss. For testing, we use unseen WLI images as the negative class (label = 0) and NBI images as positive class (label = 1). The in-distribution trained polyp classifier is used for inference of the test set to generate the OOD score κ. It is important to note that training the classifier doesn’t have any element of OOD, and it is trained solely to classify lesions. Another advantage is that this approach doesn’t require any OOD samples during training. Distinguishing WLI & NBI images by itself is not a clinically motivated problem, but we use it as a proxy setup to demonstrate SSL’s potential as an OOD detector.
IV. EXPERIMENTS & RESULTS
A. DATASET
The colonoscopy video data used in this paper was collected at the Johns Hopkins Hospital using a protocol approved by the Johns Hopkins Institutional Review Board (#IRB00184221). Video segments were analyzed and cropped from patient procedure video data, retrospectively, to limit included frames to those containing lesions that were biopsied by the endoscopist. Tissue biopsies were collected from suspected lesions, and ground truth labels derived from histopathology analysis were later paired with the respective video segments. A total of 108 patients were enrolled in the study. A total of 132 videos with corresponding ground truth labels were collected, with each video segment featuring a unique lesion. Video annotations were recorded by two medical trainees and verified by an experienced gastroenterologist. An additional 112 videos with no ground truth classification were cropped and extracted for training the semi-supervised model.
Videos were further categorized into two classes: “neoplastic/precancerous” and “non-neoplastic”. Using the histologic labels, adenomas and serrated adenomas were assigned to the neoplastic/precancerous class (n=110), while hyperplastic polyps were assigned to the non-neoplastic class (n=22). The videos include a diverse distribution of imaging parameters, such as varied video processors, illumination modes (WLI/NBI), as well as scope manufacturer and models with high- & standard-definition resolutions. Videos were separated into training and testing sets with equal class balance between sets. Derived image frames were stored in separate containers to prevent class leakage. Repetitive image frames resulting from minimal camera motion were discarded. A frame wise summary of the dataset is given in Table 1.
TABLE 1.
Summary of dataset.
| Total Number of frames | 6,649 | ||
| Labeled frames | 4,095 | Unlabled frames | 2,554 |
| Neoplastic Frames | 3,369 | Non-Neoplastic frames | 726 |
| WLI Frames | 3,855 | NBI Frames | 2,646 |
B. IMPLEMENTATION DETAILS
All experiments are implemented using PyTorch library [58] on a server equipped with an NVIDIA RTX 2080Ti 11GB GPU, an Intel Xeon Processor W-2123 3.6 GHz CPU, and 64 GB of RAM. We use the JiGen repository [39] as our base code for development. All experiments utilize ResNet-18 [56] as the feature encoder. The fully connected layers are 512 × 2 for the supervised branch and 512 × P for the unsupervised branch, similar to the FCN classifier in ResNet-18, only differing by the number of output nodes.
The network weights are initialized using the pre-trained ImageNet ResNet-18 weights available in the PyTorch library. Data augmentation for whole images includes random vertical flip, random horizontal flip, random rotation into {0°, 90°, 180°, 270°}, and random crops of size [0.8, 1.0] (all p=0.5). The images are normalized with mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225]. The augmented images are finally resized to 222 × 222. In the case of the unsupervised phase, the whole image transformations are applied before the jigsaw shuffler. No color transformations are applied, as polyp color is a discriminative feature among the classes. ADAM optimizer [59] with weight decay (L2 Penalty) is used for training the network. The initial learning rate is kept as 0.0001. The ratio of frame-wise frequency of class is 0.83:0.17 for the neoplastic to non-neoplastic classes, the inverse of which is used as weights in the supervised weighted cross entropy in Equation 1. The scrambled to unscrambled image ratio s : 1 – s used in Equation 2 is kept as 0.6:0.4.
C. VARYING THE QUANTITY OF LABELED DATA
To test the efficacy of our semi-supervised learning approach, we evaluate its performance as a function of the quantity of labeled data used for training. We train the network using k% of the total labeled training data where k varied logarithmically, k = {100, 50, 25, 12.5, 6.25}. For each k, we perform a five-fold cross validation. To split the dataset, we first select 20% of the total labeled data for validation. This split of validation set is done at the video level to prevent images of the same polyp mixing between the train and validation sets. Next, we choose k% of the remaining labeled datasets as our supervised training dataset . Thus, the validation dataset for a particular fold remains the same for all the values of k. We use the selected labeled dataset for training the supervised phase, but for all values of k we use the whole training dataset (excludes the validation images) for the unsupervised phase. On an average there are 819 images in the validation set.
We perform an ablation study to measure the performance of SSL when compared to a baseline model. The baseline model is also a ResNet-18, and it is architecturally the same as the SSL model (described in III-A), but without the jigsaw head. The baseline model uses the same weighted cross entropy loss that the SSL model uses in the supervised phase (Equation 1). When comparing the performance of the SSL model and the baseline, both models use the same validation data and the selected labeled data for supervised training .
The hyperparameters which gave the best performance for both models are reported - For the baseline model, an initial learning rate of 0.0001 is used for all cases except for 100% models where 0.001 is used. As for weight decay, 100% model uses 0.005, 50% & 25% uses 0.05, 12.5% has 0.2 and 6.25% uses a value of 0.005. For the SSL models, an initial learning rate of 0.0001 is used. The number of jigsaw classes (P) is 30 for all cases except 100%, which uses 100 classes. The weight decay values are - 0.005 for 100%, 0.05 for 50%, 0.07 for 25% & 12.5%, and 0.2 for 6.25%. The unsupervised loss weights λ are 1 for 100% & 50%, 2 for 25%, and 1.5 for 12.5% & 6.25%. The λ value is also increased 1.5 times every 5 epochs for low data regime training to accelerate the unsupervised learning phase to match the swift learning on the supervised end, due to small labeled data size.
We evaluate the classification performance with five commonly used metrics - accuracy, F1 score, sensitivity, specificity and precision. Accuracy is the ratio of correct predictions over the total number of test samples. Since our data has an uneven class distribution, we also use F1 score for evaluation. F1 score is the harmonic mean of precision and recall. Sensitivity is the ratio of correctly predicted positive samples to the total number of positive samples (neoplastic/precancerous class). Similarly, specificity is the ratio of correctly classified negative class samples (non-neoplastic class). Precision is the ratio of correctly predicted positives to all predicted positives. Definitions are as follows:
| (6) |
| (7) |
where true positive TP is the number of correct predictions for the positive class while true negative TN is the number of correct predictions for the negative class. False negative FN is the number of samples incorrectly classified to negative class whereas false positives FP is the incorrect classifications to the positive class.
Figure 3 plots the median metrics and the standard deviation across the five fold cross validation as a function of percentage of labeled data. All the performance metrics are shown in Table 2. From Figure 3, we can observe that the semi-supervised learning consistently achieves superior performance compared to the baseline for all cases in terms of accuracy & F1 Score. The accuracy for semi-supervised learning falls by only 9.96% as compared to 16.87% for baseline when moving to the low data regime. Similarly, for F1 score we observe a drop of only 0.06 for semi-supervised learning versus a 0.13 drop by the baseline. With only 6.25% labeled data, the semi-supervised model gives an accuracy of 66.80% and a F1 score of 0.79. Using 100% of the labeled data in conjunction with unlabeled data gave semi-supervised model a boost of 2.85% in accuracy and 0.03 in F1 score. When comparing the sensitivity, the semi-supervised approach exceeds the baseline in all cases.
FIGURE 3.

Results comparing semi-supervised learning against baseline as function of fraction of labeled training data. Median and standard deviation for 5-fold cross validation are reported.
TABLE 2.
Descriptive statistics comparing the performance of semi-supervised learning against baseline as function of labeled data percentage. The median values across 5-fold cross-validation are reported.
| Labeled Data | Accuracy (%) | F1 Score | Sensitivity | Specificity | Precision | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | SSL | Baseline | SSL | Baseline | SSL | Baseline | SSL | Baseline | SSL | |
| 6.25% | 57.04 | 66.80 | 0.69 | 0.79 | 0.55 | 0.68 | 0.70 | 0.52 | 0.94 | 0.89 |
| 12.5% | 60.53 | 69.67 | 0.73 | 0.79 | 0.63 | 0.76 | 0.38 | 0.22 | 0.87 | 0.83 |
| 25% | 68.86 | 71.96 | 0.77 | 0.81 | 0.68 | 0.80 | 0.33 | 0.19 | 0.89 | 0.82 |
| 50% | 69.54 | 75.60 | 0.80 | 0.85 | 0.73 | 0.88 | 0.54 | 0.20 | 0.85 | 0.80 |
| 100% | 73.91 | 76.76 | 0.82 | 0.85 | 0.83 | 0.87 | 0.37 | 0.24 | 0.82 | 0.82 |
The semi-supervised model’s improvement over the baseline indicates that adding a jigsaw solving auxiliary task is beneficial. This improvement could be attributed to SSL enabling the network to learn more discriminative features, such as shape, while learning the jigsaw task. Superior performance in the low data regime, and even the extra boost with 100% labeled data, indicates that the jigsaw task effectively leverages unlabeled data. It is worth noting that the baseline outperforms SSL on the specificity metric. For our use case of precancerous lesion classification, sensitivity is more important than specificity, as missing precancerous lesions may lead to delayed treatment, a worse prognosis, and ultimately a reduced survival rate.
D. DOMAIN ADAPTATION
The goal of this experiment was to test the domain generalizability of semi-supervised learning. We train the model on labeled white light images (n=2326) and unlabeled NBI images(n=961), and then test the model using labeled NBI images (n=1685). The architecture, training protocol, and testing protocol remain the same as in the previous subsection IV-C. For the ablation study, the baseline model described in IV-C was used. For training the baseline model, we use the same set of white light labeled images (n=2326) as done in SSL training. The hyperparameters used are an initial learning rate of 0.0001 and weight decay of 0.005 for both cases. In SSL, the number of jigsaw classes (P) was 100 and the unsupervised loss weight λ = 1 was used.
The results for the domain adaptation experiment are reported in Table 3. To avoid any statistical error, we report the mean values for 3 runs initiated with different random seeds. We observe that the semi-supervised model exceeds the baseline in accuracy, F1 score and sensitivity by 1.92%, 0.02 and 0.07 respectively. This superlative performance demonstrates that the semi-supervised methods take advantage of unlabeled target images to learn domain invariant feature representations. This may be enabled by the jigsaw puzzle solver learning the spatial correlation of images.
TABLE 3.
A comparison of baseline and jigsaw semi-supervised learning for domain adaptation.
| Accuracy | F1 Score | Sensitivity | Specificity | Precision | |
|---|---|---|---|---|---|
| Baseline | 77.84% | 0.86 | 0.87 | 0.35 | 0.85 |
| SSL | 79.76% | 0.88 | 0.94 | 0.14 | 0.83 |
E. OUT-OF-DISTRIBUTION DETECTION
In this experiment, we test semi-supervised model’s performance as an out-of-distribution detector. In our problem setup, we treat white light images as in-distribution samples and NBI images as out-of-distribution. The SSL and baseline models and their training algorithms as lesion classifiers as described in IV-C are used in this experiment as well. The training set for the baseline and SSL consisted of 1921 labeled white light images, with the SSL model additionally used 1518 unlabeled WLI. We used the same hyperparameters as described in IV-D for training the in-distribution models.
During inference, the out-of-distribution detector score κ for the baseline is the KL-divergence between the prediction probabilities and uniform distribution. For SSL, we add the jigsaw loss to the KL-divergence term to compute κ as described in equation 4. The test set consists of 416 white light images (label = 0) and 1685 NBI images (label = 1). The OOD κ scores and the labels are used to generate a Receiver Operator Characteristic (ROC) curve. The Area Under Receiver Operator Characteristic (AUROC) is then used as a metric to determine the efficacy of the OOD detector. The AUROC can be interpreted as the probability that the OOD score κ for an out-of-distribution sample is greater than an in-distribution sample.
Figure 4 shows the results for OOD detection. A ROC curve for the model with median AUROC among three runs is reported. The SSL models has an AUROC of 0.71 as compared to 0.53 for the baseline. This shows that the unsupervised loss combines well with the KL-divergence term. The results demonstrate that attaching an auxiliary unsupervised head with a simplistic score can drastically improve the capability of the network as an OOD detector.
FIGURE 4.

ROC curves for OOD detection comparing baseline and semi-supervised model.
V. CONCLUSION
In this paper, we explore semi-supervised learning to utilize unlabeled data for improving lesion classification in colonoscopy images. We developed a phased training model using a jigsaw solving task and observed improved performance in metrics including accuracy and F1 score when compared with a purely supervised model. These data demonstrate that the addition of a jigsaw task helps the encoder generate discriminative features. We find that a semi-supervised learning model performs significantly better than a fully supervised method, especially in the low data regime. These results suggests that unsupervised learning is strongly regularizing the model.
While the focus of semi-supervised learning works has traditionally been on accuracy metrics, in this paper we also study the effect of SSL on the generalizability and uncertainty of the model. In terms of generalizability, we show SSL’s superior performance to supervised methods for domain adaptation. SSL improves performance on the target domain, using only unlabeled target distribution images. We also show that SSL models are better out-of-distribution detectors as compared to supervised models. This uncertainty measurement can simply be obtained from the prediction probabilities and jigsaw loss without requiring any architectural modifications.
We would like to emphasize that the point of this study is not to present the jigsaw based semi-supervised learning as the best-in-class model for the accuracy, domain adaptation, or OOD detection problems. Instead, we aim to establish proof-of-concept that adding an auxiliary semi-supervised task to supervised methods can significantly improve colonoscopy image analysis. In medical image analysis in general, the paucity of labeled data makes semi-supervised learning an important paradigm. Additionally, since domain generalization and out-of-distribution detection are important challenges in many practical clinical scenarios, semi-supervised learning holds significant promise to facilitate the translation of artificial intelligence techniques to real world applications.
Future work to expand and further validate this general approach include exploring additional semi-supervised learning tasks such as image colorization, and patch prediction, or even a combination of these proxy tasks in a multitask learning setup. To understand the dependence of the supervised objective on the semi-supervised learning proxy, the performance of a variety of colonoscopy challenges, such as polyp detection and segmentation, should be included, as well as additional proxy tasks. It is possible that the jigsaw task may not be optimal for improving the performance of lesion detection, for instance. The improvement in domain evaluation from SSL may be expanded by assessing not only across imaging modalities but also across different endoscopes with varying resolutions, illumination parameters, and frame rates. A deeper analysis of out-of-distribution detection, particularly for different types of out-of-distribution samples and the ‘harder’ near distribution anomalies, is an important future step. Lastly, it would be valuable to explore how the SSL improvements change as the size of both the labeled and unlabeled datasets increase.
Acknowledgments
This work was supported in part by the National Institute of Biomedical Imaging and Bioengineering (NIBIB)-National Institutes of Health (NIH) Trailblazer Award under Grant R21 EB024700.
Biographies

MAYANK GOLHAR received the B.Tech. degree from the Indian Institute of Technology Guwahati, in 2017, with a major in electronics and communication engineering and a minor in computer science. He is currently pursuing the M.S. degree in electrical and computer engineering with Johns Hopkins University. From 2017 to 2019, he was a Senior Software Engineer with the Health and Medical Equipment Division, Samsung R&D Institute India-Bangalore. His research interest includes medical image analysis.

TAYLOR L. BOBROW (Member, IEEE) received the B.S. degree in electrical engineering from Old Dominion University, in 2017. He is currently pursuing the Ph.D. degree in biomedical engineering with The Johns Hopkins University School of Medicine. His research interests include computer vision, computational biophotonics, and gastroenterology.

MIRMILAD POURMOUSAVI KHOSHKNAB received the doctorate degree in medicine from the School of Medicine, Islamic Azad University, Tabriz Branch, and graduated as a General Practitioner, in 2014. He is currently a Research Fellow with the Internal Medicine-Cardiology Department, University of Pennsylvania, Philadelphia, PA, USA. From 2014 to 2019, he was a General Practitioner, as a Researcher and Postdoctoral Fellow, with the Tabriz University of Medical Sciences, Iran, and with Johns Hopkins University, Baltimore, MD, USA. His research focused on internal medicine-gastrointestinal and cardiology disease and way of the diagnosis with new methods. His reward included as a Top Research in the “8th Middle East Cardiovascular Congress,” Istanbul, Turkey.

SIMRAN JIT received the M.B.B.S. degree from the Sri Guru Ram Das Institute of Medical Sciences & Research (SGRD), under the Baba Farid University of Health Sciences, in 2019. She is currently working as a Postdoctoral Fellow with the Division of Gastroenterology and Hepatology, The Johns Hopkins University School of Medicine. Her research interest includes the early detection and prognostication of various gastrointestinal conditions. She also delves into the Endopro and Epic to precisely understand the type and class of colonic polyps along with their treatment.

SAOWANEE NGAMRUENGPHONG received the M.D. degree from Chulalongkorn University, in 2005. She completed her residency at the Texas Tech University Health Sciences Center. She performed a fellowship in gastroenterology at the Mayo School of Graduate Medical Education and a fellowship in therapeutic endoscopy at the Johns Hopkins Hospital. She is currently an Assistant Professor of Medicine with The Johns Hopkins University School of Medicine. She has a special interest in the endoscopic diagnosis and management of gastrointestinal malignancies, premalignant and malignant lesions in the gastrointestinal tract, esophageal cancer, colon polyps, and pancreatic and biliary disease. She was recognized with an ASGE Cook Medical Don Wilson Award, in 2016. She is a member of several professional organizations, including the Thai Medical Council, the American College of Gastroenterology, and the American Gastroenterological Association.

NICHOLAS J. DURR (Member, IEEE) received the B.S. degree in electrical engineering and computer science from the University of California at Berkeley, in 2003, and the M.S. and Ph.D. degrees in biomedical engineering from The University of Texas at Austin, in 2007 and 2010, respectively. He completed his postdoctoral training in molecular imaging at the Harvard Medical School, from 2010 to 2011, and was an M+Vision Fellow in medical imaging with the Massachusetts Institute of Technology, from 2011 to 2014. He is currently an Assistant Professor of Biomedical Engineering with Johns Hopkins University. His research interest includes computational biophotonics.
REFERENCES
- [1].Siegel RL, Miller KD, Sauer AG, Fedewa SA, Butterly LF, Anderson JC, Cercek A, Smith RA, and Jemal A, “Colorectal cancer statistics, 2020,” CA A, Cancer J. Clinicians, vol. 70, no. 3, pp. 145–164, 2020. [DOI] [PubMed] [Google Scholar]
- [2].Joseph DA, Meester RGS, Zauber AG, Manninen DL, Winges L, Dong FB, Peaker B, and van Ballegooijen M, “Colorectal cancer screening: Estimated future colonoscopy need and current volume and capacity,” Cancer, vol. 122, no. 16, pp. 2479–2486, August. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Colucci PM, Yale SH, and Rall CJ, “Colorectal polyps,” Clin. Med. Res, vol. 1, no. 3, pp. 261–262, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Hayashi N, Tanaka S, Hewett DG, Kaltenbach TR, Sano Y, Ponchon T, Saunders BP, Rex DK, and Soetikno RM, “Endoscopic prediction of deep submucosal invasive carcinoma: Validation of the narrow-band imaging international colorectal endoscopic (NICE) classification,” Gastrointestinal Endoscopy, vol. 78, no. 4, pp. 625–632, October. 2013. [DOI] [PubMed] [Google Scholar]
- [5].Li M, “Kudo’s pit pattern classification for colorectal neoplasms: A meta-analysis,” World J. Gastroenterol, vol. 20, no. 35, p. 12649, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].IJspeert JEG, Bastiaansen BAJ, van Leerdam ME, Meijer GA, van Eeden S, Sanduleanu S, Schoon EJ, Bisseling TM, Spaander MC, van Lelyveld N, Bargeman M, Wang J, and Dekker E, “Development and validation of the WASP classification system for optical diagnosis of adenomas, hyperplastic polyps and sessile serrated adenomas/polyps,” Gut, vol. 65, no. 6, pp. 963–970, June. 2016. [DOI] [PubMed] [Google Scholar]
- [7].Repici A, Anderloni A, Jovani M, and Hassan C, “Advances, problems, and complications of polypectomy,” Clin. Experim. Gastroenterol, vol. 7, p. 285, Aug. 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].He K, Fan H, Wu Y, Xie S, and Girshick R, “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), June. 2020, pp. 9729–9738. [Google Scholar]
- [9].Chen T, Kornblith S, Norouzi M, and Hinton G, “A simple framework for contrastive learning of visual representations,” 2020, arXiv:2002.05709. [Online]. Available: http://arxiv.org/abs/2002.05709 [Google Scholar]
- [10].Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, and Joulin A, “Unsupervised learning of visual features by contrasting cluster assignments,” in Proc. Adv. Neural Inf. Process. Syst, vol. 33, 2020, pp. 1–13. [Google Scholar]
- [11].Noroozi M and Favaro P, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 69–84. [Google Scholar]
- [12].Min JK, Kwak MS, and Cha JM, “Overview of deep learning in gastrointestinal endoscopy,” Gut Liver, vol. 13, no. 4, p. 388, January. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Nogueira-Rodríguez A, Domínguez-Carbajales R, López-Fernández H, Iglesias Á, Cubiella J, Fdez-Riverola F, Reboiro-Jato M, and Glez-Peña D, “Deep neural networks approaches for detecting and classifying colorectal polyps,” Neurocomputing, vol. 423, pp. 721–734, January. 2021. [Google Scholar]
- [14].Mesejo P, Pizarro D, Abergel A, Rouquette O, Beorchia S, Poincloux L, and Bartoli A, “Computer-aided classification of gastrointestinal lesions in regular colonoscopy,” IEEE Trans. Med. Imag, vol. 35, no. 9, pp. 2051–2063, September. 2016. [DOI] [PubMed] [Google Scholar]
- [15].Zhang R, Zheng Y, Mak TWC, Yu R, Wong SH, Lau JYW, and Poon CCY, “Automatic detection and classification of colorectal polyps by transferring low-level CNN features from nonmedical domain,” IEEE J. Biomed. Health Informat, vol. 21, no. 1, pp. 41–47, January. 2017. [DOI] [PubMed] [Google Scholar]
- [16].Lui T, Wong K, Mak L, Ko M, Tsao S, and Leung W, “Endoscopic prediction of deeply submucosal invasive carcinoma with use of artificial intelligence,” Endoscopy Int. Open, vol. 7, no. 4, pp. E514–E520, April. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Byrne MF, Chapados N, Soudan F, Oertel C, Pérez ML, Kelly R, Iqbal N, Chandelier F, and Rex DK, “Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model,” Gut, vol. 68, no. 1, pp. 94–100, January. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Chen P-J, Lin M-C, Lai M-J, Lin J-C, Lu HH-S, and Tseng VS, “Accurate classification of diminutive colorectal polyps using computer-aided analysis,” Gastroenterology, vol. 154, no. 3, pp. 568–575, February. 2018. [DOI] [PubMed] [Google Scholar]
- [19].Kandel P, Lalonde R, Ciofoaia V, Wallace MB, and Bagci U, “Su1741 colorectal polyp diagnosis with contemporary artificial intelligence,” Gastrointestinal Endoscopy, vol. 89, no. 6, p. AB403, June. 2019. [Google Scholar]
- [20].Billah M, Waheed S, and Rahman MM, “An automatic gastrointestinal polyp detection system in video endoscopy using fusion of color wavelet and convolutional neural network features,” Int. J. Biomed. Imag., vol. 2017, pp. 1–9, August. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Mahmood F, Yang Z, Ashley T, and Durr NJ, “Multimodal densenet,” 2018, arXiv:1811.07407. [Online]. Available: http://arxiv.org/abs/1811.07407 [Google Scholar]
- [22].Mahmood F, Yang Z, Chen R, Borders D, Xu W, and Durr NJ, “Polyp segmentation and classification using predicted depth from monocular endoscopy,” Proc. SPIE, Med. Imag., Comput.-Aided Diagnosis, vol. 10950, March. 2019, Art. no. 1095011. [Google Scholar]
- [23].Ahmad OF, Soares AS, Mazomenos E, Brandao P, Vega R, Seward E, Stoyanov D, Chand M, and Lovat LB, “Artificial intelligence and computer-aided diagnosis in colonoscopy: Current evidence and future directions,” Lancet Gastroenterol. Hepatol, vol. 4, no. 1, pp. 71–80, January. 2019. [DOI] [PubMed] [Google Scholar]
- [24].Jing L and Tian Y, “Self-supervised visual feature learning with deep neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., early access, May 4, 2020, doi: 10.1109/TPAMI.2020.2992393. [DOI] [PubMed] [Google Scholar]
- [25].Pathak D, Krahenbuhl P, Donahue J, Darrell T, and Efros AA, “Context encoders: Feature learning by inpainting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), June. 2016, pp. 2536–2544. [Google Scholar]
- [26].Vondrick C, Pirsiavash H, and Torralba A, “Generating videos with scene dynamics,” in Proc. Adv. Neural Inf. Process. Syst, 2016, pp. 613–621. [Google Scholar]
- [27].Caron M, Bojanowski P, Joulin A, and Douze M, “Deep clustering for unsupervised learning of visual features,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 132–149. [Google Scholar]
- [28].Doersch C, Gupta A, and Efros AA, “Unsupervised visual representation learning by context prediction,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), December. 2015, pp. 1422–1430. [Google Scholar]
- [29].van den Oord A, Li Y, and Vinyals O, “Representation learning with contrastive predictive coding,” 2018, arXiv:1807.03748. [Online]. Available: http://arxiv.org/abs/1807.03748 [Google Scholar]
- [30].Gidaris S, Singh P, and Komodakis N, “Unsupervised representation learning by predicting image rotations,” 2018, arXiv:1803.07728. [Online]. Available: http://arxiv.org/abs/1803.07728 [Google Scholar]
- [31].Pathak D, Girshick R, Dollár P, Darrell T, and Hariharan B, “Learning features by watching objects move,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), July. 2017, pp. 2701–2710. [Google Scholar]
- [32].Croitoru I, Bogolin S-V, and Leordeanu M, “Unsupervised learning from video to detect foreground objects in single images,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), October. 2017, pp. 4335–4343. [Google Scholar]
- [33].Ren Z and Lee YJ, “Cross-domain self-supervised multi-task feature learning using synthetic imagery,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, June. 2018, pp. 762–771. [Google Scholar]
- [34].Li Y, Paluri M, Rehg JM, and Dollár P, “Unsupervised learning of edges,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), June. 2016, pp. 1619–1627. [Google Scholar]
- [35].Albuquerque I, Naik N, Li J, Keskar N, and Socher R, “Improving out-of-distribution generalization via multi-task self-supervised pretraining,” 2020, arXiv:2003.13525. [Online]. Available: http://arxiv.org/abs/2003.13525 [Google Scholar]
- [36].Korbar B, Tran D, and Torresani L, “Cooperative learning of audio and video models from self-supervised synchronization,” in Proc. Adv. Neural Inf. Process. Syst, 2018, pp. 7763–7774. [Google Scholar]
- [37].Arandjelovic R and Zisserman A, “Look, listen and learn,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), October. 2017, pp. 609–617. [Google Scholar]
- [38].Hendrycks D, Mazeika M, Kadavath S, and Song D, “Using self-supervised learning can improve model robustness and uncertainty,” in Proc. Adv. Neural Inf. Process. Syst, 2019, pp. 15663–15674. [Google Scholar]
- [39].Carlucci FM, D’Innocente A, Bucci S, Caputo B, and Tommasi T, “Domain generalization by solving jigsaw puzzles,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), June. 2019, pp. 2229–2238. [Google Scholar]
- [40].Hendrycks D and Gimpel K, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” 2016, arXiv:1610.02136. [Online]. Available: http://arxiv.org/abs/1610.02136 [Google Scholar]
- [41].Cheplygina V, de Bruijne M, and Pluim JPW, “Not-so-supervised: A survey of semi-supervised, multi-instance, and transfer learning in medical image analysis,” Med. Image Anal, vol. 54, pp. 280–296, May 2019. [DOI] [PubMed] [Google Scholar]
- [42].Iglesias JE, Liu C-Y, Thompson P, and Tu Z, “Agreement-based semi-supervised learning for skull stripping,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Berlin, Germany: Springer, 2010, pp. 147–154. [DOI] [PubMed] [Google Scholar]
- [43].Meier R, Bauer S, Slotboom J, Wiest R, and Reyes M, “Patient-specific semi-supervised learning for postoperative brain tumor segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, 2014, pp. 714–721. [DOI] [PubMed] [Google Scholar]
- [44].Dittrich E, Riklin Raviv T, Kasprian G, Donner R, Brugger PC, Prayer D, and Langs G, “A spatio-temporal latent atlas for semi-supervised learning of fetal brain segmentations and morphological age estimation,” Med. Image Anal, vol. 18, no. 1, pp. 9–21, January. 2014. [DOI] [PubMed] [Google Scholar]
- [45].Gu L, Zheng Y, Bise R, Sato I, Imanishi N, and Aiso S, “Semi-supervised learning for biomedical image segmentation via forest oriented super pixels (voxels),” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent Cham, Switzerland: Springer, 2017, pp. 702–710. [Google Scholar]
- [46].Bai W, Oktay O, Sinclair M, Suzuki H, Rajchl M, Tarroni G, Glocker B, King A, Matthews PM, and Rueckert D, “Semi-supervised learning for network-based cardiac MR image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, 2017, pp. 253–260. [Google Scholar]
- [47].Wang D, Zhang Y, Zhang K, and Wang L, “FocalMix: Semi-supervised learning for 3D medical image detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, June. 2020, pp. 3950–3959. [Google Scholar]
- [48].Baur C, Wiestler B, Albarqouni S, and Navab N, “Fusing unsupervised and supervised deep learning for white matter lesion segmentation,” in Proc. Int. Conf. Med. Imag. Deep Learn., 2019, pp. 63–72. [Google Scholar]
- [49].Diaz-Pinto A, Colomer A, Naranjo V, Morales S, Xu Y, and Frangi AF, “Retinal image synthesis and semi-supervised learning for glaucoma assessment,” IEEE Trans. Med. Imag, vol. 38, no. 9, pp. 2211–2218, September. 2019. [DOI] [PubMed] [Google Scholar]
- [50].Ahsan U, Madhok R, and Essa I, “Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Waikoloa Village, HI, USA, January. 2019, pp. 179–189. [Google Scholar]
- [51].Zhuang X, Li Y, Hu Y, Ma K, Yang Y, and Zheng Y, “Self-supervised feature learning for 3D medical images by playing a Rubik’s cube,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, 2019, pp. 420–428. [Google Scholar]
- [52].Noroozi M, Vinjimoor A, Favaro P, and Pirsiavash H, “Boosting self-supervised learning via knowledge transfer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, Salt Lake City, UT, USA, June. 2018, pp. 9359–9367. [Google Scholar]
- [53].Taleb A, Lippert C, Klein T, and Nabi M, “Multimodal self-supervised learning for medical image analysis,” 2019, arXiv:1912.05396. [Online]. Available: http://arxiv.org/abs/1912.05396 [Google Scholar]
- [54].Tao X, Li Y, Zhou W, Ma K, and Zheng Y, “Revisiting Rubik’s cube: Self-supervised learning with volume-wise transformation for 3D medical image segmentation,” 2020, arXiv:2007.08826. [Online]. Available: http://arxiv.org/abs/2007.08826 [Google Scholar]
- [55].Ross T, Zimmerer D, Vemuri A, Isensee F, Wiesenfarth M, Bodenstedt S, Both F, Kessler P, Wagner M, Müller B, Kenngott H, Speidel S, Kopp-Schneider A, Maier-Hein K, and Maier-Hein L, “Exploiting the potential of unlabeled endoscopic video data with self-supervised learning,” Int. J. Comput. Assist. Radiol. Surgery, vol. 13, no. 6, pp. 925–933, June. 2018. [DOI] [PubMed] [Google Scholar]
- [56].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), June. 2016, pp. 770–778. [Google Scholar]
- [57].Hendrycks D, Mazeika M, and Dietterich T, “Deep anomaly detection with outlier exposure,” 2018, arXiv:1812.04606. [Online]. Available: http://arxiv.org/abs/1812.04606 [Google Scholar]
- [58].Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, and Desmaison A, “Pytorch: An imperative style, high-performance deep learning library,” in Proc. Adv. Neural Inf. Process. Syst, Wallach H, Larochelle H, Beygelzimer A, Alché-Buc FD, Fox E, and Garnett R, Eds. Red Hook, NY, USA: Curran Associates, 2019, pp. 8024–8035. [Google Scholar]
- [59].Kingma DP and Ba J, “Adam: A method for stochastic optimization,” in Proc. 3rd Int. Conf. Learn. Represent., San Diego, CA, USA, 2015, pp. 1–15. [Google Scholar]
