A Closer Look at Benchmarking Self-supervised Pre-training with Image Classification

Markus Marks; Manuel Knott; Neehar Kondapaneni; Elijah Cole; Thijs Defraeye; Fernando Perez-Cruz; Pietro Perona

doi:10.1007/s11263-025-02402-w

. 2025 Apr 27;133(8):5013–5025. doi: 10.1007/s11263-025-02402-w

A Closer Look at Benchmarking Self-supervised Pre-training with Image Classification

Markus Marks ^1,^✉,^#, Manuel Knott ^1,^2,^3,^4,^✉,^#, Neehar Kondapaneni ¹, Elijah Cole ^1,⁵, Thijs Defraeye ⁴, Fernando Perez-Cruz ^2,^3,⁶, Pietro Perona ¹

PMCID: PMC12289721 PMID: 40727247

Abstract

Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels. The model is forced to learn about the data’s inherent structure or context by solving a pretext task. With SSL, models can learn from abundant and cheap unlabeled data, significantly reducing the cost of training models where labels are expensive or inaccessible. In Computer Vision, SSL is widely used as pre-training followed by a downstream task, such as supervised transfer, few-shot learning on smaller labeled data sets, and/or unsupervised clustering. Unfortunately, it is infeasible to evaluate SSL methods on all possible downstream tasks and objectively measure the quality of the learned representation. Instead, SSL methods are evaluated using in-domain evaluation protocols, such as fine-tuning, linear probing, and k-nearest neighbors (kNN). However, it is not well understood how well these evaluation protocols estimate the representation quality of a pre-trained model for different downstream tasks under different conditions, such as dataset, metric, and model architecture. In this work, we study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types. Our study includes eleven common image datasets and 26 models that were pre-trained with different SSL methods or have different model backbones. We find that in-domain linear/kNN probing protocols are, on average, the best general predictors for out-of-domain performance. We further investigate the importance of batch normalization for the various protocols and evaluate how robust correlations are for different kinds of dataset domain shifts. In addition, we challenge assumptions about the relationship between discriminative and generative self-supervised methods, finding that most of their performance differences can be explained by changes to model backbones.

Supplementary Information

The online version contains supplementary material available at 10.1007/s11263-025-02402-w.

Keywords: Computer vision, Self-supervised learning, Benchmarking, Image classification

Introduction

There has been a trend in machine learning where algorithmic improvements follow challenges posed through new datasets and evaluation metrics. How we evaluate new ML methods is therefore crucial, as the community may optimize for flawed (Locatello et al., 2019) or misleading (Musgrave et al., 2020) metrics. Self-supervised learning (SSL) is a promising path to advance machine learning using unlabeled data. It describes techniques that enable learning general image representations from abundant and cheap unlabeled data by solving pretext tasks (Balestriero et al., 2023). Because of its effectiveness, SSL in computer vision has been used in a wide array of domains, ranging from animal behavior (Sun et al., 2023), retinal disease detection (Zhou et al., 2023), computational histopathology (Chen et al., 2023) to remote sensing (Wang et al., 2022). SSL has proven more robust to data distribution shifts than supervised learning (Shi et al., 2022). Fundamentally, SSL methods aim to learn a general representation useful for any downstream task. What is “downstream task performance” and how should it be measured? We first consider the different applications of self-supervised pre-training. Figure 1 (bottom) depicts the main practical applications where self-supervised pre-training is applied:

Fig. 1 — SSL application scenarios: We illustrate the following applications of self-supervised learning: a supervised learning (training and fine-tuning on the same dataset), b transfer learning (train on a large dataset and fine-tune the model on a—usually smaller—domain dataset), c semi-supervised learning (train on a large unlabeled dataset and fine-tune on a small labeled subset of it), d unsupervised tasks (train on a dataset and run inference with the resulting model on any dataset to create embeddings that can be used for downstream tasks other than classification). Arrows between protocols and applications indicate a direct relationship

Supervised Learning: A model is pre-trained on dataset A via SSL and then fine-tuned on the same dataset in a supervised way. This procedure can yield higher overall accuracies than supervised training from randomly initialized model weights (He et al., 2021; Bao et al., 2021).
Transfer Learning: A model is pre-trained on dataset A via SSL. The pre-trained backbone is then fine-tuned on a typically smaller labeled domain dataset B. This classical transfer learning paradigm can achieve better results with fewer data on small domain-specific data sets (Li et al., 2020). SSL is usually a better starting point for transfer learning compared to supervised pre-training (Shi et al., 2022), as the latter is prone to overfit on features that are only useful for solving the initial supervised task (Jing & Tian, 2019).
Semi-supervised Learning1: A model is pre-trained on an unlabelled dataset A via SSL, followed by supervised fine-tuning on a small and labeled subset of the same dataset. This is particularly useful when data is cheap but labeling is expensive. (Chen et al., 2020a, b; Grill et al., 2020; Zhou et al., 2021)
Unsupervised Tasks/Clustering: A model is pre-trained on dataset A via SSL. It is then used to generate embeddings at inference time. These embeddings can be used for various downstream tasks without further training the model (Pandarinath et al., 2018; Higgins et al., 2021; Sun et al., 2023).

Evaluating the performance of SSL methods is challenging since there are endless ways to evaluate their learned representations, and exploring all of them is impossible. The community has developed several evaluation protocols to compare the representations’ quality, resulting in proxy metrics for unobserved downstream tasks. Many of these protocols use the learned representation to solve classification tasks, for example, through linear probing, end-to-end fine-tuning, or by evaluating the embedding representation with a kNN classifier. The similarities and differences in the expressiveness of various protocols are understudied, which leads to an inconsistent evaluation and comparison of SSL methods (Appendix A in Supplementary Materials). In this work, we show how reliably different protocols rank SSL methods w.r.t. to their performance on different downstream tasks. In detail, our contributions are as follows:

We survey existing papers on self-supervised learning methods for images and provide a structured summary of established evaluation protocols.
We correlate in-domain (ID) and out-of-domain (OOD) top-1 and top-5 classification accuracies obtained from fine-tuning, linear probing, and kNN probing on 26 SSL-pretrained models. We show that linear/kNN probing protocols yield the proxy metrics that can, on average, best predict the ranking of SSL methods on eleven OOD datasets.
We explore two kinds of domain shifts—categorical shift (either with coarse-grained or fine-grained features) and style shift—and find that in-domain proxy metrics vary in their predictiveness for each type of domain shift.
We compare generative and discriminative SSL protocols for ResNet and ViT backbones. We find that relative differences in linear probing and fine-tuning performance are more due to backbone architecture than the SSL family.

Related Work

Self-supervised Learning

Self-supervised learning plays a crucial role in the recent success of natural language processing models (Qiu et al., 2020; Devlin et al., 2019) and computer vision (Chen et al., 2020b; Zhou et al., 2021; Caron et al., 2021; He et al., 2021) and finds applications in tasks like speech recognition (Oord et al., 2018), video classification (Feichtenhofer et al., 2022), point cloud reconstruction (Yu et al., 2022) or behavioral analysis (Sun et al., 2023). SSL relies on designing pretext tasks, forcing the model to learn a functional representation of the data without providing external labels (Balestriero et al., 2023). Most SSL algorithms for images fall into one of two major categories: discriminative and generative methods (Liu et al., 2021).

Discriminative methods. Contrastive SSL methods for vision generate augmentations of samples and discriminate them from other samples in the data set (Chen et al., 2020b; He et al., 2019). These methods rely on negative samples and, therefore, require large batch sizes. A second line of work (self-distillation) solely relies on positive samples (Grill et al., 2020; Caron et al., 2020). Yet another group of clustering-based methods utilizes pseudo-labels based on k-means clustering in order to learn image representations (Caron et al., 2018; Yan et al., 2019).

Generative methods. Transformers (Vaswani et al., 2017) are the current state-of-the-art deep-neural network architecture across many AI fields, bridging language and vision models. Inspired by pretext tasks for language transformer models, such as masking in BERT (Devlin et al., 2019; He et al., 2021) recently introduced masked auto-encoding for images, an effective pre-training method, by which an image is split into patches, and about 70 percent of the patches are masked. Based on the remaining patches, the transformer will reconstruct the masked patches. MaskFeat (Wei et al., 2022) showed that the use of HOG features (Dalal & Triggs, 2005) as reconstruction targets of masked patches is an effective pretext task. Recent work combines masked image modeling with language-guided representations (Fang et al., 2022; Hou et al., 2022). Another approach focuses on pixel-level reconstruction, alleviating the problem of missing foreground information that can occur with patch-based reconstruction approaches (Liu et al., 2023).

SSL Evaluation Protocols

In general, self-supervised pre-training aims to learn useful representations across various downstream tasks. However, the quality of representations varies depending on the task. For example, some tasks may require representations invariant to certain transformations, while others may require representations preserving fine-grained details. For those reasons, designing evaluation protocols and associated metrics that capture all aspects is challenging. We conducted a literature survey on the different evaluation metrics used in SSL papers (Appendix A in Supplementary Materials). This section gives an overview of the most popular evaluation metrics. Typically, a study uses a set of a few metrics to evaluate the performance. This study mainly focuses on classification-based protocols, for which we identified four variations described in more detail below. Figure 1 illustrates their relationship to the previously mentioned use cases.

K-nearest neighbors (kNN). kNN-classification is a way of probing the model, assuming that similar samples should have close Euclidean proximity in the latent space (see, e.g., Caron et al., 2021; 2020; Wu, Xiong, Yu, & Lin, 2021; J. Zhou et al., 2018). Compared to linear probing, kNN classifiers are fast and computationally light to deploy, often without an iterative learning setup (Caron et al., 2021). Since kNN requires no training, one could argue that this is the most direct and cheapest evaluation for representation learning. However, clustering in high-dimensional spaces can be challenging (Assent, 2012). Another issue is that different dimensions do not necessarily have the same scale and might need to be normalized.

Linear probing. In most cases, the classifier is implemented as a logistic regression model via a single fully-connected layer, usually referred to as Linear Probing (see, e.g., Caron et al. 2021, Grill et al. 2020, Chen et al. 2020b, Misra and Maaten 2019, He et al. 2021, Bao et al. 2021, Chen et al. 2020b, Chen et al. 2020a, Dong et al. 2023, Xie et al. 2021, Zhou et al. 2021, Goyal et al. 2021). The intuition here is that the learned representation is good if the dataset classes (the model was not trained on) are linearly separable. Besides linear and kNN probing, researchers sometimes use other shallow classifiers, e.g., Support Vector Machines (Caron et al., 2020; Wu et al., 2018; Doersch et al., 2015; Pathak et al., 2016; Zhang et al., 2016).

End-to-end fine-tuning. Like linear probing, the end-to-end fine-tuning protocol replaces the last layer of a model with a linear classifier. In this setting, all model parameters are trained, allowing latent representations to adapt to the supervised task and/or a new data set (see, e.g., Misra and Maaten 2019; Chen et al. 2020b, a; He et al. 2021; Bao et al. 2021; Chen et al. 2021; Zhou et al. 2021; Wei et al. 2022; Hou et al. 2022). Some papers use a partial fine-tuning protocol where only parts of the model are trained (He et al., 2021; Noroozi & Favaro, 2016; Yosinski et al., 2014).

Few-shot fine-tuning. The few-shot learning protocol follows the same procedure as end-to-end fine-tuning but only uses a subset of the available training labels (typically 10% or 1%) (see, e.g., Grill et al. 2020; Chen et al. 2020b; Caron et al. 2020; Zhou et al. 2021; Goyal et al. 2021), which makes evaluation significantly more efficient.

Some common protocols that do not use classification are not included in the experimental part of this study but should be mentioned at this point. Our survey found that task transfer protocols, such as object detection (Misra & Maaten, 2019; He et al., 2021; Chen et al., 2020b; Caron et al., 2020; Dong et al., 2023; Zhou et al., 2021), semantic segmentation (Grill et al., 2020; He et al., 2021; Bao et al., 2021; Misra & Maaten, 2019; Zhou et al., 2021), depth estimation (Grill et al., 2020), copy detection (Caron et al., 2021), image retrieval (Caron et al., 2021), and super-resolution (Bao et al., 2021), are frequently used to benchmark SSL methods. Less commonly, unsupervised clustering, e.g., k-means, is used in the context of SSL evaluation (see, e.g., Gansbeke, Vandenhende, Georgoulis, Proesmans, & Gool, 2021; J. Zhou et al., 2020).

Studies on SSL Evaluation Protocols

While the community focuses on improving the capacity of SSL methods, evaluation protocols are seldom challenged. However, some studies can be used as references.

Kim et al. (2022) compared self-supervised and supervised pre-training for domain transfer. They evaluated fine-tuning accuracy on four downstream datasets for models pre-trained either supervised on ImageNet or SSL. By comparing four SSL methods, they found that supervised pre-training consistently outperformed SSL regarding OOD fine-tuning accuracy. As a shortcoming of their study, they mention the lack of possible combinations of different backbones with different SSL methods, which we address in our study.

Yang et al. (2022) define an OOD benchmark for large language models. They find that distribution shifts between ID and OOD dominate OOD generalization results for language. They also find that discriminative models show a stronger linear correlation between ID and OOD performance than generative models. They find that linear probing shows relatively low ID and OOD accuracy, differing from findings in computer vision, where Kumar et al. (2022) find that FT can do worse than LP for large distribution shifts.

Newell and Deng (2020) find that the performance of an SSL algorithm in one setting might not translate to another. Moreover, they see that LP performance does not correlate with FT performance. Linear transferability occurs when data from the same class in different domains are more related to each other than data from other classes in different domains (HaoChen et al., 2022).

Ibrahim et al. (2022) measure the robustness of SOTA vision models, including SSL models, against distribution shifts w.r.t. factors of variation such as background, pose, etc., and find that the learning objective is more impactful for robustness than architecture. Other studies have focused on understanding optimal SSL methods in the context of various metrics on ImageNet, such as fine-tuning accuracy, linear probing, and k-nearest neighbors (Ericsson et al., 2021; Gwilliam & Shrivastava, 2022).

Miller et al. (2021) ask whether accuracy depends on in-domain to out-of-domain distributions shift, i.e. training on CIFAR-10 (Krizhevsky & Hinton, 2009) and testing on CIFAR $-$ 10.1 (Recht et al., 2019). They find that the linear trend between in-domain and out-of-domain performance holds across many but not all datasets.

Cole et al. (2022) explore challenges in generalizing contrastive self-supervised learning beyond ImageNet, finding limitations with respect to data quantity, domain transfer, robustness, and fine-grained task performance.

Recently, Goldblum et al. (2024) compared a wide range of architectural backbones and SSL setups on multiple datasets and downstream tasks. They focus on finding the backbone and method that generalizes best. In contrast, our research focuses on which metric to use when developing SSL methods. Our findings challenge the wide use of fine-tuning as a metric (He et al., 2021; Feichtenhofer et al., 2022; Wei et al., 2022), as it does not strongly predict performance across different tasks and metrics.

Liu et al. (2021) conducted a comparative study between discriminative and generative SSL methods among several domains (not limited to vision). They claim that contrastive learning methods—MoCo and SimCLR in particular—are effective if the downstream task is classification, while this is not obvious for many generation tasks.

While some work has compared discriminative and generative models’ influence on performance in vision (Bao et al., 2021; Wei et al., 2022; Yu et al., 2021), our study posits that the backbone of the model has more impact on performance than pre-training or pretext tasks. Specifically, we compare Vision Transformers (ViTs) with Residual Networks (ResNets), both indirectly and directly.

A recent study (Lee et al., 2023) presents a motivation akin to ours. However, the scope of our study is considerably larger as we compare more models (26 compared to 7), more OOD datasets (11 compared to 4), and a broader range of evaluation protocols, including three distinct fine-tuning protocols (100%, 10%, and 1%). Additionally, our study evaluates various model architectures, contrasting ResNets with Vision Transformers, and explores different types of domain shifts—categorical and style—by selecting transfer learning datasets.

Experimental Setup

Models and protocols. Our experiments are based on pre-trained models published by the original authors (if available) or replicas that achieve similar results to those reported in the original papers (see Appendix E in Supplementary Materials for sources of the pre-trained model weights). We use ResNet-50 and ViT-B16 backbones in this study. All models were pre-trained on ImageNet-1k (Russakovsky et al., 2015). We measure the accuracies of an SSL method on its training dataset (ImageNet) using five evaluation protocols: linear probing, kNN probing, and three variations of end-to-end fine-tuning with 100%, 10%, or 1% of the available training data (see Fig. 1, top). In addition, we compute kNN, linear probing, and fine-tuning (100%) metrics on multiple OOD datasets for each SSL method.

Correlation analysis. We correlate the results of the different protocols across 26 different SSL methods. Linear and kNN probing are evaluated with and without normalizing the embedding. We found that normalization has no significant effect on some models and a large positive effect on others, especially those using masked image modeling. This aligns with findings from previous research (Lee et al., 2023). All LP and kNN results reported in the main part of this paper use the normalized version (non-normalized results are reported in the Supplementary Materials).

OOD Datasets. For domain-shift analyses, we select our datasets following insights from previous work. It has been shown that the performance of current SSL models depends on the granularity of the dataset classes (Cole et al., 2022). We, therefore, choose datasets of different granularities in our study. We chose Caltech-256 (Griffin et al., 2022), Pascal VOC 2012 (Everingham et al., 2010), and iNaturalist 2021 mini (Van Horn et al., 2021) (“Family” target) as representative datasets with coarse-grained classes. In addition, we evaluate CUB (Wah et al., 2011) and two more variations of iNaturalist 2021 mini—with “Genus” or “Species” as target classes—as fine-grained datasets (see Appendix D in Supplementary Materials for details on how the iNaturalist datasets are constructed). We also compare categorical domain shift and stylistic domain shift with respect to ImageNet. We group all previously mentioned datasets, excluding Pascal VOC, to create a group with no or few shared categories with ImageNet. This group is our categorical domain shift group. We use the ImageNet-D (Rusak et al., 2022) dataset (ImageNet vocabulary but different styles) for stylistic domain shifts. A tabular overview of the dataset assignment described in this paragraph can be found in Table S.6 (Supplementary Materials).

Hyperparameter selection. Usually, researchers sweep over a set of different hyperparameters during pre-training to find the best configuration for their method and evaluation protocols. This results in a variety of different hyperparameters for the same protocol. Therefore, it is very difficult to directly compare the reported metrics as they are confounded by the different choices of hyperparameters. We decided to standardize our protocols by finding “typical” hyperparameter configurations for each of the protocols derived from the literature and use them for all models (see Appendix E in Supplementary Materials for implementation details). Consequently, the metrics we found in our experiments may deviate from the ones reported by the original authors. However, this standardization is crucial as our goal is not to benchmark the overall performance of different SSL methods but to correlate evaluation metrics under comparable conditions.

Robustness. In order to quantify variance introduced by random seeding, we calculate means and standard deviations for one model per protocol and dataset and repeat the same experiments for this selection three times (see Appendix F in Supplementary Materials).

Results

Which In-Domain Metric Best Predicts Out-of-Domain Rankings on Average?

We begin our analysis by visualizing the rank correlations averaged across all models and datasets. In Fig. 2, left panel, we see the ID metrics correlated against themselves. Expectedly, ID metrics generally correlate highly, with linear probing and 10%-fine-tuning having the highest ( $r = 0.90$ , $r = 0.91$ ) and fine-tuning having the lowest average correlation coefficient ( $r = 0.79$ ). Notably, Linear and kNN probing correlate almost perfectly ( $r = 0.99$ ) when features are normalized (Fig. 2). When comparing ID with OOD metrics, correlation coefficients are visibly lower, indicating that domain shifts affect both the absolute accuracy and also the ranking of different SSL representations. We will further investigate these effects in Sect. 4.2.

Next, we target the question of whether an ID metric on training data is a good proxy metric for OOD use cases. We consider two use cases: (1) transfer learning, expressed by OOD fine-tuning accuracies averaged over multiple datasets, and (2) unsupervised representation learning, expressed by OOD kNN and/or linear probing accuracies averaged over multiple datasets. We can observe that the probing protocols (kNN and linear) correlate most with themselves and each other when comparing ID and OOD accuracies. Overall, linear probing is the best OOD predictor when averaged across metrics and datasets ( $r = 0.85$ ), followed closely by kNN ( $r = 0.84$ ). Interestingly, the two few-shot protocols appear to be the best predictors of model performance for out-of-distribution (OOD) fine-tuning. This correlation may arise from two primary factors: First, batch normalization on the embeddings is applied during the probing protocols but not for fine-tuning, which we discuss in more detail in Sect. 4.3. Second, the transfer learning datasets in our study contain fewer training samples compared to ImageNet, resulting in a number of training steps during OOD fine-tuning that more closely resembles those in the few-shot in-distribution (ID) protocols rather than those used for fine-tuning on the full ImageNet dataset.

Summary: While probing protocols are the best OOD predictors on average, one should rely on few-shot fine-tuning (10%) to predict the transfer learning capability for OOD fine-tuning.

How Do Protocols Differ Under Different Kinds of Domain Shift?

Figure 3 assesses the average rank correlations of top-1 accuracies under the four described domain shifts. In each panel, we explore the rank correlation of each in-domain metric against a single OOD metric averaged over domain shift grouped datasets. Generally, when comparing ID metrics with OOD metrics, we observe no notable difference between fine-grained and coarse-grained datasets but a significantly lower correlation for style-shift datasets.

Fig. 3 — Spearman rank correlations of top-1 classification accuracies derived from in-domain and out-of-domain protocols under certain types of domain shift. We differentiate between fine-grained and coarse-grained categorical domain shifts (left half of each panel). Further, we compare categorical with stylistic domain-shift (right half of each panel). Black rectangles highlight when the same ID and OOD evaluation protocol is used

OOD kNN. We can see that both ID probing protocols (kNN and linear) can reliably predict the ranking of OOD kNN accuracy for categorical domain shifts and less reliably for style-related domain shifts (left panel, top row). The correlation coefficient for categorical shift (kNN $r = 0.96$ , LP $r = 0.94$ ) is notably higher than the equivalent for fine-tuning ( $r = 0.58$ ).

OOD LP. The general pattern is similar to OOD kNN probing, with ID kNN and ID LP being the strongest predictors and style shifts having a stronger impact than categorical shifts. Few-shot fine-tuning protocols (1% and 10% correlate slightly more with OOD LP compared to OOD kNN.

OOD FT. In Fig. 2, ID FT is weakly correlated with OOD FT. In Fig. 3, we see that FT rankings are less predictable with respect to shifts in both category ( $r = 0.73$ ) and style ( $r = 0.61$ ). Surprisingly, in-domain probing (ID LP, ID kNN) and few-shot fine-tuning (ID FT-10% FT-1%) protocols are better correlated to OOD FT across all types of domain shifts. As these protocols are significantly cheaper than full end-to-end fine-tuning (see Table S.8 in Supplementary Materials for the estimated computational costs of each ID protocol in our experiments), they can be used as a proxy for the ranking of OOD fine-tuning performance.

Summary: The ranking of SSL methods is more robust for categorical and less for stylistic domain shifts under all protocols. There is no significant difference between fine-grained and coarse-grained categorical shifts.

What is the Effect of Embedding Normalization on Different Protocols?

Previous research has pointed out the importance of embedding normalization for linear (He et al., 2021; Lee et al., 2023) and kNN (Lee et al., 2023) probing. Our experiments confirm that using batch normalization before the final classification layer in linear probing and z-score normalization for kNN probing can significantly increase accuracy (see Table S.2 in Supplementary Materials). While the effect is strong for models with unscaled embedding representations (e.g., SimSiam and MaskFeat in our case), others (e.g., DINO) are neither positively nor negatively affected by normalization.

While batch normalization is common for linear probing, it is not established for fine-tuning protocols, presumably because feature scaling will be resolved during training when model weights are not frozen. We challenge this assumption and claim that this is only true if a model is trained long enough (e.g., 100 epochs on full ImageNet) while fine-tuning on smaller datasets or with fewer epochs can yield significantly higher accuracies when BatchNorm is applied. Figure 4 displays fine-tuning accuracies with and without batch normalization for all the datasets included in our study, together with the total number of optimizer steps.

Fig. 4 — Fine-tuning accuracies with and without batch normalization for two exemplary models that appear to have scaled (left, DINO+ResNet-50) and unscaled (right, MaskFeat+ViT-B/16) embedding representations. The x-axes display all datasets included in this study and the number of optimizer steps derived from the dataset size, batch size, and total number of epochs. For MaskFeat, batch normalization has a significant effect when the number of optimizer steps is small and only a small effect when the number of steps is large, implying less-scaled features compared to DINO

Summary: For models with unscaled features, batch normalization is critical for linear/kNN probing but also when fine-tuning on small datasets.

How Do Different SSL Families and Architectures Perform Under the Various Protocols?

Previous work hypothesized that generative SSL methods achieve higher fine-tuning accuracies through expressive but non-linear features (He et al., 2021). Conversely, contrastive SSL methods achieve better linear probing performance through linearly separable features due to a discriminative loss function (Wei et al., 2022). Following this hypothesis, recent studies have excluded linear probing altogether and have only used the fine-tuning protocol (Feichtenhofer et al., 2022).

In Fig. 5, we plot the relation between each model’s fine-tuning and linear probing performance on ImageNet. We see that, indeed, the models with a generative loss (MaskFeat (Wei et al., 2022), MAE (He et al., 2021), BEiT v2 (Peng et al., 2022), iBOT (Zhou et al., 2021) have a larger performance gap between fine-tuning and linear probing performance on ImageNet than discriminative models. However, generative methods have been introduced in more recent publications, and ViT backbones are frequently used instead of CNNs. Could this cause the relative difference between fine-tuning and linear-probing performance? The figure shows that ViT backbones are all above the regression line, indicating a higher fine-tuning accuracy relative to their linear probing accuracy compared to other models. We can more directly assess this effect using two SSL methods that use both the ResNet-50 and ViT-B/16 backbones (DINO and MoCo-v3). For these models, we see that switching to a ViT backbone moves them from below the regression line to above. This suggests that the relatively higher fine-tuning accuracy is caused by a difference in backbone architectures rather than the SSL family, which contrasts with previous hypotheses (Wei et al., 2022; He et al., 2021).

Summary: Differences in linear probing performance between generative and discriminative models can be explained through different backbones rather than SSL methods.

How Do Rank Correlations Relate to Absolute Performance?

Thus far, we have analyzed rank correlations between ID and OOD metrics. These correlations tell us how well the ranking of representations generated by an ID metric respects the ranking of the representations on the eleven OOD datasets we consider, and we have found several interesting trends. In Fig. 6, we show how these trends manifest in terms of absolute performance, in which we visualize OOD accuracy against ID accuracy under each of our three metrics. Whether OOD accuracy will be higher or lower than ID accuracy depends on several factors, such as the representation quality and the similarity of the target dataset compared to ImageNet. For certain datasets (e.g., CUB and iNat-mini), OOD performances can be significantly worse even though the method ranking is very similar. On three datasets, Pascal VOC, Caltech-256, and CUB, we see that kNN and linear probing have a more linear relationship between ID and OOD accuracy than fine-tuning. On these datasets, we can see that several SSL methods can have almost the same ID fine-tuning accuracy but significantly different OOD fine-tuning accuracy.

Fig. 6 — ID vs. OOD accuracy on different protocols and datasets. We compare both top-1 and top-5 classification accuracies. Correlation coefficients r are calculated using Spearman’s rank correlation. ImageNet-D accuracies are averaged across the six datasets. Individual ImageNet-D visualizations can be found in Fig S.1 (Supplementary Materials)

Summary: ID evaluation protocols can be robust proxies to estimate the ranking of SSL methods, but not their absolute performance.

Discussion

Self-supervision is a powerful way of leveraging unlabeled data for further downstream tasks. Since the performance benchmark we choose will influence algorithmic development, it is crucial to evaluate SSL methods correctly for their intended purpose. However, SSL evaluation is non-trivial since performance depends on the metric, the training dataset, the testing dataset, the computation required, and the downstream task. We systematically investigate the performance of 26 SSL models on eleven datasets to evaluate which metric(s) should be used when benchmarking SSL models.

First, we find that linear and kNN probing accuracies are highly correlated when embedding normalization is applied. They can be used mostly interchangeably and are, on average, the best predictors for OOD metrics. Remarkably, we find that 10%-fine-tuning on ImageNet is the strongest predictor for the ranking of SSL methods in OOD fine-tuning. This is particularly interesting for downstream users who are interested in using SSL pre-trained for transfer learning classification tasks.

Second, we find that linear/kNN probing is more robust to shifts in label granularity and (to some extent) style than fine-tuning protocols. When comparing ID accuracy against OOD accuracy directly, we see that several SSL methods can have equivalent ID fine-tuning results but much weaker OOD fine-tuning results (Fig. 6).

Third, it was previously assumed that discriminative and generative SSL models differ in the type of representations they learn and that generative SSL methods result in powerful but non-linear representations that require fine-tuning. Using our comprehensive benchmark, we find that differences in performance may be attributed to the differences in the backbone used by different SSL methods.

Fourth, we investigate the importance of embedding normalization on several protocols. We confirm the findings of previous work (Lee et al., 2023) regarding the impact of batch normalization on batch protocols. In addition, we highlight the effects of batch normalization on end-to-end fine-tuning w.r.t. the dataset size.

Societal Impact. As the amount of data and applications for AI are growing, self-supervised learning plays an increasingly important role. SSL allows us to train models on unlabeled data and is necessary to reduce human annotation efforts and biases. Therefore, having an SSL evaluation metric that is predictive of various downstream tasks, i.e., of applications in the real world, is critical. In addition, SSL methods are more computationally expensive than supervised learning and, therefore, have a higher environmental impact. We must ensure that evaluation metrics accurately represent the utility of our methods, ensuring that time and resources spent on SSL development are not wasted.

Limitations. Our study has some limitations. While we cover several datasets and evaluation protocols, many more can still be considered. For example, SSL representations are commonly evaluated for other vision tasks like semantic segmentation, object detection, or depth estimation. Evaluating all of these is very costly and, therefore, beyond the scope of this study. Another limitation is categorizing each dataset as ID or OOD in a binary way. In future work, one could try to quantify dataset dissimilarity and use this as a proxy for how far out of distribution a dataset is. Finally, future work should find theoretical grounding for our findings with respect to the interplay between SSL method, backbone, training dataset, and type of domain shift, similar to previous works (Cabannes et al., 2023).

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 539 KB)^{(539.9KB, pdf)}

Acknowledgements

Manuel Knott was supported by an ETH Zurich Doc.Mobility Fellowship. Pietro Perona and Markus Marks were supported by the National Institutes of Health (NIH R01 MH123612A) and the Caltech Chen Institute (Neuroscience Research Grant Award). Pietro Perona, Neehar Kondapaneni, and Markus Marks were supported by the Simons Foundation (NC-GB-CULM-00002953-02).

Funding

Open access funding provided by Swiss Federal Institute of Technology Zurich

Data Availability

Source code for all experiments is available online at https://github.com/manuelknott/ssl_eval_protocols. Model checkpoints and datasets are used from various open-source projects and linked in the Github repository as well as in Table S.5 in Supplementary Materials.

Footnotes

The term semi-supervised learning is commonly used to describe the “self-supervised pre-train, supervised fine-tune on a subset” paradigm (see e.g., Chen et al., 2020a; b, Misra & Maaten 2019). We adopt this usage but acknowledge that, originally, semi-supervised learning refers to methods that utilize labeled and unlabeled data simultaneously rather than sequentially.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Markus Marks and Manuel Knott contributed equally to this work.

Contributor Information

Markus Marks, Email: marks@caltech.edu.

Manuel Knott, Email: manuel.knott@alumni.ethz.ch.

References

Asano, Y. M., Rupprecht, C., & Vedaldi, A. (2020). Self-labelling via simultaneous clustering and representation learning. In International conference on learning representations.
Assent, I. (2012). Clustering high dimensional data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,2(4), 340–350. [Google Scholar]
Baevski, A., Hsu, W. N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. In International conference on machine learning (ICML).
Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., & Goldblum, M. (2023). A cookbook of self-supervised learning. arXiv preprintarXiv:2304.12210.
Bao, H., Dong, L., & Wei, F. (2021). BEiT: BERT pre-training of image transformers. arXiv preprintarXiv:2106.08254.
Cabannes, V., Kiani, B., Balestriero, R., LeCun, Y., & Bietti, A. (2023). The ssl interplay: Augmentations, inductive bias, and generalization. In International conference on machine learning (pp. 3252–3298). PMLR.
Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep Clustering for Unsupervised Learning of Visual Features. In European conference on computer vision (ECCV).
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems,33, 9912–9924. [Google Scholar]
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 9650–9660).
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020a). Generative pretraining from pixels. In Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 1691–1703). PMLR.
Chen, R. J., Ding, T., Lu, M. Y., Williamson, D. F. K., Jaume, G., Chen, B., & Mahmood, F. (2023). A general-purpose self-supervised model for computational pathology. arXiv preprintarXiv:2308.15474.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020b). A simple framework for contrastive learning of visual representations. In Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 1597–1607). PMLR.
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. E. (2020). Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems,33, 22243–22255. [Google Scholar]
Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., & Wang, J. (2024). Context autoencoder for self-supervised representation learning. International Journal of Computer Vision,132(1), 208–223. [Google Scholar]
Chen, X., Fan, H., Girshick, R., & He, K. (2020). Improved baselines with momentum contrastive learning. arXiv preprintarXiv:2003.04297.
Chen, X., & He, K. (2020). Exploring simple siamese representation learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2021, 15745–15753. [Google Scholar]
Chen, X., Xie, S., & He, K. (2021). An empirical study of training self-supervised vision transformers. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 9620–9629).
Cole, E., Yang, X., Wilber, K., Mac Aodha, O., & Belongie, S. (2022). When does contrastive visual representation learning work? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 14755–14764).
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. 886–893).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 4171–4186).
Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. IEEE International Conference on Computer Vision (ICCV),2015, 1422–1430. [Google Scholar]
Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., & Guo, B. (2023). PeCo: Perceptual codebook for BERT pre-training of vision transformers. In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, pp. 552–560).
Ericsson, L., Gouk, H., & Hospedales, T. M. (2021). How well do self-supervised models transfer? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5414–5423).
Everingham, M., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision,88(2), 303–338. 10.1007/s11263-009-0275-4 [Google Scholar]
Fang, Y., Wang, W., Xie, B., Sun, Q. S., Wu, L. Y., Wang, X., & Cao, Y. (2022). EVA: Exploring the limits of masked visual representation learning at scale. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2023, 19358–19369. [Google Scholar]
Feichtenhofer, C., Fan, H., Li, Y., & He, K. (2022). Masked autoencoders as spatiotemporal learners. Advances in Neural Information Processing Systems,32, 35946–35958. [Google Scholar]
Gansbeke, W. V., Vandenhende, S., Georgoulis, S., Proesmans, M., & Gool, L. V. (2020). SCAN: Learning to classify images without labels. In European conference on computer vision.
Gidaris, S., Singh, P., & Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. In International conference on learning representations.
Goldblum, M., Souri, H., Ni, R., Shu, M., Prabhu, V., Somepalli, G., & Goldstein, T. (2024). Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks. Advances in Neural Information Processing Systems, 36, ,
Goyal, P., Caron, M., Lefaudeux, B., Xu, M., Wang, P., Pai, V., & Bojanowski, P. (2021). Self-supervised pretraining of visual features in the wild. arXiv preprintarXiv:2103.01988.
Griffin, G., Holub, A., & Perona, P. (2022). Caltech 256. CaltechDATA. [2023-06-05] https://data.caltech.edu/records/20087
Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., & Valko, M. (2020). Bootstrap your own latent a new approach to self-supervised learning. In Proceedings of the 34th international conference on neural information processing systems.
Gwilliam, M., & Shrivastava, A. (2022). Beyond supervised vs. unsupervised: Representative benchmarking and analysis of image representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9642–9652).
HaoChen, J. Z., Wei, C., Kumar, A., & Ma, T. (2022). Beyond separability: Analyzing the linear transferability of contrastive representations to related subpopulations. arXiv preprintarXiv:2204.02683.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. B. (2021). Masked autoencoders are scalable vision learners. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2022, 15979–15988. [Google Scholar]
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. B. (2019). Momentum contrast for unsupervised visual representation learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, 9726–9735. [Google Scholar]
Higgins, I., Chang, L., Langston, V., Hassabis, D., Summerfield, C., Tsao, D., & Botvinick, M. (2021). Unsupervised deep learning identifies semantic disentanglement in single inferotemporal face patch neurons. Nature Communications,12(1), 6456. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hou, Z., Sun, F., Chen, Y. K., Xie, Y., & Kung, S. Y. (2022). MILAN: Masked image pretraining on language assisted representation. arXiv preprintarXiv:2208.06049.
Ibrahim, M., Garrido, Q., Morcos, A., & Bouchacourt, D. (2022). The robustness limits of sota vision models to natural variation. arXiv preprintarXiv:2210.13604.
Jing, L., & Tian, Y. (2019). Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,43, 4037–4058. [DOI] [PubMed] [Google Scholar]
Kim, D., Wang, K., Sclaroff, S., & Saenko, K. (2022). A broad study of pre-training for domain generalization and adaptation. In European conference on computer vision (pp. 621–638).
Kolesnikov, A., Zhai, X., & Beyer, L. (2019). Revisiting self-supervised visual representation learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2019, 1920–1929. [Google Scholar]
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.https://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf
Kumar, A., Raghunathan, A., Jones, R., Ma, T., & Liang, P. (2022). Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprintarXiv:2202.10054.
Lee, J. H., Yoon, D., Ji, B., Kim, K., & Hwang, S. (2023). Rethinking evaluation protocols of visual representations learned via self-supervised learning. arXiv preprintarXiv:2304.03456.
Li, X., Grandvalet, Y., Davoine, F., Cheng, J., Cui, Y., Zhang, H., & Yang, M. H. (2020). Transfer learning in computer vision tasks: Remember where you come from. Image and Vision Computing,93, 103853. 10.1016/j.imavis.2019.103853 [Google Scholar]
Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J., & Tang, J. (2021). Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering,35(1), 857–876. [Google Scholar]
Liu, Y., Zhang, S., Chen, J., Chen, K., & Lin, D. (2023). Pixmim: Rethinking pixel reconstruction in masked image modeling. arXiv preprintarXiv:2303.02416.
Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., & Bachem, O. (2019). Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. In Proceedings of the 36th international conference on machine learning (ICML) (pp. 4114–4124). PMLR.
Miller, J. P., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., & Schmidt, L. (2021). Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International conference on machine learning (pp. 7721–7735). PMLR.
Misra, I., & Maaten, L. V. D. (2019). Self-supervised learning of pretext-invariant representations. In IEEE/CVF conference on computer vision and pattern recognition (CVPR),2020, 6706–6716.
MMSelfSup Contributors (2021). MMSelfSup: OpenMMLab Self-Supervised Learning Toolbox and Benchmark.https://github.com/open-mmlab/mmselfsup
Musgrave, K., Belongie, S., & Lim, S. N. (2020). A metric learning reality check. In European conference on computer vision (ECCV) (pp. 681–699).
Newell, A., & Deng, J. (2020). How useful is self-supervised pretraining for visual tasks? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).
Noroozi, M., & Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision (ECCV).
Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprintarXiv:1807.03748.
Pandarinath, C., O’Shea, D. J., Collins, J., Jozefowicz, R., Stavisky, S. D., Kao, J. C., & Sussillo, D. (2018). Inferring single-trial neural population dynamics using sequential auto-encoders. Nature Methods,15(10), 805–815. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., & Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS),32, 8026–8037. [Google Scholar]
Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016, 2536–2544. [Google Scholar]
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Dubourg, V. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,12, 2825–2830. [Google Scholar]
Peng, Z., Dong, L., Bao, H., Ye, Q., & Wei, F. (2022). BEiT v2: masked image modeling with vector-quantized visual tokenizers. arXiv preprintarXiv:2208.06366.
Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., & Huang, X. (2020). Pre-trained models for natural language processing: A survey. SCIENCE CHINA Technological Sciences,63(10), 1872–1897. [Google Scholar]
Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet? In International conference on machine learning (pp. 5389–5400). PMLR.
Rusak, E., Schneider, S., Gehler, P. V., Bringmann, O., Brendel, W., & Bethge, M. (2022). ImageNet-D: A new challenging robustness dataset inspired by domain adaptation. ICML 2022 Shift Happens Workshop.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision,115(3), 211–252. 10.1007/s11263-015-0816-y [Google Scholar]
Shi, Y., Daunhawer, I., Vogt, J. E., Torr, P., & Sanyal, A. (2022). How robust are pre-trained models to distribution shift? ICML 2022: Workshop on Spurious Correlations, Invariance, and Stability.
Sun, J. J., Marks, M., Ulmer, A., Chakraborty, D., Geuther, B., Hayes, E., & Kennedy, A. (2023). MABe22: A multi-species multi-task benchmark for learned representations of behavior. In International conference on machine learning (ICML).
Van Horn, G., Cole, E., Beery, S., Wilber, K., Belongie, S., & Mac Aodha, O. (2021). Benchmarking representation learning for natural world image collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12884–12893).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is All you Need. In Advances in neural information processing systems (NeurIPS) (pp. 6000–6010).
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. California Institute of Technology. (CNS-TR-2011-001)
Wang, X., Zhang, R., Shen, C., Kong, T., & Li, L. (2020). Dense contrastive learning for self-supervised visual pre-training. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2021, 3023–3032. [Google Scholar]
Wang, Y., Albrecht, C. M., Braham, N., Mou, L., & Zhu, X. X. (2022). Self-supervised learning in remote sensing: A review. IEEE Geoscience and Remote Sensing Magazine,10(4), 213–247. [Google Scholar]
Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 14648–14658). 10.1109/CVPR52688.2022.01426
Wightman, R. (2019). PyTorch Image Models. GitHub. https://github.com/rwightman/pytorch-image-models
Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018, 3733–3742. [Google Scholar]
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., & Hu, H. (2021). SimMIM: A simple framework for masked image modeling. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR).
Yan, X., Misra, I., Gupta, A. K., Ghadiyaram, D., & Mahajan, D. K. (2019). ClusterFit: Improving generalization of visual representations. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, 6508–6517. [Google Scholar]
Yang, L., Zhang, S., Qin, L., Li, Y., Wang, Y., Liu, H., & Zhang, Y. (2022). GLUE-X: Evaluating natural language understanding models from an out-of-distribution generalization perspective. arXiv preprintarXiv:2211.08073.
Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? Advances in Neural Information Processing Systems (Vol. 27).
Yu, J., Li, X., Koh, J. Y., Zhang, H., Pang, R., Qin, J., & Wu, Y. (2021). Vector-quantized image modeling with improved VQGAN. arXiv preprintarXiv:2110.04627.
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., & Lu, J. (2022). Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19313–19322).
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow twins: Self-supervised learning via redundancy reduction. In International conference on machine learning.
Zhang, R., Isola, P., & Efros, A. A. (2016). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2017, 645–654. [Google Scholar]
Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T. (2021). iBOT: Image BERT pre-training with online Tokenizer. arXiv preprintarXiv:2111.07832.
Zhou, Y., Chia, M. A., Wagner, S. K., Ayhan, M. S., Williamson, D. J., Struyven, R. R., & Keane, P. A. (2023). A foundation model for generalizable disease detection from retinal images. Nature,622(7981), 156–163. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary file 1 (pdf 539 KB)^{(539.9KB, pdf)}

Data Availability Statement

[CR1] Asano, Y. M., Rupprecht, C., & Vedaldi, A. (2020). Self-labelling via simultaneous clustering and representation learning. In International conference on learning representations.

[CR2] Assent, I. (2012). Clustering high dimensional data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,2(4), 340–350. [Google Scholar]

[CR3] Baevski, A., Hsu, W. N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. In International conference on machine learning (ICML).

[CR4] Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., & Goldblum, M. (2023). A cookbook of self-supervised learning. arXiv preprintarXiv:2304.12210.

[CR5] Bao, H., Dong, L., & Wei, F. (2021). BEiT: BERT pre-training of image transformers. arXiv preprintarXiv:2106.08254.

[CR6] Cabannes, V., Kiani, B., Balestriero, R., LeCun, Y., & Bietti, A. (2023). The ssl interplay: Augmentations, inductive bias, and generalization. In International conference on machine learning (pp. 3252–3298). PMLR.

[CR7] Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep Clustering for Unsupervised Learning of Visual Features. In European conference on computer vision (ECCV).

[CR8] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems,33, 9912–9924. [Google Scholar]

[CR9] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 9650–9660).

[CR10] Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020a). Generative pretraining from pixels. In Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 1691–1703). PMLR.

[CR11] Chen, R. J., Ding, T., Lu, M. Y., Williamson, D. F. K., Jaume, G., Chen, B., & Mahmood, F. (2023). A general-purpose self-supervised model for computational pathology. arXiv preprintarXiv:2308.15474.

[CR12] Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020b). A simple framework for contrastive learning of visual representations. In Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 1597–1607). PMLR.

[CR13] Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. E. (2020). Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems,33, 22243–22255. [Google Scholar]

[CR14] Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., & Wang, J. (2024). Context autoencoder for self-supervised representation learning. International Journal of Computer Vision,132(1), 208–223. [Google Scholar]

[CR15] Chen, X., Fan, H., Girshick, R., & He, K. (2020). Improved baselines with momentum contrastive learning. arXiv preprintarXiv:2003.04297.

[CR16] Chen, X., & He, K. (2020). Exploring simple siamese representation learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2021, 15745–15753. [Google Scholar]

[CR17] Chen, X., Xie, S., & He, K. (2021). An empirical study of training self-supervised vision transformers. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 9620–9629).

[CR18] Cole, E., Yang, X., Wilber, K., Mac Aodha, O., & Belongie, S. (2022). When does contrastive visual representation learning work? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 14755–14764).

[CR19] Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. 886–893).

[CR20] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 4171–4186).

[CR21] Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. IEEE International Conference on Computer Vision (ICCV),2015, 1422–1430. [Google Scholar]

[CR22] Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., & Guo, B. (2023). PeCo: Perceptual codebook for BERT pre-training of vision transformers. In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, pp. 552–560).

[CR23] Ericsson, L., Gouk, H., & Hospedales, T. M. (2021). How well do self-supervised models transfer? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5414–5423).

[CR24] Everingham, M., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision,88(2), 303–338. 10.1007/s11263-009-0275-4 [Google Scholar]

[CR25] Fang, Y., Wang, W., Xie, B., Sun, Q. S., Wu, L. Y., Wang, X., & Cao, Y. (2022). EVA: Exploring the limits of masked visual representation learning at scale. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2023, 19358–19369. [Google Scholar]

[CR26] Feichtenhofer, C., Fan, H., Li, Y., & He, K. (2022). Masked autoencoders as spatiotemporal learners. Advances in Neural Information Processing Systems,32, 35946–35958. [Google Scholar]

[CR27] Gansbeke, W. V., Vandenhende, S., Georgoulis, S., Proesmans, M., & Gool, L. V. (2020). SCAN: Learning to classify images without labels. In European conference on computer vision.

[CR28] Gidaris, S., Singh, P., & Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. In International conference on learning representations.

[CR29] Goldblum, M., Souri, H., Ni, R., Shu, M., Prabhu, V., Somepalli, G., & Goldstein, T. (2024). Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks. Advances in Neural Information Processing Systems, 36, ,

[CR30] Goyal, P., Caron, M., Lefaudeux, B., Xu, M., Wang, P., Pai, V., & Bojanowski, P. (2021). Self-supervised pretraining of visual features in the wild. arXiv preprintarXiv:2103.01988.

[CR31] Griffin, G., Holub, A., & Perona, P. (2022). Caltech 256. CaltechDATA. [2023-06-05] https://data.caltech.edu/records/20087

[CR32] Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., & Valko, M. (2020). Bootstrap your own latent a new approach to self-supervised learning. In Proceedings of the 34th international conference on neural information processing systems.

[CR33] Gwilliam, M., & Shrivastava, A. (2022). Beyond supervised vs. unsupervised: Representative benchmarking and analysis of image representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9642–9652).

[CR34] HaoChen, J. Z., Wei, C., Kumar, A., & Ma, T. (2022). Beyond separability: Analyzing the linear transferability of contrastive representations to related subpopulations. arXiv preprintarXiv:2204.02683.

[CR35] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. B. (2021). Masked autoencoders are scalable vision learners. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2022, 15979–15988. [Google Scholar]

[CR36] He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. B. (2019). Momentum contrast for unsupervised visual representation learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, 9726–9735. [Google Scholar]

[CR37] Higgins, I., Chang, L., Langston, V., Hassabis, D., Summerfield, C., Tsao, D., & Botvinick, M. (2021). Unsupervised deep learning identifies semantic disentanglement in single inferotemporal face patch neurons. Nature Communications,12(1), 6456. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] Hou, Z., Sun, F., Chen, Y. K., Xie, Y., & Kung, S. Y. (2022). MILAN: Masked image pretraining on language assisted representation. arXiv preprintarXiv:2208.06049.

[CR39] Ibrahim, M., Garrido, Q., Morcos, A., & Bouchacourt, D. (2022). The robustness limits of sota vision models to natural variation. arXiv preprintarXiv:2210.13604.

[CR40] Jing, L., & Tian, Y. (2019). Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,43, 4037–4058. [DOI] [PubMed] [Google Scholar]

[CR41] Kim, D., Wang, K., Sclaroff, S., & Saenko, K. (2022). A broad study of pre-training for domain generalization and adaptation. In European conference on computer vision (pp. 621–638).

[CR42] Kolesnikov, A., Zhai, X., & Beyer, L. (2019). Revisiting self-supervised visual representation learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2019, 1920–1929. [Google Scholar]

[CR43] Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.https://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf

[CR44] Kumar, A., Raghunathan, A., Jones, R., Ma, T., & Liang, P. (2022). Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprintarXiv:2202.10054.

[CR45] Lee, J. H., Yoon, D., Ji, B., Kim, K., & Hwang, S. (2023). Rethinking evaluation protocols of visual representations learned via self-supervised learning. arXiv preprintarXiv:2304.03456.

[CR46] Li, X., Grandvalet, Y., Davoine, F., Cheng, J., Cui, Y., Zhang, H., & Yang, M. H. (2020). Transfer learning in computer vision tasks: Remember where you come from. Image and Vision Computing,93, 103853. 10.1016/j.imavis.2019.103853 [Google Scholar]

[CR47] Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J., & Tang, J. (2021). Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering,35(1), 857–876. [Google Scholar]

[CR48] Liu, Y., Zhang, S., Chen, J., Chen, K., & Lin, D. (2023). Pixmim: Rethinking pixel reconstruction in masked image modeling. arXiv preprintarXiv:2303.02416.

[CR49] Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., & Bachem, O. (2019). Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. In Proceedings of the 36th international conference on machine learning (ICML) (pp. 4114–4124). PMLR.

[CR50] Miller, J. P., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., & Schmidt, L. (2021). Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International conference on machine learning (pp. 7721–7735). PMLR.

[CR51] Misra, I., & Maaten, L. V. D. (2019). Self-supervised learning of pretext-invariant representations. In IEEE/CVF conference on computer vision and pattern recognition (CVPR),2020, 6706–6716.

[CR52] MMSelfSup Contributors (2021). MMSelfSup: OpenMMLab Self-Supervised Learning Toolbox and Benchmark.https://github.com/open-mmlab/mmselfsup

[CR53] Musgrave, K., Belongie, S., & Lim, S. N. (2020). A metric learning reality check. In European conference on computer vision (ECCV) (pp. 681–699).

[CR54] Newell, A., & Deng, J. (2020). How useful is self-supervised pretraining for visual tasks? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

[CR55] Noroozi, M., & Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision (ECCV).

[CR56] Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprintarXiv:1807.03748.

[CR57] Pandarinath, C., O’Shea, D. J., Collins, J., Jozefowicz, R., Stavisky, S. D., Kao, J. C., & Sussillo, D. (2018). Inferring single-trial neural population dynamics using sequential auto-encoders. Nature Methods,15(10), 805–815. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., & Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS),32, 8026–8037. [Google Scholar]

[CR59] Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016, 2536–2544. [Google Scholar]

[CR60] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Dubourg, V. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,12, 2825–2830. [Google Scholar]

[CR61] Peng, Z., Dong, L., Bao, H., Ye, Q., & Wei, F. (2022). BEiT v2: masked image modeling with vector-quantized visual tokenizers. arXiv preprintarXiv:2208.06366.

[CR62] Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., & Huang, X. (2020). Pre-trained models for natural language processing: A survey. SCIENCE CHINA Technological Sciences,63(10), 1872–1897. [Google Scholar]

[CR63] Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet? In International conference on machine learning (pp. 5389–5400). PMLR.

[CR64] Rusak, E., Schneider, S., Gehler, P. V., Bringmann, O., Brendel, W., & Bethge, M. (2022). ImageNet-D: A new challenging robustness dataset inspired by domain adaptation. ICML 2022 Shift Happens Workshop.

[CR65] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision,115(3), 211–252. 10.1007/s11263-015-0816-y [Google Scholar]

[CR66] Shi, Y., Daunhawer, I., Vogt, J. E., Torr, P., & Sanyal, A. (2022). How robust are pre-trained models to distribution shift? ICML 2022: Workshop on Spurious Correlations, Invariance, and Stability.

[CR67] Sun, J. J., Marks, M., Ulmer, A., Chakraborty, D., Geuther, B., Hayes, E., & Kennedy, A. (2023). MABe22: A multi-species multi-task benchmark for learned representations of behavior. In International conference on machine learning (ICML).

[CR68] Van Horn, G., Cole, E., Beery, S., Wilber, K., Belongie, S., & Mac Aodha, O. (2021). Benchmarking representation learning for natural world image collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12884–12893).

[CR69] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is All you Need. In Advances in neural information processing systems (NeurIPS) (pp. 6000–6010).

[CR70] Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. California Institute of Technology. (CNS-TR-2011-001)

[CR71] Wang, X., Zhang, R., Shen, C., Kong, T., & Li, L. (2020). Dense contrastive learning for self-supervised visual pre-training. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2021, 3023–3032. [Google Scholar]

[CR72] Wang, Y., Albrecht, C. M., Braham, N., Mou, L., & Zhu, X. X. (2022). Self-supervised learning in remote sensing: A review. IEEE Geoscience and Remote Sensing Magazine,10(4), 213–247. [Google Scholar]

[CR73] Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 14648–14658). 10.1109/CVPR52688.2022.01426

[CR74] Wightman, R. (2019). PyTorch Image Models. GitHub. https://github.com/rwightman/pytorch-image-models

[CR75] Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018, 3733–3742. [Google Scholar]

[CR76] Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., & Hu, H. (2021). SimMIM: A simple framework for masked image modeling. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR).

[CR77] Yan, X., Misra, I., Gupta, A. K., Ghadiyaram, D., & Mahajan, D. K. (2019). ClusterFit: Improving generalization of visual representations. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, 6508–6517. [Google Scholar]

[CR78] Yang, L., Zhang, S., Qin, L., Li, Y., Wang, Y., Liu, H., & Zhang, Y. (2022). GLUE-X: Evaluating natural language understanding models from an out-of-distribution generalization perspective. arXiv preprintarXiv:2211.08073.

[CR79] Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? Advances in Neural Information Processing Systems (Vol. 27).

[CR80] Yu, J., Li, X., Koh, J. Y., Zhang, H., Pang, R., Qin, J., & Wu, Y. (2021). Vector-quantized image modeling with improved VQGAN. arXiv preprintarXiv:2110.04627.

[CR81] Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., & Lu, J. (2022). Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19313–19322).

[CR82] Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow twins: Self-supervised learning via redundancy reduction. In International conference on machine learning.

[CR83] Zhang, R., Isola, P., & Efros, A. A. (2016). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2017, 645–654. [Google Scholar]

[CR84] Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T. (2021). iBOT: Image BERT pre-training with online Tokenizer. arXiv preprintarXiv:2111.07832.

[CR85] Zhou, Y., Chia, M. A., Wagner, S. K., Ayhan, M. S., Williamson, D. J., Struyven, R. R., & Keane, P. A. (2023). A foundation model for generalizable disease detection from retinal images. Nature,622(7981), 156–163. [DOI] [PMC free article] [PubMed]

PERMALINK

A Closer Look at Benchmarking Self-supervised Pre-training with Image Classification

Markus Marks

Manuel Knott

Neehar Kondapaneni

Elijah Cole

Thijs Defraeye

Fernando Perez-Cruz

Pietro Perona

Abstract

Supplementary Information

Introduction

Fig. 1.

Related Work

Self-supervised Learning

SSL Evaluation Protocols

Studies on SSL Evaluation Protocols

Experimental Setup

Results

Which In-Domain Metric Best Predicts Out-of-Domain Rankings on Average?

Fig. 2.

How Do Protocols Differ Under Different Kinds of Domain Shift?

Fig. 3.

What is the Effect of Embedding Normalization on Different Protocols?

Fig. 4.

How Do Different SSL Families and Architectures Perform Under the Various Protocols?

Fig. 5.

How Do Rank Correlations Relate to Absolute Performance?

Fig. 6.

Discussion

Supplementary Information

Acknowledgements

Funding

Data Availability

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases