Abstract
Transferring the ImageNet pre-trained weights to the various remote sensing tasks has produced acceptable results and reduced the need for labeled samples. However, the domain differences between ground imageries and remote sensing images cause the performance of such transfer learning to be limited. The difficulty of annotating remote sensing images is well-known as it requires domain experts and more time, whereas unlabeled data is readily available. Recently, self-supervised learning, which is a subset of unsupervised learning, emerged and significantly improved representation learning. Recent research has demonstrated that self-supervised learning methods capture visual features that are more discriminative and transferable than the supervised ImageNet weights. We are motivated by these facts to pre-train the in-domain representations of remote sensing imagery using contrastive self-supervised learning and transfer the learned features to other related remote sensing datasets. Specifically, we used the SimSiam algorithm to pre-train the in-domain knowledge of remote sensing datasets and then transferred the obtained weights to the other scene classification datasets. Thus, we have obtained state-of-the-art results on five land cover classification datasets with varying numbers of classes and spatial resolutions. In addition, by conducting appropriate experiments, including feature pre-training using datasets with different attributes, we have identified the most influential factors that make a dataset a good choice for obtaining in-domain features. We have transferred the features obtained by pre-training SimSiam on remote sensing datasets to various downstream tasks and used them as initial weights for fine-tuning. Moreover, we have linearly evaluated the obtained representations in cases where the number of samples per class is limited. Our experiments have demonstrated that using a higher-resolution dataset during the self-supervised pre-training stage results in learning more discriminative and general representations.
Keywords: Transfer learning, Deep learning, Remote sensing, Self-supervised learning, Representation learning, Scene classification
1. Introduction
Remote sensing imageries are acquired via imaging satellites, airplanes, etc. [1]. These devices are capable of monitoring various aspects of the earth's surface. Unlike natural images, which are captured using digital cameras and often contain a limited number of objects, remote sensing imageries can encompass vast geographical areas and hold numerous contents with varying dimensions and sizes. Remote sensing images, in contrast to ground images, are not object-centric. Therefore, they can be used for various applications, including land cover classification, road network extraction, disaster prevention, and monitoring [2,3]. Only artificial intelligence and machine learning systems can process this volume of data. Fortunately, recent advances in computer vision have made it easier to process and analyze visual data points [4]. With the recent advancements in deep learning for computer vision applications, the supervised learning approaches for land cover classification in remote sensing images have performed exceptionally well. However, the main drawback of supervised learning is that it needs a tremendous number of labeled samples. Providing this volume of remote-sensing images is very costly and time-consuming. In addition, it requires experts to annotate data points carefully. When obtaining large labeled datasets is exhaustive, the general solution is to transfer the learned weights from the ImageNet dataset to these tasks [1,[5], [6], [7], [8], [9], [10], [11], [12], [13]]. While this transfer learning has produced acceptable results for remote sensing tasks, it has the following drawbacks:
-
1.
If there are significant domain differences between the remote sensing and the ImageNet datasets, this type of transfer learning will fail. As a result, it can perform poorly in some cases, such as hyperspectral and multispectral images.
-
2.
Transferring ImageNet pre-trained weights directly to non-RGB remote sensing datasets is impossible [3,[14], [15], [16], [17], [18], [19]].
Domain differences between the natural and remote sensing images stimulate researchers to find alternative solutions. To do so, some researchers used supervised or unsupervised methods to pre-trained models on remote sensing datasets. The learned weights are then transferred to other remote-sensing tasks [5,19]. However, the disadvantage of the supervised pre-training is that it requires large in-domain labeled samples to learn general representations from remote sensing images. Self-supervised learning has emerged to overcome all of the previously mentioned drawbacks. It aims to learn effective representations of the input data without relying on human-provided labels. Recent advances in self-supervised learning have demonstrated that their pre-trained features have a higher transferability than ImageNet models. This branch of artificial intelligence is advantageous when acquiring labeled data is time-consuming and expensive, such as medical and satellite images [14]. Additionally, these methods are more practical in the real world because different sensors generate millions or even billions of data samples, and labeling them is unavoidably impossible. Recently, contrastive self-supervised learning [20] outperformed other feature learning methods. These methods significantly narrowed the gap between supervised and unsupervised approaches in representation learning [21]. Currently, the most effective contrastive self-supervised algorithms [22] employ data augmentation techniques to generate positive samples. In other words, they use data augmentation techniques such as image cropping, rotation, and so on to create multiple views of the same image. The objective function tries to bring positive samples as close to each other as possible in the feature space. In most of these methods, positive and negative pairs compete with each other [23]. Since these methods do not require labeled data, we can use a large amount of unlabeled data to learn the features in an unsupervised way and then transfer the weights to other remote sensing tasks.
Selecting an appropriate dataset for visual representation learning from remote sensing images, either supervised or self-supervised, is one of the influencing factors in learning high-generalizable features, which have huge effects on the performance of the final model on downstream tasks. In recent works, such as [6,17], researchers have determined the influencing factors on the datasets for pre-training features in a supervised manner from satellite images. Unlike supervised learning, which uses accurate human-provided labels as supervisory signals, self-supervised learning methods extract supervisory signals from the data itself. This difference makes it necessary to carefully investigate the vital factors that make the dataset an ideal option for self-supervised pre-training in remote sensing. One of our goals is to investigate the effect of the selected dataset for pre-training visual features from satellite images using the SimSiam algorithm. To achieve this goal, in the pre-training stage, we used datasets with different characteristics in terms of the number of samples, spatial resolution, and the number of classes. Our other goal is to investigate the generalizability of self-supervised learned features using the SimSiam for land cover classification. We have examined the transferability of the in-domain pre-trained weights by conducting extensive experiments. In the SimSiam algorithm, we used ResNet50 with ImageNet weights as the backbone. In this setting, we have pre-trained the features in a self-supervised manner on MLRSNet, PatternNet, and Resisc45 datasets. Finally, we have fine-tuned obtained models on the target datasets under different conditions, such as fine-tuning all layers and linear evaluation using a limited number of samples. The results demonstrate that by selecting a suitable medium-sized remote sensing dataset, we can pre-train features that produce the best results for various land-cover classification tasks. Our main contributions are as follows:
-
1.
We have investigated the generalization of the SimSiam algorithm for learning visual representations in remote sensing images by conducting detailed and exhaustive experiments on six land cover classification datasets with different characteristics.
-
2.
During pre-training in-domain representations with the SimSiam algorithm, we used ImageNet weights as initial weights to reduce the need for training data.
-
3.
By conducting detailed experiments, we have discussed the factors that make the dataset a good reference for self-supervised pre-training of features. The obtained results have demonstrated that the pre-training dataset should have a high spatial resolution.
The remainder of this paper is organized as follows:
In section 2, we have reviewed the related works. Section 3 explores the SimSiam algorithm used for pre-training in-domain features from remote sensing images. Section 4 presents the statistics of the selected datasets for each step. In section 5, we solved the downstream tasks and demonstrated the results. Finally, we have concluded the paper in section 6.
2. Related works
2.1. Visual representation learning in remote sensing imagery
Where large labeled datasets are unavailable, the general solution is to use pre-trained models on large-scale datasets such as ImageNet. The referred models can be used to extract features from new datasets or as a starting point for fine-tuning weights on other tasks and datasets. However, this type of transfer learning is directly applicable to the RGB remote sensing datasets [1,[5], [6], [7], [8], [9], [10], [11], [12]]. For pre-training the in-domain general representations for overhead imagery, we can use either supervised or unsupervised methods. We can refer to Ref. [5] as an example of supervised learning-based methods. Here, the initial steps for supervised in-domain visual representation learning from remote sensing images are described. The learned features have been evaluated using fine-tuning on land cover classification datasets. In most cases, it has been demonstrated that in-domain features learned from remote sensing datasets perform better than ImageNet counterparts. Additionally, in the case of supervised learning, they have investigated the characteristics that make a dataset a good reference for learning visual representations. The features learned from multi-resolution datasets have demonstrated higher generalization and better performance. The researchers in Ref. [26] combined the Resisc45, PatternNet, and RSI-CB datasets, trained a model on them, and then fine-tuned it on the UC-Merced dataset. Compared to the ImageNet model, this model is more accurate. In comparison, the analysis in Ref. [27] demonstrated that the models pre-trained on the ImageNet perform better than models pre-trained on the PatternNet when transferred to the target AID [28] and UCM [29] datasets. Both studies [5,6] are very similar and examined the performance of ImageNet pre-trained models and models pre-trained on the in-domain datasets. They conducted experiments in Ref. [6] using two high-resolution and three medium-resolution datasets. The results indicated that fine-tuning in-domain pre-trained weights on remote sensing datasets perform better than ImageNet weights. However, the mentioned work used only two high-resolution datasets for pre-training in-domain features, and the influential factors on learning highly generalizable representations from remote sensing datasets are not yet fully determined. Additionally, the mentioned works have examined the effect of a pre-training dataset for supervised representation learning. In contrast, by conducting detailed experiments on PatternNet and Resisc45 datasets, we have examined the impact of the pre-training dataset for the SimSiam algorithm, which is a contrastive self-supervised learning method.
2.2. A brief overview of self-supervised learning methods
Self-supervised learning is a highly practical subset of unsupervised learning that aims to learn general visual features from images without the use of human labels. In general, self-supervised learning methods consist of two steps. In the first step, a pretext task is designed, and by solving this proxy task, visual representations are learned. The second step is to transfer the pre-trained features from the previous step to other downstream tasks. The resulting model from the first step can be used as a starting point for further fine-tuning or feature extraction. Such techniques will be advantageous for difficult-to-label aerial and medical images. In the following section, we have classified the self-supervised visual representation learning methods into three groups and discussed each one.
2.2.1. Representation learning by solving pretext tasks
Researchers in computer vision have defined many pretext tasks to date. We refer readers to Ref. [30] for a review of self-supervised learning methods based on pretext tasks. The authors have classified all pretext tasks into four categories:
-
1.
Generative-based methods such as image-colorization, super-resolution, etc.
-
2.
Context-based tasks, such as image jigsaw puzzles, geometric transformations, etc.
-
3.
Free semantic label-based tasks, such as contour detection, depth estimation, etc.
-
4.
Cross Modal-based methods such as optical flow estimation, visual-audio correspondence, etc.
Moreover, the researchers in Refs. [31,32] combined several pretext tasks for capturing high generalizable features. These studies demonstrated that different pretext tasks are complementary and that combining them results in the acquisition of more generalizable features. The visual representations learned by solving most of the pretext tasks have limited generalization and performance on downstream tasks compared to the ImageNet pre-trained models.
2.2.2. Clustering
Clustering-based methods are another type of unsupervised method for learning visual representations. They alternate between clustering the representations and learning to predict the cluster assignment. Instead of directly comparing features as in contrastive learning, SwAV [21] clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image. Researchers in Ref. [33] demonstrated that k-means assignments could be used as pseudo labels to learn visual representations. In Ref. [34], how to cast the pseudo-label assignment problem as an instance of the optimal transport problem have been demonstrated. Despite the fact that clustering-based methods have been very effective for learning visual representations, due to the need to alternate between clustering and feature learning, they require high computing resources.
2.2.3. Contrastive self-supervised learning
The most contrastive self-supervised learning methods have emerged in the last two years. These methods minimized the gap between supervised and unsupervised feature learning. We refer readers to Refs. [35,36] to learn more about contrastive self-supervised learning algorithms. The main idea of contrastive learning is to bring pairs of positive samples closer together and pairs of negative samples further apart in the feature space. In practice, the performance of contrastive learning methods is highly dependent on a large number of negative samples [35]. As a result, a large number of negative samples must be provided. For instance, the PIRL [37] algorithm stores negative samples in a memory bank, while the MoCo [38] algorithm stores a row of negative samples in a momentum encoder. In contrast, the SimCLR [22] algorithm generates negative examples with large batch sizes, necessitating the use of significant computational resources. Unlike other contrastive self-supervised learning methods, the SimSiam algorithm does not need a memory bank or a larger batch size. Therefore, it requires fewer computation resources.
2.3. Self-supervised learning in remote sensing
Recently, some researchers have attempted to apply self-supervised learning algorithms and concepts to remote sensing images. In Ref. [16], multi-scale spatial features from high-resolution remote sensing images are captured by multiple-layer feature-matching generative adversarial networks (MARTA GANs) and are used for solving land cover classification tasks. In another work, a pretext task is defined in a way that predicts RGB channels information using high-frequency channels information [18]. Additionally [15], employs image colorization, relative position prediction, and instance discrimination as pretext tasks for learning the in-domain representations from remote sensing images. The learned features are evaluated by transferring to other land cover classification datasets with very few labeled samples. In Ref. [39] the MoCov2 algorithm have been modified by introducing a geography-aware cost function to learn the visual features of remote sensing images. Rather than using regular data augmentation techniques to generate positive samples, the aforementioned work utilized geographic information about the locations that satellites frequently pass through to generate positive samples. Researches in Ref. [14] have demonstrated that hierarchical pre-training first on natural images and then on remote sensing images, improves accuracy in downstream tasks. In Ref. [40], the effect of different data augmentation methods for contrastive self-supervised learning algorithms on remote sensing datasets has been studied. In Ref. [41], a self-supervised learning approach for pre-training weights from remote sensing images is proposed. This approach makes simultaneous use of the correspondence between remote sensing images and geo-tagged audio recordings. Therefore, it pre-trains the features using both the image information and the corresponding audio of each image.
3. An overview of the SimSiam algorithm
The method we used in this paper is divided into two sections:
-
1.
Self-supervised pre-training using the SimSiam algorithm.
-
2.
Transferring the pre-trained weights in the previous step to downstream tasks and evaluating the generalization of features by fine-tuning on various land cover classification datasets.
The features learned using most contrastive learning algorithms are highly generalizable. However, the main drawback of these methods is their high computational requirements. In general, this requirement for plenty of computational resources is motivated by the following three main items: 1. negative samples; 2. large batch size; 3. momentum encoder. Unlike other contrastive learning methods, the SimSiam algorithm does not require any of the three items mentioned above, making it significantly more computational resource-efficient. Therefore, we employ this algorithm to learn the visual features of remote sensing images [24]. The schematic of this algorithm is shown in Fig. 1:
Fig. 1.
SimSiam architecture.
After applying different data augmentation techniques to the image x, two distinct views, and , are generated. Both of these two distinct perspectives are entered into both sides of the Siamese architecture and then processed using the encoder f. This encoder utilizes a ResNet50 backbone, and an MLP projection head to extract the features from the input image. Therefore, each image is converted from pixel space to a smaller feature space. The projection head in the encoder f consists of three layers of MLP with batch normalization layers applied to each fully connected layer, including the output layer. Following the encoder(f), the architecture only has a top-side prediction head module. The prediction head MLP(h) merges the encoder outputs for the view and then matches its dimensions to the encoder output on the bottom side of the architecture. This MLP (h) is composed of two fully connected layers and a hidden layer that has been subjected to a batch normalization layer. The trained weights are shared on both sides of the model [28]. demonstrated that copying weights on both sides of Siamese architectures produces poor results. They have provided a momentum update to avoid this issue. The proposed solution has the disadvantage of requiring a large number of computational resources. The SimSiam employs a stop-gradient operator on one side of the architecture to overcome the requirement for high computational resources as the BYOL [23] algorithm does. When the stop-gradient operator is applied to any side of the network, the gradients on that side are not updated via back propagation. The proposed cost function is simple and is defined as a function of the cosine similarity of two vectors. If we illustrate the output vectors of two views with and , where represents the output of the prediction head applied to and f (.) is the backbone which is applied to both views, then the objective function can be defined as follows:
In this equation, represents the norm. Finally, the total cost function is a symmetric function, which is defined as follows:
The cost obtained for all images in the batch is averaged and considered as the total loss. The stop gradient operator is a critical component that makes this algorithm work well. This operator applies to the features extracted from each view; therefore, the final cost function is defined as follows:
This relationship demonstrates that the defined cost function is perfectly symmetric. Additionally, gradients are updated only when a corresponding view enters the network from the top side of the architecture [24].
4. Datasets
We conducted our experiments using two sets of remote sensing datasets. The first category contains datasets selected for self-supervised pre-training with the SimSiam algorithm, while the second category contains datasets used to evaluate the features learned. During the self-supervised pre-training phase, we used all of the images in pre-training datasets. During the transfer learning phase and solving downstream tasks, we used 60 % of the data for training, 20 % for validation and 20 % for testing as in Ref. [17].
4.1. Self-supervised pre-training datasets
We have used MLRSNet, NWPU-RESISC45, and PatternNet, to pre-train the general representations using the SimSiam. These datasets have different characteristics from which the number of classes, number of samples, and spatial resolution are more noticeable discrepancies. The variety of attributes in pre-train datasets can help us to identify the vital factors that make the dataset a good choice for acquiring general representations.
MLRSNet: MLRSNet is a multi-label high spatial resolution remote sensing dataset for semantic scene understanding. It contains 109,161 remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1,500 to 3,000. The images have a fixed size of 256 × 256 pixels with various pixel resolutions (∼0.1m–10m). Moreover, the number of labels associated with each image varies from 1 to 13. The dataset can be used for multi-label-based image classification, multi-label-based image retrieval, and image segmentation.
NWPU-RESISC45 [10]: This dataset contains 31.5k images classified into 45 classes. The images of this dataset have a high spatial resolution. For many samples, this dataset has a spatial resolution of between 0.2m and 30m per pixel.
PatternNet [25]: PatternNet has a higher spatial resolution than Resisc45 and consists of 38 classes with 800 images per class. Therefore, there are 30.4k samples in this dataset. The image size of this dataset is 256 × 256. Additionally, the spatial resolution is between 0.06m and 4.96m.
4.2. Downstream datasets
In addition to the Resisc45 and PatternNet, we have evaluated the pre-trained representations using three distinct datasets with the following characteristics:
AID [28]: The dataset contains 10,000 RGB images with a resolution of 600 × 600 pixels divided into 30 classes. The spatial resolution of images is about 0.5m–8m.
EuroSAT [9]: The dataset contains 27,000 images with 64 × 64 pixel dimensions classified into ten classes. This dataset has two versions of 13 channels and three RGB channels. The spatial resolution of each image in this dataset is about 10m–30m, indicating that it has a low spatial resolution. We conducted our experiments using a three-channel version.
UC_Merced [29]: The dataset has 2,100 images divided into 21 classes with a resolution of 0.3m and image sizes 256 × 256.
The following table summarizes the characteristics of the datasets used in this article.
5. Experiments
Our primary objective is to obtain meaningful representations from aerial imagery in an unsupervised manner and use them to tackle the domain difference issue. As a result, we use the obtained weights either as initial weights or feature extractors.
As illustrated in Fig. 2, our proposed method comprises three discrete stages. The first step involves the selection of a robust general model, specifically ResNet50, which has been trained on the widely-used ImageNet dataset. The next step involves pre-training the model either by supervised or self-supervised learning paradigms using the PatternNet, Resisc45, and MLRSNet datasets using the ImageNet weights as the starting point. This round of pre-training overcomes the domain difference between source and target datasets. Finally, the pre-trained ResNet50 models derived from the preceding steps are utilized as either the initial model or feature extractor to address scene classification tasks in various scenarios.
Fig. 2.
Overview of the proposed method. By implementing two rounds of transfer learning, we have achieved state-of-the-art results across all downstream tasks.
Some of our experiments are inspired by Ref. [17]. However, they considered the supervised approach to learning visual representations. Additionally, through extensive experiments, we examine the effect of the pre-training dataset using the SimSiam algorithm. We have conducted our experiments using the PyTorch and PyTorch Lightning [42] frameworks on an Ubuntu system equipped with QuadroP6000 GPU. We repeated each experiment five times and reported the average results (see Table 1).
Table 1.
General characteristics of the selected datasets.
| Dataset | Image size | Size | Classes | Resolution (m) |
|---|---|---|---|---|
| MLRSNet | 256 × 256 | 109.16k | 46 | 0.1–10m |
| Resisc45 | 256 × 256 | 31.5k | 45 | 0.2–30m |
| PatternNet | 256 × 256 | 30.4k | 38 | 0.06–4.96m |
| AID | 600 × 600 | 10k | 30 | 0.5–8m |
| UCM | 256 × 256 | 2.1k | 21 | 0.3m |
| EuroSat | 64 × 64 | 27k | 10 | 10–30m |
5.1. Supervised pre-training of in-domain visual representations
To obtain supervised in-domain visual representations, we independently trained the ResNet50 model with 90 % of the samples from Resisc45, PatternNet and MLRSNet datasets and evaluated the learned features with the remaining 10 %. The objective of this experiment is not to solve classification problems on the mentioned datasets. For this reason, we use only 10 % of the data for evaluation to determine the right direction at the pre-training time. During training, the batch size is set to 120 and models are trained for 100 epochs. Additionally, we utilize OneCycleLR as a learning rate scheduler and Adam as an optimizer. We determine other hyper-parameters for each dataset by experimenting in such a way that the model has high accuracy on the evaluation samples. Table 2 demonstrates the obtained results:
Table 2.
Accuracy (%) of ResNet50 model on validation set (10 % of each dataset).
| Dataset | Global Accuracy (%) |
|---|---|
| PatternNet | 99.97 |
| Resisc45 | 97.20 |
| MLRSNet | 97.85 |
This experiment produces three pre-trained models that are trained using PatternNet, Resisc45, and MLRSNet datasets in a supervised manner. These models are called Sup-PatternNet, Sup-Resisc45 and Sup-MLRSNet, respectively.
In this experiment, we fine-tuned the resulting models on three downstream tasks. The obtained results are as follows:
Based on the data presented in the table, it is evident that fine-tuning the ResNet50 model pre-trained on the PatternNet dataset results in superior accuracy across all downstream datasets, despite having significantly fewer samples compared to the MLRSNet dataset.
5.2. Self-supervised pre-training using SimSiam
In the first phase, we performed in-domain self-supervised pre-training using the SimSiam algorithm on all instances of the MLRSNet, Resisc45, and PatternNet datasets. As previously described, the SimSiam algorithm utilizes an encoder (f) that consists of a backbone and a projection head. In our experiments, we used ResNet50 as the backbone and applied slight changes to the number of neurons in the projection and prediction head. The projection head is a three-layer MLP with 1024 and 512 neurons in its hidden layers. The predictor module (h) follows the encoder module. It consists of a two-layer MLP with 256 neurons in its hidden layer. We trained the obtained SimSiam on MLRSNet, Resisc45, and PatternNet, for 100k iterations. We used the SGD optimizer with a batch size of 128 and a base learning rate of 0.05 during training, as well as the MultiStepLR scheduler. We set the weight decay and SGD momentum to 10e-5 and 0.9, respectively. During the process of self-supervised pre-training, we have either conducted training from scratch or fine-tuned the weights of ImageNet weights. According to the results of [14], using ImageNet weights during pre-training leads to the higher accuracy of downstream tasks and decreases the time required for convergence. By performing this experiment, we obtained six distinct models pre-trained on datasets with different characteristics. We have named obtained models as Scratch-PatternNet, Scratch-Resisc45, Scratch-MLRSNet, Sim-PatternNet, Sim-Resisc45, and Sim-MLRSNet, respectively.
5.3. Transfer learning to downstream tasks
In this experiment, we fine-tuned the resulting models on five remote sensing datasets with different characteristics and reported global accuracy for each dataset to evaluate the quality of pre-trained representations. We used 60 % of the datasets as a training set, 20 % as a validation set, and the remaining 20 % as a test set to solve downstream tasks. We also used the Adam optimizer with a batch size of 64 and the ReduceLrOnPlateau scheduler. We have fine-tuned all of the models for 100 epochs.
Our data augmentation pipelines are as follows:
We first resize all images to 256 × 256 pixels for all downstream datasets except EuroSAT and then apply random horizontal or vertical flips. We crop 224 × 224 pixels from the center of the resulting image. Finally, each dataset is normalized using the mean and standard deviation of the pixel intensities. We repeated each experiment five times and reported the average results (see Table 3).
Table 3.
Accuracy over different supervised pre-training methods. (Averaged over 5 runs). PD and DT stand for pretraining dataset and downstream task, respectively.
| DT → |
AID | EuroSAT | UCM |
|---|---|---|---|
| PD ↓ | |||
| Sup-Resisc45 | 95.20 | 97.31 | 97.83 |
| Sup-MLRSNet | 97.18 | 98.85 | 99.2 |
| Sup-PatternNet | 97.53 | 98.93 | 99.45 |
In Table 4, we compared our best results obtained by pre-training self-supervised learning method on PatternNet dataset to those reported in Ref. [17].
Table 4.
Accuracy over different training methods. Our best results were obtained by fine-tuning in-domain representations captured by the SimSiam algorithm on PatternNet dataset (Averaged over 5 runs).
In [17], the ResNet50 model is pre-trained in a supervised manner on various remote sensing datasets. The model's final parameters are then fine-tuned using the Resisc45, UCM, Eurostat, and other datasets. As shown in Table 2, the self-supervised pre-trained model on the high-resolution PatternNet dataset outperforms ImageNet pre-trained model and other in-domain supervised models.
In Table 5, we have compared our best results obtained by Sim-PatternNet to some of the best available models. The results indicate that self-supervised pre-training using the SimSiam algorithm produced the best results across different land cover classification datasets.
Table 5.
Comparison of results on selected remote sensing datasets. Our best results were obtained by fine-tuning in-domain representations captured by the SimSiam algorithm on PatternNet dataset (Averaged over 5 runs).
| Dataset | Reference | Description | Acc (%) |
|---|---|---|---|
| AID | [15] | Unsupervised | 78.83 |
| [41] | Unsupervised | 84.44 | |
| [9] | Supervised | 94.38 | |
| [19] | Supervised | 95.58 | |
| [6] | Supervised | 97.30 | |
| Sim-PatternNet | Unsupervised | 97.83 | |
| EuroSAT | [15] | Unsupervised | 76.37 |
| [5] | Supervised | 99.20 | |
| [43] | Unsupervised | 98.91 | |
| [9] | Supervised | 98.57 | |
| Sim-PatternNet | Unsupervised | 99.26 | |
| UCM | [41] | Unsupervised | 89.71 |
| [9] | Supervised | 96.42 | |
| [5] | Supervised | 99.61 | |
| [44] | Supervised | 92.40 | |
| [45] | Supervised | 97.10 | |
| [12] | Supervised | 99.41 | |
| [7] | Supervised | 98.50 | |
| Sim-PatternNet | Unsupervised | 99.90 | |
| PatternNet | [25] | Supervised | 96.65 |
| [6] | Supervised | 99.84 | |
| Sim-PatternNet | Unsupervised | 99.90 | |
| Resisc45 | [41] | Unsupervised | 84.88 |
| [43] | Unsupervised | 96.28 | |
| [5] | Supervised | 96.83 | |
| [6] | Supervised | 97.03 | |
| Sim-PatternNet | Unsupervised | 97.20 |
5.4. Choosing the appropriate dataset for self-supervised pre-training
-
A.
Fine-tuning all layers
In this section, by conducting detailed experiments, we have examined the effect of the pre-training dataset on the final accuracy of downstream tasks to determine the effective characteristics for selecting the pre-training dataset using the SimSiam algorithm. For this purpose, we have used MLRSNet, Resisc45, and PatternNet for pre-training using the SimSiam. These datasets have different attributes in the number of samples, class diversity, and spatial resolutions. The similarity of the pre-training or source dataset to the downstream or target dataset is a critical factor affecting the accuracy of land cover classification tasks, as has been discussed for the supervised approach [17]. However, for representation learning from remote sensing images using a contrastive self-supervised learning approach, potentially influential factors must be examined through deliberate experiments.
The class similarity is a proxy that shows the similarity of source and target datasets. We calculated the similarities by comparing the number of identical classes in both pre-training and downstream datasets. Table 6 indicates that the downstream datasets used in our experiment are, on average, more similar to Resisc45 than PatternNet and MLRSNet. Another factor that causes the learning of global features and, as a result, high performance in target datasets is the class diversity of the pre-training dataset [17]; the higher the class diversity of the source dataset, the higher the generalization of the pre-trained features on target tasks. Although Resisc45 has higher class diversity and more similarity to the target datasets, the pre-trained features on the PatternNet have a higher generalization ability on all downstream tasks. A vital factor that comes into view is the spatial resolution of source datasets. It is between 0.06m and 4.96m for PatternNet, 0.1m–10m for MLRSNet, and 0.2m and 30m for Resisc45. Therefore, the spatial resolution of PatternNet is higher than Resisc45 and MLRSNet. The high spatial resolution in remote sensing images makes the edges of the objects in the image's sharper, and because self-supervised learning methods provide supervisory signals from the data, the presence of these edges makes the differences between the objects in the images more accentuated. As a result, the SimSiam model can better learn the difference between the objects in the dataset. It is conclusive that the importance of other factors such as class similarity, class diversity, and the number of samples for learning general features from remote sensing images using the SimSiam are highlighted when the pre-training dataset has a high spatial resolution. MLRSNet is much larger than PatternNet, but the generalization of PatternNet is better than that of MLRSNet. It means that; although the class diversity, class similarity, and number of samples of the PatternNet dataset to the target datasets are lower than Resisc45 and MLRSNet, these factors are still significant for the PatternNet dataset. All these factors make PatternNet an appropriate source for pre-training visual features using the SimSiam algorithm.
Table 6.
Class similarity of MLRSNet, Resisc45 and PatternNet to the downstream tasks. PD and DT stand for pretraining dataset and downstream task, respectively.
| DT → |
AID | EuroSAT | UCM |
|---|---|---|---|
| PD↓ | |||
| MLRSNet | 66.6 % | 40 % | 76.1 % |
| PatternNet | 30 % | 30 % | 85.71 % |
| Resisc45 | 60 % | 40 % | 90.47 % |
We compared the results of fine-tuning pre-trained weights on the PatternNet and Resisc45 datasets in Table 7. These results demonstrate that despite Resisc45's high similarity to downstream datasets and the diversity of its classes, the pre-trained model on the PatternNet performs significantly better when solving land cover classification tasks.
Table 7.
Results on downstream tasks (Averaged over 5 runs). The pre-trained model on the PatternNet performs better than other models. PD and DS stand for pretraining dataset and downstream task, respectively.
| DT → |
AID | EuroSAT | UCM |
|---|---|---|---|
| PD ↓ | |||
| Sim-Resisc45 | 97.62 | 97.75 | 98.24 |
| Sim-MLRSNet | 97.78 | 98.45 | 98.85 |
| Sim-PatternNet | 97.83 | 99.26 | 99.99 |
According to Table 6, the similarity of the AID to the MLRSNet, Resisc45, and PatternNet is 66.6 %, 60 %, and 30 %, respectively. However, fine-tuning weights obtained from the PatternNet dataset on the AID perform better than other models. These results indicate that when using the SimSiam algorithm to train visual representations from remote sensing images, the choice of higher-resolution datasets is a critical factor with a huge impact on the final performance of downstream tasks. However, the conclusions are made based only on three datasets, and additional experiments with diverse datasets are required to make more precise generalizations.
-
B.
Linear Evaluation with limited number of samples
In this section, by using linear evaluation, we further examined the quality of the pre-trained features. In Fig. 3, we have shown the general outline of the linear evaluation.
Fig. 3.
Schematic of a linear evaluator. Pre-trained model serves as feature extractor.
For linear evaluation of pre-trained features, we freeze the backbone of the ResNet50 model and only train the head, which is the classifier. Therefore, the backbone of the model serves as the feature extractor. Since fine-tuning the whole model distorts all the pre-trained weights, we believe it cannot provide an ideal solution for the evaluation of different pre-trained models. In contrast, fine-tuning the last layer does not change the pre-trained weights. Therefore, it better shows the capability of the pre-trained backbone in extracting general features from remote sensing images. Using a limited number of samples per class, we conducted our experiments and reported the global accuracy in Table 8.
Table 8.
Results of linear evaluation under limited number of samples (Averaged over 5 runs). The pre-trained model on the PatternNet performs better than other models. PD and DT stand for pretraining dataset and downstream task, respectively.
| DT → |
AID |
EuroSat |
UCM |
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PD ↓ |
Number of images per class |
Number of images per class |
Number of images per class |
|||||||||
| 5 | 10 | 20 | 50 | 5 | 10 | 20 | 50 | 5 | 10 | 20 | 50 | |
| ImageNet | 45.45 | 52.36 | 63.14 | 70.17 | 39.36 | 46.45 | 51.22 | 59.71 | 40.43 | 50.33 | 56.72 | 63.21 |
| Scratch-Resisc45 | 43.82 | 49.61 | 62.25 | 68.42 | 32.40 | 41.05 | 47.19 | 55.65 | 31.02 | 44.81 | 52.15 | 59.01 |
| Scratch-MLRSNet | 44.32 | 51.14 | 61.50 | 68.02 | 36.33 | 44.92 | 50.87 | 57.07 | 38.92 | 48.42 | 55.89 | 62.63 |
| Scratch-PatternNet | 44.75 | 52.12 | 61.07 | 69.21 | 38.55 | 45.13 | 51.37 | 59.11 | 39.27 | 49.33 | 56.63 | 63.05 |
| Sim-Resisc45 | 72.32 | 75.44 | 81.74 | 86.56 | 77.50 | 80.12 | 85.16 | 90.93 | 77.89 | 82.11 | 87.95 | 92.15 |
| Sim-MLRSNet | 73.34 | 77.10 | 82.52 | 89.52 | 79.31 | 83.27 | 88.87 | 92.58 | 80.92 | 84.60 | 90.37 | 94.85 |
| Sim-PatternNet | 73.89 | 78.25 | 85.13 | 89.33 | 80.02 | 84.19 | 89.55 | 92.31 | 81.65 | 85.87 | 91.70 | 94.66 |
Within our study, we explore four distinct scenarios distinguished by varying sample sizes per class, specifically 5, 10, 20, and 50. Each scenario requires the selection of unique hyperparameters to enhance model performance. While the intricacies of our methodology are extensive and cannot be fully expounded upon in this paper, we offer a concise overview of our approach for each scenario. Our empirical investigations into the domain of few-shot learning were executed with meticulous precision, as we carefully crafted our methodology and tailored hyperparameters to mitigate the risks of overfitting. In our research pursuit, we strategically employed the Adam optimizer alongside a cosine annealing learning rate scheduler. To address experimental scenarios characterized by limited sample sizes per class, particularly 5 and 10, we thoughtfully incorporated a dropout layer with a rate of 0.2 and an initial learning rate of 4e-5. Throughout all experiments, we implemented the optimal fine-tuning strategy.
According to Table 8, all in-domain pre-trained models using the SimSiam algorithm on remote sensing datasets, whether pre-trained from scratch or fine-tuned with ImageNet weights, exhibit superior performance compared to the ImageNet model across all downstream tasks. Additionally, a decrease in the number of training samples accentuates the performance discrepancy among pre-trained models. A crucial observation from the table is that during the pre-training phase, self-supervised learning models pre-trained on ImageNet weights outperform their counterparts pre-trained from scratch. These models demonstrate superior performance across all three modes. For instance, the Scratch-PatternNet and Sim-PatternNet models, obtained by utilizing the SimSiam algorithm with random weights and fine-tuning the ImageNet weights, respectively, illustrate this point. As per the table, the Sim-PatternNet model significantly outperforms the Scratch-PatternNet model. To achieve parity in accuracy between models pre-trained from scratch and those obtained from fine-tuning ImageNet weights, the author suggests two potential solutions. Firstly, increasing the training data during the pre-training of the weights can be considered. This can be accomplished through traditional data augmentation techniques or by generating similar data using generative models. The second approach involves providing more labeled data on the downstream dataset when fine-tuning the weights of the pre-trained model. Both of these strategies have the potential to address the performance gap between models pre-trained from scratch and those fine-tuned from ImageNet weights. In both scenarios, expanding the training data is essential, whether during the pre-training phase or when addressing the downstream problem. However, it's important to note that augmenting the data comes with increased demands for computing resources, which may be scarce in many instances. Moreover, acquiring the necessary resources entails additional costs. The results obtained from the method developed in this article indicate that it has effectively addressed both problems to a significant extent.
As in the previous experiments, in linear evaluation, the PatternNet pre-trained model performs better than other models in most cases. The acquired results lead us to use unsupervised models for capturing in-domain features from remote sensing images. One critical point that should be carefully considered is the characteristics of the source dataset. The results showed that the source dataset not only should have high-class diversity and high similarity with target datasets but also should have a high spatial resolution.
6. Conclusions
Recently, contrastive learning, a subset of self-supervised learning, has made significant progress in general visual representations of natural images. The available remote-sensing datasets have different numbers of samples and channels, spatial resolutions, and image sizes. Therefore, it is necessary to examine the transferability of self-supervised pre-trained features from remote sensing images and determine the right factors that make the dataset a good choice for feature pre-training. In this paper, we utilized the SimSiam for in-domain general feature learning from three remote-sensing datasets with different characteristics. The pre-trained weights were then evaluated by fine-tuning and linear evaluation on other land cover classification datasets achieving state-of-the-art results. Our deliberate experiments demonstrate that for contrastive self-supervised pre-training of remote-sensing images, higher resolution datasets lead to better performance on downstream tasks.
Data availability statement (DAS)
The data that support the findings of this study are available from the corresponding author, [H.S.], upon reasonable request.
Funding statement
No funding is available for this work.
CRediT authorship contribution statement
Ali Ghanbarzadeh: Writing – original draft, Visualization, Software, Methodology, Formal analysis, Data curation, Conceptualization. Hossein Soleimani: Writing – review & editing, Validation, Supervision, Methodology, Investigation, Formal analysis, Conceptualization.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- 1.Máttyus G., Luo W., Deeproadmapper R. Urtasun. Proceedings of the IEEE International Conference on Computer Vision. 2017. Extracting road topology from aerial images. [Google Scholar]
- 2.Ghosh R., Jia X., Kumar V. Land cover mapping in limited labels scenario: a survey. arXiv preprint arXiv:2103.02429. 2021 [Google Scholar]
- 3.Singh S., et al. BMVC. 2018. Self-supervised feature learning for semantic segmentation of overhead imagery. [Google Scholar]
- 4.Mnih V., Hinton G.E. European Conference on Computer Vision. Springer; 2010. Learning to detect roads in high-resolution aerial images. [Google Scholar]
- 5.Neumann M., et al. IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium. IEEE; 2020. Training general representations for remote sensing using in-domain knowledge. [Google Scholar]
- 6.Risojević V., Stojnić V. The role of pre-training in high-resolution remote sensing scene classification. arXiv preprint arXiv:2111.03690. 2021 [Google Scholar]
- 7.Scott G.J., et al. Training deep convolutional neural networks for land–cover classification of high-resolution imagery. Geosci. Rem. Sens. Lett. IEEE. 2017;14(4):549–553. [Google Scholar]
- 8.Wu W., et al. PolSAR image semantic segmentation based on deep transfer learning—realizing smooth classification with small training sets. Geosci. Rem. Sens. Lett. IEEE. 2019;16(6):977–981. [Google Scholar]
- 9.Helber P., et al. Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Rem. Sens. 2019;12(7):2217–2226. [Google Scholar]
- 10.Cheng G., Han J., Lu X. Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE. 2017;105(10):1865–1883. [Google Scholar]
- 11.Penatti O.A., Nogueira K., Dos Santos J.A. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2015. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? [Google Scholar]
- 12.Nogueira K., Penatti O.A., Dos Santos J.A. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recogn. 2017;61:539–556. [Google Scholar]
- 13.Liu Q., et al. Dense dilated 'convolutions' merging network for land cover classification. IEEE Trans. Geosci. Rem. Sens. 2020;58(9):6309–6320. [Google Scholar]
- 14.Reed C.J., et al. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022. Self-supervised pretraining improves self-supervised pretraining. [Google Scholar]
- 15.Tao C., et al. Remote sensing image scene classification with self-supervised paradigm under limited labeled samples. Geosci. Rem. Sens. Lett. IEEE. 2020;19:1–15. [Google Scholar]
- 16.Lin D., et al. MARTA GANs: unsupervised representation learning for remote sensing image classification. Geosci. Rem. Sens. Lett. IEEE. 2017;14(11):2092–2096. [Google Scholar]
- 17.Neumann M., et al. 2019. In-domain Representation Learning for Remote Sensing. arXiv preprint arXiv:1911.06721. [Google Scholar]
- 18.Vincenzi S., et al. 2020 25th International Conference on Pattern Recognition (ICPR) IEEE; 2021. The color out of space: learning self-supervised representations for earth observation imagery. [Google Scholar]
- 19.Stojnic V., Risojevic V. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Self-supervised learning of remote sensing scene representations using contrastive multiview coding. [Google Scholar]
- 20.Wu Z., et al. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. Unsupervised feature learning via non-parametric instance discrimination. [Google Scholar]
- 21.Caron M., et al. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020;33:9912–9924. [Google Scholar]
- 22.Chen T., et al. International Conference on Machine Learning. PMLR; 2020. A simple framework for contrastive learning of visual representations. [Google Scholar]
- 23.Grill J., et al. 2021. Bootstrap Your Own Latent: a New Approach to Self-Supervised Learning 2020. arXiv preprint arXiv:2006.07733. [Google Scholar]
- 24.Chen X., He K. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Exploring simple siamese representation learning. [Google Scholar]
- 25.Zhou W., et al. PatternNet: a benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS J. Photogrammetry Remote Sens. 2018;145:197–209. [Google Scholar]
- 26.Chen Z., et al. An improved pretraining strategy-based scene classification with deep learning. Geosci. Rem. Sens. Lett. IEEE. 2019;17(5):844–848. [Google Scholar]
- 27.Pires de Lima R., Marfurt K. Convolutional neural network for remote-sensing scene classification: transfer learning analysis. Rem. Sens. 2019;12(1):86. [Google Scholar]
- 28.Xia G.-S., et al. AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Rem. Sens. 2017;55(7):3965–3981. [Google Scholar]
- 29.Yang Y., Newsam S. Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems. 2010. Bag-of-visual-words and spatial extensions for land-use classification. [Google Scholar]
- 30.Jing L., Tian Y. Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020;43(11):4037–4058. doi: 10.1109/TPAMI.2020.2992393. [DOI] [PubMed] [Google Scholar]
- 31.Bucci S., et al. Self-supervised learning across domains. IEEE Trans. Pattern Anal. Mach. Intell. 2021;44(9):5516–5528. doi: 10.1109/TPAMI.2021.3070791. [DOI] [PubMed] [Google Scholar]
- 32.Ghiasi G., et al. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. Multi-task self-training for learning general representations. [Google Scholar]
- 33.Caron M., et al. Proceedings of the European Conference on Computer Vision (ECCV) 2018. Deep clustering for unsupervised learning of visual features. [Google Scholar]
- 34.Asano Y.M., Rupprecht C., Vedaldi A. Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371. 2019 [Google Scholar]
- 35.Liu X., et al. Self-supervised learning: generative or contrastive. IEEE Trans. Knowl. Data Eng. 2021;35(1):857–876. [Google Scholar]
- 36.Jaiswal A., et al. A survey on contrastive self-supervised learning. Technologies. 2020;9(1):2. [Google Scholar]
- 37.Misra I., Maaten L.v.d. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. Self-supervised learning of pretext-invariant representations. [Google Scholar]
- 38.He K., et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. Momentum contrast for unsupervised visual representation learning. [Google Scholar]
- 39.Ayush K., et al. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. Geography-aware self-supervised learning. [Google Scholar]
- 40.Kang J., et al. Deep unsupervised embedding for remotely sensed images based on spatially augmented momentum contrast. IEEE Trans. Geosci. Rem. Sens. 2020;59(3):2598–2610. [Google Scholar]
- 41.Heidler K., et al. Self-supervised audiovisual representation learning for remote sensing data. arXiv preprint arXiv:2108.00688. 2021 [Google Scholar]
- 42.Falcon W., Cho K. A framework for contrastive self-supervised learning and designing a new approach. arXiv preprint arXiv:2009.00104. 2020 [Google Scholar]
- 43.Islam A., et al. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. A broad study on the transferability of visual representations with contrastive learning. [Google Scholar]
- 44.Marmanis D., et al. Deep learning earth observation classification using ImageNet pretrained networks. Geosci. Rem. Sens. Lett. IEEE. 2015;13(1):105–109. [Google Scholar]
- 45.Castelluccio M., et al. Land use classification in remote sensing images by convolutional neural networks. arXiv preprint arXiv:1508.00092. 2015 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author, [H.S.], upon reasonable request.



