Abstract
Archaeology has long faced fundamental issues of sampling and scalar representation. Traditionally, the local-to-regional-scale views of settlement patterns are produced through systematic pedestrian surveys. Recently, systematic manual survey of satellite and aerial imagery has enabled continuous distributional views of archaeological phenomena at interregional scales. However, such “brute force” manual imagery survey methods are both time- and labor-intensive, as well as prone to inter-observer differences in sensitivity and specificity. The development of self-supervised learning methods (e.g., contrastive learning) offers a scalable learning scheme for locating archaeological features using unlabeled satellite and historical aerial images. However, archaeological features are generally only visible in a very small proportion relative to the landscape, while the modern contrastive-supervised learning approach typically yields an inferior performance on highly imbalanced datasets. In this work, we propose a framework to address this long-tail problem. As opposed to the existing contrastive learning approaches that typically treat the labeled and unlabeled data separately, our proposed method reforms the learning paradigm under a semi-supervised setting in order to fully utilize the precious annotated data (<7% in our setting). Specifically, the highly unbalanced nature of the data is employed as the prior knowledge in order to form pseudo negative pairs by ranking the similarities between unannotated image patches and annotated anchor images. In this study, we used 95,358 unlabeled images and 5,830 labeled images in order to solve the issues associated with detecting ancient buildings from a long-tailed satellite image dataset. From the results, our semi-supervised contrastive learning model achieved a promising testing balanced accuracy of 79.0%, which is a 3.8% improvement as compared to other state-of-the-art approaches.
Keywords: Machine Learning, Satellite Imagery, Semi-Supervised Learning, Contrastive Learning
1. Introduction
Archaeological structures and settlements are essential sources of information that archaeologists use to study the economic, political, and social systems of ancient civilizations. Conventional approaches to mapping and recording settlement locations at local and regional scales have relied on field-based pedestrian survey methods, which require professionals to physically examine the landscape for evidence of ancient material culture (Banning 2002; Phillips and Willey 1953; Sanders 1961; Balkansky et al. 2000; Alcock and Cherry 2016). However, the scale of field-based surveys is ultimately limited by the physical impedances of fieldwork. Moreover, the distribution of both survey and excavation zones is often unsystematic, which further complicates efforts to synthesize findings across field projects. Since the early 2000s, archaeologists have made use of high-resolution satellite imagery to understand the spatial and structural patterns of the archaeological features at larger scales, including by step-wise visual identification of sites by trained specialists (Hanson and Oltean 2012; Fowler 2002; Lasaponara and Masini 2012; Bewley et al. 2016; Casana 2014; Lin et al. 2014; Parcak 2019). Such research has produced novel insights into macro- and inter-regional scale settlement patterns (Casana 2014; Casana and Cothren 2013; Wernke, VanValken-burgh, and Saito 2020). However, such “brute force” manual imagery survey methods are very labor intensive, time-consuming, and prone to inter-observer differences in feature detection sensitivity and specificity (Casana 2014). In part, these issues are inherent and can be attributed to the nature of the data, as archaeological features are generally very sparsely distributed across the landscape making manual identification and the labeling of archaeological features in satellite imagery a very low yield endeavor. Observational fatigue and inter-observer differences in detection rates additionally pose unavoidable risks. The resulting datasets are thus generally quite large in aerial extent but come with few labels (Mnih and Hinton 2012; Casana and Cothren 2013). Developing an effective machine learning algorithm for automating information extraction procedures on such large-scale, sparsely annotated, and unbalanced data is a long-standing machine-learning challenge in remote sensing.
In recent years, the rapid development of self-supervised contrastive learning has shown promise towards the task of utilizing large-scale, sparsely annotated data. However, the proportion of images containing archaeological settlements is often relatively low (<7% in our setting). Such an unbalanced data distribution is problematic for modern contrastive learning algorithms (Chen and He 2021; Grill et al. 2020), consequently leading to excessive favoring representations of the majority classes. By reforming contrastive learning in a semi-supervised setting, we emphasize the critical role of the sparse but valuable annotated positive instances in both training and fine-tuning stages.
In this work, we propose a novel self-supervised contrastive learning framework to identify ancient settlements through relict architectural feature detection in the south-central Andes. As opposed to existing self-supervised learning approaches, which typically model labeled and unlabeled data separately, we introduce a holistic, end-to-end semi-supervised learning framework that utilizes the highly unbalanced nature of the data to form pseudo-negative pairs by ranking the similarities between unannotated image patches and annotated anchor images. Specifically, pseudo-negative images are employed to calculate a supervised contrastive (SupCon) loss (Khosla et al. 2020), which is seamlessly integrated with the contrastive loss (Chen et al. 2020).
To test this approach, this project surveys an approximately 4,000 km2 region of the western cordillera of the southern Peruvian Andes (Figure 1). Utilizing images taken by Worldview 2 and Worldview 3 Satellite platforms, our dataset consists of 95,358 unlabeled images and 5,830 labeled images, where the ratio between positive and negative instances is roughly 1:100. We show that our semi-supervised contrastive learning model outperforms its self-supervised and fully-supervised counterparts, along with traditional supervised networks such as ResNet50.
Figure 1. Survey Region.

The study region encompasses approximately 4,000 km2 of the western cordillera of the southern Peruvian highlands, including portions of the modern Cusco and Arequipa districts. Sample tiles represent the diversity of land formation in the region and the sample locations are shown in black.
There also have been recent general computer vision studies for semi-supervised contrastive learning methods (Zhang et al. 2022; Yang et al. 2022) that allow for the optimal utilization of vast amounts of unlabeled data. These approaches tend to refine the quality of pseudo labels by continuously selecting positive samples. However, the primary challenge in archaeological field lies in the extremely imbalanced proportion of positive and negative samples in the archeological dataset (1:100). The highly imbalanced proportion of negative classes can lead to ineffective training in conventional self- and semi-supervised contrastive learning. In comparison, our proposed model offers several advantages:(1) Leveraging the foreground image, aligning with the class-specific few-shot learning design for the self-supervised contrastive task. (2) The supervised contrastive task is balanced and further ensures the discriminative representation between the two classes in the latent space.
Innovation of the work
The innovation of this study is four-fold:
This study investigates a large, new survey region utilizing a cutting-edge representative deep learning approach. The study region encompasses approximately 4,000 km2 of the western cordillera of the southern Peruvian highlands, including portions of the modern Cusco and Arequipa districts.
We propose a novel contrastive learning scheme which is optimized for the unique challenges faced in remote sensing image analyses, such as (1) effectively learning design from limited annotated data with large-scale unannotated data, and (2) the highly imbalanced data distribution (e.g., the foreground objects of interests are much less than the background).
A new semi-supervised contrastive learning method was introduced by aggregating the advantages from previous (1) self-supervised and (2) supervised contrastive learning strategies. Compared with traditional approaches, the proposed method maximizes utilization of large-scale unlabeled image data and small-scale labeled image data under a probabilistic learning model.
A similarity-based down-sampling approach is proposed for pseudo-label synthesis in both the latent space learning section, as well as the supervised learning section, of the semi-supervised model.
2. Background and Related Research
This section provides an overview of the background and related research for contrastive representation learning and satellite remote sensing. The following brief literature survey includes a summary of recent contrastive learning methods and a discussion of the applications of machine learning in remote sensing research.
2.1. Contrastive Representation Learning
In contrast to supervised learning (Cunningham, Cord, and Delany 2008), which requires the presence of labeled inputs to predict outputs, self-supervised learning (Le 2013) refers to the identification of the hidden patterns of a dataset without the usage of any labels. Comprising what is a relatively new family of self-supervised learning methods, contrastive representation learning has recently become a key approach in solving various computer vision tasks with state-of-the-art performance (Wu et al. 2018; Noroozi and Favaro 2016; Zhuang, Zhai, and Yamins 2019; Hjelm et al. 2018; Chuang et al. 2020; Tian et al. 2020; Khosla et al. 2020; Cui et al. 2021). Designed to learn the general features of large datasets without labels, contrastive learning aims to pull similar sample pairs together while pushing dissimilar pairs apart. As a result, the model is capable of learning the high-level features of a dataset even with few or no labels available.
In recent years, various contrastive representation learning methods have been proposed with different implementations. SimCLR (Chen et al. 2020) aims to pull the representations of different views of the same image closer while repulsing the views of different images in the latent space.SwAV (Caron et al. 2020) applies online clustering on different augmentations of the same image instead of performing explicit pairwise feature comparisons. Wu et al. (Wu et al. 2018) proposes the use of an offline memory bank to store all data representations, with training data randomly selected for negative-pair minimization. Instead of utilizing an offline dictionary, MoCo (He et al. 2020) utilizes a momentum design to build a dynamic dictionary that stores a negative sample pool, which demands a large batch size. To further alleviate the cost of storing negative pairs, BYOL (Grill et al. 2020) is proposed to incorporate an asynchronous momentum encoder into the model so that it can use only the positive pairs for training. Recently, SimSiam (Chen and He 2021) has been proposed to save GPU memory consumption by fully eliminating the momentum encoder. In addition, various efforts have been made to modify the contrastive learning approach within a fully-supervised setting; an example would be the SupCon loss proposed by Khosla (Khosla et al. 2020).
2.2. Remote Sensing with Machine Learning
Satellite remote sensing has contributed to the execution of a variety of tasks, including climate change measurement, crop condition monitoring, natural disaster alerts, and archaeological site detection (Harris 1987). Satellites were first introduced to the field of archaeology in the late 1900s, with Landsat and SPOT imagery being used for archaeological predictive modeling and archaeological feature detection (Leisz 2013). Since then, the usage of satellite remote sensing in the detection of archaeological sites has picked up rapidly, leveraging all available technologies, from decommissioned CORONA imagery (Ur 2013) to the latest in multi-spectral imagery (Abrams and Comer 2013).
Starting in the 2000s, the development of machine learning (and more particularly, representation learning) offered major breakthroughs towards the analytical approaches employed on satellite images; this in turn lead to seminal insights and discoveries in the field of archaeology (Lary et al. 2016; Camps-Valls 2009; Ali et al. 2015; Cooner, Shao, and Campbell 2016; Comer and Harrower 2013; Parcak 2019). Depending on the specific problem, various types of machine learning algorithms have been employed, such as support vector machines (SVM), decision trees, random forests, etc. (Samui 2008; Azamathulla and Wu 2011; Friedl and Brodley 1997; Pal 2005). Recently, many deep learning (and more specifically, contrastive learning-based) methods have been utilized towards remote sensing applications (Hou et al. 2021; Yue et al. 2021; Wang et al. 2022; Liu et al. 2020; Hu et al. 2021).
3. Methods
The overall design of our framework is shown in Figure 2. An analysis of the backbone network and pseudo-label synthesis is presented below.
Figure 2. Overall framework.

This figure demonstrates the general structure of our semi-supervised contrastive learning framework. The upper panel shows the general flow of our framework, which was adopted from the SimSiam network. The lower panel describes the process of obtaining the supervised contrastive loss from predicted features. The detailed discussions can be found in the Methods section.
3.1. Overview of the SimSiam Framework
In this work, SimSiam is chosen as our backbone network due to its simplicity and effectiveness. Compared to other widely-used self-supervised representation learning networks, SimSiam removes all additional structures such as negative samples (Sim-CLR), momentum encoder (BYOL), or clustering (SwAV) and still obtains great performance in its learning representations of unlabeled datasets. The overall loss function consists of two separate losses, namely (1) self-supervised contrastive loss (i.e., Cosine similarity) and (2) supervised contrastive loss. The rationale of employing both self-supervised and supervised loss is to form a new “semi-supervised” learning scheme for remote sensing image learning. Compared with traditional self-supervised contrastive learning (Chen and He 2021) and supervised contrastive learning (Khosla et al. 2020) approaches, the proposed method maximizes utilization of large-scale unlabeled image data by incorporating the small-scale labeled image data. In addition, the recently introduced mixed-precision training feature is utilized to accelerate the training process.
Algorithm 1.
Pseudo-Code for generating pseudo negative pairs
| Input: An array of unlabeled class features: f_un |
| Input: An array of positive class features: f_pos |
| Output: An array of pseudo-negative class features: f_neg |
| 1: f_un = Normalize(f_un) |
| 2: f_pos = Normalize(f_pos) |
| 3: Divide f_un into groups of 16 → f_un_groups |
| 4: Randomly choose a feature from f_pos → f_positive |
| 5: f_neg = [ ] |
| 6: for f_un_group in f_un_groups do |
| 7: sim_array = Similarity_Function(f_un_group + f_positive) |
| 8: f_neg.append(Median(sim_array)) |
| 9: end for |
| 10: return f_neg |
3.2. Pseudo-Label Synthesis
For the synthesis of pseudo-labels, a mixed array of unlabeled and positive class features is used as the starting point. The first step is to normalize this array. Next, the array is decoupled into X features of unlabeled images (Group 1) and Y features of positively-labeled images (Group 2). Then, we divide the X unlabeled class features into subgroups of size k (16 in our case), and we end up with X/k subgroups. Meanwhile, a single feature is randomly selected from the Y positive class features for future use.
For each subgroup, we apply the cosine similarity function in order to compute the similarity between the previously selected positive feature and the features in the subgroup. Following this, the unlabeled image with the median similarity score is assigned a pseudo label (i.e., negative class) based on the hypothesis that negative images dominate the distribution of the entire cohort. The 1:100 ratio used in this paper follows the design from a previous publication (Yang and Xu 2020). According to (Yang and Xu 2020), a higher imbalance ratio can impose an additional challenge towards the classification tasks as compared to a scenario with moderately imbalanced data.
The pseudo-code for generating such pseudo-labels is presented in Algorithm 1. As an example, it assumes that there are N unlabeled images. The ratio of positive images (Npos) to negative images (N − Npos) is roughly 1:100 in this study. Thus, if the size of the batch is B (B ≪ N) and the images are randomly selected, the probability of having exactly n positive image(s) in this batch is expressed as:
| (1) |
Following this, the probability of having one or less positive images (in other words, B or B − 1 negative images) among a randomly selected batch B, is:
| (2) |
when N ≈ N−Npos. Therefore, every batch contains almost all negative images, which ensures that the pseudo-labels are negative.
3.3. Semi-Supervised Contrastive Learning
The key innovation of our method is our proposal of a new semi-supervised contrastive learning strategy. In short, we aggregate the standard cosine similarity loss with the supervised contrastive (SupCon) loss.
3.3.1. Self-supervised Contrastive Task
In each iteration of the training, X unlabeled images and Y positively-labeled images are utilized as inputs for our SimSiam model, and the generated symmetric loss is named loss_cosine. The formulas are defined as (Chen and He 2021):
| (3) |
| (4) |
The formulas above show the SimSiam symmetrized loss for a single data point (image) x. z1 and z2 denote the encoding vectors of the two augmented views x1 and x2 (generated from x). Following, p1 and p2 denote the projection views of the encoding vectors z1 and z2 by adding an MLP head on top of the shared encoder. Equation (3) represents the negative cosine similarity. Equation (4) shows how a contrastive pair is generated through two augmented views and used for computing the cosine similarity loss.
3.3.2. Supervised Contrastive Task
Following the self-supervised contrastive task, the encoding features, z (that are mentioned in the self-supervised task), are employed to calculate the supervised loss, which is named loss_super. The encoding features will go through the steps in the Pseudo-Label Synthesis subsection, and generate X/k pseudo-negative class features and Y positive class features (mentioned earlier in the subsection). Finally, these features are combined together as inputs to the SupCon (Khosla et al. 2020) loss function in order to calculate loss super.
The formula for the SupCon loss is presented below (Khosla et al. 2020):
| (5) |
Equation (5) presents a generalized form of the supervised contrastive loss within a multi-view batch, where i ∈ I ≡ {1 … 2N} is the index of an arbitrary augmented sample (view). The ”·” symbol denotes the dot product. The index i represents the anchor image, while P(i) represents all positive pairs of the anchor. A(i) ≡ I \ {i}, indicates the remaining 2N −1 views in the batch, excluding the anchor i. As shown in Equation (5), the contrastive nominator maximizes the similarity between the anchor and its positive pairs. Meanwhile, the contrastive denominator differentiates the anchor from negative samples.
3.3.3. Semi-supervised multi-task loss
In the final step, the total loss function is modeled as a multi-task loss design with different weighting parameters on the self-supervised (loss_cosine) and supervised (loss_super) losses. The optimization of the weighting parameters of the loss function is inspired by Kendall (Kendall, Gal, and Cipolla 2018), in which the weighting parameters (v1 and v2 in Equation (6)) are obtained in a data-driven manner based on training performance.
| (6) |
Equation (6) is shown above, where the loss1 and loss2 represent the self-supervised contrastive loss and supervised contrastive loss, respectively. v1 and v2 denote the two weighting parameters that are automatically computed based on the training performance. Both two parameters are randomly initialized and included in the loss calculation per batch during the total loss optimization.
4. Data and Experiments
The data used in the experiments are collected from WorldView 2 and WorldView 3 satellite constellations. The Data subsection below provides the descriptions of data acquisition and the preprocessing pipeline. The Experiment Design subsection describes the experimental design, including data setup, hyperparameters, and validation metrics.
4.1. Data
The satellite images used in this analysis were collected by the WorldView 2 and WorldView 3 satellite constellations and were provided by the Digital Globe Foundation following color correction and orthographic correction using a coarse digital elevation model (DEM). The data was then pan-sharpened using the Bayesian fusion algorithm from the Orfeo-Toolbox (Grizonnet et al. 2017) so as to increase the spatial resolution of the multi-spectral imagery to 0.5 m for the WorldView 2 imagery, and 0.3 m for the WorldView 3 imagery. In this study, all spatial bands were dropped except for the Red, Green, and Blue spectral bands, and the imagery was re-sampled from 32 bits to 8 bits in order to reduce the storage size and computational requirements. In total, the images covered approximately 12,000 km2. Finally, the study region was divided into approximately 1.6 million image tiles of size 76.8×76.8 meters (256×256 pixels at 0.3 m resolution).
Due to the semi-arid environment and limited vegetation coverage of the south-central Andes, satellites are able to capture clear and unobscured images of the ground and of the archaeological features of interest. Furthermore, ancient structures in this region were primarily constructed from stone, leading to relatively good preservation and consequently, a high visibility in satellite imagery. Of the 1.6 million image tiles produced, 5,000 of were randomly selected and manually coded for the presence/absence of archaeological buildings. To better balance sample size for the sparsely distributed modern and ancient settlements on the landscape, an additional set of 830 images known to contain examples of archaeological or modern structures was added to provide additional representation for the aforementioned categories.
Since ancient buildings were the objects of interest, all images were labeled into two classes: ”ancient_building” (that is, ”the presence of an archaeological structure,” defined as a human-made structures less than 30 m in its largest dimension without evidence of modern roofs or maintenance) and ”no_ancient_building” (that is, ”no presence of an archaeological structure”). From the remaining unlabeled images, around 100,000 images were randomly selected to train the self-supervised deep learning models. Finally, those images were visually examined, and the defective ones (missing data) were abandoned, resulting in an unlabeled dataset of 95,358 images. Sample images are presented in Figure 3.
Figure 3. Example of annotated classes.

This figure demonstrates example classes of the annotated data. The left panel shows the various types of unannotated images with a mixture of contents, including ancient/modern buildings, rock, soil, and grass. Due to the variety of potential objects, it’s unrealistic to create a separate class for each unique combination of objects. Therefore, two classes were created based on the presence of ancient buildings, which is shown in the right panel.
4.2. Experimental Design
(1) For the Semi-supervised contrastive pre-training: the training dataset consists of 95,358 unlabeled images and 258 labeled foreground images. (2) For the supervised downstream classification fine-tuning task: The 5,830 labeled images were divided into training, validation, and testing splits, ensuring that images from the nearby physical space were placed into the same split in order to avoid the issue of data contamination. Additionally, in order to alleviate the unbalanced nature of our data source, the under-representative positive class (with ancient builds) in our training dataset was up-sampled to have the roughly an equal distribution to the negative class. The details of the data split are shown in Table 1. As a last step, all of the labeled and unlabeled images were resized to 128×128 so as to expedite the training process.
Table 1.
Datasets Setup
| Dataset | # of Ancient_Building | # of No_Ancient_Building |
|---|---|---|
| Training | 193 (Original) 3,088 (Upsampled) | 4,272 |
| Validation | 65 | 610 |
| Testing | 71 | 619 |
| Unannotated | 95,358 | |
4.2.1. Semi-Supervised Contrastive Training
The proposed semi-supervised contrastive learning model was adopted from the Sim-Siam network with major modifications on the loss function. The optimizer used was the SGD optimizer which was initialized with a learning rate = 0.1, weight decay = 0.0001, and momentum = 0.9. The training dataset of the unlabeled images had a batch size = 512, while the labeled dataset had a batch size = 16. The Mixed-precision training features were integrated into our network in order to boost the training process.
The model was trained for 200 epochs with approximately 100 training hours. The 100 training hours were computed by a workstation with Intel Xeon Gold 5118 2.30 GHz CPU, 383 GB memory, and two NVIDIA GeForce RTX 2080 Ti GPU (11.0 GB dedicated GPU memory).
4.2.2. Supervised Fine-Tuning and Testing
After pre-training the model using the unannotated data, an additional single linear layer was fined tuned with the labeled data. The F1 score and the balanced accuracy (Wegier and Ksieniewicz 2020; Feng, Zhou, and Tong 2021) on the validation set were the metrics used to select the best performance epoch as well as the optimal hyper-parameters.
4.2.3. Evaluation metrics
According to (Wegier and Ksieniewicz 2020), the F1 score aggregates the sensitivity and precision,
| (7) |
where the sensitivity (or recall) determines the accuracy of the minority class classification and precision indicates the probability of correct detection.
Balanced accuracy is the arithmetic mean of the sensitivity and specificity,
| (8) |
The specificity, in a binary case, indicates the accuracy of recognizing the negative (majority) class.
5. Results
Extensive experiments are then designed to verify the effectiveness of our proposed model. Thus, a performance comparison with other state-of-art methods is presented for an ablation study.
5.1. Ablation Study
This ablation study consisted of two alternative versions of our proposed pretrained model. The first version conducted self-supervised contrastive learning using only loss_cosine as discussed in the Methods section, inspired by the SimSiam network (Chen et al. 2020). By contrast, the second version conducted supervised contrastive learning using only loss_super as discussed in the Methods section. The corresponding testing results on the downstream labeled data have been presented in Table 2 as SSL.
Table 2.
Quantitaive results of different learning methods. SL and SSL correspond to Supervised Learning and Self-Supervised Learning, respectively. CE is short for Cross Entropy.
| Model | Loss Function | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|
| ResNet 50b | Supervised CE Loss (Zhang and Sabuncu 2018) | 0.790 | 0.718 | |
| SSL | SimCLR | |||
| Semi-Supervised Loss (Ours) | 0.769 | 0.613 | ||
| BYOL | ||||
| Semi-Supervised Loss (Ours) | 0.757 | 0.696 | ||
| SimSiam | ||||
| Semi-Supervised Loss (Ours) | 0.790 | 0.762 |
This corresponds to model: ResNet 50 (from scratch)
This corresponds to model: ResNet 50 (ImageNet pretrained)
5.2. Comparison with Fully-Supervised Learning Benchmark
In addition to the contrastive learning framework discussed above, a fully supervised version of the experiment that was trained from scratch and used only the labeled images was also established, so as to further demonstrate the performance boost that is provided by the unlabeled data. Since our self-supervised framework employed ResNet50 as the backbone model, it’s also used here for the canonical fully-supervised learning benchmarks. An SGD optimizer and cross entropy loss were employed to create a standard training environment. The model was trained for 16 epochs. The trained model with the best validation performance was selected to run the testing dataset. The results of investigating canonical fully-supervised benchmarks are demonstrated in Table 2 as SL.
5.3. Experiments on Additional Contrastive Learning Frameworks
In order to further illustrate the effectiveness of our network design, two additional contrastive learning frameworks - BYOL and SimCLR - were utilized and modified to also incorporate the semi-supervised loss. The corresponding F1 score and balanced accuracy are indicated in Table 2. We then evaluated the accuracy of positive and negative images in this highly unbalanced scenario. Our semi-supervised loss mechanism yields an accuracy of 0.803 for positive images and 0.734 for negative images when using the SimCLR backbone. It suggested that our semi-supervised loss mechanism had a balanced performance for both positive and negative cases.
6. Discussion
In this study, we developed a new, semi-supervised contrastive learning pseudo-label generation method based on the similarity matrix; in doing so, our ultimate goal was to enhance self-supervised training performance over a highly unbalanced dataset. By integrating self-supervised and supervised loss functions, we designed a new learning framework that simultaneously learns from both unannotated and annotated data.
The conducted experiments have yielded promising results. In the ablation study, the models trained using only loss_super were relatively ineffective in identifying ancient buildings, while the models trained using only loss_cosine produced competitive F1 scores and balanced accuracies; this indicated that having a model pre-trained on unlabeled data using self-supervised contrastive learning was essential in distinguishing ancient buildings from other objects. Nonetheless, our proposed semi-supervised model that was trained with loss_cosine + loss_super outperformed its self-supervised and fully-supervised counterparts by a decent margin. This result showed that the integration of fully-supervised and self-supervised networks was a complimentary aggregation. Furthermore, when comparing fully-supervised learning benchmarks (such as ResNet50) with our model, our solution exhibited superior performance on the downstream labeled dataset.
Figure 4 offered additional insights on the performance of our framework. The upper left and lower right corners indicate the correct classes prediction, while the upper right and lower left are the examples of false positives and false negatives. From the ablation study, the superior performance of our model can be attributed to the dynamic combination of fully-supervised and self-supervised information. The loss generated from the positive and pseudo-negative images serves as a complement to the other loss functions that were generated solely on unlabeled instances.
Figure 4. Testing Sample Results.

This figure presents representative samples from the testing results. The left panel indicates the true positive examples while the right panel indicates the true negative cases. Likewise, the upper row indicates the predicted positive ones while the lower row indicates the predicted negative ones.
We believe there are several potential improvements for our semi-supervised contrastive learning framework. First, our proposed pseudo-labeling strategy would only work for highly unbalanced data sets. Moreover, the current model has not been extended to multi-label classification scenarios. Meanwhile, the size of annotated training images in our study was still relatively small. To further facilitate the performance, we might need more training data especially with more positive images (ancient building) from our predicted positive class.
7. Conclusion
In this project, we proposed a new semi-supervised contrastive learning method for identifying relict architectural features in the south-central Andes from satellite imagery. As opposed to the existing solutions, we utilized the unbalanced nature of the large-scale unlabeled data to form pseudo-negative pairs. Using such negative pairs, we leveraged the contrastive learning method by introducing a holistic learning scheme with both cosine similarity loss and pseudo-supervision loss functions. According to the experimental results, our proposed framework yielded both superior accuracy and a higher F1 score as compared with its self-supervised and fully-supervised counterparts. The integrated model eventually outperformed traditional supervised networks (e.g., ResNet50) by 15% in F1 score and 8.9% in balanced accuracy.
These improved feature detection results show great promise for developing a machine-human teaming approach, in which human surveyors would not need to visually scan vast featureless areas, and instead could focus their efforts on categorizing, annotating, and enriching the attribute data on autonomously-identified features. This approach would also eliminate inter-observer differences in feature detection sensitivity and specificity, while also enabling greater transparency and reproducibility through the reporting of model parameters.
Acknowledgements
This work is supported by Scaling Success Grant from Vanderbilt University. The imagery analyzed in this paper was provided by the Digital Globe Foundation through a generous satellite imagery grant (S. Wernke, P.I.). Computational resources were provided by the Vanderbilt University Spatial Analysis Research Laboratory (https://wernkelab.org/). This work has not been submitted for publication or presentation elsewhere.
Data Availability Statement
The data that support the findings of this study are available from the author, Steven A. Wernke, upon reasonable request.
References
- Abrams Michael J., and Comer Douglas C.. 2013. “Multispectral and Hyperspectral Technology and Archaeological Applications.” In Mapping Archaeological Landscapes from Space, Springer Science & Business Media. [Google Scholar]
- Alcock Susan, and Cherry John. 2016. Side-by-side survey: comparative regional studies in the Mediterranean world. Oxbow Books. [Google Scholar]
- Ali Iftikhar, Greifeneder Felix, Stamenkovic Jelena, Neumann Maxim, and Notarnicola Claudia. 2015. “Review of machine learning approaches for biomass and soil moisture retrievals from remote sensing data.” Remote Sensing 7 (12): 16398–16421. [Google Scholar]
- Azamathulla H Md, and Wu Fu-Chun. 2011. “Support vector machine approach for longitudinal dispersion coefficients in natural streams.” Applied Soft Computing 11 (2): 2902–2905. [Google Scholar]
- Balkansky Andrew K, Kowalewski Stephen A, Rodríguez Verónica Pérez, Pluckhahn Thomas J, Smith Charlotte A, Stiver Laura R, Beliaev Dmitri, Chamblee John F, Espinoza Verenice Y Heredia, and Pérez Roberto Santos. 2000. “Archaeological survey in the Mixteca Alta of Oaxaca, Mexico.” Journal of Field Archaeology 27 (4): 365–389. [Google Scholar]
- Banning Edward Bruce. 2002. Archaeological survey. Springer. [Google Scholar]
- Bewley Robert, Wilson Andrew, Kennedy David, Mattingly David, Banks Rebecca, Bishop Michael, Bradbury Jennie, et al. 2016. “Endangered archaeology in the Middle East and North Africa: Introducing the EAMENA project.” In CAA2015. Keep the revolution going: Proceedings of the 43rd annual conference on computer applications and quantitative methods in archaeology, Newcastle University. [Google Scholar]
- Camps-Valls Gustavo. 2009. “Machine learning in remote sensing data processing.” In 2009 IEEE international workshop on machine learning for signal processing, 1–6. IEEE. [Google Scholar]
- Caron Mathilde, Misra Ishan, Mairal Julien, Goyal Priya, Bojanowski Piotr, and Joulin Armand. 2020. “Unsupervised learning of visual features by contrasting cluster assignments.” arXiv preprint arXiv:2006.09882. [Google Scholar]
- Casana Jesse. 2014. “Regional-scale archaeological remote sensing in the age of big data: Automated site discovery vs. brute force methods.” Advances in Archaeological Practice 2 (3): 222–233. [Google Scholar]
- Casana Jesse, and Cothren Jackson. 2013. “The CORONA atlas project: Orthorectification of CORONA satellite imagery and regional-scale archaeological exploration in the Near East.” In Mapping archaeological landscapes from space, 33–43. Springer. [Google Scholar]
- Chen Ting, Kornblith Simon, Norouzi Mohammad, and Hinton Geoffrey. 2020. “A simple framework for contrastive learning of visual representations.” In International conference on machine learning, 1597–1607. PMLR. [Google Scholar]
- Chen Xinlei, and He Kaiming. 2021. “Exploring simple siamese representation learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15750–15758. [Google Scholar]
- Chuang Ching-Yao, Robinson Joshua, Lin Yen-Chen, Torralba Antonio, and Jegelka Stefanie. 2020. “Debiased contrastive learning.” arXiv preprint arXiv:2007.00224. [Google Scholar]
- Comer Douglas C., and Harrower Michael J.. 2013. Mapping Archaeological Landscapes from Space. SpringerBriefs in Archaeological Heritage Management. New York: Springer Science & Business Media. Google-Books-ID: yWJDAAAAQBAJ, https://www.springer.com/gp/book/9781461460732. [Google Scholar]
- Cooner Austin J, Shao Yang, and Campbell James B. 2016. “Detection of urban damage using remote sensing and machine learning algorithms: Revisiting the 2010 Haiti earthquake.” Remote Sensing 8 (10): 868. [Google Scholar]
- Cui Jiequan, Zhong Zhisheng, Liu Shu, Yu Bei, and Jia Jiaya. 2021. “Parametric Contrastive Learning.” arXiv preprint arXiv:2107.12028. [Google Scholar]
- Cunningham Pádraig, Cord Matthieu, and Delany Sarah Jane. 2008. “Supervised learning.” In Machine learning techniques for multimedia, 21–49. Springer. [Google Scholar]
- Feng Yang, Zhou Min, and Tong Xin. 2021. “Imbalanced classification: A paradigm-based review.” Statistical Analysis and Data Mining: The ASA Data Science Journal 14 (5): 383–406. [Google Scholar]
- Fowler Martin JF. 2002. “Satellite remote sensing and archaeology: a comparative study of satellite imagery of the environs of Figsbury Ring, Wiltshire.” Archaeological prospection 9 (2): 55–69. [Google Scholar]
- Friedl Mark A, and Brodley Carla E. 1997. “Decision tree classification of land cover from remotely sensed data.” Remote sensing of environment 61 (3): 399–409. [Google Scholar]
- Grill Jean-Bastien, Strub Florian, Altché Florent, Tallec Corentin, Richemond Pierre H, Buchatskaya Elena, Doersch Carl, et al. 2020. “Bootstrap your own latent: A new approach to self-supervised learning.” arXiv preprint arXiv:2006.07733. [Google Scholar]
- Grizonnet Manuel, Michel Julien, Poughon Victor, Inglada Jordi, Savinaud Mickaël, and Cresson Rémi. 2017. “Orfeo ToolBox: Open source processing of remote sensing images.” Open Geospatial Data, Software and Standards 2 (1): 15. [Google Scholar]
- Hanson William S, and Oltean Ioana A. 2012. Archaeology from historical aerial and satellite archives. Springer Science & Business Media. [Google Scholar]
- Harris Ray. 1987. “Satellite remote sensing. An introduction.”.
- He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, and Girshick Ross. 2020. “Momentum contrast for unsupervised visual representation learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738. [Google Scholar]
- Hjelm R Devon, Fedorov Alex, Lavoie-Marchildon Samuel, Grewal Karan, Bachman Phil, Trischler Adam, and Bengio Yoshua. 2018. “Learning deep representations by mutual information estimation and maximization.” arXiv preprint arXiv:1808.06670. [Google Scholar]
- Hou Sikang, Shi Hongye, Cao Xianghai, Zhang Xiaohua, and Jiao Licheng. 2021. “Hyperspectral Imagery Classification Based on Contrastive Learning.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–13. [Google Scholar]
- Hu Xiang, Li Teng, Zhou Tong, Liu Yu, and Peng Yuanxi. 2021. “Contrastive learning based on transformer for hyperspectral image classification.” Applied Sciences 11 (18): 8670. [Google Scholar]
- Kendall Alex, Gal Yarin, and Cipolla Roberto. 2018. “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.” In Proceedings of the IEEE conference on computer vision and pattern recognition, 7482–7491. [Google Scholar]
- Khosla Prannay, Teterwak Piotr, Wang Chen, Sarna Aaron, Tian Yonglong, Isola Phillip, Maschinot Aaron, Liu Ce, and Krishnan Dilip. 2020. “Supervised contrastive learning.” arXiv preprint arXiv:2004.11362. [Google Scholar]
- Lary David J, Alavi Amir H, Gandomi Amir H, and Walker Annette L. 2016. “Machine learning in geosciences and remote sensing.” Geoscience Frontiers 7 (1): 3–10. [Google Scholar]
- Lasaponara Rosa, and Masini Nicola. 2012. Satellite remote sensing: A new tool for archaeology. Vol. 16. Springer Science & Business Media. [Google Scholar]
- Le Quoc V. 2013. “Building high-level features using large scale unsupervised learning.” In 2013 IEEE international conference on acoustics, speech and signal processing, 8595–8598. IEEE. [Google Scholar]
- Leisz Stephen J. 2013. “An Overview of the Application of Remote Sensing to Archaeology During the Twentieth Century.” In Mapping Archaeological Landscapes from Space, edited by Comer Douglas C. and Harrower Michael J., SpringerBriefs in Archaeology, 11–19. New York, NY: Springer New York. [Google Scholar]
- Lin Albert Yu-Min, Huynh Andrew, Lanckriet Gert, and Barrington Luke. 2014. “Crowd-sourcing the unknown: The satellite search for Genghis Khan.” PloS one 9 (12): e114046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Bing, Yu Anzhu, Yu Xuchu, Wang Ruirui, Gao Kuiliang, and Guo Wenyue. 2020. “Deep multiview learning for hyperspectral image classification.” IEEE Transactions on Geoscience and Remote Sensing 59 (9): 7758–7772. [Google Scholar]
- Mnih Volodymyr, and Hinton Geoffrey E. 2012. “Learning to label aerial images from noisy data.” In Proceedings of the 29th International conference on machine learning (ICML-12), 567–574. [Google Scholar]
- Noroozi Mehdi, and Favaro Paolo. 2016. “Unsupervised learning of visual representations by solving jigsaw puzzles.” In European conference on computer vision, 69–84. Springer. [Google Scholar]
- Pal Mahesh. 2005. “Random forest classifier for remote sensing classification.” International journal of remote sensing 26 (1): 217–222. [Google Scholar]
- Parcak Sarah H. 2019. Archaeology from Space: How the Future Shapes Our Past. Illustrated edition ed. Henry Holt and Co. [Google Scholar]
- Phillips Philip, and Willey Gordon R. 1953. “Method and theory in American archeology: An operational basis for culture-historical integration.” American Anthropologist 55 (5): 615–633. [Google Scholar]
- Samui Pijush. 2008. “Support vector machine applied to settlement of shallow foundations on cohesionless soils.” Computers and Geotechnics 35 (3): 419–427. [Google Scholar]
- Sanders William T. 1961. “A Developmental Concept of Pre-Spanish Urbanization in the Valley of Mexico. William J. Mayer-Oakes. Middle American Research Records, Vol. II, No. 8 (preprinted from Pub. No. 18, pp. 165–76), Middle American Research Institute, Tulane University, New Orleans, 1960. 9 pp., 2 charts.” American Antiquity 27 (2): 259–260. [Google Scholar]
- Tian Yonglong, Sun Chen, Poole Ben, Krishnan Dilip, Schmid Cordelia, and Isola Phillip. 2020. “What makes for good views for contrastive learning?” arXiv preprint arXiv:2005.10243. [Google Scholar]
- Ur Jason A. 2013. “CORONA Satellite Imagery and Ancient near Eastern Landscapes.” In Mapping Archaeological Landscapes from Space, 21–31. Springer. [Google Scholar]
- Wang Yi, Albrecht Conrad M, Braham Nassim Ait Ali, Mou Lichao, and Zhu Xiao Xiang. 2022. “Self-supervised learning in remote sensing: A review.” arXiv preprint arXiv:2206.13188. [Google Scholar]
- Wegier Weronika, and Ksieniewicz Pawel. 2020. “Application of imbalanced data classification quality metrics as weighting methods of the ensemble data stream classification algorithms.” Entropy 22 (8): 849. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wernke Steven, VanValkenburgh Parker, and Saito Akira. 2020. “Interregional archaeology in the age of big data: building online collaborative platforms for virtual survey in the Andes.” Journal of Field Archaeology 45 (sup1): S61–S74. [Google Scholar]
- Wu Zhirong, Xiong Yuanjun, Yu Stella X, and Lin Dahua. 2018. “Unsupervised feature learning via non-parametric instance discrimination.” In Proceedings of the IEEE conference on computer vision and pattern recognition, 3733–3742. [Google Scholar]
- Yang Fan, Wu Kai, Zhang Shuyi, Jiang Guannan, Liu Yong, Zheng Feng, Zhang Wei, Wang Chengjie, and Zeng Long. 2022. “Class-aware contrastive semi-supervised learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14421–14430. [Google Scholar]
- Yang Yuzhe, and Xu Zhi. 2020. “Rethinking the value of labels for improving class-imbalanced learning.” Advances in neural information processing systems 33: 19290–19301. [Google Scholar]
- Yue Jun, Fang Leyuan, Rahmani Hossein, and Ghamisi Pedram. 2021. “Self-supervised learning with adaptive distillation for hyperspectral image classification.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–13. [Google Scholar]
- Zhang Yuhang, Zhang Xiaopeng, Li Jie, Qiu Robert, Xu Haohang, and Tian Qi. 2022. “Semi-supervised contrastive learning with similarity co-calibration.” IEEE Transactions on Multimedia. [Google Scholar]
- Zhang Zhilu, and Sabuncu Mert. 2018. “Generalized cross entropy loss for training deep neural networks with noisy labels.” Advances in neural information processing systems 31. [Google Scholar]
- Zhuang Chengxu, Zhai Alex Lin, and Yamins Daniel. 2019. “Local aggregation for unsupervised learning of visual embeddings.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6002–6012. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the author, Steven A. Wernke, upon reasonable request.
