Expert teacher based on foundation image segmentation model for object detection in aerial images

Yinhui Yu; Xu Sun; Qing Cheng

doi:10.1038/s41598-023-49448-9

. 2023 Dec 11;13:21964. doi: 10.1038/s41598-023-49448-9

Expert teacher based on foundation image segmentation model for object detection in aerial images

Yinhui Yu ^1,^✉, Xu Sun ¹, Qing Cheng ¹

PMCID: PMC10713596 PMID: 38082152

Abstract

Despite the remarkable progress of general object detection, the lack of labeled aerial images limits the robustness and generalization of the detector. Teacher–student learning is a feasible solution on natural image domain, but few works focus on unlabeled aerial images. Inspired by foundation models with the powerful generalization in computer vision field, we propose an expert teacher framework based on foundation image segmentation model called ET-FSM. Our approach provides the performance gains for the student detector by generating high-quality pseudo-labels for unlabeled aerial images. In the ET-FSM, we design the binary detector with expert guidance mechanism to sufficiently leverage the extra knowledge obtained from the foundation image segmentation model, which accurately detects object positions in the complex backgrounds. Also, we present the momentum contrast classification module to distinguish confused object categories in aerial images. To demonstrate the effectiveness of the proposed method, we construct an unlabeled aerial image dataset covering various scenes. The experiments are conducted on diverse types of student detectors. The results show that the proposed approach achieves superior performance compared to existing methods, and allows the student detector to achieve fully supervised performance with much less labeled aerial images. Our dataset and code are available at https://github.com/cq100/ET-FSM.

Subject terms: Computer science, Information technology, Electrical and electronic engineering, Imaging techniques, Imaging and sensing, Aerospace engineering

Introduction

Object detection for aerial images captured by UAVs (unmanned aerial vehicles) has been widely used in numerous practical applications, such as traffic surveillance, disaster relief and smart agriculture^1,2. Although general object detection has made prominent success since the rise of deep learning, the complex working environment of UAVs and the scarcity of labeled aerial images impair the robustness and generalization of the detector, which limits the advancement and application of aerial image detection^3,4. Current works mainly focus on data augmentation and elaborate network architecture design to improve the detector performance^5,6. Nevertheless, these methods ignore the potential application of unlabeled aerial images available everywhere⁷.

Recent semi-supervised object detection in natural images has obtained performance gains from a large number of unlabeled images by leveraging the teacher-student learning manner^8,9. Typically, this methodology adopts a complex and high-performance teacher model to generate pseudo-labels for unlabeled images, and then these pseudo-labels and ground-truth labels are used to train a lightweight student model¹⁰. In the training process, accurate pseudo-labels are critical that can provide the correct supervision information to the student model^11,12. Nevertheless, aerial images usually contain small objects and complex backgrounds, which makes teacher models generate inaccurate predictions¹³.

Large language foundation models can generalize to unseen data distributions by training with abundant text corpora, such as GPT-4¹⁴. Inspired by this, the foundation models in computer vision field are also developing rapidly^15,16. The segment anything model (SAM) released by Meta AI Research is the most representative in semantic segmentation tasks¹⁷. The model is trained on over one billion masks, and constructs a data collection loop to continuously enhance zero-shot and few-shot generalization. The unique property can assist the detector to resist noise disturbances in aerial image detection tasks.

To this end, we design an effective teacher framework based on foundation image segmentation model for object detection in aerial images. The proposed approach can transfer knowledge learned from unlabeled aerial images to the student detector, which makes the student detector achieve superior performance with a small number of labeled aerial images. Specifically, we propose a binary detector with expert guidance mechanism (EGD) to achieve the finer bounding box prediction by incorporating the guidance information provided from the foundation image segmentation model. Also, the momentum contrast classification (MCC) module is designed for object classification, which is able to distinguish confused object categories and boost the feature representation ability. The two key components can be used to generate accurate pseudo-labels for objects in complex aerial images. To prove the validity of the proposed method, we collect 14110 unlabeled aerial images under different scenes and conduct extensive experiments on Visdrone¹⁸ and UAVDT¹⁹ datasets. Our dataset and code are available at https://github.com/cq100/ET-FSM.

The primary contributions of our paper are as follows:

We present an expert teacher framework based on foundation image segmentation model called ET-FSM, which uses the unlabeled aerial images with high-quality pseudo-labels to enhance the robustness and generalization of the student detector. We also construct an unlabeled aerial image dataset to provide valuable resources for unlabeled data study in aerial image detection.
We design the binary detector with expert guidance mechanism (EGD) that treats the extra knowledge provided from the foundation image segmentation model as new image modality information. It is able to resist the background disturbances and accurately locate object positions.
We propose the momentum contrast classification (MCC) module. It regards object features of the same class as a cluster, and uses the cluster expectations for object classification, producing higher-quality pseudo category labels for easily confused objects in the aerial images.

Related work

Object detection in aerial images

Different from natural images, UAVs usually capture aerial images under varying illumination and uncontrolled outdoor conditions, which requires object detection models with strong robustness^20,21. Existing methods are mainly carried out in terms of model structure and labeled data²².

Designing elaborate model architectures facilitates better extraction of small object features. Nuisance disentangled feature transform²³ designed the extra nuisance prediction branch to learn robust features for each domain covering altitude, view and weather. Cascaded zoom-in detector²⁴ was a recent method that reused detectors based on object density in the training and inference stages. This manner brings tremendous computational costs, which is difficult to deploy on embedded UAV platform.

Using large-scale labeled aerial image dataset to train detectors can intuitively enhance robustness, but manually labeling objects is time-consuming and labor-intensive²⁵. Data augmentation expands the dataset by providing a diverse view of the sample. Uniform cropping²⁶ as a popular augmentation approach divided the aerial image into four equal-sized patches, and then merged these patches into the training set. Mask Re-sampling²⁷ generated numerous object chips form dataset, and used masks to determine proper positions for these chips. These methods still extract features from the labeled aerial image dataset without utilizing unlabeled aerial images easily available.

Semi-supervised object detection

In recent years, semi-supervised learning has gradually focused on object detection tasks, which can be divided into consistency-based approach and pseudo labeling approach^28,29. The latter is a mainstream method that leverages the teacher model trained on ground-truth labels to generate pseudo-labels for unlabeled images, and retrains the student model with all data.

STAC³⁰ first followed the popular teacher-student learning framework to achieve semi-supervised object detection. Unbiased Teacher³¹ utilized the threshold to select more reliable pseudo-labels for student models. Soft Teacher³² dynamically adjusted the training loss weights for each pseudo-box, which alleviated the negative effects of incorrect pseudo-labels. Despite significant progress in natural images, these teacher models cannot be directly applied to aerial images. The bounding boxes of small objects in aerial images are particularly sensitive to noise perturbations, which lead to the unreliable prediction from the teacher model. ScaleKD³³ was the recently released teacher-student learning scheme specifically for small object detection, and it designed a cross-scale assistant to reduce the adverse effect of teacher model. ZoomInNet³⁴ distilled a standard teacher model by learning cross-scale knowledge of small objects.

Proposed method

ET-FSM overall

When detecting small and confused objects in aerial images, general teacher detectors have poor performance against complex background disturbances. To generate more accurate object bounding boxes and category soft labels, we decouple object localization and classification tasks. The overall architecture of the proposed approach based on SAM is presented in Fig. 1. Firstly, we employ the SAM to segment all objects in the aerial image. These segmented regions are categorized and mapped into expert masks. Afterwards, the expert mask serves as an image modality to guide the binary detector to predict more accurate object positions. Finally, the MCC module determines the specific category scores of these detected objects.

The overall architecture of the proposed method based on SAM.

The ET-FSM is responsible for pseudo-label generation, and the student detector is optimized with these pseudo-labels and ground-truth labels. In principle, the student detector is arbitrary. We use Faster R-CNN³⁵ as a baseline example. The optimization loss L for student detector is calculated as follows:

L = L_{\sup} + λ \cdot L_{unsup}

where $L_{\sup}$ and $L_{unsup}$ are the loss functions computed on the labeled aerial images and unlabeled aerial images respectively, and $λ$ is a hyperparameter to balance the two losses.

\begin{matrix} \begin{matrix} L_{\sup} & = L_{cls} + L_{reg} \\ = \sum_{i} (CE (x_{cls}^{i}, y_{cls}^{i}) + {Smooth}_{L 1} (x_{reg}^{i}, y_{reg}^{i})) \end{matrix} \end{matrix}

where $L_{cls}$ is the classification loss, $L_{reg}$ is the regression loss, CE is the cross-entropy loss function, $x_{i}$ is the predicted sample, and $y_{i}$ is the ground-truth label.

The $L_{unsup}$ loss is similar to the $L_{\sup}$ loss. The difference is that we employ the pseudo category soft labels generated by the MCC module for classification, and pseudo location labels obtained by the EGD for regression.

Aerial image dataset collection

To explore the performance gains of the proposed ET-FSM on the student detector, we construct an unlabeled aerial image dataset containing 14110 visible frames. This available resource is exceptionally valuable in studying unlabeled data and enhancing the detection capabilities of aerial images.

Our dataset covers multiple scenes including traffic roads, campuses and parks. The resolution of collected image is 1920 $\times$ 1080. To increase the adaptability of high-altitude missions, these images are captured in different height intervals of 0-30m, 30-60m, and 60-80m. We provide the statistical information of different shooting scenes and heights in Fig. 2. For hardware devices, we select the DJI Matrice M300 RTK UAV equipped with the Zenmuse H20 sensor to collect data, as shown in Fig. 3. The relevant equipment parameters are shown in Table 1.

Distribution of images across capturing scenes and heights.

Table 1.

Relevant equipment parameters.

Devices	Parameters	Values
DJI Matrice M300 RTK	Max payload	2.7 kg
	Hovering accuracy	Vertical:±0.1m
		Horizontal: ±0.3m
	Max angular velocity	Pitch: 300 $^{\circ}$ /s
		Yaw: 100 $^{\circ}$ /s
Zenmuse H20	Sensor	1/2.3” CMOS
	Focal length	4.5mm
	Aperture	f/2.8

Open in a new tab

Binary detector with expert guidance mechanism

The SAM is a foundation image segmentation model, and has the powerful zero-shot generalization ability. When directly applying to aerial images, the segmentation performance of SAM may be unsatisfactory due to sensitivity to environmental perturbations³⁶. Inspired by the multimodal object detection approach in aerial images³⁷, we consider the extra knowledge from SAM as new modality information to help the detector focus on the relevant object regions. Specifically, we propose a binary detector with expert guidance mechanism (EGD) to generate trustworthy bounding boxes for unlabeled aerial images.

The workflow of the designed detector is shown in Fig. 4. In the training stage, we employ the SAM to segment the labeled aerial images into multiple regions, and use the MCC module to distinguish the objects and backgrounds in the segmentation results. The pixel values of object regions are set to 1 and the other regions are set to 0. The corresponding expert mask is generated, and stored in the local environment. Different epochs avoid duplicate segmentation operations, which greatly saves computational complexity and time costs.

Binary detector with expert guidance mechanism.

In the inference stage, the unlabeled aerial images are employed to generate expert masks. The expert mask concatenates with the original image along the channel dimension as the binary detector input to provide the guidance information. Moreover, our detector only performs binary classification of objects and backgrounds for accurate object location. We adopt the Faster R-CNN as the binary detector in this paper.

Momentum contrast classification module

Since the disturbances caused by flying altitude, viewing angle, and weather condition are more severe in aerial images, objects of different categories usually have the similar appearances. It is difficult for a general classifier to distinguish multiple confused object categories. Inspired by the momentum contrast learning³⁸, we propose the momentum contrast classification (MCC) module to generate the accurate pseudo category labels. Our module can be combined with most image classifiers. We use the PVTv2 (pyramid vision transformer version2)³⁹ in this paper.

The MCC module uses the classifier to encode input samples. Sample features of the same class are regarded as a cluster. The expectation vectors of the clusters are used for contrast classification. By minimizing the contrast classification loss $L_{c}$ , our module increases the similarity of object features from the same category, and dissimilarity to that of different categories.

\begin{matrix} L_{c} = - log \frac{exp (q \cdot c_{+} / τ)}{\sum_{i = 0}^{C} exp (q \cdot c_{i} / τ)} \end{matrix}

where q is an input encoded sample vector, $c_{+}$ is the expectation of encoded vector for the matched category, $c_{i}$ is that of the category i, and $τ$ is a temperature parameter. Each input encoded sample vector that completes the calculation is stored in a queue. When the maximum value Q of the queue capacity is reached, the encoded sample vectors for each category are updated through momentum.

\begin{matrix} c_{i} \leftarrow c_{i} + (1 - m) c_{i}^{'} \end{matrix}

where m is a momentum parameter, and $c_{i}^{'}$ is the expectation of updated encoded vectors for category i.

In the training stage, we use the segmented background and object regions from the labeled aerial images. In the inference stage, the MCC module distinguishes objects and backgrounds output from the SAM on the unlabeled aerial images, and determines the specific category scores of objects output from the binary detector. Figure 5 presents encoded feature distribution. It can be observed that high-dimensional features from different categories lack clear representation of distribution boundaries in the dimensionality reduction visualization. After applying the MCC module, the feature points of the same category are more clustered. This indicates that the MCC module is conducive to distinguishing between easily confused objects from different categories.

Encoded feature distribution (a) without the MCC module (b) with the MCC module.

Experiments

Implementation detail

The experiments are conducted on the Visdrone¹⁸ and UAVDT¹⁹ datasets, which provide 10209 images and 38327 images with annotations, respectively. We measure the performance gains of the proposed method on the student detector by adding the aerial images with pseudo-labels. To further test robustness, we introduce ten corrupted types for the two testing sets to simulate the UAV-specific perturbations, which is the same setting as in the previous method⁴⁰.

Our method is trained on a NVIDIA Tesla P40 GPU platform with 24GB memory, and the implementation is based on the MMDetection toolbox⁴¹. The input image size of student detectors is set to 1000 $\times$ 600 pixels. We set the batch size to 4, the initial learning rate to $1.0 \times 10^{- 4}$ and the epochs to 18. For the evaluation metric, we mainly adopt the AP (average precision), AP50, and AP75 to measure the detection performance.

Comparison experiments

We employ three types of base student detectors based on ResNet50⁴² to evaluate the performance gains of our method, including anchor-based Faster R-CNN (FRCNN)³⁵, RetinaNet⁴³ (Retina), and anchor-free FCOS⁴⁴. Also, we use the advanced standard-scale detector DINO⁴⁵ and the recently released UAV-specific detector CEASC⁴⁶ as student models. Table 2 shows the detection results of different student detectors. As can be seen, the ET-FSM improves all detector performance, validating the effectiveness of our approach. In particular, after using the proposed approach, the base detectors are able to achieve competitive performance with the advanced standard-scale DINO and UAV-specific CEASC. The accuracy improvement on the Visdrone dataset is greater than on the UAVDT dataset. One reasonable explanation is that the UAVDT contains more labeled aerial images and fewer categories, which lowers the top bound on performance growth. Moreover, it can be seen that the AP75 score increase is smaller than the AP50 score, because there exist some position deviations when generating pseudo-labels. After adding the UAV-specific perturbations, our approach increases the AP scores of the base student detector by 6.6 $%$ , 7.0 $%$ , and 7.6 $%$ on the corrupted Visdrone dataset, respectively. These results indicate that the proposed method has considerable corruption robustness gains.

Table 2.

The detection results of different student detectors.

Method	Visdrone						UAVDT
	Clean accuracy			Corruption robustness			Clean accuracy			Corruption robustness
	AP (%)	AP50 (%)	AP75 (%)	AP(%)	AP50 (%)	AP75 (%)	AP (%)	AP50 (%)	AP75 (%)	AP (%)	AP50(%)	AP75 (%)
FRCNN	21.4	37.4	21.5	16.6	20.9	16.3	17.1	29.2	18.6	10.4	18.9	9.2
Retina	15.7	29.8	14.6	11.1	23.6	10.2	15.8	30.2	15.3	12.8	24.8	11.7
FCOS	16.9	29.8	17.2	9.8	19.2	9.3	16.9	29.9	17.8	10.5	19.8	10.0
DINO	23.1	41.8	22.0	19.7	36.8	18.1	16.2	28.8	16.8	14.8	26.7	14.9
CEASC	19.5	32.1	20.4	14.9	25.9	15.3	16.7	27.3	19.0	14.9	25.2	16.3
FRCNN+ET-FSM	25.8	45.9	23.4	23.2	39.8	18.8	19.4	32.6	19.2	13.9	24.8	10.5
Retina+ET-FSM	21.1	39.3	17.7	18.1	29.8	14.3	17.2	32.5	15.6	15.2	24.9	11.9
FCOS+ET-FSM	20.6	34.4	18.6	17.4	28.1	12.7	18.1	32.5	18.0	12.2	24.7	10.3
DINO+ET-FSM	26.3	52.9	24.0	24.1	45.5	20.4	18.9	32.0	20.5	17.7	29.5	19.3
CEASC+ET-FSM	22.7	41.3	21.5	18.9	34.1	18.6	18.5	31.7	19.8	16.7	26.5	19.4

Open in a new tab

Table 3 provides the AP scores for each category on the clean Visdrone dataset. We compare the performance of the student model using ground-truth labels, adding the pseudo-labels generated by the student model itself (SMI), and the pseudo-labels generated by the proposed ET-FSM. It can be observed that directly utilizing the student model to generate pseudo-labels deteriorates the detection performance. We assume the reason for this phenomenon is that the student model predicts pseudo-labels with imprecise object bounding boxes and severe category confusion, leading to error accumulation. The ET-FSM method not only largely boosts the detection accuracy of FRCNN on categories with most training instances, but also improves AP scores on the long tail categories.

Table 3.

The AP scores for each category on the clean Visdrone dataset.

Method	AP (%)	AP50 (%)	AP75 (%)	Car	Bus	Van	Ped.	Motor	Truck	Person	Tricycle	Awn.	Bicycle
FRCNN	21.4	37.4	21.5	51.2	32.2	27.8	20.4	20.1	20.1	13.7	13.5	7.7	7.0
FRCNN+SMI	15.8	29.3	15.0	46.5	17.1	21.8	16.3	14.9	14.7	10.7	8.2	4.1	3.7
FRCNN+ET-FSM	25.8	45.9	23.4	64.8	40.1	34.0	24.8	24.0	24.0	16.0	14.8	7.5	7.7

Open in a new tab

We compare the performance of existing approaches on the clean Visdrone dataset in Table 4. For a fair comparison, we use ResNet50 as the backbone network, and implement these models under the same experimental conditions. In the inference stage, we do not perform any cropping operation. Indeed, the uniform cropping²⁶ and the cascaded zoom-in detector²⁴ increase the AP scores, but the improvement is marginal. Compared to the two methods, our approach achieves better performance. It suggests that sufficiently leveraging unlabeled aerial images through the proposed teacher framework does bring in greater gains. We also compare the advanced Soft Teacher³² in the field of natural images, and ZoomInNet³⁴ and ScaleKD³³ in the field of aerial images, which are based on the teacher-student learning framework typically used for semi-supervised methods. It can be observed that the ET-FSM outperforms these comparative methods, and achieves higher AP score increases of 4.4 $%$ and 5.4 $%$ on the FRCNN and Retina, respectively. This means that our method can effectively boost the small object detection ability of the student detector.

Table 4.

Performance comparison of existing approaches on the clean Visdrone dataset.

Method	AP (%)	AP50 (%)	AP75 (%)
FRCNN	21.4	37.4	21.5
Uniform cropping²⁶	22.7	40.7	22.2
Cascaded zoom-in²⁴	23.9	42.2	24.0
Soft Teacher³²	24.5	39.3	27.0
ET-FSM	25.8	45.9	23.4
Retina	15.7	29.8	14.6
ZoomInNet³⁴	17.3	33.3	16.3
ScaleKD³³	19.4	36.8	18.0
ET-FSM	21.1	39.3	17.7

Open in a new tab

Ablation study

Our ablation experiments are conducted on the clean Visdrone dataset. We evaluate the AP scores of FRCNN baseline and the ET-FSM on different unlabeled image proportions in Fig. 6. It can be observed that our approach can achieve greater performance gains with fewer labeled samples. When adding 75 $%$ unlabeled images, the ET-FSM can surpass 100 $%$ fully supervised performance only using 25 $%$ labeled data.

The AP scores comparison of FRCNN baseline and the ET-FSM under varying proportions of unlabeled images.

We further investigate the effect of the designed momentum contrast classification (MCC) module on the classification performance in Table 5. As can be seen, using the MCC module can increase the macro-average score by 13.6 $%$ compared to vanilla classifier, demonstrating the effectiveness of our module. Also, we explore the value Q of the queue capacity. When the Q is 512, the highest macro-average score of 85.4 $%$ and micro-average score of 93.5 $%$ can be obtained.

Table 5.

The effect of each component in the ET-FSM classification.

Vanilla classifier	MCC module	Q=128	Q=256	Q=512	Q=1024	Macro-average	Micro-average
$✓$						71.8	84.3
	$✓$	$✓$				81.3	89.8
	$✓$		$✓$			82.1	90.6
	$✓$			$✓$		85.4	93.5
	$✓$				$✓$	83.5	93.8

Open in a new tab

Table 6 shows the ablation study of the binary detector with expert guidance mechanism (EGD). In the ET-FSM, using the expert mask to train the detector has higher accuracy than directly inputting the original aerial image. For example, the AP scores increase by 14.8 $%$ and 13.7 $%$ on the Visdrone and UAVDT datasets, respectively. The results show that our detector can achieve more accurate object bounding box prediction.

Table 6.

The ablation study of the EGD.

Method	Visdrone			UAVDT
Method	AP (%)	AP50 (%)	AP75 (%)	AP (%)	AP50 (%)	AP75 (%)
Vanilla detector	37.1	65.7	37.1	36.2	64.1	37.8
EGD	51.9	91.0	51.7	49.9	87.7	49.8

Open in a new tab

Figure 7 shows the visualizing detection results of our approach on the Visdrone and UAVDT datasets. It can be seen that small and occluded objects can be detected by using the ET-FSM, and their categories can be clearly identified, such as motor and people. In particular, the proposed method is able to detect the confused objects with the complex background in the poorly illuminated scenes.

Conclusion

We propose an expert teacher framework ET-FSM based on foundation image segmentation model to boost the robustness and generalization of student detectors in aerial images. Our approach takes full advantage of the effective knowledge from the powerful foundation image segmentation model to generate accurate pseudo-labels for unlabeled aerial images. Specifically, we design the binary detector with expert guidance mechanism (EGD) and the momentum contrast classification (MCC) module in the ET-FSM to make teacher models predict more accurate location bounding boxes and object category scores. Moreover, we collect an unlabeled aerial image dataset in various real-world scenes, which provides abundant resources for unlabeled aerial image research. The experiment results show that the proposed method brings greater performance gains than advanced methods, and enables the student detector to outperform 100 $%$ supervised performance with only 25 $%$ labeled images when adding 75 $%$ unlabeled images.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 61671219.

Author contributions

Conceptualization, S.X.; data curation, S.X., C.Q. and Y.Y.; software, S.X. and C.Q.; formal analysis, S.X.; project administration, Y.Y.; supervision, Y.Y.; investigation, S.X.; writing-original draft, S.X., writing-review and editing, S.X., C.Q. and Y.Y.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Data availability

The dataset and code of the current study are available in the github repository, https://github.com/cq100/ET-FSM.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Heidari A, Navimipour NJ, Unal M, Hang G. Machine learning applications in internet-of-drones: Systematic review, recent deployments, and open issues. ACM Comput. Surv. 2023;55(12):1–45. doi: 10.1145/3571728. [DOI] [Google Scholar]
2.Santhana KB, et al. Fusion of visible and thermal images improves automated detection and classifcation of animals for drone surveys. Sci. Rep. 2023;13:1–12. doi: 10.1038/s41598-023-37295-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Ding J, Xue N, Xia G-S, Bai X, Yang W, Yang MY, Belongie S, Luo J, Datcu M, Pelillo M. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022;44(11):7778–7796. doi: 10.1109/TPAMI.2021.3117983. [DOI] [PubMed] [Google Scholar]
4.Wang W, Chen Y, Ghamisi P. Transferring CNN with adaptive learning for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2022;60:1–18. [Google Scholar]
5.Kumar, T., Mileo, A., Brennan, R. & Bendechache, M. Image data augmentation approaches: A comprehensive survey and future directions. Preprint at arXiv:2301.02830. (2023).
6.Deng L, Bi L, Li H, Chen H, Duan X, Lou H. Lightweight aerial image object detection algorithm based on improved yolov5s. Sci. Rep. 2022;13:1–10. doi: 10.1038/s41598-023-34892-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Li J, Sun B, Li S, Kang X. Semisupervised semantic segmentation of remote sensing images with consistency self-training. IEEE Trans. Geosci. Remote Sens. 2021;60:1–11. [Google Scholar]
8.Guo, Q., et al. Scale-equivalent distillation for semi-supervised object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14502–14511. (2022)
9.Li, H., Wu, Z., Shrivastava, A. & Davis, L. S. Rethinking pseudo labels for semi-supervised object detection, in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1314–1322. (2022)
10.Mi, P., et al. Active teacher for semi-supervised object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14462–14471. (2022)
11.Xu, B., Chen, M., Guan, W. & Hu, L. Efficient teacher: Semi-supervised object detection for yolov5. Preprint at arXiv:2302.07577 (2023).
12.Yu, J., et al. Pseudo-label generation and various data augmentation for semi-supervised hyperspectral object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 304–311. (2022)
13.Zhang Y, Yan Z, Sun X, Diao W, Fu K, Wang L. Learning efficient and accurate detectors with dynamic knowledge distillation in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021;60:1–19. [Google Scholar]
14.Bubeck, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4. Preprint at arXiv:2303.12712 (2023).
15.Liang, F., et al. Open-vocabulary semantic segmentation with mask-adapted clip, arXiv, (2023).
16.Qin, J., et al. Freeseg: Unified, universal and open-vocabulary image segmentation. Preprint at arXiv:2303.17225 (2023).
17.Alexander, K., et al. Segment anything. Preprint at arXiv:2304.02643 (2023).
18.Cao, Y., et al. Visdrone-det2021: The vision meets drone object detection challenge results, in Proceedings of the IEEE/CVF International conference on computer vision, pp. 2847–2854. (2021)
19.Du, D., et al. The unmanned aerial vehicle benchmark: Object detection and tracking, in Proceedings of the European Conference on Computer Vision (ECCV), pp. 375–391. (2018)
20.Deepanshi D, Barkur R, Suresh D, Lal S, Reddy CS, Diwakar PG. Rscdnet: A robust deep learning architecture for change detection from bi-temporal high resolution remote sensing images. IEEE Trans. Emerg. Top. Comput. Intell. 2023;7(2):537–551. doi: 10.1109/TETCI.2022.3230941. [DOI] [Google Scholar]
21.Zhen P, Wang S, Zhang S, Yan X, Wang W, Ji Z, et al. Towards accurate oriented object detection in aerial images with adaptive multi-level feature fusion. ACM Trans. Multimed. Comput. Commun. Appl. 2023;19(1):1–12. doi: 10.1145/3513133. [DOI] [Google Scholar]
22.Bai Y, Song Y, Zhao Y, Zhou Y, Wu X, He Y, et al. Occlusion and deformation handling visual tracking for UAV via attention-based mask generative network. Remote Sens. 2022;14(19):4756. doi: 10.3390/rs14194756. [DOI] [Google Scholar]
23.Wu, Z., et al. Delving into robust object detection from unmanned aerial vehicles: A deep nuisance disentanglement approach, in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1201–1210. (2019)
24.Meethal, A., Granger, E. & Pedersoli, M. Cascaded zoom-in detector for high resolution aerial images. Preprint at arXiv:2303.08747, (2023).
25.Hao F, Ma Z-F, Tian H-P, Wang H, Wu D. Semi-supervised label propagation for multi-source remote sensing image change detection. Comput. Geosci. 2022;170:105249. doi: 10.1016/j.cageo.2022.105249. [DOI] [Google Scholar]
26.Zhang, X., Izquierdo, E. & Chandramouli, K. Dense and small object detection in UAV vision based on cascade network, in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 118–126. (2019)
27.Tang, Z., Liu, X., & Yang, B. Penet: Object detection using points estimation in high definition aerial images, in IEEE International Conference on Machine Learning and Applications, pp. 392–398. (2020)
28.Gao, M., et al. Consistency-based semi-supervised active learning: Towards minimizing labeling cost, in European Conference, p. 510–526. (2020)
29.Liu, L., et al. Mixteacher: Mining promising labels with mixed scale teacher for semi-supervised object detection. Preprint at arXiv:2303.09061 (2023).
30.Sohn, K., Zhang, Z., Li, C.-L., Zhang, H., Lee, C.-Y. & Pfister, T. A simple semi-supervised learning framework for object detection. Preprint at arXiv:2005.04757 (2020).
31.Liu, Y.-C., et al. Unbiased teacher for semi-supervised object detection, in Int. Conf. Learn. Represent., (2021).
32.Xu, M., et al. End-to-end semi-supervised object detection with soft teacher, in Proceedings of the IEEE/CVF International Conference on Computer Vision (2021).
33.Yichen, Z., et al. Scalekd: Distilling scale-aware knowledge in small object detector, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023).
34.Liu B-Y, Chen H-X, Huang Z, Liu X, Yang Y-Z. Zoominnet: A novel small object detector in drone images with cross-scale knowledge distillation. Remote Sens. 2021;13(6):1198. doi: 10.3390/rs13061198. [DOI] [Google Scholar]
35.Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017;39(6):1137–1149. doi: 10.1109/TPAMI.2016.2577031. [DOI] [PubMed] [Google Scholar]
36.Zhang, J., e al. Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models. Preprint at arXiv:2304.10597 (2023).
37.Sun Y, Cao B, Zhu P, Hu Q. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022;32(10):6700–6713. doi: 10.1109/TCSVT.2022.3168279. [DOI] [Google Scholar]
38.He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9726–9735. (2020)
39.Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media. 2022;8(3):415–424. doi: 10.1007/s41095-022-0274-8. [DOI] [Google Scholar]
40.Yamada, Y. & Otani, M. Does robustness on imagenet transfer to downstream tasks? in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 9205–9214. (2022)
41.Chen, K., et al. Mmdetection: Open mmlab detection toolbox and benchmark. Preprint at arXiv:1906.07155 (2019).
42.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770–778. (2016)
43.Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollar, P. Focal loss for dense object detection, in Proc. Eur. Conf. Comput. Vis., pp. 2999–3007. (2017) [DOI] [PubMed]
44.Tian, Z., Shen, C., Chen, H. & He, T. Fcos: Fully convolutional one-stage object detection, in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9626–9635. (2019)
45.Hao, Z., Feng, L., Shilong, L., Lei, Z., Hang, S., Jun, Z. et al. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, in Proc. Int. Conf. Learn. Represent., (2022).
46.Bowei, D., Yecheng, H., Jiaxin, C. & Di, H. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset and code of the current study are available in the github repository, https://github.com/cq100/ET-FSM.

[CR1] 1.Heidari A, Navimipour NJ, Unal M, Hang G. Machine learning applications in internet-of-drones: Systematic review, recent deployments, and open issues. ACM Comput. Surv. 2023;55(12):1–45. doi: 10.1145/3571728. [DOI] [Google Scholar]

[CR2] 2.Santhana KB, et al. Fusion of visible and thermal images improves automated detection and classifcation of animals for drone surveys. Sci. Rep. 2023;13:1–12. doi: 10.1038/s41598-023-37295-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Ding J, Xue N, Xia G-S, Bai X, Yang W, Yang MY, Belongie S, Luo J, Datcu M, Pelillo M. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022;44(11):7778–7796. doi: 10.1109/TPAMI.2021.3117983. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Wang W, Chen Y, Ghamisi P. Transferring CNN with adaptive learning for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2022;60:1–18. [Google Scholar]

[CR5] 5.Kumar, T., Mileo, A., Brennan, R. & Bendechache, M. Image data augmentation approaches: A comprehensive survey and future directions. Preprint at arXiv:2301.02830. (2023).

[CR6] 6.Deng L, Bi L, Li H, Chen H, Duan X, Lou H. Lightweight aerial image object detection algorithm based on improved yolov5s. Sci. Rep. 2022;13:1–10. doi: 10.1038/s41598-023-34892-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Li J, Sun B, Li S, Kang X. Semisupervised semantic segmentation of remote sensing images with consistency self-training. IEEE Trans. Geosci. Remote Sens. 2021;60:1–11. [Google Scholar]

[CR8] 8.Guo, Q., et al. Scale-equivalent distillation for semi-supervised object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14502–14511. (2022)

[CR9] 9.Li, H., Wu, Z., Shrivastava, A. & Davis, L. S. Rethinking pseudo labels for semi-supervised object detection, in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1314–1322. (2022)

[CR10] 10.Mi, P., et al. Active teacher for semi-supervised object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14462–14471. (2022)

[CR11] 11.Xu, B., Chen, M., Guan, W. & Hu, L. Efficient teacher: Semi-supervised object detection for yolov5. Preprint at arXiv:2302.07577 (2023).

[CR12] 12.Yu, J., et al. Pseudo-label generation and various data augmentation for semi-supervised hyperspectral object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 304–311. (2022)

[CR13] 13.Zhang Y, Yan Z, Sun X, Diao W, Fu K, Wang L. Learning efficient and accurate detectors with dynamic knowledge distillation in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021;60:1–19. [Google Scholar]

[CR14] 14.Bubeck, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4. Preprint at arXiv:2303.12712 (2023).

[CR15] 15.Liang, F., et al. Open-vocabulary semantic segmentation with mask-adapted clip, arXiv, (2023).

[CR16] 16.Qin, J., et al. Freeseg: Unified, universal and open-vocabulary image segmentation. Preprint at arXiv:2303.17225 (2023).

[CR17] 17.Alexander, K., et al. Segment anything. Preprint at arXiv:2304.02643 (2023).

[CR18] 18.Cao, Y., et al. Visdrone-det2021: The vision meets drone object detection challenge results, in Proceedings of the IEEE/CVF International conference on computer vision, pp. 2847–2854. (2021)

[CR19] 19.Du, D., et al. The unmanned aerial vehicle benchmark: Object detection and tracking, in Proceedings of the European Conference on Computer Vision (ECCV), pp. 375–391. (2018)

[CR20] 20.Deepanshi D, Barkur R, Suresh D, Lal S, Reddy CS, Diwakar PG. Rscdnet: A robust deep learning architecture for change detection from bi-temporal high resolution remote sensing images. IEEE Trans. Emerg. Top. Comput. Intell. 2023;7(2):537–551. doi: 10.1109/TETCI.2022.3230941. [DOI] [Google Scholar]

[CR21] 21.Zhen P, Wang S, Zhang S, Yan X, Wang W, Ji Z, et al. Towards accurate oriented object detection in aerial images with adaptive multi-level feature fusion. ACM Trans. Multimed. Comput. Commun. Appl. 2023;19(1):1–12. doi: 10.1145/3513133. [DOI] [Google Scholar]

[CR22] 22.Bai Y, Song Y, Zhao Y, Zhou Y, Wu X, He Y, et al. Occlusion and deformation handling visual tracking for UAV via attention-based mask generative network. Remote Sens. 2022;14(19):4756. doi: 10.3390/rs14194756. [DOI] [Google Scholar]

[CR23] 23.Wu, Z., et al. Delving into robust object detection from unmanned aerial vehicles: A deep nuisance disentanglement approach, in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1201–1210. (2019)

[CR24] 24.Meethal, A., Granger, E. & Pedersoli, M. Cascaded zoom-in detector for high resolution aerial images. Preprint at arXiv:2303.08747, (2023).

[CR25] 25.Hao F, Ma Z-F, Tian H-P, Wang H, Wu D. Semi-supervised label propagation for multi-source remote sensing image change detection. Comput. Geosci. 2022;170:105249. doi: 10.1016/j.cageo.2022.105249. [DOI] [Google Scholar]

[CR26] 26.Zhang, X., Izquierdo, E. & Chandramouli, K. Dense and small object detection in UAV vision based on cascade network, in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 118–126. (2019)

[CR27] 27.Tang, Z., Liu, X., & Yang, B. Penet: Object detection using points estimation in high definition aerial images, in IEEE International Conference on Machine Learning and Applications, pp. 392–398. (2020)

[CR28] 28.Gao, M., et al. Consistency-based semi-supervised active learning: Towards minimizing labeling cost, in European Conference, p. 510–526. (2020)

[CR29] 29.Liu, L., et al. Mixteacher: Mining promising labels with mixed scale teacher for semi-supervised object detection. Preprint at arXiv:2303.09061 (2023).

[CR30] 30.Sohn, K., Zhang, Z., Li, C.-L., Zhang, H., Lee, C.-Y. & Pfister, T. A simple semi-supervised learning framework for object detection. Preprint at arXiv:2005.04757 (2020).

[CR31] 31.Liu, Y.-C., et al. Unbiased teacher for semi-supervised object detection, in Int. Conf. Learn. Represent., (2021).

[CR32] 32.Xu, M., et al. End-to-end semi-supervised object detection with soft teacher, in Proceedings of the IEEE/CVF International Conference on Computer Vision (2021).

[CR33] 33.Yichen, Z., et al. Scalekd: Distilling scale-aware knowledge in small object detector, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023).

[CR34] 34.Liu B-Y, Chen H-X, Huang Z, Liu X, Yang Y-Z. Zoominnet: A novel small object detector in drone images with cross-scale knowledge distillation. Remote Sens. 2021;13(6):1198. doi: 10.3390/rs13061198. [DOI] [Google Scholar]

[CR35] 35.Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017;39(6):1137–1149. doi: 10.1109/TPAMI.2016.2577031. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Zhang, J., e al. Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models. Preprint at arXiv:2304.10597 (2023).

[CR37] 37.Sun Y, Cao B, Zhu P, Hu Q. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022;32(10):6700–6713. doi: 10.1109/TCSVT.2022.3168279. [DOI] [Google Scholar]

[CR38] 38.He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9726–9735. (2020)

[CR39] 39.Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media. 2022;8(3):415–424. doi: 10.1007/s41095-022-0274-8. [DOI] [Google Scholar]

[CR40] 40.Yamada, Y. & Otani, M. Does robustness on imagenet transfer to downstream tasks? in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 9205–9214. (2022)

[CR41] 41.Chen, K., et al. Mmdetection: Open mmlab detection toolbox and benchmark. Preprint at arXiv:1906.07155 (2019).

[CR42] 42.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770–778. (2016)

[CR43] 43.Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollar, P. Focal loss for dense object detection, in Proc. Eur. Conf. Comput. Vis., pp. 2999–3007. (2017) [DOI] [PubMed]

[CR44] 44.Tian, Z., Shen, C., Chen, H. & He, T. Fcos: Fully convolutional one-stage object detection, in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9626–9635. (2019)

[CR45] 45.Hao, Z., Feng, L., Shilong, L., Lei, Z., Hang, S., Jun, Z. et al. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, in Proc. Int. Conf. Learn. Represent., (2022).

[CR46] 46.Bowei, D., Yecheng, H., Jiaxin, C. & Di, H. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023).

PERMALINK

Expert teacher based on foundation image segmentation model for object detection in aerial images

Yinhui Yu

Xu Sun

Qing Cheng

Abstract

Introduction

Related work

Object detection in aerial images

Semi-supervised object detection

Proposed method

ET-FSM overall

Figure 1.

Aerial image dataset collection

Figure 2.

Figure 3.

Table 1.

Binary detector with expert guidance mechanism

Figure 4.

Momentum contrast classification module

Figure 5.

Experiments

Implementation detail

Comparison experiments

Table 2.

Table 3.

Table 4.

Ablation study

Figure 6.

Table 5.

Table 6.

Figure 7.

Conclusion

Acknowledgements

Author contributions

Data availability

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases