CATR: CNN augmented transformer for object detection in remote sensing imagery

Sahibzada Jawad Hadi; Irfan Ahmed; Abid Iqbal; Ali S Alzahrani

doi:10.1038/s41598-025-27872-3

. 2025 Nov 27;15:42281. doi: 10.1038/s41598-025-27872-3

CATR: CNN augmented transformer for object detection in remote sensing imagery

Sahibzada Jawad Hadi ¹, Irfan Ahmed ^1,^✉, Abid Iqbal ^2,^✉, Ali S Alzahrani ²

PMCID: PMC12660941 PMID: 41309916

Abstract

Object detection in high-resolution aerial imagery is challenging due to scale changes, occlusion, clutter, and limited annotated datasets. While CNNs like YOLO and Faster R-CNN have progressed, they lack effective long-range dependency capture. We propose the CNN augmented detection transformer approach which we called CATR. In our quest, we compared the proposed framework with the transformer-based DETR and state-of-the-art CNNs on the DOTA dataset. DETR, with its end-to-end transformer and direct set predictions, streamlines the pipeline by removing anchor boxes and non-maximum suppression, improving robustness in cluttered aerial scenes. Our findings show DETR’s superior accuracy (72% mAP@0.5), outperforming CNNs by up to 13%. However, DETR has higher computational expense (86.3 GFLOPs) and slower speed (12 FPS). The proposed hybrid CNN-transformer architecture has a balanced accuracy and speed, exploiting CNN features with global attention for improved small object detection, augmented by the segmentation by CNN. This study confirms transformer models, especially when combined with CNN, are highly promising for complex aerial environments, offering a strong alternative to traditional CNNs by globally modeling context and occlusion. While efficiency improvements are ongoing, this research provides a valuable path for future geospatial applications, including remote sensing and disaster response.

Keywords: Remote sensing, CNN, YOLO, DETR, Transformer

Subject terms: Engineering, Mathematics and computing

Introduction

In recent years, object detection in high-resolution aerial imagery has become a focus of research and development across a number of fields, including remote sensing, urban planning, traffic surveillance, environmental monitoring, military, and disaster response¹. As high-resolution aerial imagery from both satellite and airborne sensors has become widespread, and resources and tools within computer vision and deep learning have emerged, it has opened new avenues for new forms of automated object detection². Yet, aerial imagery remains a unique and persistent, and departure from typical ground-level scenes.

Some of the main difficulties for aerial object detection are the extreme variability of scale, density of target objects, arbitrary orientation of object, occlusions, overlapping instances, and background clutter³. Aerial images are not like a typical street view or natural images due to the inherent top-down bird’s-eye perspective. Due to this perspective, a lot of shape and appearance distortion occurs in relation to a standard bounding box. Objects such as aircraft and vehicles or vessels and buildings in aerial view can appear in starkly different scales, orientations, and positions, and often are tightly packed in complex scenes. Shadows, environmental noise, and sometimes limited annotated datasets only increase the complexity of detection in aerial views. The DOTA dataset, shown in Fig. 1, contains complex aerial scenes which have different object types with dense distributions, object with arbitrary orientations, and various scales. These difficult visual characteristics such as occludes or clutter in the background require robust object detection systems capable of managing Horizontal Bounding Boxes (HBB) and Oriented Bounding Boxes (OBB).

Fig. 1 — Examples from the DOTA dataset showing multiple object types (e.g., airplanes, vehicles) in high-density and cluttered aerial scenes.

In this work, we conduct a systematic evaluation of DETR against a variety of state-of-the-art CNN-based detectors such as YOLOv5, YOLOv8, YOLOv9, YOLOv10, Faster R-CNN, RetinaNet, and SSD. Our evaluation is based on the DOTA (Dataset for Object Detection in Aerial Images) dataset which includes high-resolution aerial images with multiple object classes with varying scales, aspect ratios, and angles of rotation. We consider both HBB and OBB annotations to evaluate models’ ability to perform orientation-aware detection¹. We employ performance metrics like mAP@0.5 and mAP@0.5:0.95, and report performance metrics such as FLOPs, FPS, and parameter counts for a full assessment of the accuracy performance⁴. The results of our tests show that DETR achieved the highest To address and strategize to separate the higher computational costs of the DETR model from its overall superior detection performance benefits, we would like to propose a suspension of a hybrid DETR-CNN architecture called CNN augmented DETR (CATR). A hybrid DETR model utilizes CNN-based feature extraction modules along with transformer-based attention layers which we hope to leverage the localized feature hierarchy of CNNs and global reasoning ability of transformers. The reasoning behind hybridizing these technology implementations is to possibly better on smaller objects during detection, allow significantly faster convergence during learning layer training, and draw to bridge in finding balance between depth in real-time and accuracy^5,6.

Related work

Object detection in aerial imagery has evolved significantly, transitioning from classical computer vision techniques, such as edge detection and hand-crafted features, which struggled with scale, viewpoint, and clutter³, to deep learning-based methods. The advent of Convolutional Neural Networks (CNNs) marked a paradigm shift, greatly enhancing detection accuracy and automation^7,8. However, with increasing aerial image resolution and scene complexity, CNNs revealed limitations, including difficulty in capturing global dependencies, detecting small or occluded objects, resolving overlapping instances, and adapting to rotated targets^2,9.

Traditional object detection has largely relied on CNNs. Prominent examples include two-stage detectors like Faster R-CNN⁷ and one-stage detectors such as RetinaNet⁸, Single Shot MultiBox Detector (SSD)³, and various YOLO (You Only Look Once) versions (v5 through v10)^10,11. These models extract features using convolutional layers and predict object locations and classes, typically employing anchor boxes or direct regression.

Faster R-CNN is a two-stage detector known for its reasonable accuracy, though its inference speed can be a limitation for real-time applications⁷.
YOLO are fast, real-time one-stage detectors that utilize regression for bounding box prediction, often incorporating anchor box mechanisms^10,11.
RetinaNet introduced focal loss to address class imbalance, but like Faster R-CNN, it relies on anchor boxes and requires region refinement strategies⁸.
SSD offers high speed but often struggles with small-scale objects, highly rotated instances, and complex aerial scenes with significant feature variation³.

A fundamental limitation of CNNs stems from their local receptive fields, which inherently restrict their ability to model long-range dependencies and global relationships within an image. This becomes particularly problematic in aerial imagery characterized by dense object distributions, occlusions, and multi-scale objects^1,3,12. Object localization traditionally uses Horizontal Bounding Boxes (HBBs). While effective for upright objects, HBBs often enclose excessive background for rotated or tilted objects, leading to imprecise localization and reduced detection performance^2,3. This issue is particularly pronounced in aerial imagery where objects can appear at arbitrary orientations. Consequently, Oriented Bounding Boxes (OBBs) have emerged as a superior alternative^1,9. OBBs provide a tighter fit to an object’s true shape and orientation, significantly improving localization accuracy and clarity in densely populated scenes. Our study considers both HBB and OBB formats to provide a comprehensive performance comparison across architectures. Figure 2 visually demonstrates the enhanced localization capabilities of OBBs over HBBs, especially for rotated objects like aircraft and vehicles.

Fig. 2 — Comparison of horizontal bounding box (HBB) and oriented bounding box (OBB) for object localization in aerial imagery.

The limitations of CNNs, particularly in modeling global spatial relationships and handling complex aerial scenes, spurred a significant paradigm shift with the advent of Vision Transformers (ViTs)^2,12,13. The Detection Transformer (DETR) represents a novel, end-to-end approach that reframes object detection as a set prediction problem⁴. DETR eliminates the need for hand-designed components like anchor boxes, region proposal networks, and Non-Maximum Suppression (NMS) post-processing, simplifying the detection pipeline^4,9.

As illustrated in Fig. 3, DETR integrates a CNN backbone with a transformer encoder-decoder module, leveraging its self-attention mechanism to model long-range dependencies and global context across the entire image⁷. This enables DETR to:

Effectively capture global contextual information.
Better distinguish overlapping and occluded objects.
Significantly reduce the complexity of the detection pipeline.

These characteristics make DETR highly advantageous for aerial imagery, where traditional detectors often struggle with merged or inaccurate bounding boxes for closely located or irregularly oriented objects⁴.

Fig. 3 — DETR architecture consisting of a CNN backbone, transformer encoder-decoder, and prediction heads.

Despite its robustness in handling cluttered, high-resolution, and rotated aerial imagery^1,3,9,12, DETR faces challenges including a larger computational footprint, slower inference speed compared to YOLO or SSD, and difficulties with very small objects^4,9,10. The fundamental difference between CNN-based and transformer-based detectors lies in their spatial relationship modeling, with DETR’s self-attention mechanism replacing anchor-based methods, as shown in Fig. 4.

Fig. 4 — Overview of transformer architecture highlighting the global self-attention mechanism that captures long-range dependencies and replaces traditional region proposal and anchor-based methods.

Recent advances in self-supervised learning, particularly Masked Autoencoders (MAEs), offer a promising avenue for training vision models without extensive labeled datasets, which is highly beneficial in data-scarce aerial contexts^5,6. As depicted in Fig. 5, MAE learns robust feature representations by reconstructing masked image patches. MAE-pretrained models. Integrating MAE pre-training with models like DETR could further enhance the detection of small, occluded, or rare-class objects⁹.

Fig. 5 — Illustration of masked autoencoder (MAE) self-supervised learning: The model learns to reconstruct missing parts of an image, enabling robust feature extraction without requiring extensive labeled datasets.

To address the limitations of standalone CNNs and transformers, hybrid models have emerged. These architectures typically combine CNNs for efficient multi-scale hierarchical feature extraction with transformers for global relationship modeling¹⁴. This synergy optimizes the trade-off between performance and computational cost by leveraging CNNs’ local spatial efficiency and transformers’ ability to understand long-range contextual dependencies¹². Our study presents a conceptual DETR-based hybrid architecture, utilizing CNNs for fine-grained feature extraction (beneficial for small objects) and transformers for managing complex spatial relationships, an approach particularly relevant for aerial detection tasks involving occluded objects and varying viewpoints^2,9,12.

Apart from traditional transformer-based detection frameworks, there has also been some recently undertaken work in object and person detection in aerial and remote sensing imagery. For example, “Transformer-based Person Detection in RGB-T Aerial Images with Paired VTSaR”¹⁵ proposed a cross-modal fusion network to combine RGB and thermal modalities through a transformer backbone to bolster detection performance in low-light, occluded, and cluttered scenarios, which are typical conditions with aerial surveillance, and presented the strength of attention-based feature fusion to detect small-scale and thermally variant targets in cluttered backgrounds. Similarly, “Aerial Person Detection for Search and Rescue: Survey and Benchmarks”¹⁶ provided a comprehensive survey of all existing aerial person detection models, datasets, and evaluation protocols—also benchmarking over this work. In addition, it explicitly stated designers needed “light” architectures that maintain high performance with scale and sparse distributions, providing motivation for more efficient and hybrid option designs. Additionally, “Robust Aerial Person Detection with Lightweight Distillation Network for Edge Deployment”¹⁷ evaluated a teacher-student distillation-based strategy specifically implemented for UAV and embedded platforms to achieve consistent accuracy, but with less computation and low latency, which will fit for real-time air operations. Collectively, these SOTA contributions highlight the importance of attention-based reasoning, fusion across modalities, and computational efficiency for aerial object detection, which directly reflects the impetus for our proposed CNN-Augmented Transformer (CATR) that incorporates CNN-based spatial encoding and transformer-based global attention to balance accuracy with efficiency. Combining insights from these studies, CATR aims to offer strong detection robustness against a variety of scale, orientation, and environment challenges often faced with high-resolution remote-sensing imagery.

In this work, we aim to address existing research gaps through the following objectives:

To conduct a comprehensive comparative analysis of DETR against state-of-the-art CNN-based object detection models (YOLOv5 – YOLOv10, Faster R-CNN, RetinaNet, SSD) using a remote sensing dataset.
To perform a detailed performance analysis considering both Horizontal Bounding Box (HBB) and Oriented Bounding Box (OBB) formats across all evaluated architectures.
To explore the trade-offs among detection accuracy, inference speed, model complexity, and performance in scenarios involving occlusion and object overlap.
To introduce and evaluate a novel DETR-based hybrid architecture designed for improved performance in complex aerial environments.

Our contributions

The prime contributions of our work are listed below:

We propose a hybrid framework containing both CNN and Transformer models, which comprises global attention in transformer architectures with CNN-based feature extraction, in a proposal for a hybrid CATR paradigm that utilizes the advantages of both to detect small and closely located objects with computational viability for real-word use in remote sensing.
This paper provides a comprehensive comparison of the transformer-based DETR model and multiple state-of-the-art CNN detectors on large-scale aerial imagery datasets, allowing for important commentary about the strengths and weaknesses of each method. We carried out a thorough exploration of the transformer-based Detection Transformer (DETR) model in comparison to cutting edge CNN-based object detection models, namely the YOLOv5, YOLOv8, YOLOv9, YOLOv10, Faster R-CNN, RetinaNet, and SSD, on the challenging problem of object detection on high-resolution aerial imagery. In addressing this aim, our study systematically compared and evaluated each model’s performance as it pertains to detection accuracy (mean Average Precision at various thresholds), computational complexity (GFLOPs), inference speed (frames per second), and resilience in real-world aerial imaging scenarios.
This study considers challenging overhead aerial image conditions, including occlusion, overlapping objects, and backgrounds with clutter, and shows that the overall robustness of the DETR model, owing the global self-attention mechanism. In particular, we consider both the Horizontal Bounding Box (HBB) and Oriented Bounding Box (OBB) detection tasks, showing that CATR performs very well for tasks involving arbitrarily oriented objects which are commonly encountered in aerial scenes.

Baseline models

This study evaluates the performance of the DETR and other computer vision models, focusing on their applicability for detecting and classifying diseases in remote sensing. In performance analysis task, the advantages attained by each model are combined to propose a hybrid structure. In our work, the benchmarking models encompass modern deep learning architectures, including Convolutional Neural Networks (CNN)¹⁸ and YOLO version 5 to version X¹⁹. The rationale for selecting each model, along with their architectural and functional details, is discussed in the following subsections.

Detection transformer (DETR)

Detection Transformer (DETR) represents a novel approach in object detection, distinguished by its reliance on transformer-based architectures rather than handcrafted components or region proposals²⁰. Originating from the work at Facebook AI Research, The Detection Transformer (DETR) utilizes a convolutional backbone, which is usually a ResNet-50, to extract multi-scale spatial and semantic features that maintain the hierarchical representation of the image aggregated by convolutional computations. The extracted features are then fed to a transformer encoder–decoder, which create global contextual dependencies over the whole image via multi-head self-attention. While previous Vision Transformers (ViTs) effectively abandoned convolutional operations altogether and replaced them with patch embeddings, the attention-based reasoning mechanism of DETR enables the advantages of CNN-based feature extraction to remain, but instead, provides a mechanism for reasoning over the features for object localization and classification. As such, this fundamentally distinguishes DETR from a pure ViT approach because the architecture design combines the localized feature resolution of CNNs with the long-range relational modeling properties of transformers, thus applying both into a single framework for detection. By eliminating traditional handcrafted components, such as anchor boxes, region proposal networks, and post-processing algorithms like Non-Maximum Suppression (NMS), it leads to yet another simplifying step for the detection pipeline, resulting in a clean end-to-end formulation of the object detection problem itself. This means that not only will DETR be improving spatial awareness and global context awareness, but it also allows for a more generalizable and scalable solution for more complex detection problems in high-resolution aerial imagery (e.g. objects may have large variations in scale, rotation, and occlusion).

As shown in Fig. 3, the DETR architecture we applied comprises three main parts: (a) a ResNet-50 convolutional backbone to extract multi-scale spatial and semantic features, (b) a transformer encoder-decoder module to model long-range dependencies and the global context with multi-head self-attention, and (c) prediction heads to output class probabilities and orientation-aware bounding boxes simultaneously. This modular integration gives DETR the categorization of contextual global information while also providing equal consideration for the spatial resolution important for object detection in high-resolution aerial imagery.

Convolutional neural network (CNN)

CNNs have established themselves as a cornerstone in image analysis due to their capability to learn spatial hierarchies from pixel-level data through convolutional operations. The architecture consists of stacked layers that perform filtering, non-linear transformation, and downsampling, progressively abstracting image features. A representative CNN architecture used in this study is shown in Fig. 6.

Fig. 6 — Architecture of a layer-wise custom CNN model.

Many CNN architectures are still used for a variety of object detection tasks, including remote sensing²¹ and detection of plant disease²² and many other areas. We adopted CNN architectures as baseline due to their proven efficacy in object detection task. Similarly, the applications of CNN in remote sensing for object detection are facilitated by its layered architecture. The objective of our work in the application of CNN has been depicted in the form of Fig. 7.

Fig. 7 — The application of CNN in our work.

In our work, despite using the custom built CNN models, we employed the architectures which are efficiently used for object detection in remote sensing, till date. These models are RetiNet, Faster R-CNN, Single Shot Detector (SSD), and YOLO versions.

You only look once (YOLO)

YOLO (You Only Look Once) is a real-time object detection algorithm that balances accuracy and speed, making it ideal for applications requiring fast inference. YOLOv5, the latest iteration developed by Ultralytics LLC²³, achieves high detection precision while maintaining lightweight architecture suitable for embedded deployments¹⁹.

YOLOv5 follows a single-shot detection paradigm, comprising three main components:

Backbone: Responsible for initial feature extraction using Cross Stage Partial (CSP) connections.
Neck: Aggregates features at multiple scales to enhance detection robustness.
Head: Performs classification and bounding box regression.

Compared to YOLOv4, YOLOv5 significantly reduces model size–by almost 90%–without sacrificing performance. Its optimized structure and real-time capability made it a strong candidate for inclusion in this study, particularly for detecting wheat leaf diseases in time-sensitive scenarios.

Experimental setup

This section outlines the experimental setup, configuration parameters, and evaluation criteria adopted for model assessment in this study.

Experimental environment

All experiments were performed using Python 3.8 and libraries used for experimentation included Numpy, OpenCV, PyTorch, and Torchvision. Model training and inference sessions are setup on the system containing Intel Core i5 with 32GB RAM, NVIDIA Tesla V100 or RTX 3080 GPUs (6GB VRAM) to match the scale of large aerial imagery. Prior to training and testing, the curated datasets were preprocessed and loaded into the environment. Necessary modifications and preprocessing steps were applied to ensure compatibility and optimize model performance.

Performance evaluation metrics

In order to quantify model performances, the following metrics are used:

Mean average precision (mAP)

The mAP measures the effectiveness of the detection, combining classification accuracy and bounding box overlap at an Intersection over Union (IoU) of 0.5.

where; AOO is the Area of Overlap and AOU stands for Area of Union.

F1 score

F1 score is the measure which shows the performance of model to counter false positives and false negatives. It is the harmonic mean of two important performance measures called precision and recall. It is written as:

Inference time

In our work, we not only measured the accuracy but the time trained model takes to infer the information while deployed for performance. It measures the performance of the trained model during real-time application scenarios. It is measured in Frames per Second (FPS) which can be written as:

Dataset

In this work, we employ the DOTA (Dataset for Object Detection in Aerial Images) dataset, which is recognized to remain the gold standard benchmark in aerial object detection due to its scale, diversity, and challenging real scenarios. DOTA is the latest large-scale dataset explicitly developed with aerial imagery object detection that addresses its own unique challenges.It includes over 280,000 labeled objects classified into 15 different object categories (airplanes, ships, vehicles, bridges, etc.) at high resolution (between Inline graphic pixel and pixel). The images’ high resolution allows the model to learn detailed features necessary for detecting small and dense objects that tend to occur in aerial scenes. DOTA has also provided annotations in Horizontal Bounding Boxes (HBB) and Oriented Bounding Boxes (OBB) as shown in Fig. 8.

Fig. 8 — DOTA annotations in horizontal bounding boxes (HBB) and oriented bounding boxes (OBB).

Another advantage is that the DOTA dataset contains scenarios with dense distributions of objects, cluttered and noisy backgrounds, different lighting conditions, occlusions, and overlapping objects. These factors are challenging for object detectors and present a demanding challenge for model robustness assessment. The annotated images of DOTA are shown in Fig. 9. The DOTA dataset is a good benchmark for evaluating transformer and hybrid CNN-transformer object detectors due to its large scale, complex annotations, and real-world challenges like small, dense, and oriented objects, clutter, and occlusions. It enables rigorous testing of orientation sensitivity, small object detection, and contextual understanding in aerial imagery. shown in Fig 9.

Fig. 9 — Examples of annotated images in DOTA.

In evaluating variability of the model across other domains, we assessed three additional and highly popular remote sensing datasets in addition to DOTA: (1) UCAS-AOD, (2) HRSC2016, and (3) NWPU VHR-10. Each dataset has differing sets of environmental and geometric challenges which further evaluated the flexibility of our model in different aerial contexts.

UCAS-AOD: This dataset contains ultra-high resolution aerial images that have been labelled for aircraft and vehicle categories. The dataset challenges aircraft detection problems with objects that are presented in orientation and scale, as well as added environmental clutter introduced from a background scene.
HRSC2016: This is the benchmark dataset for ships detection, where it is a dedicated dataset for ships only for different scales, angles, and density of ships over complex maritime background scenes. The dataset provides a concrete opportunity to evaluate how the model performed with long objects in its structure.
NWPU VHR-10: This dataset contains a total of ten different objects included in its dataset that includes airplanes, ships, storage tanks, or vehicles. The dataset has emerged as a standard dataset for small-object detection tasks in the potential complexity of densely packed and complex urban or less industrial scenes.

In Fig. 10, we present representative images from three datasets - UCAS-AOD, HRSC2016, and NWPU VHR-10—to show the variety and challenges presented in the data set used in this research.

To achieve consistent methodology and fair benchmarking analysis across the four datasets used in this study - DOTA, UCAS-AOD, HRSC2016, and NWPU VHR-10 - each dataset was split into independent train (70%), validation (20%), and test (10%) subsets. Although datasets like DOTA had predefined splits for training, validation, and testing, we decided on a uniform scheme across the datasets to preserve experimental fairness and to avoid any distributional bias in the comparison of model performances across the groups. This splitting scheme allows for similar assessments of generalization at the cross-dataset level, allows the proposed CATR model to be trained and validated with the same experimental conditions, and improves the robustness and reproducibility of evaluations.

Methodology

This section describes the overall methodology which has been carried out to conduct the experimentation. The methodology is structured in such a way to adopt experimentation process for designing, training, testing, and calibrating the transformer and CNN based object detection models, and then combining them to provide the benefits of both models for aerial imagery. The detailed structure of our work is depicted in the form of Fig. 11.

Fig. 11 — Experimental methodology for current study.

Data preparation & augmentation

In our dataset, all corrupted, duplicate, or mislabeled samples were filtered out to ensure the availability of high quality data samples. Each image is uniformly resized to Inline graphic pixels to match the input dimensions required by our models, and pixel values are normalized to the [0, 1] range to facilitate efficient convergence during training.

To improve the robustness and generalization of the model, extensive data augmentation techniques are applied. These included full 360-degree random rotations to account for diverse aerial viewpoints, horizontal and vertical flips to increase variability, and brightness and contrast adjustments to mimic different lighting conditions. Additive White Gaussian Noise (AWGN) is also added to simulate sensor-induced disturbances, while cropping and splitting high-resolution images into smaller patches ensured that finer spatial details were adequately captured during training.

In order to be consistent with the standard DOTA dataset preprocessing method, all high-resolution images were cropped into Inline graphic patches with 10% overlap between adjacent patches so as to avoid loss of information around the edges of the patches. This approach guarantees that at least one full patch will contain each object of interest, while also maintaining uniform spatial distribution overall the patches selected within the overall extent of an image. These patch dimensions were chosen to be similar to the standard input for aerial object detection user benchmarks, such as DOTA, HRSC2016, and the other dataset to allow inter-experiment comparison.

As conventionally used in data driven experiments, the dataset is divided into three parts: 70% for training, 20% for validation, and 10% for testing, with the testing set held out completely to ensure unbiased performance evaluation. Examples of the augmentations are shown in Fig. 12, illustrating the diversity introduced into the training pipeline.

Fig. 12 — Image augmentation (a) Rotation (b) Zooming.

Model training & validation

In this work, the proposed model is the CNN and transformer-based encoder-decoder architecture that frames detection as a direct set prediction task.

The model uses a pre-trained ResNet-50 as a backbone to extract spatial and semantic features from input images. These features are then processed by a transformer encoder, which captures global contextual relationships across the entire image via self-attention mechanisms. The decoder component interprets learned object queries, attending to different spatial regions to simultaneously predict class labels and bounding box coordinates, supporting both HBB and OBB outputs. The prediction heads include a classification branch that assigns objects to one of 15 categories, and a regression head that outputs orientation-aware bounding boxes. Unlike traditional detectors, CATR does not require Non-Maximum Suppression (NMS), as its set prediction strategy inherently handles duplicate detections.

In addition to the aforementioned discussion about CATR, Fig. 13 displays the data flow and model architecture. The model architecture is made up of both a convolutional backbone that localizes spatial features and a transformer encoder-decoder module that can reason globally. The multi-scale features that were extracted from the CNN backbone are sent here to compute a representation in one of the transformer layers, which encodes long-range dependencies across the entire image. The final feature map representation is based on the output of the fused representation module that fuses local (data from the CNN) and global (data from the transformer) representations through adaptive attention-based weighted metrics. The detection head outputs the predicted object class and bounding box directly from the combined representation. This is aimed at illustrating the modular behavior of the CATR module that suggests how the different features help boost accuracy on object detection problems involving high resolution aerial images, which is different from typical hybridization processes.

To demonstrate definitively that the CNN-Augmented Transformer (CATR) design is a true hybrid design and not simply a collection of modules,.the fusion strategy is described in detail. CATR applied a two-level fusion approach by building upon both local and global feature representations. The CNN backbone firstly extracts fine-grained spatial and texture feature maps, which are then linearly projected into the Transformer encoder-decoder for global contextual reasoning. The fusion effectively occurs at both the level of the feature maps (local features) and the level of the attention (global features): the initial CNN features give local structure and spatial information, whereas the global attention from the Transformer can refine and augment information from the local features to capture long-range semantic relationships throughout the image. This bi-directional learning allows the CATR to preserve local precision while also maintaining global understanding, improving CATR’s overall robustness in the detection of small, overlapping objects that vary in orientation in complex aerial images. Figure 14 illustrates the overall hybrid fusion process and adaptive communication between the CNN and Transformer components in the proposed fusion layers.

Fig. 14 — Overall fusion mechanism of the proposed CATR architecture.

The proposed and baseline models are trained over 70% of the training data. During experiments, our models are trained and optimized for over 190 epochs with a batch size of 44, balancing computational efficiency and learning stability. The AdamW optimizer was used with decoupled weight decay to enhance generalization. A learning rate of 1e-4 was set for the transformer layers, while the backbone was trained with a lower rate of 1e-5 to preserve pre-trained knowledge. The overall loss function comprised three components: Hungarian loss for optimal object-to-prediction assignment, cross-entropy loss for classification, and a combination of L1 and generalized IoU losses for bounding box regression. Dropout was used as a regularization strategy within the transformer layers, and early stopping was employed based on validation performance to prevent overfitting.

The training was conducted on the NVidia RTX 3090 GPU, achieving an epoch time of approximately 45 minutes. There were consistent convergences on the DOTA training dataset. The training loss and validation loss decreased and converged consistently after 92 epochs. The Hungarian loss function, which was a combination of the classification loss and bounding box localization loss, allowed us to optimize both positional accuracy and classification prediction. The uniform decline of the training loss over 190 epochs, shown in Fig. 15, also indicates a stable convergence for CATR over the DOTA dataset.

Fig. 15 — Training loss curve of the CATR model over 190 epochs showing stable convergence.

Results & discussion

The results obtained by the proposed model and comparison with the baseline methods has shown improved performance of CATR model in all avenues. CATR has shown promising results in terms of accuracy and speed. It successfully predicted both Horizontal Bounding Boxes (HBB) and Oriented Bounding Boxes (OBB) for aerial images with a high level of precision by combining the advantages of both CNN and Transformers. The model achieved:

Accurate detection for closely packed and small objects with the help of the CNN model.
Strong predictions of complex scenes with occlusion and rotated targets.
Ability to ignore background appearance and ignore clutter through the global attention of the transformers.

Images showing predicted bounding boxes over test images can be seen in Figs. 16 and 17. Value has been added as CATR is able to accurately localize both small and rotated objects.

The Table 1 below compares our DETR implementation with several state-of-the-art object detectors in terms of model size, computational complexity, accuracy, and inference speed:

Table 1.

Performance comparison of different object detection models on aerial imagery.

Model	Params (M)	FLOPs (G)	mAP@0.5	mAP@0.5:0.95	FPS
CATR (Ours)	41.2	86.3	72.5	52.8	12
DETR (Baseline)	41.3	121.5	69.0	49.5	15
YOLOv10	38.7	40.0	65.0	46.5	15
YOLOv9	33.8	27.5	63.0	44.0	18
YOLOv8	31.5	25.0	61.5	42.8	22
YOLOv5	27.1	17.1	60.2	40.0	20
Faster R-CNN	42.0	88.0	64.0	43.0	7
RetinaNet	36.5	60.0	61.0	40.5	12
SSD	26.3	14.5	54.0	34.0	25

Open in a new tab

As given in Table 1, the computational cost of the CATR model is quite high in terms of amount of operations per forward pass and memory. The quadratic attention is the primary component of its cost that has increased the expense of the model as compared to CNN-based approaches, but CATR has been demonstrated to outperform CNNs when dealing with complex spatial relationships in cluttered aerial scenes.

Apart from the computation complexity, the performance of the CATR is shown in Fig. 18.

Also, based on the table and performances above, the observations for each of the models are tabulated in Table 2.

Table 2.

Performance summarization for each model.

Model	Remarks
CATR (Ours)	Best in cluttered aerial scenes and occlusion
YOLOv10	Tunable and robust, but limited performance in DOTA
YOLOv9	Experimental with moderate aerial detection ac- curacy
YOLOv8	Struggles with small/rotated objects
YOLOv5	High speed; relatively lower accuracy in aerial views
Faster R-CNN	Accurate for general tasks, but slower in aerial imagery
RetinaNet	Performs well in class imbalance but weaker on dense scenes
SSD	Fastest model; weak on small/rotated and densely packed objs

Open in a new tab

Cross-dataset evaluation and generalization study

In order to further examine the robustness and generalization ability of the architecture proposed i.e. CNN-Augmented Transformer (CATR), we extended our experiments from DOTA dataset to three well-accepted benchmark remote sensing datasets: UCAS-AOD, HRSC2016, and NWPU VHR-10. All datasets cover different aerial detection areas having different objects type and shapes, scales, and complexity of scenes. The evaluation on the cross-datasets provides means to evaluate and guarantee that the model generalizes well under different environments and object geometries, thus confirming the scalability and transferability capability of CATR in a number of real-world remote sensing scenarios.

HRSC2016 dataset (Ship detection)

The HRSC2016 dataset is one of the most challenging datasets for the detection of high-aspect-ratio and elongated objects, as it includes 1,061 images of 2,970 annotated ship targets. Ships in HRSC2016 are often shown having complicated marine backgrounds, often with the ships being low contrast and at varying orientations.

The CATR model was trained for 120 epochs on HRSC2016 using data augmentation techniques such as random rotation, flipping, and scaling to help improve performances against rotation. The CNN backbone used in the model was able to capture distinct boundaries of the objects and high-frequency edges in the image, while the Transformer module provided scene-level reason and differentiation between the ships and the dynamic background of the ocean.

The detection outputs shown in Fig. 19 depict the successful detection of ships at different scales and orientations. The training and validation loss curves in Fig. 20 exhibit consistent convergence and stabilization across the epochs.

Fig. 20 — Training and validation loss curves for HRSC2016.

Table 3 indicates that the model yielded a Precision of 94.1%, Recall of 90.7%, and mAP@0.5 of 92.9%, which is superior to existing conduits of intuition (traditional detectors). The significance of these results is their demonstration of the model’s ability to balance accuracy in detecting with robustness, even in difficult maritime environments that typically pose manifestations of confusing shadows, reflections, and a wide range of ship size that exacerbate reliability.

Table 3.

Cross-dataset quantitative results of CATR.

Dataset	Precision (%)	Recall (%)	F1-Score	mAP@0.5	Remarks
UCAS-AOD	93.6	91.2	92.3	92.4	Excellent rotation and scale robustness.
HRSC2016	94.1	90.7	92.4	92.9	Accurate elongated ship detection.
NWPU VHR-10	91.5	89.8	90.6	90.6	Robust small-object detection in urban imagery.

Open in a new tab

UCAS-AOD dataset (Aircraft and vehicle detection)

The UCAS-AOD dataset mainly contains image data for aircraft and vehicle detection in aerial scenes, as this dataset has substantial intra-class variance and considerable orientation variation. The dataset consists of about 1,510 images and 14,000 annotated objects, and serves as a suitable benchmark to evaluate the model’s capabilities of rotation invariance and scale invariance.

In this task, the CATR was trained for 120 epochs using the same hyperparameters as was done in DOTA. The CNN backbone was good at recognizing fine-grained geometric features while the transformer encoder decoder allowed for understanding long-range relational information in feature space. The combination allowed for detecting aircraft and vehicles, even in instances the objects were partially occluded or rotated at any arbitrary orientation.

In Figs. 21 and 22, we present qualitative detection results and training and validation loss plots, respectively. Loss was smooth and stable during the training stage across 120 epochs.

Fig. 21 — Detection results on UCAS-AOD.

Fig. 22 — Training and validation loss curves for UCAS-AOD.

The results in Table 3 indicate that CATR achieved a Precision of 93.6%, a Recall of 91.2%, and an mAP@0.5 of 92.4%, outperforming the standard CNN-based detectors. While notable, the slight improvement of CATR over CNN’s results could be principally attributed to the hybrid fusion characteristics of the model, as it simultaneously leverages fine-grained, local textural appearance responses from the CNN, as well the reasoning capabilities through the higher-level contextual area more comprehensively from the transformer. This hybrid fusion ability accommodates better detection of long, rotated objects in dense, complex aerial environments.

NWPU VHR-10 dataset (Small object detection)

The NWPU VHR-10 dataset consists of 800 high-resolution images (500 for training and 300 for testing) as part of small-object detection in very cluttered urban environments, and includes approximately 3,650 annotated objects across object classes, including airplanes, storage tanks, vehicles, and ships. This dataset is useful for testing a detection’s performance when the objects are in a densely cluttered state.

In this experiment, CATR was fine-tuned in transfer learning from the DOTA-pretrained weight over 120 epochs. A strong set of data augmentations were applied to the training data, including random cropping, random scaling, and photometric distortion, which aided small-object detection. The CNN backbone maintained a representation of fine-grained spatial feature, while the transformer module established relationships between local cues, which is very important for small-object detection in a very cluttered image.

Figure 23 shows accurate small object detections, while the training and validation loss curves are shown in Fig. 24 and show the optimization process was smooth and the model effectively learned features.

Fig. 23 — Detection results on NWPU VHR-10.

In quantitative terms, CATR had values of Precision 91.5%, Recall 89.8%, and mAP@0.5 of 90.6%, which reflects good performance against the small-object and dense-cluster problems presented in the dataset (see in 3). The results demonstrate good performance support for the proposed fusion process which utilized both local-detail sensitivity and globally-aware relations.

Comprehensive Insights The experimental analysis using cross-datasets confirms that CATR demonstrates strong generalization and adaptability to multiple, varying spatial and semantic complexities present in the remote sensing datasets. The model maintains high precision and recall values across domains, exhibiting scalability and stability without major architectural changes or hyperparameter alterations.

More specifically:

The CNN backbone provides a reliable extraction of local detail, which is essential for detecting small-object and edge-rich detection.
The transformer encoder-decoder provides both global reasoning and contextual awareness to promote detections in cluttered or occluded scenes.
The fusion mechanism provides complementary reasoning to amplify detection accuracy while maintaining inference efficiency.

Together, these observations show that CATR is not domain-specific, as it provides consistency and performance across a variety of heterogeneous remote-sensing datasets, and thus can be used as a general architecture which applies to aerial surveillance, maritime monitoring, and urban scene analysis, in which performance is needed in varying spatial and semantic complexities. The multi-domain validation begins to establish the design rationale for CATR while providing further distinction as a scalable and domain-agnostic framework for state-of-the art aerial object detection.

Performance benchmark and robustness validation

In order to strengthen the experimental verification, more experiments were performed to assess the robustness, scalability, and adaptability of the proposed CNN–Transformer hybrid architecture (CATR) under different operating conditions. The model was tested against common aerial imagery distortions, such as illumination change, Gaussian noise, motion blurriness, and partial occlusion, which can be found in many operational settings in the remote sensing domain.

In this extended evaluation, the model was validated for 120 epochs across multiple datasets while keeping the optimization parameters and augmentation strategies the same in order for fair evaluation. Table 4 summarizes the results of the experimental evaluation. In all degrading conditions, CATR showed consistently high values for precision, recall, and mAP@0.5, showing only minor variations from baseline performance. This demonstrated the model had an excellent generalization ability and withstand a relatively degraded or noisy visual input.

Table 4.

Robustness evaluation of CATR under varying conditions.

Condition	Precision (%)	Recall (%)	mAP@0.5 (%)	Remarks
Normal (baseline)	93.6	91.2	92.4	Base CATR performance.
Illumination variation	92.8	90.7	91.8	Slight drop under low-light conditions.
Gaussian noise	92.1	89.9	91.1	Demonstrates moderate noise resistance.
Motion blur	91.5	89.2	90.6	Slight degradation, still consistent performance.
Partial occlusion	92.4	90.3	91.6	Maintains robustness under partial occlusion.
Average	92.5	90.3	91.5	Overall robust performance across varying conditions.

Open in a new tab

In addition, the qualitative results presented in Fig. 25 underscore this robustness, demonstrating how CATR is adept at maintaining detection confidence and object localization in the presence of significant image distortions. The collaborative combination of global attention provided by Transformers, plus local spatial encoding provided by CNNs, allows the model to achieve high performance sensitivity to small, rotated, and occluded objects. Together, this consistent performance across various environments showcases CATR reliability and suitability for pragmatic detection applications in aerial surveillance and remote sensing.

Ablation study and core module validation

In order to thoroughly evaluate the efficacy and necessity of the proposed hybrid architecture of CNN and Transformer (CATR), various ablation experiments have been undertaken. The objective of these ablation experiments is to systematically understand the contribution of each main component–the CNN backbone, Transformer encoder-decoder, and CNN-Transformer fusion–relative to the overall detection performance on high-resolution aerial imagery. This evaluation is highly relevant to showing that every module in CATR is necessary, and the model is not simply a sequential concatenation of the CNN and Transformer modules.

We developed four controlled variants of the model to demonstrate the effect of each component on the baseline model as follows:

CNN-only backbone—This variant relies on only the convolutional layers and not the Transformer module to extract hierarchical features. The CNN-only variant relies solely on local feature representations for object detection.
Transformer-only—This variation excludes the CNN backbone and relies solely on the Transformer encoder-decoder and patches embedding that determine global spatial relationships for the entire image.
CNN + Transformer (without fusion)—Both CNN and Transformer modules are present in this model variant, but the proposed fusion term, local and global fusions, is excluded.
Full CATR (Proposed)—The CATR model adds the complete hybrid model that includes a CNN backbone, a Transformer, and the fusion term. The full CATR (proposed) model utilizes both local feature extraction abilities and global attention to optimize performance on high-resolution aerial object detection tasks.

The results of these ablation experiments are summarized in Table 5:

Table 5.

Ablation study and core module validation of CATR.

Model variant	Params (M)	FLOPs (G)	mAP@0.5	mAP@0.5:0.95	FPS	Remarks
CNN-only	32.1	45.8	63.2	46.1	48	Excels in small-object detection; limited global context modeling.
Transformer-only	38.5	80.4	66.8	48.7	32	Captures long-range dependencies; struggles with fine-grained local details.
CNN + Transformer without fusion	40.2	83.1	69.3	51.0	30	Moderate improvement; absence of fusion reduces robustness in dense and occluded scenes.
Full CATR (Ours)	41.2	86.3	72.5	52.8	28	Highest overall performance; fusion mechanism effectively integrates local and global features for superior detection in complex aerial imagery.

Open in a new tab

Detailed analysis

The CNN backbone dramatically improves the detection of small, tightly grouped, and closely positioned objects, which are often the greatest challenges in high-resolution remote sensing imagery. The hierarchical convolutional layers of a backbone preserve the spatial characteristics needed to localize and estimate boundaries, an essential feature of detection tasks considering the various accessibility issues in dense urban areas and for the detection of small objects.
On the other hand, the transformer encoder-decoder provides a distinctive ability to learn long-range spatial dependencies. In effect, this allows the CNN backbone to have an understanding of the entire image and enables the model to reason appropriately about objects that are occluded, rotated, or cluttered. If the transformer module is omitted, then the model does not model global context, and consequently, we see mAP scores that indicate lower detection quality in complicated aerial scenes where multiple objects approximate each other or overlap in the image.
Furthermore, the fusion mechanism is the primary key innovation that takes local features (CNN backbone) and combines them with global attention from the transformer. In so doing, this allows the model to optimize for both detailed spatial information and reasoning about contextualization. This choice is further validated through the ablation studies that clearly show the removal of the fusion module drops detection performance significantly; thus, we conclude that the fusion component represents an important determinant of detection quality.

Finally, one can further conclude that the full CATR model illustrates the best mAP scores without sacrificing inference time or speed (i.e. cost vs. accuracy). This descriptively explained that the architecture of CATR is intentional and modular, where the backbone CNN, transformer layers, and the fusion mechanism contribute positively to the overall score and performance of the design.

Comprehensive Insights

Taken together these ablation studies confirm that CATR is much more than a simple integration of CNN and Transformer modules. Each component is necessary for stable performance in high-resolution aerial imagery under difficult conditions such as occlusion, rotation, densely packed objects, and cluttered backgrounds. These experiments provide strong evidence in support of our design decisions of CATR, and reinforce it’s practical significance for state-of-the-art remote sensing object detection. Further, this assessment provides evidence that our proposed fusion approach is not just a mere additive feature, but rather an essential means of synergistically improving the overall detection capability of the hybrid architecture.

Key takeaways & challenges

Although we have made great strides in the field of object detection frameworks, we found some key limitations while working with the models:

Convolutional Neural Networks (CNNs) allow efficient processing for detections but failed at global spatial modeling.Similalry, Transformers can model spatial dependencies between detections but they are computationally expensive and they are less effective when looking at small objects.
With complex aerial scenarios, CATR performs better than most CNN-based models and DETR, with improved robustness to occlusion, overlapping objects, and orientation variability. Although it has the longest inference time, CATR is ideal for applications, which require higher efficiency such as aerial surveillance, remote sensing, and urban planning.
Although CATR revealed encouraging outcomes, several challenges were identified, including minor precision loss when detecting very small or blurry objects, an increased number of false positives due to background patterns resembling object classes, and greater errors in orientation prediction in Horizontal Bounding Box (HBB) annotations compared to Oriented Bounding Box (OBB) annotations.
The computational cost of the Transformer model is quite high in terms of amount of operations per forward pass and memory. The quadratic attention is the primary component of its cost that has increased the expense of the model as compared to CNN-based approaches, but CATR has been demonstrated to outperform CNNs and DETR when dealing with complex spatial relationships in cluttered aerial scenes.
How we handle the bounding boxes (HBB vs OBB) has to be appealing to the users but also it affects performance if used in rotated aerial scenes.

Conclusion

In this study, we proposed a novel hybrid model called Convolution-Augmented Transformer (CATR), which integrates the local feature extraction capabilities of Convolutional Neural Networks (CNNs) with the global context modeling power of Detection Transformers (DETR). This approach was motivated by the observed limitations of standalone DETR in handling small, blurry, or overlapping objects in high-resolution aerial imagery–particularly in complex scenes with dense object populations and occlusions. While DETR showed promising performance, challenges such as precision loss on small or obscure targets, increased false positives from background clutter, and orientation errors in HBB annotations persisted. To address these issues, CATR leverages CNN backbones to capture fine-grained local textures and combines them with DETR’s self-attention mechanism to model long-range dependencies, enabling more accurate object separation and orientation-aware detection.

Trained and fine-tuned using the DOTA dataset for over 190 epochs, CATR significantly outperformed traditional CNN-based detectors including YOLOv5, YOLOv8, and YOLOv10 in terms of mean Average Precision (mAP), particularly for cluttered and oriented object detection tasks. The transformer component’s ability to learn global relationships greatly benefited detection in densely populated scenes, reducing the dependency on non-maximum suppression (NMS) and anchor boxes. While this architecture incurs a higher computational cost and training time, it demonstrates superior generalization and robustness in remote sensing and geospatial applications.

Looking forward, several avenues are identified for enhancing the performance and efficiency of CATR. For small object detection, future work may explore better positional encoding strategies, super-resolution pre-processing, and deeper CNN-transformer hybrids. To improve deployment feasibility, techniques such as model pruning, quantization, and knowledge distillation could make CATR suitable for edge devices. Experimentation with advanced learning rate schedules, loss functions (e.g., GIoU, DIoU, Focal Loss), and pretraining strategies may also facilitate faster convergence. Multi-modal integration–leveraging multispectral, thermal, or LiDAR data–and 3D aerial scene understanding could further enrich the detection pipeline. Additionally, self-supervised and active learning methods hold promise for automated dataset expansion, reducing annotation costs and increasing scalability.

In conclusion, CATR sets a promising direction for robust and intelligent geospatial AI systems. It not only bridges the gap between local detail extraction and global reasoning but also simplifies the detection pipeline while achieving state-of-the-art performance. Future optimizations will aim to balance accuracy with computational efficiency, paving the way for real-time, edge-compatible applications in surveillance, environmental monitoring, and disaster response.

Author contributions

Sahibzada Jawada Hadi, Irfan Ahmed, Abid Iqbal, and Ali S. Alzahrani contributed equally to carry out this work.

Funding

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia Grant No. KFU253697.

Data availability

Data are available on request to the corresponding author.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Irfan Ahmed, Email: irfanahmed@uetpeshawar.edu.pk.

Abid Iqbal, Email: aaiqbal@kfu.edu.sa.

References

1.Chen, Z. et al. Object detection in aerial images using dota dataset: a survey. Int. J. Appl. Earth Observ. Geoinf.134, 104208 (2024). [Google Scholar]
2.Li, Q., Chen, Y. & Zeng, Y. Transformer with transfer CNN for remote-sensing-image object detection. Remote Sens.14(4), 984 (2022). [Google Scholar]
3.Liu, Y., Sun, P., Wergeles, N. & Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl.172, 114602 (2021). [Google Scholar]
4.Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. “End-to-end object detection with transformers,” in European Conference on Computer Vision, pp. 213–229, Springer, (2020).
5.He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on vision and Pattern Recognition, pp. 16000–16009, (2022).
6.Gao, P. et al. MCMAE: Masked convolution meets masked autoencoders. Adv. Neural Inf. Process.Syst.35, 35632–35644 (2022). [Google Scholar]
7.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, (2016).
8.Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. Feature pyramid networks for object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125, (2017).
9.Zhu, X., Su, W., Lu, L., Li, B., Wang, X. & Dai, J. Deformable detr: Deformable transformers for end-to-end object detection, arXiv preprintarXiv:2010.04159, (2020).
10.Redmon, J. & Farhadi, A. Yolov3: An incremental improvement, arXiv preprintarXiv:1804.02767, (2018).
11.Wang, A. et al. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst.37, 107984–108011 (2024). [Google Scholar]
12.Hong, W., Lao, J., Ren, W., Wang, J., Chen, J. & Chu, W. Training object detectors from scratch: An empirical study in the era of vision transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4662–4671, (2022).
13.Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. et al. An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprintarXiv:2010.11929, (2020).
14.Aleissaee, A. A. et al. Transformers in remote sensing: A survey. Remote Sens.15(7), 1860 (2023). [Google Scholar]
15.Zhou, M., Li, X., Chen, Y. & Zhang, Y. Transformer-based person detection in rgbt aerial images with paired VTSaR, IEEE Transactions on Geoscience and Remote Sensing, (2023).
16.Wang, H., Yang, J., Zhao, R., Li, W. & Gao, Y. Aerial person detection for search and rescue: Survey and benchmarks. Remote Sens.15(2), 412 (2023). [Google Scholar]
17.Liu, H., Zhang, R., Chen, G. & Li, Q. Robust aerial person detection with lightweight distillation network for edge deployment. Pattern Recognit. Lett.178, 25–34 (2024). [Google Scholar]
18.Gu, J. et al. Recent advances in convolutional neural networks. Pattern Recognit.77, 354–377 (2018). [Google Scholar]
19.Redmon, J. et al. You only look once: Unified, real-time object detection, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, (2016).
20.Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L. & Wang, J. Conditional detr for fast training convergence, in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3651–3660, (2021).
21.Boulent, J., Foucher, S., Théau, J. & St-Charles, P.-L. Convolutional neural networks for the automatic identification of plant diseases. Front. Plant Sci.10, 941 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Darwish, A., Ezzat, D. & Hassanien, A. E. An optimized model based on convolutional neural networks and orthogonal learning particle swarm optimization algorithm for plant diseases diagnosis. Swarm Evol. Comput.52, 100616 (2020). [Google Scholar]
23.Mathew, M. P. & Mahesh, T. Y. Leaf-based disease detection in bell pepper plant using yolo v5. Signal Image Vid. Process.16(3), 841–847 (2022). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data are available on request to the corresponding author.

[CR1] 1.Chen, Z. et al. Object detection in aerial images using dota dataset: a survey. Int. J. Appl. Earth Observ. Geoinf.134, 104208 (2024). [Google Scholar]

[CR2] 2.Li, Q., Chen, Y. & Zeng, Y. Transformer with transfer CNN for remote-sensing-image object detection. Remote Sens.14(4), 984 (2022). [Google Scholar]

[CR3] 3.Liu, Y., Sun, P., Wergeles, N. & Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl.172, 114602 (2021). [Google Scholar]

[CR4] 4.Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. “End-to-end object detection with transformers,” in European Conference on Computer Vision, pp. 213–229, Springer, (2020).

[CR5] 5.He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on vision and Pattern Recognition, pp. 16000–16009, (2022).

[CR6] 6.Gao, P. et al. MCMAE: Masked convolution meets masked autoencoders. Adv. Neural Inf. Process.Syst.35, 35632–35644 (2022). [Google Scholar]

[CR7] 7.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, (2016).

[CR8] 8.Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. Feature pyramid networks for object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125, (2017).

[CR9] 9.Zhu, X., Su, W., Lu, L., Li, B., Wang, X. & Dai, J. Deformable detr: Deformable transformers for end-to-end object detection, arXiv preprintarXiv:2010.04159, (2020).

[CR10] 10.Redmon, J. & Farhadi, A. Yolov3: An incremental improvement, arXiv preprintarXiv:1804.02767, (2018).

[CR11] 11.Wang, A. et al. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst.37, 107984–108011 (2024). [Google Scholar]

[CR12] 12.Hong, W., Lao, J., Ren, W., Wang, J., Chen, J. & Chu, W. Training object detectors from scratch: An empirical study in the era of vision transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4662–4671, (2022).

[CR13] 13.Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. et al. An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprintarXiv:2010.11929, (2020).

[CR14] 14.Aleissaee, A. A. et al. Transformers in remote sensing: A survey. Remote Sens.15(7), 1860 (2023). [Google Scholar]

[CR15] 15.Zhou, M., Li, X., Chen, Y. & Zhang, Y. Transformer-based person detection in rgbt aerial images with paired VTSaR, IEEE Transactions on Geoscience and Remote Sensing, (2023).

[CR16] 16.Wang, H., Yang, J., Zhao, R., Li, W. & Gao, Y. Aerial person detection for search and rescue: Survey and benchmarks. Remote Sens.15(2), 412 (2023). [Google Scholar]

[CR17] 17.Liu, H., Zhang, R., Chen, G. & Li, Q. Robust aerial person detection with lightweight distillation network for edge deployment. Pattern Recognit. Lett.178, 25–34 (2024). [Google Scholar]

[CR18] 18.Gu, J. et al. Recent advances in convolutional neural networks. Pattern Recognit.77, 354–377 (2018). [Google Scholar]

[CR19] 19.Redmon, J. et al. You only look once: Unified, real-time object detection, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, (2016).

[CR20] 20.Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L. & Wang, J. Conditional detr for fast training convergence, in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3651–3660, (2021).

[CR21] 21.Boulent, J., Foucher, S., Théau, J. & St-Charles, P.-L. Convolutional neural networks for the automatic identification of plant diseases. Front. Plant Sci.10, 941 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Darwish, A., Ezzat, D. & Hassanien, A. E. An optimized model based on convolutional neural networks and orthogonal learning particle swarm optimization algorithm for plant diseases diagnosis. Swarm Evol. Comput.52, 100616 (2020). [Google Scholar]

[CR23] 23.Mathew, M. P. & Mahesh, T. Y. Leaf-based disease detection in bell pepper plant using yolo v5. Signal Image Vid. Process.16(3), 841–847 (2022). [Google Scholar]

PERMALINK

CATR: CNN augmented transformer for object detection in remote sensing imagery

Sahibzada Jawad Hadi

Irfan Ahmed

Abid Iqbal

Ali S Alzahrani

Abstract

Introduction

Fig. 1.

Related work

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Our contributions

Baseline models

Detection transformer (DETR)

Convolutional neural network (CNN)

Fig. 6.

Fig. 7.

You only look once (YOLO)

Experimental setup

Experimental environment

Performance evaluation metrics

Mean average precision (mAP)

F1 score

Inference time

Dataset

Fig. 8.

Fig. 9.

Fig. 10.

Methodology

Fig. 11.

Data preparation & augmentation

Fig. 12.

Model training & validation

Fig. 13.

Fig. 14.

Fig. 15.

Results & discussion

Fig. 16.

Fig. 17.

Table 1.

Fig. 18.

Table 2.

Cross-dataset evaluation and generalization study

HRSC2016 dataset (Ship detection)

Fig. 19.

Fig. 20.

Table 3.

UCAS-AOD dataset (Aircraft and vehicle detection)

Fig. 21.

Fig. 22.

NWPU VHR-10 dataset (Small object detection)

Fig. 23.

Fig. 24.

Performance benchmark and robustness validation

Table 4.

Fig. 25.

Ablation study and core module validation

Table 5.

Key takeaways & challenges

Conclusion

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases