Abstract
Small object detection in remote sensing imagery remains challenging due to complex backgrounds, frequent occlusions, and dense distributions of objects, which often lead to suboptimal performance with existing models. To address these issues, this paper proposes a novel dynamic element-activated non-semantic sparse attention method for detecting small objects in remote sensing images. First, we introduce a non-semantic sparse attention mechanism that computes self-attention within local patches, enhancing the model’s focus on textures and edges while improving its perception of occluded small objects and local complex variations. Subsequently, a dynamic element-activated cross-layer channel attention mechanism is incorporated to adaptively strengthen cross-layer positional awareness, thereby specifically enhancing the representational capacity of small objects feature against cluttered backgrounds. Finally, a diffusion wavelet convolutional structure is employed to process multi-channel features in parallel, mitigating information loss and capturing critical features of densely distributed small objects under boundary ambiguity. Extensive ablation studies and comparisons with state-of-the-art methods on the VisDrone and AI-TODv2 datasets demonstrate the feasibility and effectiveness of our approach, showing its potential to provide technical support for practical applications in remote sensing small object detection.
Keywords: Small object detection, Attention mechanism, Remote sensing imagery, Dynamic activation
Subject terms: Engineering, Mathematics and computing
Introduction
Remote sensing small object detection constitutes a pivotal research thrust within the computer vision domain, assuming indispensable roles across military, agricultural, maritime, and urban planning applications. This task entails identifying diminutive objects in aerial imagery, yet remains substantially challenged by intricate backgrounds, pervasive inter-object occlusions, and densely packed object distributions. Conventional detection paradigms exhibit insufficient robustness and transferability, rendering them inadequate for complex operational requirements. While deep learning advancements have markedly elevated object detection performance in generic scenarios, the distinctive imaging geometry and heightened scene complexity of remote sensing data introduce far more formidable challenges than those encountered with natural imagery, compelling the development of specialized detection architectures.
Notwithstanding considerable maturation in object detection methodologies, contemporary approaches remain inadequate in confronting the compounded difficulties of cluttered environments, frequent occlusions, and high-density configurations characteristic of remote sensing small objects, yielding marginal performance improvements. Recent investigations have pursued divergent technical avenues to address these challenges. One research strand emphasizes feature enhancement and reconstruction, for instance, Li et al.1 and Wang et al.2 leveraged edge-aware guidance and multi-scale fusion mechanisms respectively to augment feature representations. Nevertheless, such methodologies prove vulnerable to contamination from adjacent objects in congested scenarios when handling occluded objects, revealing fundamental robustness limitations. Alternative research directions exploit global contextual and spectral-domain information, exemplified by Wang et al.’s MLP fusion network for holistic semantic extraction3 and Zhu et al.’s frequency-domain transformations for background suppression4. While demonstrating efficacy for sparsely distributed isolated objects, these techniques encounter diminished modeling efficacy and spectral separability when confronting densely clustered and overlapping objects. Furthermore, although Shi et al. attempted to integrate contextual modeling with cross-dimensional interactions5, their framework provided limited mitigation for inherent static occlusion phenomena in monocular imagery. Collectively, the research community still lacks a unified framework capable of synergistically addressing feature ambiguity, information degradation, and mutual interference arising from complex background clutter, inter-object occlusion, and dense spatial configurations.
Consequently, while existing research has achieved progress along various technical pathways, their designs predominantly address isolated challenges without establishing a unified framework capable of synergistically handling the interconnected problems of complex backgrounds, object occlusion, and dense distributions. Three fundamental limitations persist: firstly, the absence of an attention mechanism that effectively focuses on non-semantic granular details like textures and edges of minuscule objects while resisting complex background interference, secondly, the lack of adaptive structures during feature propagation that dynamically reinforce and recover faint features from occluded or submerged objects, thirdly, the deficiency in models that concurrently preserve and refine critical features under conditions of extreme object density and boundary ambiguity without information aliasing. To address these gaps, we propose a sparse and dynamic attention-driven feature processing mechanism that achieves robust characterization of remote sensing small objects from local details to global contexts.
This paper aims to resolve the aforementioned challenges through an innovative approach based on dynamic element activation and non-semantic sparse attention. Our contributions are threefold:
We design a non-semantic sparse attention module that computes self-attention within localized patches, directing the model’s focus toward non-semantic details such as textures and edges. This significantly enhances the perception capability for small objects and local complex variations, effectively addressing object occlusion issues.
We introduce a dynamic element-activated cross-layer channel attention mechanism that adaptively amplifies faint small object features in feature maps. This architecture enhances cross-layer positional awareness during feature interaction, thereby specifically improving their representational capacity within complex backgrounds.
To handle object density in remote sensing data, we implement a diffusion wavelet convolutional structure that processes multi-channel features in parallel. This approach minimizes information loss while effectively capturing critical features of densely distributed small objects with blurred boundaries, enabling efficient detection of remote sensing small objects.
Comprehensive ablation studies and comparative experiments conducted on the VisDrone and AI-TODv2 benchmark datasets demonstrate the superiority, feasibility, and effectiveness of our method for remote sensing small object detection tasks, potentially providing reliable technical support for practical applications.
Related works
Deep learning-based small object detection in remote sensing
The detection of small objects (e.g., vehicles, vessels) in remote sensing imagery represents a particularly challenging research frontier. Compared to medium and large-scale objects, small objects inherently suffer from limited feature representation capacity due to their minimal pixel coverage. This fundamental constraint is further exacerbated by frequent inter-object occlusions and densely clustered distributions, which intensify feature confusion and information degradation. Consequently, these challenges manifest as persistently high rates of missed detections and false alarms in current detection paradigms.
Contemporary research primarily evolves along two pivotal trajectories: feature enhancement and representation optimization. Specifically, a series of studies have focused on architectural refinements to improve the extraction and utilization efficiency of small object features. For instance, Liu et al.6 designed heterogeneous filters to amplify saliency in object regions, thereby enhancing detection performance for small objects. Shi et al.5 jointly optimized contextual modeling of high-level features and detail refinement of low-level features, incorporating cross-dimensional interactive attention to augment perceptual capability. For sequential imagery, Li et al.7 proposed a trajectory encoding enhancement module that leverages temporal information to suppress background clutter while reinforcing object features.
In the domain of feature reconstruction, Hui et al.8,9 employed dense residual-based super-resolution modules to recover detail losses incurred by downsampling operations. Li et al.1 developed an edge-guided perception network that strengthens edge and semantic information through multi-scale progressive fusion. Wang et al.3 constructed a wide-area gated comprehensive fusion network designed to integrate more discriminative multi-scale contextual features throughout encoding and decoding paths.
Collectively, the aforementioned methods still exhibit insufficient capability in preserving and enhancing critical features under extreme scenarios such as severe occlusion and high density of targets. In response, researchers have begun exploring more direct attention-guided mechanisms, with the objective of enabling models to adaptively focus on key regions and channels relevant to small targets.
Attention mechanisms
The application of attention in vision originated from emulating human “active vision” processes. The Transformer architecture and its self-attention mechanism, introduced by Vaswani et al.10, revolutionized technical paradigms through their potent global contextual modeling capabilities. Building upon this foundation, the channel attention (e.g., SENet11) and spatial attention (e.g., CBAM12), which enhance feature representation by emphasizing the importance of specific feature channels or spatial locations. Coordinate Attention13 effectively captures precise spatial relationships by embedding positional information into channel attention, delivering substantial improvements for position-sensitive tasks like object detection. Concurrently, BoTNet14 achieved remarkable performance across multiple tasks by replacing spatial convolutions with global self-attention.
Recent years have witnessed further expansion of attention’s theoretical boundaries and application forms. State Space Models (SSMs), epitomized by Mamba15, have been shown to share profound theoretical duality with attention mechanisms and achieve similar “focusing” functionality through selective gating. Its vision variant, Vision Mamba16, demonstrates potential for linear-complexity image processing, offering new directions for constructing efficient large-scale vision models. Subsequently, researchers have devised more sophisticated attention architectures to address challenges including multi-scale variation, occlusion, and complex scenes. For instance, Fair-DETR17 employs adaptive multi-scale attention to balance detection capability across differently sized objects, while studies such as AIN-YOLO18 leverage attention to reinforce critical features and suppress noise in challenging environments like underwater and infrared imagery.
In summary, attention mechanisms have matured into a diverse and powerful technical arsenal. Nevertheless, direct application of generic attention modules to remote sensing small object detection remains substantially challenged by the inherent limitations of small objects—their constrained feature availability and susceptibility to information degradation in convolutional networks. Consequently, developing specialized attention mechanisms capable of delicately enhancing local small object features while resisting background interference emerges as particularly crucial.
Feature activation functions
In convolutional neural networks, feature activation functions introduce non-linear operations, thereby addressing the limitations inherent in linear transformations.In 2011, Glorot et al.19 first proposed ReLU for deep learning applications. The subsequent successful implementation of ReLU in AlexNet by Krizhevsky20 in 2012 established this activation function as a standard component. ReLU offers significant advantages including mitigation of vanishing gradients, computational efficiency, and accelerated convergence. Nevertheless, it produces zero gradients for negative inputs, resulting in complete cessation of learning for such activations. To resolve this limitation, Xu et al.21 introduced Leaky ReLU in 2015, incorporating a small slope in the negative domain to ensure gradient flow for all inputs. In 2016, Clevert et al.22 proposed the Exponential Linear Unit (ELU), which smoothly asymptotes to a negative saturation value-
in its negative region. This formulation produces outputs closer to zero mean, potentially enabling faster convergence while maintaining gradient flow, though at the cost of increased computational overhead. The subsequent introduction of Swish by Ramachandran et al.23 in 2017 and Mish by Misra24 in 2020 demonstrated superior optimization properties and generalization capabilities. However, their enhanced performance comes with increased computational complexity, which may impact model inference speed in detection tasks.
Fundamentally, by addressing the non-linearity deficiency inherent in pure convolution, these functions—particularly those with smooth profiles—enable more precise gradient information and more stable optimization trajectories throughout the learning process.
Feature pyramid architectures
Feature pyramid architectures represent a cornerstone technique for addressing multi-scale challenges in object detection, particularly critical for small object detection. The seminal work by Tsung-Yi Lin et al.25 introduced the Feature Pyramid Network (FPN) in 2017, establishing a foundational paradigm through its top-down architecture with lateral connections. Building upon FPN, subsequent research has pursued refinements from diverse perspectives. Liu et al. proposed the Path Aggregation Network26, which shortens the informational pathway between low-level and high-level features through augmented bottom-up propagation, thereby strengthening localization precision. Addressing feature imbalance issues, Pang et al.27 developed the Balanced Feature Pyramid, employing strategic recombination and integration operations to equilibrate contributions from different hierarchical levels. To establish denser feature connectivity, Chen et al.28 devised a multi-path Feature Pyramid Grid (FPG) structure, while Yang et al.29 proposed the Asymptotic Feature Pyramid Network (AFPN), enabling direct interaction between non-adjacent levels through progressive fusion to mitigate substantial semantic gaps. Collectively, the evolution of feature pyramid architectures centers on optimizing the fusion efficiency of semantic and detailed information across hierarchical levels to advance small object feature representation.
Notwithstanding the enhanced detection performance for small targets achieved by feature fusion architectures epitomized by FPN, contemporary methodologies continue to demonstrate substantial limitations in extracting and preserving features of faint, occluded, and densely clustered small targets within complex backgrounds. To address these persistent challenges through more direct and effective mechanisms, this work introduces a novel detection framework operating from dual pivotal perspectives: feature enhancement and attention guidance.
Proposed method
This section elaborates on our proposed Dynamic Element-Activated Non-Semantic Sparse Attention framework for remote sensing small object detection. We provide comprehensive descriptions and implementation principles from four perspectives: the overall architecture, the Non-Semantic Sparse Attention (NSSA) mechanism, the Dynamic Element-Activated Cross-Layer Channel Attention mechanism, and the Diffusion Wavelet Convolutional feature preservation structure.
Overall architecture
The fundamental challenge in remote sensing object detection stems from diminutive object sizes and insufficient feature information, leading to suboptimal model performance. To address these limitations, we propose a novel detection framework integrating dynamic element activation with non-semantic sparse attention. Our methodology employs a three-stage enhancement strategy: First, the Non-Semantic Sparse Attention mechanism intensifies the model’s focus on textual patterns and edge features, effectively mitigating feature extraction difficulties arising from object occlusion. Subsequently, through Dynamic Tanh(DyT) nonlinear feature mapping, the Dynamic Element-Activated Cross-Layer Channel Attention mechanism strengthens positional awareness across network layers during feature interaction, thereby significantly improving feature discriminability in complex backgrounds. Finally, the Diffusion Wavelet Convolutional structure minimizes feature degradation of small objects while effectively capturing critical characteristics of densely distributed objects with ambiguous boundaries. The synergistic integration of these components substantially enhances detection performance for small objects in remote sensing imagery. The comprehensive network architecture is illustrated in Figure 1.
Fig. 1.
Architectural overview of the proposed framework.
As illustrated in Fig. 1, the backbone network employs multiple parallel Convolution-BatchNorm-ReLU (CBR) modules to construct cross-layer feature interactions using multi-level feature maps extracted from the feature extraction network, thereby reducing computational complexity. The multi-scale feature maps generated by the CBR modules are subsequently fed into the Non-Semantic Sparse Attention (NSSA) mechanism, which enhances the model’s focus on textural and edge features while suppressing semantic information expression within sparse attention blocks, consequently improving recognition sensitivity for small-sized objects.
The multi-layer feature maps are then simultaneously processed through the Dynamic Element-Activated Cross-Layer Channel Attention mechanism with DyT nonlinear feature mapping. This structurally streamlined design ensures training stability while significantly enhancing the representational capacity for occluded small objects in feature maps. Finally, shallow features are directly routed to the detection mechanism, while the final layer features are processed through a CBR feature preservation structure incorporating Diffusion Wavelet Convolution. This integrated approach minimizes information degradation and effectively captures critical features of densely distributed small objects under boundary ambiguity, ultimately achieving efficient detection of remote sensing small objects.
Non-semantic sparse attention mechanism
To effectively extract non-semantic features of small objects, we implement a Non-Semantic Sparse Attention mechanism following multiple parallel CBR modules. This approach operates by partitioning feature maps into localized patches and computing self-attention within these patches, thereby enhancing the model’s focus on textural patterns and edge information. Concurrently, the module effectively suppresses the expression of semantic information within sparse attention blocks while adaptively extracting task-relevant non-semantic features. This dual mechanism significantly improves the model’s recognition sensitivity for small-scale objects. The architectural configuration of the Non-Semantic Sparse Attention mechanism is detailed in Fig. 2.
Fig. 2.
Schematic diagram of the non-semantic sparse attention mechanism.
As illustrated in Fig. 2, the input feature map
is initially processed through two parallel branch networks. The first branch undergoes convolutional operations, average pooling, and Sigmoid activation to generate attention weights. The second branch partitions the original feature map into tensor blocks structured as
, where S represents the sparsity coefficient. Specifically, the feature map is decomposed into non-overlapping tensor blocks of size S
S, each possessing a spatial dimension of
. Self-attention mechanisms are computed individually within these tensor blocks, where blocks marked with identical colors perform simultaneous self-attention computations.
This architecture effectively suppresses the expression of sparse semantic information in spatial semantic modules, enabling the model to concentrate more effectively on extracting non-semantic features. Subsequently, the recovered features are element-wise multiplied with corresponding elements from the prior branch and reshaped to comply with spatial dimensionality requirements. The mathematical implementation of the Non-Semantic Spatial Attention mechanism is formalized in Eqs. (1)–(4).
![]() |
1 |
![]() |
2 |
![]() |
3 |
![]() |
4 |
where AP denotes average pooling, Sparse (
) represents the sparse linear projections generating queries, keys, and values, CA-MHSA indicates the channel-wise multi-head self-attention mechanism, and Unsparse refers to the restoration of the original shape before sparsification. The sparsification of tensor blocks within our Non-Semantic Sparse Attention mechanism effectively eliminates the computational demand for extensive non-critical key-value pairs during operational localization. This design achieves enhanced local representation capacity with minimal parameter overhead, enabling the model to concentrate more effectively on non-semantic information in images. Consequently, it alleviates feature extraction difficulties caused by object occlusion and significantly improves the model’s recognition sensitivity for small-scale objects.
Dynamic element-activated cross-layer channel attention mechanism
Small object information in remote sensing imagery is susceptible to information degradation during normalization operations. To address this limitation, Zhu et al.30 proposed a dynamic element-activated cross-layer channel attention architecture employing DyT nonlinear feature mapping. This framework utilizes a DyT function to emulate the sigmoidal mapping characteristics of LayerNorm, enabling direct element-wise computation while preserving nonlinear feature relationships. The architecture maintains structural simplicity and training stability.
Whereas, conventional multi-head attention mechanisms treat all interaction groups indiscriminately, resulting in insufficient sensitivity to object positional information. To overcome this constraint, we introduce Cross-Layer Consistent Relative Positional Encoding (CCPE) to enhance cross-layer positional awareness during feature interaction. This targeted augmentation specifically strengthens feature representational capacity within complex backgrounds.
The DyT nonlinear feature mapping architecture is primarily constructed based on the Tanh function, with its computational formulation expressed in Eq. (5):
![]() |
5 |
where
is a learnable dynamic scaling parameter that enables adaptive rescaling according to varying input magnitudes x. Both
and
are learnable parameters. Since each channel vector shares identical parameters across all normalization layers, their outputs can be rescaled to vectors of arbitrary magnitudes. We initialize
as an all-ones vector and
as an all-zeros vector, then perform operations following the normalization layer. Additionally, the scaling parameter
is initialized with a value of 0.5.
During the forward propagation process, DyT operates independently on each input element within the tensor without computing statistical information or other forms of aggregation. Consequently, the tanh-based dynamic element activation structure does not constitute a normalization layer. Nevertheless, it nonlinearly compresses extreme values while maintaining near-linear transformation for the central portion of inputs, thereby preserving the beneficial effects of normalization layers. Through experimental validation, it is observed that as the value is gradually increased from 0, the model performance consistently improved until it reached 0.5, after which performance began to fluctuate. Therefore, the value is set to 0.5 in this paper. During the training phase, the initial learning rate is set to 0.002.
The scale-transformed features are subsequently fed into the channel attention architecture, as illustrated in Fig. 3. First, channel reconstruction is performed separately on the input multi-layer features to generate feature maps with varying channel dimensions. Subsequently, overlapping channel matching is applied to the feature maps to create channel groups. This operation extracts overlapping regions along the channel dimension from local areas, with different scale features yielding distinct channel configurations after reconstruction. The multi-channel features from different layers then undergo cross-layer fusion, producing enhanced object representations. Finally, these refined features are processed by the object detection architecture to obtain small object classification and localization information.
Fig. 3.
Architecture of the channel attention mechanism.
The channel reconstruction operation stacks feature values from the spatial dimension onto the channel dimension, maintaining consistent spatial resolution while preserving computational efficiency. The computational procedure for channel reconstruction is formalized in Eq. (6):
![]() |
6 |
where input feature map
is transformed into output
through channel reconstruction, with the detailed computational procedure governed by Eq. (6).
Subsequently, the channel overlapping matching operation is applied to generate channel feature groups. These groups extract overlapping regions from local areas along the channel dimension, producing varying channel sizes when processing multi-scale feature inputs. Specifically, based on the multi-scale characteristics, adjacent features in
yield channel sizes differing by a factor of 4 (
). To construct overlapping adjacent feature groups, we introduce an unfolding factor
and perform channel overlapping matching on
, yielding
, which can be mathematically formulated as shown in Eq. (7).
![]() |
7 |
Taking the feature map at the i-th layer as an example, after obtaining
, we employ cross-layer consistent multi-head attention to capture global dependencies along the spatial dimension, thereby deriving the interaction outcome
. The computational procedure is formally expressed in Eqs. (8) and (9).
![]() |
8 |
![]() |
9 |
where
denotes the linear projection matrix, while K and V represent the concatenated keys and values, respectively. We employ a multi-head mechanism to capture global dependencies for each channel token. After obtaining the interaction results
from multi-scale feature maps, an inverse channel overlapping matching operation is applied to counteract the effects of channel overlapping matching, thereby restoring the original spatial resolution. Finally, spatial reconstruction (SR) is utilized to rearrange feature values from the channel dimension back to the spatial dimension, effectively reversing the channel reconstruction (CR) operation and ensuring dimensional alignment between each feature map and its corresponding input
.
To mitigate the inherent limitation of conventional multi-head attention mechanisms in capturing positional information, we introduce cross-layer consistent relative positional encoding to enhance cross-layer positional awareness. The technical implementation proceeds as follows: The set of attention maps between each spatial token group is denoted as
, where N represents the number of attention heads,
. For computational simplicity while disregarding H0 and W0, we define
,
, where
and
denote the height and width of the spatial dimensions at the *i*-th and *j*-th layers, respectively. Consequently, the collection of attention feature maps can be reformulated as
.
We define a learnable codebook C and extract relative positional information between any two tokens by computing their cross-layer consistent relative position indices within this codebook. Considering first the interaction between token groups in the spatial dimensions of the i-th and j-th layers, where
denotes the absolute coordinate matrix of the respective tokens in the form, while
similarly represents their corresponding absolute coordinate matrix. To acquire the relative positional information of
with respect to
, we first normalize the coordinates according to their respective spatial token group dimensions, yielding normalized coordinates
and
. Subsequently, by projecting their coordinates onto the maximum spatial dimension, we obtain the projected coordinate representations
and
.
To acquire the relative positional information of
with respect to
, we initially normalize the coordinates according to their respective spatial token group dimensions, yielding normalized coordinates. Subsequently, by projecting their coordinates onto the maximum spatial dimension, we obtain the projected coordinate representations.
Subsequently, the relative distance between
and
is computed and converted into an index for the codebook. This index is then utilized to retrieve the positional embedding matrix from codebook C, which is additively incorporated into the original attention map
to generate the output
enriched with comprehensive positional information. This mechanism adaptively amplifies faint small object feature within the feature maps, enhances cross-layer positional awareness during feature interaction, and consequently strengthens their representational capacity in complex backgrounds.
Diffusion wavelet convolutional feature preservation architecture
Conventional convolution operations frequently induce information dissipation of small objects in remote sensing imagery, consequently compromising detection performance. To counteract the inherent information degradation during convolutional processing, we propose a diffusion wavelet convolutional structure. This framework leverages parallel multi-channel feature processing to minimize information loss, thereby maintaining effective characterization of critical features under conditions of object diminutiveness or boundary ambiguity in dense distributions. Ultimately, this architecture achieves substantial enhancement in detecting densely arranged small-scale objects. The structural configuration of the diffusion wavelet convolution is delineated in Fig. 4.
Fig. 4.
Architectural diagram of the diffusion wavelet convolution.
As illustrated in Figs. 1 and 4, the input image undergoes convolutional operations to generate multi-scale feature maps. To mitigate information degradation of small-scale objects during convolutional processing, we employ diffusion wavelet convolution to process feature maps of varying dimensions. This approach achieves feature map downscaling while preserving spatial details, consequently producing feature representations with expanded receptive fields. The computational procedure of diffusion wavelet convolution is formally expressed in Eq. (10).
![]() |
10 |
In this formulation,
denotes the high-resolution shallow feature maps,
represents the deep feature maps possessing expansive receptive fields, signifies the convolution operation, and indicates the dimensional concatenation operation.
corresponds to the image resolution reduced by a factor of
relative to the original input. To facilitate comprehension, we express Equation (10) using
, where i indicates the number of downsampling iterations. Consequently, Eq. (10) can be formally represented as:
![]() |
11 |
Here, F represents the input feature map,
denotes the sub-feature maps obtained through DWT filters,
corresponds to the high-frequency bands, and
signifies the low-frequency components. The features processed by the diffusion wavelet convolutional architecture demonstrate enhanced preservation of small-scale object characteristics. This methodology effectively mitigates information loss while simultaneously capturing critical features of densely distributed small objects with ambiguous boundaries, thereby achieving efficient detection of remote sensing small objects.
Diffusion Wavelet Convolution employs Chebyshev polynomial approximation to avoid explicit eigendecomposition. Efficient K-order polynomial filtering in the spatial domain is achieved through a scaled Laplacian matrix and its recursive properties.In terms of kernel design, multiple parallel branches are constructed, each corresponding to a distinct scale. The Chebyshev coefficients within each branch serve as learnable parameters, collectively forming an adaptive multi-scale convolutional kernel. This enables the network to autonomously learn frequency-response characteristics at different scales. By transforming fixed wavelet bases into data-driven adaptive filters and incorporating a multi-branch architecture for explicit multi-scale feature fusion, the proposed approach ultimately enhances the representational capacity for small or fine-grained targets.
Experiments and results
The evaluation metrics employed in our method include: Average Precision (
) and Average Recall (
). Specifically,
represents the mean average precision across all categories
,
corresponds to the area under the Precision-Recall curve, and
is computed as twice the area under the Recall-IoU curve. The mathematical formulations are expressed in Eqs. (12)- (14) as follows:
![]() |
12 |
![]() |
13 |
![]() |
14 |
where
denotes the precision value corresponding to the i-th
level on the Precision-Recall curve, and n represents the total number of object categories. The computational formulation of recall is expressed in Eq. (15), while precision (
) is defined in Eq. (16) as follows:
![]() |
15 |
![]() |
16 |
Recall quantifies the proportion of actual positive instances that are correctly identified by the model, while precision measures the fraction of correctly predicted positive instances among all positive predictions by the model. Here, True Positive (
) denotes the count of accurately detected bounding boxes, False Positive (
) represents the number of erroneously detected boxes, and False Negative (
) indicates the count of undetected ground truth boxes. Prediction confidence refers to the probability that a predicted object category is correct. The evaluation metrics include
(
measured at an IoU threshold of 0.50) and
(the average
across IoU thresholds from 0.50 to 0.95 in 0.05 increments). The computational formulation is expressed in Eq. (17) as follows:
![]() |
17 |
We conducted extensive experiments on the VisDrone dataset. As illustrated in Fig.5, the visualization of detection results demonstrates the effectiveness of our approach. Specifically, rows (a) and (c) present the detection results obtained by the RetinaNet baseline, while rows (b) and (d) display the corresponding results generated by our proposed method.
Fig. 5.
Comparative visualization of detection results on the VisDrone dataset.
The qualitative comparisons reveal that our method, enhanced by the three proposed small object detection components, successfully identifies a greater number of small objects. This improvement is particularly evident when comparing columns 1 and 2 of rows (a) and (b). Furthermore, as shown in columns 1 and 4 of rows (c) and (d), our approach maintains accurate detection and recognition capabilities even for partially occluded objects. Additionally, our method significantly reduces false positive rates. This advantage is clearly demonstrated in column 2 of rows (c) and (d), where the baseline RetinaNet method mistakenly identifies pillars as pedestrians, while our improved approach successfully eliminates such false detections.
These visual comparisons provide compelling evidence that our proposed method achieves superior performance for remote sensing small object detection tasks.
To comprehensively validate the effectiveness of our proposed method, we primarily employ
and
as evaluation metrics. The experimental results are summarized in Table 1, where the baseline method is RetinaNet and the reported metric corresponds to
.
Table 1.
Comparative overall experimental results on the VisDrone dataset.
| Method | FLOPs | Params | FPS | AR(%) | AP(%) | ||||
|---|---|---|---|---|---|---|---|---|---|
| S | M | All | S | M | All | ||||
| RetinaNet | 428.5G | 32.27M | 13.5 | 17.5 | 45.2 | 32.3 | 8.8 | 28.5 | 18.1 |
| Ours | 428.8G | 39.37M | 30.3 | 24.4 | 53.4 | 36.4 | 11.7 | 35.0 | 22.5 |
As evidenced in Table 1, our proposed method substantially enhances detection performance for small objects. On the VisDrone dataset, the overall average recall demonstrates a notable improvement of 4.1%, with particularly significant gains observed for small-scale and medium-scale objects, achieving 6.9% and 8.2% enhancement respectively. Furthermore, the overall detection accuracy rises from 18.1% to 22.5%, representing a 4.4% improvement. Specifically, small-scale objects exhibit a 2.9% increase in average detection rate, while medium-scale objects achieve a substantial 6.5% improvement in detection precision. These results collectively validate that our proposed dynamic element-activated non-semantic sparse attention framework effectively enhances detection performance for remote sensing small objects.
To further verify the individual contribution of each proposed component, we conduct comprehensive ablation studies on the VisDrone dataset. The ablation experimental results presented in Table 2, measured by
, systematically demonstrate that all three proposed architectural components contribute to the improved detection performance for remote sensing small objects.
Table 2.
Ablation study results on the VisDrone dataset (%).
| NSSA | DyT | DWT | AR | AP | ||||
|---|---|---|---|---|---|---|---|---|
| S | M | All | S | M | All | |||
| – | – | – | 17.5 | 45.2 | 32.3 | 8.8 | 28.5 | 18.1 |
![]() |
– | – | 24.0 | 53.1 | 36.4 | 11.7 | 35.1 | 22.4 |
| - | ![]() |
– | 24.2 | 52.7 | 36.4 | 11.7 | 34.5 | 22.1 |
| - | - | ![]() |
24.8 | 53.2 | 36.9 | 11.7 | 34.6 | 22.1 |
![]() |
![]() |
![]() |
24.4 | 53.4 | 36.4 | 11.7 | 35.0 | 22.5 |
Additionally, to substantiate the efficacy of our proposed methodology, we conduct a comparative visualization between the feature maps produced by our enhanced network architecture and those generated by the baseline model. The resultant comparative analysis is systematically presented in Fig. 6.
Fig. 6.
Feature heatmap comparison between our method and RetinaNet.
Figure 6 presents comparative feature heatmap visualizations where rows (a) and (c) display feature representations from the RetinaNet baseline while rows (b) and (d) showcase corresponding visualizations from our proposed method. Specifically rows (a) and (b) illustrate features from the second or third network layers whereas rows (c) and (d) depict features from the fourth or fifth architectural tiers. Critical observation of columns 1 and 3 in rows (c) and (d) reveals that our approach generates more salient feature representations for small objects with suppressed background interference thereby amplifying the discriminative gap between foreground objects and background clutter. This enhanced feature distinctiveness provides subsequent detection modules with more discriminative feature representations.
To further demonstrate the advancement of our method we conduct comprehensive comparisons with state-of-the-art approaches on the VisDrone dataset. Table 3 systematically compares experimental results between our proposed algorithm RetinaNet and other mainstream methods confirming the superior performance of our framework for remote sensing small object detection, where the detection accuracy is measured in %. In Table 3, the results for RetinaNet, YOLOv10, YOLOv11, and the proposed method are obtained in our own experimental environment, while the results of the other methods are taken from previously published papers.
Table 3.
Performance comparison between our method and current mainstream approaches on VisDrone dataset.
| Method | FLOPs | Params | FPS | AP-S | AP-M | AP-L | ![]() |
![]() |
AP |
|---|---|---|---|---|---|---|---|---|---|
| RetinaNet31 | 428.5G | 32.27M | 13.5 | 8.8 | 28.5 | 38.0 | 31.1 | 18.3 | 18.1 |
| CFPT[32]32 | 218.5G | 37.3M | 20.4 | 11.7 | 34.5 | 42.5 | 38.0 | 22.5 | 22.1 |
| FPN3325 | 216.6G | 36.3M | – | 10.9 | 34.3 | 40.1 | 36.4 | 21.4 | 21.0 |
| PAFPN26 | 222.7G | 38.7M | 22.5 | 10.9 | 34.6 | 41.1 | 36.5 | 21.6 | 21.2 |
| AugFPN33 | 216.8G | 38.17M | – | 11.1 | 35.4 | 40.4 | 37.1 | 22.2 | 21.7 |
| DRFPN34 | 228.5G | 41.1M | 17.9 | 11.0 | 35.3 | 39.5 | 36.7 | 22.0 | 21.5 |
| FPG28 | 346.1G | 71.0M | – | 11.5 | 35.2 | 38.7 | 37.3 | 22.2 | 21.7 |
| FPT35 | 331.8G | 56.6M | 11.8 | 9.4 | 30.0 | 38.9 | 33.3 | 19.2 | 19.3 |
| RCFPN36 | 209.2G | 36.0M | – | 10.5 | 34.8 | 38.1 | 36.0 | 21.3 | 21.0 |
| SSFPN37 | 274.0G | 40.8M | 21.0 | 11.5 | 35.3 | 39.8 | 37.3 | 22.2 | 21.7 |
| AFPN29 | 250.0G | 58.0M | 13.4 | 10.7 | 33.4 | 36.9 | 36.0 | 21.2 | 20.7 |
| YOLOv1038 | 21.3G | 8.0M | 169 | – | – | – | 37.4 | 22.5 | 22.2 |
| YOLOv1139 | 21.3G | 9.4M | 147 | – | – | – | 38.3 | 22.4 | 22.3 |
| Ours | 428.8G | 39.37M | 30.3 | 11.7 | 35.0 | 41.6 | 38.5 | 22.7 | 22.5 |
As demonstrated in Table 3, our proposed method achieves superior performance across all evaluation metrics. Particularly noteworthy are its accomplishments in
,
and overall AP, where it attains detection rates of 38.5%, 22.7%, and 22.5% respectively, surpassing all comparative methods. These results further substantiate the effectiveness of our approach for remote sensing small object detection.Moreover, based on the analysis of computational complexity, model parameter count, and inference speed, although the computational complexity of the proposed method is relatively high at 428.8 GFLOPs, its model parameter count is 39.37 M, which remains competitive among the compared methods. Most notably, the inference speed of the proposed method is the fastest at 30.3 FPS, meeting the requirements for real-time detection. This further demonstrates the superiority of the proposed method in terms of real-time performance.
To additionally validate the methodological efficacy, we conduct evaluations on the AI-TODv2 dataset. Table 4 presents comparative results between the baseline method and our proposed framework, where T, S, and M denote tiny, small, and medium-sized objects respectively. All reported metrics correspond to
measurements.
Table 4.
Comparative overall experimental results on the AI-TODv2 dataset (%).
| Method | AR | AP | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| VT | T | S | M | All | VT | T | S | M | All | |
| RetinaNet | 2.0 | 5.4 | 6.3 | 7.6 | 13.6 | 2.0 | 5.4 | 6.3 | 7.6 | 4.7 |
| Ours | 9.2 | 24.2 | 24.6 | 35.9 | 23.4 | 1.9 | 7.9 | 12.0 | 20.0 | 7.8 |
As evidenced in Table 4, our method demonstrates consistent improvements in both average precision and recall across most object scales. The overall average recall achieves a substantial gain of 9.6%, with specific improvements of 2.7% for tiny objects, 6.0% for small objects, and 8.2% for medium-sized objects. Regarding detection accuracy, the overall average precision improves by 3.1%, while tiny, small, and medium-sized objects show respective enhancements of 2.4%, 5.7%, and 12.4%. These comprehensive improvements in both precision and recall metrics on the specialized small object dataset further validate the effectiveness of our proposed methodology.
Furthermore, we conduct comparative experiments with state-of-the-art methods on the AI-TODv2 dataset to demonstrate the advancement of our approach for remote sensing small object detection. Table 5 presents the experimental comparisons between our proposed algorithm, RetinaNet, and other mainstream methods.In Table 5, the results for RetinaNet, YOLOv10, YOLOv11, and the proposed method are obtained in our own experimental environment, while the results of the other methods are taken from previously published papers.
Table 5.
The experimental comparisons between our proposed algorithm and other methods on the AI-TODv2 dataset (%).
| Method | Backbone | AP-VT | AP-T | AP-S | AP-M | ![]() |
![]() |
AP |
|---|---|---|---|---|---|---|---|---|
| RetinaNet31 | Resnet50 | 2.0 | 5.4 | 6.3 | 7.6 | 13.6 | 2.1 | 4.7 |
| CFPT32 | Resnet50 | 2.2 | 7.5 | 11.6 | 20.0 | 21.4 | 4.0 | 7.7 |
| SSD51240 | Resnet50 | 1.0 | 4.7 | 11.5 | 13.5 | 21.7 | 2.8 | 7.0 |
| TridentNet41 | Resnet50 | 1.0 | 5.8 | 12.6 | 14.0 | 20.9 | 3.6 | 7.5 |
| FoveaBox42 | Resnet50 | 0.9 | 5.8 | 13.4 | 15.9 | 19.8 | 5.1 | 8.1 |
| YOLOv1038 | CSPDarknet | – | – | – | – | 17.9 | 4.3 | 7.2 |
| YOLOv1139 | CSPDarknet | – | – | – | – | 18.5 | 4.5 | 7.4 |
| Ours | Resnet50 | 1.9 | 7.9 | 12.0 | 20.0 | 22.5 | 3.9 | 7.8 |
As demonstrated in Table 5, our proposed method achieves superior performance across multiple key metrics. Particularly noteworthy are its accomplishments in AP-T, AP-M,
, and overall AP, attaining detection rates of 7.9%, 20.0%, 22.5%, and 7.8% respectively, outperforming all comparative methods. These results provide additional validation for the effectiveness of our approach in remote sensing small object detection.
Furthermore, the sparsity coefficient S in the NSSA module is the key parameter that determines the locality of the attention mechanism. We perform a sensitivity analysis on S by setting its values to 4, 8, and 16. Experiments are conducted separately on the VisDrone and AI-TODv2 datasets, and the corresponding results are presented in Tables 6 and 7.
Table 6.
Sensitivity analysis of the sparsity coefficient S on the VisDrone dataset (%).
| S | AR | AP | ||||
|---|---|---|---|---|---|---|
| S | M | All | S | M | All | |
| 4 | 24.4 | 53.4 | 36.4 | 11.7 | 35.0 | 22.5 |
| 8 | 24.6 | 53.4 | 36.8 | 11.8 | 34.6 | 22.3 |
| 16 | 24.6 | 53.4 | 36.8 | 11.9 | 34.5 | 22.4 |
Table 7.
Sensitivity analysis of the sparsity coefficient S on the AI-TODv2 dataset (%).
| S | AR | AP | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| VT | T | S | M | All | VT | T | S | M | All | |
| 4 | 9.2 | 24.2 | 24.6 | 35.9 | 23.4 | 1.9 | 7.9 | 12.0 | 20.0 | 7.8 |
| 8 | 8.0 | 25.3 | 24.2 | 34.6 | 23.7 | 2.0 | 7.9 | 11.5 | 20.3 | 7.9 |
| 16 | 9.7 | 24.2 | 24.6 | 33.8 | 23.3 | 2.1 | 7.4 | 11.3 | 19.4 | 7.5 |
As can be observed by integrating the findings from Tables 6 and 7, the experimental results vary with different values of S across distinct datasets. Specifically, Table 6 demonstrates that on the VisDrone dataset, the model achieves its optimal overall AP of 22.5% when S is set to 4. However, with S is 16, the performance for small object detection is slightly superior, yielding an AP of 11.9%, while simultaneously attaining the best AR of 36.8%. In contrast, as shown in Table 7, the proposed method exhibits a distinct pattern on the AI-TODv2 dataset. The model reaches its peak overall AP of 7.9% and optimal overall AR of 23.7% when S is 8. Furthermore, it is notable that the AP performance for ’very tiny’ objects gradually improves with increasing S, peaking at 2.1% when S is 16, although this setting does not yield the best overall detection performance. In summary, the results collectively demonstrate that the performance of the proposed method varies with different settings of the sparsity coefficient S across the tested datasets.
For more intuitive demonstration of our method’s efficacy, Fig. 7 presents comparative visualizations between the baseline method and our proposed approach on the AI-TODv2 dataset. Rows (a) and (c) display detection results from the baseline method, while rows (b) and (d) showcase corresponding results from our method. Both approaches demonstrate competent detection of minuscule objects in remote sensing imagery, though our method exhibits superior performance. Specifically, in the first column’s comparison between images (a) and (b), while the baseline method detects only two objects in scenarios with extremely small objects, our approach successfully identifies four objects. This pattern is consistently observed in the first column’s comparison between images (c) and (d) as well.
Fig. 7.
Comparative visualization of detection results on the AI-TODv2 dataset.
These visual comparisons support the conclusion that our method significantly outperforms the baseline approach in detecting minuscule objects, while maintaining robust detection performance in complex environmental conditions. The experimental evidence collectively reaffirms the superior detection capability of our proposed method for remote sensing small objects.
Conclusion
This paper presents an innovative detection methodology that addresses the persistent challenges in remote sensing small object detection, including complex backgrounds, severe occlusions, and dense distributions. The proposed framework integrates dynamic element activation with non-semantic sparse attention through three synergistic components that collectively enhance feature representation and discrimination capabilities for small objects.
The architecture incorporates a non-semantic sparse attention module that computes self-attention within localized patches, effectively focusing on fundamental visual elements including textures and edges while improving perception of occluded objects and fine-grained details. Furthermore, the dynamic element-activated cross-layer channel attention mechanism enables adaptive fusion and enhancement of multi-scale features, thereby amplifying small object saliency within complex backgrounds. The integration of a diffusion wavelet convolutional structure facilitates parallel feature processing that minimizes information degradation while effectively capturing critical characteristics and contextual information of densely distributed small objects.
Comprehensive experimental evaluation on benchmark datasets demonstrates significant performance improvements, with our method achieving detection accuracy of 22.5% on VisDrone and 7.8% on AI-TODv2, outperforming various mainstream detection models across key metrics including precision and recall. Systematic ablation studies confirm the individual contribution and necessity of each architectural component.
This research establishes that the proposed method effectively addresses typical challenges in remote sensing small object detection, providing a viable technical framework for developing high-performance remote sensing image interpretation systems. Future research directions will focus on lightweight architectural design to enhance computational efficiency, while exploring applications in more complex scenarios including video sequence object tracking and three-dimensional detection.
Author contributions
S.L.L.: conceptualization, methodology, formal analysis revision and funding acquisition; Y.R.B.: methodology, validation, software, writing and editing; Y.H.: methodology, investigation, validation, software, and visualization; Y.D.: supervision, project administration, revision and funding acquisition; Z.F.L.: supervision, review, project administration, revision; J.G.W.: supervision, funding acquisition.
Funding
This work was supported by the Henan Provincial Science and Technology Research Project under Grant 242102211006, the National Natural Science Foundation of China (NSFC) Youth Project under Grant 62301623, the Natural Science Youth Foundation of Zhongyuan University of Technology under Grant K2025YB003, the Key Scientific Research Project of Colleges and Universities in Henan Province under Grant 25A620001, the Shanxi Province Basic Research Program (Free Exploration for Youth Science Research Project) under Grant 202303021212387, and the General International Scientific and Technological Cooperation Project of Henan Province (Grant No. 252102521049).
Data availability
The datasets used during the current study are publicly available. The VisDrone dataset can be accessed at: https://github.com/VisDrone/VisDrone-Dataset.git. The AI-TODv2 dataset can be accessed at: https://github.com/Chasel-Tsui/mmdet-aitod. No additional datasets were generated during this study.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Li, Q., Zhang, W., Lu, W. & Wang, Q. Multi-branch mutual-guiding learning for infrared small target detection. IEEE Trans. Geosci. Remote Sens. (2025).
- 2.Wang, Z., Wang, C., Li, X., Xia, C. & Xu, J. Mlp-net: multi-layer perceptron fusion network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. (2024).
- 3.Wang, Y. et al. Multi-scale hierarchical feature fusion for infrared small-target detection. Remote Sens.17, 428 (2025). [Google Scholar]
- 4.Zhu, Y. et al. Towards robust infrared small target detection via frequency and spatial feature fusion. IEEE Trans. Geosci. Remote Sens. (2025).
- 5.Shi, T. et al. Adaptive feature fusion with attention-guided small target detection in remote sensing images. IEEE Trans. Geosci. Remote Sens.61, 1–16 (2023). [Google Scholar]
- 6.Liu, C., Xie, F., Dong, X., Gao, H. & Zhang, H. Small target detection from infrared remote sensing images using local adaptive thresholding. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens.15, 1941–1952 (2022). [Google Scholar]
- 7.Li, F., Rao, P., Sun, W., Su, Y. & Chen, X. A low-signal-to-noise ratio infrared small-target detection network. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. (2025).
- 8.Hui, Y., Wang, J. & Li, B. Dsaa-yolo: Uav remote sensing small target recognition algorithm for yolov7 based on dense residual super-resolution and anchor frame adaptive regression strategy. J. King Saud Univ.-Comput. Inf. Sci.36, 101863 (2024). [Google Scholar]
- 9.Jia, G., Cheng, Y. & Chen, T. Irgraphseg: Infrared small target detection based on hierarchical GNN. IEEE Geosci. Remote Sens. Lett.21, 1–5 (2024). [Google Scholar]
- 10.Mnih, V., Heess, N., Graves, A. & Kavukcuoglu, K. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst.27 (2014).
- 11.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141 (2018).
- 12.Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), 3–19 (2018).
- 13.Hou, Q., Zhou, D. & Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13713–13722 (2021).
- 14.Srinivas, A. et al. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16519–16529 (2021).
- 15.Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling (2024).
- 16.Zhu, L. et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024).
- 17.Wang, H., Liu, M., Qu, L., Li, D. & Wang, S. Fair-detr: Detection transformer with adaptive multi-scale attention and dual strong constraint-aware query selection. Image Vis. Comput. 105792 (2025).
- 18.He, X., Zhang, Y. & Zhan, Q. Ain-yolo: A lightweight yolo network with attention-based inceptionnext and knowledge distillation for underwater object detection. Adv. Eng. Inform.66, 103504 (2025). [Google Scholar]
- 19.Glorot, X., Bordes, A. & Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 315–323 (JMLR Workshop and Conference Proceedings, 2011).
- 20.Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst.25 (2012).
- 21.Xu, B., Wang, N., Chen, T. & Li, M. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 (2015).
- 22.Clevert, D.-A., Unterthiner, T. & Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 11 (2015).
- 23.Ramachandran, P., Zoph, B. & Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017).
- 24.Misra, D. Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681 (2019).
- 25.Lin, T.-Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2117–2125 (2017).
- 26.Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8759–8768 (2018).
- 27.Pang, J. et al. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 821–830 (2019).
- 28.Chen, K., Cao, Y., Loy, C. C., Lin, D. & Feichtenhofer, C. Feature pyramid grids. arXiv preprint arXiv:2004.03580 (2020).
- 29.Yang, G. et al. Afpn: Asymptotic feature pyramid network for object detection. In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2184–2189 (IEEE, 2023).
- 30.Zhu, J., Chen, X., He, K., LeCun, Y. & Liu, Z. Transformers without normalization. In Proceedings of the Computer Vision and Pattern Recognition Conference, 14901–14911 (2025).
- 31.Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, 2980–2988 (2017).
- 32.Du, Z., Hu, Z., Zhao, G., Jin, Y. & Ma, H. Cross-layer feature pyramid transformer for small object detection in aerial images. IEEE Trans. Geosci. Remote Sens. (2025).
- 33.Guo, C., Fan, B., Zhang, Q., Xiang, S. & Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12595–12604 (2020).
- 34.Ma, J. & Chen, B. Dual refinement feature pyramid networks for object detection. arXiv preprint arXiv:2012.01733 (2020).
- 35.Zhang, D. et al. Feature pyramid transformer. In European Conference on Computer Vision, 323–339 (Springer, 2020).
- 36.Zong, Z., Cao, Q. & Leng, B. Rcnet: Reverse feature pyramid and cross-scale shift network for object detection. In Proceedings of the 29th ACM International Conference on Multimedia, 5637–5645 (2021).
- 37.Hong, M. et al. Sspnet: Scale selection pyramid network for tiny person detection from UAV images. IEEE Geosci. Remote Sens. Lett.19, 1–5 (2021). [Google Scholar]
- 38.Wang, A. et al. Yolov10: Real-time end-to-end object detection. Adv. Neural. Inf. Process. Syst.37, 107984–108011 (2024). [Google Scholar]
- 39.Khanam, R. & Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725 (2024).
- 40.Liu, W. et al. Ssd: Single shot multibox detector. In European Conference on Computer Vision, 21–37 (Springer, 2016).
- 41.Li, Y., Chen, Y., Wang, N. & Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6054–6063 (2019).
- 42.Kong, T. et al. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process.29, 7389–7398 (2020). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used during the current study are publicly available. The VisDrone dataset can be accessed at: https://github.com/VisDrone/VisDrone-Dataset.git. The AI-TODv2 dataset can be accessed at: https://github.com/Chasel-Tsui/mmdet-aitod. No additional datasets were generated during this study.


































