Abstract
The rapid evolution of drone technology has expanded its applications across collaborative control, public safety, and aerial imaging, yet reliable object detection remains a challenge due to small target sizes and complex backgrounds in drone-captured imagery. To address these limitations, this paper introduces MFA-YOLO, a high-precision network specifically optimized for small-object detection in drone imagery. The proposed approach integrates three innovations: the Local Feature Mapping (LFM) unit for enhanced fine-grained feature extraction, the Progressive Shared Atrous Pyramid (PSAP) for efficient multi-scale feature integration, and the Dynamic Decoupling Head (DDH) for improved adaptive task alignment. Through these components, MFA-YOLO enhances representational capacity while preserving real-time inference efficiency. Experimental evaluations on the VisDrone benchmark demonstrate a 3.6% increase in AP50, a 2.4% increase in AP, and a 17% reduction in model parameters compared to YOLOv8n. Additional experiments on UAVDT further indicate the model’s promising generalization across similar drone datasets. These results highlight MFA-YOLO’s potential to advance drone-based perception systems, making them more effective and efficient for safety-critical and real-time applications in resource-constrained UAV environments, such as public safety monitoring, surveillance, and autonomous aerial operations.
Subject terms: Engineering, Mathematics and computing
Introduction
IN recent years, rapid advances in unmanned aerial vehicle (UAV) technology have accelerated UAV adoption in collaborative control1, public safety2, and aerial imaging3,4. These developments have garnered extensive research interest. The mobility, portability, and low-altitude operation of UAVs—along with on-board embedded computing—enable them to collect and analyze data in real time. These capabilities also make UAVs well-suited to visual perception tasks.
While generic object detection datasets such as COCO5 and VOC6 have greatly advanced the field, UAV imagery introduces distinct challenges that differ from these conventional benchmarks. Targets in drone footage are typically much smaller and more densely distributed. They are often captured from high altitudes and arbitrary viewing angles against cluttered backgrounds. These factors complicate reliable detection. Even mainstream detectors often struggle to locate tiny objects. For instance, evaluations on UAV benchmarks such as VisDrone7 and UAVDT8 have shown that high object density and small target size significantly reduce detection accuracy. Recent efforts have attempted to address these issues through specialized UAV-specific detectors9, incorporating enhanced feature extraction and attention mechanisms to improve small-object recognition in aerial images.
Deep learning has significantly advanced object recognition in challenging conditions such as nighttime or infrared imagery10,11. Most existing object detectors are based on Convolutional Neural Networks (CNNs), which typically follow either a two-stage or a single-stage architecture. In a two-stage detector, the model first generates region proposals, and then refines these proposals. This approach, as seen in models like Fast R-CNN12 and Faster R-CNN13, results in high precision and robust performance, particularly in complex or cluttered scenes. The refinement step is effective in accurately localizing small or overlapping targets, a common challenge in real-world scenarios. However, this two-step process incurs additional computational overhead, leading to increased latency. Even when optimized, two-stage detectors generally run slower than single-stage models, limiting their suitability for real-time applications such as those used in UAVs. In contrast, single-stage detectors, such as YOLO14 and SSD15, forego the proposal generation step and predict bounding boxes and class probabilities directly from full images in a single forward pass16. This unified architecture dramatically simplifies the detection pipeline and enables much faster inference, which is particularly beneficial for on-board UAV applications. However, single-stage models often struggle to detect small or densely packed objects due to lower feature resolution and the absence of a refinement stage. This results in lower recall for tiny objects compared to two-stage methods.
In UAV imagery, small-object detection faces specific difficulties, including low resolution leading to feature loss, occlusion in dense scenes, and scale variation due to varying altitudes. These challenges inspire improvements in feature extraction and processing: enhanced fine-grained localization to counter resolution issues, efficient multi-scale integration to handle scale variations, and adaptive task alignment to mitigate occlusion effects.
To improve performance on small objects, techniques such as multi-scale feature pyramids and enhanced loss functions have been introduced. For example, Gu et al.17 develop a gradient-aware multi-scale YOLO variant that boosts small-object detection in UAV images. Despite these improvements, challenges related to small-object detection in cluttered scenes remain.
More recently, transformer-based models have shown promise thanks to their ability to model long-range dependencies via self-attention. End-to-end detectors like DETR18 employ transformers to capture global context. By forgoing hand-crafted post-processing, these methods can improve small-object recognition. Nevertheless, these attention-based networks typically require high-resolution inputs and have substantial computational overhead. These factors hamper their deployment on UAV platforms, where low-resolution imagery and limited hardware are common. Even the latest real-time transformer detectors bridge some speed and accuracy gaps relative to CNNs. However, their design is primarily tuned for natural images. Hybrid CNN–Transformer approaches have been explored to adapt these models to UAV imagery19, but fully closing the domain gap remains challenging. Consequently, UAV small-object detection remains difficult. Addressing this challenge necessitates (i) more discriminative feature extraction for tiny targets amid complex backgrounds and (ii) high-precision detection under strict computational budgets.
To address these challenges, we propose MFA-YOLO, a small-object detection network specifically designed for dense, small-scale, and low-resolution targets in UAV imagery. MFA-YOLO achieves a favorable balance between accuracy and computational efficiency. As illustrated in Fig. 1, MFA-YOLO consistently surpasses existing lightweight YOLO frameworks across different model scales while maintaining a comparable parameter budget, demonstrating its superiority in UAV-based small object detection.
Fig. 1.

Comparison of model accuracy and efficiency with other mainstream detectors on the VisDrone dataset.
Beyond performance gains, Fig. 2 provides a conceptual comparison between conventional approaches and our method. Previous detectors tend to lose fine-grained information in deeper backbone layers, resulting in limited discriminative ability for small objects. In contrast, MFA-YOLO introduces adaptive feature extraction and multi-scale fusion, enabling deeper layers to retain richer semantic cues and improving the overall representation of tiny targets.
Fig. 2.
The previous method overlooked detailed information in deeper layers of the backbone network during feature extraction. Our method aims to extract more multi-scale features into deeper layers of the network to enhance the expression of semantic information.
Our contributions are threefold:
Localized feature mapping unit (LFM): enhances fine-detail localization within the receptive field, mitigating the loss of key information common in standard convolutions.
Progressive shared atrous pyramid (PSAP): aggregates global features across multiple receptive field scales using dilated convolutions with inter-layer weight sharing. This module replaces traditional spatial pyramid pooling.
Dynamic decoupling head (DDH): improves the alignment of small-object features by combining task decoupling, dynamic convolutional alignment, and spatial feature mapping.
Related work
The overview of YOLO
The You Only Look Once (YOLO) series represents a cornerstone in real-time object detection, evolving to address the demands of efficiency and accuracy in diverse applications, including unmanned aerial vehicle (UAV) imagery. Starting with YOLOv1 20, which conceptualized detection as a unified regression task on a grid, subsequent versions have progressively mitigated issues such as limited resolution for small objects. YOLOv2 21 incorporated anchor boxes and batch normalization to enhance scale invariance, while YOLOv3 22 integrated a feature pyramid network (FPN) 23 for improved multi-scale contextual fusion, facilitating better handling of varied object sizes in complex scenes.
Advancements in later iterations emphasized architectural refinements for resource-constrained environments. YOLOv4 24 employed techniques like mosaic augmentation and a CSPDarknet backbone to bolster robustness without excessive complexity. YOLOv5 25 introduced scalable architectures, enabling adaptability to edge devices, though it often struggles with fine-grained details in dense UAV captures due to rigid feature extraction. YOLOv7 26 focused on re-parameterization and efficient aggregation layers to optimize training dynamics. YOLOv8 27 shifted to an anchor-free design with task-aligned assignment, reducing sensitivity to hyperparameters and improving generalization in cluttered environments.
More recent developments, such as YOLOX 28, decoupled heads for classification and regression to refine predictions in noisy backgrounds. YOLOv10 29 incorporated programmable gradients and non-maximum suppression-free training for streamlined end-to-end processing. YOLOv11 30 and YOLOv13 31 leveraged hypergraph structures and adaptive processing to enhance perceptual capabilities. A significant extension is MHAF-YOLO 32, which introduces a Multi-Branch Auxiliary FPN (MAFPN) with Superficial Assisted Fusion (SAF) and Advanced Assisted Fusion (AAF) modules, alongside a Re-parameterized Heterogeneous Multi-Scale (RepHMS) block and Global Heterogeneous Flexible Kernel Selection (GHFKS) mechanism. This architecture promotes heterogeneous feature fusion and adaptive kernel adjustments, addressing shortcomings in prior YOLO variants like inadequate shallow feature retention and static receptive fields, particularly beneficial for small-object scenarios.
Despite these evolutions, YOLO models frequently prioritize computational speed at the expense of nuanced feature representation, leading to challenges in UAV applications where small, variable objects predominate. Our proposed MFA-YOLO overcomes these limitations by integrating a Localized Feature Module (LFM), Pyramid Self-Attention Pyramid (PSAP), and Dynamic Decoupled Head (DDH), fostering superior feature localization and adaptive fusion. This design yields enhanced detection precision through richer contextual integration, albeit with a modest increase in inference complexity–a trade-off that is acceptable for safety-critical UAV tasks, where reliability in identifying subtle threats supersedes marginal efficiency gains.
Small object detection
Detecting small targets in visual data poses inherent difficulties, which are amplified in UAV imagery by limited pixel resolution, background clutter, and significant scale variations. Conventional two-stage detectors like Faster R-CNN13 rely on region proposal mechanisms that often overlook subtle details, leading to diminished recall for diminutive objects. Single-stage approaches (e.g., YOLO) offer real-time performance but can struggle with low-resolution features and densely packed scenes. To balance accuracy and efficiency, lightweight convolutional networks have emerged as alternatives, aiming to preserve feature expressiveness under tight computational constraints. However, these models often grapple with adapting to the dynamic and noisy nature of aerial scenes, where object appearance and scale can vary rapidly.
Researchers have explored specialized lightweight architectures to improve small-object recognition. For instance, Shen et al.33 devised a compact CNN for finger vein recognition using depthwise separable convolutions and a triplet-loss strategy to distill robust features from low-resolution inputs. This method excels at extracting intricate vein patterns, yet it faces challenges generalizing to unstructured outdoor environments; background interference akin to UAV scenes can degrade its performance. Similarly, Shen et al.34 proposed a lightweight model with hybrid attention for instrument reading acquisition, directing focus to salient dial features via channel and spatial attention mechanisms. While effective in controlled settings, this approach falters under the degradation and variability prevalent in aerial views (e.g., partial occlusions or environmental clutter). In the realm of cultural heritage analysis, Shen et al.35 introduced a semantic feature-oriented lightweight algorithm for ancient mural element detection. By employing adaptive data augmentation to simulate defects and integrating residual attention for context fusion, it adeptly captures fine details on degraded surfaces. However, its emphasis on static, faded imagery reveals inadequacies in real-time UAV contexts, where motion-induced distortions and rapid scale changes demand more resilient, adaptive detection strategies.
Beyond the realm of general CNN designs, recent one-stage detectors have been tailored specifically for UAV-based small object detection. For example, Fan et al.36 developed LUD-YOLO, a novel lightweight YOLO-based network optimized for unmanned aerial vehicle applications, which streamlines the detection pipeline to reduce computational load. Zhou et al.37 proposed SAD-YOLO, an improved YOLOv8 variant designed to enhance the detection of tiny objects in airport surveillance imagery through refined feature processing. Xie et al.38 introduced KL-YOLO, which incorporates an adaptive global feature enhancement module to bolster small-object detection performance in low-altitude remote sensing scenes. These approaches illustrate the ongoing efforts to boost the detection of small targets while keeping models lightweight and deployment-friendly for UAV platforms.
Despite such advancements, small-object detectors remain vulnerable to adverse conditions. Tian et al.39 demonstrated that even minor input perturbations can severely impair the detection of small objects, underscoring the fragility of current lightweight models to adversarial influences when robust safeguards are absent. This highlights that reliably detecting tiny targets in complex, real-world UAV scenarios continue to be an open challenge, demanding further improvements in feature extraction robustness and model resilience.
Multi-scale receptive field
Multi-scale receptive field design is pivotal for assimilating contextual cues across diverse object sizes—a requirement amplified in UAV imagery that ranges from fine-grained details to broad views. Standard CNNs with fixed-size kernels lack the flexibility to accommodate extreme scale variability, often resulting in missed detections or insufficient context for very small or very large objects. Contemporary approaches have thus pivoted toward dynamic and heterogeneous receptive field strategies to broaden the range of captured scales. However, these techniques frequently introduce additional complexity or computation, hindering seamless real-time integration on resource-constrained UAV platforms.
To address scale diversity, some lightweight detectors integrate multi-scale processing directly into their architectures. Shen et al.40 developed an anchor-free lightweight CNN for aerial vehicle detection that employs multi-scale convolutions and channel stacking to strengthen feature extraction in dense scenes. This method improves scale-invariant detection by capturing features at multiple resolutions, but its predefined multi-scale framework remains inflexible and may not fully adapt to the arbitrary object orientations and non-uniform scale distributions typical of UAV imagery. In a broader context of UAV system resilience, Yang et al.41 proposed a recovery mechanism for integrated aerial networks, prioritizing critical network functions based on contextual attributes. While such a strategy fortifies overall system reliability, it also underscores a gap in the perception layer: without adaptively tuned receptive fields in the vision model, scale-variant or anomalous objects can slip through undetected, leading to failures that network-level solutions alone cannot preempt.
Recent object detector designs attempt to dynamically expand and fuse receptive fields across layers. Yang et al.32 introduced MHAF-YOLO, a multi-branch heterogeneous fusion network that adaptively selects convolutional kernel sizes and integrates features across scales through a specialized pyramid architecture. This approach surpasses traditional feature pyramid networks in capturing multi-scale context and improving detection accuracy for varied object sizes. However, its intricate multi-module design (involving several parallel sub-networks and attention blocks) adds significant complexity and may impose latency, limiting its practicality for real-time UAV deployments with strict speed and power constraints. Overall, existing techniques still exhibit rigidity in receptive field adaptation or inefficiencies in feature fusion, leading to suboptimal handling of scenes with widely varying object scales.
Our proposed solution, MFA-YOLO, aims to overcome these limitations by introducing mechanisms for both receptive field adaptation and efficient multi-scale feature aggregation. Specifically, a Dynamic Decoupling Head (DDH) decouples classification and localization subtasks for small-object features, preventing task interference and allowing more focused learning, while a Progressive Shared Atrous Pyramid (PSAP) module expands the effective receptive field through multi-scale atrous (dilated) convolutions with shared parameters. Together, these components afford flexible contextual expansion without undue complexity. This design bolsters multi-scale feature representation and improves detection fidelity across scales, with only a minimal computational overhead–a trade-off deemed acceptable in UAV applications where comprehensive multi-scale perception is crucial for operational safety and effectiveness.
Method
The proposed detection framework consists of three main components: the Backbone, the Neck, and the Head. After the input image undergoes preprocessing and two successive down-sampling operations, the Backbone performs dynamic feature extraction using a Local Fusion Mechanism (LFM), which provides adaptive receptive fields for effective feature representation across multiple spatial scales. A multi-level feature extraction module further enhances gradient flow, while the LFM dynamically assigns weights and adapts spatially to refine feature representations. To improve multi-scale integration, the traditional Spatial Pyramid Pooling-Fast (SPPF) is replaced by a Progressive Shared Atrous Pyramid (PSAP), which enables efficient fusion of spatial information from different receptive field sizes through parameter sharing across atrous convolutions with varying dilation rates.
In the Neck, a bidirectional feature pyramid network fuses multi-level features by combining low-level C2f features with high-level PSAP features via cross-layer connections. This is followed by up-sampling, feature concatenation, and C2f-based refinement to progressively recover spatial details, resulting in three feature maps (P3, P4, and P5) at different resolutions. This process alleviates semantic dilution in higher-level feature maps and ensures that both spatial details and semantic information are preserved for subsequent tasks.
Finally, the Head introduces a Dynamic Decoupling Head (DDH), which separates feature streams for classification and regression tasks, thereby avoiding task interference and improving task-specific learning. Combined with Dynamic Deformable Convolution v2 (DyDCNv2), the DDH enables adaptive geometric alignment of features with target objects, further enhancing localization precision. By integrating dynamic feature extraction in the Backbone, structural optimization in the Neck, and task-level decoupling in the Head, the proposed pipeline maintains gradient consistency, improves small-object detection accuracy in remote sensing imagery, and ensures efficient real-time inference.
Localized feature mapping units (LFM)
As shown in Fig. 3, we incorporate Localized Feature Mapping units (LFM) in the backbone, which replace the traditional convolutional blocks (CBS components). The LFM unit is a specialized convolutional block designed to enhance feature extraction ability. Furthermore, the C2f modules are adapted into LFM-C2f variants, where the internal Bottleneck structures are replaced with Bottleneck-LFM blocks. The primary distinction is the replacement of the traditional convolutional layer in the Bottleneck-LFM with an LFM unit, thereby boosting receptive field awareness while maintaining computational efficiency.
Fig. 3.
Overall framework of MFA-YOLO.
The Localized Feature Mapping (LFM) unit is designed to enhance the network’s sensitivity to local receptive field variations, addressing the limitations of standard convolutions that rely on globally shared kernel weights. In the context of UAV-based small object detection, LFM adaptively reweights features within each local region (receptive field) of the input feature map, thereby assigning position-specific importance to better capture small object cues.
As illustrated in Fig. 4, the module operates on an input feature map
and produces an output feature map of the same spatial dimensions. It consists of two parallel branches: a weight mapping branch and a feature extraction branch, followed by a fusion step. The weight mapping branch computes a set of weights for each
receptive field (with k denoting the convolution kernel size), while the feature extraction branch extracts the corresponding local features. These two outputs are then fused to yield an adaptive convolutional feature mapping with locally varying weights.
Fig. 4.
Structure of the localized feature mapping (LFM) unit, illustrating its adaptive weighting mechanism that enhances local receptive field representation for small-object features.
Receptive field feature extraction
In the feature extraction branch, the input
is applied a grouped
convolution
to obtain the receptive-field feature map
. This operation expands each position’s neighborhood into a larger vector representation, effectively collecting the features from each
region of X into separate channels. Formally, we have:
![]() |
1 |
where B is the batch size. In practice, grouping is used in
to maintain efficiency,generally setting the grouping number
, and finally
retains a total of
feature channels. This is analogous to an unfold operation that extracts each local patch, but implemented via convolution for speed. The result
can be reshaped into
, for each spatial location (h, w) in the feature map,
contains a
-length vector of local features.
Weight mapping generation
In parallel, the LFM module computes an attention map over the receptive-field features to gauge the importance of each local position. To reduce computation, we first aggregate the information within each receptive field. In particular, we compress the channel information for each patch location by average pooling. For the m-th position in the
window (where
), we take the mean across all C feature channels in
, yielding an intermediate map
that summarizes each patch position’s response. Next, a
grouped convolution is applied to
to produce the adaptive weight mapping Z, we compute:
![]() |
2 |
where
is the
convolution operator and
are the output spatial dimensions of Z. This convolution allows interaction among the
positions of each local receptive field, thereby enabling a refined and adaptive generation of position-wise weights. Importantly, the
convolution is implemented with proper grouping to maintain computational efficiency while ensuring rich interaction between the kernel positions.
Then, through reshaping the dynamic weight mapping Z as
, a softmax function is applied across the
positions of each receptive field to normalize the response values and compute an attention map M. Specifically, for the element (c, m, h, w) , we compute the normalized weight coefficient as:
![]() |
3 |
where
is the normalized weight coefficient for the m-th position.Additionally,the
has dimensions
.
Reweighting and feature fusion
The learned weight mapping
is then used to reweight the receptive-field feature map
. Let
denote the component of
corresponding to the
-th patch position of channel
at location
. We compute the weighted feature map
by applying the attention coefficients to each local feature in
:
![]() |
4 |
where
is element-wise multiplication. In this way, features at different positions within each receptive field are adaptively scaled.
Rearrangement and convolutional projection
To recombine the weighted local features into the final output map
, the
expanded features are first rearranged into their corresponding spatial positions, then passed through a convolution. We perform a reshape operation
that is essentially the inverse of the unfolding operation: it maps
to a feature map
by tiling the
features back into a
neighborhood around each original location. In other words, each position in the
receptive field–which was previously represented along the channel dimension–is now relocated to its corresponding spatial offset in an expanded feature map Y. After this adjustment of shape, the feature map’s spatial resolution increases by a factor of k, which means that height and width become
and
while the channel dimension returns to C. Finally, we apply a standard convolution
of kernel size
with stride k to Y to produce the output feature map:
![]() |
5 |
where the stride-k convolution
effectively aggregates each
block in Y into a single output location, which corresponds to a particular original receptive field , restoring the feature map to its original spatial size (
). Notably, because the adaptive weight features in Y are arranged as non-overlapping patches, the weights in
are applied locally within each patch and are not shared across different patches – thus the combination of dynamic weighting and the final convolution yields an effective position-specific convolution kernel for each receptive field.
The output
represents the feature map produced by the LFM unit. In summary, the LFM unit enhances a conventional convolution by incorporating a learned, receptive-field based reweighting mechanism. By generating a reweighting map for each local receptive field and applying it to the extracted features before the final convolution, the LFM produces adaptive local filtering that mitigates the limitations of globally shared kernel weights. This enables more effective feature extraction, particularly for small objects, as fine details are accentuated within their local contexts. As a result, subsequent detection layers are better equipped to recognize small targets with higher accuracy.
Notably, this process is achieved with minimal computational overhead and parameter increase, as the reweighting mechanism relies on simple operations such as averaging and
convolutions. Consequently, the LFM module provides a powerful yet computationally efficient enhancement for the MFA-YOLO architecture, enabling the network to focus on the most relevant local features and significantly improving its performance in detecting small objects within UAV imagery.
Progressive shared atrous pyramid (PSAP)
As shown in Fig. 5, the Progressive Shared Atrous Pyramid (PSAP) module is proposed as an efficient replacement for the traditional Spatial Pyramid Pooling (SPP) module, introducing a learnable multi-scale feature extraction mechanism with minimal overhead. Unlike SPPF, which relies on a sequence of fixed large-kernel max-pooling operations to approximate multi-scale context, PSAP employs dilated convolutions to capture a pyramid of receptive fields in a trainable manner. It also differs from the Atrous Spatial Pyramid Pooling (ASPP) approach, which uses multiple parallel dilated convolutional branches each with its own set of weights. Instead, PSAP reuses a single convolutional kernel across progressively increasing dilation rates, significantly reducing the number of parameters and computations while still capturing multi-scale information. This progressive, shared-kernel strategy enables PSAP to capture both local and long-range features more effectively, yielding notable improvements over SPPF’s fixed pooling approach. While PSAP incurs only a modest increase in computational overhead, it provides a substantial improvement in feature representation capability.
Fig. 5.
The structure of progressive shared atrous pyramid (PSAP).
The PSAP module processes an input feature map
in three main stages:
A
pointwise convolution is applied to reduce the channel dimension from
to
(typically
), yielding a compressed feature map of size
. This step significantly lowers the computational cost of subsequent operations.
The compressed feature map is then fed through a sequence of
atrous convolutional layers, each with a
kernel. The dilation rates for these layers are predefined as
,
, and
, for
. All these three convolutions share the same kernel weights, ensuring parameter efficiency and consistency in feature extraction across the different receptive fields. Each convolution uses a stride of 1 and appropriate padding, preserving the spatial resolution. The dilation rate
and kernel size
determine the effective receptive field
as follows:
![]() |
6 |
Figure 6 illustrates that for a kernel
and dilation rates
, the effective receptive fields are
and
. This means the first such convolution focuses on local details, the second captures mid-range context, and the third covers long-range dependencies.
Fig. 6.
Atrous convolution with different dilation rates.
The output feature maps from the initial
reduction and all dilated convolutional stages each have dimensions
. These are concatenated along the channel dimension to form a multi-scale representation of size
. Finally, another
convolution is applied to restore the channel dimension to
, producing the PSAP output
.
By confining the heavier
convolutions to the reduced-channel feature map, reusing a single kernel across multiple dilation rates, and integrating features from various receptive fields (via the concatenation of the initial and dilated features), the PSAP module learns a rich multi-scale representation with minimal computational and parameter overhead. This design effectively combines fine-grained details with broad contextual information, achieving a better trade-off between complexity and multi-scale feature representation compared to SPPF.
Dynamic decoupling head (DDH)
To overcome the feature conflict and information isolation caused by the independent design of classification and localization branches in conventional detection frameworks, we propose the Dynamic Decoupling Head (DDH). Unlike methods such as FCOS42 and TOOD43 which mainly focus on label assignment, as demonstrated in Fig. 7, DDH achieves task alignment not only in the assignment strategy but also through a dynamically adaptive head structure, enabling deep collaborative optimization between classification and localization. As a result, the proposed DDH substantially improves the detection of small objects in complex scenarios, while simultaneously enhancing localization accuracy.
Fig. 7.
The structure of dynamic decoupling head (DDH).
The DDH detection head operates on multi-scale feature maps
obtained from the feature fusion neck. Each feature map corresponds to small, medium, and large object scales, respectively.
Shared convolutional refinement
Each feature map
is first processed through a series of shared convolutional layers with Group Normalization (GN) and SiLU activation. Using the same refinement layers across all scales stabilizes the regression behavior and provides consistent feature processing. More importantly, this weight-sharing design substantially reduces the number of parameters in the head, as it avoids separate sets of convolutions for each scale. In contrast to conventional decoupled heads in YOLOv8 that allocate independent convolution layers per scale, our shared refinement ensures efficiency and uniform feature quality across scales, which is particularly beneficial for robust small-object detection.
Task decomposition module
A key component of the proposed detection head (DDH) is the Task Decomposition (TD) module, which adaptively splits the shared features into classification- and regression-specific representations while preserving inter-task interaction. The TD module first extracts a global context vector W from the shared feature map through global average pooling, followed by two successive
convolutions with ReLU and sigmoid activations:
![]() |
7 |
where
and
denote two distinct
convolutions, ReLU is the rectified linear unit, and
is the sigmoid function. Guided by W, the TD module then decomposes the shared feature
at scale i into two branches:
![]() |
8 |
where
and
are the task-specific features for classification and regression, respectively. Essentially, the TD module functions as a layer-attention mechanism:
emphasizes category-discriminative context, whereas
highlights fine-grained localization cues.
The TD module mitigates the inherent conflict between classification and regression by reallocating feature capacity under the guidance of W. It enhances task-specific representations while still preserving shared context, thereby reducing destructive interference and improving detection accuracy.
Classification alignment
After task decomposition, the classification feature
is further refined through an alignment process guided by the shared feature
. Specifically,
is passed through a
convolution followed by ReLU activation, and then through a
convolution with a sigmoid to generate a probability map
. This map acts as an adaptive modulation factor, re-weighting the classification feature in a spatially varying manner. The aligned classification representation is finally obtained by applying a
convolution:
![]() |
9 |
This operation adaptively emphasizes discriminative regions while suppressing irrelevant responses, thereby achieving more stable and semantically consistent classification predictions.
Dynamic deformable convolution (DyDCNv2)
To further enhance localization precision, the regression features
are fed into a deformable convolution module with dynamic offsets and modulation masks. The deformable convolution enables adaptive receptive fields, crucial for modeling object boundaries and shapes that are irregular or partially occluded. In our design, the offsets
and masks
for the deformable convolution are not learned statically by the convolutional filters themselves like traditional DCNv244, but are predicted by an external lightweight network conditioned on the current features and global context. The regression feature update can be written as:
![]() |
10 |
Where O and M are the externally predicted offset and mask fields applied to
. By decoupling the offset and mask prediction from the convolution module, our approach differs from conventional DCNv2, which adaptively derives offsets and masks through internal layers during feature transformation. Utilizing externally derived O and M offers two advantages. First, it reduces the parameter and computational overhead associated with learning offsets and masks within each convolutional layer. Second, it ensures that the sampling adjustment is guided by global or task-level cues, leading to more stable and semantically aligned deformation. This dynamic deformable convolution enables the model to focus on an object’s true extent, as sampling points can shift towards object corners or occluded regions, thereby improving bounding box regression, particularly for small or irregularly shaped targets.
Task interaction and scale adaptation
After task decomposition and deformable refinement, the two branches are further processed and then used to predict outputs. In the classification branch, a dynamic feature fusion strategy that is influenced by the global context W and potential information from the regression branch is applied to strengthen category discrimination. In the regression branch, the refined features
yield more accurate box predictions. Importantly, cross-branch interaction is encouraged to enable localization to benefit from semantic cues. For instance, the presence of class-specific features can implicitly influence the regression confidence or bounding box shape through the shared context
and the joint optimization objectives. Such collaborative interaction facilitates the simultaneous refinement of classification confidence and localization accuracy, thereby mitigating potential misalignments in which the classifier responds to an object that the regressor fails to localize precisely.
Moreover, DDH introduces a scale adjustment mechanism to handle objects of different sizes more effectively. For each detection scale i, we include a learnable scaling factor
that adjusts the magnitude of the regression features:
![]() |
11 |
where
is the final adjusted regression feature for scale i. These scale-specific coefficients
allow the network to recalibrate the feature responses per output level, compensating for the inherent differences in object size and feature map resolution. This mechanism can enhance the response for small-object features on the finest scale or attenuate noisy high-frequency details on larger scales. By learning
during training, the detector becomes more adaptive to scale variations, which is particularly beneficial for detecting tiny objects that require a higher sensitivity at the corresponding prediction scale.
The proposed DDH integrates task-aware feature decomposition, dynamic deformable convolution, and multi-scale feature adaptation into a unified head architecture. Through the synergy of these mechanisms, the model effectively overcomes challenges such as complex backgrounds, scale variation, and irregular object shapes. The shared and dynamic design of DDH enhances detection performance for small targets in cluttered scenarios while maintaining high parameter efficiency. By consolidating multiple prediction heads into a single dynamic head, the architecture becomes more compact and computationally efficient. In contrast to YOLOv8, which employs independent decoupled sub-networks for each scale, our approach leverages shared layers and adaptive modules to substantially reduce parameter count, and the collaborative optimization of classification and localization yields more precise and robust detection in demanding applications (Table 1).
Table 1.
Comparative experimental results on VisDrone dataset.
| Model | AP | AP50 | AP75 | APs | APm | APl | Param(M) | GFLOPs | FPS |
|---|---|---|---|---|---|---|---|---|---|
| YOLOv5n | 18.1 | 33.9 | 17.2 | 11.5 | 31.6 | 36.8 | 2.5 | 7.1 | 209 |
| YOLOv5s | 22.3 | 39.9 | 20.5 | 15.1 | 37.8 | 38.1 | 9.1 | 23.8 | 156 |
| YOLOv5m | 24.1 | 42.3 | 22.5 | 17.3 | 40.5 | 44.2 | 25.0 | 64.0 | 103 |
| YOLOv8n | 18.4 | 34.2 | 17.1 | 11.8 | 31.3 | 37.2 | 3.0 | 8.1 | 220 |
| YOLOv8s | 22.1 | 40.1 | 20.6 | 15.2 | 37.7 | 37.8 | 11.1 | 28.5 | 192 |
| YOLOv8m | 23.9 | 42.7 | 22.4 | 17.2 | 39.9 | 44.5 | 25.8 | 78.7 | 95 |
| RT-DETR-R18 | 24.5 | 43.8 | 23.9 | 17.9 | 43.1 | 46.8 | 20.0 | 60.0 | 98 |
| RT-DETR-R34 | 25.2 | 44.5 | 24.1 | 18.1 | 42.6 | 51.2 | 31.0 | 92.0 | 84 |
| YOLOv10n29 | 18.5 | 34.6 | 17.5 | 12.3 | 31.5 | 37.4 | 2.2 | 6.5 | 197 |
| YOLOv10s | 22.7 | 40.7 | 21.4 | 15.1 | 38.0 | 38.6 | 7.2 | 21.4 | 131 |
| YOLOv10m | 24.5 | 43.8 | 23.5 | 17.6 | 41.5 | 44.9 | 15.3 | 58.9 | 94 |
| YOLOv11n | 18.2 | 34.6 | 17.0 | 11.3 | 31.7 | 37.5 | 2.6 | 6.3 | 205 |
| YOLOv11s | 22.3 | 40.6 | 20.9 | 15.6 | 38.4 | 38.9 | 9.4 | 21.3 | 141 |
| YOLOv11m | 24.3 | 43.1 | 22.2 | 16.9 | 39.5 | 44.2 | 20.0 | 67.7 | 83 |
| YOLOv13n31 | 13.7 | 34.6 | 17.8 | 12.3 | 32.2 | 37.3 | 2.5 | 6.2 | 221 |
| YOLOv13s | 22.4 | 40.5 | 21.4 | 15.9 | 37.8 | 40.1 | 9.0 | 20.1 | 159 |
| LUD-YOLO | 19.3 | 35.2 | 18.5 | 13.7 | 34.9 | 37.1 | 2.8 | 9.3 | 89 |
| SAD-YOLO | 19.1 | 34.1 | 17.9 | 12.8 | 33.7 | 36.2 | 3.4 | 10.1 | 76 |
| KL-YOLO | 20.5 | 37.5 | 19.1 | 13.5 | 35.8 | 37.8 | 3.1 | 19.6 | 23 |
| MFA-YOLOn | 20.8 | 37.8 | 19.3 | 14.1 | 36.2 | 38.4 | 2.5 | 9.1 | 125 |
| MFA-YOLOs | 23.9 | 42.6 | 22.4 | 16.8 | 40.5 | 42.9 | 9.6 | 34.0 | 85 |
| MFA-YOLOm | 25.5 | 45.0 | 24.3 | 18.3 | 43.4 | 50.5 | 24.2 | 100.6 | 49 |
Experiment and results
Design of experiments
The hardware environment for our experiments comprised an NVIDIA RTX 3090 GPU with 24 GB of memory and 14 vCPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz, which was utilized for both training and FPS testing. The system ran on Ubuntu 20.04 with Python 3.8.10, PyTorch 2.0.0+cu118, torchvision 0.15.1+cu118, and ultralytics 8.2.50 (CUDA 11.8). The input image resolution was 640
640 pixels, and the model was trained for 300 epochs with a batch size of 8. We used the SGD optimizer with a momentum of 0.937 and an initial learning rate of 0.01. A warm-up phase was applied during the first 3 epochs, followed by a learning rate scheduler. Weight decay was set to 0.0005. To enhance efficiency, we employed automatic mixed precision (AMP) during training. For reproducibility, the random seed was set to 0, and deterministic mode was enabled. Additional hyperparameters included a box loss gain of 7.5, class loss gain of 0.5, and distribution focal loss gain of 1.5, with no data augmentation techniques such as mosaic, mixup, or copy-paste applied (set to 0.0 or disabled). The base model was initialized from YOLOv8s pretrained weights.
In our experiments, we employed the VisDrone and UAVDT datasets, both of which contain aerial images captured by unmanned aerial vehicles (UAVs). These datasets are widely used in the UAV-based small object detection domain due to their high object density, diverse scenes, and complex imaging conditions. Representative raw images from the two datasets are presented in Fig. 8, which highlight their suitability for evaluating small object detection algorithms. For VisDrone dataset, its training set comprises 6,471 images, while the validation and test sets consist of 548 and 1,610 images, respectively. Each annotated object has an average size of approximately 51 pixels, making the dataset a widely adopted benchmark for small-object detection. Furthermore, to ensure a fair evaluation and to assess the robustness of the proposed MFA-YOLO model, we conducted additional experiments on the UAVDT dataset, which includes three primary categories: Car, pedestrian, and non-motorized vehicle (such as bicycle, tricycle, etc.). For UAVDT, we divided the dataset into training, validation, and test sets in an 8:1:1 ratio, thereby validating the generalization capability of the proposed MFA-YOLO model.
Fig. 8.
Visualization of VisDrone and UAVDT dataset.
Comparation experiment
VisDrone dataset
The VisDrone dataset constitutes a demanding benchmark featuring a high density of small objects within aerial imagery, thereby serving as an appropriate evaluation platform for assessing small-object detection proficiency in the proposed model. On this dataset, the MFA-YOLO framework manifests significant enhancements in small-target detection relative to conventional detectors, encompassing contemporary UAV-centric approaches. Specifically, the medium-sized MFA-YOLOm variant yields an average precision for small objects (AP
) of 18.3%, surpassing YOLOv5m by 1.0 percentage points (17.3%) and YOLOv10m by 0.7 percentage points (17.6%). Moreover, it substantially exceeds specialized UAV detectors, including LUD-YOLO (13.7%), SAD-YOLO (12.8%), and KL-YOLO (13.5%). These sustained advancements underscore the efficacy of the multi-scale feature aggregation strategy in bolstering small-object detection, attributable to the synergistic operation of the Local Feature Mapping (LFM) unit, Progressive Shared Atrous Pyramid (PSAP) module, and Dynamic Decoupling Head (DDH), which adeptly extract fine-grained details amidst intricate aerial backgrounds.
Beyond accuracy, MFA-YOLOm exhibits commendable efficiency concerning model complexity. It sustains a parameter count of 24.2 million and a memory footprint commensurate with other medium-sized YOLO architectures, indicative of robust parameter efficiency. Although the integrated UAV-specific baselines–such as LUD-YOLO (2.8 million parameters, 9.3 GFLOPs), SAD-YOLO (3.4 million parameters, 10.1 GFLOPs), and KL-YOLO (3.1 million parameters, 19.6 GFLOPs)–are engineered for low-resource environments and thus lighter in design, the superior AP
of MFA-YOLOm illustrates that its advanced feature fusion mechanism delivers enhanced performance on small objects, albeit at an elevated computational expense (100.6 GFLOPs) and diminished frames per second (49 FPS) when contrasted with these nano-scale counterparts.
To address the speed-accuracy trade-off, note that the increased GFLOPs in MFA-YOLO stem from PSAP’s multi-scale aggregation and DDH’s dynamic operations, which enrich feature representation but elevate computational demands. For instance, compared to YOLOv8m (78.7 GFLOPs, 95 FPS), MFA-YOLOm’s 100.6 GFLOPs and 49 FPS represent a 27.8% increase in load and 48.4% drop in speed. In practical UAV terms, this translates to inference times of approximately 20ms per frame on high-end GPUs like RTX 3090, remaining viable for real-time applications requiring 30-60 FPS, such as public safety monitoring. However, on edge devices like Jetson Nano, the latency may pose challenges, potentially reducing effective FPS below 30. Nonetheless, the 1.1% AP
gain justifies this compromise in safety-critical scenarios, where improved small-object recall prevents costly misses (e.g., detecting pedestrians in dense traffic). Future optimizations–such as model quantization and pruning–have the potential to further reduce computational overhead while preserving detection performance. These techniques have been shown to be effective in related YOLO variants, suggesting their applicability to our framework as well.
The augmented accuracy is attained without a disproportionate escalation in model dimensions relative to equivalent medium-scale peers, a pivotal aspect for implementation on resource-limited UAV platforms. It is acknowledged that the enriched feature fusion within MFA-YOLO engenders a moderately higher computational burden (in GFLOPs) compared to YOLOv5m or YOLOv10m; however, this compromise is substantiated by the resultant precision improvements, particularly in benchmarks against dedicated UAV methodologies that emphasize efficiency yet exhibit deficiencies in precision for scenarios involving dense, small targets. In summation, the proposed approach on VisDrone realizes preeminent small-object detection performance while upholding a lightweight configuration conducive to practical aerial deployments.
UAVDT dataset
We further evaluate MFA-YOLO’s generalization ability on the UAVDT dataset, which includes diverse aerial scenes. As shown in Table. 2, the small-scale variant, MFA-YOLOs, achieves the highest overall detection accuracy with an AP of 19.5%, surpassing all other methods, including both small and nano YOLOv8 models. It also outperforms recent state-of-the-art detectors such as GFL, GLSAN, CEASC, and FBRT-YOLO. These non-YOLO baselines were carefully selected for their relevance and effectiveness in UAV-specific small-object detection tasks. Specifically, GFL (Generalized Focal Loss) enhances detection quality for small objects through advanced loss functions, making it suitable for challenging UAV imagery with scale variations. GLSAN (Global-Local Self-Adaptive Network) employs a global-local detection strategy tailored for drone-view object detection, addressing complex backgrounds and dense distributions common in aerial scenes. CEASC (Context-Enhanced Adaptive Sparse Convolution) focuses on lightweight, efficient sparse convolutions with global context enhancement, enabling faster inference on resource-constrained drone platforms while maintaining accuracy for small targets. FBRT-YOLO, although YOLO-inspired, incorporates specialized real-time optimizations for aerial image detection, providing a strong benchmark for balancing speed and precision in UAV applications. This selection ensures a comprehensive comparison against diverse architectures, validating MFA-YOLO’s superiority in both accuracy and generalization. MFA-YOLOs excels across all metrics, leading in AP50 and AP75, thereby reflecting its superior precision on the UAVDT dataset.
Table 2.
Comparative experimental results on UAVDT dataset.
Crucially, these improvements come with minimal impact on efficiency. MFA-YOLOs introduces no significant increase in parameters compared to YOLOv8s, and its inference speed remains essentially unchanged, maintaining real-time capabilities. This demonstrates that the architectural advancements, such as multi-scale feature fusion and attention integration, enhance detection performance without compromising model compactness or speed. The ultra-lightweight MFA-YOLOn variant, despite its smaller size, outperforms most non-YOLO baselines, further confirming the scalability and effectiveness of the proposed MFA-YOLO design.
When integrated with the YOLOv8s backbone, MFA-YOLOs achieves 19.5% AP, 31.5% AP50, and 21.8% AP75, surpassing the baseline YOLOv8s by +0.6 AP, +0.2 AP50, and +1.2 AP75. These consistent improvements across different IoU thresholds demonstrate the effectiveness of the MFA-YOLO in enhancing detection accuracy, particularly for small and occluded objects, and improving localization precision under stricter evaluation criteria.
Ablation experiment
Effect of key components
Each experiment in Table 3 demonstrates the effectiveness of the proposed modules in enhancing the YOLOv8n baseline for small object detection. Starting from the baseline, the inclusion of LFM (Localized Feature Mapping) brings a modest but consistent improvement, particularly benefiting small object representation, with AP rising from 19.6% to 19.7% and AP50 from 34.6% to 36.1%. When applying PSAP (Progressive Shared Atrous Pyramid) alone, the model also shows clear gains in AP (20.0%) by effectively capturing multi-scale contextual information, though AP50 (34.7%) indicates limited improvements in localization accuracy. In contrast, the introduction of DDH (Dynamic Decoupling Head) individually yields the largest single-module enhancement, achieving AP 21.3% and AP50 35.8%, while simultaneously reducing parameters to 2.2 M, highlighting its efficiency.
Table 3.
Ablation experimental results on VisDrone dataset.
| LFM | PSAP | DDH | AP | AP50 | Params(M) |
|---|---|---|---|---|---|
| ✗ | ✗ | ✗ | 19.6 | 34.6 | 3.0 |
| ✓ | ✗ | ✗ | 19.7 | 36.1 | 3.1 |
| ✗ | ✓ | ✗ | 20.0 | 34.7 | 3.1 |
| ✗ | ✗ | ✓ | 21.3 | 35.8 | 2.2 |
| ✓ | ✓ | ✗ | 20.1 | 36.7 | 3.2 |
| ✓ | ✓ | ✓ | 20.8 | 37.8 | 2.5 |
Significant values are in bold.
When combining LFM and PSAP, the model benefits from complementary improvements in feature representation and multi-scale context modeling, with AP increasing to 20.1% and AP50 reaching 36.7%. Finally, the integration of all three modules—LFM, PSAP, and DDH—achieves the best overall performance, with AP improving to 20.8% and AP50 to 37.8%, while reducing parameters to 2.5 M. Compared with the baseline, this configuration improves AP by 1.2 points and AP50 by 3.2 points, alongside a reduction of 0.5 M parameters, demonstrating the synergy and efficiency of the proposed design for small object detection in UAV imagery.
Effect of multi-dilation configurations
As shown in Table. 4, the dilation configuration (1, 3, 5) delivers the best recall (34.1%), precision (44.9%), AP (20.8%), and AP50 (37.8%). This progressive increase of receptive field across successive convolutional layers enables the network to balance fine-grained details with broader contextual information, leading to overall superior detection performance compared with other dilation strategies.
Table 4.
Ablation experimental results for different dilation parameters.
| Dilation | P | R | AP | AP50 |
|---|---|---|---|---|
| (1,1,1) | 44.2 | 33.4 | 19.9 | 35.0 |
| (3,3,3) | 44.1 | 33.6 | 19.4 | 36.5 |
| (5,5,5) | 44.7 | 33.1 | 19.7 | 36.7 |
| (5,3,1) | 44.9 | 33.4 | 19.7 | 36.7 |
| (1,3,5) | 45.9 | 34.1 | 20.8 | 37.8 |
Significant values are in bold.
In contrast, the single-dilation configurations (1, 1, 1), (3, 3, 3), and (5, 5, 5) produce suboptimal results. With a uniform receptive field at every layer, these models are restricted to a fixed scale of context and cannot simultaneously capture small-scale details and broader structural patterns. This limitation explains their inferior AP scores compared to the multi-dilation approach, as the lack of multi-scale feature extraction hinders the detection of objects at varying sizes.
Notably, the reversed multi-dilation order (5, 3, 1) also performs worse than the progressive (1, 3, 5) configuration. In the reversed setting, receptive field sizes shrink in the deeper layers, which is inconsistent with the natural hierarchical feature extraction process where deeper layers are expected to capture broader context. This inconsistency likely impedes the model’s ability to integrate information across scales. In contrast, the forward progressive strategy (1, 3, 5) not only provides complementary multi-scale features but also maintains a logically consistent growth of receptive field across layers. Consequently, the (1, 3, 5) configuration delivers superior detection performance, especially for small objects in complex backgrounds.
Visualization
As shown in Fig. 9, we present comparative results of the proposed MFA-YOLO (Ours) against YOLOv8 on the VisDrone dataset, across both daytime and nighttime conditions. The figure clearly highlights the ground truth annotations as well as the detection outputs of both models. Our algorithm demonstrates a marked improvement in detection precision, outperforming YOLOv8 in both lighting scenarios.
Fig. 9.
Visualization of detection results on the VisDrone dataset across three scenes (rows) and three methods (columns). From left to right: Ground Truth (GT), YOLOv8, and MFA-YOLO (ours).
In the daytime, especially in complex environments like crowded intersections, MFA-YOLO shows enhanced ability to accurately identify and localize objects. This is critical for real-world applications, where precise object detection in dynamic settings is crucial. In contrast, YOLOv8 tends to misidentify or fail to detect objects due to the challenges posed by high occlusion and varying scales of objects, a common issue in real-world surveillance and autonomous driving tasks.
During the night, MFA-YOLO continues to exhibit superior performance, even in low-visibility conditions. The night scenes in VisDrone often feature poor lighting, which complicates object detection tasks. However, MFA-YOLO’s robust feature extraction and enhanced detection capabilities allow it to maintain accuracy and reliability under such challenging conditions. However, YOLOv8 struggles to maintain detection accuracy at night, especially when objects are partially obscured or in shadowed areas.
These results underscore the potential of MFA-YOLO in providing high-precision object detection across different environmental settings. The improvements achieved by our method are significant, particularly in handling complex real-world scenarios, where both day and night challenges need to be addressed. The superior performance of MFA-YOLO establishes it as a promising approach for various practical applications, such as intelligent transportation systems and surveillance.
Conclusion
In this paper, we introduced MFA-YOLO, a lightweight detection framework specifically tailored for UAV imagery with dense, small-scale, and low-resolution targets. By integrating the Localized Feature Mapping Unit (LFM), Progressive Shared Atrous Pyramid (PSAP), and Dynamic Decoupling Head (DDH), the proposed model effectively addresses two central challenges in UAV-based detection: extracting discriminative semantics from cluttered backgrounds and improving the robustness of small-object localization under strict computational constraints. Experimental evaluations on the VisDrone and UAVDT benchmarks demonstrate that MFA-YOLO achieves superior detection accuracy compared to state-of-the-art lightweight detectors while maintaining a favorable balance between precision and efficiency.
Despite these promising results, there remain open challenges. In particular, the current design introduces additional computational complexity due to multi-branch feature aggregation and dynamic alignment operations, thereby increases inference latency on edge devices. To improve inference speed and reduce model size, further optimizations are required. Future work will focus on enhancing the lightweight design using techniques such as neural architecture search, low-rank decomposition, and quantization-aware training. Additionally, exploring more efficient dynamic operators and knowledge distillation strategies may further reduce computational overhead without compromising detection accuracy. Overall, MFA-YOLO provides a solid foundation for UAV small-object detection, and with continued optimization in efficiency, it holds significant potential for deployment in UAV-based perception systems.
Author contributions
S.L. conceived and designed the study, performed data collection and statistical analysis, interpreted the results, and wrote the main manuscript text. C.C. also prepared all tables and figures. The author reviewed and approved the final manuscript.
Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Zhang, R., Wang, M., Cai, L. X. & Shen, X. Learning to be proactive: Self-regulation of UAV based networks with UAV and user dynamics. IEEE Trans. Wirel. Commun.20, 4406–4419. 10.1109/TWC.2021.3058533 (2021). [Google Scholar]
- 2.Valianti, P., Kolios, P. & Ellinas, G. Energy-aware tracking and jamming rogue UAVs using a swarm of pursuer UAV agents. IEEE Syst. J.17, 1524–1535. 10.1109/JSYST.2022.3179632 (2023). [Google Scholar]
- 3.Li, J., Ye, D. H., Kolsch, M., Wachs, J. P. & Bouman, C. A. Fast and robust UAV to UAV detection and tracking from video. IEEE Trans. Emerg. Top. Comput.10, 1519–1531. 10.1109/TETC.2021.3104555 (2022). [Google Scholar]
- 4.Wang, H., Liu, C., Cai, Y., Chen, L. & Li, Y. Yolov8-qsd: An improved small object detection algorithm for autonomous vehicles based on yolov8. IEEE Trans. Instrum. Meas.73, 1–16. 10.1109/TIM.2024.3379090 (2024). [Google Scholar]
- 5.Lin, T.-Y. et al. Microsoft coco: Common objects in context. arXiv https://arxiv.org/abs/1405.0312 (2015).
- 6.Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. & Zisserman, A. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis.88, 303–338. 10.1007/s11263-009-0275-4 (2010). [Google Scholar]
- 7.Zhu, P. et al. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell.44, 7380–7399. 10.1109/TPAMI.2021.3119563 (2022). [DOI] [PubMed] [Google Scholar]
- 8.Yu, H. et al. The unmanned aerial vehicle benchmark: Object detection, tracking and baseline. Int. J. Comput. Vis.128, 1141–1159. 10.1007/s11263-019-01266-1 (2020). [Google Scholar]
- 9.Ye, T. et al. Dense and small object detection in UAV-vision based on a global-local feature enhanced network. IEEE Trans. Instrum. Meas.71, 1–13. 10.1109/TIM.2022.3196319 (2022). [Google Scholar]
- 10.Tang, Y., Xu, T., Qin, H. & Li, J. Irstd-yolo: An improved yolo framework for infrared small target detection. IEEE Geosci. Remote Sens. Lett.22, 1–5. 10.1109/LGRS.2025.3562096 (2025). [Google Scholar]
- 11.Xi, Y. et al. Detection-driven exposure-correction network for nighttime drone-view object detection. IEEE Trans. Geosci. Remote Sens.62, 1–14. 10.1109/TGRS.2024.3351134 (2024). [Google Scholar]
- 12.Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 1440–1448. 10.1109/ICCV.2015.169 (2015).
- 13.Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.39, 1137–1149. 10.1109/TPAMI.2016.2577031 (2017). [DOI] [PubMed] [Google Scholar]
- 14.Murat, A. A. & Kiran, M. S. A comprehensive review on yolo versions for object detection. Eng. Sci. Technol. Int. J.70, 102161. 10.1016/j.jestch.2025.102161 (2025). [Google Scholar]
- 15.Liu, W. et al. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV). Vol. 9905. Lecture Notes in Computer Science. 21–37. 10.1007/978-3-319-46448-0_2 (2016).
- 16.Ni, J., Shen, K., Chen, Y. & Yang, S. X. An improved ssd-like deep network-based object detection method for indoor scenes. IEEE Trans. Instrum. Meas.72, 1–15. 10.1109/TIM.2023.3244819 (2023).37323850 [Google Scholar]
- 17.Gu, Q., Huang, H., Han, Z., Fan, Q. & Li, Y. Glfe-yolox: Global and local feature enhanced yolox for remote sensing images. IEEE Trans. Instrum. Meas.73, 1–12. 10.1109/TIM.2024.3387499 (2024). [Google Scholar]
- 18.Zhao, Y. et al. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16965–16974. 10.1109/CVPR52733.2024.01605 (2024).
- 19.Ye, T. et al. Real-time object detection network in UAV-vision based on cnn and transformer. IEEE Trans. Instrum. Meas.72, 1–13. 10.1109/TIM.2023.3241825 (2023).37323850 [Google Scholar]
- 20.Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 779–788. 10.1109/CVPR.2016.91 (2016).
- 21.Redmon, J. & Farhadi, A. Yolo9000: Better, faster, stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6517–6525. 10.1109/CVPR.2017.690 (2017).
- 22.Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. arXiv https://arxiv.org/abs/1804.02767 (2018).
- 23.Lin, T.-Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 936–944. 10.1109/CVPR.2017.106 (2017).
- 24.Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv https://arxiv.org/abs/2004.10934 (2020).
- 25.Jocher, G. et al. Ultralytics/yolov5: v7.0 – yolov5 sota realtime instance segmentation. Zenodo 10.5281/zenodo.7347926 (2022). [DOI]
- 26.Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7464–7475. 10.1109/CVPR52729.2023.00721 (2023).
- 27.Jocher, G., Chaurasia, A. & Qiu, J. Ultralytics yolov8. https://github.com/ultralytics/ultralytics (2023).
- 28.Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. Yolox: Exceeding yolo series in 2021. arXiv https://arxiv.org/abs/2107.08430 (2021).
- 29.Wang, A., Chen, H. & Liu, L. Yolov10: Real-time end-to-end object detection. arXiv https://arxiv.org/abs/2405.14458 (2024).
- 30.Jocher, G. & Qiu, J. Ultralytics yolo11. https://github.com/ultralytics/ultralytics (2024).
- 31.Lei, M., Li, S., Wu, Y. & et al. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv https://arxiv.org/abs/2506.17733 (2025).
- 32.Yang, Z. et al. Mhaf-yolo: Multi-branch heterogeneous auxiliary fusion yolo for accurate object detection. arXiv https://arxiv.org/abs/2502.04656 (2025).
- 33.Shen, J. et al. Finger vein recognition algorithm based on lightweight deep convolutional neural network. IEEE Trans. Instrum. Meas.71, 1–13. 10.1109/TIM.2021.3132332 (2022). [Google Scholar]
- 34.Shen, J., Liu, N., Sun, H., Li, D. & Zhang, Y. An instrument indication acquisition algorithm based on lightweight deep convolutional neural network and hybrid attention fine-grained features. IEEE Trans. Instrum. Meas.73, 1–16. 10.1109/TIM.2023.3346488 (2024). [Google Scholar]
- 35.Shen, J. et al. An algorithm based on lightweight semantic features for ancient mural element object detection. npj Heritage Sci.13, 70. 10.1038/s40494-025-01565-6 (2025). [Google Scholar]
- 36.Fan, Q., Li, Y., Deveci, M., Zhong, K. & Kadry, S. Lud-yolo: A novel lightweight object detection network for unmanned aerial vehicle. Inf. Sci.686, 121366. 10.1016/j.ins.2024.121366 (2025). [Google Scholar]
- 37.Zhou, W. et al. Sad-yolo: A small object detector for airport optical sensors based on improved yolov8. IEEE Sens. J.25, 20513–20522. 10.1109/JSEN.2025.3557999 (2025). [Google Scholar]
- 38.Xie, J. et al. Kl-yolo: A lightweight adaptive global feature enhancement network for small-object detection in low-altitude remote sensing imagery. IEEE Trans. Instrum. Meas.74, 1–13. 10.1109/TIM.2025.3576957 (2025). [Google Scholar]
- 39.Tian, J. et al. Adversarial attacks and defenses for deep-learning-based unmanned aerial vehicles. IEEE Internet Things J.9, 22399–22409. 10.1109/JIOT.2021.3111024 (2022). [Google Scholar]
- 40.Shen, J. et al. An anchor-free lightweight deep convolutional network for vehicle detection in aerial images. IEEE Trans. Intell. Transport. Syst.23, 24330–24342. 10.1109/TITS.2022.3203715 (2022). [Google Scholar]
- 41.Yang, Y., Wang, B., Tian, J., Lyu, X. & Li, S. An efficient and low-delay sfc recovery method in the space–air–ground integrated aviation information network with integrated uavs. Drones9, 440. 10.3390/drones9060440 (2025). [Google Scholar]
- 42.Tian, Z., Shen, C., Chen, H. & He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 10.1109/ICCV.2019.00972 (2019).
- 43.Feng, C., Zhong, Y., Gao, Y., Scott, M. R. & Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 10.1109/ICCV48922.2021.00349 (2021).
- 44.Zhu, X., Hu, H., Lin, S. & Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9300–9308. 10.1109/CVPR.2019.00953 (2019).
- 45.Li, X. et al. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. arXiv https://arxiv.org/abs/2006.04388 (2020).
- 46.Deng, S. et al. A global-local self-adaptive network for drone-view object detection. IEEE Trans. Image Process.30, 1556–1569. 10.1109/TIP.2020.3045636 (2021). [DOI] [PubMed] [Google Scholar]
- 47.Du, B., Huang, Y., Chen, J. & Huang, D. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. arXiv https://arxiv.org/abs/2303.14488 (2023).
- 48.Xiao, Y., Xu, T., Xin, Y. & Li, J. Fbrt-yolo: Faster and better for real-time aerial image detection. Proc. AAAI Conf. Artif. Intell.39, 8673–8681. 10.1609/aaai.v39i8.32937 (2025). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.



















