Abstract
Detecting small targets in remote sensing images is challenging for traditional lightweight methods due to the inherent conflict between feature representation capability and computational constraints. To address this, this paper proposes a lightweight and high-precision detection network, LMW-YOLO, built upon the YOLO11n baseline. We design a novel CSD strategy, which tailors the feature extraction process to the distinct requirements of different network stages. Guided by this strategy, we first design the LKCA module for the shallow P3 branch. This module decomposed LKA to capture long-range dependencies and global context essential for small targets, effectively compensating for the limited receptive fields of standard convolutions. Subsequently, to handle semantic ambiguity in deeper layers, the MSDP module is introduced in the P4 branch, which expands the receptive field to capture multi-scale semantic context without sacrificing spatial resolution. Furthermore, the WIoU v3 loss function is incorporated to optimize bounding box regression. By employing a dynamic non-monotonic focusing mechanism, WIoU v3 intelligently rebalances gradient gains based on anchor box quality, which accelerates convergence and enhances localization accuracy in dense scenarios. Experimental results on the NWPU VHR-10, RS-STOD, and VisDrone2019 datasets demonstrate that LMW-YOLO achieves superior detection performance compared to state-of-the-art methods while maintaining an extremely low parameter count (2.6 M), validating its effectiveness for resource-constrained aerial applications.
Subject terms: Engineering, Mathematics and computing
Introduction
With the continuous advancement of aerospace sensor technologies, object detection in high-resolution Remote Sensing Images (RSIs) has become a important component in applications ranging from urban planning1 and environmental monitoring2 to disaster management3. While general-purpose single-stage detectors, epitomized by the YOLO series4–9, have established themselves as the kind of standard for real-time applications, their direct deployment on RSIs presents severe challenges. Unlike natural scene images, RSIs are captured from a high-altitude, top-down perspective, resulting in dense distributions of minute objects, massive scale variations, and highly complex background clutter10.
To tackle these RSI-specific challenges, recent research has largely focused on introducing attention mechanisms11–13 or deepening multi-scale feature fusion networks14. However, a critical gap remains unsolved regarding the preservation of fine-grained spatial information in lightweight models. Standard single-stage detectors rely heavily on strided convolutions for rapid downsampling to acquire high-level semantics. For small objects that may occupy fewer than 5 to 10 pixels, this aggressive downsampling inevitably leads to the irreversible loss of high-frequency spatial information15,16, eroding defining spatial features before the network can effectively encode them. While multi-scale information fusion offers a powerful paradigm17, standard CNN architectures and techniques like Feature Pyramid Networks (FPN)18 are not explicitly designed to counteract this initial detail loss; they fuse features only after significant spatial information has already been lost in the backbone.
Furthermore, attempts to enhance feature representation through existing attention mechanisms, such as Squeeze-and-Excitation (SE)19 or Convolutional Block Attention Module (CBAM)20, typically focus on channel-wise recalibration or global context. These approaches lack the specific mechanisms required to locally expand and preserve the high-frequency spatial details necessary for distinguishing minute objects from background clutter. Approaches relying on Transformers21 or self-attention22 capture global context but incur prohibitive computational overhead, violating the lightweight premise of edge-deployed detectors. Conversely, while lightweight models aim for efficiency, they often sacrifice the capacity for effective fusion, leading to a noticeable drop in accuracy.
More fundamentally, current state-of-the-art YOLO architectures (including YOLOv8 and YOLO11) suffer from a structural paradigm flaw: they typically deploy homogeneous feature extraction modules (e.g., standard C2f or C3k2 blocks) uniformly across all levels of the feature pyramid. We think that this design is inherently inadequate for RSIs because different feature levels exhibit highly asymmetric representational demands. Shallow layers (e.g., the P3 branch) maintain high spatial resolution crucial for small targets but suffer from severely restricted receptive fields, making them vulnerable to local background noise. In contrast, deep layers (e.g., the P4 branch) possess large receptive fields and rich semantics but are prone to semantic aliasing and resolution degradation, struggling to capture objects with extreme scale variations. Applying a uniform structural block across these stages results in a compromise that fails to satisfy the distinct requirements of either layer. Furthermore, standard bounding box regression loss functions (e.g., CIoU) employ a static focusing mechanism that treats all samples equally. In RSI datasets containing low-quality examples (e.g., noisy or occluded targets), this approach often leads to harmful gradients from geometric outliers, hindering model convergence and generalization23.
To address these limitations, we propose LMW-YOLO, a novel and lightweight detection framework built upon the YOLO11n architecture. By formalizing the asymmetric demands of multi-scale feature representations, we introduce a Context-Scale Decoupled (CSD) strategy. Instead of uniform processing, the CSD strategy shifts from a homogeneous architecture to a task-specific structural design, ensuring that each feature pyramid level is equipped with highly specialized modules tailored to its distinct bottleneck. Specifically, we deploy the Large-Kernel Context Aggregation (LKCA) module in the shallow P3 layer to expand the limited receptive fields for minute targets; introduce the Multi-Scale Dilated Perception (MSDP) module in the intermediate P4 layer to dynamically adapt to drastic scale variations; and retain the highly efficient C3k2 module in the deepest P5 layer to consolidate high-level semantics. This targeted decoupling allows each layer to optimize its specific function without introducing redundant computational overhead. And we integrate Wise-IoU v3 (WIoU v3) as the bounding box regression loss, leveraging its dynamic non-monotonic focusing mechanism to intelligently suppress harmful gradients from extreme geometric outliers.
The main contributions of this paper can be summarized as follows:
To address the limitation of standard CNNs in capturing global information due to fixed receptive fields, we propose the Large-Kernel Context Aggregation (LKCA) module as a replacement for the standard bottleneck in the shallow feature branch. By decomposing large kernels into depth-wise and depth-wise dilated convolutions, LKCA establishes long-range dependencies and captures global context for small targets without the high computational cost of Transformers. This design significantly enhances the network’s ability to distinguish minute objects from complex backgrounds.
To mitigate the feature degradation caused by drastic scale variations in remote sensing images, we introduce the Multi-Scale Dilated Perception (MSDP) module. This module incorporates a residual structure with parallel dilated convolution branches of varying dilation rates. It effectively expands the receptive field to capture semantic context at multiple granularities without sacrificing spatial resolution, thereby boosting the model’s adaptability to objects of extreme sizes.
We propose a Context-Scale Decoupled (CSD) strategy. Instead of uniformly applying a single block type across all stages, we strategically deploy the LKCA module in the shallow P3 branch to compensate for limited receptive fields, and the MSDP module in the deep P4 branch to handle semantic scale variations. This differentiated design maximizes the unique strengths of each layer, achieving an optimal balance between spatial detail preservation and semantic abstraction.
We implement a dynamic gradient optimization strategy by incorporating Wise-IoU v3 (WIoU v3) as the bounding box regression loss. In contrast to static focusing mechanisms, WIoU v3 employs a dynamic non-monotonic mechanism based on the outlier degree of anchor boxes. This approach intelligently suppresses harmful gradients from low-quality examples while amplifying the focus on high-quality anchors, resulting in faster convergence and more precise localization in noisy environments.
Related work
Before the dominance of deep learning, object detection in remote sensing imagery relied heavily on handcrafted features and template matching. Early approaches, such as mean shift segmentation24 and improved Scale-Invariant Feature Transform (SIFT) algorithms25, utilized prior knowledge of object shapes for extraction. However, these traditional methods necessitated manual feature design and demonstrated poor robustness against complex scene variations and noise disturbances, resulting in suboptimal detection performance.
With the advent of Convolutional Neural Networks (CNNs), data-driven paradigms have revolutionized this field. Generally, deep learning-based detection frameworks fall into two categories: two-stage and one-stage methods. Two-stage algorithms, such as the R-CNN family, first generate region proposals before performing classification. While methods like RSADet26 introduced deformable convolution to handle spatial variations, and contextual refinement modules27 improved positive sample generation, these architectures are often computationally intensive with high inference latency, limiting their deployment in real-time industrial applications.
Conversely, one-stage algorithms, epitomized by the YOLO series, directly regress bounding boxes and class probabilities, offering a superior balance between speed and accuracy. Consequently, numerous studies have tailored one-stage detectors to address the unique challenges of Remote Sensing Images (RSIs), specifically multi-scale feature fusion, dense small object detection, and model lightweighting.
To mitigate the feature vanishing problem of small targets, advanced fusion strategies have been widely adopted. For instance, Zhang et al. utilized multi-scale convolution branches to enhance feature perception28, while Zhao and Zhu integrated the CBAM module with Swin Transformer blocks to capture global dependencies29. Similarly, the GLFE-YOLOX framework30 integrated an improved PAFPN to enhance feature fusion, achieving significant gains on the NWPU VHR-10 dataset. Furthermore, attention mechanisms like Efficient Multiscale Attention (EMA)31 and GAM32 have been incorporated into YOLO architectures to reduce background interference and improve feature selection.
Addressing dense distributions and minute targets is another critical focus. TPH-YOLOv533 introduced a transformer prediction head to effectively handle small objects in UAV images, though it struggled with drastic scale variations. To capture extreme scale differences, Wang et al. incorporated micro-target detection layers34, and Yang et al. proposed a cluster detection network to estimate target sizes in dense regions dynamically35. Additionally, strategies like constrained regression modeling in LOCO36 have been employed to improve the robustness of predicting dense building footprints.
While accuracy is paramount, computational efficiency is crucial for edge deployment. Niu and Yan introduced diverse branch blocks to reduce parameters by 37% while boosting AP37, and methods like L-SNR-YOLO38 attempted to enhance feature saliency via transformers. However, some approaches inadvertently introduce excessive parameters, neglecting the lightweight requirement. Moreover, methods like DFPH-YOLO39, despite improving semantic combination, often fail to effectively filter out irrelevant background information during feature extraction on elongated objects.
Although the aforementioned studies provide various optimization schemes, ranging from hierarchical anchor generation40 to center probability map prediction41, they often face a trade-off between localization precision and computational burden. Specifically, existing one-stage detectors frequently neglect the suppression of background noise when extracting features from small, densely packed targets, and many suffer from parameter redundancy. Considering the requirement for real-time performance on resource-constrained equipment, this study adheres to the Horizontal Bounding Box (HBB) paradigm based on the YOLO framework. We aim to address the limitations of existing methods by optimizing the network structure to ensure robust multi-scale context capture and precise localization without incurring the heavy computational overhead typical of OBB detectors or heavy-weight transformers.
Methods
Overall architecture
In this work, we propose a novel detection framework named LMW-YOLO, built upon the advanced YOLO11n architecture, to effectively address the persistent challenges of Small Object Detection (SOD) in Remote Sensing Images (RSIs), specifically large scale variation, dense object distribution, and complex background interference. The overall architecture and core components of LMW-YOLO are illustrated in Fig. 1.
Fig. 1.

The overall architecture of LMW-YOLO. This figure illustrates the backbone, neck, and head components of the proposed model.
While standard lightweight detectors often prune deep detection heads to reduce computation, we argue that high-level semantic information is crucial for distinguishing targets from complex aerial backgrounds. Guided by our proposed Context-Scale Decoupled (CSD) strategy, LMW-YOLO adopts a hierarchical feature extraction paradigm. The shallow layers primarily suffer from insufficient global context, whereas deep layers struggle with scale variance. Consequently, we tailor the neck network by deploying two specialized modules: the Large-Kernel Context Aggregation (LKCA) module in the shallow P3 branch and the Multi-Scale Dilated Perception (MSDP) module in the deep P4 branch. This decoupled design ensures that each network stage focuses on its specific representational bottleneck.
First, to enhance the representation of small and densely distributed objects, the LKCA module is embedded into the P3 branch (shallow features). Small targets in remote sensing imagery often lack sufficient pixel information. By incorporating Large Kernel Attention (LKA) into the C3k2 block, this module enables the network to establish long-range dependencies and capture rich contextual information without discarding fine-grained local features. This design significantly improves the model’s sensitivity to tiny targets that are easily overwhelmed by noise.
Second, the MSDP module is introduced into the P4 branch (medium-level features) to handle scale transitions and background discrimination. Utilizing Dilation-wise Residual (DWR) structures, this module effectively expands the receptive field to capture multi-scale semantic contexts. This enhancement allows the network to better model the relationship between objects and their surroundings, thereby suppressing irrelevant background interference and improving discriminative capability for medium-scale targets.
Large-kernel context aggregation (LKCA) module
Standard convolutional layers (e.g.,
) in YOLO models inherently suffer from limited receptive fields, making it difficult to capture long-range dependencies and global context, which are crucial for recognizing objects in high-resolution Remote Sensing Imagery. Although increasing the kernel size can expand the receptive field, it incurs a quadratic increase in computational cost and parameter count. To resolve this dilemma, we propose the LKCA module, as illustrated in Fig. 2, which integrates a decomposed Large Kernel Attention (LKA) mechanism into the C3k2 architecture. The core component of this module is the LKA_Bottleneck, designed to emulate the behavior of a super-large kernel through a decomposed multi-stage strategy. Unlike standard bottlenecks, the LKA_Bottleneck effectively decouples the feature extraction process into three sequential steps: first, a
depth-wise convolution is applied to capture local spatial details and texture information. Then, the output is processed by a
depth-wise dilated convolution with a dilation rate of
. This design creates a dilated window that significantly expands the effective receptive field to cover broader contextual information without introducing additional parameters. Finally, a
point-wise convolution fuses the spatially aggregated features across channels.
Fig. 2.

Architecture of the LKCA module and its core component, the LKA_Bottleneck.
Mathematically, for an input tensor X, the output Y of the bottleneck is formulated as:
![]() |
1 |
![]() |
2 |
![]() |
3 |
It is worth noting that while the original LKA design typically employs an element-wise attention mechanism42, we adopt an additive residual connection (
) in our LKA_Bottleneck. This modification is specifically implemented to maintain gradient stability and facilitate information flow during the training of deep networks, preventing signal degradation. Specifically, while the original LKA was designed as a multiplicative attention mechanism (
, where F represents the input feature map and
denotes the element-wise multiplication) for image classification, our LKCA module fundamentally modifies this behavior to better suit object detection in remote sensing imagery. By deliberately abandoning the multiplicative paradigm in favor of a pure additive residual block, our LKA_Bottleneck leverages LKA’s efficient multi-stage decomposition to simulate a massive receptive field (e.g.,
) without incurring a quadratic parameter overhead. As decoupled into the previously mentioned
depth-wise,
depth-wise dilated (
), and
convolutions, it effectively captures both local spatial details and expansive global context. The LKCA module acts as the carrier for these modified bottlenecks by integrating them into the C3k2 architecture of YOLO. Adopting the Cross-Stage Partial (CSP) structure, the input feature map is split into two branches. One branch passes through two LKA_Bottleneck layers to extract rich global-local features, while the other maintains the original feature identity. The final concatenation ensures that the network benefits simultaneously from the expansive global context provided by the decomposed large kernels and the detail-preserving gradient flow of the CSP split.
Multi-scale dilated perception (MSDP) module
The MSDP module is a Dilated Window-based Residual (DWR) enhanced module designed to overcome the limitations of standard feature fusion modules (e.g., C3k2 or C2f) in YOLO models when applied to small object detection in Remote Sensing Images (including UAV and satellite scenarios). While architectures like C3k2 improve efficiency through cross-stage feature integration, their reliance on standard Bottleneck structures with fixed, single-scale receptive fields is fundamentally inadequate for the dramatic scale variations and dense, compact object distributions typical of aerial imagery. Furthermore, naively expanding these modules with standard multi-branch dilated convolutions often leads to redundant weights and optimization difficulties, as they attempt to extract multi-scale context simultaneously from complex, entangled feature maps.
To address these representational bottlenecks, the proposed MSDP module introduces a synergistic architectural design by embedding a decoupled as illustrated in Fig. 3, two-step feature extraction paradigm within a Cross Stage Partial (CSP) topology. Crucially, this design fundamentally departs from the original DWR architecture. While traditional DWR module relies on a FCN-like direct encoder–decoder pipeline where the entire feature map undergoes heavy multi-scale transformations, our MSDP module explicitly partitions the input feature channels in half (via a chunk(2, 1) operation). One half bypasses the bottleneck entirely as an identity shortcut to strictly preserve uncorrupted, high-resolution spatial gradients. The remaining half is directed into a tailored DWR_Bottleneck, where we deliberately separate context acquisition into Region Residualization (RR) and Semantic Residualization (SR).
Fig. 3.

The detailed structure of MSDP module, including DWR_Bottleneck.
First, rather than forcing dilated convolutions to process raw features directly, the input channels are processed via a standard
convolution to generate RR features. This crucial first step yields a concise, activated representation of local regional textures. Subsequently, these simplified RR features are passed to the SR stage, which employs parallel dilated convolutions with varying dilation rates (e.g.,
). Because the input to this stage has already been filtered by the RR step, the multi-rate convolutions act as highly efficient semantic filters. The
branch establishes a compact receptive field to accurately preserve fine-grained local details critical for extreme small targets, while the
and
branches seamlessly capture broader contextual cues for larger or sparsely distributed objects without introducing excessive background noise. Finally, these multi-scale features are concatenated with the preserved identity shortcut, restored to their original dimensions via a
convolution, and residually fused.
By embedding this decoupled RR-SR mechanism within a split-gradient CSP path, it achieves an expansive receptive field without incurring significant computational overhead or gradient complexity. Furthermore, it eliminates the “dead weights” typical of traditional multi-branch architectures, dynamically capturing rich multi-scale semantics while strictly preserving the raw pixel-level features essential for the precise localization of small objects in remote sensing images.
Wise-IoU v3 (WIoU v3)
The drastic scale variations, dense distributions, and the prevalence of small objects in remote sensing images (encompassing satellite remote sensing and low-altitude UAV data) pose significant challenges to bounding box regression. Although the CIoU loss function, commonly used in architectures such as YOLO11, comprehensively considers overlap area, center distance, and aspect ratio, it employs a static focusing mechanism. This static approach treats all samples equally, which is problematic for remote sensing datasets containing low-quality samples (e.g., geometric distortions, occlusions, or annotation noise). In such scenarios, geometric outliers generate harmful gradients that destabilize the training process and hinder the model from converging to an optimal solution for complex, densely arranged targets.
To address these limitations, we introduce Wise-IoU v3 (WIoU v3), a bounding box regression loss function that incorporates a dynamic non-monotonic focusing mechanism. Unlike CIoU, WIoU v3 dynamically evaluates the quality of anchor boxes, enabling the model to focus on “ordinary quality” anchor boxes (hard samples) while reducing the impact of harmful gradients from extreme outliers.
WIoU v3 is constructed upon a distance-based attention mechanism. It first computes a distance penalty term, denoted as
, based on the center point deviation:
![]() |
4 |
where (x, y) and
represent the center coordinates of the predicted box and the ground truth box, respectively, and c denotes the diagonal length of the smallest enclosing box. As shown in Fig. 4, this forms the basis of WIoU v1:
![]() |
5 |
Fig. 4.
Geometric illustration of the proposed WIoU v1 loss function. This diagram depicts the interaction between the predicted box and the ground truth box (IoU) and explains
.
To implement the dynamic focusing mechanism of v3, we introduce the outlier degree
, which describes the quality of an anchor box relative to the current model performance:
![]() |
6 |
Here,
is the exponential moving average of the IoU loss for the current batch, serving as a dynamic normalization factor. Based on
, a non-monotonic focusing coefficient r is constructed using hyperparameters
and
:
![]() |
7 |
Consequently, the final WIoU v3 loss function is defined as:
![]() |
8 |
The hyperparameters
and
jointly control the shape of this non-monotonic curve, allowing for flexible adaptation to different datasets. Specifically,
serves as a threshold point: when
, the gradient gain
, indicating the boundary where the mechanism shifts from amplifying gradients to suppressing them. Meanwhile,
controls the peak position and the decay rate of the curve. Mathematically, the peak of the gradient gain occurs at
. By adjusting
, we can shift the focus of the model toward specific ’ordinary’ samples, while ensuring a rapid decay in gradient gain for extreme outliers (
), thereby stabilizing the training process.
The core advantage of WIoU v3 in remote sensing scenarios lies in the dynamic behavior of r. As illustrated in Fig. 5, this mechanism operates in three distinct regions based on the outlier degree
. For high-quality anchor boxes (small
), the loss is reduced to prevent overfitting, as the model has already learned these samples well. Conversely, for low-quality outliers (large
), which often correspond to mislabeled data or extreme noise common in RSI, the gradient gain r is significantly reduced, effectively suppressing the generation of harmful gradients that could destabilize training. Crucially, for ordinary anchor boxes (medium
), the gradient gain is amplified, thereby prompting the model to focus its limited learning capacity on these “hard but valid” samples. By dynamically regulating the gradient gain, WIoU v3 accelerates convergence and significantly improves localization accuracy, demonstrating particular robustness when distinguishing densely adjacent small targets in complex aerial imagery.
Fig. 5.
The dynamic non-monotonic focusing mechanism of WIoU v3. The curve illustrates the relationship between gradient gain r and outlier degree
(plotted with
= 1.9,
= 3.0).
Experiment
Dataset
In this paper, we selected three representative remote sensing object detection datasets—NWPU VHR-10, RS-STOD, and VisDrone2019—for experimentation to comprehensively evaluate the effectiveness and generalization capability of the proposed LMW-YOLO network.
NWPU VHR-10 is a high-resolution geospatial object detection dataset43. It consists of 800 images, covering 10 categories including airplane, baseball diamond, and basketball court, with a total of 3651 object instances. Specifically, the dataset comprises 650 images containing targets (Positive Samples) and 150 images containing only background (Negative Samples). This study specifically selected this dataset as the primary platform for ablation studies due to its unique sample composition and moderate scale. On one hand, the dataset contains a certain proportion of pure background negative samples, which can sensitively verify the model’s resistance to false detections in complex backgrounds, thereby precisely evaluating the robustness improvements brought by the proposed modules. On the other hand, compared to million-scale massive datasets, NWPU VHR-10 can significantly reduce the time cost of multiple iterative trainings while ensuring the statistical significance of experimental results, thus efficiently verifying the effectiveness of each component. In this experiment, the dataset was randomly partitioned into training, validation, and test sets in an 8:1:1 ratio to conduct ablation studies and baseline performance evaluations.
To verify the robustness of LMW-YOLO in detecting extremely minute targets, we employed the RS-STOD dataset44, a benchmark specifically constructed for Small Tiny Objects (STOs) detection. This dataset is sourced from platforms such as Land Information New Zealand and Amap, comprising 2354 images (sliced into
patches) and 50,854 annotated instances, with spatial resolutions ranging from 0.45 m to 8 m. It covers diverse scenarios across major Asia-Pacific cities, including airports, harbors, and dense urban parking lots, to prevent overfitting to specific scenes. RS-STOD encompasses five distinct categories: Small Vehicle (SV), Large Vehicle (LV), Airplane (Ap), Ship (Sh), and Storage Tank (ST). A distinguishing feature of RS-STOD is its extreme scale challenge; the average target size is only 13.4 pixels, which is significantly smaller than that of mainstream datasets like NWPU VHR-10. Statistical analysis indicates that 81% of the targets are smaller than 16 pixels, and 50% are under 8 pixels, strictly adhering to the definition of absolute instance size (
). To ensure a rigorous and objective evaluation of the model’s generalization ability, the dataset is randomly partitioned into training, validation, and test sets in a 7:1:2 ratio.
To evaluate the generalization capability of the model in complex low-altitude UAV scenarios, we utilized the VisDrone2019 dataset45. Collected by the AISKYEYE team at Tianjin University, this dataset constitutes a challenging benchmark specifically designed for drone-based optical remote sensing object detection. It consists of 10,209 static images captured using drone-mounted cameras under diverse environmental conditions, including varying weather, lighting, shooting angles, and scenes. While the dataset originally defines 12 categories, this study concentrates on the 10 primary object classes relevant to aerial monitoring (e.g., vehicles and pedestrians) by excluding “ignored regions” and “others”. Adhering to the official partition protocol, the dataset is divided into a training set (6471 images), a validation set (548 images), and a test set (1610 images), approximating a 7:1:2 ratio. Given its characteristics of dense object distribution and severe occlusion, VisDrone2019 serves as a rigorous testbed for assessing the robustness of LMW-YOLO against real-world drone imagery challenges.
Environment and parameter settings
The experiments in this study were conducted on a platform equipped with an Intel Xeon Platinum 8255C CPU (8 cores, @ 2.50 GHz) and an NVIDIA Tesla T4 GPU, supported by 32 GB of Multi-bit ECC memory. The operating system employed was Windows Server 2022 Datacenter 21H2. The programming environment consisted of Python 3.12.12 and PyTorch 2.9.1, utilizing CUDA version 12.6. The training process employed Stochastic Gradient Descent (SGD), with an initial learning rate of 0.01 and a final learning rate of 0.001. The input image size was fixed at
pixels, and the batch size was set to 32. The model was trained for a total of 300 epochs. To stabilize convergence and enhance the final detection accuracy, Mosaic data augmentation was disabled during the last 30 epochs.
Evaluation indicators
To comprehensively quantify the experimental results and evaluate the detection accuracy and efficiency of the proposed LMW-YOLO model in small object detection for remote sensing images, this study adopts Precision (P), Recall (R), and mean Average Precision (mAP) as the primary evaluation protocols.
Precision (P) measures the proportion of true positive samples among all positive predictions made by the model, effectively reflecting the model’s ability to resist false alarms amidst complex background clutter. Recall (R) represents the proportion of actual positive samples correctly identified by the model, serving as a key indicator for measuring the missed detection rate of tiny targets. mAP@0.5 represents the mean Average Precision calculated at an Intersection over Union (IoU) threshold of 0.5, while mAP@0.5:0.95 averages the mAP values over strictly increasing IoU thresholds (from 0.5 to 0.95 with a step size of 0.05). The latter provides a more rigorous evaluation of the model’s high-precision localization capability.
The formulas for these metrics are as follows:
![]() |
9 |
![]() |
10 |
![]() |
11 |
![]() |
12 |
where TP (True Positive) represents the number of targets correctly detected as positive samples; FP (False Positive) represents background regions incorrectly classified as targets (false alarms); FN (False Negative) represents actual targets missed by the model (missed detections); N represents the total number of object categories; and AP (Average Precision) corresponds to the area under the Precision–Recall (P–R) curve for a specific category, used to measure the detection accuracy of that single category.
Furthermore, considering the strict hardware constraints typical of platforms such as UAVs and satellites, we introduce Parameters (M) and FLOPs (G) as supplementary metrics to evaluate the model’s spatial and computational complexity respectively, and ensure its feasibility in resource-constrained environments.
Ablation experiments
To validate the effectiveness of each proposed module in LMW-YOLO, we conducted progressive ablation experiments on the NWPU VHR-10 dataset, using the original YOLO11n as the baseline (Method A). The quantitative results are summarized in Table 1.
Table 1.
Ablation study of LMW-YOLO. This table summarizes the performance impact of incorporating LKCA, MSDP, and WIoU v3 modules on the model’s parameters and accuracy.
| Methods | LKCA | MSDP | WIoU v3 | Params (M) | GFLOPs | mAP@0.5 (%) | mAP@0.5:0.95 (%) |
|---|---|---|---|---|---|---|---|
| A | 2.59 | 6.4 | 92.4 | 61.7 | |||
| B |
|
2.59 | 6.3 | 92.7 | 61.3 | ||
| C |
|
|
2.6 | 6.3 | 93.1 | 60.0 | |
| D (ours) |
|
|
|
2.6 | 6.3 | 94.3 | 61.9 |
| E |
|
|
2.6 | 6.4 | 93.6 | 61.0 | |
| F |
|
2.6 | 6.4 | 93.3 | 60.6 | ||
| G |
|
2.6 | 6.4 | 94.0 | 60.9 |
First, we integrated the LKCA module into the network (Method B). As shown in Table 1, this modification improved the mAP@0.5 from 92.4% to 92.7%, while maintaining the parameter count at 2.59 M and slightly reducing GFLOPs to 6.3. This result demonstrates that replacing the standard bottleneck with the large kernel attention mechanism enables the model to capture long-range dependencies and global information in shallow features without introducing additional computational costs.
Subsequently, with the incorporation of the MSDP module (Method C), the mAP@0.5 further rose to 93.1%. Although this addition resulted in a marginal increase in parameters (from 2.59 M to 2.60 M), the improvement in detection accuracy confirms the effectiveness of the dilated residual structure. By expanding the receptive field, MSDP enhances the network’s ability to aggregate multi-scale contextual information, which is crucial for distinguishing objects from complex backgrounds.
Next, replacing the CIoU loss function with WIoU v3 alongside the previous modules (Method D) yielded the most significant performance improvement. Unlike CIoU, which employs a static focusing mechanism, WIoU v3 utilizes a dynamic non-monotonic focusing mechanism to optimize the regression process. By intelligently allocating gradient gain, specifically suppressing harmful gradients from low-quality examples while amplifying the focus on ordinary anchors, the mAP@0.5 surged to 94.3%, and the high-precision metric mAP@0.5:0.95 increased to 61.9%. This validates that WIoU v3 effectively accelerates convergence and robustly optimizes the localization accuracy of tiny and densely packed objects, even in the presence of geometric outliers.
To further isolate and validate the individual contributions of these modules, we also evaluated alternative configurations (Methods E, F, and G). When integrating only WIoU v3 (Method F) or solely the MSDP module (Method G) into the baseline, the mAP@0.5 improved to 93.3% and 94.0%, respectively. Combining MSDP and WIoU v3 without the LKCA module (Method E) achieved a mAP@0.5 of 93.6% and an mAP@0.5:0.95 of 61.0%. While these configurations demonstrate the independent benefits of MSDP and WIoU v3, they fall short of the comprehensive performance achieved by Method D. Crucially, comparing Method E (without LKCA) to Method D (with LKCA) highlights the vital role of the large kernel attention mechanism; its inclusion not only slightly reduces computational overhead (GFLOPs from 6.4 to 6.3) but also creates a synergistic effect that pushes both mAP@0.5 and mAP@0.5:0.95 to their peak values.
Therefore, the complete LMW-YOLO architecture (Method D) achieves an optimal balance between accuracy and complexity, demonstrating its superiority in the task of Small Object Detection in Remote Sensing Images.
Comparative experiments
To evaluate the effectiveness of the proposed LMW-YOLO model, we benchmarked it against a series of state-of-the-art object detection models on the NWPU VHR-10 dataset. The comparative methods include the classic YOLOv8n (2023), the widely adopted YOLO11n (2024), the latest YOLO26n (2025), and recent representative works from 2026. Table 2 summarizes the experimental results, listing the parameter count (Params), mean Average Precision (mAP@0.5), and Average Precision (AP) for each category.
Table 2.
The results of different base models on the NWPU VHR-10 dataset. Ap., Airplane; ST, Storage Tank; BD, Baseball Diamond; TC, Tennis Court; BC, Basketball Court; GTF, Ground Track Field; Har, Harbor; Bri, Bridge; Veh, Vehicle.
| Methods | Params | FLOPs | Ap | Ship | ST | BD | TC | BC | GTF | Har | Bri | Veh | mAP@0.5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (M) | (G) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | |
| YOLOv5m (2020) | 20.9 | 48.0 | 99.5 | 67.4 | 97.1 | 97.4 | 80.6 | 85.3 | 100 | 80.5 | 80.8 | 87.4 | 87.6 |
| YOLOv7-tiny (2022) | 6.2 | 13.1 | 99.4 | 95.8 | 58.7 | 96.9 | 98.2 | 87.7 | 94.0 | 87.7 | 99.5 | 81.7 | 90.0 |
| YOLOv8n (2023) | 3.0 | 8.1 | 97.0 | 91.2 | 95.6 | 96.6 | 98.0 | 73.4 | 97.1 | 92.9 | 72.0 | 86.0 | 90.0 |
| DET-YOLO (2024)46 | 1.41 | 5.3 | – | – | – | – | – | – | – | – | – | – | 79.9 |
| PR-Deformable DETR (2024)47 | 46.7 | 151.6 | 96.8 | 90.6 | 67.7 | 94.6 | 89.3 | 94.1 | 100 | 90.1 | 80.5 | 80.0 | 88.3 |
| YOLO11n (2024) | 2.59 | 6.4 | 99.5 | 99.5 | 96.4 | 92.7 | 84.1 | 94.7 | 98.6 | 93.4 | 91.1 | 73.7 | 92.4 |
| ECHF-YOLO (2025)48 | 3.2 | 9.0 | 99.3 | 77.9 | 81.4 | 96.5 | 97.2 | 84.3 | 99.0 | 89.4 | 89.5 | 80.0 | 89.4 |
| M2FE-YOLO (2025)49 | 5.4 | 12.8 | – | – | – | – | – | – | – | – | – | – | 93.3 |
| SRM-YOLO (2025)50 | 3.2 | 15.9 | – | – | – | – | – | – | – | – | – | – | 86.6 |
| HAF-YOLO (2025)51 | 4.3 | 11.8 | 98.8 | 76.8 | 87.1 | 97.1 | 92.2 | 60.6 | 97.1 | 92.5 | 63.3 | 84.4 | 85.0 |
| YOLO26n (2025) | 2.4 | 5.2 | 99.5 | 99.5 | 95.2 | 93.1 | 85.4 | 95.3 | 99.5 | 89.7 | 90.2 | 68.8 | 91.6 |
| YOLO-SPCI (2026)52 | 3.09 | 8.2 | 95.1 | 94.1 | 99.5 | 99.6 | 86.4 | 100 | 99.5 | 95.4 | 99.5 | 87.7 | 92.0 |
| DP-CIADet (2026)53 | 2.14 | – | – | – | – | – | – | – | – | – | – | – | 86.0 |
| SCG-YOLO (2026)54 | 2.5 | 6.4 | – | – | – | – | – | – | – | – | – | – | 85.5 |
| YOLO-PICO (2026)55 | 0.79 | 5.1 | – | – | – | – | – | – | – | – | – | – | 84.6 |
| LMW-YOLO (ours) | 2.6 | 6.3 | 99.5 | 99.5 | 96.7 | 95.9 | 85.7 | 97.6 | 99.5 | 96.7 | 94.7 | 77.4 | 94.3 |
As shown in Table 2, LMW-YOLO achieved a superior overall mAP@0.5 score of 94.3%. Compared with the classic baseline YOLOv8n (90.0%) and the established YOLO11n (92.4%), our method realized significant accuracy improvements of 4.3% and 1.9%, respectively. Even when compared against the latest lightweight model, YOLO26n (91.6%), LMW-YOLO maintained a distinct lead of 2.7%. Notably, while achieving state-of-the-art accuracy, LMW-YOLO maintains an extremely compact architecture with only 2.6 M parameters. This represents a parameter reduction of approximately 13% compared to YOLOv8n (3.0 M) and 16% compared to the 2026 model YOLO-SPCI (3.09 M), effectively balancing feature representation efficiency with computational costs.
Specifically, LMW-YOLO demonstrated exceptional performance in categories characterized by complex geometric structures and background clutter. In the highly challenging harbors category, LMW-YOLO achieved an accuracy of 96.7%, significantly outperforming YOLO-SPCI (95.4%) and YOLO11n (93.4%). Similarly, for the bridges category, which often presents large aspect ratios and background interference, our model attained a detection accuracy of 94.7%, surpassing YOLO11n (91.1%) and YOLO26n (90.2%). Furthermore, the model maintained near-perfect detection precision in categories such as ships (99.5%), airplanes (99.5%), and ground track fields (99.5%). These results validate that the proposed modules (MSDP and LKCA) successfully enhanced the model’s adaptability to the diverse spatial scales and semantic conditions found in remote sensing imagery.
LMW-YOLO also establishes a commanding lead over the newest 2026 baselines. Compared to YOLO-SPCI52, it achieves a 2.3% higher mAP@0.5 (94.3% vs. 92.0%) while operating with notably fewer FLOPs (6.3 G vs. 8.2 G). When evaluated alongside contemporary ultra-lightweight models with similar parameter counts, such as SCG-YOLO54 and DP-CIADet53, LMW-YOLO delivers massive accuracy surges of 8.8% and 8.3%. Moreover, while YOLO-PICO55 pushes the limits of miniaturization (0.79 M parameters, 5.1 G FLOPs), it sacrifices significant performance, peaking at only 84.6% mAP. These comparisons confirm that LMW-YOLO provides a much more effective trade-off between computational efficiency and detection robustness.
Beyond lightweight models, LMW-YOLO proves highly competitive against much heavier network architectures. In contrast to PR-Deformable DETR47 (46.7 M, 88.3%), our method secures a 6.0% higher mAP while utilizing a mere 5.6% of its parameters. Furthermore, against M2FE-YOLO49 (5.4 M, 93.3%), LMW-YOLO yields a 1.0% accuracy improvement while cutting the parameter count by more than 50%. By successfully maximizing precision while minimizing architectural complexity, LMW-YOLO proves to be an ideal candidate for resource-constrained remote sensing applications like drone and satellite surveillance.
In addition to the NWPU VHR-10 dataset, we conducted comparative experiments on the extremely challenging RS-STOD dataset, which is specifically designed to evaluate the detection of Small Tiny Objects (STOs). As shown in Table 3, early anchor-free and anchor-based methods, such as CenterNet57 and Cascade56, performed relatively poorly, with mAP@0.5 scores of 19.3% and 37.5%, respectively. This highlights the inherent difficulty of detecting minute targets in vast aerial scenes. Meanwhile, Transformer-based methods (e.g., DINO59 and RT-DETR21) elevated performance to a moderate level of 53.5–66.9%. In contrast, recent algorithms from 2025 demonstrated strong competitiveness; for instance, FBV-Fusion44 and YOLO26n achieved mAP@0.5 scores of 70.9% and 71.0%, respectively. The robust baseline YOLO11n established a high-performance benchmark by securing an mAP@0.5 of 71.9
Table 3.
The results of different base models on the RS-STOD dataset.
| Methods | Params (M) | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) |
|---|---|---|---|---|---|
| Cascade (2018)56 | 72.0 | 66.4 | 39.3 | 37.5 | 26.6 |
| CenterNet (2019)57 | 32.9 | 51.4 | 22.6 | 19.3 | 8.0 |
| GridRCNN (2019)58 | 60.0 | 71.4 | 35.5 | 35.1 | 25.4 |
| DINO (2022)59 | 47.0 | 57.5 | 55.5 | 53.5 | 26.8 |
| RT-DETR (2024)21 | 42.0 | 71.2 | 65.1 | 66.9 | 41.1 |
| YOLO11n (2024) | 2.59 | 75.2 | 68.2 | 71.9 | 45.1 |
| CFPT (2025)60 | 1.6 | 12.1 | 14.6 | 21.3 | 10.9 |
| ESOD (2025)61 | 2.4 | 73.4 | 68.6 | 69.3 | 43.0 |
| FBV-Fusion (2025)44 | – | 74.0 | 69.0 | 70.9 | 44.3 |
| YOLO26n (2025) | 2.38 | 74.2 | 68.2 | 71.0 | 44.1 |
| AMFC-DEIM (2026)62 | 5.27 | – | – | 70.8 | 44.5 |
| LMW-YOLO (ours) | 2.6 | 75.5 | 68.6 | 72.1 | 45.2 |
Surpassing these benchmarks, our LMW-YOLO further boosted performance to 72.1%. This result not only sets a new record for this benchmark but also outperforms the specialized small object detector FBV-Fusion (70.9%) and the latest YOLO26n (71.0%). Notably, despite the extreme scale variations inherent in RS-STOD, LMW-YOLO achieved the highest Precision (75.5%) and a highly competitive Recall (68.6%) among the comparative methods. This indicates that the proposed WIoU v3 loss and feature enhancement modules effectively mitigated false positives in complex backgrounds while recovering missed detections of minute targets.
Table 3 also reveals key insights regarding localization accuracy under strict metrics. While most methods struggle with the mAP@0.5:0.95 metric, attributed to the strict IoU requirements for targets smaller than 16 pixels, where a deviation of even 1–2 pixels leads to a significant drop in IoU, LMW-YOLO demonstrated exceptional robustness. It achieved an mAP@0.5:0.95 of 45.2%, outperforming the heavy-weight RT-DETR (41.1%) and the baseline YOLO11n (45.1%). Unlike methods such as CenterNet (8.0%) and CFPT (10.9%)60, which severely struggled to maintain high-precision localization, and even compared to strong recent competitors like ESOD (43.0%)61, our model maintained superior stability. This compellingly confirms that even with an ultra-lightweight design, LMW-YOLO retains superior feature extraction and generalization capabilities, effectively addressing the “gradient vanishing” and “feature misalignment” problems frequently encountered in small object detection.
To further evaluate the generalization capability of the proposed modules in dense and occluded scenarios, the LMW-YOLO model was assessed on the challenging VisDrone2019 benchmark. As detailed in Table 4, LMW-YOLO achieved an mAP@0.5 of 37.2%, ranking first among all comparative lightweight methods.
Table 4.
The results of different base models on the VisDrone2019 dataset.
| Methods | P (%) | R (%) | mAP@0.5 (%) | Params (M) | FLOPs (G) |
|---|---|---|---|---|---|
| YOLOv5n (2020) | 36.9 | 27.7 | 25.4 | 1.8 | 7.1 |
| YOLOv6n (2022) | 35.1 | 33.0 | 25.9 | 4.3 | 11.1 |
| YOLOv7-tiny (2022) | 45.7 | 40.3 | 37.0 | 6.2 | 13.1 |
| YOLOv8n (2023) | 44.5 | 35.3 | 35.1 | 3.0 | 8.1 |
| YOLOv9t (2024) | 46.2 | 33.7 | 35.0 | 2.0 | 7.6 |
| YOLOv10n (2024) | 46.3 | 34.0 | 34.7 | 2.3 | 6.5 |
| YOLOv11n (2024) | 46.5 | 34.6 | 35.2 | 2.59 | 6.4 |
| YOLOv12n (2025) | 44.2 | 33.4 | 33.0 | 2.6 | 6.3 |
| LSOD-YOLO (2025) 63 | 48.4 | 38.2 | 37.0 | 3.8 | 33.9 |
| YOLO26n (2025) | 43.5 | 34.2 | 34.5 | 2.38 | 5.2 |
| SRTSOD-YOLO-n (2025) 64 | – | – | 36.3 | 3.5 | 7.4 |
| MCD-YOLO (2026)65 | 44.6 | 33.0 | 31.2 | 2.64 | 9.7 |
| SFEP-YOLO (2026)66 | 45.9 | 36.6 | 34.2 | – | – |
| MDI-YOLO (2026)67 | 47.6 | 35.9 | 37.0 | 3.02 | 7.7 |
| LMW-YOLO (ours) | 48.5 | 38.6 | 37.2 | 2.6 | 6.3 |
This performance significantly outperforms the classic YOLOv8n (35.1%) and the baseline YOLOv11n (35.2%), yielding improvements of 2.1% and 2.0%, respectively. Notably, compared to larger models such as YOLOv7-tiny (37.0%, 6.2 M), LMW-YOLO achieves superior detection accuracy with less than half the parameters (2.6 M), demonstrating an exceptional efficiency-accuracy trade-off.
Furthermore, when compared against the state-of-the-art models released in 2025 and 2026, LMW-YOLO exhibits remarkable robustness. Against the 2025 models, it outperforms YOLOv12n (33.0%) and YOLO26n (34.5%) by margins of 4.2% and 2.7%, respectively. Even against the specialized LSOD-YOLO (37.0%), our model maintains a lead of 0.2% in mAP@0.5, while securing the highest Precision (48.5%) and a highly competitive Recall (38.6%) within the comparison group.
Crucially, LMW-YOLO also demonstrates significant advantages over the latest 2026 models. It surpasses MDI-YOLO (37.0%) by 0.2% in mAP@0.5 while utilizing fewer parameters (2.6 M vs. 3.02 M) and lower computational cost (6.3 G vs. 7.7 G FLOPs). Moreover, it achieves substantial accuracy improvements of 6.0% and 3.0% over MCD-YOLO (31.2%) and SFEP-YOLO (34.2%), respectively. These results compellingly confirm that the proposed LMW-YOLO effectively addresses the challenges of small scale and heavy occlusion typical of drone-captured imagery, proving its versatility beyond standard remote sensing datasets.
Overall, the experimental findings indicate that the proposed improvements yield consistent enhancements in detection performance and demonstrate robust generalization across multiple benchmarks, thereby validating their efficacy and adaptability in diverse remote sensing environments.
Inference speed evaluation
To further validate the lightweight nature of the proposed LMW-YOLO and its potential for deployment across various environments, including resource-constrained edge devices, we conducted inference speed tests on strictly CPU-only environments. All FPS measurements were conducted with a batch size of 1 to reflect real-time inference latency. As shown in Table 5, the evaluation was performed on three different hardware platforms: a high-performance CPU (AMD Ryzen 9 7940HX), a resource-constrained CPU (AMD Ryzen 5 3550H), and an Intel server-grade CPU (Intel Xeon Platinum 8255C), using the NWPU VHR-10 dataset.
Table 5.
Inference speed (FPS) comparison on different CPU platforms (tested on the NWPU VHR-10 dataset).
| CPU | YOLO11n | YOLO26n | LMW-YOLO (ours) |
|---|---|---|---|
| AMD Ryzen 9 7940HX | 40.97 | 43.85 | 42.03 |
| AMD Ryzen 5 3550H | 5.59 | 8.23 | 7.33 |
| Intel Xeon Platinum 8255C | 10.18 | 12.46 | 11.92 |
On the high-performance Ryzen 9 processor, all models achieve inference speeds exceeding 30 FPS, which matches standard video capture rates and satisfies real-time detection requirements. Notably, our LMW-YOLO achieves 42.03 FPS, slightly outperforming the baseline YOLO11n (40.97 FPS). On the server-grade Intel Xeon Platinum 8255C, LMW-YOLO demonstrates a clear advantage with an inference speed of 11.92 FPS, outperforming the baseline YOLO11n (10.18 FPS) by a relative margin of approximately
. More importantly, on the older and resource-constrained Ryzen 5 processor, which better simulates the limited computational power of UAVs or edge computing nodes, the baseline YOLO11n experiences a severe performance drop to 5.59 FPS. In contrast, LMW-YOLO maintains an inference speed of 7.33 FPS, delivering a significant relative speedup of approximately
over YOLO11n. Although YOLO26n exhibits a slightly faster speed, LMW-YOLO achieves a highly competitive balance between superior detection accuracy and inference efficiency. These CPU-based evaluations conclusively demonstrate that the CSD strategy, proposed modules (LKCA and MSDP), and introduced WIoU v3 successfully reduce computational overhead, making LMW-YOLO highly suitable for diverse deployment scenarios, particularly real-world remote sensing applications on edge devices.
Visualization and analysis
To intuitively demonstrate the enhanced detection capabilities of the proposed model on remote sensing imagery, we conducted visual comparisons between the baseline YOLO11n, the latest YOLO26n, and the improved LMW-YOLO on the NWPU VHR-10, RS-STOD, and VisDrone2019 datasets, as illustrated in Fig. 6.
Fig. 6.
Visual comparison of detection results. The images illustrate the detection performance on the NWPU VHR-10 (Rows 1–2), RS-STOD (Rows 3–4), and VisDrone2019 (Rows 5–6) datasets.
On the NWPU VHR-10 dataset (Rows 1–2): LMW-YOLO demonstrates superior robustness in detecting diverse small objects. While both the baseline YOLO11n and the latest YOLO26n suffer from varying degrees of false negatives (missed detections), our method successfully identifies all instances with visibly higher confidence scores. This highlights the effectiveness of the LKCA module in enhancing feature representation against complex backgrounds.
On the RS-STOD dataset (Rows 3–4): Detecting extremely minute targets remains a challenge for all evaluated models, leading to some inevitable omissions. However, LMW-YOLO exhibits significantly higher recall and precision compared to its counterparts. Specifically, in Row 3, contrast models miss a substantial number of “Small Vehicle” instances. More importantly, in Row 4, both YOLO11n and YOLO26n struggle with scale ambiguity, frequently misclassifying “Large Vehicles” as “Small Vehicles”. In contrast, our model accurately distinguishes between these fine-grained categories, validating the scale-adaptive capability of the proposed MSDP module.
On the VisDrone2019 dataset (Rows 5–6): In dense and crowded scenarios, the advantages of LMW-YOLO are most pronounced. As depicted in Row 5, the comparative models fail to detect a significant number of “car” instances due to occlusion. Furthermore, in Row 6, YOLO11n exhibits false positives by incorrectly identifying background textures as “trucks,” while YOLO26n suffers from severe missed detections for both “truck” and “car” targets. LMW-YOLO effectively suppresses these background interferences and maintains high localization accuracy, proving its superiority in handling dense aerial scenes.
To further investigate the intrinsic improvement mechanisms of LMW-YOLO, we visualized the output feature maps of the P3 and P4 detection heads using Grad-CAM heatmaps, as shown in Fig. 7. The comparative results clearly reveal the limitations of baseline models when handling multi-scale remote sensing targets: YOLO11n and YOLO26n often exhibit a distinct “scale mismatch” phenomenon. Specifically, when detecting small targets, the attention of the original models appears scattered, easily introducing excessive background noise; whereas when facing larger targets, constrained by their limited receptive fields, they fail to encompass the holistic semantic region of the objects.
Fig. 7.
Heat map visualization. The images illustrate the heat maps on the NWPU VHR-10 (Row 1), RS-STOD (Row 2), and VisDrone2019 (Row 3) datasets. Warmer colors indicate regions with higher attention weights, corresponding to the detected objects, while cooler colors represent background areas.
However, LMW-YOLO demonstrates superior scale adaptability. This improvement is attributed to the synergy between the long-range context modeling capability of the LKCA module and the multi-scale feature aggregation capability of the MSDP module. On one hand, owing to the dilated residual features extracted by MSDP, the heatmaps for larger targets can break through local limitations, completely covering the subject and capturing global semantic information. On the other hand, with the aid of LKCA, even in regions containing dense small targets (e.g., vehicles, storage tanks), the heatmaps exhibit highly concentrated highlighted responses (distinct red regions). The Large Kernel Attention mechanism enables the network to establish effective contextual dependencies in shallow features, thereby guiding attention to precisely focus on the core parts of objects and effectively suppressing background interference. This precise visual attention mechanism not only optimizes the decision-making process but also validates the robustness of LMW-YOLO in complex remote sensing scenarios from the perspective of interpretability.
Conclusion
In this paper, we proposed LMW-YOLO, a lightweight and high-precision object detector tailored for remote sensing images, aiming to address long-standing challenges such as drastic scale variations, complex background interference, and the dense distribution of small objects. Guided by the proposed CSD strategy, we optimized the feature extraction architecture to balance spatial detail and semantic abstraction. Specifically, the LKCA module is integrated into shallow layers to capture long-range dependencies for small objects, while the MSDP module is introduced in deep layers to enhance multi-scale context aggregation. Furthermore, the WIoU v3 loss function employs a dynamic non-monotonic focusing mechanism to optimize gradient gain allocation, significantly improving localization accuracy for dense targets by suppressing harmful gradients from low-quality samples.
Extensive experiments on three authoritative benchmarks demonstrate that LMW-YOLO achieves superior detection performance with exceptional parameter efficiency. On the NWPU VHR-10 dataset, LMW-YOLO achieved a remarkable 94.3% mAP@0.5, surpassing the baseline YOLO11n (92.4%) and the latest YOLO26n (91.6%) by a significant margin, as well as recent 2026 models such as YOLO-SPCI (92.0%) and SCG-YOLO (85.5%), while maintaining a compact size of only 2.6 M parameters. On the challenging RS-STOD dataset dominated by STOs, our model established a new State-of-the-Art (SOTA) with 72.1% mAP@0.5 and 45.2% mAP@0.5:0.95. It significantly outperforms specialized algorithms such as FBV-Fusion (70.9%) and heavy-weight models like RT-DETR (66.9%), alongside the newly introduced 2026 method AMFC-DEIM (70.8%), validating its robustness in detecting micro-scale targets. Similarly, on the complex VisDrone2019 benchmark, LMW-YOLO attained 37.2% mAP@0.5, ranking first among comparative methods and outperforming 2025 state-of-the-art models, including LSOD-YOLO (37.0%) and SRTSOD-YOLO-n (36.3%), in addition to the 2026 methods like MDI-YOLO (37.0%) and SFEP-YOLO (34.2%). Furthermore, LMW-YOLO delivers highly competitive inference speeds on CPU platforms. It achieves 42.03 FPS on an AMD Ryzen 9 7940HX, 7.33 FPS on an AMD Ryzen 5 3550H, and 11.92 FPS on an Intel Xeon Platinum 8255C, consistently outperforming the baseline YOLO11n (40.97 FPS, 5.59 FPS, and 10.18 FPS, respectively) on the NWPU VHR-10 dataset. LMW-YOLO achieves a significantly better trade-off between real-time processing speed and state-of-the-art accuracy.
However, certain limitations of our method merit further discussion. Currently, the evaluation is confined to static, single-modal optical images, and the theoretical parameter efficiency has not been fully validated regarding real-world inference speed on physical edge computing devices (e.g., FPGAs or Jetson modules). Furthermore, while LMW-YOLO demonstrates robustness in general remote sensing scenarios, it may still struggle in certain extreme cases. For instance, extreme lighting conditions (e.g., low illumination or heavy cloud shadows) and severe occlusions in dense urban environments can degrade feature representations and lead to missed detections. Additionally, the model’s performance on long-tail categories with extremely limited training samples remains suboptimal. Future work will focus on exploring few-shot learning techniques to address class imbalance, and extending the framework to handle multi-modal inputs (such as fusing optical with SAR or infrared data to overcome adverse lighting and weather conditions). We will also adapt the method for Oriented Object Detection (OBB) to better serve diverse and complex earth observation tasks on edge devices in the future work.
Author contributions
Y.Q. conceived the experiment(s), Y.Q. conducted the experiment(s), Y.Q. and Z.L. analysed the results. All authors reviewed the manuscript.
Funding
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Data Availability
All relevant data are within the manuscript. Additionally, the datasets underlying the results of this study are available from public repositories. Specifically, the NWPU VHR-10 dataset is available at https://github.com/Gaoshuaikun/NWPU-VHR-10, the RS-STOD Dataset is available at https://github.com/lixinghua5540/FBVF-YOLO, and the VisDrone2019 dataset is available at https://github.com/VisDrone/VisDrone-Dataset. The code of the proposed LMW-YOLO model is publicly available at https://github.com/qqqqqq-ch/LMW-YOLO/. For further information or specific inquiries regarding the code, please contact the corresponding author.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Liu, Y. et al. Multiscale U-shaped CNN building instance extraction framework with edge constraint for high-spatial-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens.59, 6106–6120. 10.1109/TGRS.2020.3022410 (2021). [Google Scholar]
- 2.Laghari, A. A. et al. Unmanned aerial vehicles advances in object detection and communication security review. Cogn. Robot.4, 128–141. 10.1016/j.cogr.2024.07.002 (2024). [Google Scholar]
- 3.Pi, Y., Nath, N. D. & Behzadan, A. H. Convolutional neural networks for object detection in aerial imagery for disaster response and recovery. Adv. Eng. Inform.43, 101009. 10.1016/j.aei.2019.101009 (2020). [Google Scholar]
- 4.Redmon, J. & Farhadi, A. YOLOv3: An incremental improvement. arXiv preprint arXiv:org/abs/1804.02767 (2018).
- 5.Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. YOLOv4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:org/abs/2004.10934. (2020).
- 6.Li, C. et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:org/abs/2209.02976. (2022).
- 7.Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7464–7475. 10.1109/CVPR52729.2023.00721 (2023).
- 8.Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:org/abs/2107.08430. (2021).
- 9.Wang, C.-Y., Yeh, I.-H. & Mark Liao, H.-Y. YOLOv9: Learning what you want to learn using programmable gradient information. In Computer Vision—ECCV 2024, 1–21. 10.1007/978-3-031-72751-1 (2024).
- 10.Huang, X., He, B., Tong, M., Wang, D. & He, C. Few-shot object detection on remote sensing images via shared attention module and balanced fine-tuning strategy. Remote Sens.13, 3816. 10.3390/rs13193816 (2021). [Google Scholar]
- 11.Qu, J., Tang, Z., Zhang, L., Zhang, Y. & Zhang, Z. Remote sensing small object detection network based on attention mechanism and multi-scale feature fusion. Remote Sens.15. 10.3390/rs15112728 (2023).
- 12.Li, Q., Yan, Y. & Lan, S. Fca-yolo: A small object detection method based on feature attention fusion for UAV remote sensing images. AIMS Math.11, 5172–5191. 10.3934/math.2026211 (2026). [Google Scholar]
- 13.Li, Y., Yang, Y., An, Y., Sun, Y. & Zhu, Z. Lars: Remote sensing small object detection network based on adaptive channel attention and large kernel adaptation. Remote Sens.16. 10.3390/rs16162906 (2024).
- 14.He, Q. et al. Multi-scale spatial fusion lightweight model for optical remote sensing image-based small object detection. Geo-spatial Inf. Sci. 1–21. 10.1080/10095020.2025.2555616 (2025).
- 15.Wang, K., Wang, Z., Wu, Q., Xu, J. & He, C. WDS-YOLO: A small object detection algorithm based on YOLOv8s. In Third International Conference on Remote Sensing, Mapping, and Geographic Information Systems (RSMG 2025), vol. 13791, 373–380. 10.1117/12.3084871 (2025).
- 16.Li, G. et al. LR-Net: Lossless feature fusion and revised SIoU for small object detection. Comput. Mater. Continua. 85, 3267–3288. 10.32604/cmc.2025.067763 (2025).
- 17.Azeem, A., Li, Z., Siddique, A., Zhang, Y. & Zhou, S. Unified multimodal fusion transformer for few shot object detection for remote sensing images. Inf. Fusion111, 102508. 10.1016/j.inffus.2024.102508 (2024). [Google Scholar]
- 18.Li, H., Zhang, R., Pan, Y., Ren, J. & Shen, F. Lr-fpn: Enhancing remote sensing object detection with location refined feature pyramid network. In 2024 International Joint Conference on Neural Networks (IJCNN), 1–8. 10.1109/IJCNN60899.2024.10650583 (2024).
- 19.Hu, J., Shen, L., Albanie, S., Sun, G. & Wu, E. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis Mach. Intell. 42, 2011–2023, 10.1109/TPAMI.2019.2913372 (2020). [DOI] [PubMed]
- 20.Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. CBAM: convolutional block attention module. In Computer Vision – ECCV 2018, 3–19, 10.1007/978-3-030-01234-2_1 (2018).
- 21.Zhao, Y. et al. DETRs beat YOLOs on real-time object detection. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16965–16974. 10.1109/CVPR52733.2024.01605 (2024).
- 22.Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803. 10.1109/CVPR.2018.00813 (2018).
- 23.Zheng, Z. et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern.52, 8574–8586. 10.1109/TCYB.2021.3095305 (2022). [DOI] [PubMed] [Google Scholar]
- 24.T, S., R, C., J, S., Varadan, G. & Mohan, S. S. Mean-shift based object detection and clustering from high resolution remote sensing imagery. In 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 1–4. 10.1109/NCVPRIPG.2013.6776271 (2013).
- 25.Paul, S. & Pati, U. C. Remote sensing optical image registration using modified uniform robust SIFT. IEEE Geosci. Remote Sens. Lett.13, 1300–1304. 10.1109/LGRS.2016.2582528 (2016). [Google Scholar]
- 26.Yu, D. & Ji, S. A new spatial-oriented object detection framework for remote sensing images. IEEE Trans. Geosci. Remote Sens.60, 1–16. 10.1109/TGRS.2021.3127232 (2022). [Google Scholar]
- 27.Wang, Y., Xu, C., Liu, C. & Li, Z. Context information refinement for few-shot object detection in remote sensing images. Remote Sens.14, 3255. 10.3390/rs14143255 (2022). [Google Scholar]
- 28.Zhang, M., Zhang, B., Liu, M. & Xin, M. Robust object detection in aerial imagery based on multi-scale detector and soft densely connected. IEEE Access8, 92791–92801. 10.1109/ACCESS.2020.2994379 (2020). [Google Scholar]
- 29.Zhao, L. & Zhu, M. MS-YOLOv7: YOLOv7 based on multi-scale for object detection on UAV aerial photography. Drones7, 188. 10.3390/drones7030188 (2023). [Google Scholar]
- 30.Gu, Q., Huang, H., Han, Z., Fan, Q. & Li, Y. GLFE-YOLOX: Global and local feature enhanced YOLOX for remote sensing images. IEEE Trans. Instrum. Meas.73, 1–12. 10.1109/TIM.2024.3387499 (2024). [Google Scholar]
- 31.Qin, Z., Chen, D. & Wang, H. MCA-YOLOv7: An improved UAV target detection algorithm based on YOLOv7. IEEE Access12, 42642–42650. 10.1109/ACCESS.2024.3378748 (2024). [Google Scholar]
- 32.Yang, Y., Di, J., Liu, G. & Wang, J. Rice pest recognition method based on improved YOLOv8. In 2024 4th International Conference on Consumer Electronics and Computer Engineering (ICCECE), 418–422. 10.1109/ICCECE61317.2024.10504248 (2024).
- 33.Zhu, X., Lyu, S., Wang, X. & Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2778–2788. 10.1109/ICCVW54120.2021.00312 (2021).
- 34.Wang, F., Wang, H., Qin, Z. & Tang, J. UAV target detection algorithm based on improved YOLOv8. IEEE Access11, 116534–116544. 10.1109/ACCESS.2023.3325677 (2023). [Google Scholar]
- 35.Yang, F., Fan, H., Chu, P., Blasch, E. & Ling, H. Clustered object detection in aerial images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 8311–8320. 10.1109/ICCV.2019.00840 (2019).
- 36.Xie, Y., Cai, J., Bhojwani, R., Shekhar, S. & Knight, J. A locally-constrained YOLO framework for detecting small and densely-distributed building footprints. Int. J. Geogr. Inf. Sci.34, 777–801. 10.1080/13658816.2019.1624761 (2020). [Google Scholar]
- 37.Niu, K. & Yan, Y. A small-object-detection model based on improved YOLOv8 for UAV aerial images. In 2023 2nd International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP), 57–60. 10.1109/AIIIP61647.2023.00016 (2023).
- 38.Niu, R. et al. Aircraft target detection in low signal-to-noise ratio visible remote sensing images. Remote Sens.15, 1971. 10.3390/rs15081971 (2023). [Google Scholar]
- 39.Sun, Y., Liu, W., Gao, Y., Hou, X. & Bi, F. A dense feature pyramid network for remote sensing object detection. Appl. Sci.12, 4997. 10.3390/app12104997 (2022). [Google Scholar]
- 40.Wan, X., Yu, J., Tan, H. & Wang, J. LAG: Layered objects to generate better anchors for object detection in aerial images. Sensors22, 3891. 10.3390/s22103891 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Wang, J., Yang, W., Li, H.-C., Zhang, H. & Xia, G.-S. Learning center probability map for detecting objects in aerial images. IEEE Trans. Geosci. Remote Sens.59, 4307–4323. 10.1109/TGRS.2020.3010051 (2021). [Google Scholar]
- 42.Guo, M.-H., Lu, C.-Z., Liu, Z.-N., Cheng, M.-M. & Hu, S.-M. Visual attention network. Comput. Visual Media9, 733–752. 10.1007/s41095-023-0364-2 (2023). [Google Scholar]
- 43.Cheng, G., Han, J., Zhou, P. & Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote. Sens.98, 119–132. 10.1016/j.isprsjprs.2014.10.002 (2014). [Google Scholar]
- 44.Bai, X., Li, X., Miao, J. & Shen, H. A front-back view fusion strategy and a novel dataset for super tiny object detection in remote sensing imagery. Knowl.-Based Syst.326, 114051. 10.1016/j.knosys.2025.114051 (2025). [Google Scholar]
- 45.Du, D. et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 213–226. 10.1109/ICCVW.2019.00030 (2019).
- 46.Chen, X. et al. DET-YOLO: An innovative high-performance model for detecting military aircraft in remote sensing images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens.17, 17753–17771. 10.1109/JSTARS.2024.3462745 (2024). [Google Scholar]
- 47.Chen, Y., Liu, B. & Yuan, L. PR-Deformable DETR: DETR for remote sensing object detection. IEEE Geosci. Remote Sens. Lett.21, 1–5. 10.1109/LGRS.2024.3483217 (2024). [Google Scholar]
- 48.Xie, H. et al. ECHF-YOLO: a remote sensing object detection network integrating efficient convolution and high-low frequency features. Big Earth Data. 1–34. 10.1080/20964471.2025.2575602 (2025).
- 49.Wu, Q. et al. M2FE-YOLO: Multibranch and multilevel feature enhancement network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens.63, 1–19. 10.1109/TGRS.2025.3612212 (2025). [Google Scholar]
- 50.Yao, B. et al. SRM-YOLO for small object detection in remote sensing images. Remote Sens.17, 2099. 10.3390/rs17122099 (2025). [Google Scholar]
- 51.Zhang, P., Liu, J., Zhang, J., Liu, Y. & Shi, J. HAF-YOLO: Dynamic feature aggregation network for object detection in remote-sensing images. Remote Sens.17, 2708. 10.3390/rs17152708 (2025). [Google Scholar]
- 52.Wang, X., Peng, L., Li, X., He, Y. & U, K. Enhancing remote sensing object detection via selective-perspective-class integration. Eng. Appl. Artif. Intell.165, 113416. 10.1016/j.engappai.2025.113416 (2026).
- 53.Sun, J. et al. DP-CIADet: Detail perception and critical information aggregation compact detection network for aerial optical remote sensing images. Opt. Laser Technol.193, 113519. 10.1016/j.optlastec.2025.113519 (2026). [Google Scholar]
- 54.Fan, W., Liu, Y., Wang, X. & He, H. Scg-yolo: Shift-context network with multidirectional upsampling and geometric quality for small-object detection. IEEE Geosci. Remote Sens. Lett.23, 1–5. 10.1109/LGRS.2026.3657206 (2026). [Google Scholar]
- 55.Aghili, M. E., Ghassemian, H. & Imani, M. Yolo-pico: Lightweight object recognition in remote sensing images using expansion attention modules. Pattern Recogn.176, 113114. 10.1016/j.patcog.2026.113114 (2026). [Google Scholar]
- 56.Cai, Z. & Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6154–6162. 10.1109/CVPR.2018.00644 (2018).
- 57.Zhou, X., Wang, D. & Krähenbühl, P. Objects as points. arXiv preprint arXiv:org/abs/1904.07850. (2019).
- 58.Lu, X., Li, B., Yue, Y., Li, Q. & Yan, J. Grid R-CNN. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7355–7364. 10.1109/CVPR.2019.00754 (2019).
- 59.Zhang, H. et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:org/abs/2203.03605. (2022).
- 60.Du, Z., Hu, Z., Zhao, G., Jin, Y. & Ma, H. Cross-layer feature pyramid transformer for small object detection in aerial images. IEEE Trans. Geosci. Remote Sens.63, 1–14. 10.1109/TGRS.2025.3572706 (2025). [Google Scholar]
- 61.Liu, K. et al. ESOD: Efficient small object detection on high-resolution images. IEEE Trans. Image Process.34, 183–195. 10.1109/TIP.2024.3501853 (2025). [DOI] [PubMed] [Google Scholar]
- 62.Lin, X., Li, G., Xie, J. & Zhi, Z. AMFC-DEIM: Improved DEIM with adaptive matching and focal convolution for remote sensing small object detection. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens.19, 5021–5034. 10.1109/JSTARS.2026.3653626 (2026). [Google Scholar]
- 63.Jiang, H. et al. LSOD-YOLO: Lightweight small object detection algorithm for wind turbine surface damage detection. J. Nondestr. Eval.44, 112. 10.1007/s10921-025-01253-2 (2025). [Google Scholar]
- 64.Xu, Z. et al. SRTSOD-YOLO: Stronger real-time small object detection algorithm based on improved YOLO11 for UAV imageries. Remote Sens.17, 3414. 10.3390/rs17203414 (2025). [Google Scholar]
- 65.Li, F. et al. MCD-YOLO: An improved yolov11 framework for manhole cover detection in UAV imagery. IEEE Geosci. Remote Sens. Lett.23, 1–5. 10.1109/LGRS.2026.3663831 (2026). [Google Scholar]
- 66.Lu, Z., Zhang, X., Cao, X., Hou, J. & Yuan, X. SFEP-YOLO: A track obstacle detection model for autonomous electric locomotives in underground mine. IEEE Trans. Veh. Technol. 1–14. 10.1109/TVT.2026.3651888 (2026).
- 67.Shi, H. et al. MDI-YOLO a lightweight transformer-CNN-based multidimensional feature fusion model for small object detection. Sci. Rep.16, 7233. 10.1038/s41598-026-38378-x (2026). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All relevant data are within the manuscript. Additionally, the datasets underlying the results of this study are available from public repositories. Specifically, the NWPU VHR-10 dataset is available at https://github.com/Gaoshuaikun/NWPU-VHR-10, the RS-STOD Dataset is available at https://github.com/lixinghua5540/FBVF-YOLO, and the VisDrone2019 dataset is available at https://github.com/VisDrone/VisDrone-Dataset. The code of the proposed LMW-YOLO model is publicly available at https://github.com/qqqqqq-ch/LMW-YOLO/. For further information or specific inquiries regarding the code, please contact the corresponding author.


























