Abstract
In rice pest management, accurate pest detection is critical for intelligent agricultural systems, yet challenges like limited dataset availability, pest occlusion, and insufficient small object detection accuracy hinder effective monitoring. To address the aforementioned challenges, this study presents YOLO-PEST, an innovative detection approach based on the YOLOv5s architecture to address these issues. YOLO-PEST collects rice pest images from multiple channels and images are randomly cropped to occlude detection boxes, effectively simulating pest overlapping scenarios. During the feature fusion process, the ConvNeXt module is integrated to improve the detection accuracy for small objects via multiscale feature extraction. Additionally, the CoTAttention mechanism is incorporated to enhance the model’s robustness under complex environmental conditions. Comparative experiments show that the YOLO-PEST approach achieves a 97% of mAP@0.5, representing a 1.4-point improvement compared with previous methods, thus verifying its effectiveness in rice pest management.
Keywords: YOLOv5s, Rice pests, Object detection, Convolutional neural network
Introduction
Rice serves as a staple food and a vital economic crop in China. However, pests and diseases [1] continuously pose threats to mature crops, leading to reductions in both yield and quality. Conventional methods, such as manual inspection, are labor intensive and highly subjective, frequently resulting in misjudgments. These inaccuracies can prompt the excessive and indiscriminate use of pesticides, thereby causing environmental contamination. Additionally, traditional machine learning approaches necessitate manual feature extraction, which is an inherently intricate process and is susceptible to the variability of field conditions. Therefore, this study emphasizes that the development of accurate rice pest detection techniques is of paramount importance for minimizing yield losses and enhancing farmers’ revenue.
In recent years, the image detection method based on deep learning has shown great potential in the agricultural field. This method is applied throughout the entire life cycle management of crops, including disease detection before harvest, yield prediction, real-time monitoring during harvest, and quality control after harvest, all with the goal of minimizing losses [2]. Despite its wide application in practical scenarios such as fruit tree disease detection and crop pest monitoring, the complex field environment still creates two core bottlenecks. First, object occlusion reduces the reliability of detection. Many approaches have been proposed by researchers to address this issue. For instance, Li et al. [3] utilized a field robot equipped with multi-view cameras for data collection, achieving full view coverage of blueberry plants to resolve the occlusion problem. Additionally, Zhou et al. [4] enhanced the tracking algorithm by fusing multi-view camera dynamic tracking with multiple mapping algorithms. By combining motion trajectory prediction and a customized strawberry feature extractor, they proposed a detection network resistant to occlusion for strawberry maturity, which significantly decreased missed detections caused by occlusion. Second, the feature representation of small objects is inadequate, thereby constraining the detection accuracy. For example, Upadhyay et al. [5] employed an enhanced ResNeXt convolutional neural network to predict three major fungal diseases of crops, surmounting the problem of small disease spot loss. Subsequently, Upadhyay et al. [6] put forward an end-to-end method for detecting the severity of leaf disease infection. They introduced SegLearner to accurately detect and outline the disease spot area in leaf images. Then, by using the pixel counting method, they determined the severity of leaf damage, offering a fresh approach to small object detection. Compared with the relatively static occlusion situations of blueberry, strawberry and other crops, as well as the fixed disease spot objects like leaf diseases, rice pest detection faces more challenging issues. The small size and high movement randomness of pests lead to frequent occlusions, and there are significant morphological differences of the same pest at various growth stages.
To address the above challenges, extensive study has been carried out in recent years to innovate rice pest detection models. In this study, deep learning methods have attracted considerable attention for their capacity to detect crop pests and diseases with minimal human intervention. For instance, Yang et al. [7] proposed Maize-YOLO, an improved YOLOv7 architecture. The method integrates the CSPResNeXt-50 module with the VOV-GSCSP module, significantly reducing computational complexity and improving detection accuracy. On a dataset containing 4,533 images covering 13 types of corn pests, the proposed method achieves 76.3% mean Average precision (mAP). However, despite the above advantages, the model still exhibits performance attenuation when dealing with small objects or complex backgrounds. Yin et al. [8] developed YOLO-RMD, a new algorithm based on YOLOv8. This algorithm incorporates innovative modules like the Receptive Field Attention Convolution component, Mixed Local Channel Attention, and an upgraded multi-scale detection framework equipped with Dynamic Head. As a result, YOLO-RMD notably boosts detection precision and adaptability. Nevertheless, the dataset’s insufficient coverage of certain pest species hinders the model’s ability to recognize other species. Wang et al. [9] developed a Transformer-based rice pest detection method, which co-optimizes network feature representation through the RepPconv module and the neck structure of the Gold-YOLO network. Experimental results on the dataset demonstrate that this method reduces the number of parameters while enhancing detection accuracy. Nevertheless, the issue of insufficient dataset comprehensiveness persists, leading to poor model generalization ability. Wang et al. [10] developed the RGC-YOLO model, which employs the GhostConv structure instead of traditional convolutional layers. The model adopts a reparameterized RepGhost module to replace the C2f layer and incorporates a hybrid attention mechanism. These enhancements allow for efficient multi-scale recognition of rice pests and diseases, and also make it suitable for the deployment requirements of field embedded devices, thereby supporting real-time monitoring. Wang et al. [11] developed the Insect-YOLO method, adopting a two-stage transfer learning strategy. First, a general pest recognition model was pretrained on the ImageNet dataset, followed by fine-tuning on the COCO dataset to enhance the accuracy of small-sample pest detection. Finally, the Channel-Spatial Attention Mechanism (CBAM) was integrated into the backbone network to further optimize the model’s adaptability to limited samples, significantly improving the efficiency of crop pest recognition. Du et al. [12] proposed modifying the YOLOv7 model by introducing a Progressive Spatial Adaptive Feature Pyramid (PSAFP) structure and applying a hybrid loss function to suppress negative sample interference. However, the increased structural complexity of PSAFP leads to high computational overhead, making the model difficult to meet the practical requirements of lightweight deployment. To address the model lightweight deployment challenge, Li et al. [13] developed RDRM-YOLO, a lightweight detection model. By fusing the SPDConv module with the lightweight GSConv module in the neck network, this method significantly reduces computational complexity while maintaining feature transfer accuracy. Nevertheless, there remains room for optimizing its adaptability to scenarios with strong light variations and severe lesion occlusions. Guan et al. [14] developed the GC-FasterRCNN method, which markedly improves the detection performance for multi-scale objects and similar categories by incorporating the GCT-CBAM hybrid attention mechanism into the backbone network. Nevertheless, the Insect25 dataset mainly consists of clear images with sparse pest distributions, which restricts the model’s generalization ability in practical dense occlusion scenarios. Huang et al. [15] proposed the lightweight YOLO-YSTs detection method, which successfully resolves the problems of missed detections and false positives for small-sized and overlapping objects, while striking a balance between detection speed and accuracy. However, the inadequate sample size of the yellow armyworm plate dataset used in this research limits the further improvement of the model’s performance. Wang et al. [16] proposed the SMC-YOLO model based on YOLOv8. The model enriches feature information by replacing the SPPF module with the Spatial Pyramid Convolutional Pooling Module (SPCPM). Additionally, it introduces the Multi-dimensional Feature Enhancement Module (MDFEM) in the neck network to focus on pest and disease characteristics. Although this design reduces the computational load, there is a certain limitation in recognition accuracy. A detailed comparison of the latest pest detection methods mentioned above is presented in Table 1 below.
Table 1.
Comparison of different deep learning methods for pests detection
| Author | Method | Dataset | Classes | Map50(%) |
|---|---|---|---|---|
| Yang et al. | YOLOv7, CSPResNeXt-50, VOV-GSCSP | IP102(IP13) | 13 maize pests | 76.3 |
| Yin et al. | YOLOv8 | IP 102,Kaggle | 7 rice pests | 98.2 |
| Wang et al. | Transformer, RepPConv, MPDIoU | IP102 | 7 rice pests | 76.9 |
| Wang et al. | YOLOv8, GhostConv, RepGhost | Kaggle | 1 rice pest and 3 rice diseases | 93.2 |
| Wang et al. | YOLOv8, Transfer learning, CBAM | Hangzhou Zhuo Qi Electronic Technology Corporation Limited liability company | 7 rice pests | 93.8 |
| Du et al. | YOLOv7, PSAFP, Varifocal Loss, Loss Rank Mining | filtered-plant-village-dataset/ 2000-sample-for-all-class&rice-false-smut | Rice and corn | 84.7/93.3 |
| Li et al. |
YOLOv5,Hor-BNFA, SPDConv, GSConv |
Rice Leaf Ailment Visual Archive(RLAVA) | 4 rice diseases | 93.5 |
| Guan et al. | Faster RCNN, GCT-CBAM, EIoU | Self-constructed dataset(Insect25) | 25 pests | 96.3 |
| Huang et al. | YOLOv10, SPD-Conv, Inner-SIoU, iRMB | Self-constructed &yellow sticky traps | 5 pests | 86.8 |
| Wang et al. |
YOLOv8, SPCPM, MDFEM, CSFLNLM |
Self-constructed & IP102 | 9 maize pests | 86.7 |
Drawing from the foregoing analyses, while mainstream networks demonstrate remarkable performance in detecting standard size pests, their recognition accuracy for small objects in occluded environments remains suboptimal. To address this, this study proposes a deep learning based rice pest detection method that adopts YOLOv5 as the baseline model. In the YOLO series, YOLOv7 and YOLOv8 enhance the feature fusion capabilities through cross-stage local networks, yet their structural complexity tends to compromise inference efficiency. YOLOv9 introduces Programmable Gradient Information (PGI) to strengthen small object detection, but its additional gradient calculation module leads to parameter redundancy. Although YOLOv10, YOLOv11, and YOLOv12 improve inference efficiency via non-reparameterization designs, their accuracy decreases in complex backgrounds. By contrast, YOLOv5 outperforms other versions in balancing accuracy, speed, and lightweight design, providing an ideal foundational framework for small object detection in occluded environments. The primary contributions of this study are summarized as follows:
Randomly cropping the image to partially occlude pest object areas, thus simulating the occlusion situation and enriching the diversity of the dataset as well as the generalization ability of the model.
ConvNeXt modules are introduced to optimize the convolutional module, hence lowering the computational load and the number of parameters while improving the model’s ability to extract characteristics of small object pests.
Incorporate the CoTAttention mechanism prior to the SPPF layer to suppress the complex background elements in the field, so as to improve the detection accuracy of pest objects.
The model was tested on the dataset for the detection of ten categories of highly damaging pests in rice, achieving an accuracy of 97%, which prioritizes over most of the classical algorithms, including YOLOv5s, YOLOv7-tiny, YOLOv8s, YOLOv10s, YOLOv11s, and YOLOv12s.
Materials and methods
Dataset construction
In the field of deep learning, training networks usually requires a dataset with a substantial number of images. Inadequate data can lead to issues such as overfitting, flawed recognition, and reduced reliability. The most extensive dataset for pest detection to date is the IP102 dataset, released by Wu et al. in 2019, which contains 75,000 pest samples and encompasses nearly all prevalent pest species. However, publicly available datasets focused on single-crop rice pest detection remain scarce. Consequently, this study introduces a novel rice pest dataset created by sourcing and filtering images through various methods. The dataset includes 10 pest categories with intricate backgrounds and diverse characteristics, as shown in Fig. 1. During dataset creation, rice pest images were first manually selected using existing label information, then processed through cropping, resizing, and pixel value normalization. Subsequently, the LabelImg tool was used to annotate images with rectangular bounding boxes and category labels following the object detection approach. This process yielded 2,000 original images featuring pest objects of various sizes and complex backgrounds.
Fig. 1.
Images of ten categories of pests
Data augmentation
After preparation was finished, this study divided the original photos into an 8:2 ratio of the original training set and validation set before data augmentation to avoid the validation set including training set pictures. This study deployed data augmentation [17] to the insect species with labels 3, 4, and 9 in the first training set, which contained comparatively less data, in order to address the issue of unbalanced datasets. This study used horizontal flipping, random adjustment of brightness and contrast, and random rotation to expand the data of these three insect species on the basis of the original training set. This study expanded the data until the number of images was eight times the original, and used them for the experiments in this study. As shown in Table 2 , the training set has 2084 images and 2886 labels, and the validation set has 370 images and labels.
Table 2.
Classification and quantity of labels for rice pest images
| Pest species | Number |
|---|---|
| Cletus punctiger eggs | 269 |
| Rice leaf folder | 511 |
| Tryporyza incertulas | 284 |
| Rice moth pupa | 102 |
| Sesamia inferens | 131 |
| Nezara viridula | 311 |
| Scotinophara lurida | 350 |
| Cotton mirids | 295 |
| Chilo suppressalis | 176 |
| Oxya chinensis | 457 |
As depicted in Fig. 2, the distribution of the quantities of labels for these 10 types of pests is presented. Figure A represents the data amount of each category, Figure B exhibits the size and quantity of the bounding boxes, Figure C presents the center coordinates of the bounding boxes, and Figure D indicates the length and width of the bounding boxes.
Fig. 2.
Distribution of Pest Labels
The YOLOv5s module
YOLOv5 has emerged as a preferred object detection solution due to its outstanding performance and usability. It strikes an excellent balance on public datasets and is highly suitable for real-time application scenarios.
The YOLOv5 series includes five editions: YOLOv5n (Nano) is the smallest and fastest version, suitable for environments with limited resources; YOLOv5s (Small) achieves a good balance between speed and accuracy, making it the recommended choice for most application scenarios; YOLOv5m (Medium) offers better performance, ideal for tasks requiring higher precision; YOLOv5l (Large) prioritizes recognition accuracy while maintaining computational efficiency, particularly suitable for scenarios demanding precise detection; and YOLOv5x (Extra Large) is the largest version, providing the highest accuracy but with the highest requirements for computing resources. Given YOLOv5s’ excellent balance between speed and accuracy and efficient resource utilization, this study selected it as the base model to meet real-time object detection needs, with experimental analysis detailed in Sect. 3.2.
The YOLOv5s network structure mainly consists of the backbone, neck, and head. This study selects YOLOv5s version 6.0 for experiments, and its architecture is shown in Fig. 3. The backbone network, which extracts high quality features from input images, adopts the CSPNet architecture. The neck combines and processes features from the backbone, creating a pyramid feature hierarchy using the Feature Pyramid Network (FPN) topology. As the network’s output layer, the head applies an anchor based method, predicting the positions and sizes of predefined anchor boxes and their corresponding category probabilities. Through continuous refinement of anchor box parameter optimization, YOLOv5s speeds up model convergence and improves detection accuracy.
Fig. 3.
YOLOv5s network architecture diagram
Occlusion training
In order to solve the problem of insect occlusion caused by stem and leaf overlap, biological activities and light shadow in rice field environment, this study proposes a data augmentation method based on partial occlusion. This method simulates the real occlusion scene by processing the pest objects labeled in YOLO format of the original dataset. The specific process is shown in the following Fig. 4.
Fig. 4.
Flow chart of occluded image generation
First, the YOLO format annotation file corresponding to each input image is read to obtain the bounding box information of each object in the image, that is, the normalized center coordinate, width, and height. Then for each bounding box, the occlusion region of its four directions (top, bottom, left and right occlusion) is determined, and the width or height of the occlusion region is set to a quarter of the width or height of the bounding box (adjustable parameter). To simulate real occlusion, in this study, an image block with the same size as the occlusion region is randomly cropped from the same image and the image block is covered to the corresponding occlusion region of the object bounding box. In this way, the occluded parts come from other parts of the same image, preserving the naturalness of the image parts. Finally, four new images were generated for each of the four directions of occlusion of the bounding box, and the newly generated images used the annotation file of the original image. This processing method helped the model to recognize the object in the case of partial occlusion.
The dataset processing flow of this study is shown in Fig. 5. After occlusion, the dataset was augmented by a factor of four, with 8336 images in the training set and 1480 images in the validation set. Figure 6 displays some of the occluded pictures that were generated. In addition to increasing the diversity of the dataset by simulating occlusion scenarios in real scene, this also improves the model’s generalization ability, lowers the possibility of overfitting, and permits the model to perform efficient detection in challenging circumstances, all of which contribute to improved performance.
Fig. 5.
Flow chart for data set processing
Fig. 6.
Examples of occluded images
Optimization of convolutional modules
The ConvNeXt module, proposed by Liu et al. [18] in 2022, is a pure convolutional network architecture that draws upon the Vision Transformer [19] (ViT) architecture and makes adjustments based on the ResNet50 network as the backbone. This module modifies the stacking times in the stage of ResNet50 [20] from (3, 4, 6, 3) to (3, 3, 9, 3), sharing similar FLOPs with Swin-T [21] (Swin Transformer). It replaces ordinary convolution with depthwise convolution (DWConv) [22], where the number of groups is the same as the number of channels. Then, the size of the convolution kernel is increased, changing the convolution kernel of DWConv from 3 × 3 to 7 × 7. Thus, DWConv can extract spatial information, while 1 × 1 convolution can extract channel information. Furthermore, the Inverted Bottleneck module, similar to the MLP module structure in the Transformer block, is adopted to create a reverse bottleneck at the end through two 1 × 1 convolutions to restore the number of input channels. Through this design, the ConvNeXt module successfully integrates the local feature extraction ability of convolutional networks with the global context modeling ability of transformer networks, exhibiting superior performance.
To enhance the detection accuracy of small object rice pests, this study utilizes the ConvNeXt module to replace the C3 module in the Neck section of YOLOv5s. This modification aims to minimize information loss during feature extraction and reduce redundant computational operations for similar feature extraction tasks. The framework of the ConvNeXt module is illustrated in Fig. 7.
Fig. 7.
Structure Diagram of the ConvNeXt Module
Upon feeding the image into the network for processing, the convolutional layer first captures the local characteristics of the image, similar to convolutional neural networks. To stabilize training and improve the convergence rate, ConvNeXt applies layer normalization after each convolutional block, which helps maintain the consistency of feature distributions. Subsequently, an activation function (such as GELU) is applied to introduce nonlinearity, enabling the model to better represent complex features. In the feature extraction process, ConvNeXt employs depthwise separable convolution, as illustrated in Fig. 8. By breaking down the typical convolution into depthwise and pointwise convolutions, this method successfully reduces computational complexity and the number of parameters. Additionally, residual connections are incorporated into deep networks to alleviate training difficulties and ensure efficient information flow within the network, addressing the gradient vanishing problem. Finally, to preserve the global contextual information of feature maps, ConvNeXt utilizes global average pooling, and the extracted features are fed into the classification head for decision making.
Fig. 8.

Structure Diagram of the DWConv Module
CoTAttention mechanism
In the complex environment of rice fields, the backgrounds of pest objects to be detected are often complex and cluttered, leading to the collected dataset images having relatively low resolution. Therefore, to improve detection accuracy, this study integrated the CoTAttention [23] (Contextual Transformer Attention) module into the data processing pipeline to suppress redundant background features in the images. In the CoTAttention module, contextual information of key features is fully exploited to model inter-feature dependencies, enabling adaptive attention allocation for input images. This module integrates contextual information mining with self-attention learning and aggregation within a unified architecture, thereby enhancing the visual representational capability of images. The structural diagram of the CoTAttention module is illustrated in Fig. 9.
Fig. 9.
Structural Diagram of the CoTAttention Mechanism Module
The input is considered to be x in feature processing, where H stands for height, W for width, and C for the number of channels. Firstly, a k×k group convolution is conducted to acquire K, which possesses a representation with local context information. Subsequently, the query Q is subjected to a concat operation, and the concat result is subjected to two successive 1 × 1 convolution processes. As shown in Eq. (1), where W represents the convolution operation. Instead of only modeling the link between the query and the key, the A matrix in this case is derived by interacting with local context information and query information, which is different from the typical Self-Attention. Equation (2) then displays the dynamic context modeling feature map K², which is created by multiplying the acquired attention matrix A and V.
![]() |
1 |
![]() |
2 |
Ultimately, CoTAttention outputs the integration of the expressions of static and dynamic contexts, as shown in Eq. (3). This design can effectively integrate data from different modalities, enhance the representational ability of features, and enable the model to understand pest features more comprehensively, thereby improving the detection performance in complex paddy field environments.
![]() |
3 |
YOLO-PEST
The YOLO-PEST architecture consists of three main components: the Backbone, Neck, and Head. In the Backbone, the CoTAttention mechanism, a contextual attention module designed to model inter-feature dependencies via self-attention computations, is integrated to enhance feature fusion and enable cross-modal reasoning. Within the Neck section, the ConvNeXt module replaces standard convolutions with depthwise separable and pointwise convolutions, reducing computational complexity while improving the detection accuracy of small objects. This architectural modification not only enhances the model’s capacity to learn complex feature representations but also significantly reduces parameter count and computational overhead. The overall architecture of the proposed model is illustrated in Fig. 10.
Fig. 10.
The structural diagram of the improved YOLOv5s network
Performance evaluation methods
The three most important assessment metrics for object detection models in this work are precision, recall, and mAP@0.5. Precision refers to the model’s capability of identifying relevant objects, and the percentage of accurately detected results among all detections is termed as precision. One important indicator for assessing the model’s capacity to recognize good instances is recall. As indicated in the formula:
![]() |
4 |
![]() |
5 |
![]() |
6 |
![]() |
7 |
![]() |
8 |
The rate at which the model correctly detects positive cases is referred to here as TP (True Positive). The circumstances when the model accurately predicts negative cases are indicated by the symbol TN (True Negative). When negative cases are mistakenly labeled as positive, it’s referred to as FP (False Positive). FN (False Negative) refers to situations when good examples are mistakenly categorized as negative. The ratio of TP (True Positive) to the total number of samples the model projected to be positive is known as precision. On the other hand, recall measures the proportion of accurately detected positive samples to all positive occurrences.
The region above the coordinate axes and under the Precision-Recall (P-R) curve is the Average Precision (AP) value.
signifies the recall rate for the i-th pest category, while
denotes the precision for the same category, and N represents the total number of pest species. The mean Average Precision (mAP) is computed as the average of AP values and serves as an indicator of the overall detection accuracy in object detection tasks. Specifically, The average accuracy determined when the Intersection over Union (IoU) criterion is set at 0.5 is known as mAP@0.5. Equation (8) expresses the IoU as the degree of overlap between the ground truth box and the predicted bounding box. The term “intersection” refers to the area of overlap between the two regions, while “union” indicates the total area covered by both regions.
Experimental results and discussion
Experimental Preparation
The configuration parameters of the hardware and software for the practical training environment experiment are presented in Table 3.
Table 3.
Settings of the experimental environment
| Experimental environment | Details |
|---|---|
| Programming language | Python3.7 |
| Deeping learning framework | PyTorch1.9.0 |
| Operating system | The Linux operating system of Ubuntu 20.04 version |
| GPU | NVIDIA GeForce RTX3080 |
| CPU | Intel(R) Core(TM) i9-10940x |
| CUDA | CUDA11.0 |
To improve the model’s performance during the experiment, the Adam optimization technique was applied. An initial learning rate of 0.001, a batch size of 8, and an epoch count of 150 were the default hyperparameter values. The learning rate decay coefficient was determined to be 0.3 and the momentum factor was adjusted at 0.9. Furthermore, all of the backbone networks’ parameters were initialized using pretrained weights in the YOLO models used in this investigation.
Selection of the baseline model
This study used the same dataset to train YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5n, and YOLOv5x in order to further evaluate the applicability of the YOLOv5s model. The training outcomes for each model were compiled and presented in Table 4.
Table 4.
Comparative results of different versions of YOLOv5
| Model | Layers | P | R | mAP@0.5 | Parameters | GFLOPS |
|---|---|---|---|---|---|---|
| YOLOv5x | 322 | 0.954 | 0.912 | 0.947 | 86,233,975 | 203.9 |
| YOLOv5l | 267 | 0.958 | 0.908 | 0.95 | 46,156,743 | 107.8 |
| YOLOv5m | 212 | 0.932 | 0.924 | 0.95 | 20,889,303 | 48.0 |
| YOLOv5s | 157 | 0.933 | 0.944 | 0.956 | 7,037,095 | 15.8 |
| YOLOv5n | 157 | 0.927 | 0.913 | 0.945 | 1,772,695 | 4.2 |
The experimental findings show that the network’s breadth and depth positively correlate with the model’s overall accuracy, feature extraction, and fusion capabilities. However, both the parameter count and GFLOPS grow in proportion to the number of layers. YOLOv5n has the lowest GFLOPS and the fewest parameters, but its mean Average Precision (mAP), Precision (P), and Recall (R) are all far worse than those of the other models. By simulating occlusion scenarios in real scene, this not only increases the dataset’s diversity but also enhances the model’s generalization capabilities, reduces the likelihood of overfitting, and enables the model to detect effectively in difficult situations all of which lead to better performance. In contrast, YOLOv5s displays performance metrics (P, R, and mAP) that are relatively comparable to those of YOLOv5x, YOLOv5l, and YOLOv5m, while maintaining a lower parameter count and GFLOPS. Therefore, after a thorough evaluation of these factors, this study selected YOLOv5s as the baseline model.
Ablation experiment
To assess the effectiveness of the ConvNeXt module and the CoTAttention module, we integrated both modules into the base model YOLOv5s and conducted an ablation experiment. The results of this experiment are detailed in Table 5.
Table 5.
Results of the ablation experiment
| Model | P | R | mAP@0.5 | Parameters | GFLOPS |
|---|---|---|---|---|---|
| YOLOv5s | 0.933 | 0.944 | 0.956 | 7,037,095 | 15.8 |
| YOLOv5s + CoTAttention | 0.954 | 0.941 | 0.961 | 9,501,863 | 17.9 |
| YOLOv5s + ConvNeXt | 0.955 | 0.93 | 0.965 | 6,864,551 | 15.5 |
| YOLO-PEST | 0.965 | 0.959 | 0.97 | 9,352,359 | 17.7 |
In comparison to the baseline YOLOv5s model, the experimental findings show that the CoTAttention module enhanced the mAP@0.5 by 0.5%. Additionally, incorporating the ConvNeXt module led to a 0.9% increase in mAP@0.5 relative to YOLOv5s.After integrating both of these modules into YOLOv5s, the mAP@0.5 was increased by 1.4% compared with YOLOv5s. Thus, it can be concluded that the CoTAttention module, by suppressing unnecessary image background elements, enables the model to effectively obtain more pest details. By using a multiscale fusion technique, the ConvNeXt module improves the detection accuracy of the model, minimizes missed detections and false alarms of large objects, and maximizes the detection effect on small objects. Both the average accuracy and recall rates may be significantly increased using these improvement techniques.
Comparative analysis with other models
In order to verify YOLO-PEST’s detection capabilities, this study contrasted it with many models, including YOLOv7-tiny, YOLOv8s, YOLOv10s, YOLOv11s and YOLOv12s. The results of this comparison are detailed in Table 6.
Table 6.
Comparative analysis with various models
| Model | P | R | mAP@0.5 | mAP@0.5:0.95 | Parameters | GFLOPS |
|---|---|---|---|---|---|---|
| YOLOv5s | 0.933 | 0.944 | 0.956 | 0.624 | 7,037,095 | 15.8 |
| YOLOv7-tiny | 0.797 | 0.762 | 0.811 | 0.442 | 6,156,743 | 29.2 |
| YOLOv8s | 0.926 | 0.831 | 0.908 | 0.59 | 3,007,598 | 28.5 |
| YOLOv10s | 0.915 | 0.854 | 0.908 | 0.624 | 8,042,700 | 24.5 |
| YOLOv11s | 0.931 | 0.824 | 0.921 | 0.599 | 8,321,152 | 23.1 |
| YOLOv12s | 0.944 | 0.858 | 0.928 | 0.599 | 8,460,937 | 22.3 |
| YOLO-PEST | 0.965 | 0.959 | 0.97 | 0.65 | 9,352,359 | 17.7 |
The results of the experiment show that YOLO-PEST’s mAP@0.5 is 97%, and the mAP@0.5:0.95 stands at 65%. In comparison with YOLOv5s, YOLOv7-tiny, YOLOv8s, YOLOv10s, YOLOv11s and YOLOv12s, the mAP@0.5 of YOLO-PEST is respectively 1.4%, 15.9%, 11.1%, 6.2%, 6.2%, 4.9% and 4.2% higher than those models. The mAP@0.5:0.95 is respectively 2.6%, 20.8%, 6%, 2.6%, 5.1% and 5.1% higher. The suggested YOLO-PEST approach performs better, according to the comparative findings of the previously discussed approaches. After implementing occlusion enhancement on the dataset and improving the backbone and neck network of YOLOv5s, P, R, and mAP were significantly enhanced, while the number of parameters and GFLOPS increased only marginally. Hence, the YOLO-PEST model can be regarded as the prime choice for rice pest detection.
Analysis of training curves
Figure 11 shows the differences in the mAP training curves for the various YOLOv5 versions on the training set. This study trained the models using the officially provided pretrained weights, which included yolov5n.pt, yolov5s.pt, yolov5l.pt, yolov5x.pt, and yolov5m.pt. Figure 11a displays the training curves for YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. Each model reached its optimal performance upon 150 training iterations, attaining rapid convergence.
Fig. 11.
Comparison of the mAP. (a) The training curves of YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. (b) The mAP curves after introducing the CoTAttention module and ConvNeXt module seperately. (c)The variations of the mAP training curves on the training set for different network models
It can be observed from Fig. 11b that after the CoTAttention and ConvNeXt modules were introduced respectively, the mAP curves gradually surpassed those of the original network model. The mAP@0.5 and mAP@0.5:0.95 of the YOLO-PEST model proposed by us were relatively lower compared to other models in the initial several epochs. However, as the number of epochs increased, the detection performance was gradually enhanced and eventually converged slowly.
The variations of the mAP training curves on the training set for different network models are depicted in Fig. 11c.It is evident from the training curve graphic that the YOLO-PEST model’s training curve is greater than the other models’ for the same epochs.
In order to evaluate the model performance, this study uses a standardized Precision-Recall Curve analysis to evaluate the performance of seven object detection models (YOLOv5s, YOLOv7-tiny, YOLOv8s, YOLOv10s, YOLOv11s, YOLOv12s and YOLO- PEST) over 150 epochs. Firstly, the detection results were sorted in descending order of confidence, and the true positive and false positive were determined by calculating the intersection over union ratio (IoU ≥ 0.5 threshold). Then, the precision and recall are cumulatively calculated to form discrete evaluation points. Finally, a continuous curve is generated by linear interpolation, and the area under the curve (Average Precision, AP) is calculated to quantify the model performance. The curve in the following Fig. 12 clearly shows the P-R characteristics of each model in the pest detection task with recall as the horizontal axis and precision as the vertical axis. Visual analysis shows that the YOLO-PEST architecture maintains optimal performance under the confidence threshold, indicating that the module achieves more comprehensive coverage of pest detection.
Fig. 12.
Comparison of P-R curves
Figure 13 presents the detection frames and their confidence levels for each pest. Figure a and Figure b are respectively examples of the detection results of the original dataset and the dataset after occlusion. It can be observed from the figures that even when the pest objects are occluded, incomplete, or multiple objects are together, the model can still accurately identify the species of pests, with an average confidence level of approximately 90%. This further substantiates the validity of the proposed model.
Fig. 13.
Examples of Detection Results by YOLO-PEST. (a) Example of Detection Results for the Original Dataset. (b) Example of Detection Results for the Occluded Dataset
Conclusion
Aiming at the problems of limited diversity of rice pest datasets, complex field environment with occlusion, and insufficient detection accuracy of small targets, this paper proposes a rice pest detection model YOLO-PEST based on YOLOv5s. Existing agricultural pest detection models often use general data augmentation techniques to alleviate the occlusion problem. For example, Zhong et al. [24] proposed a random erasure method that balances information preservation and removal using random rectangular masks, but its fixed geometry differs significantly from the random overlap of stems and leaves in the field. The YOLOF-PD model proposed by Peng et al. [25] simulates occlusion by clipping random rectangular regions, but it is difficult to match the natural morphology of overlapping stems and leaves of rice. Furthermore, some researchers [3, 4] employ multi-view cameras to monitor targets and reduce occlusion at the source, yet this demands substantial resources. In contrast, YOLO-PEST introduces a local occlusion enhancement method that simulates natural occlusion by utilizing stem and leaf segments from the same image. This approach not only maintains the consistency of rice field scenes but also covers major occlusion types in field environments, thereby significantly improving the model’s adaptability to real occlusion scene. Concurrently, the ConvNeXt module is integrated to enhance small object feature extraction, while the CoTAttention mechanism is incorporated to strengthen feature discrimination under complex backgrounds.
Experimental results demonstrate that the improved YOLO-PEST model achieves an mAP@0.5 of 97%, representing a 1.4%-point increase over the original YOLOv5s network. This validates the effectiveness of the proposed occlusion processing and architectural improvements. The achieved performance meets the practical requirements for rice pest monitoring and control, offering a novel approach for rice pest early warning and infestation assessment.
Despite the significance of this study for pest detection and control, several aspects still require further exploration. Firstly, while the dataset includes various pests, samples from extreme environments (such as water stains after heavy rain and shadow interference under strong backlighting) remain insufficient. This may compromise the model’s stability in special climatic conditions; such samples will be supplemented in future work to enhance robustness. Secondly, current occlusion simulations focus primarily on stem and leaf overlap, yet other field occlusion types, such as dew attachment and insect manure coverage, have not been incorporated and need supplementation in subsequent research. Finally, the model’s inference speed requires further improvement to adapt to UAV inspections, enabling integration of YOLO-PEST with UAVs and other intelligent agricultural technologies to support real-time pest monitoring and management.
In the future, this study will explore additional architectures and techniques to further optimize the YOLO-PEST model, enhancing its adaptability to a broader range of agricultural scenarios. This study will focus on improving detection performance for small objects and under complex backgrounds, addressing the limitations of existing methods, and contributing to advancing sustainable agriculture and global food security.
Acknowledgements
The authors are grateful to all anonymous reviewers and the editor for helpful comments and suggestions.
Author contributions
Conceptualization and methodology, Jun Qiang; original draft, software and preparation, Li Zhao; data analysis and curation, supervision, review, and editing, Jun Qiang and Hongming Wang; preparation, and project administration, Tianqi Xu, Qihang Jia and Lixiang Sun; All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported in part by the Excellent Top-notch Talent Cultivation Funding Project of Colleges and Universities in Anhui Province (No.gxyqZD2021123), National Innovation and Entrepreneurship Training Program for College Students (S202210363311).
Data availability
The code and datasets in the present study may be available from the corresponding author upon request.
Declarations
Ethics approval and consent to participate
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Consent to publishment
All authors agreed to publish this manuscript.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Yu X, Zheng J. A review on the application of deep learning methods in detection and identification of rice diseases and Pests[J]. Volume 78. Computers, Materials & Continua; 2024. 1.
- 2.Upadhyay N, Bhargava A. Artificial intelligence in agriculture: applications, approaches, and adversities across pre-harvesting, harvesting, and post-harvesting phases[J]. Iran J Comput Sci. 2025;8:749–772.
- 3.Li Z, Xu R, Li C, et al. In-field blueberry fruit phenotyping with a MARS-PhenoBot and customized BerryNet[J]. Comput Electron Agric. 2025;232:110057. [Google Scholar]
- 4.Zhou X, Zhang Y, Jiang X, et al. Advancing tracking-by-detection with multimap: towards occlusion-resilient online multiclass strawberry counting[J]. Expert Syst Appl. 2024;255:124587. [Google Scholar]
- 5.Upadhyay N, Gupta N. Detecting fungi-affected multi-crop disease on heterogeneous region dataset using modified resnext approach[J]. Environ Monit Assess. 2024;196(7):610. [DOI] [PubMed] [Google Scholar]
- 6.Upadhyay N, Gupta N, SegLearner. A segmentation based approach for predicting disease severity in infected leaves[J]. Multimedia Tools Appl. 2025;84:1–24.
- 7.Yang S, Xing Z, Wang H, Dong X, Gao X, Liu Z, Zhang X, Li S, Zhao Y. Maize-YOLO: A new High-Precision and Real-Time method for maize pest detection. Insects. 2023;14:p. 278. [DOI] [PMC free article] [PubMed]
- 8.Yin J, et al. An intelligent field monitoring system based on enhanced YOLO-RMD architecture for Real-Time rice pest detection and management. Agriculture. 2025;15(8):798. [Google Scholar]
- 9.Wang J, Wang T, Xu Q, et al. RP-DETR: end-to-end rice pests detection using a transformer[J]. Plant Methods. 2025;21(1):1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wang J, Ma S, Wang Z, et al. Improved lightweight YOLOv8 model for rice disease detection in Multi-Scale Scenarios[J]. Agronomy. 2025;15(2):445. [Google Scholar]
- 11.Wang N, Fu S, Rao Q, et al. Insect-YOLO: A new method of crop insect detection[J]. Comput Electron Agric. 2025;232:110085. [Google Scholar]
- 12.Du L, Zhu J, Liu M, et al. YOLOv7-PSAFP: crop pest and disease detection based on improved YOLOv7[J]. IET Image Proc. 2025;19(1):e13304. [Google Scholar]
- 13.Li P, Zhou J, Sun H, et al. RDRM-YOLO: A High-Accuracy and lightweight rice disease detection model for complex field environments based on improved YOLOv5[J]. Agriculture. 2025;15(5):479. [Google Scholar]
- 14.Guan B, Wu Y, Zhu J, et al. GC-Faster RCNN: the object detection algorithm for agricultural pests based on improved hybrid attention mechanism[J]. Plants. 2025;14(7):1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Huang Y, Liu Z, Zhao H, et al. YOLO-YSTs: an improved YOLOv10n-Based method for Real-Time field pest Detection[J]. Agronomy. 2025;15(3):575. [Google Scholar]
- 16.Wang Q, Liu Y, Zheng Q, et al. SMC-YOLO: A High-Precision maize insect Pest-Detection Method[J]. Agronomy. 2025;15(1):195. [Google Scholar]
- 17.Naveed H, Anwar S, Hayat M, et al. Survey: image mixing and deleting for data augmentation[J]. Eng Appl Artif Intell. 2024;131:107791. [Google Scholar]
- 18.Liu Z, Mao H, Wu CY et al. A convnet for the 2020s[C]Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 11976–11986.
- 19.Han K, Wang Y, Chen H, et al. A survey on vision transformer[J]. IEEE Trans Pattern Anal Mach Intell. 2022;45(1):87–110. [DOI] [PubMed] [Google Scholar]
- 20.Theckedath D, Sedamkar RR. Detecting affect States using VGG16, ResNet50 and SE-ResNet50 networks[J]. SN Comput Sci. 2020;1(2):79. [Google Scholar]
- 21.Liu Z, Lin Y, Cao Y et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012–10022.
- 22.Chollet F, Xception. Deep learning with depthwise separable convolutions[C]Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1251–1258.
- 23.Yang Q, et al. A real-time object detection method for underwater complex environments based on FasterNet-YOLOv7. J Real-Time Image Proc. 2024;21(1):8. [Google Scholar]
- 24.Zhong Z, Zheng L, Kang G et al. Random erasing data augmentation[C]Proceedings of the AAAI conference on artificial intelligence. 2020;34(07):13001–13008.
- 25.Peng H, et al. Insect pest detection of field crops based on improved YOLOF Model[J]. Trans Chin Soc Agricultural Mach. 2023;54(4):285–294303. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The code and datasets in the present study may be available from the corresponding author upon request.




















