Skip to main content
MethodsX logoLink to MethodsX
. 2025 Jul 24;15:103533. doi: 10.1016/j.mex.2025.103533

GradCAM-PestDetNet: A deep learning-based hybrid model with explainable AI for pest detection and classification

Ramitha Vimala a, Saharsh Mehrotra a, Satish Kumar a,b,, Pooja Kamat a, Arunkumar Bongale a, Ketan Kotecha a,b
PMCID: PMC12340394  PMID: 40799831

Abstract

Pest detection is crucial for both agriculture and ecology. The growing global population demands an efficient pest detection system to ensure food security. Pests threaten agricultural productivity, sustainability, and economic development. They also cause damage to machinery, equipment and soil, making effective detection essential for commercial benefits. Traditional pest detection methods are often slow, less accurate and reliant on expert knowledge. With advancements in computer vision and AI, deep transfer learning models (DTLMs) have emerged as powerful solutions. The GradCAM-PestDetNet methodology utilizes object detection models like YOLOv8m, YOLOv8s and YOLOv8n, alongside transfer learning techniques such as VGG16, ResNet50, EfficientNetB0, MobileNetV2, InceptionV3 and DenseNet121 for feature extraction. Additionally, Vision Transformers (ViT) and Swim Transformers were explored for their ability to process complex data patterns. To enhance model interpretability, GradCAM-PestDetNet integrates Gradient-weighted Class Activation Mapping (Grad-CAM), allowing better visualization of model predictions.

  • Uses YOLOv8 models (YOLOv8n for fastest inference at 1.86 ms/img) and transfer learning for pest detection ensuring that the system is viable for low-resource environments.

  • Employs an ensemble model (ResNet50, DenseNet, MobileNet) that achieved 67.07 % accuracy, 66.3 % F1-score and 68.1 % recall. This is an improvement over the baseline CNN which gave an accuracy of 21.5 %. This ensures a more generalized and robust model that is not biased towards the majority class.

  • Integrates Grad-CAM for improved interpretability in pest detection.

Keywords: Pest detection, Transfer learning, Ensemble model, Attention mechanism, Convolution neural network, Explainable AI

Graphical abstract

Image, graphical abstract


Specifications table

Subject area Computer Science
More specific subject area Computer Vision
Name of your method GradCAM-PestDetNet: A Deep Learning-Based Hybrid Model with Explainable AI for Pest Detection and Classification.
Name and reference of original method Zhang, Lijuan, Cuixing Zhao, Yuncong Feng, and Dongming Li. "Pests identification of ip102 by yolov5 embedded with the novel lightweight module." Agronomy 13, no. 6 (2023): 1583.
Resource availability The URL for the dataset is given below:
(i) IP102 dataset [10]: https://github.com/xpwu95/IP102
(ii) PlantVillage dataset [15]:
https://www.kaggle.com/datasets/mohitsingh1804/plantvillage
(iii) EuroSAT dataset [16]: https://github.com/phelber/EuroSAT

Background

One of the significant challenges faced in the present times for modern agriculture is ensuring that there is a sufficient food supply to meet growing demands while also maintaining a sustainable practice for agriculture. As it is emphasized by Davis et al. [1] transformative approaches are needed for achieving food security while also needing to mitigate the factors that contribute to losses in productivity. Pest infestations are one of the most critical challenges faced as it reduces agricultural yields and also damages the health of the soil, the quality of the farming equipment as well as the infrastructure. This affects the economic stability of the entire agricultural community hence there is an intense need to develop an effective and sustainable solution for pest detection [2].

Traditional practices that were used for pest management often rely on manual processes that are labour intensive which are isolated to resource-limited regions and also require expert knowledge. As stated by Abate et al. [3], these methods are very inaccurate and inefficient. These methods additionally also involve the use of chemical pesticides which are proved to be effective in the short term but in the long run it is not environmentally sustainable. Integrated Pest Management (IPM) strategies have been implemented as a sustainable alternative because it combines strategies that optimize pest control and also minimize environmental damage [4].

Large scale datasets such as IP102 have formed the foundation for training the models for pest detection [5].The recent development of computer vision and Artificial Intelligence(AI) has been transformative in the scope of pest detection. Deep learning and transfer learning architectures have shown significant potential in overcoming the traditional pest detection methods [6] because they enable fast as well as accurate pest detection. Models such as CNNs, Vision Transformers, Swim Transformers and transfer learning models effectively capture patterns in images, thereby improving detection of pests. Grad-CAM is used to improve model interpretability by highlighting key regions that influence predictions.

The detection of pests and their identification has gained increasing attention due to their crucial role in agriculture. Various methodologies, datasets and key findings from notable studies have been explored, analyzing their strengths and limitations. This comprehensive review provides valuable insights in guiding and selecting the most effective, scalable approaches to achieve accurate results in pest detection.

In [7] Zhang et al. explored object detection models such as light weight YOLOv5. Their approach consisted of embedding YOLOv5 with novel optimizations which demonstrated pest detection in real-time because of the computational efficiency. In [8] et al. presents an approach to pest detection that integrates complementary features that captures images of the pests from multiple angles that enables diverse feature extraction.

In [9] the proposed methodology introduces an ensemble-based model of VGG16, VGG19 and ResNet50. This also employs a voting classifier and achieved an accuracy of 82.5 % in the 1P102 dataset. In [10] Wu et al. released the IP102 dataset which is considered a comprehensive benchmark in terms of pest detection. The IP102 dataset consists of 75,000 samples across 102 classes which provides diversity for real world applications.

In [11] Shafique et al. researched pest detection and classification in field crops with the use of Machine Learning techniques. This research illustrated the effectiveness of combining hand-crafted features along with the advanced deep learning algorithms. In [12] a novel deep learning model ‘DeepPestNet’ is introduced which is designed to enhance the efficiency along with the accuracy of the pest detection in crops by utilizing advanced neural network architectures.

Method details

The framework as shown in Fig. 1, consists of four primary steps - preprocessing, object detection, classification and explainability.

Fig. 1.

Fig. 1:

Overview of the GradCAM-PestDetNet framework.

The algorithm steps used in GradCAM-PestDetNet architecture pipeline are summarized as follows in Algorithm 1.

Algorithm 1.

GradCAM-PestDetNet pipeline.

START
INPUT: Image I
OUTPUT: Pest class label C, Grad-CAM heatmap H
FOR each Image I in the dataset:
1. Preprocess image I (augmentation: rotate, flip, scale,sharpen, blur)
2. Detect pests in I using YOLOv8 (n variant)
3. If pest is detected:
 a. Crop the region of interest from the image
 b. Resize the image to the required input size for classification
 c. Classify whether the pest is rare or not rare using a CNN architecture using SE Blocks
 d. Apply an ensemble of ResNet50, DenseNet, MobileNet to classify the type of not rare pest.
 e. Aggregate predictions using weighted voting - label C
 f. Generate Grad-CAM heatmap H for explainability and interpretability.
4. Return the predicted class with label C and the Grad-CAM Heatmap H.
END

Dataset

In GradCAM-PestDetNet, three datasets have primarily been utilized—IP102 [10], PlantVillage [15] and EuroSAT [16]. The IP102 dataset is a large-scale benchmark dataset [10] which was particularly designed to support the research in the field of pest detection and identification. It is tailored for real world applications as it addresses issues such as class imbalance as well as high intra-class variability.

IP102 dataset [10]

The IP102 dataset comprises 75,000 images across 102 diverse and distinct classes, which currently makes it one of the most comprehensive datasets in the field of pest detection and identification. The dataset consists of diverse species such as Asian Corn Borer, Green Rice Leafhopper, Cotton Bollworm, Rice Stem Borer, Mango Fruit Fly etc., which covers a wide range of pests that potentially affects the crops [13]. As shown in Fig. 2, IP102 dataset contains various diverse species of pests.

Fig. 2.

Fig. 2:

Diverse species of pests in IP102 dataset.

Each class comprises unique pest species across various stages of the life cycle such as egg, larva, pupa and adult. This reflects the diversity of the dataset for developing a robust pest detection model. The IP102 dataset has a hierarchical taxonomy which comprises two levels: sub-classes and super-classes. The subclasses represent the individual pest species while the superclasses represent the grouping of the pests in terms of the crops they affect such as field crops (e.g., rice, wheat) and economic crops (e.g.,s vitis, citrus). Table 1 below represents and highlights the above mentioned links.

Table 1.

Subclass and superclass of IP102 dataset.

Superclass Subclass Class Train Val Test IR
FC Rice 14 5043 843 2531 6.4
Corn 13 8404 1399 4212 27.9
Wheat 9 2048 340 1030 5.2
Beet 8 2649 441 1330 15.4
Alfalfa 13 6230 1037 3123 10.7
EC Vitis 16 10,525 1752 5274 74.8
Citrus 19 4356 725 2192 17.6
Mango 10 5840 971 2927 61.7
IP102 FC 57 24,602 4098 12,341 39.4
EC 45 20,721 3448 10,393 80.8
IP102 102 45,095 7508 22,619 80.8

As shown in Fig. 3, the IP102 dataset has been divided into training, validation and testing sets in the ratio of 6:1:3, which ensures that it is a balanced setup. Specifically, the training set comprises a total of 45,095 images, the validation set comprises 7508 images and finally the test comprises 22,619 images.

Fig. 3.

Fig. 3:

Train-validation-test split.

The IP102 dataset is split into two parts depending on the task - object detection or classification. The object detection subset consists of bounding box annotations with a split of 15,178 images for training and 3798 images for testing. On an average each class consists of 737 samples; however several of these classes are severely under-represented.

PlantVillage [15]

The PlantVillage dataset [15] is a widely used source for research in crop health. It comprises 54,000 images which are labelled including crops such as tomatoes, potatoes, maize, apples and wheat. Fig. 4 illustrates that, it includes healthy samples and pest-affected samples of the diverse crops.

Fig. 4.

Fig. 4:

PlantVillage dataset.

Each image is annotated with labels that indicate the crop type as well as the condition which makes the dataset suitable for both object detection and classification. The PlantVillage and EuroSAT datasets complement the IP102 dataset by allowing the model to differentiate between pest and no-pest classes.

EuroSAT [16]

The EuroSAT dataset [16] is a publicly available source released by the EuroSAT agricultural statistics database and is widely used for classifying the use of land from the geospatial images. It is a novel benchmark dataset for land use and land cover classification. Each image in the dataset is 64 pixels in length and width and has a ground sampling distance of 10 m. As shown in Fig. 5, they consist of various types of land covers such as AnnualCrop, Forest, Highway, Industrial and others.

Fig. 5.

Fig. 5:

EuroSAT dataset.

The EuroSAT dataset complements the IP102 and the PlantVillage dataset by adding non-pest samples to minimize false positives and increases the robustness of the model.

Table 2 below summarizes the datasets in terms of the number of images, classes, annotation type as well as the train, test and validation splits.

Table 2.

Overview of the datasets.

Dataset Total Images Train Validation Test Classes Annotation Type
IP102 75,000 45,095 7508 22,619| 102 Bounding Boxes, Labels
PlantVillage 54,305 38,014 5430 10,861 14 Labels
EuroSAT 27.000 18,900 2700 5400 10 Labels

To evaluate the contribution of each module within the GradCAM-PestDetNet framework, an ablation analysis study was conducted. This study reveals how each specific component such as the attention mechanisms, ensemble learning, data augmentation and the explainability modules affect the performance when removed.

When evaluated in comparison to models with attention modules, a basic CNN model lacking attention blocks struggled to manage the class imbalance between rare and common classes, resulting in a performance heavily biased towards the majority class (not rare). A comparative study was conducted to evaluate the performance of different attention mechanisms, specifically Squeeze-and-Excitation (SE) and Convolutional Block Attention Module (CBAM). Based on the evaluation metrics discussed in the results section, SE outperformed CBAM and hence was integrated into the baseline CNN model.

A baseline CNN model was used to further classify the classes within rare and not rare groups. This model performed severely bad with all model predictions belonging to one majority class and hence yielding an accuracy of 21.5 %. Therefore an ensemble of various transfer learning models was used as discussed below in the Classification section. This model yielded an impressive accuracy of 67.07 % balanced across all the classes.

Object detection

The first step in GradCAM-PestDetNet is object detection, which primarily focuses on identifying whether a pest is present in the sample. If a pest is detected, GradCAM-PestDetNet further classifies the specific type of pest using advanced deep learning and transfer learning models.

To reduce false positives in predictions, the PlantVillage and EuroSAT datasets have been used in addition with the IP102 dataset in order to provide no-pest class samples, as IP102 contains only pest-class samples. This increases the model proficiency and robustness in identifying the two distinct classes (pest and no pest).

The data flow in GradCAM-PestDetNet pipeline initially begins with an input image which is processed by the YOLOv8n model for object detection. If any pest is detected, the Region of Interest (ROI) of the image is extracted and then further passed to the CNN classifier which has Squeeze-and-Excitation (SE) attention which classifies the pest as rare or not-rare. If the pest is classified as not-rare, then the extracted ROI is passed to an ensemble of transfer learning models (ResNet50, DenseNet, MobileNet) to further classify the not-rare pest into a particular class. This step-wise transition allows efficient, explainable and interpretable pipeline for pest detection.

Preprocessing

The IP102 dataset alone exhibits a significant class imbalance where certain classes such as Chestnut Weevil (2975 images), Cabbage Looper (1268 images) and Corn Leaf Aphid (1013 images) have been severely overrepresented while other classes such as Bean Leaf Beetle (5 images), Cicadella Viridis (2 images) and Sitka Spruce Weevil (8 images) have been severely underrepresented. The above mentioned imbalance results in a severe bias towards the over-represented class which poses a challenge for the machine learning model to learn. Hence addressing this issue is crucial to prevent any bias while classifying the types of pests. Fig. 6 below illustrates the class imbalance of the IP102 dataset.

Fig. 6.

Fig. 6:

Class distribution of IP102 dataset.

Data Augmentation techniques were employed to address the severe class imbalance issue and also to enhance the issue of diversity in the dataset. For the under-represented classes, techniques such as rotation, cropping, scaling, flipping as well as brightness adjustments were applied to increase the number of samples in each of the classes. Moreover, similar augmentation techniques were also applied to over-represented classes to preserve the overall diversity of the dataset. Apart from the standard augmentation techniques, other techniques such as random cropping and slight distortions were also used to stimulate real-world scenarios. The different data augmentation techniques applied are shown in Fig. 7 below.

Fig. 7.

Fig. 7:

Data augmentation used in GradCAM-PestDetNet.

The PlantVillage and EuroSAT datasets were combined which resulted in 68,276 images for the no-pest class. To maintain balance between the pest and no-pest classes, a downsampling technique was applied to reduce the number of no-pest samples, bringing them closer to the number of pest images in IP102.

Models like YOLO employ a specialized augmentation technique called mosaic augmentation to enhance the dataset for object detection tasks. This was also experimented upon in the study. Table 3 below presents the different data augmentation techniques applied along with the probabilities and the limits. .

Table 3.

Augmentations used in GradCAM-PestDetNet.

Method Limit Probability
Mosaic Combines 4 images for diversity context 0.6
Horizontal Flip 0.5
Brightness Range: (−0.2,0.2) 0.4
Contrast Range: (−0.2,0.2) 0.4
Gaussian Blur Kernel Size: (3,5) 0.3
Rotation Angle: (−30°, 30°) 0.5
Scaling Scale Factor: (0.8.1.2) 0.3
Shift Horizontal/Vertical: (−0.3,0.3) 0.4
Sharpen Alpha: (0.1,0.3), Lightness: (0.6,1.0) 0.2
Cutout Patch Size: (10,30), Random Position 0.3
Vertical Flip 0.3

Mosaic Augmentation combines four different images into a single image allowing the model to learn and generalise from the diverse spatial relationship. This helps in addressing variations in background, position as well as sizes of objects. Fig. 8 below illustrates the Mosaic augmentation specially for the object detection models.

Fig. 8.

Fig. 8:

Mosaic augmentation for YOLOv8.

You-only-look-once (YOLO)

YOLO architectures [14] primarily use Region of Interest (ROI) mechanisms to solely focus on the relevant parts of an image such as Pests while ignoring the other irrelevant regions. YOLO has several variants that have been developed as an improvement based on speed, accuracy and adaptability. In this proposed methodology, the comparison of YOLO variants YOLOv8n, YOLOv8m and YOLOv8s has been performed based on metrics such as Precision, Recall, mAP@0.5, mAP@0.5:0.95, Fitness, Speeds for Inference, Intersection Over Union (IoU), Preprocessing and Postprocessing. Fig. 9 illustrates the YOLOv8 architecture.

Fig. 9.

Fig. 9:

YOLov8 architecture.

YOLOv8n (Nano) is the smallest variant of YOLO that is mainly used in environments where there is a resource constraint like drones and mobile devices. In this variant, speed is prioritized over accuracy. YOLOv8m (Medium) is a variant that provides a balance between the speed and accuracy thereby being suitable for detecting smaller or complex objects in environments that are only moderately resource intensive. YOLOv8s generally tends to provide the highest accuracy among the numerous variants which are lightweight and subsequently are ideal in tasks involving very detailed object detection.

In Fig. 10, YOLO models were compared based on the inference speed as well as the accuracy to determine which model performs best in detecting pest or no pest. Along with that, the trade-off between the model size and the performance was analysed to find the most suitable one for real world applications. Fig. 11 below presents the YOLOv8 model's object detection results.

Fig. 10.

Fig. 10:

Train and validation loss across epochs for YOLO variants.

Fig. 11.

Fig. 11:

YOLOv8n model detections for GradCAM-PestDetNet.

All models were trained and evaluated on the Kaggle cloud platform using a Tesla P100 GPU environment. During training, the GPU memory usage peaked at 12 GB out of the available 16 GB of VRAM. In addition, CPU RAM usage reached 15 GB out of 29 GB of available system memory. Approximately 16 GB of disk storage was utilized for datasets, checkpoints, and model outputs, within the allocated 60 GB of storage provided by the Kaggle environment.

Classification

After detecting the pest, the next step is to classify it as either a rare or not-rare pest, based on a threshold defined by the number of samples in each class. Due to the significant imbalance in the classes, there is a need for a different approach for the rare and not-rare pest classes. For the not-rare classes, a CNN based model was used with attention as well as an ensemble of transfer learning models. Fig. 12 highlights the need for a different approach to handle the significant class imbalance between rare and not-rare pest classes.

Fig. 12.

Fig. 12:

Comparison between samples of rare vs not rare class.

The first step involves balancing the dataset to address the severe class imbalance between the rare and the not-rare classes. Initially, experimentation was done using a baseline CNN model and further developments were done by adding SE (Squeeze - Excitation) blocks for attention. Fig. 13 below illustrates the Squeeze - Excitation architecture. Apart from this, the model includes convolutional layers and dropout layers for regularization. After initial training, hard example mining is added to focus more on the misclassified rare pest samples and the resulting samples are correspondingly augmented and used for fine tuning the model.

Fig. 13.

Fig. 13:

Squeeze-excitation attention architecture.

The proposed CNN architecture processes a (358,361,3) images where each begins with a Conv2D layer (32, 64, 128 filters, 3 × 3 kernel). Each convolutional layer is followed by Batch Normalization, Max Pooling and Dropout layers. The extracted features are further flattened and passed into Dense layers (128, 64 units, ReLU). Finally, a softmax output layer predicts the classes.

Transfer learning uses pre-trained models which are trained on large datasets and then fine-tuned to solve a specific task which uses a much smaller dataset. This reduces training time and computational resource requirements, as the lower layers of the model learn and reuse basic features like edges and textures, while the higher layers are fine-tuned for the new tasks. Several transfer learning models, including VGG16, ResNet50, EfficientNetB0, MobileNetV2, InceptionV3 and DenseNet121 were utilized for experimentation.

The ensemble of these models was used to combine the strengths of the individual transfer learning models and enhance the overall performance. By aggregating the predictions from multiple models, the approach improves the accuracy as well as the generalization . The model leverages the ResNet architecture’s residual learning through skip connections, DenseNet architecture’s feature reuse through the dense connections between the layers and MobileNet provides a lightweight architecture that uses depthwise separable convolutions. A weighted voting mechanism was employed, where the class probabilities predicted by each model were multiplied by a predefined weight. The aggregated result was determined using the weighted probabilities calculated. This enables the model with a higher performance to have a greater influence on the ensemble decision. Additionally, further experimentation was done using Vision Transformers (ViT) and Swim Transformers to compare them against other transfer learning models and CNN based models.

Various hyperparameters were used for both object detection as well as classification. Hyperparamters are a set of parameters that are initialized before the training begins and govern how a model performs. There are various hyperparameters such as - batch size, learning rate, epochs, image size, optimizer and early stopping parameters such as patience which were utilized,

Table 4 provides a summary of the hyperparameters used for training the models employed in object detection and classification.

Table 4.

Hyperparameters used for object detection and classification.

Hyperparameter Object Detection Classification
Batch Size 16 32
Epochs 10 20 (initial), 20 (fine-tuning)
Image Size 640 × 640 224 × 224
Patience 5 5
Learning Rate 0.01 0.5
Optimizer Adam Adam
Weight Decay 0.0005 0.0005
Momentum 0.937 0.937
Scheduler Default Cosine Scheduler
Loss Function CIoU (Complete IoU) Loss and Binary Cross Entropy Sparse categorical cross entropy

Grad-CAM

Grad-CAM (Gradient-weighted Class Activation Mapping) is a part of Explainable AI developed to handle the hurdle of the black box nature of deep learning models. Grad-CAM is a popular explainability technique that shows the visual interpretations of a model's decisions by highlighting the important regions in the image.

Grad-CAM generates heatmaps by calculating the gradients of the output class score with respect to the feature maps of the last convolution layer in the architecture. The heatmap is then superimposed on the original image to visualize the parts of the image that influence the predictions of the model.

The importance of each feature map is determined by calculating the global average of the individual gradients of the feature maps. The result weight coefficient determines how much each feature contributes to the overall final prediction by the model. After the calculation of the global average, ReLU activation function is applied to retain only the influences which are positive.

Fig. 14 above depicts Grad-CAM visualizations for the proposed ensemble model applied to various insect types. In Grad-CAM, the color intensity depicts the importance of each feature map. Warmer shades (red, orange, yellow) indicate areas of higher importance than the cooler shades (blue, green) which indicate areas with lower importance. From the Grad-CAM visualizations, it can be observed that the proposed ensemble model primarily focuses on the distinctive features of the insects such as their legs, body structure and antennae rather than the less important features such as the background.

Fig. 14.

Fig. 14:

Grad - CAM visualization of GradCAM-PestDetNet.

The Grad-CAM visualizations also provide insights into diagnosing the failure cases in the models predictions. This hence ensures transparency as well as a better performance in terms of classification.

Method validation

The IP102 dataset was chosen among the available options for its diverse composition, comprising 75,000 images spanning 102 distinct classes. It also contains images across various stages of the life cycle such as egg, larva, pupa and adult. Furthermore, the dataset is structured in such a way that it includes superclass and subclass thus allowing a more structured classification. The PlantVillage and the EuroSAT datasets were included to introduce “no-pest” samples to the dataset to improve the generalizability of the model.

However, the dataset also presents certain challenges such as class imbalance where certain classes of pests are over-represented while others are under-represented. To mitigate this issue, data augmentation techniques like rotation, cropping, scaling, and brightness adjustments have been applied to replicate real-world conditions. Further random downsampling was applied to reduce the number of samples in the overrepresented class. This prevents the dataset from being biased towards one particular class. Data Augmentation was also applied to the over-represented class to maintain the complete diversity of the dataset.

In addition to the aforementioned techniques for addressing class imbalance, techniques such as class weight adjustment, focal loss and hard example mining were also implemented. In class weight adjustment, higher weights are dynamically assigned to the under-represented class, thus ensuring that the model stays unbiased. Focal loss function was also introduced to handle class imbalance by focusing more towards the hard to classify examples. This function reduces the weight of the well-classified examples while increasing that of the misclassified ones. In conjunction with focal loss, hard example mining was utilized to identify and reintroduce misclassified samples into the training process.

YOLO (You Only Look Once) was selected over other object detection models for its single-shot detection architecture which allows it to be accurate and efficient than other region based models such as Faster R-CNN. Since pests generally tend to be small and scattered across the images, YOLO’s ability to detect multiple objects of varying sizes is quite effective. Apart from that, the newer versions of YOLO also incorporate Mosaic Augmentation thus improving the overall generalizability of the model. Out of all the variants of YOLO compared, YOLOv8n was chosen due to its balance between the inference speed as well as the accuracy. YOLOv8n's lightweight architecture makes it highly effective for deployment in resource-constrained environments, such as mobile applications.

The next step constitutes the classification of the pests into its respective species. A baseline CNN model was initially used to classify the species but further improvements were made by incorporating Squeeze-and-Excitation (SE) attention mechanisms to aid with the feature extraction. Traditional CNN architectures tend to give the same level of attention to all the feature maps which leads to a suboptimal level of learning for the underrepresented classes. In contrast to that, Squeeze-and-Excitation (SE) attention mechanisms introduce attention channel wise which ensures that the most relevant features such as legs, body shape or wings are given more importance dynamically. SE attention mechanism recalibrates the features maps globally instead of computing the spatial relationships between all the pixels which could be quite computationally expensive. Hence, SE attention mechanism is quite lightweight compared to the other types of attention mechanisms such as self-attention mechanisms in transformers.

To further improve the performance of classification, transfer learning was introduced using models such as pre-trained models such as ResNet50, VGG16, EfficientNetB0, MobileNetV2, InceptionV3 and DenseNet121. These models are trained on large-scale datasets and hence converge quickly and extract effective features. Each of these pre-trained models have their own strengths, an ensemble model would be effective in combining the individual models strengths and mitigating the weaknesses. The ensemble model aggregates the predictions from multiple models hence improving the accuracy. To ensure the robustness of the models, fine-tuning was performed just on the upper layers of the architecture while the lower layers were kept frozen.

Finally, Grad-CAM was integrated to enhance interpretability by generating heatmaps that highlight key regions influencing predictions. This helps debug misclassifications and ensures the model focuses on relevant insect features rather than background noise.

The code for GradCAM-PestDetNet is currently under refinement and will be publicly available via GitHub upon publication. In the meantime, a detailed pseudocode has been provided in Algorithm 1 to aid the reproducibility and the transparency.

Results and discussion

Object detection

As illustrated by Table 5 below, the evaluation of YOLOv8n, YOLOv8m and YOLOv8s is done based on various metrics which offer a trade-off between speed and accuracy. In terms of accuracy YOLOv8s has the highest precision (0.954) and mAP@0.5 (0.962) while YOLOv8n has the highest performance in terms of speed with the fastest inference speed (1.86 ms/img) as well as the preprocessing times. Hence this makes YOLOv8n suitable for real-time applications. YOLOv8m overall provides a balanced performance as it achieved the best average IoU (0.6203) which reflects its ability to perform well in terms of speed and accuracy.

Table 5.

Comparison YOLO variants.

Metric YOLOv8s YOLOv8m YOLOv8n
Precision 0.954 0.953 0.953
Recall 0.939 0.934 0.938
mAP@0.5 0.962 0.961 0.962
mAP@0.5:0.95 0.781 0.778 0.780
Fitness 0.796 0.796 0.798
Inference Speed(ms/img) 4.2 9.13 1.86
Preprocess Speed (ms) 0.2 0.154 0.158
Postprocess Speed (ms) 0.8 0.727 0.875

Overall, YOLOv8n has the best results in terms of speed and accuracy with the fastest inference time (1.86 ms/img) and it also achieves competitive accuracy in other metrics. This makes it suitable for real-time applications.

As shown Fig. 15 above, YOLOv8n achieves the average IoU of 0.5785 which is lower than YOLOv8m (0.6203) but it is higher than YOLOv8s (0.5586). This result indicates that YOLOv8n is optimized for speed and also has the ability to align reasonably with the ground truth boxes.

Fig. 15.

Fig. 15:

Average IoU of various YOLO variants.

Classification

Initially, various CNN architectures were experimented with for classifying "Rare" vs. "Not Rare" classes. Further enhancements were made by incorporating Squeeze-and-Excitation (SE) attention to the architecture. It can be observed that CNN with attention has an improved recall of 0.69 compared to the recall in the basic CNN which has a recall of 0.52. But the precision decreases from 0.94 to 0.78 which is because of the higher number of false positives. For the not rare class precision improved from 0.84 to 0.88 but recall dropped slightly from 0.99 to 0.92. But the overall f1-score for rare class increases from 0.67 to 0.73. Hence the model is more effective for identifying rare class samples. Fig. 16 below compares the performance of CNN architecture with and without attention with respect to the metrics - F1-score, Recall and Precision.

Fig. 16.

Fig. 16:

Comparison of CNN with and without Attention.

The confusion matrix depicts that rare class recall improves from 153 to 204 by reducing the false negatives even though false positives increased from 9 to 57. This proves that CNN with attention has an improved performance in identifying rare classes while also maintaining its performance in identifying the not rare class. The Fig. 17 below shows the confusion matrix of both the CNN architecture with and without attention.

Fig. 17.

Fig. 17:

Confusion matrix of CNN with and without attention.

The Fig. 18 below compares performance of various transfer learning architectures such as ResNet50, VGG16, EfficientNetB0, DenseNet, InceptionV3, MobileNet and VGG19. Various evaluation metrics were used for comparison such as Accuracy, Precision, Recall and F1-score. Overall the highest performance is shown by DenseNet with an accuracy of 64.56 %, Precision of 67.92 %, F1-score of 64.99 %. It can also be observed in DenseNet architecture that it consistently performs across Macro and Weighted categories. Overall the lowest performance is by VGG19 with an accuracy of 11.45 %, Precision of 17.42 %, Recall of 11.45 % and F1-score of 9.72 %.

Fig. 18.

Fig. 18:

Comparison of various transfer learning architectures.

Further experimentation was done by combining the individual architectures to form an ensemble that improves the overall performance. This approach involves averaging the predictions from different models and hence leverages the strengths of the individual models. The best performing ensemble model is ResNet50, DenseNet and MobileNet which achieved the overall accuracy of 67.07 % which outperforms the individual performance of various models.

Conclusion

In conclusion, this study proposes an efficient approach for pest detection and identification, utilizing YOLO for detection and CNN with attention along with transfer learning models for classification. YOLOv8n provided the best inference speed of 1.86 ms/img for real-time application without compromising the performance. The proposed methodology proves that integrating attention into the CNN architecture enhances the detection of rare pests by providing an improved accuracy of 67.07 % over the baseline CNN architecture. Future improvements could potentially include integrating IoT-devices for real time processing as well as extending Explainable AI techniques such as LIME (Local Interpretable Model-agnostic Explanations) or SHAP (Shapley Additive Explanations).

Limitations

The proposed methodology, while effective, still has certain limitations which could be addressed as the future scope. Despite employing techniques like augmentation, dynamic weight adjustment and focal loss, class imbalance remains a persistent challenge. The model's performance is highly influenced by environmental factors such as lighting and background variations in the images. The model predictions are restricted to the pest species in the dataset and hence might struggle to recognize the unseen images. Grad-CAM provides interpretability to the model’s predictions but it still does not completely explain the overall decision making process.

Discussion

The proposed GradCAM-PestDetNet framework effectively combines both object detection and classification. It integrates several modules such as attention mechanisms, transfer learning, ensemble models as well as Grad-CAM. The inclusion of SE attention blocks to baseline CNN model improved its ability to classify rare pest classes and hence addresses the class imbalance. Grad-CAM further enhanced the explainability as well as the interpretability of the model by highlighting the key features visually. Overall, all these modules integrated contribute to a robust and real-world pest monitoring system.

Ethics statements

Not applicable

CRediT author statement

Ramitha Vimala: Conceptualization, Methodology, Formal Analysis, Investigation, Writing - Original draft, Review & Editing. Saharsh Mehrotra: Conceptualization, Formal Analysis, Investigation, Writing - Review & Editing. Pooja Kamat: Formal Analysis, Investigation, Supervision, Conceptualization, Review. Satish Kumar: Formal Analysis, Investigation, Supervision, Conceptualization, Review. Arunkumar Bongale: Supervision, Conceptualization, Review. Ketan Kotecha: Supervision, Conceptualization, Review

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Funding: This work was supported by the Research Support Fund (RSF) of Symbiosis International (Deemed University), Pune, India.

Footnotes

Related research article: None.

Data availability

Online public dataset repository link is shared

References

  • 1.Davis KF., Gephart JA., Emery KA., Leach AM., Galloway JN., D’Odorico P. Meeting future food demand with current agricultural resources. Global Environ. Change. 2016;39:125–132. [Google Scholar]
  • 2.Singh A., Dhiman N., Kar A.K., Singh D., Purohit M.P., Ghosh D., Patnaik S. Advances in controlled release pesticide formulations: prospects to safer integrated pest management and sustainable agriculture. J. Hazard. Mater. 2020;385 doi: 10.1016/j.jhazmat.2019.121525. [DOI] [PubMed] [Google Scholar]
  • 3.Abate T., Huis A., Ampofo J.K.O. Pest management strategies in traditional agriculture: an African perspective. Annu. Rev. Entomol. 2000;45(1):631–659. doi: 10.1146/annurev.ento.45.1.631. [DOI] [PubMed] [Google Scholar]
  • 4.Nalam V., Louis J., Shah J. Plant defense against aphids, the pest extraordinaire. Plant Sci. 2019;279:96–107. doi: 10.1016/j.plantsci.2018.04.027. [DOI] [PubMed] [Google Scholar]
  • 5.Wu X., Zhan C., Lai Yu-K, Cheng M.-M., Yang J. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. Ip102: a large-scale benchmark dataset for insect pest recognition; pp. 8787–8796. [Google Scholar]
  • 6.Estruch JJ., Carozzi NB., Desai N., Duck NB., Warren GW., Koziel MG. Transgenic plants: an emerging approach to pest control. Nat. Biotechnol. 1997;15(2):137–141. doi: 10.1038/nbt0297-137. [DOI] [PubMed] [Google Scholar]
  • 7.Zhang L, Zhao C, Feng Y, Li D. Pests identification of IP102 by YOLOv5 embedded with the novel lightweight module. Agron. 2023;13(6):1583. doi: 10.3390/agronomy13061583. [DOI] [Google Scholar]
  • 8.An J., Du Y., Hong P., Zhang L., Weng X. Insect recognition based on complementary features from multiple views. Sci. Rep. 2023;13(2966) doi: 10.1038/s41598-023-29600-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Anwar Z., Masood S. Exploring deep ensemble model for insect and pest detection from images. Procedia Comput. Sci. 2023;218:2328–2337. doi: 10.1016/j.procs.2023.01.208. [DOI] [Google Scholar]
  • 10.Wu X., Zhan C., Lai Y.-K., Cheng M.-M., Yang J. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA. 2019. IP102: a large-scale benchmark dataset for insect pest recognition; pp. 8779–8788. [DOI] [Google Scholar]
  • 11.Zhang X., et al. Deep learning based detector YOLOv5 for identifying insect pests. Appl. Sci. 2022;12(19) doi: 10.3390/app121910167. [DOI] [Google Scholar]
  • 12.An efficient approach for crops pests recognition and classification based on novel DeepPestNet deep learning model. IEEE Xplore. 2022 doi: 10.1109/TII.2022.9825681. [DOI] [Google Scholar]
  • 13.Albattah W., Masood M., Javed A., Nawaz M., Albahli S. Custom CornerNet: a drone-based improved deep learning technique for large-scale multiclass pest localization and classification. Complex. Intell. Systems. 2023;9(2):1299–1316. [Google Scholar]
  • 14.Yaseen, M. What is YOLOv9: An in-depth exploration of the internal features of the next-generation object detector. arXiv preprint arXiv:2409.07813 (2024).
  • 15.Hughes, D., and M. Salathé. "An open access repository of images on plant health to enable the development of mobile disease diagnostics." arXiv preprint arXiv:1511.08060 (2015).
  • 16.Helber P., Bischke B., Dengel A., Borth D. Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEe J. Sel. Top. Appl. Earth. Obs. Remote Sens. 2019;12(7):2217–2226. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Online public dataset repository link is shared


Articles from MethodsX are provided here courtesy of Elsevier

RESOURCES