Abstract
This paper addresses the challenge of low detection accuracy of grape clusters caused by scale differences, illumination changes, and occlusion in realistic and complex scenes. We propose a multi-scale feature fusion and augmentation YOLOv7 network to enhance the detection accuracy of grape clusters across variable environments. First, we design a Multi-Scale Feature Extraction Module (MSFEM) to enhance feature extraction for small-scale targets. Second, we propose the Receptive Field Augmentation Module (RFAM), which uses dilated convolution to expand the receptive field and enhance the detection accuracy for objects of various scales. Third, we present the Spatial Pyramid Pooling Cross Stage Partial Concatenation Faster (SPPCSPCF) module to fuse multi-scale features, improving accuracy and speeding up model training. Finally, we integrate the Residual Global Attention Mechanism (ResGAM) into the network to better focus on crucial regions and features. Experimental results show that our proposed method achieves a mAP of 93.29% on the GrappoliV2 dataset, an improvement of 5.39% over YOLOv7. Additionally, our method increases Precision, Recall, and F1 score by 2.83%, 3.49%, and 0.07, respectively. Compared to state-of-the-art detection methods, our approach demonstrates superior detection performance and adaptability to various environments for detecting grape clusters.
Keywords: Grape clusters detection, Multi-scale, Receptive field, Feature fusion, Feature augmentation
Subject terms: Computational science, Computer science
Introduction
Grape cultivation, a significant global fruit crop, has steadily increased in scale over the past 40 years. However, current operational methods still rely heavily on traditional manuals, which are time-consuming and labor-intensive. Advancing grape detection technology is crucial for automating the grape industry and achieving intelligent agricultural production. Therefore, developing rapid and accurate grape detection techniques is imperative.
With ongoing advancements in artificial intelligence, significant progress has been achieved in object detection. Numerous researchers have developed and refined various methods to enhance the accuracy of fruit detection. For instance, Hamim et al.1 used convolutional neural networks (CNN), KNN, and SVM to detect rotten fruits and vegetables. Wang et al.2 proposed the NYW-YOLOv8s network, an improved version of YOLOv8s for tomato detection, enhancing detection performance by augmenting the original and balanced training datasets. Yi et al.3 proposed a method for detecting fruit recognition targets based on an improved YOLOv4 algorithm. Wei et al.4 developed an accurate method for kiwifruit recognition and visual localization using an improved YOLOv5s. Luo et al.5 proposed a new detection model, Soft-CBAM-YOLOV5, for detecting and counting fruits and vegetables. Zhang et al.6 constructed a YOLOX target detection network and utilized new sample enhancement and composition strategies to train the network for detecting and counting Holly fruits. Du et al.7 developed the DSW-YOLO network, based on YOLOv7, for detecting ripe strawberries.
Furthermore, numerous studies have focused on grape detection. For example, Shahzad et al.8 developed a new grape dataset from a local farm and used a convolutional neural network to detect grape clusters. K. V et al.9 explored a method based on grape leaf image classification and constructed a plant disease detection model using a deep convolutional network. Ariza-Sentís et al.10 proposed the GrapeMOTS method, which improves grape detection by recording multi-object tracking annotations of grape bunches from various perspectives. Liu et al.11 proposed YOLOX-RA, an advanced grape detection model based on YOLOX-Tiny, designed to quickly and accurately identify densely growing and occluded grape bunches. Marani et al.12 proposed a deep-learning framework that employs an on-vehicle RGB-D camera system for grape detection. Chen et al.13 proposed the YOLOv8-GP network, based on YOLOv8n-Pose, for detecting grape clusters and picking points. These studies have significantly enhanced the accuracy of grape detection through various innovative approaches.
Although current detection technology has advanced significantly, detecting grapes accurately remains challenging due to scale variations and obstructive branches. The YOLOv7 network has shown great improvements in speed and accuracy, surpassing all object detectors at the time of its development and being widely utilized in various fields. Therefore, this paper selects the YOLOv7 network for grape cluster detection. However, the network’s performance is limited when detecting grape targets of varying scales, particularly smaller ones, and when faced with obstructive branches. To solve the above problems, this paper proposes a multi-scale feature fusion and augmentation YOLOv7 network. The main contributions are as follows:
Design the MSFEM to enhance the accuracy in detecting small-scale targets.
Propose the RFAM to expand the receptive field of features to enhance the capability of extracting features at different scales.
Propose the SPPCSPCF module, which utilizes the 33 convolution for improved speed and accuracy performance.
Develop the ResGAM to emphasize target regions in the image and improve the contrast between the target object and the background.
Related work
The issue of scale variation in object detection poses a significant challenge. To address this problem, Szegedy et al.14 introduced the Inception structure in 2014. This structure not only expands the receptive field but also enhances the robustness of the neural network by integrating various convolution kernels. Ding et al.15 combined the concept of a multi-branch and multi-scale Inception network with over-parameterization, proposing the Diverse Branch Block, which significantly improved the model’s performance. Subsequently, many methods have adopted multi-branch approaches for feature extraction at different scales. To further enhance feature extraction at various scales, this paper designs MSFEM for multi-scale feature extraction.
The scale of the receptive field plays a crucial role in object detection, as it determines the scale of features and contextual information that the network can capture. A larger receptive field enables the network to capture larger-scale objects and richer contextual information, but it may not accurately capture the boundary information of the target. Conversely, a smaller receptive field can capture detailed information but may lack contextual information. Therefore, combining receptive fields of different scales is necessary for detecting objects of varying sizes. Zheng et al.16 designed a Fast Spatial Pyramid Pooling Cross Stage Partial Convolution module based on SPPCSPC, which achieves higher detection efficiency while ensuring accuracy. Yu et al.17 for multi-scale face detection, designed a receptive field enhancement to study different receptive field characteristic figures and enhance the feature pyramid representation. Inspired by this module, we propose RFAM for grape detection at various scales.
He et al.18 proposed the Spatial Pyramid Pooling (SPP) method to address the limitation of fixed input data size in convolutional neural networks, enabling them to adapt to data of any size and scale. The SPPCSPC19 module, an improvement upon SPP, consists of SPP and Cross Stage Partial Convolution (CSPC) modules, primarily utilized for multi-scale feature extraction and cross-stage feature fusion. The SPPF module, based on SPP, replaces the 99 and 1313 convolutions in the original SPP with concatenated 55 convolutions, significantly improving detection speed. Building upon this idea, this paper enhances the SPPCSPC module and designs the SPPCSPCF module to achieve faster and more accurate grape detection.
To address the issue of decreased accuracy in object detection when the object is occluded, researchers have explored the use of attention mechanisms to extract features. By incorporating the attention mechanism, the neural network can autonomously learn and selectively focus on crucial information within the input data while disregarding irrelevant details. This enhances the model’s generalization capability. Liu et al.20 conducted a study on attention mechanisms across spatial and channel dimensions, proposing a global attention mechanism that boosts the performance of deep neural networks by reducing information loss and amplifying global interactive representations. Building upon this, ResGAM is introduced in this paper to improve the detection performance of the model.
Methodologies
This paper proposes a multi-scale feature fusion and augmentation YOLOv7 network. The network structure, shown in Fig. 1, consists of four components: Input, Backbone, Feature Augmentation, and Multi-Scale Detection. It incorporates four improved modules: MSFEM, RFAM, SPPCSPCF, and ResGAM.
Fig. 1.
Network structure of this paper.
MSFEM
We enhanced feature extraction for small-scale targets by adding a multi-scale feature extraction branch in YOLOv7. This branch is positioned after the first ELAN1 and processes low-level high-resolution feature maps. Although this increases the number of parameters, it significantly improves detection accuracy. Unlike the four-detector head structure (such as YOLOv7-d6, YOLOv7-e6, etc.) included in YOLOv7, we added an additional branch while preserving the overall framework of YOLOv7. This was achieved by introducing a set of smaller anchors corresponding to the small target detection branch. Including the four-detector head structure mitigates the negative impact of target scale variations. It enhances detection for targets of different scales, especially small targets. Additionally, it has been observed that the four-detector head structure exhibits greater stability compared to the three-detector head structure.
We conducted a comparative analysis of different branch structures within YOLOv7 to validate the efficacy of the selected four-branch structure. The diagrams of different branch structures are shown in Fig. 2. Figure 2c depicts the three-branch structure of YOLOv7. By pruning YOLOv7 while keeping the overall structure unchanged, we obtained the two-branch structure shown in Fig. 2d. The four-branch structure is depicted in Fig. 2b, and similarly, the five-branch structure resembles the four-branch by adding two branches while maintaining the original framework, as shown in Fig. 2a. Table 1 presents the detection performance of the model with different numbers of branches.
Fig. 2.
Different branching structures of YOLOv7: (a) five-branch structure, (b) four-branch structure, (c) three-branch structure of YOLOv7, (d) dual-branch structure.
Table 1.
Comparison results of multi-branch structures.
| Branching structure | mAP (%) | Precision (%) | Recall (%) | F1 score | Parameters (M) |
|---|---|---|---|---|---|
| Dual branch | 87.66 | 90.30 | 79.89 | 0.85 | 22.2 |
| Three-branch | 87.90 | 85.00 | 79.15 | 0.82 | 37.2 |
| Four-branch | 90.27 | 90.39 | 86.1 | 0.85 | 37.8 |
| Five-branch | 75.72 | 74.72 | 69.79 | 0.72 | 143.6 |
Significant values are in bold.
The experimental results in Table 1 indicate that the four-branch structure demonstrates superior mAP, Precision, Recall, and F1 score compared to other configurations. Fewer branches cannot fully extract features at different scales, while too many branches increase network layers, leading to issues such as gradient vanishing or explosion and network degradation. Based on these considerations, a four-branch structure has been chosen for further investigation.
In this section, we compare different structures of YOLOv7, specifically focusing on the three-branch and four-branch configurations. We apply the corresponding pre-trained weights for each architecture during training. The performance comparison results between our method and the various YOLOv7 structures are shown in Table 2. Experimental results demonstrate that our method achieves the highest mAP. Precision is slightly lower than YOLOv7-e6. Recall is slightly lower than YOLOv7-d6. However, the F1 score, which measures both Precision and Recall, still shows the superiority of our method.
Table 2.
Comparison results of different structures of YOLOv7.
| Methods | Number of branches | mAP (%) | Precision (%) | Recall (%) | F1 score | Parameters (M) |
|---|---|---|---|---|---|---|
| Yolov7 | 3 | 87.90 | 85.00 | 79.15 | 0.82 | 37.2 |
| Yolov7x | 3 | 90.60 | 84.02 | 85.05 | 0.85 | 70.8 |
| Yolov7-d6 | 4 | 92.02 | 84.54 | 87.59 | 0.86 | 152.9 |
| Yolov7-e6 | 4 | 92.06 | 90.05 | 84.31 | 0.87 | 110.4 |
| Yolov7-w6 | 4 | 91.63 | 87.53 | 86.03 | 0.87 | 81.0 |
| Ours | 4 | 93.29 | 89.95 | 87.53 | 0.89 | 43.5 |
Significant values are in bold.
Table 2 shows that, generally, a four-branch structure outperforms a three-branch structure, although it comes with more parameters. Our proposed method has fewer model parameters while achieving higher accuracy compared to the four-branch structure. It also achieves higher overall accuracy with only a slight increase in model parameters compared to the three-branch structure. Therefore, our proposed method demonstrates superior results across all aspects compared to other structures within YOLOv7.
RFAM
Using convolution extraction for a single scale may restrict the receptive field size of the feature layer, leading to difficulties in detecting objects of different scales. Drawing inspiration from the RFE17 approach, this paper proposes RFAM, as depicted in Fig. 3. The Rblock module comprises two components: multi-branch expansion convolution21 and an acquisition weighting layer. The multi-branch section employs three convolutional layers with varying dilation rates (1, 3, and 5) to capture features at different scales. Each branch utilizes a 33 convolutional kernel with distinct receptive field sizes. The acquisition weighting layer gathers information from various branches and assigns weights to their respective features. Additionally, residual connectivity22 is incorporated into the module to ensure model stability, as an increase in the number of layers can result in vanishing or exploding gradients and network degradation. With these optimizations, RFAM effectively processes the receptive field to enhance the model’s feature extraction capability.
Fig. 3.
Structure of RFAM.
To evaluate the impact of the expansion rate of the expansion convolution in RFAM on the model’s performance, we incorporated RFAM modules with different expansion rates into the YOLOv7 network for comparison. The findings in Table 3 show that the modules using 1, 3, and 5 expansion convolutions in this study exhibited superior performance. Since different expansion convolution rates are suitable for various types and scales of targets, it is essential to carefully select the appropriate expansion convolution rate when designing a target detection network.
Table 3.
Comparison results of RFAM with different expansion rates.
| Expansion rates | mAP/% | Precision/% | Recall/% |
|---|---|---|---|
| 1,1,1 | 90.54 | 87.47 | 85.54 |
| 2,2,2 | 87.34 | 85.65 | 83.58 |
| 3,3,3 | 90.09 | 87.12 | 86.24 |
| 5,5,5 | 90.80 | 89.10 | 84.31 |
| 1,2,3 | 90.10 | 87.90 | 87.25 |
| 1,3,5 | 90.97 | 87.47 | 87.25 |
Significant values are in bold.
SPPCSPCF
The SPPCSPCF is an enhancement of SPPCSPC, as illustrated in Fig. 4. This paper introduces two improved methods. The first method, similar to SPPF, utilizes only three 5 5 convolution kernels to achieve 9 9 and 13 13 convolutions, as depicted in Fig. 4b. This approach is faster and more accurate than the original module. The second improved method employs multiple 3 3 convolutions to replace the 5 5, 9 9, and 13 13 convolutions, which is more effective. The specific structure is shown in Fig. 4c.
Fig. 4.
Structure of different SPPCSPC: (a) structure of SPPCSPC, (b) structure of SPPCSPCF with 5 5 convolution, (c) structure of SPPCSPCF with 3 3 convolution.
Table 4 presents the performance comparison results of this module using different convolutional kernels. The experiment focused solely on comparing the speed of this module by examining the difference in the pooling part. The experimental results show that 5 5 and 3 3 replacement convolutions both improve performance. The 3 3 convolution shows greater enhancement. Given its efficiency, we chose to utilize the 3 3 convolution in this paper.
Table 4.
Comparison results of different convolutional kernels for SPPCSPCF.
| Convolution kernel | mAP (%) | Time (s) |
|---|---|---|
| Original convolution | 87.90 | 0.49 |
| 5 5 | 88.76 | 0.37 |
| 3 3 | 89.73 | 0.21 |
Significant values are in bold.
ResGAM
To improve focus on relevant information and minimize the impact of irrelevant information on the model, this paper proposes the ResGAM module. This module builds on the GAM20 by incorporating a residual connection. It integrates input features with those passing through channel and spatial attention sub-modules. This retains original features and reduces feature loss.
Figure 5 illustrates the structure of ResGAM, which incorporates the sequential channel-spatial attention mechanism from CBAM and reconfigures its submodules. For an input feature map, , the intermediate state and the output state is defined as follows: , . Here, Mc and Ms denote the channel and spatial attention maps, respectively, while represents element-wise multiplication. The channel attention submodule employs 3D permutations to retain information in all three dimensions. Then, a two-layer MLP (Multilayer Perceptron) enhances channel-spatial correlations across dimensions. The MLP features an encoder-decoder structure with the same reduction ratio r as BAM.
Fig. 5.
Structure of ResGAM.
In the spatial attention sub-module, two convolutional layers fuse spatial information and focus on spatial details. The reduction ratio r used in the channel attention submodule is consistent with BAM. The pooling process is eliminated to preserve the feature maps better.
Experiment settings
Datasets
This paper utilizes two open-access datasets, GrappoliV2 dataset23 and Lvtn dataset24, for the experiments. The GrappoliV2 dataset, specifically designed for grape detection, comprises 375 images in a single category. After removing low-quality images, 335 remained. The dataset was then divided into training, validation, and test sets in a ratio of 15:2:2. Specifically, the training set consists of 265 images, the validation set 35 images, and the test set also 35 images.
As an additional performance validation dataset, the Lvtn dataset comprises 915 images from two classes, specifically designed for grape detection. The dataset is split into training, validation, and test sets in a ratio of 22:5:2. Examples of the GrappoliV2 and Lvtn datasets are shown in Fig. 6.
Fig. 6.
Example of datasets: (a) GrappoliV2 dataset, (b) Lvtn dataset.
Experimental environment
During the experiment, 200 batch iterations were conducted with a batch size of 2. The initial learning rate was set at 0.1 and gradually decreased to 0.01 during training. The pre-training phase used the training weights of YOLOv7. The specific settings of the experimental environment are presented in Table 5.
Table 5.
Experimental environment settings.
| Configuration | Parameter |
|---|---|
| CPU | Intel Core i5-9300H |
| GPU | GTX 1650 |
| Operating system | Windows11 |
| Accelerate environment | cuda11.6, Python3.10, Pytorch 2.0.1 |
Evaluation metrics
The evaluation metrics for target detection methods typically include Precision, Recall, F1 score, and Average Precision. Precision measures the probability of correctly identifying all detected targets. Recall measures the likelihood of correctly identifying positive samples. The F1 score combines Precision and Recall, representing their harmonic mean. Higher values indicate better performance, with a range between 0 and 1. Average Precision evaluates the model’s overall performance under different thresholds, balancing Precision and Recall. It is specific to a single category, while mAP (mean Average Precision) is the average accuracy across all categories and mAP represents the average accuracy for all categories with an IoU of 0.5.
Experiment results
Ablation experiments
In this section, we perform ablation experiments on MSFEM, RFAM, SPPCSPCF, and ResGAM to evaluate the impact of each module on the model’s performance. The experimental results are presented in Table 6.
Table 6.
Ablation experiments.
| MSFEM | SPPCSPCF | RFAM | ResGAM | mAP/% | Precision/% | Recall/% | F1 score |
|---|---|---|---|---|---|---|---|
| – | – | - | – | 87.90 | 85.00 | 79.15 | 0.82 |
| – | – | – | 90.21 | 84.13 | 86.00 | 0.85 | |
| – | – | – | 89.73 | 88.31 | 87.01 | 0.84 | |
| – | – | – | 90.97 | 88.44 | 86.27 | 0.87 | |
| – | – | - | 92.04 | 91.89 | 86.99 | 0.86 | |
| – | – | 90.79 | 84.48 | 85.11 | 0.85 | ||
| – | – | 91.17 | 89.87 | 82.59 | 0.86 | ||
| – | – | 92.38 | 88.10 | 88.89 | 0.88 | ||
| – | – | 92.71 | 83.37 | 89.69 | 0.86 | ||
| – | – | 90.95 | 86.83 | 90.89 | 0.89 | ||
| – | – | 91.99 | 88.80 | 87.48 | 0.88 | ||
| – | 90.55 | 83.97 | 86.01 | 0.85 | |||
| – | 92.76 | 91.62 | 85.06 | 0.88 | |||
| – | 92.42 | 92.45 | 84.31 | 0.88 | |||
| – | 91.63 | 86.67 | 86.03 | 0.87 | |||
| 93.29 | 89.95 | 87.53 | 0.89 |
Significant values are in bold.
Adding MSFEM, RFAM, SPPCSPCF, and ResGAM individually resulted in improvements of 2.31, 1.83, 3.07, and 4.14% in mAP, respectively. This indicates a significant enhancement in model performance, particularly when SPPCSPCF and ResGAM are added individually.
When two modules are combined, mAP increases by 0.58, 0.96, and 2.17% when MSFEM is combined with the other three modules, respectively. Combining SPPCSPCF with each of the other three modules results in increases of 1.06, 2.98, and 1.22% in mAP, respectively. RFAM combined with the other three modules leads to increases of 0.2, 1.74, and 1.02% in mAP, respectively. Integrating ResGAM with MSFEM resulted in a 0.34% increase in . Conversely, when combined with the other two modules, there was a slight reduction in mAP. However, it remained higher than the base model. It can be observed that the ablation effect of this module with SPPCSPCF and RFAM is poor.
When the three modules are combined, MSFEM and ResGAM show a greater increase in mAP when combined with the third module compared to the first two modules. The combination of SPPCSPCF and RFAM with the third module reduces mAP, but it remains higher than the base model. This indicates that SPPCSPCF and RFAM do not synergize well with MSFEM and ResGAM.
In conclusion, combining modules can reveal synergies that enhance model performance. Some modules have interdependencies, and combining them results in suboptimal outcomes. Despite this, the overall network performance was superior, achieving the highest mAP and F1 score.
Performance comparison of different attention mechanisms
To validate the efficacy of the proposed attention mechanism, we conducted a comparative analysis with SE25, ECA26, CA27, CBAM28, and GAM20. Specifically, we integrated each attention mechanism into the YOLOv7 network at the same position and compared their effects. The performance results for various attention mechanisms are presented in Table 7. Our proposed attention mechanism has a slightly lower Recall than CBAM and a lower F1 score than ECA. However, it achieves the highest mAP and Precision, and it has the fewest parameters.
Table 7.
Comparison results of attention mechanisms.
| Attention mechanism | mAP (%) | Precision (%) | Recall (%) | F1 score | Parameter (M) |
|---|---|---|---|---|---|
| SE25 | 91.20 | 88.48 | 88.48 | 0.88 | 52.9 |
| ECA26 | 91.54 | 91.64 | 86.01 | 0.89 | 52.9 |
| CA27 | 90.33 | 90.86 | 85.29 | 0.88 | 52.9 |
| CBAM28 | 90.53 | 87.47 | 88.97 | 0.88 | 54.5 |
| GAM20 | 91.61 | 89.34 | 87.53 | 0.88 | 39.5 |
| ResGAM | 92.04 | 91.89 | 86.99 | 0.86 | 39.5 |
Significant values are in bold.
In conclusion, the proposed attention mechanism uses both channel and spatial information. This makes it more effective than mechanisms that use only one type of information. It also shows superior performance in detecting grape clusters.
Performance comparison with mainstream methods
To validate the effectiveness of the proposed method, we compare it with current mainstream target detection methods on the GrappoliV2 dataset. The compared methods include Faster R-CNN29, SSD30, YOLOv331, YOLOv432, YOLOv5, YOLOX33, YOLOv719, YOLOv8, YOLOv934, and YOLOv1035.
Table 8 presents a performance comparison between our method and mainstream methods on the GrappoliV2 dataset. Our method achieves the highest mAP and F1 score, surpassing other methods significantly. Although the mAP of our method is slightly lower than YOLOv5 and YOLOv9, the F1 score is 0.1 and 0.3 higher than YOLOv5 and YOLOv9. Recall is 1% lower than Faster R-CNN, but mAP is 10% higher. While our method’s Precision and Recall are not consistently higher than all other methods, it excels in F1 score. Compared to YOLOv7, our method improves mAP by 5.39%, Precision by 2.83%, Recall by 3.49%, and F1 score by 0.07 on this dataset. Overall, our method demonstrates superior detection performance.
Table 8.
Comparison results of mainstream methods on the GrappoliV2 dataset.
| Methods | mAP (%) | mAP (%) | Precision (%) | Recall (%) | F1 score |
|---|---|---|---|---|---|
| Faster R-CNN29 | 82.48 | 41.92 | 50.34 | 88.54 | 0.64 |
| SSD30 | 64.02 | 47.75 | 82.93 | 48.69 | 0.61 |
| YOLOv331 | 82.18 | 33.74 | 85.45 | 65.87 | 0.74 |
| YOLOv432 | 59.62 | 28.06 | 86.32 | 19.57 | 0.32 |
| YOLOv5 | 91.33 | 60.56 | 90.59 | 86.00 | 0.88 |
| YOLOX33 | 86.68 | 48.02 | 87.14 | 79.24 | 0.83 |
| YOLOv719 | 87.95 | 46.11 | 87.12 | 84.04 | 0.82 |
| YOLOv8 | 89.32 | 58.15 | 87.18 | 80.09 | 0.84 |
| YOLOv934 | 92.87 | 69.55 | 84.42 | 87.91 | 0.86 |
| YOLOv1035 | 83.37 | 52.03 | 76.11 | 76.44 | 0.77 |
| Ours | 93.29 | 59.16 | 89.95 | 87.53 | 0.89 |
Significant values are in bold.
Figure 7 shows the Precision-Recall curves for each method, comparing our proposed method with mainstream approaches. The figure demonstrates that our method achieves the highest mAP. The area under the curve represents AP, and our curve (the red line segment) is positioned outside the other curves, indicating superior performance.
Fig. 7.

Precision–recall comparison of methods on the GrappoliV2 dataset.
Figure 8 is an F1-mAP scatter plot illustrating the relationship between the F1 score and mAP for various methods. The results indicate that our method achieves the highest F1 score and mAP, outperforming YOLOv9, YOLOv5, YOLOv8, and other methods.
Fig. 8.

F1-mAP comparison of methods on the GrappoliV2 dataset.
To comprehensively verify the effectiveness of the proposed methods, we also conducted performance comparison experiments with mainstream methods on the Lvtn dataset. The results are shown in Table 9. Our method achieves the best mAP on this dataset. Although its Precision is lower than that of YOLOv7, its mAP, Recall, and F1 score are higher by 6.12, 8.99%, and 0.02, respectively. While its mAP and Recall are lower than those of YOLOv9, our method achieves 3.45% higher mAP and 0.04 higher F1 score. Overall, our method shows superior detection performance on this dataset.
Table 9.
Comparison results of mainstream methods on the Lvtn dataset.
| Methods | mAP (%) | mAP (%) | Precision (%) | Recall (%) | F1 score |
|---|---|---|---|---|---|
| Faster R-CNN29 | 50.45 | 17.76 | 37.05 | 49.83 | 0.42 |
| SSD30 | 34.37 | 16.96 | 77.28 | 47.21 | 0.60 |
| YOLOv331 | 46.62 | 11.93 | 33.34 | 65.00 | 0.65 |
| YOLOv432 | 25.20 | 9.56 | 54.52 | 19.83 | 0.32 |
| YOLOv5 | 71.73 | 44.98 | 78.15 | 65.37 | 0.71 |
| YOLOX33 | 51.17 | 26.42 | 45.84 | 58.21 | 0.52 |
| YOLOv719 | 69.73 | 41.60 | 81.91 | 62.83 | 0.71 |
| YOLOv8 | 70.39 | 37.21 | 66.86 | 68.07 | 0.68 |
| YOLOv934 | 72.40 | 48.12 | 66.61 | 72.87 | 0.69 |
| YOLOv1035 | 68.38 | 37.19 | 52.26 | 66.92 | 0.63 |
| Ours | 75.85 | 45.48 | 74.19 | 71.82 | 0.73 |
Significant values are in bold.
Figure 9 shows the Precision-Recall curves for each method used in this dataset. The figure indicates that the curve of our proposed method (depicted by the red line) is on the periphery of the other curves, demonstrating superior performance compared to methods such as YOLOv9, YOLOv5, YOLOv8, and other methods.
Fig. 9.

Precision-recall comparison of methods on the Lvtn dataset.
Figure 10 presents the F1 score and mAP for each method on the Lvtn dataset. Our proposed method achieves the highest F1 score and mAP, outperforming YOLOv5, YOLOv9, YOLOv7, and other methods.
Fig. 10.

F1-mAP comparison of methods on the Lvtn dataset.
Visualization
To fully demonstrate the impact of the proposed attention mechanism, we analyzed this section using heat maps, as shown in Fig. 11. The first row displays the original image, the second row shows the heat map of YOLOv7, and the third row presents the heat map after incorporating ResGAM into YOLOv7. In these heat maps, the target area is marked in dark red and becomes more concentrated with the introduction of ResGAM. This indicates that ResGAM effectively directs the model’s focus to specific regions of interest in the image. As a result, integrating this attention mechanism enhances the model’s ability to concentrate on relevant areas, leading to more accurate detection results.
Fig. 11.
Heat map: (a) original image, (b) heat map of YOLOv7, (c) heat map of YOLOv7+ResGAM.
This section presents the detection results of our proposed method compared to mainstream methods. Figure 12 demonstrates the effectiveness of these methods in detecting targets at different scales, under various illumination conditions, and with occlusions. The figure is organized into four columns: the first column displays large targets, the second and third columns show medium targets with occlusion and varying illumination, and the fourth column highlights small targets with occlusion.
Fig. 12.
Grape clusters detection results for different methods: (a) YOLOv3, (b) YOLOv5, (c) YOLOX, (d) YOLOv7, (e) YOLOv8, (f) YOLOv9, (g) YOLOv10, (h) ours. Note that all the models have been retrained on the GrappoliV2 dataset.
The figure illustrates that our method effectively detects grape cluster targets in various environments with high confidence. It accurately identifies all large grape cluster targets in the first column, demonstrating higher confidence than other methods. In contrast, YOLOv3, YOLOX, YOLOv7, and YOLOv10 detect incomplete grape clusters as complete clusters. For medium-sized targets with occlusion, shown in the second and third columns, and under different lighting conditions, our method maintains high confidence in detecting all grape cluster targets. On the other hand, YOLOv8 and YOLOv10 miss some detections, while YOLOv3, YOLOX, YOLOv7, and YOLOv9 incorrectly identify incomplete grape clusters. YOLOv5 and YOLOv9 also misidentify the trunk as a grape cluster. For small targets with occlusion in the fourth column, YOLOv8 fails to detect all grape cluster targets, whereas other methods succeed. Notably, our method consistently demonstrates higher confidence in these scenarios.
In summary, YOLOv8 and YOLOv10 exhibit reasonable Precision but suffer from low Recall, as they fail to detect all grape cluster targets. In contrast, YOLOv3, YOLOX, YOLOv5, YOLOv7, and YOLOv9 can detect grape cluster targets, but their accuracy is limited. Our method stands out with the best detection performance, offering superior Precision and Recall compared to other methods. It successfully detects all grape cluster targets and demonstrates higher confidence, particularly in identifying small targets.
Conclusion
We propose a multi-scale feature fusion and augmentation YOLOv7 network, and introduce MSFEM, RFAM, SPPCSPCF and ResGAM modules to improve the accuracy of grape cluster detection in complex environments. The experimental results demonstrate that the proposed method achieved a 5.39% increase in mAP on the GrappoliV2 dataset and a 6.12% increase in mAP on the Lvtn dataset. Our method significantly outperforms current mainstream object detection methods in accuracy. Furthermore, tests in real grape field environments confirm that our method effectively detects grape cluster targets at various scales, even in challenging conditions such as varying lighting and occlusions.
However, it is important to note that the processing we undertake to enhance the detection accuracy of the network increases certain parameters. Future work will concentrate on reducing parameters, improving detection speed, and enhancing generalization ability while maintaining or even improving network accuracy.
Acknowledgements
This work was supported by the Natural Science Foundation of China (62462001), the Natural Science Foundation of Ningxia Province (2023AAC03264), the Innovation Team of Image and Intelligent Information Processing of National Ethnic Affairs Commission. We would like to thank the journal’s editors and anonymous reviewers for their critical comments and valuable suggestions to improve the quality of this article.
Author contributions
Jinlin Ma and Silong Xu conceived the idea. Silong Xu, Hong Fu, and Baobao Lin carried out the simulations, analyzed the data, conducted experiments and drafted the manuscript. Ziping Ma participated in data analysis and made revisions. Both authors discussed the results and approved the final manuscript.
Data availability
The datasets used and analyzed in this study were all publicly available datasets.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Jinlin Ma, Email: majinlin@nmu.edu.cn.
Ziping Ma, Email: 2206041@num.edu.cn.
References
- 1.Hamim, M. A., Tahseen, J., Hossain, K. M. I., Akter, N. & Asha, U. F. T. Bangladeshi fresh-rotten fruit & vegetable detection using deep learning deployment in effective application. In 2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI). 233–238 (IEEE, 2023).
- 2.Wang, A. et al. Nvw-yolov8s: An improved yolov8s network for real-time detection and segmentation of tomato fruits at different ripeness stages. Comput. Electron. Agric.219, 108833 (2024). [Google Scholar]
- 3.Yi, C., Wu, W., Yang, L. & Jia, R. Research on fruit recognition method based on improved yolov4 algorithm. In 2023 IEEE 2nd International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA). 1892–1901 (IEEE, 2023).
- 4.Wei, S., Jun, S. Y., Chen, L. Z. & Jing, G. Accurate recognition of kiwifruit based on improved yolov5. In 2023 5th International Conference on Natural Language Processing (ICNLP). 103–107 (IEEE, 2023).
- 5.Luo, Q., Zhang, Z., Yang, C. & Lin, J. An improved soft-cbam-yolov5 algorithm for fruits and vegetables detection and counting. In 2023 IEEE 6th International Conference on Pattern Recognition and Artificial Intelligence (PRAI). 187–192 (IEEE, 2023).
- 6.Zhang, Y. et al. Complete and accurate holly fruits counting using yolox object detection. Comput. Electron. Agric.198, 107062 (2022). [Google Scholar]
- 7.Du, X. et al. Dsw-yolo: A detection method for ground-planted strawberry fruits under different occlusion levels. Comput. Electron. Agric.214, 108304 (2023). [Google Scholar]
- 8.Shahzad, M. O., Aqeel, A. B. & Qureshi, W. S. Detection of grape clusters in images using convolutional neural network. In 2023 International Conference on Robotics and Automation in Industry (ICRAI). 1–6 (IEEE, 2023).
- 9.Kavithamani, V. et al. Advanced grape leaf disease detection using neural network. In 2023 Second International Conference on Electronics and Renewable Systems (ICEARS). 949–954 (IEEE, 2023).
- 10.Ariza-Sentís, M., Wang, K., Cao, Z., Vélez, S. & Valente, J. Grapemots: UAV vineyard dataset with mots grape bunch annotations recorded from multiple perspectives for enhanced object detection and tracking. Data Brief54, 110432 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Liu, B. et al. An improved lightweight network based on deep learning for grape recognition in unstructured environments. Inf. Process. Agric.11, 202–216 (2024). [Google Scholar]
- 12.Marani, R., Milella, A., Petitti, A. & Reina, G. Deep neural networks for grape bunch segmentation in natural images from a consumer-grade camera. Precis. Agric.22, 387–413 (2021). [Google Scholar]
- 13.Chen, J. et al. Efficient and lightweight grape and picking point synchronous detection model based on key point detection. Comput. Electron. Agric.217, 108612 (2024). [Google Scholar]
- 14.Szegedy, C. et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9 (2015).
- 15.Ding, X., Zhang, X., Han, J. & Ding, G. Diverse branch block: Building a convolution as an inception-like unit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10886–10895 (2021).
- 16.Zheng, Z., Hu, Y., Li, X. & Huang, Y. Autonomous navigation method of jujube catch-and-shake harvesting robot based on convolutional neural networks. Comput. Electron. Agric.215, 108469 (2023). [Google Scholar]
- 17.Yu, Z. et al. Yolo-facev2: A scale and occlusion aware face detector. Pattern Recognit.155, 110714 (2024). [Google Scholar]
- 18.He, K., Zhang, X., Ren, S. & Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell.37, 1904–1916 (2015). [DOI] [PubMed] [Google Scholar]
- 19.Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7464–7475 (2023).
- 20.Liu, Y., Shao, Z. & Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv preprintarXiv:2112.05561 (2021).
- 21.Yu, F. & Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv preprintarXiv:1511.07122 (2015).
- 22.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778 (2016).
- 23.Davide. Grappoliv2 Dataset. https://universe.roboflow.com/davide/grappoliv2 (2021). Accessed 30 July 2024.
- 24.HoangNhatVu. Lvtn Dataset. https://universe.roboflow.com/hoangnhatvu/lvtn-oqur4 (2023). Accessed 30 July 2024.
- 25.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141 (2018).
- 26.Wang, Q. et al. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11534–11542 (2020).
- 27.Hou, Q., Zhou, D. & Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13713–13722 (2021).
- 28.Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV). 3–19 (2018).
- 29.Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.39, 1137–1149 (2016). [DOI] [PubMed] [Google Scholar]
- 30.Liu, W. et al. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. 21–37 (Springer, 2016).
- 31.Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. arXiv preprintarXiv:1804.02767 (2018).
- 32.Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprintarXiv:2004.10934 (2020).
- 33.Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. Yolox: Exceeding yolo series in 2021. arXiv preprintarXiv:2107.08430 (2021).
- 34.Wang, C.-Y., Yeh, I.-H. & Liao, H.-Y. M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv preprintarXiv:2402.13616 (2024).
- 35.Wang, A. et al. Yolov10: Real-time end-to-end object detection. arXiv preprintarXiv:2405.14458 (2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used and analyzed in this study were all publicly available datasets.








