Abstract
Wheat is significantly impacted by fungal diseases, which result in severe economic losses. These diseases result from pathogenic spores invading wheat. Rapid and accurate detection of these spores is essential for post-harvest contamination risk assessment and early warning. Traditional detection methods are time-consuming and labor-intensive, and difficult to detect small target spores in complex environments. Therefore, a YOLO-ASF-MobileViT detection algorithm is proposed to detect pathogenic wheat spores with varying sizes, shapes, and textures. Four types of common pathogenic wheat spores are used as the study object, including Fusarium graminearum, Aspergillus flavus, Tilletia foetida (sporidium maturum), and Tilletia foetida (sporidium immaturum). The Attentional Scale Sequence Fusion (ASF) is integrated into the original YOLOv5s to enhance the capture of small details in spore images and fuse multi-scale feature information of spores. Additionally, the Mobile Vision Transformer (MobileViT) attention mechanism is incorporated to enhance both local and global feature extraction for small spores. Experimental results show that the proposed YOLO-ASF-MobileViT model achieves an overall mAP@0.5 of 97.0%, outperforming advanced detectors such as TPH-YOLO (95.6%) and MG-YOLO (95.5%). Compared to the baseline YOLOv5s model, it improves the average detection accuracy by 1.6%, with a notable 4.3% increase in detecting small Aspergillus flavus spores (reaching 90.8%). The model maintains high robustness in challenging scenarios such as spore adhesion, occlusion, blur, and noise. This approach enables efficient and accurate detection of wheat fungal spores, supporting early contamination warning in post-harvest management.
Graphical Abstract
Keywords: Wheat, Fungal spore detection, Feature fusion, Attention mechanism, YOLO
Introduction
Wheat is an important food crop that provides an abundance of carbohydrates, essential amino acids, and minerals, making it a crucial source of energy and nutrients for humans [1, 2]. Wheat is widely grown worldwide, and its healthy growth is crucial for food security [3]. The growth process of wheat is susceptible to a variety of fungal stresses that can lead to disease [4]. The three diseases explored in this paper—Fusarium head blight (FHB), common bunt, and Aspergillus flavus (A. flavus) infection—not only significantly reduce the yield and quality of wheat but also pose severe health risks and have widespread, lasting impacts due to spore transmission. FHB is caused by Fusarium graminearum (F. graminearum). FHB manifests as pink-orange sporodochia on infected spikelets and may cause shriveled kernels and weight loss [5]. Historical data from 1919 to the mid-1990s revealed yield losses of 15–29% in India, 30–40% in the Yangtze River Basin of China, and even 70% in parts of Romania [6, 7]. Wheat infected with FHB also produces mycotoxins, especially deoxynivalenol (DON), which seriously threaten the health of animals and humans [8]. Common wheat bunt is caused by Tilletia foetida (T. foetida). It is characterized by bunt balls—infected kernels filled with dark brown to black teliospores that emit a fetid odour at harvest [9, 10]. Common bunt in wheat also affects the weight, germination effect and vigour of seeds [11]. In addition, wheat is infected with A. flavus, which is characterized by the appearance of yellow mould on the surfaces of wheat kernels. A. flavus produces aflatoxins B1 and B2 (AFB1 and AFB2, respectively), which form a stable class of carcinogenic mycotoxins that pose a deadly threat to human health [12]. Fungal spores are a major source of inoculum in wheat fungal diseases. Other transmission pathways include infected seeds, crop residues, and rain-splash dispersal [13, 14]. This study focuses on the post-harvest stage, where rapid spore detection supports early warning and contamination risk assessment during storage and procurement.
Manual visual rating of wheat ears and grains is labour‑intensive, low‑throughput, and prone to subjective bias, with inter‑rater inconsistencies and frequent underestimation or overestimation of disease severity [15]. This method is inefficient [16, 17]. To solve this problem, in recent years, methods that use image processing and deep learning to detect the severity of diseases in wheat have gradually attracted attention [18]. A spore, as a propagule in the life cycle of a fungus, can provide reliable early warning of potential contamination during post-harvest handling [19]. Traditional methods such as PCR and ELISA offer high specificity and sensitivity [20, 21]. However, their reliance on multi-step preparation, specialised equipment, and trained personnel, along with turnaround times of several hours, limits their use for rapid, large-scale screening in post-harvest settings [22]. In contrast, image-based spore detection offers a faster and more accessible alternative for on-site early warning [17].
Deep learning algorithms have been widely used to classify spore microscopy images. Qamar et al. conducted a classification study on bacterial spore images acquired via transmission electron microscopy (TEM) [23]. They used a convolutional neural network (CNN) for feature extraction and combined it with a random forest (RF) for classification purposes. Ultimately, they achieved 73% accuracy. They proposed an innovative idea combining deep learning-based feature extraction and machine learning-based classification for spores. Crespo-Michel et al. presented novel deep learning algorithms for classifying four species of grapevine wood fungi [24]. They used four different neural network architectures, including ResNet-50, VGG-16, MobileNet, and InceptionV3, to classify spore microscopy images with accuracies of up to 97.4%. The above methods have made some progress in terms of spore image classification. However, depending only on deep learning is insufficient for large-scale spore detection and counting tasks [25]. In contrast, a target detection approach based on deep learning has the potential to achieve multi-target classification and counting. This method enables the completion large-scale, diverse spore detection tasks [17].
In recent years, target detection algorithms have made significant progress in fungal spore microscopy image detection, but most of them have focused on detecting a single type of pathogenic spore. For example, Zhang et al. proposed an improved You Only Look Once version 5 (YOLOv5) model for detecting F. graminearum spores [26]. They incorporated efficient channel attention (ECA) and adaptive spatial feature fusion (ASFF) mechanisms into their model to effectively address the small sizes and limited features of spores. The method achieved remarkable results in terms of detecting F. graminearum spores, with an average accuracy of 98.57%. Li et al. proposed a high-precision, lightweight detection method based on YOLOv5 for detecting downy mildew spores in complex backgrounds [27]. They introduced a normalization-based attention module (NAM) and receptive field blocks (RFB-s) with dilated convolution to improve the detection accuracy attained for small target spores, and their approach ultimately achieved a mean average precision (mAP) of 95.6%. Wei et al. introduced the Faster region-based CNN (R-CNN), which used ResNet-50 and two small-scale region proposal networks (RPNs) to extract features for detecting microscopic hyperspectral images of thick-wall spores, yielding an average accuracy of 94.68% [28]. The authors also used the model to detect small spores in complex backgrounds. The above studies show that the existing methods have achieved significant improvements in terms of detecting a single type of pathogenic spore, especially small spores, in complex contexts. Wheat is often exposed to various fungal propagules, including airborne spores of different shapes and sizes, which are the main targets of image-based detection in post-harvest scenarios [29]. Larger spores are relatively easy to detect, whereas smaller spores are easily overlooked. Therefore, we need to introduce a method for detecting multiple spores in complex environments and improve the accuracy of small spore detection, thereby providing technical support for providing early warnings concerning multiple fungal diseases in wheat.
Recent studies have attained improved small spore detection performance by enhancing the feature extraction capabilities of models [30]. Due to the differences among the sizes, morphologies and textures of diverse spores, the performance of the existing methods in diverse small fungal spore detection tasks still needs to be improved [31]. In this paper, we propose a novel YOLO-ASF-MobileViT spore detection method to solve these problems and realize rapid detection for small pathogenic wheat spores. The main objectives include the following:
We propose a neck with the introduction of the Attentional Scale Sequence Fusion (ASF). The ASF significantly enhance the capture of tiny details in spore images and fuse multi-scale feature information of spores.
The Mobile Vision Transformer (MobileViT) attention is inserted before the P3 detection head to capture both local and global features of small spores. The method reduces missed small fungal spore detection.
A novel detection method called YOLO-ASF-MobileViT is proposed for detecting small diverse pathogenic wheat spores; extensive experiments show that YOLO-ASF-MobileViT maintains high robustness even when faced with spore adhesion, overlap, blurred contours, and impurity interference.
The method proposed in this paper not only provides a novel technique for detecting diverse pathogenic wheat spores but also effectively detects small spores, which provides a new idea for the early detection of fungal wheat diseases.
Materials and methods
Data acquisition
Wheat is susceptible to fungal contamination throughout the crop cycle, including sowing, vegetative and reproductive growth stages, as well as harvest, transportation, and storage [32, 33]. To simulate the actual detection process, we collected microscopy images specifically of teliospores from T. foetida, macroconidia from F. graminearum, and conidia from A. flavus. For T. foetida, sori were dissected from diseased wheat samples, soaked in sterile distilled water, and shaken overnight. The suspension was filtered through double-layer gauze, washed twice with 0.25% sodium hypochlorite, and rinsed repeatedly with 0.01% sterile water. F. graminearum and A. flavus were cultured on potato dextrose agar (PDA) at 25–28 °C for 7–10 days. Spore suspensions were obtained by washing sporulated cultures with 0.01% sterile water and filtering through gauze. Spore concentration was determined using a hemocytometer and calculated following standard protocols. The final concentration of each spore type was adjusted to 1 × 10⁶ spores/mL, and equal volumes of the three spore suspensions were mixed to prepare the dataset. The mixed spore solution was obtained as shown in Fig. 1 (a). Subsequently, to acquire image data, spore suspensions were prepared in this study. The fungal samples used in this study were obtained from the Key Laboratory of Biotoxin Analysis & Assessment for State Market Regulation. An inverted microscopy system was used to capture microscopy images of the fungal spores. As shown in Fig. 1 (b), the system consisted of a CKX53 inverted fluorescence microscope (OLYMPUS, Japan), an EP50 microscopic digital camera (OLYMPUS, Japan), and a host computer. All microscopic images were captured at a fixed 40× magnification using an inverted microscope. The average sizes of the spores captured under 40× are: teliospores of T. foetida, 16–25 μm [34]; macroconidia of F. graminearum, 25–50 μm × 3–5 μm [35]; and conidia of A. flavus, 3–6 μm [36].
Fig. 1.
Image acquisition process. (A) Spore suspension to be tested (B) The inverted microscopy system (C) Microscopic images of spores
Dataset preparation
Image labelling
To satisfy the model training requirements, we used Labelme software to label each image for distinguishing diverse spores. We manually classified the spores contained in the images into four categories: T. foetida (sporidium immaturum) spores were labelled “UNGXHF”; T. foetida (sporidia maturum) spores were labelled “GXHF”; F. graminearum spores were labelled “HGLD”; and A. flavus spores were labelled “HQM”. In this study, a total of 1015 microscopic images were acquired at 40× magnification using an inverted microscope, with each image having a resolution of 5540 × 3648 pixels. Although the visual distribution may vary in individual microscopy images due to differences in spore morphology and settling, all spore types were mixed in equal concentrations (1 × 106 spores/mL) and homogenized before sample preparation.
Data cleaning and augmentation
During the image acquisition process, improper slide preparation or imaging may have resulted in the presence of foreign objects within the slides, severe spore adhesion, image defocusing or ghosting. Ultimately, we screened 959 microscopy images of fungal wheat spores. Given the relatively small number of obtained spore images, data augmentation was used in this study to expand the dataset. We used several data augmentation methods: flipping (horizontal and vertical), luminance transformation, and contrast transformation [37]. Through these methods, we expanded the dataset to 2330 images, which increased the complexity and diversity of the data samples. These methods also effectively prevented model overfitting and improved the generalization performance of the model.
Dataset partitioning
The enhanced wheat spore images were divided into training, validation and test sets at a 6:2:2 ratio. The dataset comprised 1,398 training images, 466 validation images, and 466 test images. This division scheme ensured that sufficient images were available for efficiently constructing the model. The validation set was used to adjust the hyperparameters of the model and improve its generalizability, whereas the test set was used to evaluate the robustness and accuracy of the model.
The proposed YOLO-ASF-MobileViT architecture for spore detection
In this study, we proposed a novel YOLO model to more efficiently and accurately detect pathogenic fungal wheat spores, especially small spores such as A. flavus. The ASF was adopted to replace the original FPN in YOLOv5s. This modification allowed the model to capture local detail features. It also focused on global features, which improved the detection accuracy achieved for spores, especially small spores. The MobileViT attention mechanism was added in front of the P3 detection head. The P3 detection head could receive feature information from the TFE module and the SSFF module in the shallow layer. Therefore, this part contained local detail features. The MobileViT attention mechanism was added before the detection head of the shallow layer to finally optimize the features. The improved YOLO-ASF-MobileViT is shown in Fig. 2. By combining these two methods, the detection accuracy attained for A. flavus spores was significantly improved. The pathogenic wheat spore detection performance also improved overall.
Fig. 2.
The YOLO-ASF-MobileViT architecture
Attentional scale sequence fusion mechanism
In this study, an ASF mechanism was introduced to fuse multiscale spatial features [38]. This mechanism captured spore features of different sizes. The feature fusion mechanism comprised two key modules: a TFE module was designed to capture the features of dense, small spores, and the SSFF module was utilized to integrate multiscale semantic information. This approach effectively improved the feature extraction performance achieved for spore images. It provided more comprehensive and accurate feature information for the subsequent detection step. To verify the effectiveness of the introduced ASF mechanism, we compared the performance of the original YOLOv5s model to that of YOLO-ASF on the same test set.
Triple feature encoding module
The traditional feature pyramid network (FPN) only up-sampled small-sized feature maps and then added them to the previous layer, thus ignoring the detailed information of large-sized feature maps. Therefore, the TFE module was introduced to capture both the detailed and overall features of spore images for accurately localizing and regressing the target bounding boxes. The TFE module extracted the P3, P4, and P5 feature maps from the backbone of YOLOv5s. Among these maps, the P3 feature map had the largest scale and smallest receptive field. It primarily captured the local detailed features of small spore images. The P4 feature map had a moderate scale and a medium-sized receptive field. It focused on the local detail features of large spore images. The P5 feature map had the smallest scale and the largest receptive field. It mainly represented the global semantic features of spore images.
The specific structure of the TFE module is shown in Fig. 3. After extracting features from the P3 and P5 feature maps via 2D convolution, the feature maps were normalized in scale through downsampling and upsampling, respectively. A fusion operation combining global pooling and average pooling was employed to downsample the P3 feature map, which ensured the diversity of the high-resolution detail features. The nearest-neighbour interpolation method was used to upsample the P5 feature map, which mitigated the loss of information from the low-resolution features. The normalized P3 and P5 feature maps were consistent in scale with the P4 feature map. Convolution and concatenation were used to extract the information from the normalized P3 and P5 feature maps and fuse it, respectively, with the P4 feature map. The fused feature map channels were three times as numerous as those in the original feature maps (P3, P4, and P5).
Fig. 3.
Triple feature encoding module
Scale sequence feature fusion
The wheat spore images were characterized by their sizes, shapes, and colours. The A. flavus spores were smaller than the T. foetida spores, whereas the F. graminearum spores were more elongated. The tiny features of the A. flavus spores could be clearly recognized in the high-resolution images. However, as the depth increased, the image resolution decreased, which led to the loss of tiny features. The traditional FPN fusion method failed to effectively utilize the relationships between different levels of pyramid feature maps, which resulted in a lack of communication between the local and global information. Therefore, we introduced SSFF. This module was inspired by video frame processing methods [39]. It performed interactive fusion on multiscale information by stitching multiscale feature maps. SSFF also utilized 3D convolution for feature extraction purposes. The specific structure of this module is shown in Fig. 4. First, we used 1 × 1 2D convolutions to adjust the numbers of channels in the P4 and P5 feature maps, which ensured that both maps had 256 channels. Next, the nearest-neighbour up-sampling method was used to normalize the sizes of the P3, P4, and P5 feature maps to match the feature map with the highest resolution (P3). Then, the 3D [height, width, channel] tensors were converted into 4D [depth, height, width, channel] tensors. 3D convolution, batch normalization, and the leaky rectified linear unit (LeakyReLU) activation function were used to extract features from the 4D tensors. Finally, maximum pooling was used to down-sample the feature maps, and the down-sampled outputs were fused with the output of the TFE module.
Fig. 4.
Scale sequence feature fusion
MobileViT attention mechanism
The traditional attention mechanism focused on the channel and spatial aspects of images, whereas the ASF mechanism fused multichannel and multiscale information. Therefore, we shifted the focus of the attention mechanism to global semantic information and local detail information. We introduced the MobileViT attention mechanism, which is based on the ViT [40]. Compared with the traditional ViT, in the MobileViT, the transformer extracted global features, whereas the CNN extracted local features. The MobileViT endowed the model with spatial inductive bias and global representation capabilities [41]. The structure of the MobileViT is shown in Fig. 5. It included local feature representation, global feature representation, and feature fusion components. In the local feature representation module, an n×n convolution was used to extract local features from the input feature map, and a 1 × 1 convolution was employed to adjust the number of channels. In the global feature representation module, the feature maps were divided into multiple fixed-size blocks, and each block was flattened into a one-dimensional vector. Through linear transformation, these one-dimensional vectors were mapped to higher-dimensional feature vectors. These feature vectors with positional encoding information were subsequently concatenated and fed into the transformer encoder. The transformer encoder incorporated a multihead attention mechanism and a positional feedforward neural network. The self-attention mechanism was used to capture the dependencies between different positions in the input image. The positional feedforward neural network contained multiple hidden layers to perform nonlinear mapping and feature extraction on the feature information at each position. Finally, the feature image processed by the ViT module was fused with the original image. Through the MobileViT attention mechanism, the local features were better preserved, and global features were obtained. This enabled the model to fully utilize both global semantic information and local details [42]. This approach improved the accuracy of wheat spore detection, particularly for smaller A. flavus spores.
Fig. 5.
MobileViT attention mechanism
In this study, we conducted ablation experiments to validate the performance of the introduced MobileViT attention mechanism in terms of spore detection, particularly its ability to detect small spores. We compared it with the SE (Squeeze-and-Excitation) network [43], SA (Spatial Attention) [44], ECA (Efficient Channel Attention) [45] and the CBAM (Convolutional Block Attention Module) [46].
Model evaluation
In this study, we compared the improved YOLOv5s algorithm (YOLOv5s-ASFF-MobileViT) with the original YOLOv5s algorithm. First, we compared the mAP_0.5 iteration curves of these models to evaluate their convergence performance during the training process. Second, to verify the robustness of the proposed model, we evaluated its ability to detect spores under different extreme environmental conditions, which included adhesive spores, overlapping spores, blurry spores, and spores located in environments with interference from impurities. Finally, we verified the effectiveness of the improved YOLOv5s algorithm in terms of spore detection through a comparison with nine detection algorithms: SSD [47], YOLOv5s [48], CenterNet [49], the Faster R-CNN [50], the, TPH-YOLO [51], MG-YOLO [52], SPD-YOLO [53], YOLO-ECA-ASFF [26], and YOLO-CG-HS [37]. Each model was trained, validated, and tested under the same conditions. several metrics were utilized to comprehensively evaluate the performance of the tested models, including precision, recall, average precision (AP), and mAP. Precision was used to measure the classification accuracy achieved for positive categories. Recall was used to measure the ability of each model to detect positive samples. The AP was evaluated by calculating the precision‒recall curve for each category, reflecting the precision of the corresponding model at different recall levels. AP represents the average precision attained by a model for a single category. mAP represents the overall average precision achieved by a model across multiple categories. The precision, recall, AP, and mAP metrics are calculated as follows:
![]() |
1 |
![]() |
2 |
![]() |
3 |
![]() |
4 |
where TP represents the number of true-positive samples and FP represents the number of false-positive samples. Similarly, TN represents the number of true-negative samples, and FN represents the number of false-negative samples. Additionally, r denotes the recall, and P(r) represents the precision corresponding to r. N denotes the number of data categories detected in this study.
Experimental platform
In this study, all the models were trained and tested in the following experimental environment: the deep learning framework was PyTorch 1.12.1, and the operating system was Ubuntu 16.04. The computer was equipped with an Intel(R) Core(TM) i7-9700 K 8-core processor at 3.60 GHz and 16 GB of RAM. To increase the training speed, we utilized an NVIDIA GeForce GTX 2080Ti GPU with 11 GB of video memory, CUDA version 11.3, and cuDNN version 8.2.1.
Results and discussion
Comparison between the original and improved feature fusion mechanisms
The results of a comparison between the original and improved feature fusion mechanisms are shown in Table 1. Compared with that of the original model, the overall precision of YOLO-ASF increased by 3%, and its overall recall increased by 0.6%. Regarding the detection of F. graminearum, A. flavus, and T. foetida (sporidium immaturum), the mAP_0.5 values increased by 2%, 2.7%, and 0.3%, respectively. However, for T. foetida (sporidium maturum), the mAP_0.5 of the improved model did not increase. Overall, mAP_0.5 increased by 1.1%. The results showed that the YOLO-ASF model more accurately detected spores than did the other methods. Notably, for small and dense A. flavus spores, the introduced ASF mechanism increased the mAP_0.5 of the proposed model by 2.7%, which significantly enhanced its performance in terms of detecting small target spores. This was attributed to the TFE and SSFF modules contained in the ASF mechanism. In the TFE module, three feature maps with large, medium, and small sizes were convolved once and spliced in the channel dimension, which prevented the loss of small spore features. Moreover, the global semantic information contained in the deep feature maps was combined with the local detail information of the SSFF module, which enabled the model to more comprehensively understand the image content. The ASF mechanism significantly improved the ability of the model to extract target features, which in turn improved its detection performance. However, the original model achieved a high detection accuracy of 99% because of the relatively large sizes and distinctive colour texture features of T. foetida (sporidium maturum) spores. The detection performance was close to the optimal level. It was difficult to further improve the detection accuracy under the current conditions.
Table 1.
Comparison of feature fusion mechanisms before and after improvement
| Models | P (%) | R (%) | mAP_0.5 (%) | ||||
|---|---|---|---|---|---|---|---|
| GXHF | HGLD | HQM | UNGXHF | Total | |||
| Original YOLOv5s | 91.8 | 94.2 | 99.1 | 97.5 | 86.5 | 98.3 | 95.4 |
| YOLO-ASF | 94.8 | 94.8 | 99.1 | 99.5 | 89.2 | 98.6 | 96.5 |
We applied gradient-weighted class activation mapping (Grad-CAM) to further analyse the performance of YOLO-ASF. Grad-CAM is a visualization technique that is used to interpret neural network prediction results [54]. The gradient computation process in Grad-CAM was used to show where the model was interested in different objects. The detection weights yielded by the original YOLOv5s model and the YOLO-ASF model for the four spore classes included in the same image are visualized in Fig. 6. The four left panels of Fig. 6 visualize the weights of the original YOLOv5s model, whereas the four right panels visualize the weights of the YOLO-ASF model. The first and third images of Fig. 6 show that the original YOLOv5s model did not sufficiently focus on spores, and its weights covered a wide and imprecise area, even including areas without spores. In the case with a dense spore distribution, this model also ignored certain spores. In contrast, the two images in the fourth row of Fig. 6 show that the original YOLOv5s model did not accurately focus on small objects. Many small impurities were also incorrectly detected as spores. However, the YOLO-ASF model could focus on detecting small spores and selectively filtering out impurities. The TFE module integrated small spore features derived from three different resolution feature maps, enriching the spore features. The SSFF module combined the detailed information and global information acquired from the feature maps. This allowed for the precise localization of small spores. In summary, the ASF mechanism comprehensively improved the detection performance of the proposed model, especially in terms of detecting small spores.
Fig. 6.
Grad-CAM Visualization Comparison. (A) and (B) show the Grad-CAM visualization images of YOLOv5s and YOLO-ASF-MobileViT, respectively
Comparison among different attention mechanisms
A comparison among different attention mechanisms is shown in Table 2. All performance metrics reported in Table 2 were evaluated on the independent test set. Compared with the YOLO-ASF model, only YOLO-ASF-CBAM and YOLO-ASF-MobileViT improved the A. flavus spore detection process, with their mAP_0.5 values increasing by 0.7% and 1.6%, respectively. This occurred because the attention mechanisms, such as SE, SA and ECA, focused on the dependencies between channels but ignored spatial information. In contrast, the CBAM combined a spatial attention mechanism with a channel attention mechanism, which enabled the model to obtain more effective features from both the channel and spatial dimensions. Both the CBAM and MobileViT focused on spatial and channel information, which significantly improved the resulting small target detection performance. Notably, the precision and recall of the YOLO-ASF-CBAM model were superior to those of the YOLO-ASF-MobileViT model, but the mAP_0.5 of the former was lower than that of the latter. The CBAM could correctly detect the targets, but it failed to precisely regress the bounding boxes because of its inaccurate localization effect. The MobileViT did not involve pooling operations. Therefore, it better preserved the global semantic information, which allowed for the target boxes to be more accurately predicted. In object detection tasks, we paid more attention to mAP_0.5, as the precision and recall values achieved at different confidence levels were considered in this metric, which better reflected the overall performance of the models. The YOLO-ASF-MobileViT model had the best performance, with an overall detection mAP_0.5 of 97%, which was 1.6% better than that of the original YOLOv5s model. The MobileViT attention mechanism combines the local feature extraction ability of a CNN with the global feature extraction ability of a ViT. This approach was more suitable for detecting various spores with significant size differences.
Table 2.
Comparisons of the three different feature fusion mechanism modules
| Models | P (%) | R (%) | mAP_0.5 (%) | ||||
|---|---|---|---|---|---|---|---|
| GXHF | HGLD | HQM | UNGXHF | Total | |||
| YOLO-ASF | 94.8 | 95.4 | 99.0 | 99.5 | 89.2 | 98.6 | 96.5 |
| YOLO-ASF-SE | 94.5 | 95.4 | 98.8 | 99.2 | 88.4 | 98.4 | 96.2 |
| YOLO-ASF-SA | 94.8 | 95.1 | 98.7 | 99.4 | 89.1 | 98.7 | 96.5 |
| YOLO-ASF-ECA | 95.0 | 94.8 | 98.5 | 99.3 | 88.9 | 98.1 | 96.2 |
| YOLO-ASF-CBAM | 95.1 | 95.5 | 98.8 | 99.4 | 89.9 | 98.3 | 96.6 |
| YOLO-ASF-MobileViT | 94.8 | 95.4 | 99.1 | 99.4 | 90.8 | 98.8 | 97.0 |
Comparison between the performances achieved before and after improving the model
Model training and convergence performance comparison
A comparison between the training iteration curves produced before and after improving the model is shown in Fig. 7. Both models exhibited significant fluctuations in their mAP_0.5 values during the first 150 iterations. Because the pretrained model was not used in this experiment, the model parameters were randomly initialized, resulting in a poor data fitting ability. Consequently, the initial training process was unstable. During the last 50 iterations, the mAP_0.5 values of both models gradually stabilized. Notably, after stabilization, the mAP_0.5 of the improved YOLOv5s model consistently remained higher than that of the original YOLOv5s model. Additionally, the improved model exhibited smaller fluctuations in its mAP_0.5 data and more quickly converged to a relatively stable state. Figure 7. shows the training and validation performance curves, reflecting intermediate results on the validation set. In contrast, all metrics in Table 2 were computed on an independent test set. Slight differences are expected, as the test set evaluates the fully converged model on unseen data, demonstrating better generalization.
Fig. 7.

The comparison of training iteration curves before and after model improvement
Model robustness performance comparison
Figures 8, 9 and 10 show the detection performance of the model both before and after improvements, under four different extreme conditions. First, spore adhesion was shown in Fig. 8. In sample 1 of Fig. 8, T. foetida spores were adhered to each other. The original model misidentified the two adhered spores as three separate spores. In sample 2 of Fig. 8, T. caries spores were adhered to F. graminearum spores. The original model misidentified a single F. graminearum spore as two separate spores. However, the improved model significantly reduced the false detection rate attained for adherent spores. The original model was overly concerned with the local features of the spores and ignored their overall features, thus misidentifying them as new spores. A case with overlapping spores is shown in Fig. 9. In sample 1 of Fig. 9, two F. graminearum spores overlapped. The original model could not accurately determine their target boxes and attempted to detect targets with multiple target boxes. In sample 2 of Fig. 9, three T. foetida spores overlapped, with the middle spore partially covering the other two spores, which caused them to form a single cluster. It was difficult for the original model to accurately locate their positions, which resulted in false detections. This occurred because the original model identified multiple local features within the overlapping area and attempted to label these features with multiple target boxes. These target boxes had different sizes, and the overlap region between them was small. The intersection over union (IoU) between the adjacent target boxes did not reach the filtering threshold of the nonmaximum suppression (NMS) algorithm, which led to false detection results. Blurry spore images are shown in Fig. 10. In sample 1 of Fig. 10, the upper-right corner of the image shows a typical F. graminearum spore outline. However, owing to the influence of background lighting, its phenotypic features were not distinct. Therefore, the original model missed this target spore, whereas the improved model accurately detected the blurry spore. Similarly, in sample 2 of Fig. 10, the upper-right corner shows the outlines of two F. graminearum spores. The original model detected the two spores as a single entity, whereas the improved model correctly detected both spores. During model training, the size and resolution of the feature map decreased as the model depth and the step size of the convolution operation increased. Therefore, the blurred spores lacked clear edges and texture features. This caused the deep feature maps of the original model to easily lose detail information. The impurity interference case is shown in Fig. 11. In sample 1 of Fig. 11, impurities appear in the lower-right corner, and the original model misclassified them as F. graminearum. In sample 2 of Fig. 11, the same spore-like impurity appears in the lower-right corner, resembling A. flavus. The original model misclassified them as both A. flavus and T. foetida. In the impurity interference case, the backbone layers of the backbone layer performed poorly in terms of describing the shape and texture features of spores, leading to misdetection.
Fig. 8.
The results of detecting adhesive spores
Fig. 9.
The results of detecting overlapping spores
Fig. 10.
The results of detecting blurry spores
Fig. 11.
The results of detecting spores in impurity interference environment
The ASF mechanism and MobileViT attention mechanism introduced in this paper effectively solved the above problems. The MobileViT was added to the backbone layer. The self-attention mechanisms contained in the MobileViT were employed to capture the global relationships within features. Unlike traditional CNN algorithms with fixed receptive fields, this approach enabled the model to dynamically learn the long-range dependencies between different regions. The MobileViT incorporated CNNs as local feature extractors. This effectively integrated the local and global features. This approach improved the feature representation ability of the model and effectively addressed the misdetection issues caused by spore adhesion, intersection, and overlapping. In the ASF mechanism, the TFE module spliced features with three different sizes in the spatial dimension to capture detailed small spore information. The SSFF mechanism effectively fused the multichannel feature maps of P3, P4, and P5, which captured different spatial scales covering diverse spores with various sizes and shapes. This method mitigated the loss of detail information. Compared with the original model, the improved model exhibited greater robustness. It demonstrated better detection capabilities in terms of handling extreme conditions such as spore adhesion, overlap, impurity interference, and blurry spores. However, in real-world samples, interference may arise from fungal spores of other genera commonly found in cereals, such as Alternaria, Bipolaris, and Penicillium [55]. This may affect the model’s specificity. Thus, expanding the spore classes used for training will be essential to further enhance detection performance under complex field conditions.
Performance comparisons with state-of-the-art detection methods
The quantitative results of the state-of-the-art detection methods in the test are shown in Table 3, which contains the precision, recall and mAP_0.5 values produced by each model. Regarding the traditional one-stage detection networks, the mAP_0.5 of the improved model was 1.6%, 24.9%, and 5.1% higher than those of YOLOv5s, the SSD, and CenterNet, respectively. With respect to the traditional two-stage detection network, the mAP_0.5 of the improved model was 23.3% higher than that of the Faster R-CNN. The performance of the SSD in terms of detecting A. flavus spores was poor, with a recall of only 60.1% and an mAP_0.5 value of just 18.1%. This occurred because the SSD algorithm lacked a feature resampling step, which easily led to missed detections [46]. CenterNet significantly surpassed the SSD in overall detection performance, but its ability to detect A. flavus spores was relatively poor, with an mAP_0.5 of only 76.4%. In the two-stage detection networks, such as the Faster R-CNN, a base neural network was utilized for feature extraction purposes, and an RPN was employed on deeper feature maps to generate target boxes. The Faster R-CNN had the lowest mAP_0.5 for A. flavus spores at only 17.2% because of the small spore features lost from the depth feature maps.
Table 3.
Performance comparisons with state-of-the-art detection methods
| Models | P (%) | R (%) | AP_0.5 (%) | Inference time (ms) | Params (M) | ||||
|---|---|---|---|---|---|---|---|---|---|
| GXHF | HGLD | HQM | UNGXHF | Mean | |||||
| YOLOv5s | 91.8 | 94.2 | 99.1 | 97.5 | 86.5 | 98.3 | 95.4 | 5.8 | 7.0 |
| CenterNet | 93.3 | 86.1 | 99.2 | 97.3 | 76.4 | 94.9 | 91.9 | 24.8 | 54.6 |
| Faster R-CNN | 58.5 | 83.3 | 97.3 | 93.0 | 17.2 | 87.4 | 73.7 | 38.5 | 112.2 |
| SSD | 84.9 | 60.1 | 97.5 | 88.2 | 18.1 | 84.5 | 72.1 | 11.2 | 25.1 |
| TPH-YOLO | 96.3 | 95.3 | 99.2 | 94.8 | 89.1 | 99.1 | 95.6 | 5.6 | 7.3 |
| MG-YOLO | 93.7 | 94.8 | 99.1 | 95.3 | 89.3 | 98.3 | 95.5 | 5.8 | 6.5 |
| SPD-YOLO | 93.5 | 94.8 | 98.7 | 97.6 | 85.7 | 97.0 | 94.8 | 8.0 | 7.4 |
| YOLO-ECA-ASFF | 95.0 | 93.2 | 99.3 | 94.8 | 88.0 | 98.2 | 95.0 | 8.8 | 7.6 |
| YOLO-CG-HS | 91.2 | 90.7 | 99.3 | 95.4 | 89.8 | 98.8 | 95.8 | 9.3 | 3.8 |
| Ours | 94.8 | 95.4 | 99.1 | 99.4 | 90.8 | 98.8 | 97.0 | 10.8 | 8.4 |
Furthermore, the mAP of TPH-YOLO, MG-YOLO, SPD-YOLO, YOLO-ECA-ASFF, and YOLO-CG-HS in detecting total fungal spores was 95.6%, 95.5%, 94.8%, 95.0%, and 95.8%, respectively, and the mAP of the proposed model was at least 1.2% higher than theirs. Particularly, these cutting-edge spore detection methods had relatively low accuracy in detecting A. flavus spores (AP_0.5 < 90%). By comparing mAP, Precision, and Recall, YOLO-ASF-MobileViT showed higher accuracy, especially in detecting small spores. This was because the ASF module effectively fused multi-scale feature information of spores, enhancing fine details. Meanwhile, MobileViT better captured the local and global features of microspores, which significantly reduced the missed detection rate. Our model has more parameters (8.4 M) and a longer inference time (10.8 ms) due to its global attention design. However, it achieves the highest mAP (97.0%) among all models. Lightweight models like TPH-YOLO and MG-YOLO run faster (< 7 ms) but perform worse, especially in detecting A. flavus small spores. Overall, our model balances accuracy and speed well for post-harvest early warning.
Conclusion
This study introduced the YOLO-ASF-MobileViT network, which is an improved detection method that combines an ASF mechanism and the MobileViT attention mechanism. It effectively identifies morphologically diverse wheat fungal spores, with particular improvements in detecting A. flavus spores. The model achieved a mAP@0.5 of 97.0%, outperforming advanced detectors such as TPH-YOLO (95.6%) and MG-YOLO (95.5%). Although it introduces more parameters (8.4 M) and a longer inference time (10.8 ms), the inclusion of global attention significantly boosts detection robustness. The proposed model maintained high accuracy even under challenging conditions such as adhesion, occlusion, and noise. Overall, this method aims to provide early warning of potential fungal contamination in the post-harvest phase. It can assist in the risk assessment of seed-borne or storage-related fungal threats, thereby supporting timely intervention and quality control measures in wheat management.
In the future research, we will focus on collecting more microscopic images to expand the model’s application scenarios. We also plan to incorporate additional cereal-associated fungal spore types into the training dataset to improve the model’s generalizability and reduce false positives in mixed-contamination environments.
Author contributions
Zhizhou Ren made substantial contributions to the creation of new software; the acquisition, analysis, and interpretation of data; and drafted the work; Kun Liang made substantial contributions to the conception and design of the work; methodology and validation; and substantively revised the manuscript; Yingqi Zhang and Jinpeng Song made substantial contributions to the acquisition, analysis, and interpretation of data; Xiaoxiao Wu, Chi Zhang, Xiuming Mei, Yi Zhang and Xin Liu made substantial contributions to the acquisition and curation of data; All authors have read and approved the final manuscript and agree to be accountable for all aspects of the work.
Funding
This work was supported by Major Program of Jiangsu Provincial Administration for Market Regulation (Grant No. KJ2025016), Postgraduate Research & Practice Innovation Program of Jiangsu Province (Grant No. SJCX25_0286), Natural Science Foundation of Jiangsu Province (Grant No. BK20221518), National Undergraduate Innovation and Entrepreneurship Training Program (Grant No. 202410307100Z), and Jiangsu Agriculture Science and Technology Innovation Fund (Grant No. CX (23)1002).
Data availability
No datasets were generated or analysed during the current study.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Change history
8/29/2025
The original online version of this article was revised: Funding has been updated.
References
- 1.El Houssni I, Zahidi A, Khedid K, Hassikou R. Nutrient and anti-nutrient composition of durum, soft and red wheat landraces: implications for nutrition and mineral bioavailability. J Agric Food Res. 2024;15:101078. [Google Scholar]
- 2.Kaur N, Singh B, Kaur A, Yadav MP, Singh N, Ahlawat AK, et al. Effect of growing conditions on proximate, mineral, amino acid, phenolic composition and antioxidant properties of wheatgrass from different wheat (Triticum aestivum L.) varieties. Food Chem. 2021;341:128201. [DOI] [PubMed] [Google Scholar]
- 3.Kettlewell P, Byrne R, Jeffery S. Wheat area expansion into Northern higher latitudes and global food security. Agric Ecosyst Environ. 2023;351:108499. [Google Scholar]
- 4.Chai Y, Senay S, Horvath D, Pardey P. Multi-peril pathogen risks to global wheat production: a probabilistic loss and investment assessment. Front Plant Sci. 2022;13:1034600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mielniczuk E, Skwaryło-Bednarz B. Fusarium head blight, Mycotoxins and strategies for their reduction. Agronomy. 2020;10:509. [Google Scholar]
- 6.Almoujahed MB, Rangarajan AK, Whetton RL, Vincke D, Eylenbosch D, Vermeulen P, et al. Detection of fusarium head blight in wheat under field conditions using a hyperspectral camera and machine learning. Comput Electron Agric. 2022;203:107456. [Google Scholar]
- 7.Scott P, Strange R, Korsten L, Gullino ML, editors. Plant diseases and food security in the 21st century. Cham: Springer International Publishing; 2021. [Google Scholar]
- 8.Abaya A, Xue A, Hsiang T. Selection and screening of fungal endophytes against wheat pathogens. Biol Control. 2021;154:104511. [Google Scholar]
- 9.Mathre DE. Bunts and Smuts revisited: has the air been cleared? Plant Health Prog. 2000;1:0622–02. [Google Scholar]
- 10.Sharma P, Chauhan R, Pande V, Basu T, Rajesh, Kumar A. Rapid sensing of Tilletia indica teliospore in wheat extract by a piezoelectric label-free immunosensor. Bioelectrochemistry. 2022;147:108175. [DOI] [PubMed] [Google Scholar]
- 11.Kumar S, Singroha G, Singh GP, Sharma P. Karnal bunt of wheat: etiology, breeding and integrated management. Crop Prot. 2021;139:105376. [Google Scholar]
- 12.Noroozi R, Kobarfard F, Rezaei M, Ayatollahi SA, Paimard G, Eslamizad S, et al. Occurrence and exposure assessment of aflatoxin B1 in Iranian breads and wheat-based products considering effects of traditional processing. Food Control. 2022;138:108985. [Google Scholar]
- 13.Hoffmann A, Funk R, Müller MEH. Blowin’ in the wind: wind dispersal ability of phytopathogenic fusarium in a wind tunnel experiment. Atmosphere. 2021;12:1653. [Google Scholar]
- 14.Hernandez Nopsa JF, Daglish GJ, Hagstrum DW, Leslie JF, Phillips TW, Scoglio C, et al. Ecological networks in stored grain: key post-harvest nodes for emerging pests, pathogens, and Mycotoxins. Bioscience. 2015;65:985–1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Gwinn KD, Leung MCK, Stephens AB, Punja ZK. Fungal and Mycotoxin contaminants in cannabis and hemp flowers: implications for consumer health and directions for further research. Front Microbiol. 2023;14:1278189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Anwar H, Khan SU, Ghaffar MM, Fayyaz M, Khan MJ, Weis C, et al. The NWRD dataset: an open-source annotated segmentation dataset of diseased wheat crop. Sensors. 2023;23:6942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Paul N, Sunil GC, Horvath D, Sun X. Deep learning for plant stress detection: a comprehensive review of technologies, challenges, and future directions. Comput Electron Agric. 2025;229:109734. [Google Scholar]
- 18.Catal Reis H, Turk V. Integrated deep learning and ensemble learning model for deep feature-based wheat disease detection. Microchem J. 2024;197:109790. [Google Scholar]
- 19.Zhu Y, Yao Y, Xi J, Tang C, Wu L. Modelling the effect of pH and H2S on the germination of F. graminearum spores under different temperature conditions. LWT. 2023;177:114530. [Google Scholar]
- 20.Wakeham A, Kennedy R, McCartney A. The collection and retention of a range of common airborne spore types trapped directly into microtiter wells for enzyme-linked immunosorbent analysis. J Aerosol Sci. 2004;35:835–50. [Google Scholar]
- 21.Pilo P, Tiley AMM, Lawless C, Karki SJ, Burke J, Feechan A. A rapid fungal DNA extraction method suitable for PCR screening fungal mutants, infected plant tissue and spore trap samples. Physiol Mol Plant Pathol. 2022;117:101758. [Google Scholar]
- 22.Harpaz D, Duanis-Assaf D, Alkan N, Eltzov E. Detection of post-harvest pathogenic fungi by RNA markers in a high-throughput platform on Streptavidin plate. Postharvest Biol Technol. 2022;183:111728. [Google Scholar]
- 23.Qamar S, Öberg R, Malyshev D, Andersson M. A hybrid CNN–random forest algorithm for bacterial spore segmentation and classification in TEM images. Sci Rep. 2023;13:18758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Crespo-Michel A, Alonso-Arévalo MA, Hernández-Martínez R. Developing a microscope image dataset for fungal spore classification in grapevine using deep learning. J Agric Food Res. 2023;14:100805. [Google Scholar]
- 25.Zhou H, Lai Q, Huang Q, Cai D, Huang D, Wu B. Automatic detection of rice blast fungus spores by deep learning-based object detection: models, benchmarks and quantitative analysis. Agriculture. 2024;14:290. [Google Scholar]
- 26.Zhang D-Y, Zhang W, Cheng T, Zhou X-G, Yan Z, Wu Y, et al. Detection of wheat scab fungus spores utilizing the YOLOv5-ECA-ASFF network structure. Comput Electron Agric. 2023;210:107953. [Google Scholar]
- 27.Li K, Qiao C, Zhu X, Song Y, Zhang L, Gao W, et al. Lightweight fungal spore detection based on improved YOLOv5 in natural scenes. Int J Mach Learn Cybern. 2023;14:02026–x. [Google Scholar]
- 28.Wei X, Liu Y, Song Q, Zou J, Wen Z, Li J, et al. Microscopic hyperspectral imaging and an improved detection model based on detection of mycogone perniciosa chlamydospore in soil. Eur J Agron. 2024;152:127007. [Google Scholar]
- 29.Giovenzana V, Beghi R, Buratti S, Civelli R, Guidetti R. Monitoring of fresh-cut Valerianella locusta shelf life by electronic nose and VIS–NIR spectroscopy. Talanta. 2014;120:368–75. [DOI] [PubMed] [Google Scholar]
- 30.Cao S, Wang T, Li T, Mao Z. UAV small-target detection algorithm based on an improved YOLOv5s model. J Vis Commun Image Represent. 2023;97:103936. [Google Scholar]
- 31.Shi Y, Wang B, Yin C, Li Z, Yu Y. Performance improvement: a lightweight gas information classification method combined with an electronic nose system. Sens Actuators B Chem. 2023;396:134551. [Google Scholar]
- 32.Weaver MA, Park LC, Brewer MJ, Grodowitz MJ, Abbas HK. Detection, quantification, and characterization of airborne Aspergillus flavus within the corn canopy. Mycotoxin Res. 2025;41:267–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ma P, Li C, Rahaman MM, Yao Y, Zhang J, Zou S, et al. A state-of-the-art survey of object detection techniques in microorganism image analysis: from classical methods to deep learning approaches. Artif Intell Rev. 2023;56:1627–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Nguyen HDT, Sultana T, Kesanakurti P, Hambleton S. Genome sequencing and comparison of five Tilletia species to identify candidate genes for the detection of regulated species infecting wheat. IMA Fungus. 2019;10:0011–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Mansour MB, Goh YK, Vujanovic V. Rapid macroconidia production in fusarium graminearum 3- and 15-acetyldeoxynivalenol chemotypes using sucrose-water medium. Ann Microbiol. 2012;62:965–71. [Google Scholar]
- 36.Son Y-E, Park H-S. SscA is required for fungal development, aflatoxin production, and pathogenicity in Aspergillus flavus. Int J Food Microbiol. 2024;413:110607. [DOI] [PubMed] [Google Scholar]
- 37.Cheng T, Zhang D, Gu C, Zhou X-G, Qiao H, Guo W, et al. YOLO-CG-HS: a lightweight spore detection method for wheat airborne fungal pathogens. Comput Electron Agric. 2024;227:109544. [Google Scholar]
- 38.Li M, Liu J, Yao T, Gao Z, Gong J. Deep-learning based in-situ micrograph analysis of high-density crystallization slurry using image and data enhancement strategy. Powder Technol. 2024;437:119582. [Google Scholar]
- 39.Kang M, Ting C-M, Ting FF, Phan R-CW, et al. ASF-YOLO: a novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis Comput. 2024;147:105057. [Google Scholar]
- 40.Dündar N, Keçeli AS, Kaya A, Sever H. A shallow 3D convolutional neural network for violence detection in videos. Egypt Inf J. 2024;26:100455. [Google Scholar]
- 41.Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T et al. An image is worth 16×16 words: transformers for image recognition at scale. ArXiv [Preprint]. 2020; arXiv:2010.11929.
- 42.Mehta S, Rastegari M. MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. ArXiv [Preprint]. 2022. arXiv:2110.02178.
- 43.Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. pp. 7132–7141.
- 44.Zhang QL, Yang YB. SA-Net: shuffle attention for deep convolutional neural networks. In: ICASSP 2021 – IEEE International Conference on Acoustics, Speech and Signal Processing; 2021. pp. 2235–2239.
- 45.Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q et al. ECA-Net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. pp. 11534–11542.
- 46.Woo S, Park J, Lee J–Y, Kweon ISCBAM. Convolutional Block Attention Module. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018. pp. 3–19.
- 47.Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y et al. SSD: single shot multibox detector. In: Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I; Springer International Publishing; 2016. pp. 21–37.
- 48.Redmon J, Divvala S, Girshick R, Farhadi A, Kokkinos I, Maziarka B et al. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 779–788.
- 49.Zhou X, Wang D, Krähenbühl P. Objects as points. ArXiv. 2019. ArXiv:1904.07850. Preprint.
- 50.Ren S, He K, Girshick R, Sun J, Zhang X, Ioffe S, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2016;39(6):1137–49. [DOI] [PubMed] [Google Scholar]
- 51.Zhu X, Lyu S, Wang X, Zhao Q. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops; 2021. pp. 2778–88.
- 52.Li K, Zhu X, Qiao C, Zhang L, Gao W, Wang Y. The Gray mold spore detection of cucumber based on microscopic image and deep learning. Plant Phenomics. 2023;5:0011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Sunkara R, Luo T. No more strided convolutions or pooling: a new CNN building block for low-resolution images and small objects. In: Amini M-R, Canu S, Fischer A, Guns T, Kralj Novak P, Tsoumakas G, editors. Machine Learning and Knowledge Discovery in Databases; 2023. pp. 443–59.
- 54.Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis. 2017;128:336–59. [Google Scholar]
- 55.Pilo P, Lawless C, Tiley AMM, Karki SJ, Burke JI, Feechan A. Comparison of microscopic and metagenomic approaches to identify cereal pathogens and track fungal spore release in the field. Front Plant Sci. 2022;13:1039090. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No datasets were generated or analysed during the current study.















