Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2024 Nov 13;14:27776. doi: 10.1038/s41598-024-77333-6

RJ-TinyViT: an efficient vision transformer for red jujube defect classification

Chengyu Hu 1,2, Jianxin Guo 1,2,, Hanfei Xie 1,2, Qing Zhu 3, Baoxi Yuan 1,2, Yujie Gao 1, Xiangyang Ma 1, Jialu Chen 1
PMCID: PMC11560971  PMID: 39537645

Abstract

Compared to the surface defect detection of industrial products produced according to specified processes, the detection of surface defects in naturally grown red jujubes poses unique and significant challenges for researchers. The high diversity of surface defects, subtle distinctions from the background, low contrast, varying scales, and the presence of high levels of noise in images are among the factors that greatly amplify the complexity of defect detection tasks. Existing methods show some deficiencies in addressing these issues, mainly due to insufficient feature extraction capabilities and overly complex network structures, leading to limitations in model efficiency and practical application performance. To tackle the challenges associated with red jujube surface defect detection, this study proposes an optimized Tiny Vision Transformer (TinyViT) network structure, named RJ-TinyViT. This method refines the TinyViT-5 m network structure to reduce network burden and introduces an improved Multi-Kernel Block (MK Block) and an improved Mobile Inverted Bottleneck Convolution Block (MBConv Block) to enhance feature extraction capabilities. Additionally, we have integrated the Coordinate Attention (CA) module to enhance the model’s capacity for recognizing and focusing on features of surface defects on red jujubes. Experimental results show that RJ-TinyViT achieved a classification accuracy of 93.38%, marking an improvement of 1.84% over the original TinyViT network. At the same time, its Floating-point Operations (FLOPs) and Parameters (Params) were reduced to 58.97% and 39.84% of the original TinyViT network, respectively. These results not only demonstrate that RJ-TinyViT achieves model lightweighting while maintaining high accuracy but also highlight its value in practical industrial applications.

Keywords: Surface defect detection, Red Jujube, Deep learning, Coordinate attention, Vision Transformer

Subject terms: Agroecology, Computer science

Introduction

In the era of Industry 4.0, Surface Defect Detection (SDD) has become a key area of research1. It involves the identification and detection of undesirable features on the surface of materials or products, such as scratches, cracks, rust spots, wear, and pores. This detection is crucial for ensuring the quality and reliability of products. SDD technologies utilize machine vision equipment to capture images, which are then analyzed to determine the presence of defects.

However, unlike other computer vision fields, SDD does not have large and unified datasets akin to ImageNet2, PASCAL-VOC3, and COCO4. The research on SDD predominantly concentrates on diverse detection targets and specific application scenarios. When conducting SDD in intricate industrial environments, a series of challenges are usually encountered, such as subtle imaging distinctions between defects and the background, low contrast, significant variations in the size and type of defects, high noise levels within defect images, and potential interference from natural environmental factors on defect imaging. Consequently, SDD encounters more pronounced challenges in practical applications.

Agriculture 4.0, grounded in internet technology and modern information technology, has evolved into a contemporary agricultural industry chain model that integrates intelligence, informatization, and automation. As a product of the digital agricultural revolution5,6, Agriculture 4.0 is distinguished by the fusion of cutting-edge technologies such as Precision Agriculture (PA), Remote Sensing (RS), Machine Learning (ML), Artificial Intelligence (AI), Big Data, and the Internet of Things (IOT)7,8. It establishes a comprehensive system that includes remote sensors, Wireless Sensor Networks (WSN), virtualization systems, cloud computing, and end-user applications9. The ultimate goal of Agriculture 4.0 is to minimize agricultural production inputs (such as water resources, fertilizers, pesticides, etc.), tackle the challenges posed by climate change, and enhance farmers’ profitability, thereby fostering more efficient agricultural production modes.

Jujube is a highly nutritious food that not only occupies a vital position in daily life as a nourishing and medicinal ingredient but also serves as an indispensable raw material in the food industry. China is not merely the birthplace jujubes but also the largest producer of red jujubes globally. With the gradual improvement of national living standards, the demand for jujubes as agricultural products has shown an upward trend year by year. According to data from the National Bureau of Statistics, from 1995 to 2015, over 20 years, the national production of red jujubes soared by 10.3 times, with annual production exceeding 8 million tons. By 2015, more than 20 provinces in China were cultivating red jujubes, with Xinjiang, Hebei, Shandong, Shanxi, and Shaanxi becoming the main jujube-producing provinces10. Particularly in Xinjiang, due to its unique geographical and environmental conditions, such as its high latitude location, ample sunlight, and significant day-night temperature differences, the jujubes grown there are not only sweet and fleshy but also larger in size11. Xinjiang leads the country in terms of jujube cultivation area and production, accounting for 30% of the total national cultivation area and 50% of the total production.

Jujube, as an important agricultural product, its surface defect detection not only affects the market competitiveness of the product but also directly relates to production efficiency and the income of farmers. Efficient and accurate jujube defect detection technology is of great significance for achieving automation and standardization in the jujube industry. However, existing methods for jujube defect detection still have shortcomings when facing complex industrial environments, and they cannot meet the needs of large-scale production. By improving the TinyViT model, this study aims to provide a more effective solution for jujube defect detection, and at the same time, promote the intelligent upgrade of the jujube industry. The results of this research are not only applicable to jujube defect detection but also provide a reference for the quality detection of other similar agricultural products.

Recent advancements in machine vision technology have facilitated its facilitated in assessing the quality of agricultural products. Nevertheless, numerous facilities still rely on manual inspection when evaluating red jujubes12, a process that is both inefficient and cost-intensive. Traditional manual sorting is inherently prone to human error, resulting in frequent inaccuracies and overlooked defects, thereby compromising the consistency of jujube product quality13. Consequently, there is an urgent necessity for the adoption of advanced technology to modernize and supersede outdated manual sorting methods, thereby augmenting the quality of jujube products through automation.

Conventional machine vision-based SDD techniques commonly rely on traditional image processing methodologies or manually designed features. These features are subsequently input into classifiers for sorting purposes. The imaging techniques are often tailored to the specific characteristics of the defects present on the surfaces of the objects inspected. Properly designed imaging strategies enable the capture of uniformly lit images, which clearly depict the defects on object surfaces. For instance, Jing et al.14 utilized a composite white light to photograph diverse defects on colored fabric surfaces. Similarly, imaging techniques can vary based on the reflective properties of the inspected objects, encompassing methods like bright field, dark field, and mixed imaging. For example, Chen et al.15 used two concentric ring-shaped bright field light sources to highlight central and peripheral areas at the bottoms of metal cans to detect any irregularities. In a specific application for red jujubes, Wu et al.16 combined a Support Vector Machine (SVM) with hyperspectral imaging and other machine vision techniques to categorize the quality of red jujubes.

In practical production environments, complex SDD tasks confront numerous challenges, such as low contrast between defects and background, large variations in defect size and type, noise and interference, among others. In these cases, traditional methods often struggle to achieve satisfactory detection results. Although traditional technologies have partially addressed the issue of automating red jujube classification, these methods often necessitate exacting environmental conditions for detection and suffer from issues of low accuracy and suboptimal real-time performance, rendering them unsuitable for widespread application17.

Deep Learning (DL), a pivotal area of ML, has made substantial advances recently, notably through the development of Convolutional Neural Networks (CNN). Owing to their excellent capability in feature extraction and nonlinear representation, CNNs have been extensively employed in diverse image recognition tasks. Furthermore, a range of DL-based defect detection technologies have begun to gain widespread acceptance and application in various industrial scenarios.

TinyViT is an innovative DL architecture proposed by Wu et al. in 202218. It utilizes a technique known as “pre-training distillation,” which transfers knowledge from large pre-trained models to smaller ones, allowing for use on devices that require fewer computational resources. This method optimizes the training process by pre-computing and storing the outputs (logits) of the teacher model, significantly reducing memory and computational costs. TinyViT exhibits performance akin to larger models on benchmark datasets, while possessing a significantly smaller number of Params, rendering it ideally suitable for resource-constrained environments, such as mobile devices and the Internet of Things (IoT).

However, despite the TinyViT network’s excellent performance in terms of computational resource consumption and adaptability, there are still areas that could be improved in the specific classification task of red jujube surface defects. First, TinyViT has limited multi-scale feature extraction capabilities when dealing with the diversity and scale variations of red jujube surface defects, leading to poor performance when facing defects with significant size variations. Second, the TinyViT model lacks sufficient feature focusing ability when dealing with subtle differences between the red jujube surface and the background, making it difficult to accurately identify certain low-contrast defects. Additionally, the complexity of the original TinyViT network structure poses challenges to real-time applications. The RJ-TinyViT proposed in this paper achieves a classification accuracy of 93.38%, with FLOPs of 0.69G and Params of 2.02 M. Compared to TinyViT, the overall classification accuracy has increased by 1.84%, while Params and FLOPs have been reduced by 60.16% and 41.03%, respectively. The RJ-TinyViT proposed in this paper not only improves accuracy but also significantly reduces computational load.

To tackle the challenges of SDD for red jujubes, this study introduced a series of refinements to the TinyViT model, proposing the RJ-TinyViT. We streamlined the TinyViT model and introduced improved MK Blocks and enhancements to the MBConv Block. These improvements not only alleviate the computational load of the network but also bolster the model’s generalization ability, resulting in more exhaustive feature information extraction. By integrating CA into TinyViT, our model can more accurately capture the different feature categories of red jujube defects. In comparison to other prevalent networks, RJ-TinyViT utilizes pre-training distillation technology to effectively transfer the knowledge of large pre-trained models to smaller ones, rendering it suitable for resource-constrained devices. This approach not only accelerates the training process, diminishes memory and computational expenditures but also addresses the shortcomings of other networks in resource-constrained environments. RJ-TinyViT demonstrates considerable advantages in terms of resource consumption and adaptability compared to traditional CNNs and other large Transformer models. Through efficient knowledge transfer and model optimization, RJ-TinyViT demonstrates enhanced adaptability and stability in real-world production environments, particularly in the surface defect detection of red jujubes, facilitating efficient and automated quality sorting, thereby ensuring its reliability and practicality across diverse application scenarios.

These improvements are not only tailored to the specific needs of red jujube defect detection but also take into account potential applications in the classification of other agricultural products and industrial vision inspection. By enhancing the model’s feature extraction capabilities and computational efficiency, this study provides technical support for the realization of more intelligent and automated industrial production, and lays the foundation for future research.

The primary contributions of this paper are as follows:

  1. The TinyViT model has been streamlined, resulting in a 60.16% reduction in parameters and a 41.03% reduction in Floating-Point Operations (FLOPs), which significantly improves the model’s computational efficiency and performance in resource-constrained environments;

  2. The MK Block has been improved and integrated into the streamlined TinyViT model to enhance its multi-scale feature extraction capabilities, leading to a 1.29% increase in classification accuracy for red jujube defect scale variations;

  3. The Coordinate Attention (CA) module has been integrated into the streamlined TinyViT model to enhance the model’s ability to focus on features in specific directions, thereby increasing the model’s classification accuracy by 1.05%;

  4. The MBConv Block within the TinyViT model has undergone enhancements to further refine its feature extraction capabilities, which has led to an additional 0.53% increase in the model’s overall classification accuracy.

By improving the TinyViT model, this study not only significantly enhances the accuracy and efficiency of red jujube surface defect detection but also provides new ideas for the further development of industrial vision detection technology. These improvements have broad application potential, especially in the context of Agriculture 4.0, where they can promote the automated quality detection and sorting of more agricultural products, contributing to the realization of more efficient agricultural production models.

The subsequent sections of this paper are organized as follows: In the Related Work section introduces current DL-based algorithms for surface defect classification of red jujubes, encompassing CNNs, Transformer models, and their significant advancements in improving classification accuracy and industrial applications. The remainder of the Methods section elaborates on the dataset used and the proposed improvements, introducing the RJ-TinyViT model. In the Experimental and Discussion section, the experimental results of RJ-TinyViT on the ‘Jujube2000’ dataset are compared and analyzed against mainstream networks. Lastly, The Conclusion section summarizes the key accomplishments of this paper and outlines potential avenues for future research.

Related work

Currently, machine vision-based surface defect detection equipment is widely used across numerous industrial fields, including but not limited to automotive manufacturing, home appliances, machinery manufacturing, semiconductors and electronics, chemical, pharmaceutical, aerospace, and light industries19, effectively supplanting traditional manual visual inspection methods.

For example, Yousef et al.20 extracted features from images of metal casting defects and used SVM and KNN models to detect defects related to uneven dimensions and surface discontinuities in metal castings, further improving the accuracy of these models by enhancing environmental conditions and image resolution, although this method has a low level of intelligence.

Peng et al.21 proposed a Multiscale Information Focus (MIFE) method for industrial surface defect detection in their study. This method utilizes an improved loss function and MIFE-GANomaly technology, aiming to detect defects on various industrial product surfaces more effectively. Their method emphasizes concentrated processing and analysis of multiscale information to enhance the accuracy and reliability of defect detection. This study signifies a significant advancement in applying sophisticated computational methods for industrial defect detection.

Li et al.22 introduced a model for detecting defects on aluminum alloy surfaces, utilizing a streamlined version of the YOLOv4 network, termed M2-BL-YOLOv4. This model first modifies the complex CSPDarkNet53 backbone network in the YOLOv4 model using an inverted residual structure, significantly reducing the number of model Params and improving detection speed. Secondly, a new feature fusion network, BiFPN-Lite, was designed to improve the network’s fusion capability, further enhancing its detection accuracy. Experimental results show that the improved lightweight YOLOv4 algorithm achieved a mean Average Precision (mAP) of 93.5% on the aluminum alloy surface defect test set, reduced the model parameter count to 60% of the original, and increased the Frames Per Second (FPS) to 52.99, improving the detection speed by 30%, achieving efficient detection of aluminum alloy surface defects.

Zhou et al.23 proposed a metal surface defect detection method based on an improved YOLOv5 model. This method enhances the model’s ability to detect small targets and improve overall detection accuracy by introducing the CSPlayer module and the Global Attention Enhancement Mechanism (GAMAttention) into the YOLOv5 model. The CSPlayer module is introduced in place of the C3 module within the YOLOv5 architecture to enhance the model’s flexibility and adaptability. GAMAttention combines channel attention and spatial attention to enhance the model’s accuracy and speed in detecting small targets. Experimental results show that this improved model performs better on the GC10-DET enhanced dataset than the original YOLOv5s model, with increases of 5.3%, 1.4%, and 1.7% in mAP@0.5 and mAP@0.5:0.95, respectively, and also has a higher inference speed. Thus, this improved YOLOv5-based method excels in metal surface defect detection but has certain requirements for the form of defect features.

Liu et al.24 devised a real-time, efficient method for detecting surface defects, employing a hybrid version of the YOLOv4-LITE model. This method combines the characteristics of the YOLOv4 algorithm to create a lightweight hybrid YOLOv4-LITE series, aiming to achieve the optimal balance between detection performance and model size. Their method emphasizes real-time performance, accuracy, and cost-effectiveness, making it particularly suitable for engineering application scenarios. This study focuses on improving both the efficiency and effectiveness of surface defect detection systems.

Wang et al.25 introduced ATT-YOLO, a precise yet lightweight detector designed for identifying surface defects in electronic products. This model incorporates a single self-attention module and utilizes multi-scale feature extraction and integration both in its backbone network and in the feature pyramid. It also features enhancements to the automatic anchoring technique. On a dataset comprising 14,478 instances of laptop surface defects in a binary detection setting, ATT-YOLO achieved a mean Average Precision (mAP) of 92.8% at an IoU threshold of 0.5. The model also underwent validation against the COCO benchmark dataset, where it not only outperformed several leading lightweight YOLO-based detectors but also achieved a 44.9% mAP and required only 21.8 GFLOPs of computational effort, surpassing competitors like YOLOv8-small and YOLOv7-tiny-SiLU.The experimental results demonstrate that this method achieves better detection accuracy in the field of electronic manufacturing surface defect detection, but it also requires specific forms of defect features.

With the rapid development of DL technology, an increasing number of studies are focusing on how to utilize these technologies to enhance the performance of SDD. DL technology, particularly methods centered around CNN, has achieved significant breakthroughs in the field of computer vision. The remarkable success of AlexNet in the 2012 ImageNet competition marked the beginning of the DL technology era. For instance, in image classification tasks, networks such as GoogLeNet26, ResNet27, and DenseNet28 have achieved excellent results in traditional vision tasks29. Recently, Transformer-based models, such as Vision Transformer (ViT), have demonstrated strong capabilities in handling complex image tasks by segmenting images into multiple small patches and utilizing self-attention mechanisms30. Swin Transformer further optimized this approach by using a hierarchical Transformer, making the model more suitable for handling different-sized structures in images31. MobileViT is an optimized version for mobile devices, combining the global attention of ViT with the local feature extraction advantages of CNNs, aiming to improve efficiency and performance on edge devices32.

With the relentless advancement of computer vision technology, the recently introduced Transformer architecture has demonstrated remarkable proficiency in feature extraction and relationship modeling. These breakthroughs not only elevate the performance of the algorithms but also present novel strategies for addressing some of the pivotal challenges encountered in traditional methodologies.

Nie et al.33 proposed a novel binocular vision tracking method, SiamLA, to address the inconsistency between classification and regression tasks in traditional binocular vision tracking. They introduced the Localization-Aware Dynamic Label (LADL) loss and Localization-Aware Label Smoothing (LALS) strategy to achieve collaborative optimization of classification and regression. Additionally, they designed a separate localization branch and integrated a Localization-Aware Feature Aggregation (LAFA) module to further optimize the model. Extensive experiments on six challenging benchmarks demonstrated the superior accuracy and efficiency of SiamLA, as well as its good stability, indicating its potential for practical applications.

Nie et al.34 proposed a new 3D single-object tracking method, GLT-T, which achieves tracking through global-local transformer voting in point clouds. They developed a Global-Local Transformer (GLT) module that integrates object and local perception priors into the seed point features to form a robust geometric position feature representation, providing more accurate and robust clues for offset learning. Moreover, they designed a simple yet effective training strategy and developed an importance prediction branch to learn the potential importance of seed points, using the output weight vector as a training constraint. Extensive experiments on challenging KITTI and NuScenes benchmarks demonstrated that GLT-T achieved state-of-the-art performance in 3D single-object tracking tasks. Further ablation studies showed the advantages of the proposed GLT voting scheme over the original VoteNet.

Nie et al.35 proposed an innovative tracking method in the field of single-object tracking, utilizing fine-grained priors to improve tracking accuracy. They developed a new box estimation network, including a Transformer Encoder Target Pyramid Guide (TPG) and a Transformer Decoder Target Pyramid Spread (TPS). This design employs self-attention mechanisms to propagate priors with high affinity, thereby improving the accuracy of Intersection over Union (IoU) scores for candidate boxes. Extensive experiments on multiple tracking benchmarks, such as OTB100, UAV123, NFS, VOT2020, VOT2021, LaSOT, LaSOT_ext, TrackingNet, and GOT-10k, validated that this new box estimation network outperformed most existing methods in single-object tracking tasks. Additionally, their tracker demonstrated competitive performance among existing top trackers.

Nie et al.36 proposed a novel method in single-object tracking, enhancing tracking accuracy through GLT voting. They developed a GLT module that integrates object and local perception priors into seed point features to form robust geometric position feature representations. They also designed a simple yet effective training strategy and developed an importance prediction branch to learn the potential importance of seed points, using the output weight vector as a training constraint. Extensive experiments on challenging KITTI and NuScenes benchmarks demonstrated that GLT-T achieved state-of-the-art performance in 3D single-object tracking tasks.

These DL technologies have attained remarkable accomplishments not only within the realms of computer vision and object tracking but also across a diverse array of application domains, underlining their immense potential.

With the continuous flourish of smart agriculture and intelligent manufacturing, these DL methods are being increasingly applied across various domains37. Currently, scholars are applying DL technology to fruit defect classification, including the defect classification of red jujubes38. This section delves into the current state-of-the-art DL-based methods for red jujube defect classification.

Xu et al.39 proposed a network model focused on feature association (FAR-Net) to address the issue of multiple defects on red jujubes. This model can detect multiple defects on a single jujube and surpassed the Inceptionv3 network model in terms of detection accuracy for multiple defects.

Fang et al.40 optimized the AlexNet CNN model by enhancing its width and depth to identify defects such as yellow skin, broken tips, mold, and normal jujubes in dried Xinjiang jujubes, increasing the defect recognition rate to 97.14%.

Yang et al.41 proposed a novel model utilizing the Network-in-Network CNN architecture. This model replaces traditional fully connected layers with global average pooling, thereby decreasing the number of Params in the model while simultaneously improving recognition accuracy. This model was used to detect six types of defects in Xinjiang Jujube, including black spots, peeling, yellow skin, cracks, and other defects, and performed better than traditional SVM detection methods and AlexNet-based classification methods.

Wen et al.42 improved the deep residual network to achieve red jujube texture recognition and surface defect detection. They replaced the ReLU activation function and Softmax Loss with SELU and Center Loss, respectively, and incorporated Dropout layers during the network training process to prevent overfitting and gradient explosion, achieving a red jujube defect classification detection efficiency of about 120/min with an accuracy of 96.11%.

Zhang et al.43 improved upon the CNN by adopting a dual-branch network structure, where one branch used transfer learning for pre-training, and the other branch used a lightweight network to extract features from red jujube images, improving the convergence speed and increasing accuracy from 96.02 to 99.50%.

Yu44 introduced a multi-attention hybrid approach for categorizing jujubes. Utilizing DenseNet121 as the core network, Yu developed a hybrid multi-attention module that merges spatial, channel, and channel-spatial attention branches, averaging their outputs to produce the final result. This technique demonstrated a classification accuracy of 95.7%.

Overall, DL-based image classification techniques exhibit effectiveness in the realm of red jujube defect classification. However, the majority of current research relies predominantly on traditional classic network models, which lag behind in terms of accuracy and structural sophistication compared to the latest network technologies. To enhance industrial production efficiency, it is imperative to broaden the application of the latest academic research and practical technologies in actual production scenarios. Furthermore, these studies often overlook in-depth analysis of misidentified defects. Conducting a comprehensive and precise analysis could facilitate targeted optimization of the model, thereby improving classification accuracy and mitigating the misidentification rate. This study endeavors to devise a lightweight yet high-accuracy network tailored for red jujube surface defect classification. Addressing the deficiencies of existing research, this study streamlines the novel TinyViT model, incorporates improved MK Blocks, introduces CA module, and enhances the MBConv Block, resulting in our novel RJ-TinyViT. Extensive experiments have underscored the superiority of the RJ-TinyViT model. Lastly, this paper provides a meticulous analysis of misidentification cases, offering valuable insights for future model optimization.

Methods

Dataset

The datasets from references3944 do not provide download links. The dataset used in this paper comes from the “Jujube2000” dataset, collected and created by Jiang et al.45. The “Jujube2000” dataset comprises six different classifications of red jujubes: Deformed, Wrinkled, Cracked, Moldy, Bird-pecked, and Normal. Each category contains 2000 images, totaling 12,000 images. The dataset is partitioned into training, testing, and validation subsets in the ratio of 70%, 20%, and 10% respectively. Figure 1 showcases some exemplary images from the dataset.

During the model training process, it is commonplace to group several images into a single batch for processing by the model. It is essential for all images in the batch to be of uniform dimensions. As a result, standardizing image sizes prior to model training becomes necessary. Given that images in large datasets vary in size, setting all images to the same dimensions is considered the best option for training speed and accuracy, although this may cause distortion of objects, potentially affecting the accuracy of the model. Many research studies46 suggest that configuring input image dimensions to 224 × 224 pixels offers an optimal balance between model accuracy and computational efficiency. For instance, with a standard feature map size of 7 × 7 and a downsampling factor of 2, reducing the image size to 112 × 112 could lead to a substantial loss of image details. Conversely, adjusting to 448 × 448 would greatly increase the computational burden.

Fig. 1.

Fig. 1

Six Types of Defect Samples. (a) Deformed (b) Wrinkled (c) Cracked (d) Moldy (e) Bird-Pecked (f) Normal.

Taking these factors into account, this study standardized the dimensions of jujube images to 224 × 224 pixels to facilitate model training. Furthermore, in efforts to bolster sample quality and bolster the model’s capacity for generalization, this study implemented data augmentation methods on the image dataset. The following are the data augmentation strategies used for classifying surface defects in jujubes:

  1. Random Cropping: This method not only augments the data by randomly selecting and cropping different areas of the image but also contributes to improving model stability and effectively mitigates model overfitting. In this study, random cropping is achieved by randomly selecting a portion of the image and resizing it to 224 × 224 pixels, thereby not only enhancing the diversity of images but also enabling the model to learn local features of the image;

  2. Flipping: By randomly flipping images either horizontally or vertically, the diversity of images is increased without altering the original content. In this study, horizontal and vertically flipping are performed with a 50% probability each, enabling the model to adapt to variations in image orientation and consequently enhancing its robustness against such changes;

  3. Random Enhancement: This comprehensive approach encompasses multiple enhancement strategies aimed at introducing a wider array of image variations to bolster the model’s generalization capability. The employed strategies encompass:

  • Shearing: Randomly generates a specific-sized rectangular area on the image and replaces the pixels within that area with preset values to simulate partial occlusion of the image, enhancing the model’s ability to recognize occluded images;

  • Rotation: Randomly rotates the image by a small angle, this slight rotational change allows the model to adapt to changes in image angles, increasing the model’s robustness to image rotation;

  • Color Jitter: Randomly adjusts the image’s brightness, contrast, and saturation to simulate images under different lighting and color conditions, thereby improving the model’s recognition capability under various environmental conditions.

Upon completion of these image enhancement processes, the finalized red jujube dataset comprises a grand total of 48,000 images, including 33,600 for training the model, 9,600 for testing the model, and 4,800 for validating the effectiveness of the model.

TinyViT

TinyViT is an innovative DL architecture introduced by Wu et al. in 2022, specifically designed for miniaturized ViTs. The key innovation of this architecture lies in its rapid pre-training distillation framework, which allows for effective knowledge transfer from large pre-trained models to more compact ones. Unlike traditional ViTs, which often require a substantial number of parameters, TinyViT reduces the model size, allowing it to function efficiently on devices with limited resources while maintaining or even enhancing model performance. The TinyViT model significantly reduces memory costs and computational overhead by sparsifying the output (logits) of large models and pre-saving them on disk, thus avoiding costly forward propagation during the training of smaller models. Furthermore, through automatic scaling technology, TinyViT can automatically generate a series of miniaturized ViT models based on computational and parameter constraints. TinyViT has demonstrated excellent performance in computer vision tasks such as image classification and object detection, particularly in terms of parameter efficiency and transferability. The introduction of TinyViT represents an important step in optimizing the balance between model size and performance, offering an effective solution for deploying efficient visual processing models in resource-limited environments.

The detailed structure of TinyViT-5 m is shown in Fig. 2. The TinyViT-5 m primarily consists of three modules: Patch Embedding (PE), Stage, and Patch Merging (PM). The following introduces these three modules:

Fig. 2.

Fig. 2

Detailed structure of TinyViT-5 m.

Patch embedding

In the TinyViT-5 m architecture, the Patch Embedding (PE) module primarily serves to convert the input image into a series of embedded vectors, which are then used for deeper network processing. It accomplishes this through a series of convolutional operations that effectively preprocess the input image, capturing preliminary features of the image, segmenting the image into smaller segments (i.e., patches), and embedding each segment to reduce the computational load on subsequent network layers. Through this process, the PE module helps the network to effectively reduce the spatial dimensions of the image while retaining key visual information, providing a series of embedded vectors that contain crucial feature information of the image, laying the foundation for deeper analysis and understanding by the network.

As shown in Fig. 2, the PE module specifically includes a 3 × 3 convolutional layer for downsampling the image by a factor of 2 to capture preliminary features of the image. This is immediately followed by a Gaussian Error Linear Unit (GELU)47 activation function layer to increase non-linear processing capability. Subsequently, another 3 × 3 convolutional layer is used for further downsampling and increasing the embedded dimensions. Through these consecutive operations, the entire PE module effectively transforms the original image into a series of high-dimensional vector representations, which are then fed into the subsequent parts of the model for deeper processing.

Stage

In TinyViT-5 m, the architecture is divided into four Stages, where Stage 1 consists of the MBConv Block, and Stages 2 to 4 consist of the TinyViT Block. Here is an overview of each block:

  1. MBConv Block.

The MBConv Block, a crucial component in the TinyViT architecture, was originally introduced in the MobileNetV2 network48, utilizing the Mobile Inverted Residual Bottleneck structure. This structure leverages depth-wise convolution (DWConv) to effectively enhance the network’s ability to capture details and complex patterns while significantly reducing computational costs. The design of the MBConv Block not only facilitates the capture of finer-grained feature information but also optimizes information flow through the introduction of shortcut connections, effectively mitigating the issue of gradient vanishing in deep networks.

As shown in Fig. 2, each MBConv Block structure consists of three main components: a 1 × 1 convolutional layer for expanding the number of input feature channels, a depth-wise convolutional layer for spatial feature extraction, and another 1 × 1 convolutional layer for compressing the feature channels to ensure that the number of output feature channels remains consistent with the input. In this sequence of operations, the depth-wise convolutional layer applies filters to each input channel individually and then uses a 1 × 1 convolutional layer to recombine these features, significantly reducing the number of parameters and computational load. Additionally, the MBConv Block includes a GELU non-linear activation function to further enhance the model’s expressive capability.

  • 2.

    TinyViT Block.

The TinyViT Block module is a pivotal element of the TinyViT architecture, drawing inspiration from the Self-Attention (SA) mechanism and Multi-Layer Perceptron (MLP) module in the ViT. This Block significantly improves the model’s capacity to capture image details and comprehend extensive contextual information by integrating SA mechanisms and Local Convolution (LConv) operations. This module includes Shortcut Connections, a SA module, LConv layers, and a GELU activation function, where the SA module corresponds to LeAttention (LA) as shown in Fig. 2.

As depicted in Fig. 2, each TinyViT Block first captures the global dependencies between features through the SA module, followed by enhancing the model’s ability to capture local features using the LConv layer. This combination of global and local processing strategies enables the TinyViT Block to consider both global contextual information and local detail information when processing images. Additionally, the introduction of shortcut connections helps improve the flow of information and mitigate the problem of gradient vanishing. The use of Batch Normalization (BN) and the GELU activation function further enhances the model’s non-linear expressive ability and training stability.

The design of the TinyViT Block aims to enhance the network’s depth and representational capacity by promoting information exchange among channels and providing additional nonlinear transformations, thereby achieving higher performance in various visual tasks. This innovative module, which combines SA mechanisms with convolutional operations, is one of the key factors enabling TinyViT to achieve excellent performance with a smaller model size.

The structure of the LA and MLP within the TinyViT Block is shown in Figs. 3 and 4.

Fig. 3.

Fig. 3

Structure of LA.

Fig. 4.

Fig. 4

Structure of MLP.

In Fig. 3, the Split operation divides the tensor along the specified dimension, which in this case is the last dimension. Scale is a scaling factor used to stabilize the gradients in the attention mechanism by scaling the dot product results of the queries and keys, preventing the gradients from vanishing due to excessively large values. Attention Bias is a position-based bias term designed to enhance the model’s understanding of positional relationships in the input data, which is particularly beneficial for position-sensitive tasks like image processing. The Linear layers in Figs. 3 and 4 represent fully connected layers used for linear transformation of the input data. LN refers to Layer Normalization (LN), which accelerates the training process and improves the model’s stability by normalizing activation values within the layer. LA, a part of the TinyViT architecture, is inspired by LeViT49. This module calculates attention within local windows, aiming to enhance the model’s ability to capture image details and understand contextual information. Unlike traditional global SA, LA restricts the attention scope to local regions, effectively reducing computational complexity while maintaining the ability to capture critical visual patterns. LA leverages the advantages of the multi-head SA architecture, processing multiple feature subspaces in parallel, thereby enhancing the model’s multi-perspective understanding of image content. The design of the LA module also considers its combination with LConv operations, further strengthening the capture of local details. This enables TinyViT to achieve efficient and accurate image processing while keeping the network lightweight. Through this innovative self-attention mechanism, the TinyViT architecture by Wu et al. demonstrates outstanding performance across various vision tasks, especially in resource-constrained applications.

Patch merging

In TinyViT-5 m, the main function of the Patch Merging (PM) module is to achieve feature map downsampling and channel number enhancement between successive stages of the model. This module effectively halves the spatial dimensions of the feature map while doubling the number of channels through the combined use of Convolution (Conv) and BN, thereby enhancing the model’s representational capacity for image content. As shown in Fig. 2, the PM module first uses a 1 × 1 convolutional layer to expand the number of channels. This is followed by a 3 × 3 depth-wise convolutional layer to compress the spatial dimensions. Finally, another 1 × 1 convolutional layer adjusts the number of channels. In Fig. 2, the Flatten operation in the PM module flattens all dimensions of the input tensor after the specified dimension (i.e., it flattens into a one-dimensional tensor), specifically flattening the third and fourth dimensions. This process not only helps reduce computational load but also ensures that while the feature map resolution is reduced, the richness of the feature representation is maintained. The PM module plays a crucial role in the TinyViT-5 m architecture, enabling efficient transitions between different stages of the network, thereby significantly enhancing the model’s performance in computer vision tasks.

Improvement strategies for TinyViT-5 m

This study adopted a comprehensive methodology, gradually optimizing the TinyViT-5 m model through structural analysis, experimental validation, and performance evaluation. First, we conducted a detailed analysis of the performance bottlenecks of existing models to identify key modules that needed improvement. Next, for these modules, we designed and implemented multiple experiments to test the effects of different improvement measures. These improvements included simplifying the model structure, optimizing the feature extraction modules, and enhancing the model’s attention mechanisms. Each step of the improvement was verified through ablation experiments and comparative experiments to ensure that while reducing computational resource consumption, the performance of the model was maximized.

Following this, improvements were made to the TinyViT-5 m, including simplifications to the TinyViT-5 m model, the addition and enhancement of the MK Block, the introduction of the CA mechanism, and improvements to the MBConv Block.

Simplification of the TinyViT-5 m model

Although TinyViT has significantly reduced computational costs through pre-training distillation technology and has performed well in resource-constrained environments, further simplifying the model can still help improve its real-time performance and resource utilization efficiency, especially in specific red jujube defect classification tasks. Particularly when facing large-scale industrial application scenarios, further reducing the number of Params and FLOPs can make the model perform better in low-resource environments. Therefore, this paper has structurally simplified the TinyViT model to further optimize its efficiency in practical applications. Here, the TinyViT-5 m network model is simplified. The simplified structure of the TinyViT-5 m network is shown in Fig. 5.

Fig. 5.

Fig. 5

Simplified TinyViT-5 m network structure.

From Fig. 5, it can be concluded that this simplification mainly focuses on the following two aspects:

  1. The number of MBConv Blocks and TinyViT Blocks stacked in Stage1, Stage3, and Stage4 has been reduced. In Stage1, the number of MBConv Blocks has been decreased from two to one, and in Stage3 and Stage4, the number of TinyViT Blocks has been reduced from six to four and from two to one, respectively ;

  2. The Channels have been reduced from [64, 120, 160, 320] to [48, 96, 120, 260]. Here, ‘Channels’ refers to the number of data channels when data is input to each Stage.

By reducing the number of stacked blocks and adjusting the number of channels, this study not only significantly reduced the network’s parameter count and FLOPs but also maintained the model’s effectiveness in feature extraction. This simplification is of great significance for practical applications, especially in tasks with high real-time and computational resource requirements, such as red jujube defect detection. This improvement makes the model more suitable for deployment in resource-constrained environments, such as mobile devices or embedded systems, thereby expanding its potential for application in industrial automation. Furthermore, this approach provides an optimization strategy for similar tasks in other fields, allowing the model to maintain efficient computational capabilities without significantly sacrificing performance.

In implementing these simplification measures, we adopted an experimentally validated methodology. Initially, for each Stage of the network, we conducted step-by-step experiments to test the adjustments in the number of stacked blocks and channels. By comparing the impact of different simplification schemes on model performance through experiments, we selected a design that reduces the number of MBConv Blocks and TinyViT Blocks in Stage1, Stage3, and Stage4, as well as the number of channels in all Stages. These adjustments were validated through multiple experiments and ultimately resulted in a significant reduction in the number of parameters and FLOPs while maintaining model performance. The experimental results show that the simplified TinyViT-5 m model reduced the number of parameters by 61.93% and FLOPs by 54.7% in the task of red jujube defect classification. Despite the simplification, the model’s classification accuracy only decreased by 0.21%, proving the effectiveness of this simplification strategy. Our experimental results indicate that this performance-assessment-based simplification strategy not only performs well in red jujube defect detection but also provides an effective optimization approach for other resource-constrained industrial applications.

Coordinate attention

CA is an efficient spatial relationship attention mechanism that was introduced by Hou et al. in 202150. It is specifically designed to better capture the spatial coordinate information of feature maps within DL models.

CA not only captures cross-channel information but also directional and positional awareness, enabling the model to focus more on key features by enhancing the representation of information. The structure of the CA is shown in Figure 6

Fig. 6.

Fig. 6

CA structure diagram.

In the figure, Hard-Swish refers to the Hard Swish activation function51, and the Split operation divides the tensor along the third dimension.

CA encodes channel relationships and long-range dependencies using precise positional information, primarily including two processes: information embedding and generation.

To enable the attention module to capture remote spatial communications with more accurate positional data, global pooling is transformed into one-dimensional feature encoding. Given an input Inline graphic, the pooling kernels Inline graphic and Inline graphic encode each channel along the horizontal and vertical orientations of the coordinate system, respectively. The output of the Inline graphic channel with height Inline graphic is shown in Eq. (1).

graphic file with name M6.gif 1

The output of the Inline graphic channel with width Inline graphic is shown in Eq. (2).

graphic file with name M9.gif 2

The generation of CA is designed to better utilize the global receptive field obtained from the above transformations, encoding accurate positional information to generate representations. The module generates two feature maps, which are transformed by a shared 1 × 1 convolutional layer, resulting in an intermediate feature map Inline graphic for both horizontal and vertical directions. The reduction ratio Inline graphic, similar to that used in the SE module, controls the size of the module. The feature map Inline graphic is then split along the spatial dimension into two separate tensors Inline graphic and Inline graphic, which are transformed to the same channel number as the input Inline graphic by two 1 × 1 convolutional layer Inline graphic and Inline graphic52. This results in Eq. (3):

graphic file with name M18.gif 3

In the equation,Inline graphic represents the sigmoid function. To reduce the complexity of the network, an appropriate downsampling ratio Inline graphic is used to decrease the number of channels. The values Inline graphic and Inline graphic are used as attention weights. The output expression of the CA block is shown in Eq. (4):

graphic file with name M23.gif 4

CA addresses the issue of lost spatial positional information found in Squeeze-and-Excitation (SE)53 and Convolutional Block Attention Module (CBAM)54, further enhancing the accuracy of the model in image classification tasks within neural networks.

In the task of red jujube surface defect classification, due to the subtle differences between the jujube surface and the background, traditional attention mechanisms tend to lose spatial location information when dealing with such issues, which affects the classification accuracy. To overcome this problem, this paper embeds the CA module into the TinyViT-5 m model. By effectively capturing spatial location information, CA allows the model to better focus on the key features of red jujube surface defects, thereby improving classification performance.

This study investigates the integration points of CA within the model, detailing three distinct structures as depicted in Fig. 7.

Fig. 7.

Fig. 7

Embedding Positions of CA. (a) Introduce CA after the PE module. (b) Introduce CA before the last GELU activation function in the MBConv Block. (c) Introduce CA at the end of the TinyViT Block.

(a) represents the use of CA after the PE module; (b) represents the use of CA before the last GELU activation function in the MBConv Block; (c) represents the use of CA at the end of the TinyViT Block. According to the ‘Experiment on Embedding Positions of CA’ in the Experimental and Discussion section, embedding CA in both (a) and (c) shows the best results. Therefore, our model will adopt the design of embedding CA in both (a) and (c).

By embedding the CA (Coordinate Attention) module after the PE (Position Embedding) module and within the TinyViT Block, the model is able to more effectively capture the spatial location features of red jujube surface defects, thereby enhancing the accuracy of classification. Experimental results show that after the integration of the CA module, the model’s classification accuracy increased from 91.33 to 92.38%, an improvement of 1.05%. Especially when dealing with defects against complex backgrounds or with low contrast, the CA module significantly enhanced the model’s ability to focus on key features, making the detection of red jujube defects more precise. This improvement not only demonstrates clear effects in the task of red jujube defect detection but also provides a feasible solution for other computer vision tasks requiring high-precision feature extraction. Therefore, this design further expands the application range of the model, especially in automated detection tasks in complex industrial environments, offering significant practical value.

To determine the optimal embedding position for the CA module, we adopted an experiment-driven approach. First, by analyzing the distribution of features in red jujube surface defects, we identified the key areas where spatial location information capture needed to be enhanced. Next, we embedded the CA module at different positions in the PE module and within the TinyViT Block, and conducted multiple ablation experiments to compare the impact of different embedding positions on the model’s classification accuracy. Ultimately, the experimental results showed that embedding the CA module after the PE module and at the end of the TinyViT Block significantly improved the model’s classification performance. This choice not only enhanced the accuracy of red jujube surface defect detection but also provided strong support for feature extraction in other complex visual tasks.

Improved MK block

In the task of classifying surface defects on red jujubes, the scale variation of jujube defects is quite large, and traditional CNNs show certain limitations in multi-scale feature extraction. Especially when dealing with complex multi-scale defects, existing models often struggle to capture both detailed and global features simultaneously, leading to a decrease in classification accuracy. To enhance the capability of the TinyViT-5 m model in multi-scale feature extraction, this paper introduces an improved MK Block.

The MK Block is an innovative DL architecture component proposed by Ashkan Ganj et al. in 2023, specifically designed for handling low-resolution images55. The key innovation of this component lies in its combination of the Inception module and residual connection design principles, effectively learning both low-level and global features from blurred and noisy low-resolution images. Unlike traditional CNN that rely on numerous layers and Params, the MK Block simplifies the model architecture, allowing it to maintain or even improve performance under limited computational resources. The MK Block employs kernels of different sizes (such as 3 × 3, 5 × 5, and 7 × 7) for feature extraction, utilizing multi-scale features, and combines features using a 1 × 1 kernel in the final step, significantly enhancing the model’s ability to learn details and local features.

When dealing with low-resolution images, utilizing multi-scale features is crucial for improving the model’s recognition capability. The MK Block achieves this goal through its multi-scale kernel design. Similar approaches have been applied to different types of CNN in other studies. By comprehensively utilizing multi-scale features, the network can better capture information about the target at different scales, thus enhancing its ability to understand and analyze images and signals. This extraction and fusion of multi-scale features not only have significant effects in the field of image processing but also perform well in other application areas.

Although introducing multi-scale features may increase the computational complexity of the model, with reasonable design, it is possible to enhance the model’s recognition performance without significantly increasing the computational cost56. The concept of multi-scale features has been extensively validated in biological visual systems. Human and animal visual systems typically process visual information at different scales simultaneously, allowing for better recognition and understanding of complex visual scenes. Therefore, incorporating the concept of multi-scale features into AI models has biological rationality and inspiration. In summary, the rationale behind multi-scale features lies not only in their ability to provide richer and more robust feature information but also in their potential to achieve a balance between complexity and performance through effective design, resulting in excellent performance across various tasks.

Specifically for the MK Block, this component achieves multi-scale feature extraction while using internal residual connections to make the model more effectively trainable, addressing the gradient vanishing problem in deep networks. The internal residual connections within the MK Block enable more efficient training, resolving the gradient vanishing issue in deep networks. In low-resolution image classification tasks, the MK Block has demonstrated outstanding performance, particularly in terms of parameter efficiency and the ability to handle noisy images. The introduction of the MK Block represents a significant step towards balancing enhanced low-resolution image processing performance with model simplicity, providing an effective solution for deploying efficient image classification models in resource-constrained environments.

The structure of MK Block is shown in Fig. 8.

Fig. 8.

Fig. 8

Structure of MK Block.

However, directly using MK Block would lead to a significant increase in the network’s FLOPs and Params. Therefore, this paper introduces improvements to MK Block. The structure of the improved MK Block is shown in Fig. 9.

Fig. 9.

Fig. 9

Structure of the Improved MK Block.

In the improved MK Block, Depth-wise Separable Convolution (DWSConv) is used to replace the original Conv in the MK Block. DWSConv consists of a DWConv and a Point-wise Convolution (PWConv). This improvement significantly reduces the model Params and FLOPs, enhancing computational efficiency while mitigating the risk of overfitting. This is particularly useful for resource-constrained environments or applications requiring rapid execution.

This improvement not only enhances the performance of the MK Block but also provides insights for optimizing other DL models. The advantages of DWSConv lie in its efficient computation and parameter savings, making it an ideal choice for resource-constrained environments. In many application scenarios, fast and accurate processing capabilities are crucial. Therefore, optimizing model architectures to reduce computational complexity and increase efficiency is of significant importance.

The scientific basis for using DWSConv to reduce model complexity lies in its decomposed structure. DWSConv decomposes standard Conv into two parts: DWConv and PWConv. DWConv performs Conv operations independently on each channel, while PWConv is used for linear combinations between channels. Compared to the fully connected operations of traditional convolution, DWSConv significantly reduces the computational load57. This separation not only decreases the computational burden but also reduces the number of model Params, effectively mitigating the risk of overfitting and improving generalization. As a result, it performs well under limited sample conditions. These characteristics collectively enhance the model’s performance and effectiveness across various tasks.

This paper also explores the embedding positions of the improved MK Block within the model, implementing three design schemes as shown in Fig. 10.

Fig. 10.

Fig. 10

Embedding Positions of the Improved MK Block. (a) Introduce the improved MK Block before the CA in PE. (b) Introduce the improved MK Block after the CA in PE. (c) Introduce the improved MK Block at the end of the MBConv Block.

(a) indicates introducing the improved MK Block before the CA in the PE module; (b) indicates introducing the improved MK Block after the CA in the PE module; (c) indicates introducing the improved MK Block at the end of the MBConv Block. According to the ‘Experiment on Embedding Positions of the Improved MK Block’ in the Experimental and Discussion section, embedding the improved MK Block in (c) shows the best results. Therefore, our model will adopt the improved MK Block design with the embedding position in (c).

By embedding the improved MK Block at the end of the MBConv Block, the model is able to more fully utilize multi-scale features for defect detection. Experimental results show that after embedding the improved MK Block, the model’s classification accuracy increased by 0.59%, and more efficient feature extraction was achieved without significantly increasing computational resources. Especially in the classification task of red jujube surface defects, multi-scale features are crucial for capturing defects of different sizes and types. The improved MK Block not only enhances the model’s feature extraction capability when dealing with complex defects but also improves overall classification accuracy without significantly increasing computational resources. The successful application of this design demonstrates the great potential of the MK Block in industrial defect detection, especially in tasks that require a balance between performance and efficiency, its application prospects are broad. This provides strong technical support for other similar industrial vision tasks.

To ensure that the improved MK Block performs optimally in the task of red jujube surface defect detection, we adopted a methodology based on experimental validation and performance evaluation. First, we analyzed existing multi-scale feature extraction methods and identified their shortcomings in handling defects of different scales. Then, we designed various improvement schemes, including replacing standard convolutions with DWSConv to reduce computational complexity. Subsequently, we conducted a series of experiments to compare the impact of different embedding positions on model performance, ultimately choosing to embed the improved MK Block at the end of the MBConv Block. The experimental results from this process showed that the improved MK Block can significantly enhance classification accuracy without significantly increasing computational resources, validating the effectiveness and feasibility of our design.

Improved MBConv block

In the task of classifying surface defects on red jujubes, the original MBConv Block in the TinyViT-5 m model shows certain limitations in feature extraction, especially when dealing with complex inter-channel feature relationships, the results are not ideal. To enhance the model’s capabilities in feature extraction and feature reorganization, this paper has made improvements to the MBConv Block in TinyViT-5 m. The structure of the improved MBConv Block is shown in Fig. 11.

Fig. 11.

Fig. 11

Structure of the improved MBConv block.

The improved MBConv Block introduces channel splitting and recombination, the application of DWConv on partial channels, and the inclusion of an SE block, compared to the original MBConv Block. These enhancements provide the model with stronger feature extraction and adjustment capabilities, enhancing the network’s responsiveness to important features. Consequently, this significantly boosts performance while maintaining computational efficiency.

The strategy of channel splitting and recombination increases the network’s flexibility in handling different feature information, enabling the model to more effectively distinguish and recombine features, thereby enhancing its ability to integrate information. This channel splitting operation is achieved through the Chunk operation shown in Fig. 11, which divides the input data into two parts along the channel dimension, inspired by CSPNet58. DWConv is applied only to a portion of the channels, preserving the lightweight characteristic while enhancing the ability to capture spatial features. The introduction of the SE block, which learns the relationships between channels, optimizes the network’s evaluation of feature importance. This allows the model to focus more on useful information, thereby improving recognition efficiency and accuracy.

These design improvements collectively enhance the model’s performance and efficiency, ensuring that the improved MBConv Block not only maintains computational efficiency in handling complex visual tasks but also significantly boosts performance through smarter feature processing, especially in feature extraction and representation capabilities, enabling it to perform better across various visual tasks.

In the task of classifying surface defects on red jujubes, these improvements are particularly important for capturing complex defect features. The types of defects on the surface of red jujubes are diverse and often have subtle differences with the background. Therefore, by employing strategies of channel splitting and reorganization, combined with the use of DWConv and SE blocks, the improved MBConv Block can more effectively identify and classify these complex features. This enhancement not only improves the model’s performance in red jujube defect detection but also reduces the reliance on computational resources, making the model more suitable for deployment in resource-constrained industrial environments. These results indicate that the improved MBConv Block has broad application prospects in industrial vision tasks, especially in automated detection scenarios that require handling complex features, where its technological advantages are significant.

To achieve these improvements, we adopted an experiment-driven methodology. Initially, addressing the shortcomings of the original MBConv Block in handling inter-channel feature relationships, we designed a series of experiments to verify the effectiveness of different improvement schemes. Specifically, we introduced channel splitting and reorganization strategies, applied DWConv to partial channels, and combined SE blocks, assessing the impact of these improvements on model performance through step-by-step experiments. By comparing and analyzing the performance of different design schemes in the task of red jujube defect classification, we ultimately selected the optimal combination of improvements. The experimental results showed that the improved MBConv Block increased classification accuracy by 0.53%, with almost no change in FLOPs and the number of parameters. Moreover, the model’s feature extraction capability was significantly enhanced, demonstrating higher precision when dealing with complex defects. These results validated the effectiveness and feasibility of our design, proving that the improved MBConv Block has broad application potential in resource-constrained industrial environments.

RJ-TinyViT

This paper first simplifies the TinyViT network model, then introduces the improved MK Block and CA into TinyViT, and finally improves the MBConv Block, resulting in an enhanced TinyViT network model named RJ-TinyViT. The complete structure of the network is depicted in Fig. 12.

Fig. 12.

Fig. 12

RJ-TinyViT network model.

Operational environment and parameter settings

All experiments in this study were conducted on a single computer equipped with the following specifications: an Intel(R) Core(TM) i9-12900 H CPU running at 2.50 GHz; an NVIDIA GeForce RTX 3070 Ti Laptop GPU with 8 GB of dedicated memory; and a system memory capacity of 32 GB. The DL environment consisted of Python version 3.6.13, Cuda 11.3.1, PyTorch 1.10.2, and torchvision 0.11.3, all running on Windows 11.

The experiments maintained consistent training parameters across all models, setting the training image size to 224 × 224 pixels and the batch size to 16. The models were trained using the cross-entropy loss function and the AdamW59 optimizer with an initial learning rate set to 3.125 × 10− 5. A cosine annealing schedule was applied to decay the learning rate from the outset of model training. It is important to emphasize that, to ensure a fair comparison of network model modifications, this study did not employ transfer learning techniques for model training. Each model underwent a training process of 300 epochs.

Evaluation Metrics

In the realm of DL, critical metrics for evaluating the performance of classification models encompass accuracy, recall, precision, and F1-score60. Accuracy represents the proportion of correctly predicted samples to the total number of samples; recall refers to the proportion of positive samples correctly identified by the model out of the total positive samples; precision is the proportion of true positive samples among those predicted as positive by the model; the F1-score is a metric calculated based on the balance between precision and recall, providing a more comprehensive representation of the model’s overall performance. The formulas for these performance metrics are shown in Eq. (5) through (8), where TP represents the number of true positives, FP represents the number of false positives, FN represents the number of false negatives, and TN represents the number of true negatives.

graphic file with name M24.gif 5
graphic file with name M25.gif 6
graphic file with name M26.gif 7
graphic file with name M27.gif 8

Experimental and discussion

Experiment on embedding positions of CA

This paper compares the effects of the CA mechanism at three different embedding positions. The results are shown in Table 1.

Table 1.

Performance comparison of different CA embedding position schemes.

Method Accuracy(%) Precision(%) Recall(%) F1-Score(%)
TinyViT-5 m 91.54 91.25 91.27 91.25
Simplified TinyViT-5 m 91.33 91.04 91.06 91.04
Simplified TinyViT-5 m with (a) 92.00 91.83 91.72 91.77
Simplified TinyViT-5 m with (b) 91.79 91.60 91.34 91.47
Simplified TinyViT-5 m with (c) 91.79 91.48 91.50 91.49
Simplified TinyViT-5 m with (a) and (b) 91.13 90.98 90.58 90.76
Simplified TinyViT-5 m with (a) and (c) 92.38 92.03 92.47 92.24
Simplified TinyViT-5 m with (b) and (c) 91.75 91.43 91.69 91.55
Simplified TinyViT-5 m with (a), (b) and (c) 90.96 90.81 90.41 90.59

Significant values are in bold.

Table 1 details the impact of embedding the CA module at different locations on the performance of the simplified TinyViT-5 m network. Specifically, (a) indicates the use of CA after the PE module, (b) indicates the use of CA before the last GELU activation function in the MBConv Block, and (c) indicates the use of CA at the end of the TinyViT Block, with the exact embedding positions shown in Fig. 7. The data from the table shows that embedding the CA module at positions (a) and (c) resulted in the most significant performance improvement for the simplified TinyViT-5 m network, with a 1.05% increase in classification accuracy and a 0.84% increase compared to the original network. Particularly when the CA module is embedded at both position (a) and position (c), the network not only performs better in handling complex backgrounds and subtle defects but also maintains high efficiency in terms of resource consumption.

Furthermore, the data in Table 1 also shows that when the CA module is embedded only at position (b), although it also improves classification accuracy, the effect is relatively limited compared to other embedding positions, indicating that embedding the CA module at the end of the MBConv Block has a smaller impact on overall performance. This further validates the effectiveness of the design choice to embed the CA module after the PE module and at the end of the TinyViT Block. Therefore, this study chooses to embed the CA module at positions (a) and (c) to maximize the model’s classification performance and computational efficiency.

Experiment on embedding positions of the improved MK block

This paper compares the effects of the improved MK Block at three different embedding positions. It is important to note that this experiment was conducted on the TinyViT-5 m model, which had been simplified and then had CA embedded at positions (a) and (c). In the table, TinyViT-5 m* represents the TinyViT model after these improvements. The results are shown in Table 2.

Table 2.

Performance comparison of different improved MK block embedding position schemes.

Method Accuracy(%) Precision(%) Recall(%) F1-Score(%)
TinyViT-5 m 91.54 91.25 91.27 91.25
TinyViT-5 m* with (a) 92.54 92.39 92.16 92.27
TinyViT-5 m* with (b) 92.79 92.53 92.56 92.67
TinyViT-5 m* with (c) 92.83 92.80 92.84 92.68

Significant values are in bold.

Table 2 details the impact of embedding the improved MK Block at different positions on the performance of the TinyViT-5 m network. Specifically, (a) indicates the introduction of the improved MK Block before CA in the PE module, (b) indicates the introduction of the improved MK Block after CA in the PE module, and (c) indicates the introduction of the improved MK Block at the end of the MBConv Block, with the specific embedding positions shown in Fig. 10. The experimental results show that embedding the MK Block at position (c) resulted in the most significant performance improvement for the TinyViT-5 m network, with a 1.29% increase in classification accuracy.

Further analysis of the data in Table 2 reveals that while embedding the improved MK Block at positions (a) and (b) also led to performance improvements, the effects were slightly inferior to that of position (c). Specifically, the embedding position at (c) allows the model to better utilize multi-scale features, especially when dealing with complex red jujube surface defects, demonstrating higher classification accuracy and stability.

Ablation experiments

To assess the performance of the enhanced MK Block, CA, and the revised MBConv Block in the RJ-TinyViT model, this study carried out ablation experiments on the ‘Jujube2000’ test set. The findings are presented in Table 3.

Table 3.

Comparison of ablation studies on the “Jujube2000” test set.

Improved MK Block CA Improved
MBConv Block
Accuracy(%) Precision(%) Recall(%) F1-Score(%) FLOPs(G) Params(M)
- - - 91.33 91.04 91.06 91.04 0.51 1.93
- - 91.92 91.27 91.50 91.36 0.69 1.99
- - 92.38 92.03 92.47 92.24 0.51 1.96
- - 91.86 91.21 91.44 91.30 0.50 1.93
- 92.83 92.53 92.84 92.67 0.69 2.01
- 92.42 92.34 92.43 92.45 0.69 1.99
- 92.78 92.45 92.73 92.54 0.50 1.96
93.38 92.96 93.16 93.06 0.69 2.02

Significant values are in bold.

It should be noted that the ablation experiments in this study were conducted based on the simplified TinyViT-5 m model. The experimental results show that the improved MK Block increased the classification accuracy by 0.59% without significantly increasing the model’s parameter count (Params) and the number of floating-point operations (FLOPs). This improvement primarily enhanced the model’s performance in handling complex red jujube surface defects by strengthening the extraction capability of multi-scale features. After introducing the CA module, the model’s classification accuracy was further increased by 1.05%, indicating that the CA module played a key role in capturing spatial location information and enhancing feature focusing, also with minimal increase in Params and FLOPs. Additionally, the improved MBConv Block increased the model’s classification accuracy by 0.53%, reflecting the effectiveness of the channel splitting and reorganization strategy, DWConv, and the combination of SE blocks in optimizing feature extraction and channel relationships, with negligible impact on FLOPs and Params.

Ultimately, with the integration of these improvements, the classification accuracy of TinyViT-5 m reached 93.38%, with FLOPs at 0.69G and Params at 2.02 M. Compared to the initial model, the overall classification accuracy increased by 1.84%, while Params and FLOPs were reduced by 60.16% and 41.03%, respectively. These results clearly demonstrate that the combination of the improved MK Block, CA module, and MBConv Block significantly enhanced the model’s recognition performance in the task of classifying surface defects on red jujubes, while maintaining low computational costs and resource consumption. The successful application of these improvement measures demonstrates their potential in other resource-constrained industrial applications.

Model comparative analysis

In the experimental platform set up for this study, the performance of RJ-TinyViT was compared with current mainstream classification models on the ‘Jujube2000’ dataset. These mainstream networks include ResNet1827, ResNet3427, ResNet5027, DenseNet12128, Swin-Transformer-Tiny31, Swin-TransformerV2-Tiny61, ConvNeXt-Tiny46, and the latest classification networks RepViT-M0.962 and TransNeXt-Micro63. Figure 13 shows the trend of each model’s validation accuracy and training loss with the number of training epochs. In Fig. 13(a), the horizontal axis represents the number of training epochs, and the vertical axis represents the model’s validation accuracy. Similarly, in Fig. 13(b), the horizontal axis represents the number of training epochs, and the vertical axis represents the model’s training loss.

Fig. 13.

Fig. 13

Training and Validation for Each Model on the “Jujube2000” Dataset. (a) Model Validation Accuracy Curves with Epochs. (b) Trend of Model Training Loss Curves with Epochs.

As demonstrated in Fig. 13 (a), the RJ-TinyViT, original TinyViT-5 m, DenseNet121, RepViT-M0.9, and TransNeXt-Micro models began to converge at 50 epochs and gradually stabilized, achieving full convergence by 300 epochs. This indicates that these models were able to quickly adapt to the dataset in the early stages of training and maintained stable performance in subsequent stages. In contrast, the ResNet18, ResNet34, and ResNet50 models had a slightly later convergence time, starting to converge at 70 epochs and reaching full convergence by 300 epochs. The Swin-Transformer-Tiny, Swin-TransformerV2-Tiny, and ConvNeXt-Tiny models had the latest convergence time, beginning to converge at 170 epochs, which may be related to their more complex model structures, but they also achieved full convergence by 300 epochs.

Among all these models, RJ-TinyViT achieved the highest validation accuracy of 93.38%, significantly outperforming other models, especially the original TinyViT-5 m at 91.54%. This suggests that through the combination of the improved MK Block, CA module, and MBConv Block, RJ-TinyViT is better able to capture and utilize key features in the data. In comparison, the Swin-Transformer-Tiny model had the lowest validation accuracy of only 80.91%, indicating its relatively poor performance in handling the task of classifying surface defects on red jujubes.

According to the data in Fig. 13 (b), the training loss curves of RJ-TinyVit, original TinyVit-5 m, DenseNet121, RepViT-M0.9, and TransNeXt-Micro models were relatively stable and low, indicating that these models effectively reduced errors during the training process and maintained good generalization capabilities. RJ-TinyViT, in particular, followed closely behind DenseNet121 and the original TinyViT-5 m in terms of training loss curve, demonstrating its excellent performance in balancing accuracy and training efficiency. In contrast, the training loss of ResNet18, ResNet34, ResNet50, and ConvNeXt-Tiny was slightly higher, and although still competitive in performance, their performance in the red jujube defect classification task was slightly inferior to RJ-TinyViT.

Among all models, Swin-Transformer-Tiny and Swin-TransformerV2-Tiny had higher training losses, confirming the challenges they faced during the training process, possibly due to their complex architecture making them more difficult to optimize and adapt to the dataset. As a result, both models performed poorly in terms of validation accuracy and training loss, ranking at the bottom. These results further validate the superiority of RJ-TinyViT in the task of classifying surface defects on red jujubes; its improved architecture not only enhances the model’s accuracy but also maintains low training loss, making it a powerful tool for handling similar tasks.

All the models trained were evaluated on the ‘Jujube2000’ test set using their respective performance metrics, and the detailed experimental outcomes are presented in Table 4.

Table 4.

Performance comparison of various models on the “Jujube2000” test set.

Model Accuracy(%) Precision(%) Recall(%) F1-Score(%) FLOPS(G) Params(M) FPS
ResNet18 86.33 86.30 86.33 86.29 1.82 11.18 30.62
ResNet34 86.46 86.43 86.46 86.42 3.68 21.29 23.58
ResNet50 85.66 85.63 85.66 85.62 4.12 23.52 12.59
DenseNet121 90.01 89.98 90.01 89.97 2.88 6.96 11.91
Swin-Transformer-Tiny 80.91 80.88 80.91 80.87 4.36 27.52 9.29
Swin-TransformerV2-Tiny 83.26 83.23 83.26 83.22 3.42 27.58 6.79
ConvNeXt-Tiny 85.09 85.06 85.09 85.05 4.46 27.82 15.82
RepViT-M0.9 89.84 89.24 89.22 89.20 0.80 5.10 24.69
TransNeXt-Micro 90.30 89.57 89.56 89.55 2.70 12.80 20.53
TinyViT-5 m 91.54 91.25 91.27 91.25 1.17 5.07 35.31
RJ-TinyViT(Ours) 93.38 92.96 93.16 93.06 0.69 2.02 40.57

Significant values are in bold.

The results indicate that RJ-TinyViT demonstrates clear superiority in predictive accuracy and other performance metrics. Based on multiple improvements to the TinyViT-5 m model, this study successfully enhanced the model’s accuracy while significantly reducing FLOPs and Params. Ultimately, the RJ-TinyViT model achieved a predictive accuracy of 93.38%, with Params reduced to 2.02 M, representing a 1.84% increase in accuracy and a 60.16% reduction in Params compared to the original TinyViT-5 m model’s 91.54% accuracy and 5.07 M Params.

In terms of FLOPs, RJ-TinyViT also achieved significant optimization, with a reduction of 41.03%, further indicating a substantial improvement in the model’s computational efficiency. Compared to other benchmarked network models, such as the ResNet and Swin-Transformer series, RJ-TinyViT not only excels in accuracy and resource consumption but also reaches the highest value in frames per second (FPS), meaning it has stronger real-time processing capabilities for industrial production.

Especially in the red jujube processing industry, fast and accurate sorting is key to ensuring production efficiency and product quality. If the sorting speed is too slow, it will not meet the demands of the production line, leading to low production efficiency. RJ-TinyViT, with the highest accuracy and lowest resource consumption, achieves the fastest runtime, effectively supporting large-scale industrial production processing scenarios and significantly enhancing the automation and efficiency of red jujube surface defect classification tasks.

The data in Table 4 further validates the comprehensive advantages of RJ-TinyViT: among all the compared network models, RJ-TinyViT has the lowest Params and FLOPs, the highest predictive accuracy, and also the highest frames per second (FPS). These advantages make RJ-TinyViT not only surpass the original model in accuracy but also excel in computational efficiency and resource utilization, demonstrating its great potential in practical industrial applications.

Confusion matrix

A confusion matrix is a useful visual aid for assessing the performance of classification models, illustrating the correspondence between predicted and actual data. It also aids in a more detailed examination of a model’s strengths and weaknesses. As depicted in Fig. 14, the confusion matrix for each model discussed earlier is visualized. The horizontal axis of the matrix denotes the true labels of the images, and the vertical axis shows the predictions made by the model. Entries on the matrix’s diagonal represent correct predictions, while off-diagonal elements indicate discrepancies between predictions and actual outcomes.

Fig. 14.

Fig. 14

Confusion Matrix for Each Model. (a) ResNet18 (b) ResNet34 (c) ResNet50. (d) DenseNet121 (e) Swin-Transformer-Tiny (f) Swin-TransformerV2-Tiny. (g) ConvNeXt-Tiny (h) RepViT-M0.9 (i) TransNeXt-Micro. (j) TinyViT-5 m (k) RJ-TinyViT.

The analysis results show that when identifying mold defects on red jujubes, due to the diversity of their characteristics, these defects are most likely to be incorrectly classified as other types of defects. This is reflected in various network models, but TinyViT-5 m and its improved version RJ-TinyViT demonstrate stronger discrimination capabilities compared to other models. Secondly, deformation defects, due to their less pronounced features, are also easily misclassified, especially with higher misclassification rates in models such as ResNet18, ResNet34, and ResNet50. RJ-TinyViT, through its improved multi-scale feature extraction capabilities and enhanced attention mechanisms, can better capture these subtle features, thereby effectively reducing misclassification occurrences.

Crack defects, due to their visual similarity to mold spots, are often incorrectly identified as mold defects. In this regard, RJ-TinyViT and TinyViT-5 m continue to outperform other models, especially with the introduction of the CA module, which allows the model to more accurately capture the subtle features of cracks and reduces confusion with mold defects.

Among many mainstream networks, ConvNeXt-Tiny, TinyViT-5 m, RepViT-M0.9, and TransNeXt-Micro perform relatively well in classification performance due to their newer network structures and more rational designs. However, TinyViT-5 m, through its optimized design, and especially after simplification as RJ-TinyViT, further improves in classification accuracy. By simplifying the structure of TinyViT-5 m and introducing the CA module, the improved MK Block, and enhancements to the MBConv Block, RJ-TinyViT demonstrates even more outstanding performance in classification, especially when dealing with complex red jujube surface defects, showing a clear advantage. These improvements not only increase the model’s classification accuracy but also enhance its ability to handle subtle difference features, making it more advantageous in practical applications in red jujube processing.

Red jujubes are crops that grow naturally, hence even defects of the same type may vary, and the differences between different types of defects might be relatively small. This makes defect detection in jujubes more complex compared to industrially manufactured goods produced under a fixed process. Although in most experiments, RJ-TinyViT displays advanced classification performance, in certain specific instances, the model more frequently misidentifies types of jujubes with mold, deformation, and cracks. Figure 15 illustrates this issue, where the areas marked with a green dashed box highlight the defect regions that require the model’s attention. This paper seeks to investigate and discuss the potential causes for errors in the model’s classification of anomalous images.

Fig. 15.

Fig. 15

Examples of Failure Cases. (a) A mold defect mistakenly identified as a crack defect.(b) A deformation defect mistakenly identified as a normal jujube. (c) A crack defect mistakenly identified as a mold defect.

As shown in Fig. 15 (a), the RJ-TinyViT model mistakenly identified a mold defect as a crack defect. This misjudgment may stem from the small and inconspicuous mold area, causing its characteristics to resemble those of a crack defect, thereby confusing the model’s judgment. Even the human eye may struggle to accurately distinguish between these two types of defects under such circumstances, further validating the classification challenge in this scenario. This suggests that the model may require more refined feature extraction and focusing mechanisms when dealing with subtle and inconspicuous features to improve classification accuracy.

As demonstrated in Fig. 15 (b), RJ-TinyViT erroneously recognized a deformation defect on a red jujube as a normal one. Such misjudgment may be due to the influence of shooting angles and lighting conditions, causing the defect area to appear as the characteristics of a normal jujube in the image, thus masking the actual deformation. This indicates that the model may need further robustness optimization when processing images under different environmental conditions to address the challenges posed by changes in lighting and angles.

In Fig. 15 (c), the RJ-TinyViT model once again mistakenly identified a crack defect as a mold defect. This may be because the crack area exhibits characteristics similar to mold in certain situations, making it difficult for the model to make the correct judgment. This misjudgment further illustrates the high similarity between crack and mold features in some cases, suggesting that we need to pay more attention to the differential extraction of these features in future model optimizations.

These misjudgment cases demonstrate that even the improved RJ-TinyViT model may face challenges in extremely complex classification tasks. However, identifying the causes behind these misjudgments provides important directions for further model optimization. For example, more advanced attention mechanisms or multimodal fusion techniques could be considered to enhance the model’s performance in complex scenarios. These improvements will not only help increase the accuracy of red jujube surface defect classification but also provide valuable references for other similar industrial vision tasks.

Through the analysis of these misclassification cases, this study will carry out targeted optimizations on the RJ-TinyViT model to reduce misjudgments in future research.

Visual interpretation of the model

This study uses Gradient-weighted Class Activation Mapping (Grad-CAM)64 technology to enhance understanding and analysis of how surface defects on red jujubes are classified by the model. Grad-CAM produces heatmaps that highlight the importance of certain areas within an image by assigning and aggregating weights from specific feature maps within the model. In these heatmaps, areas assigned higher weights appear red, indicating their critical role in helping the model distinguish between different categories. Conversely, areas with lower weights are more blue, indicating they contribute less to the model’s classification. Figure 16 displays the effects of heatmaps generated using Grad-CAM for different models in the classification of various categories of red jujube surface defects.

Fig. 16.

Fig. 16

Comparison of heatmaps generated by each model using grad-CAM.

Through these heatmaps, we can clearly identify the areas of defects that the model focuses on. The heatmaps generated by ResNet18, ResNet34, ResNet50, and DenseNet121 models are relatively similar, and these models can accurately locate the defect areas, demonstrating good capturing ability for the target regions. However, these models still have certain limitations when dealing with complex red jujube defects, especially when facing low-contrast or complex backgrounds, where the localization accuracy may be insufficient.

Due to the transformer architecture of Swin-Transformer-Tiny and Swin-TransformerV2-Tiny, they inherently possess the ability to handle long-range information. As a result, the heatmaps of these models show a broader focus range, but this breadth also leads to a less precise focus on the defect areas. In complex backgrounds, the transformer architecture may disperse attention, resulting in a lack of focus on key features.

In comparison, ConvNeXt-Tiny, RepViT-M0.9, and TransNeXt-Micro models, while also able to focus on defect areas with relatively good accuracy, do not match the precision of the focus points of the ResNet series and DenseNet121 models. Although these models perform well in handling most red jujube defects, there is still room for improvement in capturing detailed features.

TinyViT-5 m performs excellently, being able to more accurately focus on the defect areas of red jujubes, thereby achieving better classification results. The heatmaps show that TinyViT-5 m can effectively identify key areas when dealing with subtle features and complex backgrounds, reducing the occurrence of misclassification. In RJ-TinyViT, through the simplification of the original TinyViT-5 m model, the introduction of the CA module and improved MK Block, and the improvement of the MBConv Block, the model’s focusing ability on defect areas has been further optimized. Specifically, the heatmaps show that RJ-TinyViT can more precisely locate the most critical feature areas when dealing with red jujube surface defects, thereby enhancing overall classification performance.

These improvements not only increase the model’s classification accuracy but also significantly enhance the model’s adaptability and robustness when dealing with complex defect scenarios. Through the analysis of these heatmaps, we can clearly see the advantages of RJ-TinyViT in the task of classifying red jujube defects and its potential value in industrial applications.

Conclusion

Agriculture 4.0, which underscores intelligence and precision, aims to achieve efficient agricultural production through advanced intensification and coordination. However, the advanced of red jujube planting areas contrasts with the lag in post-harvest processing technology. The traditional jujube industry remains heavily dependent on manual labor. By introducing machine vision technology to enhance the automation level of red jujube production, Agriculture 4.0 can be combined with Industry 4.0. Furthermore, employing deep processing techniques for substandard jujubes not only optimizes resource utilization but also alleviates environmental strain.

This study has made several improvements to the TinyViT-5 m network model to enhance the classification of red jujube defects. First, the TinyViT-5 m network structure was simplified, and the CA module was introduced in the PE and TinyViT Blocks, while the MBConv Block was improved and the MK Block was introduced, ultimately forming the RJ-TinyViT model. Ablation experiments were conducted to assess the contributions of the improved MK Block, CA, and MBConv Block to the model’s performance. Comparative experimental results show that RJ-TinyViT has significant performance advantages, with a 60.16% reduction in the number of parameters, a 41.03% reduction in FLOPs, and a 1.84% increase in classification accuracy. RJ-TinyViT, with its high accuracy and low parameter count, achieves faster runtime under the same hardware conditions, making it highly suitable for the industrial production of red jujubes.

These improvements not only demonstrate significant performance enhancements in red jujube defect detection but also provide an effective solution for other similar tasks in agricultural product quality inspection. Especially against the backdrop of Agriculture 4.0 and Industry 4.0, this lightweight and efficient model design can provide technical support for automated inspection in large-scale industrial production. This not only helps to increase agricultural production efficiency and reduce resource waste but also promotes the development of smart agriculture and industrial automation, demonstrating broad application prospects and socio-economic value.

Future research will concentrate on further optimizing the model to bolster its robustness in complex environments and its capacity to recognize various types of defects. By examining the model’s interpretability, we can conduct a thorough analysis of classification errors, particularly for defects that are challenging to identify, thereby enhancing classification accuracy.

Additionally, we will explore techniques such as attention mechanisms and feature visualization to delve into the model’s interpretability. This will not only deepen our comprehension of the model’s decision-making processes but also aid in pinpointing and rectifying potential biases or errors.

Another avenue of research is the application of ensemble learning methods to synthesize the predictions from multiple models effectively. This strategy can capitalize on the individual strengths of various models while compensating for their respective shortcomings, thus elevating the collective performance of defect detection systems.

The expansion of these research directions will not only further enhance the performance of the RJ-TinyViT model but also promote the widespread application of defect detection technology in other fields. Especially with the development of technologies such as deep learning and edge computing, the outcomes of this study will lay the foundation for more efficient and intelligent industrial vision inspection systems. This will not only enhance the level of automation in industrial production but also promote the improvement of global agricultural product quality management standards, fostering technological innovation and economic growth in the agricultural and industrial sectors.

Acknowledgements

We are very grateful to all members of the research team led by J.-X Guo and B.-X Yuan for their valuable suggestions and dataset. This research was funded by the foundation of the Shaanxi Province key R&D (Grant 2021GY-341) and Shaanxi Key Laboratory of Integrated and Intelligent Navigation (SKLIIN-20190102). Natural Science Foundation of Shaanxi Province (2021JM-537, 2019JQ-936), and the Research Foundation for Talented Scholars of Xijing University (XJ20B01, XJ19B01, XJ17B06).

Author contributions

C.H., J.G., and B.Y. designed the experiments. C.H. conducted experiments, and contributed to analysis and manuscript preparation. J.G. obtained funding and supervised the entire project. H.X., Q.Z., Y.G., X.M., and J.C prepared Figs. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16; Tables 1, 2, 3 and 4. All authors reviewed the manuscript.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding authors upon reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1. Penumuru, D. P., Muthuswamy, S. & Karumbu, P. Identification and classification of materials using machine vision and machine learning in the context of industry 4.0. Journal of Intelligent Manufacturing 31, 1229–1241 (2020).
  2. 2. Deng, J. et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 248–255 (IEEE Computer Society, 2009).
  3. 3. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. & Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88, 303–338 (2010).
  4. 4. Lin, T.-Y. et al. Microsoft COCO: Common Objects in Context. arXiv preprint arXiv:1405.0312 (2014).
  5. 5. Ozdogan, B., Gacar, A. & Aktaş, H. Digital Agriculture Practices in The Context of Agriculture 4.0. Journal of Economics, Finance and Accounting 4, 184–191 (2017).
  6. 6. Bertoglio, R., Corbo, C., Renga, F. M. & Matteucci, M. The Digital Agricultural Revolution: A Bibliometric Analysis Literature Review. IEEE Access 9, 134762–134782 (2021).
  7. 7. Zhao, C. Prospects of Agricultural Robots. China Rural Technology 5, 20–21 (2019).
  8. 8. Li, D. & Yang, H. Research Progress and Development Trends in Agricultural Internet of Things Technology. Chinese Agricultural Digest - Agricultural Engineering 30, 3–12 (2018).
  9. 9. Raj, M. et al. A survey on the role of Internet of Things for adopting and promoting Agriculture 4.0. Journal of Network and Computer Applications 187, 103107 (2021).
  10. 10. Li, C. & Liao, X. Analysis of the Development of the Chinese Jujube Industry. Agricultural Products Market Weekly 49, 34–37 (2016).
  11. 11. Cao, L. Current Development Status and Recommendations for the Jujube Industry in Chaoyang County. Northern Fruit Trees 6, 44–45 + 47 (2017).
  12. 12. Bhargava, A., Bansal, A. & Goyal, V. Machine Learning–Based Detection and Sorting of Multiple Vegetables and Fruits. Food Analytical Methods 15, 228–242 (2022).
  13. 13. Dong, Y.-Y., Huang, Y.-S., Xu, B.-L., Li, B.-C. & Guo, B. Bruise detection and classification in jujube using thermal imaging and DenseNet. Journal of Food Process Engineering 45, e13981 (2022).
  14. 14. Premier, A. Façade cool coatings: an experiment on colour and surface quality. Intelligent Buildings International 13, 293–310 (2021).
  15. 15. Chen, T., Wang, Y., Xiao, C. & Wu, Q. M. J. A Machine Vision Apparatus and Method for Can-End Inspection. IEEE Transactions on Instrumentation and Measurement 65, 2055–2066 (2016).
  16. 16. Wu, L., He, J., Liu, G., Wang, S. & He, X. Detection of common defects on jujube using Vis-NIR and NIR hyperspectral imaging. Postharvest Biology and Technology 112, 134–142 (2016).
  17. 17. Li, Y., Ma, B., Hu, Y., Yu, G. & Zhang, Y. Detecting Starch-Head and Mildewed Fruit in Dried Hami Jujubes Using Visible/Near-Infrared Spectroscopy Combined with MRSA-SVM and Oversampling.Foods 11, 2431 (2022). [DOI] [PMC free article] [PubMed]
  18. 18. Wu, K. et al. TinyViT: Fast Pretraining Distillation for Small Vision Transformers. arXiv preprint arXiv:2207.10666 (2022).
  19. 19. Liu, Y., Ma, X., Shu, L., Hancke, G. P. & Abu-Mahfouz, A. M. From Industry 4.0 to Agriculture 4.0: Current Status, Enabling Technologies, and Research Challenges. IEEE Transactions on Industrial Informatics 17, 4322–4334 (2021).
  20. 20. Yousef, N., Parmar, C. & Sata, A. Intelligent inspection of surface defects in metal castings using machine learning. Materials Today: Proceedings 67, 517–522 (2022).
  21. 21. Peng, J., Shao, H., Xiao, Y., Cai, B. & Liu, B. Industrial surface defect detection and localization using multi-scale information focusing and enhancement GANomaly. Expert Systems with Applications 238, 122361(2024).
  22. 22. Li, S. et al. Aluminum surface defect detection method based on a lightweight YOLOv4 network. Scientific Reports 13, 11077 (2023). [DOI] [PMC free article] [PubMed]
  23. 23. Zhou, C. et al. Metal surface defect detection based on improved YOLOv5. Scientific Reports 13, 20803 (2023). [DOI] [PMC free article] [PubMed]
  24. 24. Liu, J., Cui, G. & Xiao, C. A real-time and efficient surface defect detection method based on YOLOv4. Journal of Real-Time Image Processing 20, 77 (2023).
  25. 25. Wang, J. et al. Toward surface defect detection in electronics manufacturing by an accurate and lightweight YOLO-style object detector. Scientific Reports 13, 7062 (2023). [DOI] [PMC free article] [PubMed]
  26. 26. Szegedy, C. et al. Going deeper with convolutions. arXiv preprint arXiv:1409.4842 (2015).
  27. 27. He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03358 (2015).
  28. 28. Huang, G., Liu, Z., Maaten, L. V. D. & Weinberger, K. Q. Densely Connected Convolutional Networks. arXiv preprint arXiv:1608.06993 (2016).
  29. 29. Yang, S. et al. Automatic Identification of Landslides Based on Deep Learning. Applied Sciences 12, 8153 (2022).
  30. 30. Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929 (2020).
  31. 31. Liu, Z. et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030 (2021).
  32. 32. Mehta, S. & Rastegari, M. J. A. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv preprint arXiv:2110.02178 (2021).
  33. 33. Nie, J., He, Z., Yang, Y., Gao, M. & Dong, Z. Learning Localization-Aware Target Confidence for Siamese Visual Tracking. IEEE Transactions on Multimedia 25, 6194–6206, doi:10.1109/TMM.2022.3206668 (2023).
  34. 34. Nie, J., He, Z., Yang, Y., Gao, M. & Zhang, J. J. A. GLT-T: Global-Local Transformer Voting for 3D Single Object Tracking in Point Clouds. arXiv preprint arXiv:2211.10927 (2022).
  35. 35. Nie, J., Wu, H., He, Z., Gao, M. & Dong, Z. Spreading Fine-Grained Prior Knowledge for Accurate Tracking. IEEE Transactions on Circuits and Systems for Video Technology 32, 6186–6199, doi:10.1109/TCSVT.2022.3162599 (2022).
  36. 36. Nie, J. et al. OSP2B: One-Stage Point-to-Box Network for 3D Siamese Tracking. arXiv preprint arXiv:2304.11584 (2023).
  37. 37. Zhu, N. et al. Deep learning for smart agriculture: Concepts, tools, applications, and opportunities. IJABE 11, 32–44 (2018).
  38. 38. Altalak, M., Uddin, M. A., Alajmi, A. & Rizg, A. A Hybrid Approach for the Detection and Classification of Tomato Leaf Diseases. Applied Sciences 12, 8182 (2022).
  39. 39. Xu, X., Zheng, H., You, C., Guo, Z. & Wu, X. FAR-Net: Feature-Wise Attention-Based Relation Network for Multilabel Jujube Defect Classification. Sensors 21, 392 (2021). [DOI] [PMC free article] [PubMed]
  40. 40. Fang, S., Zhao, F., Chu, S. & Wu, Z. Defect Detection Method for Jujubes Based on Multi-scale Convolutional Neural Networks. Food and Machinery 37, 158–163 + 168 (2021).
  41. 41. Yang, Z., Zheng, H., Guo, Z. & Xu, X. Jujube Defect Detection Based on Network in Network Convolutional Neural Networks. Food and Machinery 36, 140–145 + 181 (2020).
  42. 42. Wen, H., Wang, J. & Han, F. Research on Jujube Defect Detection and Classification Method Based on Improved Residual Networks. Food and Machinery 36, 161–165 (2020).
  43. 43. Zhang, Z., Xue, H. & Fan, G. Jujube Defect Recognition Based on Improved Convolutional Neural Networks. Food and Machinery 37, 158–162 + 192 (2021).
  44. 44. Yu, Y. Research on Jujube Grading Method Based on Visual Attention Mechanism. Master’s Thesis, Tarim University, (2023).
  45. 45. Jiang, L., Yuan, B., Ma, W. & Wang, Y. JujubeNet: A high-precision lightweight jujube surface defect classification network with an attention mechanism. Frontiers in Plant Science 13, 1108437 (2023). [DOI] [PMC free article] [PubMed]
  46. 46. Liu, Z. et al. A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545 (2022).
  47. 47. Hendrycks, D. & Gimpel, K. J. a. L. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415 (2016).
  48. 48. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L. C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv preprint arXiv:1801.04381 (2018).
  49. 49. Graham, B. et al. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. arXiv preprint arXiv:2104.01136 (2021).
  50. 50. Hou, Q., Zhou, D. & Feng, J. Coordinate Attention for Efficient Mobile Network Design. arXiv preprint arXiv:2103.02907 (2021).
  51. 51. Howard, A. G. et al. Searching for MobileNetV3. arXiv preprint arXiv:1905.02244 (2019).
  52. 52. Zhou, C., Xia, X., Zhou, D. & Zhou, Z. Pedestrian Re-identification Using Merged Grid Masks and Residual Coordinate Attention. Microelectronics and Computer 39, 30–38 (2022).
  53. 53. Hu, J., Shen, L. & Sun, G. Squeeze-and-Excitation Networks. arXiv preprint arXiv:1709.01507 (2017).
  54. 54. Woo, S., Park, J., Lee, J.-Y. & Kweon, I.-S. J. A. CBAM: Convolutional Block Attention Module. arXiv preprint arXiv:1807.06521 (2018).
  55. 55. Ganj, A., Ebadpour, M., Darvish, M. & Bahador, H. LR-Net: A Block-based Convolutional Neural Network for Low-Resolution Image Classification. Iranian Journal of Science and Technology, Transactions of Electrical Engineering 47, 1561–1568 (2023).
  56. 56. Xiang, Q. et al. Multi-scale group-fusion convolutional neural network for high-resolution range profile target recognition. IET Radar, Sonar & Navigation 16, 1997–2016, (2022).
  57. 57. Xiang, Q. et al. Quadruplet depth-wise separable fusion convolution neural network for ballistic target recognition with limited samples. Expert Systems with Applications 235, 121182, (2024).
  58. 58. Wang, C. Y. et al. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. arXiv preprint arXiv:1911.11929 (2019).
  59. 59. Loshchilov, I. & Hutter, F. J. A. Fixing Weight Decay Regularization in Adam. arXiv preprint arXiv:1711.05101 (2017).
  60. 60. Khasawneh, N., Faouri, E. & Fraiwan, M. Automatic Detection of Tomato Diseases Using Deep Transfer Learning. Applied Sciences 12, 8467 (2022).
  61. 61. Liu, Z. et al. Swin Transformer V2: Scaling Up Capacity and Resolution. arXiv preprint arXiv:2111.09883 (2021).
  62. 62. Wang, A., Chen, H., Lin, Z., Pu, H. & Ding, G. J. A. RepViT: Revisiting Mobile CNN From ViT Perspective. arXiv preprint arXiv:2307.09283(2023).
  63. 63. Shi, D. J. A. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. arXiv preprint arXiv:2311.17132 (2023).
  64. 64. Li, X. & Li, S. Transformer Help CNN See Better: A Lightweight Hybrid Apple Disease Identification Model Based on Transformers. Agriculture 12, 884 (2022).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding authors upon reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES