Abstract
In the study of small datasets, obtaining training samples from source categories for learning is a challenge in fine-grained image classification. Based on the fact that fine-grained concepts can be learned with very few samples, only a small number of labeled samples (for example, three) are used for each category. However, due to the difficulty in distinguishing the subtle differences between fine-grained images, this paper proposes a method based on spatial frequency information feature fusion for small dataset fine-grained image classification (SDFGIC). This method not only considers the differences in features between the spatial and frequency domains of the images but also incorporates an image processing stage where the images are rotated multiple times. Since the convolutional kernel extracts features of a certain area in the image through translation and convolution operations, multiple rotations can obtain the feature representations of the images in different directions. Finally, learnable parameters are set to fuse the spatial and frequency domain features, and classification is performed through the fully connected layer. According to the mini-batch settings, the experimental results show that the proposed method performs better than other advanced algorithms in experiments on six small datasets.
Subject terms: Computer science, Software
Introduction
Image classification is one of the key problems in handling machine vision tasks. In particular, for computers, accurately classifying fine-grained images (FGIC) is a huge challenge. With the development of neural networks, the ability to study subtle differences between objects increases. For example, references such as 1–13 can accurately classify the differences between objects with high similarity. In FGIC, multiple benchmark datasets (e.g., bird 14, car 15, and aircraft 16) were also collected with extensive tags to test the performance of different methods.
However, the performances of the aforementioned methods begin to fall significantly when the number of samples in the dataset becomes small 17,18. Meanwhile, it is expensive to collect a huge amount of labeled data for FGIC. It is also hard to collect training examples for some rare categories (e.g., endangered species). On the other hand, humans are capable of learning stable feature representations in small image datasets for dealing with FGIC tasks 19. Inspired by this ability, in this work, we study how to make the network learn discriminative information in each input image as accurately as possible so that it can perform FGIC tasks with small datasets.
Visual task 20–22 largely depends on extracting local information from input images, usually using various techniques such as first-order and second-order 23–25) derivatives. In images, information on different attributes is useful for classification. For example, frequency domain information can provide features regarding image texture, etc., while spatial domain information contains the spatial positional relationship of images. At present, few people consider the local feature information of images simultaneously from the spatial and frequency domains. This paper starts from the frequency and spatial domains and extracts the local features of images from the frequency and spatial domains respectively.
In addition, image data augmentation methods such as lighting changes 26,27, colorization 28, rotations 29,30, flips 31,32, and affine transformations 33,34 are employed to increase data diversity and help the network learn discriminative features from images. However, existing augmentation techniques focus on increasing image diversity but cannot effectively guide the network to accurately extract local structure information from the spatial frequency domain in different fine-grained visual classification tasks. Neglecting careful consideration of local structure information extraction and local structure feature description in fine-grained visual classification methods 8–13,35,36 may lead to suboptimal performance, especially true in small data sets with only a small sample size for each category.
In this study, we propose a new method of spatial frequency information feature fusion (SFIF) based on small dataset fine-grained image classification (SDFGIC). We not only extract significant local features in multiple directions in the spatial domain but also obtain enhanced local features in multiple directions in the frequency domain after filtering. By obtaining local feature information in different attribute spaces, our method can more accurately locate the salient regions with classification ability in the image. At the same time, since we introduce the frequency domain and perform noise suppression in the frequency domain information, we can effectively suppress noise interference and greatly improve classification accuracy. This work mainly includes the following:
Introduce frequency domain information and design an adaptive threshold filtering method to filter out noise interference. The fusion of spatial and frequency domains can increase the diversity of images, and the network can obtain more diverse feature representations for classification.
Considering the fixed direction of the convolutional kernel, design an image processing process with multi-directional rotation, enabling the fixed convolutional kernel to obtain more accurate feature representations of the image from different directions.
This paper proposes a novel spatial-frequency domain fusion network for fine-grained image classification on small datasets.
This paper sets up experiments from various aspects such as the number of rotations, rotation directions, and spatial and frequency domains, selects the scheme most conducive to classification and achieves better performance compared to other advanced algorithms.
Related work
In this section, we summarize existing FGIC and SDFGIC methods that are relevant to this study.
Fine-grained visual classification
The key problem of FGIC is to enable the network to learn the local information in the image with classification ability. Current FGIC methods can be divided into the following two categories based on significant region learning. The first group of methods 1,5–7,37,38 utilizes various additional annotation information, such as bounding boxes, metadata, or partial annotation mechanisms to locate significant parts. Then, classification is carried out according to the feature information of the selected region.
The second method 8–13 is designed to update network parameters so that the network can learn classified information. Zhou et al. 8 proposed an FGIC framework that incorporates multiple discriminant parts and multi-level features. By focusing and masking the most discriminative parts, the model is encouraged to discover other discriminative parts and will combine multiple layers of features to enrich its local details. Xu et al. 9 propose a new FGIC internal integrated learning converter that treats all heads of each ViT layer as weak learners and uses attention and spatial relationships to characterize markers of discriminative regions between layers. To extract more accurate cross-layer features and reduce noise, a cross-layer refinement module is introduced to extract fine features. Cui and Hui 10 propose a dual-dependency attention converter model designed to address complex global dependency modeling challenges in visual tasks. To improve the quality of semantic modeling and reduce the computational cost, the model decomposes the global token interaction into two paths, namely, the location-dependent and semantic-dependent attention paths. An et al. 11 proposed a multi-scale network of progressive multi-granularity attention to address the challenges of FGIC, especially the subtle differences between classes. The model uses progressive training and multi-granularity focus modules to locate discriminant local areas and designs multi-scale modules to extract multi-layer features, avoiding the recognition confusion caused by subtle differences between classes. Among them, 8 and 11 also use multi-level feature information to enrich the local feature representation of the image. Shen et al. 12 In order to improve the robustness of the model, an adaptive label correction strategy is adopted to ensure effective learning under limited data. Pu et al. 13 introducing semantic information and employing data augmentation techniques, this approach translates image features along semantically meaningful directions to generate diversified samples, thereby enhancing the model’s generalization ability.
Visual classification based on spatial frequency domain
Some existing SDFGIC methods combine approaches to capture image features in the frequency domain. In 39, research has shown that different networks pay special attention to different frequencies in the frequency domain space, which can cause certain effects when the network learns certain features. Therefore, introducing frequency domain information may enhance the neural network’s generalization capability. In 40, the Fast Fourier Transform (FFT) is introduced to effectively blend information in the frequency domain. The non-parametric nature and fast computation of FFT can efficiently learn interactions among features in the frequency domain, achieving good results even under limited sample conditions.
In addition, methods based on frequency domain selection are proposed in 41 and 42, which utilize frequency domain information as input and extract features using CNN, and in doing so transform image features into the frequency domain. Compared to traditional spatial domain approaches, frequency domain feature learning can not only reduce the memory occupied by the model but also achieve relatively good accuracy. Research on combining spatial and frequency domain features, as seen in 43 and 44, utilizes multi-domain cross or complementary fusion methods. In the context of few-shot classification problems, attention is given to both spatial and frequency domains, allowing for the extraction of local information in both domains. Studies on fusion methods for spatial and frequency domain features indicate that effective frequency domain information has a positive impact on classification. This also highlights the complementary nature of frequency information and spatial representation, effective fusion of spatial and frequency domain features leads to better image feature representation.
Small dataset fine-grained visual classification
The existing SDFGVC method 45–51 can be roughly divided into two categories 52: meta-learning based methods 45–47 and metric-based learning methods 48–51. The strategy of the meta-learning method approach is designed to train classifiers with a limited number of training samples in each category. Tang et al. 45 present a dual-pyramid structure to boost the internal representation of features by constructing and aggregating the features of attention pyramids with multiple scales and levels. Zhang et al. 46 proposed a double helix model based on Transformer. The model first improved cross-image interaction in semantically relevant local object areas. Subsequently, it heightens the extraction features of the semantically relevant local regions found in each branch, thus enhancing the model’s capability to distinguish fine-grained feature differences. Satoshi et al. 47 proposed a meta-image enhancement network that combines the generated image with the original image, thereby enabling the generated “hybrid” training image to improve one-shot learning.
A strategy based on metric learning methods is to learn a set of functions that transform the test sample into an embedded space. The test sample can then be classified according to a given metric (for example, nearest neighbor 53 or deep nonlinear metric 54). Zhang et al. 48 used Earth Mover Distance as an indicator to calculate the structural distance between dense image representations to determine image correlation. Wertheimer et al. 49reformulated the small-sample classification as a reconstruction problem in potential space and proposed a feature reconstruction network. A new small sample classification mechanism is designed, that is, directly reconstructed from support set samples to query set samples, which can greatly increase the difference between classes and improve the classification performance. In order to alleviate small inter-class variability and large intra-class variability, Wu et al. 50 proposed a bidirectional feature reconstruction network, introduced a self-attention mechanism to reconstruct images, used support sets to reconstruct query sets to increase inter-class differences, and further used query sets to reconstruct support sets to reduce intra-class differences. However, a large portion of the existing methods for few-shot image classification only pay attention to the modeling of global image characteristics or local image patches and overlook the global-local interaction. Sun et al. 51 propose a new method GL-ViT, which designs a feature extraction module to obtain global and local feature representations, fuse global features with local features, and better classify small sample images.
Proposed method
In this section, we first clarify how to correctly extract the spatial frequency information feature fusion (SFIF). Then, a novel local information learning method for SDFGIC is proposed. As shown in Fig. 1, the framework consists of three modules: local information preprocessing, backbone network, and classification network. Next, the reason why the proposed algorithm can effectively function will be explained, as well as how our proposed framework enables the backbone network to have the ability to obtain accurate SFLIF from each input image in order to obtain discriminative local information.
Figure 1.
The overall pipeline of our proposed SFIF learning framework. (1) Information preprocessing: filtering noise in the frequency domain of input images and rotating them. (2) Backbone classification network: extracting the basic feature maps. (3) Classification network: classifying images into fine-grained categories.
Low-pass filtering methods
The method of feature fusion based on spatial frequency information for SDFGIC (Fine-Grained Image Classification Based on Spatial Frequency Information Feature Fusion) proposed in this paper not only takes into account the features of the image in the spatial domain but also starts from the frequency domain, ensuring that the spatial and frequency domain features can be obtained simultaneously from the input image for classification. This paper uses the Discrete Cosine Transform (DCT) method to transform the image from the spatial domain to the frequency domain. Moreover, in order to further reduce the noise interference contained in the high-frequency information of the image, a low-pass filter is used to obtain the low-frequency information in the image that is beneficial for classification. Regarding the low-pass filter, in this paper, during the low-pass filtering (LPF) process, an adaptive threshold is designed to determine the frequency components that need to be filtered out, as shown below:
![]() |
1 |
![]() |
2 |
In the above formula,
is the mean of all frequency components obtained after the Discrete Cosine Transform (DCT) of the input image,
is the variance of all frequency components obtained after the DCT of the input image, m is set as a learnable parameter with an initial value of 1.
represents the frequency component at each position after the DCT of the input image, and
represents the frequency component retained at each position after the low-pass filter. In the method of this paper, an adaptive threshold method is used to perform adaptive filtering on the image in the frequency domain, thereby retaining the frequency information with classification ability, which can to a certain extent filter out the interference caused by the high-frequency information in the image and reduce noise. Finally, through the Inverse Discrete Cosine Transform (IDCT), the filtered image is transformed from the frequency domain to the spatial domain, which is regarded as an attention operation. It can not only obtain information different from the original input image but also enable the network to reduce noise interference as much as possible, thereby extracting features and classifying the input image more accurately.
Information preprocessing
In small dataset tasks, the total number of image samples is relatively small, which makes the number of images participating in each training batch much less compared to large-scale datasets. When training sample data is scarce, it is difficult for the model to obtain enough features from these limited samples to support the iteration and optimization process, which often leads to overfitting of the network. Images have different manifestations in the spatial domain and the frequency domain. In the spatial domain, it is possible to intuitively distinguish which local information belongs to the main subject and which belongs to the background. However, the spatial domain makes it difficult to directly distinguish the information valuable for classification and the noise information that interferes with classification. In contrast, in the frequency domain, the distribution of energy in the image can be known through frequencies, and then it can be inferred which frequencies carry important information. In addition, the frequency domain space can clearly separate different frequencies. For example, high-frequency information usually contains more noise interference that is not conducive to classification, and filtering methods can be used to filter out the noise information in the image in the frequency domain space. This shows that the network can obtain different feature representations from the spatial domain and the frequency domain, and these representations are of great significance for image classification tasks. Therefore, in order to solve the problem of insufficient sample data, this paper introduces the combination of the spatial domain and the frequency domain in the image to obtain more diverse feature representations of the image for classification. In the image information processing process, first, the input image
is transformed by discrete cosine transform (DCT) to obtain
to obtain the frequency domain information of the image. Then, the frequency domain image is filtered with a low-pass filter to obtain a frequency domain image with high-frequency noise interference eliminated. Finally, the filtered image is converted back to the spatial domain through inverse discrete cosine transform (IDCT) to obtain an image enhanced by the frequency domain. The formulas are as follows:
![]() |
3 |
![]() |
4 |
![]() |
5 |
Subsequently, the original image
and the spatial-domain image
enhanced by the frequency domain are combined, are rotated with multiple angles at an interval of
within the range of
. Among them,
and
are generated by rotating
and
, and then they are input into the backbone network for training. Taking the backbone network as an example, as depicted in Fig. 2, the gray change information of the input image is extracted in all directions. In Fig. 2a shows the gray change information of the image extracted by the existing method in one direction. However, in this method, the backbone network can only learn part of the local information of the image from each batch at a time, and may not contain the local area with classification ability. On the contrary, in the study of this paper enables the network to extract the local features of images along multiple directions from different attribute spaces in each epoch, which can ensure that significant local features with classification ability in the image can be obtained in each epoch. As shown in “in every epoch” in Fig. 2b. Through this method, the backbone network can accurately obtain the local features of the image in the space and frequency domain space, and it is convenient to learn the remarkable information with classification ability.
Figure 2.
Examples of local information extraction. (a) Local information extraction of existing image data augmentation techniques. (b) Local information extraction of our proposed information preprocessing.
It is worth noting that a backbone network has the ability to learn different information from each input image through data augmentation (e.g., randomly rotating the image each time). In this way, one might argue that as long as the number of epochs is increased during training, the network as shown in Fig. 2a can obtain more local feature information from the input image through data augmentation. According to this logic, this network seems to be able to achieve similar or even better classification accuracy compared to the network shown in Fig. 2b. However, subsequent experiments show that under the condition of small samples, even if the number of training epochs is significantly increased (for example, setting the number of training epochs in Fig. 2a to be 10 times that in Fig. 2b), the classification accuracy of the network shown in Fig. 2a is not improved. This is because the network shown in Fig. 2a does not learn the local structural features of the image sufficiently and cannot accurately depict the local features of different parts of the image. According to the research 55,56, the key points in the image should be accurately detected in different directions. To solve the problem that the model has difficulty obtaining sufficient local structural features from the image, considering that the convolutional kernel does not generate directional changes in the conventional convolution process, this paper proposes to perform multi-directional rotation transformations on the input image. In this way, during the convolution process, the same convolutional kernel can obtain more sufficient feature representations of a fixed region from different directions, as shown in Fig. 2b. Through this method, more accurate image feature representations can be further obtained. In addition, the experimental comparison results show that the performance of the method proposed in this paper is better than other algorithms when the number of training images is small.
Classification network
Image classification aims to classify images using the similarities between different images in the same category. In this method, the input image I is processed to generate spatial rotation image and frequency domain enhanced image, denoted as
and
(
). Then, these rotated images along with their corresponding one-vs-all fine-grained category labels
are integrated into training set
and
. The two sets of images were sent to the backbone network for feature extraction, and corresponding feature representations
and
were obtained, and adaptive feature fusion was performed to obtain
. Then, the classification probability is obtained after the fusion feature map is processed by the full connection layer in the classification network. Thus, the loss function
is described as:
![]() |
6 |
![]() |
7 |
where C is the number of images used during training.
In the framework proposed in this paper, the network model is trained end to end, and after learning by multiple epochs, the network can accurately learn the local information that has classification ability in the image. The entire framework of SFIF learning is illustrated in Fig. 1. The information preprocessing module assists the backbone network in accurately learning local information of different types of objects in both spatial and frequency domains from each input image. The classification network helps the network learn the local features of objects from different attributes. In this way, SFIF learning possesses the capability to accurately learn the attributes of different objects in images, thereby achieving better performance in SDFGIC.
Experiments
This section describes the SDFGIC data set details and the detailed experimental setup during training. Secondly, the effect of image processing on the performance of the proposed method is shown. Furthermore, the impact of the number of epochs for classification accuracy is illustrated. Thirdly, the SFIF learning method proposed in this paper is compared with many advanced methods. Fourthly, other algorithms are embedded into our framework and their corresponding detection accuracy is given.
Experiment setting
In this experiment, three training samples and three test samples in each category are randomly selected from six datasets (i.e.,Cotton 57, Oxford Flower (FLO) 58, CUB-200-2011 (CUB) 14, Standford Cars (CAR) 15, Aircraft (AIR) 16, and plant disease (PD) 59) for evaluating the classification accuracy of the proposed method in comparison with the state-of-the-art benchmark methods (VGG-16 60, ResNet50 61, NTS-Net 62, fast-MPN 63, DCL 32, Cross-X 64, MOMN 65, ACNet 31, fingerprints vitality detection (FVD) 66, ABC-Norm 67, FRe-Net 68, mutually exclusive part representations (MEPR) 69, learnable semantic data augmentation(LSDA) 13, and SRGN 70). No additional annotations are used in training other than the use of category labels in the data set. This method was trained in PyTorch framework using two NVIDIA RTX3090 graphics cards with 24GB memory.
The cotton dataset 57 contains 80 cotton leaf categories. The FLO dataset 58 contains 102 classes of flowers. The CUB dataset 14contains 200 classes. The CAR dataset 15 contains 196 classes. The AIR dataset 16 contains 100 classes. The PD dataset 59 contains 38 plant disease categories. For each category, it contains 6 images. For the six images, we follow the setting of [a][b] for training and testing. This means that for each category, the number of training images is aand the number of test images is b. In this experiment, the numbers of a and b are set to
and
respectively.
In this method, ResNet50 61 is used as the backbone network. Before the training, the image is first filled into a square, and then scaled to the size of
, and the image with the pixel size of
is sent to the network training through random rotation and cropping. During training, the model parameters were updated using stochastic gradient descent, 180 epochs were trained in total, and the input batch size was 16. The initial learning rate is set to 0.001 and decays 10 times per 60 epochs. The results in the other methods use the code provided in the article and follow the best Settings.
Parameter settings
In this section, we first test the accuracy of the proposed method. From this experiment, we find that the number of rotations of images in different directions in the spatial domain and frequency domain has a great impact on performance, as shown in Fig. 3. The proposed method achieves the best performance on the cotton datasets 57 under the rotation setting of
. For
has the lowest classification accuracy. The reason is that input images with two rotation directions in the spatial domain and frequency domain cannot enable the network to learn sufficient local self-similarity. Moreover, the classification accuracy of the image generated by 6 rotation angles in the spatial and frequency domains is more stable and relatively better. The reason is that the image generated by the six rotation angles can provide more local information, whether in the airspace or the frequency domain. Based on this, the proposed method uses the image obtained by six rotations in the spatial domain and frequency domain respectively as the input image.
Figure 3.

The impact of different input image sets on SDFGIC performance.
Then, we then test the accuracy of our proposed method on an additional six small data sets using an input image consisting of 12 images generated from the spatial and frequency domain rotation. Table 1 shows the classification accuracy on the cotton, airplane, flower, bird, car, and plant disease datasets under different combinations of image rotations with six directions. It can be found from the table that the accuracy of our algorithm is stable in six input images with different rotation directions in the spatial domain and frequency domain. Then, the best accuracy of the proposed method in each dataset will be selected for performance comparison with other algorithms.
Table 1.
The accuracy of the proposed method with combining six different image rotation directions.
| Input images | Accuracy (%) | |||||
|---|---|---|---|---|---|---|
| Cotton | CUB | CAR | AIR | FLO | PD | |
![]() |
63.79 | 86.63 | 92.77 | 91.07 | 95.49 | 97.46 |
![]() |
62.13 | 86.63 | 92.32 | 91.23 | 96.16 | 96.58 |
![]() |
63.17 | 85.29 | 92.89 | 90.79 | 95.49 | 98.72 |
![]() |
64.22 | 85.53 | 92.36 | 90.67 | 97.18 | 96.46 |
Significant values are in bold.
In the following, we discussed the effect of the number of epochs on the classification accuracy of SDFGIC. Take the cotton dataset as an example, the classification accuracies of the four methods (VGG-16 60, ResNet50 61, DCL 32, and proposed method with two image rotation directions
) under 1800 epochs and the proposed method with six image rotation directions
under 180 epochs are shown in Fig. 4. It can be seen from Fig. 4 that at the beginning, with the increase of epochs, the accuracies of the four algorithms have been increasing. When the epoch reaches about 120, the accuracies of the five algorithms on the cotton dataset become stable. It can also be found from the Fig. 4 that when the epoch is increased to 1800, Four algorithms (VGG-16 60, ResNet50 61, DCL 32) and the paper proposed a method of the two direction of rotation of
the accuracy of less than six methods of image rotation accuracy
.
Figure 4.

The impact of the number of epochs for the accuracy of SDFGIC.
Through experiments on the accuracy in multiple directions of the SFLIF method,
is selected as the multi-directional filtering for data processing in subsequent experiments. Next, comparisons are made with other advanced methods under the experimental settings of
and
and
and
.
It can be observed from Tables 2 and 3 that the method SFIF proposed in this paper is superior to other advanced algorithms in both experimental settings
and
and
and
.
Table 2.
Comparison with the state-of-the-art methods on six different small datasets with
and
.
| Method | Base Model | Accuracy (%) | |||||
|---|---|---|---|---|---|---|---|
| Cotton | CUB | CAR | AIR | FLO | PD | ||
| ResNet-50 | Resnet-50 | 48.24 | 84.20 | 90.92 | 89.74 | 95.35 | 96.33 |
| VGG-16 | VGG-16 | 40.19 | 82.18 | 87.55 | 90.32 | 94.37 | 95.17 |
| NTS-Net | ResNet-50 | 52.50 | 84.23 | 90.32 | 88.15 | 95.42 | 96.00 |
| fast-MPN | ResNet-50 | 50.73 | 85.12 | 88.61 | 90.26 | 96.33 | 95.78 |
| DCL | ResNet-50 | 60.08 | 85.47 | 92.18 | 90.58 | 96.49 | 96.19 |
| Cross-X | ResNet-50 | 52.71 | 85.22 | 92.18 | 89.84 | 96.12 | 93.63 |
| MOMN | ResNet-50 | 40.00 | 81.79 | 86.25 | 85.33 | 97.15 | 98.26 |
| ACNet | ResNet-50 | 43.62 | 85.31 | 88.29 | 83.51 | 96.88 | 96.68 |
| FVD | ResNet-50 | 48.48 | 84.20 | 91.06 | 83.67 | 96.62 | 95.43 |
| ABC-Norm | ResNet-50 | 56.32 | 82.33 | 90.32 | 88.42 | 96.13 | 96.22 |
| FRe-Net | ResNet-50 | 61.39 | 84.12 | 91.49 | 88.47 | 96.27 | 96.10 |
| MEPR | ResNet-50 | 62.47 | 84.74 | 91.85 | 89.22 | 96.93 | 96.37 |
| LSDA | ResNet-50 | 63.12 | 85.14 | 90.95 | 88.38 | 97.12 | 96.43 |
| SRGN | ResNet-50 | 63.97 | 85.67 | 91.93 | 90.28 | 96.78 | 96.21 |
| Ours | ResNet-50 | 64.22 | 85.53 | 92.36 | 90.67 | 97.18 | 96.46 |
Significant values are in bold.
Table 3.
Comparison with the state-of-the-art methods on six different small datasets with
and
.
| Method | Base model | Accuracy (%) | |||||
|---|---|---|---|---|---|---|---|
| Cotton | CUB | CAR | AIR | FLO | PD | ||
| ResNet-50 | Resnet-50 | 57.92 | 88.23 | 95.02 | 94.03 | 96.13 | 97.14 |
| VGG-16 | VGG-16 | 48.94 | 85.36 | 90.25 | 95.13 | 95.33 | 96.23 |
| NTS-Net | ResNet-50 | 61.25 | 88.42 | 94.73 | 93.31 | 96.22 | 96.74 |
| fast-MPN-Cov | ResNet-50 | 59.72 | 89.01 | 91.33 | 94.41 | 96.88 | 96.71 |
| DCL | ResNet-50 | 69.91 | 89.92 | 96.17 | 94.53 | 97.09 | 96.97 |
| Cross-X | ResNet-50 | 61.77 | 89.22 | 96.14 | 94.10 | 96.73 | 94.88 |
| MOMN | ResNet-50 | 49.49 | 86.02 | 89.85 | 89.49 | 97.35 | 98.47 |
| ACNet | ResNet-50 | 61.49 | 89.13 | 96.22 | 93.47 | 97.10 | 97.27 |
| FVD | ResNet-50 | 65.91 | 88.95 | 95.36 | 93.77 | 97.01 | 96.53 |
| ABC-Norm | ResNet-50 | 68.32 | 87.83 | 94.25 | 93.42 | 96.67 | 97.33 |
| FRe-Net | ResNet-50 | 70.31 | 89.40 | 95.26 | 93.92 | 97.07 | 97.56 |
| MEPR | ResNet-50 | 72.66 | 90.26 | 95.58 | 94.47 | 97.43 | 97.77 |
| LSDA | ResNet-50 | 72.63 | 91.01 | 94.32 | 92.77 | 97.34 | 98.21 |
| SRGN | ResNet-50 | 73.28 | 91.41 | 95.83 | 84.81 | 97.38 | 98.57 |
| Ours | ResNet-50 | 74.15 | 90.05 | 96.77 | 95.53 | 97.49 | 98.88 |
Significant values are in bold.
For CUB images, their corresponding feature maps of the last convolution layer of our method and other benchmark methods are shown in Fig. 5. It can be seen from Fig. 5 that the feature maps of each method have a significant difference. Compared with the other benchmark methods, our method can focus more on learning the overall structure information of the objects and can better suppress the interference of the surrounding environment on SDFGVC.
Figure 5.
Feature map visualization of our method and other methods based on the last convolution layer of ResNet-50 backbone.
Experiment results
As can be observed from the Tables 2 and 3, The proposed SFIF method has achieved better performance than other advanced methods in six small datasets selected from Cotton, CUB-200-2011, Stanford Cars, Aircraft, Oxford Flower and Plant disease datasets. The reason is that the SFIF method can extract local information in multiple attribute Spaces, enabling the network to learn more and richer local information, among which there is more likely to be a region with classification ability. Therefore, the SFIF proposed in this paper can describe the feature information in images from multiple attributes. In addition, it can be found that the local information in the image is critical to the SDFGVC task.
Keeping ResNet-50 as the backbone network and under experimental Settings of
and
, We compared the SFIF method with other image enhancement techniques (i.e., lighting changes 26, colorizing images 28, image rotations 29, image flips 31,32, and image affine transformations 33) in six small datasets. The results are shown in Table 4. Compared with other data enhancement techniques, the SFIF method in this paper achieves the highest classification accuracy.
Table 4.
Comparison with the existing data augmentation techniques on six different small datasets with
and
.
| Method | Base model | Accuracy (%) | |||||
|---|---|---|---|---|---|---|---|
| Cotton | CUB | CAR | AIR | FLO | PD | ||
| Lighting changes | ResNet-50 | 44.35 | 88.34 | 90.14 | 91.92 | 94.18 | 92.11 |
| Colorizing images | ResNet-50 | 43.52 | 86.32 | 89.36 | 90.11 | 93.92 | 92.16 |
| Image rotations | ResNet-50 | 44.24 | 85.66 | 90.12 | 91.37 | 93.15 | 92.21 |
| Image flips | ResNet-50 | 43.17 | 84.39 | 90.19 | 90.38 | 92.17 | 91.32 |
| Image affine transformations | ResNet-50 | 46.78 | 89.91 | 94.36 | 92.18 | 93.96 | 94.49 |
| Ours | ResNet-50 | 74.15 | 90.05 | 96.77 | 95.53 | 97.49 | 98.88 |
Significant values are in bold.
In addition, the methods of image feature information fusion in spatial and frequency domains and air-frequency domains also need to be further studied. Similarly, with ResNet-50 as the backbone network, under the experimental setting of
and
, Respectively in the airspace and frequency domain by six direction
rotation to generate the input image data as a training set of image input, the result is as follows:
As shown in Table 5, the method proposed in this paper for multiple rotations of the input image exhibits different performances in the spatial domain and the frequency domain respectively on six small datasets. The results show that, after fusing the spatial features and the frequency domain features, compared with the methods that only use the spatial domain or the frequency domain, the SFIF method achieves the optimal classification accuracy on these six small datasets.
Table 5.
Comparison with the existing data augmentation techniques on six different small datasets with
and
.
| Method | Accuracy (%) | |||||
|---|---|---|---|---|---|---|
| Cotton | CUB | CAR | AIR | FLO | PD | |
| Spatial | 73.78 | 88.73 | 96.32 | 95.01 | 97.11 | 97.26 |
| Frequency | 73.22 | 89.18 | 95.75 | 94.95 | 96.32 | 98.04 |
| Spatial + frequency | 74.15 | 90.05 | 96.77 | 95.53 | 97.49 | 98.88 |
Significant values are in bold.
Furthermore, embedding three algorithms (NTS-Net 62, fast-MPN 63, and DCL 32) into our proposed framework with six image rotation directions
, their corresponding accuracies on two small datasets (Cotton and CUB) are summarized in Table 6. It can be found from Table 6 that the accuracies of the three methods have improved significantly.
Table 6.
The classification accuracy of three algorithms embedded in our proposed framework with
and
.
| Method | Accuracy (%) | Performance improvement (%) | ||
|---|---|---|---|---|
| Cotton | CUB | Cotton | CUB | |
| Original NTS-Net | 52.50 | 84.23 | 11.71 | 8.45 |
| NTS-Net in our framework | 58.65 | 91.36 | ||
| Original fast-MPN-Cov | 50.73 | 85.12 | 13.97 | 9.38 |
| fast-MPN-Cov in our framework | 57.82 | 93.11 | ||
| Original DCL | 60.08 | 85.47 | 12.21 | 8.14 |
| DCL in our framework | 67.42 | 92.34 | ||
Application
In this subsection, the proposed network is applied for performing change detection on SAR image. The SAR images used in this application section are sourced from the geographic image information of Zhongchi Town, Shiquan County, Ankang City, Shaanxi Province, China, at two time points in 2021. The model block diagram of this part is shown as follows Fig. 6.
Figure 6.
Using the SAR geographical information image of Zhongchi Town in 2021 for permanent farmland change detection, with GIS Image as the mask image of the standard farmland in 2017. The image of Zhongchi Town uses the 2021 image as the detected image; p1 and p2 are the information of each regular geographical patch segmented from the Zhongchi Town image in 2021 through the mask image.
Patch data and preprocessing
Due to factors such as the patch information being collected in 2017 and considerations of SAR shooting angles, it is necessary to correct and match the positions of the 2021 SAR image with the patches to ensure accurate correspondence. The irregular polygon edge coordinates are found in the mask image, and new mask regions are created using these coordinates. Finally, this region is overlaid with the 2021 SAR image of Zhongchi Town to segment the farmland patch areas, resulting in 1773 patch areas.
Considering the varying size and shape of irregular patches, direct classification is difficult. Therefore, while segmenting irregular patches, their minimum bounding rectangles are also segmented to obtain two types of datasets: regular patch data and irregular patch data. Subsequently, combining manual judgment of whether the patches are farmland areas, corresponding labels are generated for inference to calculate accuracy.
Data training
To conduct model training and evaluation, it is necessary to construct training, validation, and testing datasets. These datasets are obtained by segmenting the original SAR images, with each image sized at
pixels. After manual judgment, the images are classified into two categories: farmland and non-farmland.
Specifically, we constructed a training dataset consisting of 2148 images, as well as separate validation and testing datasets, each containing 460 images.In the training dataset, training samples are used for model learning and parameter adjustment. Validation data are used to assess the model’s performance and conduct hyperparameter selection and adjustment. Testing data are then used to finally evaluate the model’s generalization ability and accuracy.Through such dataset partitioning and utilization, the proposed model’s performance in farmland change detection tasks can be comprehensively evaluated, providing reliable metrics for its performance.
Inference and change detection
In the process of change detection using random cropping method, it refers to randomly cropping the patch area during change detection of each patch. Due to the irregular shape of the patches and the use of white filling for rectangles, the shapes vary and cannot be directly used for detection. Therefore, when segmenting the patches from the original image, the minimum inscribed rectangle of the irregular patches needs to be found to segment out regular images. When predicting the patches, both regular and irregular patch loaders are created. Non-white areas of irregular patches are identified, and the center point of the cropping box is determined using random numbers. Since the sizes of regular and irregular patches are the same, the center point position is corresponded in the irregular image and checked if it’s on the boundary. If the center point position is satisfied and a
image block can be cropped, the image block is predicted using the LFSIF method, and the occurrence of changes is determined based on the predicted category.
Change detection result
Using the 2021 SAR image of Zhongchi Town and the ArcGIS vector map information of permanent farmland in Zhongchi Town in 2017, all farmland patches were segmented. By designing a random cropping method combined with the SFIF method, rapid and accurate change detection of farmland was achieved. Due to the slight displacement of the patches in 2021, the types, quantities, and prediction results of the patches can be inferred from Table 7.
Table 7.
The comparison of change detection accuracy in Zhongchi Town in 2021 with existing methods, with an image size of
.
| Method | Base model | Correct prediction | Agricultural | Non-agricultural | Accuracy (%) |
|---|---|---|---|---|---|
| NTS-Net | ResNet-50 | 618 | 37 | 581 | 34.85 |
| fast-MPN-Cov | ResNet-50 | 479 | 16 | 463 | 27.01 |
| DCL | ResNet-50 | 1317 | 116 | 1201 | 74.28 |
| Cross-X | ResNet-50 | 1139 | 97 | 1042 | 64.24 |
| MOMN | ResNet-50 | 1263 | 132 | 1131 | 71.23 |
| Ours | ResNet-50 | 1443 | 189 | 1254 | 80.14 |
Significant values are in bold.
According to Table 7, there were 1773 patches of farmland in Zhongchi Town in 2021. By segmenting and classifying image patches across multiple models, it can be seen from the detection results that when using the method proposed SFIF in this paper, for change detection of permanent farmland in Zhongchi Town, it is able to quickly and accurately classify patch information for detection.
Conclusion
The paper focuses on addressing the challenge of small dataset fine-grained classification problems by exploring SFIF from a few labeled examples. Firstly, we elucidate how to accurately extract local features from each input image, enabling the network to accurately describe the properties of different features in images across various spatial dimensions. Secondly, we conduct a detailed analysis of the limitations of existing data augmentation techniques. Thirdly, we propose a novel SFIF learning framework for SDFGIC. Fourthly, the proposed method demonstrates superior performance on six small datasets. Fifthly, by integrating other algorithms into our proposed framework, their detection accuracy can be significantly improved.
Moreover, we applied the SFIF method proposed in this paper to complete change detection tasks on SAR images of Zhongchi Town in 2021 with limited samples. By transforming change detection into a binary classification problem using classification, compared to existing image classification methods, the SFIF method proposed in this paper can more accurately detect and classify SAR images.
Author contributions
L.B. concepted the method and reviewed, Y.G. wrote original draft and edited, W.Z. conceived the experiments and conducted the experiments, Y.G. and W.Z. analysed the results, D.W. edited the software and visualized the result. All authors have read and agreed to the published version of the manuscript.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Jonathan, K., Jin, H., Yang, J. & Fei-Fei, L. Fine-grained recognition without part annotations. In IEEE Conference on Computer Vision and Pattern Recognition 5546–5555 (2015).
- 2.Wang, J., Lu, J., Yang, J., Wang, M. & Zhang, W. An unbiased feature estimation network for few-shot fine-grained image classification. Sensors24, 66 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Pan, Z., Yu, X., Zhang, W. & Gao, Y. Overcoming learning bias via prototypical feature compensation for source-free domain adaptation. Pattern Recognit.158, 111025 (2025). [Google Scholar]
- 4.Ren, J. et al.Adaptive feature selection-based feature reconstruction network for few-shot learning. Available at SSRN 4962361 .
- 5.Ye, S., Wang, Y., Peng, Q., You, X. & Chen, C. P. The image data and backbone in weakly supervised fine-grained visual categorization: A revisit and further thinking. IEEE Trans. Circuits Syst. Video Technol.66, 2–16 (2023). [Google Scholar]
- 6.Wang, H. et al. Knowledge mining with scene text for fine-grained recognition. In IEEE Conference on Computer Vision and Pattern Recognition 4624–4633 (2022).
- 7.Diao, Q., Jiang, Y., Wen, B., Sun, J. & Yuan, Z. Metaformer: A unified meta framework for fine-grained recognition. arXiv preprint arXiv:2203.02751 (2022).
- 8.Zhou, P., Pang, C., Lan, R., Wu, G. & Zhang, Y. Multi-discriminative parts mining for fine-grained visual classification. In Asian Conference on Pattern Recognition 279–292 (2023).
- 9.Xu, Q., Wang, J., Jiang, B. & Luo, B. Fine-grained visual classification via internal ensemble learning transformer. IEEE Trans. Multimed.66, 9015–9028 (2023). [Google Scholar]
- 10.Cui, S. & Hui, B. Dual-dependency attention transformer for fine-grained visual classification. Sensors24, 2337 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.An, C., Wang, X., Wei, Z., Zhang, K. & Huang, L. Multi-scale network via progressive multi-granularity attention for fine-grained visual classification. Appl. Soft Comput.146, 110588 (2023). [Google Scholar]
- 12.Shen, J. et al. Protosimi: label correction for fine-grained visual categorization. Mach. Learn.113, 1903–1920 (2024). [Google Scholar]
- 13.Pu, Y. et al. Fine-grained recognition with learnable semantic data augmentation. IEEE Trans. Image Process.66, 3130–3144 (2024). [DOI] [PubMed] [Google Scholar]
- 14.Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset (California Institute of Technology, 2011).
- 15.Krause, J., Stark, M., Deng, J. & Fei-Fei, L. 3D object representations for fine-grained categorization. In IEEE International Conference on Computer Vision Workshops 554–561 (2013).
- 16.Maji, S., Rahtu, E., Kannala, J., Blaschko, M. & Vedaldi, A. Fine-grained visual classification of aircraft. ArXiv Preprint arXiv:1306.5151 (2013).
- 17.Wang, Y., Yao, Q., Kwok, J. T. & Ni, L. M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv.53, 1–34 (2020). [Google Scholar]
- 18.Zhang, W., Zhao, Y., Gao, Y. & Sun, C. Re-abstraction and perturbing support pair network for few-shot fine-grained image classification. Pattern Recognit.148, 110158 (2024). [Google Scholar]
- 19.Schmidt, L. A. Meaning and Compositionality as Statistical Induction of Categories and Constraints. Ph.D. thesis (Massachusetts Institute of Technology, 2009).
- 20.Shui, P.-L. & Zhang, W.-C. Corner detection and classification using anisotropic directional derivative representations. IEEE Trans. Image Process.22, 3204–3218 (2013). [DOI] [PubMed] [Google Scholar]
- 21.Zhang, W., Zhao, Y., Breckon, T. P. & Chen, L. Noise robust image edge detection based upon the automatic anisotropic Gaussian kernels. Pattern Recognit.63, 193–205 (2017). [Google Scholar]
- 22.Zhang, W. & Sun, C. Corner detection using multi-directional structure tensor with multiple scales. Int. J. Comput. Vis.128, 438–459 (2020). [Google Scholar]
- 23.Li, P., Lu, X. & Wang, Q. From dictionary of visual words to subspaces: Locality-constrained affine subspace coding. In IEEE Conference on Computer Vision and Pattern Recognition 2348–2357 (2015).
- 24.Dai, X., Ng, J. Y. & Davis, L. S. Fason: First and second order information fusion network for texture recognition. In IEEE Conference on Computer Vision and Pattern Recognition 6100–6108 (2017).
- 25.Ren, J., Yu, W., Guo, J., Zhang, W. & Sun, C. Second-order anisotropic Gaussian directional derivative filters for blob detection. arXiv preprint arXiv:2305.00435 (2023).
- 26.Huang, S.-W. et al. AugGAN: Cross domain adaptation with GAN-based data augmentation. In European Conference on Computer Vision 718–731 (2018).
- 27.Ren, J., Zhao, Y., Zhang, W. & Sun, C. A novel spatial-frequency domain network for zero-shot incremental learning. arXiv preprint arXiv:2402.07216 (2024).
- 28.Yoo, S. et al. Coloring with limited data: Few-shot colorization via memory augmented networks. In IEEE Conference on Computer Vision and Pattern Recognition 11283–11292 (2019).
- 29.Feng, Z., Xu, C. & Tao, D. Self-supervised representation learning by rotation feature decoupling. In IEEE Conference on Computer Vision and Pattern Recognition 10364–10374 (2019).
- 30.Lei, T. et al. Semi-supervised 3d medical image segmentation using multi-consistency learning with fuzzy perception-guided target selection. IEEE Trans. Radiat. Plasma Med. Sci.66, 1 (2024). [Google Scholar]
- 31.Ji, R. et al. Attention convolutional binary neural tree for fine-grained visual categorization. In IEEE Conference on Computer Vision and Pattern Recognition 10468–10477 (2020).
- 32.Chen, Y., Bai, Y., Zhang, W. & Mei, T. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 5157–5166 (2019).
- 33.Luo, C., Zhu, Y., Jin, L. & Wang, Y. Learn to augment: Joint data augmentation and network optimization for text recognition. In Conference on Computer Vision and Pattern Recognition 13746–13755 (2020).
- 34.Zhang, W.-C. & Shui, P.-L. Contour-based corner detection via angle difference of principal directions of anisotropic Gaussian directional derivatives. Pattern Recognit.48, 2785–2797 (2015). [Google Scholar]
- 35.Zhang, W., Sun, C. & Gao, Y. Image intensity variation information for interest point detection. IEEE Trans. Pattern Anal. Mach. Intell.45, 9883–9894 (2023). [DOI] [PubMed] [Google Scholar]
- 36.Zhang, W., Liu, X., Xue, Z., Gao, Y. & Sun, C. Ndpnet: A novel non-linear data projection network for few-shot fine-grained image classification. arXiv preprint arXiv:2106.06988 (2021).
- 37.Huang, S., Xu, Z., Tao, D. & Zhang, Y. Part-stacked cnn for fine-grained visual categorization. In Conference on Computer Vision and Pattern Recognition 1173–1182 (2016).
- 38.Berg, T. et al. Birdsnap: Large-scale fine-grained visual categorization of birds. In IEEE Conference on Computer Vision and Pattern Recognition 2011–2018 (2014).
- 39.Lin, S. et al. Deep frequency filtering for domain generalization. In IEEE Conference on Computer Vision and Pattern Recognition 11797–11807 (2023).
- 40.Shi, H. et al. F 3 net: Fast Fourier filter network for hyperspectral image classification. IEEE Trans. Instrum. Meas.6, 66 (2023). [Google Scholar]
- 41.Xu, K. et al. Learning in the frequency domain. In IEEE Conference on Computer Vision and Pattern Recognition 1740–1749 (2020).
- 42.Lin, H., Tse, R., Tang, S.-K., Qiang, Z. & Pau, G. Few-shot learning for plant-disease recognition in the frequency domain. Plants11, 2814 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Zhu, H., Gao, Z., Wang, J., Zhou, Y. & Li, C. Few-shot fine-grained image classification via multi-frequency neighborhood and double-cross modulation. arXiv preprint arXiv:2207.08547 (2022).
- 44.Chen, X. & Wang, G. Few-shot learning by integrating spatial and frequency representation. In 2021 18th Conference on Robots and Vision (CRV) 49–56 (2021).
- 45.Tang, H., Yuan, C., Li, Z. & Tang, J. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognit.130, 108792 (2022). [Google Scholar]
- 46.Zhang, B. et al. Learning cross-image object semantic relation in transformer for few-shot fine-grained image classification. In 30th ACM International Conference on Multimedia 2135–2144 (2022).
- 47.Tsutsui, S., Fu, Y. & Crandall, D. Reinforcing generated images via meta-learning for one-shot fine-grained visual recognition. IEEE Trans. Pattern Anal. Mach. Intell.6, 66 (2022). [DOI] [PubMed] [Google Scholar]
- 48.Zhang, C., Cai, Y., Lin, G. & Shen, C. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In IEEE conference on computer vision and pattern recognition 12203–12213 (2020).
- 49.Wertheimer, D., Tang, L. & Hariharan, B. Few-shot classification with feature map reconstruction networks. In IEEE Conference on Computer Vision and Pattern Recognition 8012–8021 (2021).
- 50.Wu, J. et al. Bi-directional feature reconstruction network for fine-grained few-shot image classification. In AAAI Conference on Artificial Intelligence vol. 37 2821–2829 (2023).
- 51.Sun, M., Ma, W. & Liu, Y. Global and local feature interaction with vision transformer for few-shot image classification. In 31st ACM International Conference on Information & Knowledge Management 4530–4534 (2022).
- 52.Ren, J., Li, C., An, Y., Zhang, W. & Sun, C. Few-shot fine-grained image classification: A comprehensive review. AI5, 405–425 (2024).
- 53.Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 4077–4087 (2017).
- 54.Sung, F. et al. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition 1199–1208 (2018).
- 55.Jing, J., Gao, T., Zhang, W., Gao, Y. & Sun, C. Image feature information extraction for interest point detection: A comprehensive review. IEEE Trans. Pattern Anal. Mach. Intell.45, 4694–4712 (2022). [DOI] [PubMed] [Google Scholar]
- 56.Zhang, W., Sun, C., Breckon, T. & Alshammari, N. Discrete curvature representations for noise robust image corner detection. IEEE Trans. Image Process.28, 4444–4459 (2019). [DOI] [PubMed] [Google Scholar]
- 57.Yu, X., Zhao, Y., Gao, Y., Xiong, S. & Yuan, X. Patchy image structure classification using multi-orientation region transform. In Association for the Advancement of Artificial Intelligence 12741–12748 (2020).
- 58.Nilsback, M. & Zisserman, A. Automated flower classification over a large number of classes. In Sixth Indian Conference on Computer Vision, Graphics Image Processing 722–729 (2008).
- 59.Mohanty, S. P., Hughes, D. P. & Salathé, M. Using deep learning for image-based plant disease detection. Front. Plant Sci.7, 1419 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Simonyan, K. & Andrew, Z. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations 770–784 (2015).
- 61.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).
- 62.Yang, Z. et al. Learning to navigate for fine-grained classification. In European Conference on Computer Vision 420–435 (2018).
- 63.Li, P., Xie, J., Wang, Q. & Gao, Z. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In Conference on Computer Vision and Pattern Recognition 947–955 (2018).
- 64.Luo, W. et al. Cross-x learning for fine-grained visual categorization. In Proceedings of the IEEE International Conference on Computer Vision 8242–8251 (2019).
- 65.Min, S., Yao, H., Xie, H., Zha, Z.-J. & Zhang, Y. Multi-objective matrix normalization for fine-grained visual recognition. IEEE Trans. Image Process.29, 4996–5009 (2020). [DOI] [PubMed] [Google Scholar]
- 66.Impedovo, D., Dentamaro, V., Abbattista, G., Gattulli, V. & Pirlo, G. A comparative study of shallow learning and deep transfer learning techniques for accurate fingerprints vitality detection. Pattern Recognit. Lett.151, 11–18 (2021). [Google Scholar]
- 67.Hsu, Y.-C., Hong, C.-Y., Lee, M.-S., Geiger, D. & Liu, T.-L. Abc-norm regularization for fine-grained and long-tailed image classification. IEEE Trans. Image Process.6, 66 (2023). [DOI] [PubMed] [Google Scholar]
- 68.Zhao, P., Li, Y., Tang, B., Liu, H. & Yao, S. Feature relocation network for fine-grained image classification. Neural Netw.161, 306–317 (2023). [DOI] [PubMed] [Google Scholar]
- 69.Wang, C., Fu, H. & Ma, H. Learning mutually exclusive part representations for fine-grained image classification. IEEE Trans. Multimed.6, 66 (2023). [Google Scholar]
- 70.Wang, S. et al. Accurate fine-grained object recognition with structure-driven relation graph networks. Int. J. Comput. Vis.132, 137–160 (2024). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.















