Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2024 Oct 12;14:23879. doi: 10.1038/s41598-024-74835-1

GroupFormer for hyperspectral image classification through group attention

Rahim Khan 1, Tahir Arshad 2, Xuefei Ma 1,, Haifeng Zhu 1, Chen Wang 1, Javed Khan 3, Zahid Ullah Khan 1, Sajid Ullah Khan 4
PMCID: PMC11470927  PMID: 39396096

Abstract

Hyperspectral image (HSI) data has a wide range of valuable spectral information for numerous tasks. HSI data encounters challenges such as small training samples, scarcity, and redundant information. Researchers have introduced various research works to address these challenges. Convolution Neural Network (CNN) has gained significant success in the field of HSI classification. CNN’s primary focus is to extract low-level features from HSI data, and it has a limited ability to detect long-range dependencies due to the confined filter size. In contrast, vision transformers exhibit great success in the HSI classification field due to the use of attention mechanisms to learn the long-range dependencies. As mentioned earlier, the primary issue with these models is that they require sufficient labeled training data. To address this challenge, we proposed a spectral-spatial feature extractor group attention transformer that consists of a multiscale feature extractor to extract low-level or shallow features. For high-level semantic feature extraction, we proposed a group attention mechanism. Our proposed model is evaluated using four publicly available HSI datasets, which are Indian Pines, Pavia University, Salinas, and the KSC dataset. Our proposed approach achieved the best classification results in terms of overall accuracy (OA), average accuracy (AA), and Kappa coefficient. As mentioned earlier, the proposed approach utilized only 5%, 1%, 1%, and 10% of the training samples from the publicly available four datasets.

Keywords: Attention Module, Convolutional neural network, Hyperspectral image classification, Vision Transformer

Subject terms: Environmental sciences, Planetary science, Engineering

Introduction

The classification of hyperspectral image (HSI) data is an essential component of earth observation1 since HSI data is made up of several small bands that each carry a significant amount of information2. The environmental sciences, mineralogy, military3, and agriculture4 all use it significantly. HSI data poses three major challenges since it captures spectral information from multiple adjacent spectral bands of surface objects5. For instance, HSI data contains hundreds of spectral bands, each carrying its own amount of information6, but the overlap among these bands significantly increases computational complexity5. Second, the classification approach gets complicated by the frequent mixing of pixels in HSIs, where one pixel frequently corresponds to an entirely distinct category7,8. Finally, the number of labeled samples is limited as labeling manually HSI samples is costly and time-consuming9. Over the last decade, multiple solutions have been developed to solve these issues, with early mechanisms employing typical machine learning techniques such as logistic regression, support vector machines, k-nearest neighbors, and Bayesian estimation.

Similarly, in the early stage of some traditional methods, principal component analysis (PCA)6 and linear discriminative analysis (LDA)7,8 are two dimensionality reduction approaches that have been used to extract spectral information. However, both algorithms often overlook the spatial correlation between pixels, which is essential for efficient spatial feature extraction9,10. To overcome this, scientists created mathematical operators such as extended morphological attribute profiles; however, these techniques failed to adequately extract spectral-spatial features, resulting in misclassification3,11. However, misclassification is a common outcome of these conventional models.

In recent years, deep learning models have been shown to be adept at segmentation, target identification, and image classification. These models need a hierarchical architecture to extract abstract features from raw data and enable a non-linear transfer from feature space to label space12,13. This feature improves identification and classification accuracy while drastically lowering the need for material and human resources. Convolutional neural networks (CNNs) have shown promise in effectively gathering and using spatial-spectral data14,15. Numerous CNN models, including one-dimensional, two-dimensional, and more sophisticated designs including three-dimensional CNNs that combine spectral and spatial data, have been used by researchers16,17. Despite of these improvements, problems with computational efficiency and model complexity exist in CNNs and transformer-based models18,19. Attention-based models have been integrated using spectral-spatial transformers and formers to capture global dependencies in HSI data20,21. Yet, these models are mostly based on multi-head self-attention modules, which require a huge amount of training data and can be computationally expensive22,23.

The motivation behind this study is to rectify a number of significant issues with HSI classification. Effective semantic feature extraction is a basic difficulty in HSI classification because of the data’s large dimensionality and complicated spectral properties. Due to their small receptive fields, traditional CNNs often fail at these tasks because they are unable to collect higher-level semantic information and long-range relationships. This drawback is exacerbated by the convolution process’s innate tendency to ignore wider spatial-spectral correlations that are essential for precise classification in favor of local patterns. In this study, we provide Group Former, a unique multi-head network architecture that incorporates an Auxiliary Feature-Enhanced (AFE) Module to improve feature representation and maximize parameter efficiency to overcome these restrictions. By using a multi-head attention mechanism, Group Former allows the model to record a variety of subtle feature representations concurrently. This allows the model to concentrate on multiple facets of the input data and develop a more thorough knowledge of the underlying patterns. By lowering the number of parameters without sacrificing functionality, the AFE is intended to maximize the efficiency of the model. The AFE module improves the feature extraction process while keeping the model size small by using auxiliary features and applying parameter-sharing methods. Reduced training and testing durations across many datasets indicate our proposed strategy’s considerable increases in computational efficiency. The simplified multi-head network design and the efficient AFE integration allow this efficiency. Additionally, Group Former regularly achieves superior accuracy and F1 scores with lower computational costs compared to classic CNNs and simple transformer encoders on a variety of benchmark datasets, such as Indian Pines, Pavia University, Salinas, and Kennedy Space Center (KSC). We also suggest a Transformer-based strategy that makes use of the self-attention mechanism to get around these problems and improve the acquisition of both local and global aspects of HSI data. Specifically, we present the AFE module, which is intended to perform better than traditional CNNs in integrating spectral and spatial information. Larger receptive fields and adaptive methods are used by the AFE module to aggregate data across various scales, improving the model’s capacity to extract complex semantic information. The suggested approaches are not without difficulties. Even though they are strong, transformers may be computationally demanding and need a large quantity of training data to operate at their best. This research optimizes the model architecture and uses cutting-edge training methodologies to increase efficiency and performance.

Moreover, we performed comprehensive ablation research to compare Group Former’s performance with a basic transformer encoder and verify its efficacy. The results underscore our suggested approach’s enhanced effectiveness and efficiency, highlighting its pragmatic importance. Moreover, Group Former integrates an inventive multi-head network architecture with the AFE Module to provide a fresh method of HSI classification. By addressing significant issues with computing efficiency and model complexity, the suggested approach provides a reliable and effective technique for HSI analysis.

The main contribution of this paper is the following:

  • ➢ The proposed framework is designed to classify hyperspectral images. The model integrates CNN and attention mechanism to effectively capture both spectral-spatial features of hyperspectral images, improving classification results.

  • ➢ Multiscale convolutional neural networks are utilized to augment HSI data’s spectral and spatial features. This captures multiscale information, which enhances the ability to recognize complex patterns within the hyper-spectral data.

  • ➢ We proposed a group attention mechanism that enables the model to focus on the most informative pixels to enhance the accuracy and efficacy of the classification process.

The remainder of this paper is structured as follows: The techniques are described in Section III, along with the suggested model architecture, which includes the Group Attention Module and the AFE module, and how they improve the performance of feature extraction and classification. The HSI dataset’s properties, the experimental setup, evaluation metrics, and baseline models for comparison are all covered in Section IV, which also addresses the datasets and experiment assessment. In Section V, we compare the performance of our model with other cutting-edge models and provide in-depth insights from ablation research to emphasize the individual contributions of each component. Section VI brings the work to a close, summarizes the main conclusions, highlights the value of our suggested model, and makes some recommendations for possible future research topics.

Related work

HSI classification has developed significantly in the last few years, mainly due to the availability of high-quality datasets and the development of complex algorithms. In addition to emphasizing the advances brought about by different approaches, this section covers the major contributions of HSI classification. A vision transformer model24 has recently demonstrated superior performance in the field of computer vision. The Transformer model employs a self-attention technique to capture global dependencies. HSI classification commonly employs attention mechanisms. In21,23, researchers developed a spectral-spatial attention network to extract distinctive features from an HSI cube. Researchers have extensively utilized the vision transformer concept for HSI classification. He, X. et al.25 introduced a method known as spectral-spatial transformers (SST), which utilizes a VGG-Net model to extract spectral-spatial features and establish a connection with a density transformer. In26, the researchers introduced a model known as the Spectral Former. This model can acquire and utilize spectral information group-wise while incorporating a cross-layer transformer encoder. A positive feedback spatial-spectral correlation network based on spectral interclass slicing (PFSSC-SICS) technique for HSI classification is presented in27. It uses a spectral interclass slicing technique to enhance spectral signatures, addressing issues such as limited label samples and spectral similarity across classes. This is combined with a spatial-spectral correlation module and a positive feedback mechanism to improve feature extraction. The experimental findings show that PFSSC-SICS is a reliable solution for HSI Classification, with significant performance gains over current approaches.

Sun, L. et al.28 introduced a spectral-spatial feature tokenization transformer for HSI classification. This method utilizes both 3D and 2D CNN models to extract multiscale spatial spectral features. Additionally, it incorporates a Gaussian-weighted tokenizer. As shown in29, Zhang, J. et al. created a convolutional network called CT Mixer using a transformer model. This network includes a unique local-global multi-head self-attention mechanism. The study in30 presents a convolutional fusion network solution for HSI classification that uses a multi-hop graph to rectify attention and spectral overlap grouping. To overcome issues with small sample sizes, a multi-hop graph rectifies attention for graph convolution, and a spectral inter-group feature extraction module for spectral feature extraction is incorporated. The effective fusing of CNN and graph convolutional network features is made possible by the Gaussian weighted fusion module, which exhibits excellent classification performance on a variety of datasets. In31,32, authors proposed some backbone networks for extracting multiscale hyperspectral features due to the complexity and computational cost of the self-attention modules in vision transformers (ViT). Li, B. et al.33 proposed a multi-granularity vision transformer via semantic tokens transformer to learn the multi-granularity features and improve accuracy. The author used the LFE module to extract local features. In34, the author proposed a hierarchical attention transformer for HSI classification to extract complex features using a hierarchical attention mechanism. For HSI classification, the author of35 presents a double branch CNN and Enhanced Graph Attention Network (CEGAT) fusion network. With modules such as a spatial-spectral correlation attention module for spatial-spectral feature extraction and linear discrimination of spectral inter-class slices for spectral redundancy reduction, it handles limited labeled data. By combining Graph Attention Network and CNN branches with a key sample selection approach, CEGAT outperforms other algorithms in HSI classification tasks by improving classification accuracy. The scholar of36 presents an approach for HSI classification that involves using a transformer-enhanced two-stream complementary CNN. The technique consists of a spectral feature extraction stream that incorporates a hybrid convolution block and an attention mechanism using a transformer encoder, as well as a spatial feature extraction stream. The complementary spectral-spatial weight feature module effectively utilizes features from both streams, resulting in higher classification performance.

In37, the authors proposed a hybrid former network to extract shallow features. They use CNN and a spectral-spatial attention (SSA)-based transformer encoder for semantic features. Nevertheless, transformer-based approaches do have certain restrictions. The above models rely on a multi-head self-attention module to handle long-range dependencies, and they also depend on sufficient training samples. In comparison, combining a convolutional model with a transformer for low- and high-level feature extraction and subsequent classifiers for proposed networks is possible. Using an attention module approach to multiscale spectral and spatial feature extraction significantly improves network performance. A multiscale 3D-CNN with a group attention layer can extract features by combining spatial and spectral information. An uneven distribution of “samples” would reduce the ability to distinguish small sample problems from a spatial information perspective. For HSI classification, the authors of38 suggest a feature-complementary attention network based on adaptive knowledge filtering. They develop a Dual Pyramid Spectrum-Spatial Attention module and a Nonlocal Band Regrouping approach to collect spectrum properties and remove duplicated information. Comparing the suggested approach to state-of-the-art methods, three difficult datasets demonstrate increased performance. The researcher of39 presents a morphological transformer technique called morphformer, which combines a spatial morphological and trainable spectral network. This technique utilizes spatial and spectral morphological convolution based on the attention technique to facilitate the shape and structure of information combination. The experimental results demonstrate that this technique performs better than the existing methods.

In40, a dual-branch network for HSI classification is introduced that integrates multi-scale dual aggregated attention (CDC-MDAA) with cross-channel dense connection. However, these models show significant improvement in HSI classification. Previous work usually focuses on designing a deep and complicated network for HSI classification. CNN models are still indispensable to capture spectral spatial features due to convolution kernels. Transformer models learn the long-range dependencies through self-attention mechanisms. However, they have trouble with computational cost, slow inference and high memory usage. Integrating the strengths of both models, CNN is to extract low-level features and Transformer model is to learn high level features on spectral spatial dimensions with multiple attributes and scales that need to be solved. We have proposed a Groupformer model for HSI classification using a group attention mechanism with the AFE module. The proposed model has fewer flops than other state-of the-art models and achieved satisfactory classification results while using small training samples.

Proposed methodology

Figure 1 depicts the suggested model. This section will explain the model’s structure and its functioning. In this study, we aim to address the issue of a limited training sample by utilizing network structures. First, the dimensionality reduction technique using PCA was applied to the raw HSI data. Let the HSI data be represented by Inline graphic, where Inline graphic indicates the height and width of the data, and C represents the spectral dimension. After applying PCA to the spectral dimension, the hyperspectral data can be represented as Inline graphic. Suppose that Inline graphic contains N labeled pixel Inline graphic and their corresponding hot label Inline graphic where K represents the number of classes. The spatial size Inline graphic around of the center pixel can be defined as a spectral-spatial vector. The cross-entropy loss function estimates the difference between the predicted class label and the ground truth label.

Figure 1.

Figure 1

Illustration of the Proposed Model for HSI Classification [We have used the raw data to draw the satellite imagery in Fig. 1 from the publicly available dataset at the GitHub repository, where we have already mentioned the link in our previous manuscript in the data availability section (https://github.com/gokriznastic/HybridSN/tree/master/data) . Moreover, the classification maps in Fig. 1 were generated using the PHYTON 3.8 version and are publicly available at Pytorch library https://pytorch.org/get-started/locally/. Which is installed on NVIDIA RTX 3060 GPU with 64 GB RAM].

Multiscale spectral spatial feature extractor

We use different filters and kernel sizes for feature extraction to extract multiscale spectral-spatial features from PCA-reduced data. The convolution operation in a 3D convolution model involves applying a 3D convolutional kernel. This operation acts on the spatial dimension and captures the correlation information between multiple spectral bands. The input dimension is Inline graphicand the N 3D convolution kernels; the kernel size is Inline graphicused for the convolution operation. The output feature maps are 4D tensors and the dimension of this is Inline graphicInline graphicInline graphic. The number of bands across spectral dimension is fix and the final result is 4D tensor size of Inline graphic. The 3D kernel better utilized the spectral features. Hence, the 3D CNN approach is better suited for HSI classification, which involves abundant spectral information. During the computation of 3D convolution, the activation value at position Inline graphicon the jth feature map in the ith layer is obtained. The mathematical formula for 3D convolution is given in the equation:

graphic file with name M15.gif 1

Where Inline graphicmeans that the output variable is at position Inline graphic.Inline graphic is the activation function, and Inline graphic is the kernel’s depth, which is biased. The suggested multiscale 3D-CNN model pulls out spectral and spatially distinguishing features using a three-dimensional convolutional layer with various kernel sizes. For the multiscale spatial branch, we used Inline graphic are effective at capturing fine-grained, local features, while larger kernels Inline graphic with 64 filter sizes can capture more contextual information. By using a combination of these kernel sizes, we can ensure that our model captures a comprehensive set of spatial features that span both local details and broader patterns. The varying kernel sizes increase the effective receptive field of the convolutional layers, allowing the network to integrate information from larger spatial regions. This is crucial for accurately identifying and distinguishing between different classes in the HSI, which may have spatial dependencies at different scales. We then concatenated the enhanced spatial features. We employ three 3D CNN blocks for multiscale spectral feature augmentation in the same manner. These blocks have a kernel size of Inline graphicInline graphicand Inline graphic, an output filter size of 64; the chosen kernels allow the model to capture spectral dependencies over varying ranges of bands, from immediate neighbors Inline graphic to slightly broader contexts Inline graphic. Using 1 × 1 convolutions in the spectral domain helps reduce the spectral data’s dimensionality while preserving essential information. Each 3D CNN block consists of a 3D convolutional layer regardless of the branch. A batch normalization layer is employed with Gaussian error linear units (GELUs) as the activation function. Next, we used a 2D convolutional layer to get spectral-spatial features that include information at different levels so that we could combine features from different scales into two branches. In the 2D convolution layer, a kernel size is used with a 64-filter size. Figure 2 illustrates the multiscale feature extractor.

Figure 2.

Figure 2

Illustration of the multiscale spectral spatial feature extraction.

Group attention module

Figure 1 shows an illustration of the group attention module. This structure consists of three stages: feature grouping, group propagation, and feature ungrouping. The first step involves classifying the features of the image. Next, in the second phase, global knowledge spreads amongst the grouped features. Ultimately, the final stage reintegrates this comprehensive information into the image features.

Feature grouping

After the feature extractor module, the input of the group attention consists of image features N, which is the total number of image features, and C, which represents the dimensionality of the feature vector. The matrix Inline graphic is utilized to store M learnable group tokens. A simple multi-head attention process initiates the grouping process. This operation results in a generation of grouped features denoted Inline graphic

graphic file with name M29.gif 2
graphic file with name M30.gif 3

In the attention operation, we denote the channel number as d, the head index as h, and the projection matrices for the query, key, and values as Inline graphic After the concatenation procedure, we eliminate the feature projection layers and designate them as an identity matrix. Therefore, we can describe the grouped features as the weighted summation of image features at each head, where we compute the weights using the attention operation.

Group propagation

Once the grouped features have been obtained, updating and passing on global information among them is possible. Multilayer perceptron (MLP) Mixer necessitates an established input size corresponding to the selected groups. Our MLP-Mixer comprises a set of sequential MLP models. Inline graphiccontains the grouped features from the feature grouping process, which can be updated to Inline graphicwith the MLP-Mixer layer.

graphic file with name M34.gif 4
graphic file with name M35.gif 5

The first MLP is employed for inter-group information mixing, whereas the second MLP is utilized for channel maxing.

Feature ungrouping

Once the grouped features have been updated, retrieving global information from the image features is possible by employing a feature ungrouping procedure. The features are ungrouped by the utilization of a transformer decoder layer, wherein the image features query the grouped features.

graphic file with name M36.gif 6
graphic file with name M37.gif 7

Where Inline graphic represents the projection matrix of the attention, while Inline graphic represents the linear matrix used to project the concatenated features Z onto the same dimension as the image features X. The original transformer decoder layer is altered by replacing the initial residual connection with a concatenation operation. Afterward, the feature projection layer is moved to transform the feature back to its initial dimensions.

Auxiliary feature enhancement module

We proposed an auxiliary feature enhancement (AFE) module, further enhancing spectral-spatial features and decreasing the model’s parameters. This module consists of 2D convolutional layers, which follow the batch normalization layer and activation function; the layer’s kernel size is 32 and 64 filters. The effectiveness of direct classification will be limited even if many discriminative features have been obtained and the input size remains high after completing the previous two modules. AFE Module is shown in Fig. 3.

Figure 3.

Figure 3

Illustration of auxiliary feature enhancement module.

Moreover, after PCA, the reduced cube is embedded into the multiscale feature extractor module consisting of 3D and 2D layers. After the low-level feature extraction process, the group attention layer employing interleaved operation of group tokens is used for high-level semantic information. Then, an auxiliary feature-enhanced module is used to refine these features.

Moreover, the AFE module is intended to increase our model’s efficiency by cutting down on parameters without sacrificing performance. To extract more pertinent information and enable the model to operate with fewer parameters, the AFE module makes use of auxiliary features that enhance the model’s core characteristics. Auxiliary feature extraction, feature fusion, and parameter sharing are the three primary parts of the AFE module architecture. The auxiliary feature extraction component obtains additional contextual information by enhancing the core features. These auxiliary characteristics are integrated with the major features through the feature fusion process in a way that maximizes information flow without appreciably raising the complexity of the model. The number of unique parameters that the model requires is further decreased by parameter sharing across the major and auxiliary features. We performed a comparison examination of models with and without the AFE module to support our assertions.

Ablation research

In the ablation research, the performance of a basic transformer encoder and the suggested multi-head network were assessed using the given datasets. Table 1 below provides an overview of the findings:

Table 1.

In the ablation investigation, the performance of a basic Transformer Encoder and the suggested Multi-head Network were assessed using the given datasets.

Datasets Model Accuracy (%) F1 Score Training Time/Hour Testing Time/sec
Indian Pines 5% Simple Encoder Transformer 86.9 84.1 4.6 54
A proposed multi-head network 89.2 87.3 4.3 44
Pavia University 1% Simple Encoder Transformer 88.2 85.3 6.2 69
A proposed multi-head network 92.3 88.4 4.9 64
Salinas 1% Simple Encoder Transformer 87.5 84.5 6.6 64
A proposed multi-head network 90.7 87.1 5.9 54
KSA 10% Simple Encoder Transformer 84.3 81.2 5.7 59
A proposed multi-head network 93.2 90.1 6.4 59

The ablation research findings had shown the efficacy of the suggested multi-head network design. To be more precise, in terms of accuracy and F1 score, the suggested multi-head network consistently beats the straightforward transformer encoder on all datasets. Furthermore, as compared to the straightforward transformer encoder, the suggested multi-head network exhibits faster testing and training times. These gains are explained by the multi-head network’s capacity to record a wider range of complex and varied feature representations, which improves the model’s overall performance. The model can concentrate on many input data points at the same time, thanks to the multi-head attention mechanism, which helps it comprehend the underlying patterns more thoroughly and robustly.

Dataset and experimental evaluation

Four HSI classical datasets were selected for experiments to verify the proposed methodology’s performance, including the Pavia University dataset, Salinas dataset, Indian Pines dataset, and KSC dataset. The Pavia University dataset was collected by a reflective optics system imaging spectrometer (ROSIS) sensor over Pavia, northern Italy. The spatial resolution of this dataset is 1.3 m, and the spatial size is 610 × 340. In addition, the collected bands ranged from 0.4 to 0.86 micrometers, and the remaining 103 bands were retained by removing the noise band. The Pavia University has a total sample of 42,776. We set a 1% random sample for training, and the other samples are used for testing the model. Table 2 provides the Pavia University labeled Samples.

Table 2.

Pavia University labeled Samples [We have used the raw data to draw the satellite imagery in Table 2 from the publicly available dataset at the GitHub repository, where we have already mentioned the link in our previous manuscript in the data availability section (https://github.com/gokriznastic/HybridSN/tree/master/data). Moreover, the classification maps in Table 2 were generated using the PHYTON 3.8 version and are publicly available at Pytorch library https://pytorch.org/get-started/locally/. Which is installed on NVIDIA RTX 3060 GPU with 64 GB RAM].

graphic file with name 41598_2024_74835_Tab2_HTML.jpg

The Salinas dataset was captured by the airborne visible infrared imaging spectrometer (AVIRIS) hyperspectral sensor over the Salinas Valley of California, and its image size is 512 × 217 pixels with a spatial resolution of 3.7 m. The dataset spectrum contains 224 bands; after omitting the bands absorbed by atmospheric water, the remaining 204 bands are retained. The Salinas dataset has a total of 54,129 samples. We set 1% random samples for training the model, and the remaining samples are used to test the model. Table 3 illustrates the Salinas dataset labeled Samples.

Table 3.

Salinas dataset labeled Samples [We have used the raw data to draw the satellite imagery in Table 3 from the publicly available dataset at the GitHub repository, where we have already mentioned the link in our previous manuscript in the data availability section (https://github.com/gokriznastic/HybridSN/tree/master/data). Moreover, the classification maps in Table 3 were generated using the PHYTON 3.8 version and are publicly available at Pytorch library https://pytorch.org/get-started/locally/. Which is installed on NVIDIA RTX 3060 GPU with 64 GB RAM].

graphic file with name 41598_2024_74835_Tab3_HTML.jpg

The Indian pines dataset was received through the use of the AVIRIS in Indiana, USA, in 1992. The original image consists of 220 spectral bands with wavelengths ranging from 0.4 to 0.25 micrometers. After excluding the water absorption bands, a total of 200 bands were retained for the investigation. The spatial size of the image has pixels representing 16 distinct land cover categories. The Indian Pines data set has a total of 10,249 samples. We selected 5% random samples for training the model, and the remaining samples were used to test the model. Table 4 shows the Indian Pines Dataset labeled Samples.

Table 4.

Indian Pines Dataset labeled Samples [We have used the raw data to draw the satellite imagery in Table 4 from the publicly available dataset at the GitHub repository, where we have already mentioned the link in our previous manuscript in the data availability section (https://github.com/gokriznastic/HybridSN/tree/master/data). Moreover, the classification maps in Table 4 were generated using the PHYTON 3.8 version and are publicly available at Pytorch library https://pytorch.org/get-started/locally/. Which is installed on NVIDIA RTX 3060 GPU with 64 GB RAM].

graphic file with name 41598_2024_74835_Tab4_HTML.jpg

The KSC dataset used in this study was acquired by the National Aeronautics and Space Administrator (NASA) Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) instruments over Kennedy Space Center (KSC), located in Florida, USA, on March 23, 1996. The dataset comprises 224 spectral bands and has a spatial resolution of pixels, with a corresponding spatial resolution of 1.8 m/pixels. The spectral range spans from 400 to 2500 nm with a spectral resolution of 10 nm. The classification task is very challenging due to the presence of 13 distinct land cover types, including water and mixed classes. To optimize the data for classification, water absorption and low signal-to-noise ratio (SNR) bands have been removed, leaving 176 relevant bands for classification. In Table 5, detailed information regarding the number of training and testing samples for each class can be found. The dataset’s diverse spectral and spatial characteristics and the abundance of labeled pixels present an excellent opportunity for evaluating and refining HSI classification models. The KSC dataset has a total of 5211 samples. We set 10% randomly selected samples for training the model, and the remaining samples are used for testing.

Table 5.

KSC Dataset labeled Samples [We have used the raw data to draw the satellite imagery in Table 5 from the publicly available dataset at the GitHub repository, where we have already mentioned the link in our previous manuscript in the data availability section (https://github.com/gokriznastic/HybridSN/tree/master/data). Moreover, the classification maps in Table 5 were generated using the PHYTON 3.8 version and are publicly available at Pytorch library https://pytorch.org/get-started/locally/. Which is installed on NVIDIA RTX 3060 GPU with 64 GB RAM].

graphic file with name 41598_2024_74835_Tab5_HTML.jpg

Experimental setup

For this experiment, we utilized the Adam optimizer and selected categorical cross entropy as the loss function to train the suggested model. Assign a learning rate of 0.0001 and a weight decay of 0.00001. We set 30 PCA components and 13 × 13 input image patch size for four datasets. The batch size and epoch for simulation are set to 64 and 100, respectively.

Model parameter selection

This section examines the factors that influence classification accuracy, such as the PCA component and the varying patch size or window size.

Selection of dimensionality reduction method

HSI data has a significant presence of redundant and noisy information. Effective feature preprocessing can minimize unnecessary information and alleviate computational load. Comparative experiments have been done on four datasets using the PCA, LDA, and factor analysis (FA) preprocessing methods. The results of these experiments can be found in Tables 6, 7, 8 and 9. The experiments show that the obtained results with PCA surpass those achieved with FA and LDA. Therefore, PCA was chosen to extract the spectral features.

Table 6.

Impact of different dimensionality reduction techniques on the Indian pines dataset.

Methods OA (%) AA (%) Kappa (100%)
FA 93.54 92.65 93.8
LDA 94.01 92.89 93.69
PCA 97.60 94.65 97.2
Table 7.

Impact of different dimensionality reduction techniques on the Salinas Dataset.

Methods OA (%) AA (%) Kappa (100%)
FA 94.30 94.05 94.0
LDA 93.79 91.80 92.7
PCA 99.74 99.76 99.7
Table 8.

Impact of different dimensionality reduction techniqueson the Pavia University dataset.

Methods OA (%) AA (%) Kappa (100%)
FA 96.57 94.67 95.44
LDA 97.02 95.05 95.26
PCA 98.47 97.23 97.9
Table 9.

Impact of different dimensionality reduction techniques on KSC dataset.

Methods OA (%) AA (%) Kappa (100%)
FA 97.58 97.26 97.71
LDA 98.09 98.19 98.89
PCA 99.53 99.24 99.4

Result and discussion

To verify the effectiveness of the proposed model, experimental results of the state-of-the-art model are presented: HybridSN41, SSRN42, BS2T43, SpectralFormer26, MorphFormer39, FouriorFormer44, GSC-ViT45, and GAHT46. For a fair comparison, we use the same training samples and the same patch size for the comparison models. Tables 2, 3, 4 and 5 list the details of the four benchmark datasets on which we conducted the experiments. CNN techniques generally obtain superior accuracy on all four datasets because they better capture contextual features, especially low-level ones. However, considering their ability to capture significant spectral-spatial features, transformer-based models such as spectral Former and morph Former exhibit poor performance. The basic reason for this is that the spectralFormer model does not fully exploit the three-dimensional nature of the data, and it is challenging to implicitly represent the spatial relationship with a limited number of training samples. Out of many comparative algorithms, such as FouriorFormer, GSC-ViT, and GAHT are particularly well suited for HSI data and show better performance since they evaluated with convolutional layers, on the other hand the proposed model improves 4.26%, 2.72%, 3.51%, 8.22%, 5.67%, 1.27%, 2.01%, and 2.04% OA when compared with the existing techniques including hybridSN, SSRN, BS2T, SpectralFormer, MorphFormer, FouriorFormer, GSC-ViT, and GAHT models. The classification results of Pavia University are shown in Table 10. Hybrid CNN and transformer approaches outperform transformer approaches. This implies that the combination of two architectures can be advantageous, but it requires more refinement. It successfully extracts both local and high-level features by combining the strength of the CNN and transformer-based attention module. As a result, it outperforms other transformer-based classification models, providing a significant advantage. Comparison models may exhibit poor performance while HSI classification uses a limited number of samples. The proposed model achieves the best classification result in all classes specifically in class “Trees” and “self-blocking” achieved best classification results. Compared to the second-best fouriorformer model, the proposed model improves the classification result by 1.27% and 3.24% in terms of OA and AA. Figure 4. Shows the classification maps on the Pavia University dataset. As one can see, GSC-ViT, GAHT, and FouriorFomrer obtain excellent classification results with less noise and intra-class smoothness. In addition, the proposed model obtains multiscale spectral-spatial features and includes high semantic features from the attention layer. It can obtain a better classification map and more detailed information.

Table 10.

Classification result (%) on Pavia University Dataset.

Class HybridSN SSRN BS2T SF MorphFormer FouriorFormer GSC-ViT GAHT Proposed
C1 94.18 97.54 96.13 95.11 90.73 98.81 96.02 95.87 98.21
C2 99.84 99.98 99.81 99.97 99.85 100 99.26 99.75 99.90
C3 87.15 94.89 53.99 66.98 69.58 83.01 86.42 84.88 93.53
C4 84.70 93.40 92.15 82.45 81.70 82.65 90.20 83.18 93.73
C5 100 99.77 99.84 97.37 98.87 99.39 99.62 99.92 98.93
C6 97.24 95.84 99.15 92.56 99.23 100 99.73 100 100
C7 91.41 86.94 68.18 53.37 89.97 96.96 99.54 99.16 99.76
C8 71.98 84.66 96.62 64.41 80.24 98.49 89.60 97.61 96.86
C9 96.15 58.48 93.27 68.51 65.84 86.65 87.51 71.29 94.18
OA 94.21 95.74 94.96 90.19 92.80 97.20 96.46 96.43 98.47
AA 91.41 90.17 88.79 80.08 86.22 93.99 94.21 92.41 97.23
Kappa 0.923 0.943 0.933 0.901 0.903 0.962 0.953 0.952 0.979
Params 795.9 (K) 198.52 (K) 1.35 (M) 194.1(K) 141.1 (K) 713.67 (K) 0.77 (M) 1.06 (M) 0.9 (M)
MACs 31.88 (M) 46.15 (M) 16.78 (M) 12.47(M) 5.97 (M) 110.32 (M) 49.64 (M) 177.62 (M) 0.63 (M)
TR Time (s) 54.89 84.65 412.66 63.29 135.89 190.13 150.96 152.66 101.27
TS Time (s) 8.60 13.69 80.79 22.82 16.58 40.65 16.05 23.43 27.16

Figure 4.

Figure 4

Classification Maps on Pavia University Dataset. [The classification maps in Fig. 4 were generated using the PHYTON 3.8 version and are publicly available at Pytorch library https://pytorch.org/get-started/locally/. Which is installed on NVIDIA RTX 3060 GPU with 64 GB RAM].

Table 11 shows the classification results for the Salinas dataset. The results revealed that the proposed model exhibits constant performance gain in all the classes. It can be observed that the OA achieved by the SpectralFormer and MorphFormer models is slightly low. The GSC-ViT and GAHT models achieved better performance and gained the highest overall accuracy. The proposed model obtained better classification results in “Fallow-smooth”, “Lettuce-romaine-6kw,” and “Vineyard_untrained” where the group attention layer helps to capture features better. Compared to the second-best model, the proposed model achieved 0.59%, 0.58%, and 0.007 in terms of OA, AA, and kappa coefficient Figure 5. Shows the classification maps of different methods on the Salinas Dataset; as we can see, the GAHT model and proposed model have less noise. Our classification is almost near to the ground truth image. As the noise level increases, the classification maps’ accuracy tends to decrease.

Table 11.

Classification result (%) on Salinas Dataset.

Class HybridSN SSRN BS2T SF MorphFormer FouriorFormer GSC-ViT GAHT Proposed
C1 100 99.94 100 99.79 89.99 100 99.89 100 100
C2 100 100 100 100 99.89 100 100 100 100
C3 99.23 100 100 100 92.94 100 100 100 100
C4 98.55 99.92 99.49 91.95 88.98 82.24 99.34 99.56 99.92
C5 99.35 98.49 96.64 98.60 91.13 100 98.41 98.94 99.77
C6 99.94 100 100 100 98.13 100 99.97 99.97 100
C7 100 99.88 100 99.18 97.34 99.97 99.18 99.94 99.94
C8 95.42 99.48 99.97 93.07 90.63 99.34 98.44 99.17 99.92
C9 100 100 100 100 98.53 100 99.98 100 100
C10 97.68 99.19 99.07 98.02 87.11 99.13 99.47 99.32 99.43
C11 96.68 99.05 100 93.37 88.55 99.90 99.05 100 100
C12 99.73 100 100 100 88.36 100 100 99.89 100
C13 98.01 39.47 90.62 91.06 90.40 92.28 89.41 99.44 100
C14 97.92 97.92 98.77 98.58 90.55 98.77 97.45 93.95 97.90
C15 92.03 98.17 84.47 91.64 86.08 96.74 97.87 96.76 99.97
C16 99.38 99.49 99.94 99.94 88.31 99.88 99.32 100 100
OA 97.56 98.41 97.49 96.66 92.18 98.75 98.93 99.15 99.74
AA 98.37 95.69 98.06 97.20 91.68 98.01 98.61 99.18 99.76
Kappa 0.972 0.982 0.972 0.962 0.912 0.986 0.988 0.990 0.997
Params 796.8 (K) 139.69 (K) 1.36 (M) 205.59 (K) 141.5 (K) 715.9 (K) 1.04 (M) 1.07 (M) 0.98 (M)
MACs 31.96 (M) 47.10 (M) 14.71 (M) 13.69 (M) 5.98 (M) 110.34 (M) 118.93 (M) 177.81 (M) 1.25 (M)
TR Time (s) 51.93 90.70 391.18 65.91 153.86 206.13 157.63 189.37 168.93
TS Time (s) 10.98 20.28 103.65 25.37 26.09 51.23 19.87 22.18 54.80

Figure 5.

Figure 5

Classification maps obtained by different models on the Salinas Dataset [The classification maps in Fig. 5 were generated using the PHYTON 3.8 version and are publicly available at Pytorch library https://pytorch.org/get-started/locally/. Which is installed on NVIDIA RTX 3060 GPU with 64 GB RAM].

Table 12 shows the Indian Pines classification results. The proposed model achieve the best Classification results only with small training samples: specifically, we can see that in the classes “Grass-trees”, “Oats, “Soyebean-notils,” “Soyebean-mintill,” “Soybean-clean,” “Building-grass-trees” and “Stone-steel-towers” the proposed model achieves the highest accuracy result due to the use of a 3D CNN, which can extract more discriminative spectral, spatial features. The proposed model obtained 99.74%, 99.76%, and 0.997% in terms of OA, AA and kappa. Figure 6 shows the classification maps of different models on the Indian Pines dataset as one can see FouriorFormer, GSC-ViT, and GAHT, and our proposed model has competitive classification results in terms of OA; from classification maps, there is less noise in maps. The classification result on the KSC dataset is shown in Table 13. The proposed model significantly improves OA, AA and kappa and outperforms other comparison models. Due to using CNNs with a group attention module, more features are extracted. The proposed model obtained 97.60%, 94.65%, and 0.972 in terms of OA, AA, and kappa. The classification maps derived from the comparison models on the KSC dataset show a small number of misclassified points with less noise, especially in a specific small region. The classification map generated by the proposed model on the KSC dataset shown in Fig. 7 demonstrates reduced levels of noise and a lower number of classification errors. The utilization of a multiscale feature extractor combined with a group attention module enables the establishment of dependencies at both low and high levels. This results in a notable improvement in classification accuracy, even when the quantity of training examples is limited.

Table 12.

Classification result (%) on Indian pines dataset.

Class HybridSN SSRN BS2T SF MorphFormer FouriorFormer GSC-ViT GAHT Proposed
C1 6.66 9.60 4.54 4.54 56.81 31.81 20.45 11.11 73.08
C2 74.58 93.80 93.94 74.50 93.14 94.54 93.73 88.80 93.30
C3 79.37 99.61 77.50 64.38 92.77 97.97 96.07 94.78 97.18
C4 56.08 75.55 55.70 16 90.22 87.55 92.0 84.78 92.95
C5 96.80 97.38 95.25 92.37 100 97.38 97.60 98.93 98.16
C6 98.87 98.84 99.71 95.23 97.69 96.96 98.26 98.87 99.84
C7 70.37 20.41 21.08 3.70 66.66 22.03 96.42 94.18 97.26
C8 100 99.11 100 97.35 99.33 100 97.13 100 100
C9 36.84 10.87 11.66 5.26 36.84 20.98 21.05 78.94 80.05
C10 86.00 93.49 75.88 68.25 93.39 97.29 96.20 94.16 97.82
C11 83.16 97.94 88.16 83.49 95.24 98.92 98.11 96.93 99.09
C12 69.21 90.76 78.73 35.70 83.83 93.07 88.80 88.17 96.62
C13 93.46 86.66 99.49 93.33 98.46 84.61 91.28 98.49 98.37
C14 92.99 99.08 99.91 95.67 99.75 98.16 100 99.91 99.91
C15 76.47 95.09 80.86 54.22 92.37 90.73 91.00 95.98 99.13
C16 74.44 76.13 82.02 73.86 69.31 53.40 55.68 61.11 91.66
OA 83.66 94.994 87.91 76.60 94.30 95.51 95.09 94.36 97.60
AA 74.70 77.77 72.7 59.86 85.36 79.09 83.36 86.57 94.65
Kappa 0.814 0.938 0.861 0.729 0.934 0.948 0.943 0.935 0.972
Params 1.17 (M) 140.60 (K) 1.35 (M) 121.59 (K) 141.6 (K) 705.84 (K) 1.03 (M) 1.06 (M) 0.97 (M)
MACs 65.35 (M) 47.09 (M) 14.70 (M) 3.42 (M) 6.00 (M) 78.42 (M) 117.36 (M) 127.17 (M) 1.24 (M)
TR Time (s) 56.50 90.70 117.88 54.40 123.7 110.81 135.21 105.98 120.19
TS Time (s) 9.89 20.28 5.39 5.60 10.61 3.74 15.33 8.80 15.43

Figure 6.

Figure 6

Classification maps obtained on Indian Pines Dataset [The classification maps in Fig. 6 were generated using the PHYTON 3.8 version and are publicly available at Pytorch library https://pytorch.org/get-started/locally/. Which is installed on NVIDIA RTX 3060 GPU with 64 GB RAM].

Table 13.

Classification result (%) on KSC Dataset.

Class HybridSN SSRN BS2T SF MorphFormer FouriorFormer GSC-ViT GAHT Proposed
C1 98.68 99.29 100 99.71 100 99.86 99.44 100 100
C2 96.34 99.11 100 97.78 100 100 96.96 100 100
C3 96.52 97.89 100 94.11 99.17 98.76 99.17 100 100
C4 88.98 99.14 99.14 89.31 77.82 98.32 78.66 96.23 92.46
C5 94.48 76.66 80.66 73.33 77.12 56.20 45.09 79.08 100
C6 83.49 99.06 99.53 89.67 85.77 100 94.49 99.54 97.70
C7 100 100 100 98.96 100 81.0 84.0 99.0 100
C8 100 100 99.75 99.75 99.75 100 100 100 100
C9 100 100 100 100 100 100 99.79 100 100
C10 100 100 100 100 100 100 100 100 100
C11 100 100 99.48 99.74 100 100 100 100 100
C12 100 100 100 99.57 91.42 100 99.16 100 100
C13 100 100 100 100 100 100 100 100 100
OA 98.03 98.94 99.27 97.66 96.70 98.10 96.34 99.13 99.53
AA 96.80 97.78 98.35 95.53 94.69 94.93 92.06 97.98 99.24
Kappa 0.978 0.988 0.991 0.974 0.963 0.978 0.959 0.990 0.994
Params 534.27 (K) 141.21 (K) 1.36 (M) 121.18 (K) 139.9 (K) 701.75 (K) 9.09 (M) 1.05 (M) 0.95 (M)
MACs 16.11 (M) 32.77 (M) 14.71 (M) 3.42 (M) 5.98 (M) 78.40 (M) 116.01 (M) 125.39 (M) 1.23 (M)
TR Time (s) 36.51 50.10 90.68 47.19 60.5 63.8 100.10 72.95 92.58
TS Time (s) 3.9 4.87 5.14 3.87 4.60 5.01 6.90 4.05 7.1

Figure 7.

Figure 7

Classification maps obtained on KSC Dataset [The classification maps in Fig. 7 were generated using the PHYTON 3.8 version and are publicly available at Pytorch library https://pytorch.org/get-started/locally/. Which is installed on NVIDIA RTX 3060 GPU with 64 GB RAM].

Moreover, we have discussed the effect of different modules on four datasets in the ablation experiment section. Finally, we conducted experiments on training and testing times, as well as training sample ratios.

Ablation experiments

To comprehensively highlight the efficiency of each module, we conducted various combinations of models on the Salinas, Pavia University, Indian Pines, and KSC datasets using patch size 13 × 13. Our analysis focuses on the CNN, transformer module, and the integration of both the CNNs and AFE module, accessed through OA, AA, and kappa coefficient metrics.

The outcomes of these experiments are presented in Tables 14, 15, 16 and 17. Furthermore, the proposed network can achieve the best performance results. Therefore, based on our network, the AFE module can be useful for refining complex features. The global attention block is helping achieve more complex features and improve classification performance. Table 18 shows the experiments with the multihead self-attention (MHSA) module on the Salinas dataset to verify the effectiveness of the model. It can be seen that the MHSA module does not capture the high-level semantic information from HSI data.

Table 14.

OA (%) result of ablation study of different combinations of a model over Salinas Dataset.

Metrics CNN CNN + GA GA + AFE CNN + AFE Proposed
OA (%) 97.32 97.59 96.38 97.89 99.74
AA (%) 96.68 97.14 95.91 97.25 99.76
Kappa (100%) 0.970 0.967 0.961 0.958 0.997

Table 15.

OA (%) result of ablation study of different combinations of a model over Pavia University Dataset.

Metrics CNN CNN + GA GA + AFE CNN + AFE Proposed
OA (%) 96.25 96.78 94.18 96.89 98.47
AA (%) 95.47 95.33 93.09 95.96 97.23
Kappa (100%) 0.961 0.951 0.940 0.959 0.979

Table 16.

OA (%) result of ablation study of different combinations of a model over Indian pines dataset.

Metrics CNN CNN + GA GA + AFE CNN + AFE Proposed
OA (%) 95.10 95.69 92.54 95.55 97.60
AA (%) 90.03 90.81 87.71 91.67 94.65
Kappa (100%) 0.941 0.945 0.919 0.948 0.972

Table 17.

OA (%) result of ablation study of different combinations of a model over KSC dataset.

Metrics CNN CNN + GA GA + AFE CNN + AFE Proposed
OA (%) 98.14 98.68 97.52 98.28 99.53
AA (%) 97.58 98.04 96.92 98.33 99.24
Kappa (100%) 0.976 0.988 0.973 0.980 0.994

Table 18.

Ablation experiments with MHSA and other modules on the SA dataset.

Metrics CNN CNN + MHSA MHSA + AFE
OA (%) 98.05 95.63 95.81
AA (%) 96.99 94.70 93.43
Kappa (100%) 0.970 0.948 0.948

Impact of training ratio

To evaluate the stability and robustness of the proposed model, we employ a random selection of training samples from four datasets at various intervals. Starting from initial training samples and gradually increasing them, the result of each model under different training samples is shown in Fig. 8. The OA of the models enhanced progressively to the stable zone. The proposed model maintains the best classification accuracy, especially when low training samples.

Figure 8.

Figure 8

OAs at different numbers of samples on four datasets.

Discussion on training and testing time

The training time, testing time, and parameters for the HybridSN, SSRN, BS2T, SpectralFormer, MorphFormer, FouriorFormer, GSC-ViT, GAHT, and proposed model on four datasets are listed in Tables 10, 11, 12 and 13. The HybridSN model shows the slowest speed. The proposed model training time is increased on four datasets rather than other models. In terms of parameter numbers, the proposed model has more parameters due to the use of attention layers, but the flops of the proposed model are very low compared to other models.

Conclusion

In this paper, we proposed a new Group Former model, which is intended to improve the classification outcomes by efficiently extracting spatial features with deep spectral properties. CNNs, global attention blocks, and AFE module blocks are the three main integrated parts of the model. Every element is essential to the model’s capacity to analyze and improve spectral spatial data. The first part is a multispectral 3D CNN module that is used for low-level feature extraction in spectral space. This module is especially helpful for applications that call for a thorough study of spectral data since it is skilled at collecting fine-grained features across several spectral bands. The CNN module guarantees that both spatial and spectral dimensions are considered by processing data in three dimensions, which results in more thorough feature extraction. The model uses a global attention module to extract high-level semantic features. This element improves the model’s capacity to concentrate on the most relevant portions of the incoming data, successfully identifying significant correlations and patterns that are essential for precise categorization. The global attention method shows more robust performance results from the model’s ability to discriminate between important features and noise. An AFE module is included in the GroupFormer to further enrich the features retrieved by the high-level attention modules and the low-level CNN. As the last refinement stage, this module improves the quality of the extracted features and ensures they help with the classification objective. The final classification process uses the cohesive representation created by the AFE module, which unifies the many characteristics. The experimental findings validate the importance and efficiency of GroupFormer in extracting deep-spectrum spatial characteristics. The model demonstrates applicability in a range of scenarios by demonstrating significant increases in classification accuracy. However, the research also highlights areas for future exploration. One of the primary directions is the development of lightweight networks that can perform multiscale spectral-spatial feature extraction. Such advancements would improve accuracy and make the model more practical for real-world applications with limited computational resources. By focusing on these aspects, future research can build upon the strong foundation of GroupFormer, pushing the boundaries of spectral-spatial feature extraction and classification.

Acknowledgements

This work was supported by the National Key R&D Program of China Under Grant Number: 2022YFE0136800; Marine Defense Technology Innovation Fund of China Shipbuilding Research and Design Center Under Grant Number: JJ-2022-719-03; Open Topic of Microsystem Technology National Defense Science and Technology Key Laboratory Under Grant Number: 6142804230106; Open Fund of Marine Environmental Detection Technology and Application Key Laboratory, Ministry of Natural Resources Under Grant Number: MESTA-2022-A006; The new round of “ Double First Class” discipline collaborative innovation achievement project in Heilongjiang Province in 2023 Under Grant Number: LJGXCG2023-066; Collaborative Detection Technology Based on Multi-Base Passive Sonar Array and Xi’an Science and Technology Plan Project Under Grant Number: 2022FWQY16; Industrialisation and demonstration of intelligent cross-water and air medium communication machine Under Grant number: CXRC20231113756; Nanhai High-level Science and Technology Innovation Guidance Special Program of Nanhai Institute of Harbin Engineering University: Cross-domain information transmission and networking technology in deep ocean based on low-orbit satellites.

Author contributions

Each author in this article contributed their distinct expertise and responsibilities in a collaborative way. R.K. and T.A. were primarily responsible for formulating the study’s concepts and determining the research direction. R.K. provided methodological frameworks so that the study’s approach was rigorous and coherent. W.C. assumed responsibility for the software implementation, which is critical for data analysis and interpretation. T.A., Z.H., and W.C. collaborated to validate the findings, ensuring the strength and reliability of the conclusions reached. J.K. supervised formal analytic techniques, while R.K. and Z.K. carried out essential studies for data collection and interpretation. X.M. supervised the use of resources, while S.K. thoroughly examined the collected data. S.K. handled further modifications and editing after R.K. initially wrote the manuscript. Z.K. used data visualization to improve the presentation of significant findings. X.M. and Z.H. provided supervision throughout the study process, ensuring complete conformity to scholarly guidelines. X.M. and J.K. collaborated on project administration tasks, with X.M. leading the funding acquisition task. All authors provided input during the manuscript drafting stage.

Data availability

The Hyperspectral Image datasets (Indian Pines, Pavia University, and Salinas) used in the current study are available in a GitHub repository, https://github.com/gokriznastic/HybridSN/tree/master/data.

Declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Lupu, D., Garrett, J. L., Johansen, T. A., Orlandic, M. & Necoara, I. Quick unsupervised hyperspectral dimensionality reduction for earth observation: a comparison. arXiv preprint arXiv:2402.16566 (2024).
  • 2.Wang, D. et al. Sliding dual-window-inspired Reconstruction Network for Hyperspectral Anomaly Detection (IEEE Transactions on Geoscience and Remote Sensing, 2024).
  • 3.Kumar, V., Singh, R. S. & Dua, Y. Morphologically dilated convolutional neural network for hyperspectral image classification. Sig. Process. Image Commun.101, 116549 (2022). [Google Scholar]
  • 4.Murphy, R. J., Whelan, B., Chlingaryan, A. & Sukkarieh, S. Quantifying leaf-scale variations in water absorption in lettuce from hyperspectral imagery: a laboratory study with implications for measuring leaf water content in the context of precision agriculture. Precision Agric.20, 767–787 (2019). [Google Scholar]
  • 5.Gu, Y., Hu, Z., Zhao, Y., Liao, J. & Zhang, W. M. F. G. T. N. A multi-modal fast gated transformer for identifying single trawl marine fishing vessel. Ocean Eng.303, 117711 (2024). [Google Scholar]
  • 6.Villa, A., Benediktsson, J. A., Chanussot, J. & Jutten, C. Hyperspectral image classification with independent component discriminant analysis. IEEE Trans. Geosci. Remote Sens.49, 4865–4876 (2011). [Google Scholar]
  • 7.Fauvel, M., Benediktsson, J. A., Chanussot, J. & Sveinsson, J. R. Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles. IEEE Trans. Geosci. Remote Sens.46, 3804–3814 (2008). [Google Scholar]
  • 8.Wang, D. et al. Blind-block reconstruction network with a guard window for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote Sens.61, 1–16 (2023). [Google Scholar]
  • 9.Dalla Mura, M., Villa, A., Benediktsson, J. A., Chanussot, J. & Bruzzone, L. Classification of hyperspectral images by using extended morphological attribute profiles and independent component analysis. IEEE Geosci. Remote Sens. Lett.8, 542–546 (2010). [Google Scholar]
  • 10.Zhang, Y. et al. Topological structure and semantic information transfer network for cross-scene hyperspectral image classification (IEEE Transactions on Neural Networks and Learning Systems, 2021). [DOI] [PubMed] [Google Scholar]
  • 11.Zhao, W. & Du, S. Spectral–spatial feature extraction for hyperspectral image classification: a dimension reduction and deep learning approach. IEEE Trans. Geosci. Remote Sens.54, 4544–4554 (2016). [Google Scholar]
  • 12.Szegedy, C. et al. Going deeper with convolutions. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. (2015).
  • 13.Anand, R., Samiappan, S. & Kavitha, K. Flower pollination optimization based hyperspectral band selection using modified wavelet Gabor deep filter neural network. Infrared Phys. Technol.138, 105215 (2024). [Google Scholar]
  • 14.Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. (2016).
  • 15.Zheng, W., Lu, S., Yang, Y., Yin, Z. & Yin, L. Lightweight transformer image feature extraction network. PeerJ Comput. Sci.10, e1755 (2024). [Google Scholar]
  • 16.Bordes, A., Glorot, X., Weston, J. & Bengio, Y. Joint learning of words and meaning representations for open-text semantic parsing. In Proceedings of the Artificial intelligence and statistics, pp. 127–135. (2012).
  • 17.Wang, D., Gao, L., Qu, Y., Sun, X. & Liao, W. Frequency-to‐spectrum mapping GAN for semisupervised hyperspectral anomaly detection. CAAI Trans. Intell. Technol.8, 1258–1273 (2023). [Google Scholar]
  • 18.Qiao, M. et al. HyperSOR: context-aware graph hypernetwork for salient object ranking. IEEE Trans. Pattern Anal. Mach. Intell. 46, 5873-5889 (2024). [DOI] [PubMed]
  • 19.Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • 20.Xu, H., Li, Q. & Chen, J. Highlight removal from a single grayscale image using attentive GAN. Appl. Artif. Intell.36, 1988441 (2022). [Google Scholar]
  • 21.Mei, X. et al. Spectral-spatial attention networks for hyperspectral image classification. Remote Sens.11, 963 (2019). [Google Scholar]
  • 22.Yin, L. et al. Convolution-Transformer for Image Feature Extraction141 (CMES-Computer Modeling in Engineering & Sciences, 2024). [Google Scholar]
  • 23.Qing, Y. & Liu, W. Hyperspectral image classification based on multi-scale residual network with attention mechanism. Remote Sens.13, 335 (2021). [Google Scholar]
  • 24.Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • 25.He, X., Chen, Y. & Lin, Z. Spatial-spectral transformer for hyperspectral image classification. Remote Sens.13, 498 (2021). [Google Scholar]
  • 26.Hong, D. et al. SpectralFormer: rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens.60, 1–15 (2021). [Google Scholar]
  • 27.Shi, C., Wu, H. & Wang, L. A positive feedback spatial-spectral correlation network based on spectral slice for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens.61, 1–17 (2023). [Google Scholar]
  • 28.Sun, L., Zhao, G., Zheng, Y. & Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens.60, 1–14 (2022). [Google Scholar]
  • 29.Zhang, J., Meng, Z., Zhao, F., Liu, H. & Chang, Z. Convolution transformer mixer for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett.19, 1–5 (2022). [Google Scholar]
  • 30.Shi, C., Yue, S., Wu, H., Zhu, F. & Wang, L. A. Multi-hop Graph Rectify Attention and Spectral Overlap Grouping Convolutional Fusion Network for Hyperspectral Image Classification (IEEE Transactions on Geoscience and Remote Sensing, 2024). [Google Scholar]
  • 31.Shi, C., Sun, J., Wang, T. & Wang, L. Hyperspectral image classification based on a 3D octave convolution and 3D multiscale spatial attention network. Remote Sens.15, 257 (2023). [Google Scholar]
  • 32.Cui, X. et al. Multiscale spatial-spectral convolutional network with image-based framework for hyperspectral imagery classification. Remote Sens.11, 2220 (2019). [Google Scholar]
  • 33.Li, B. et al. Multi-granularity vision transformer via semantic token for hyperspectral image classification. Int. J. Remote Sens.43, 6538–6560 (2022). [Google Scholar]
  • 34.Arshad, T. & Zhang, J. Hierarchical Attention Transformer for Hyperspectral Image Classification (IEEE Geoscience and Remote Sensing Letters, 2024). [Google Scholar]
  • 35.Shi, C., Wu, H. & Wang, L. C. E. G. A. T. A CNN and enhanced-GAT based on key sample selection strategy for hyperspectral image classification. Neural Netw.168, 105–122 (2023). [DOI] [PubMed] [Google Scholar]
  • 36.Pan, H., Yan, H., Ge, H., Liu, M. & Shi, C. Transformer-enhanced two-stream complementary convolutional neural network for hyperspectral image classification. J. Franklin Inst. 361, 106973 (2024).
  • 37.Ouyang, E. et al. When Multigranularity meets spatial–spectral attention: a hybrid transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens.61, 1–18 (2023). [Google Scholar]
  • 38.Shi, C., Wu, H. & Wang, L. A Feature Complementary Attention Network Based on Adaptive Knowledge Filtering for Hyperspectral Image Classification (IEEE Transactions on Geoscience and Remote Sensing, 2023). [Google Scholar]
  • 39.Roy, S. K. et al. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens.61, 1–15 (2023). [Google Scholar]
  • 40.Wu, H., Shi, C., Wang, L. & Jin, Z. A cross-channel dense connection and multi-scale dual aggregated attention network for hyperspectral image classification. Remote Sens.15, 2367 (2023). [Google Scholar]
  • 41.Roy, S. K., Krishna, G., Dubey, S. R., Chaudhuri, B. B. & HybridSN Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett.17, 277–281 (2019). [Google Scholar]
  • 42.Zhong, Z., Li, J., Luo, Z. & Chapman, M. Spectral–spatial residual network for hyperspectral image classification: a 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens.56, 847–858 (2017). [Google Scholar]
  • 43.Song, R., Feng, Y., Cheng, W., Mu, Z. & Wang, X. BS2T: Bottleneck spatial–spectral transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens.60, 1–17 (2022). [Google Scholar]
  • 44.Shi, H., Zhang, Y., Cao, G., Yang, D. MHCFormer Multiscale hierarchical conv-aided fourierformer for hyperspectral image classification. IEEE Trans. Instrum. Meas. 73, 1-15 (2023).
  • 45.Zhao, Z., Xu, X., Li, S. & Plaza, A. Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network (IEEE Transactions on Geoscience and Remote Sensing, 2024). [Google Scholar]
  • 46.Mei, S., Song, C., Ma, M. & Xu, F. Hyperspectral image classification using group-aware hierarchical transformer. IEEE Trans. Geosci. Remote Sens.60, 1–14 (2022). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The Hyperspectral Image datasets (Indian Pines, Pavia University, and Salinas) used in the current study are available in a GitHub repository, https://github.com/gokriznastic/HybridSN/tree/master/data.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES