GroupFormer for hyperspectral image classification through group attention

Rahim Khan; Tahir Arshad; Xuefei Ma; Haifeng Zhu; Chen Wang; Javed Khan; Zahid Ullah Khan; Sajid Ullah Khan

doi:10.1038/s41598-024-74835-1

. 2024 Oct 12;14:23879. doi: 10.1038/s41598-024-74835-1

GroupFormer for hyperspectral image classification through group attention

Rahim Khan ¹, Tahir Arshad ², Xuefei Ma ^1,^✉, Haifeng Zhu ¹, Chen Wang ¹, Javed Khan ³, Zahid Ullah Khan ¹, Sajid Ullah Khan ⁴

PMCID: PMC11470927 PMID: 39396096

Abstract

Hyperspectral image (HSI) data has a wide range of valuable spectral information for numerous tasks. HSI data encounters challenges such as small training samples, scarcity, and redundant information. Researchers have introduced various research works to address these challenges. Convolution Neural Network (CNN) has gained significant success in the field of HSI classification. CNN’s primary focus is to extract low-level features from HSI data, and it has a limited ability to detect long-range dependencies due to the confined filter size. In contrast, vision transformers exhibit great success in the HSI classification field due to the use of attention mechanisms to learn the long-range dependencies. As mentioned earlier, the primary issue with these models is that they require sufficient labeled training data. To address this challenge, we proposed a spectral-spatial feature extractor group attention transformer that consists of a multiscale feature extractor to extract low-level or shallow features. For high-level semantic feature extraction, we proposed a group attention mechanism. Our proposed model is evaluated using four publicly available HSI datasets, which are Indian Pines, Pavia University, Salinas, and the KSC dataset. Our proposed approach achieved the best classification results in terms of overall accuracy (OA), average accuracy (AA), and Kappa coefficient. As mentioned earlier, the proposed approach utilized only 5%, 1%, 1%, and 10% of the training samples from the publicly available four datasets.

Keywords: Attention Module, Convolutional neural network, Hyperspectral image classification, Vision Transformer

Subject terms: Environmental sciences, Planetary science, Engineering

Introduction

The classification of hyperspectral image (HSI) data is an essential component of earth observation¹ since HSI data is made up of several small bands that each carry a significant amount of information². The environmental sciences, mineralogy, military³, and agriculture⁴ all use it significantly. HSI data poses three major challenges since it captures spectral information from multiple adjacent spectral bands of surface objects⁵. For instance, HSI data contains hundreds of spectral bands, each carrying its own amount of information⁶, but the overlap among these bands significantly increases computational complexity⁵. Second, the classification approach gets complicated by the frequent mixing of pixels in HSIs, where one pixel frequently corresponds to an entirely distinct category^7,8. Finally, the number of labeled samples is limited as labeling manually HSI samples is costly and time-consuming⁹. Over the last decade, multiple solutions have been developed to solve these issues, with early mechanisms employing typical machine learning techniques such as logistic regression, support vector machines, k-nearest neighbors, and Bayesian estimation.

Similarly, in the early stage of some traditional methods, principal component analysis (PCA)⁶ and linear discriminative analysis (LDA)^7,8 are two dimensionality reduction approaches that have been used to extract spectral information. However, both algorithms often overlook the spatial correlation between pixels, which is essential for efficient spatial feature extraction^9,10. To overcome this, scientists created mathematical operators such as extended morphological attribute profiles; however, these techniques failed to adequately extract spectral-spatial features, resulting in misclassification^3,11. However, misclassification is a common outcome of these conventional models.

In recent years, deep learning models have been shown to be adept at segmentation, target identification, and image classification. These models need a hierarchical architecture to extract abstract features from raw data and enable a non-linear transfer from feature space to label space^12,13. This feature improves identification and classification accuracy while drastically lowering the need for material and human resources. Convolutional neural networks (CNNs) have shown promise in effectively gathering and using spatial-spectral data^14,15. Numerous CNN models, including one-dimensional, two-dimensional, and more sophisticated designs including three-dimensional CNNs that combine spectral and spatial data, have been used by researchers^16,17. Despite of these improvements, problems with computational efficiency and model complexity exist in CNNs and transformer-based models^18,19. Attention-based models have been integrated using spectral-spatial transformers and formers to capture global dependencies in HSI data^20,21. Yet, these models are mostly based on multi-head self-attention modules, which require a huge amount of training data and can be computationally expensive^22,23.

The motivation behind this study is to rectify a number of significant issues with HSI classification. Effective semantic feature extraction is a basic difficulty in HSI classification because of the data’s large dimensionality and complicated spectral properties. Due to their small receptive fields, traditional CNNs often fail at these tasks because they are unable to collect higher-level semantic information and long-range relationships. This drawback is exacerbated by the convolution process’s innate tendency to ignore wider spatial-spectral correlations that are essential for precise classification in favor of local patterns. In this study, we provide Group Former, a unique multi-head network architecture that incorporates an Auxiliary Feature-Enhanced (AFE) Module to improve feature representation and maximize parameter efficiency to overcome these restrictions. By using a multi-head attention mechanism, Group Former allows the model to record a variety of subtle feature representations concurrently. This allows the model to concentrate on multiple facets of the input data and develop a more thorough knowledge of the underlying patterns. By lowering the number of parameters without sacrificing functionality, the AFE is intended to maximize the efficiency of the model. The AFE module improves the feature extraction process while keeping the model size small by using auxiliary features and applying parameter-sharing methods. Reduced training and testing durations across many datasets indicate our proposed strategy’s considerable increases in computational efficiency. The simplified multi-head network design and the efficient AFE integration allow this efficiency. Additionally, Group Former regularly achieves superior accuracy and F1 scores with lower computational costs compared to classic CNNs and simple transformer encoders on a variety of benchmark datasets, such as Indian Pines, Pavia University, Salinas, and Kennedy Space Center (KSC). We also suggest a Transformer-based strategy that makes use of the self-attention mechanism to get around these problems and improve the acquisition of both local and global aspects of HSI data. Specifically, we present the AFE module, which is intended to perform better than traditional CNNs in integrating spectral and spatial information. Larger receptive fields and adaptive methods are used by the AFE module to aggregate data across various scales, improving the model’s capacity to extract complex semantic information. The suggested approaches are not without difficulties. Even though they are strong, transformers may be computationally demanding and need a large quantity of training data to operate at their best. This research optimizes the model architecture and uses cutting-edge training methodologies to increase efficiency and performance.

Moreover, we performed comprehensive ablation research to compare Group Former’s performance with a basic transformer encoder and verify its efficacy. The results underscore our suggested approach’s enhanced effectiveness and efficiency, highlighting its pragmatic importance. Moreover, Group Former integrates an inventive multi-head network architecture with the AFE Module to provide a fresh method of HSI classification. By addressing significant issues with computing efficiency and model complexity, the suggested approach provides a reliable and effective technique for HSI analysis.

The main contribution of this paper is the following:

➢ The proposed framework is designed to classify hyperspectral images. The model integrates CNN and attention mechanism to effectively capture both spectral-spatial features of hyperspectral images, improving classification results.
➢ Multiscale convolutional neural networks are utilized to augment HSI data’s spectral and spatial features. This captures multiscale information, which enhances the ability to recognize complex patterns within the hyper-spectral data.
➢ We proposed a group attention mechanism that enables the model to focus on the most informative pixels to enhance the accuracy and efficacy of the classification process.

The remainder of this paper is structured as follows: The techniques are described in Section III, along with the suggested model architecture, which includes the Group Attention Module and the AFE module, and how they improve the performance of feature extraction and classification. The HSI dataset’s properties, the experimental setup, evaluation metrics, and baseline models for comparison are all covered in Section IV, which also addresses the datasets and experiment assessment. In Section V, we compare the performance of our model with other cutting-edge models and provide in-depth insights from ablation research to emphasize the individual contributions of each component. Section VI brings the work to a close, summarizes the main conclusions, highlights the value of our suggested model, and makes some recommendations for possible future research topics.

Related work

HSI classification has developed significantly in the last few years, mainly due to the availability of high-quality datasets and the development of complex algorithms. In addition to emphasizing the advances brought about by different approaches, this section covers the major contributions of HSI classification. A vision transformer model²⁴ has recently demonstrated superior performance in the field of computer vision. The Transformer model employs a self-attention technique to capture global dependencies. HSI classification commonly employs attention mechanisms. In^21,23, researchers developed a spectral-spatial attention network to extract distinctive features from an HSI cube. Researchers have extensively utilized the vision transformer concept for HSI classification. He, X. et al.²⁵ introduced a method known as spectral-spatial transformers (SST), which utilizes a VGG-Net model to extract spectral-spatial features and establish a connection with a density transformer. In²⁶, the researchers introduced a model known as the Spectral Former. This model can acquire and utilize spectral information group-wise while incorporating a cross-layer transformer encoder. A positive feedback spatial-spectral correlation network based on spectral interclass slicing (PFSSC-SICS) technique for HSI classification is presented in²⁷. It uses a spectral interclass slicing technique to enhance spectral signatures, addressing issues such as limited label samples and spectral similarity across classes. This is combined with a spatial-spectral correlation module and a positive feedback mechanism to improve feature extraction. The experimental findings show that PFSSC-SICS is a reliable solution for HSI Classification, with significant performance gains over current approaches.

Sun, L. et al.²⁸ introduced a spectral-spatial feature tokenization transformer for HSI classification. This method utilizes both 3D and 2D CNN models to extract multiscale spatial spectral features. Additionally, it incorporates a Gaussian-weighted tokenizer. As shown in²⁹, Zhang, J. et al. created a convolutional network called CT Mixer using a transformer model. This network includes a unique local-global multi-head self-attention mechanism. The study in³⁰ presents a convolutional fusion network solution for HSI classification that uses a multi-hop graph to rectify attention and spectral overlap grouping. To overcome issues with small sample sizes, a multi-hop graph rectifies attention for graph convolution, and a spectral inter-group feature extraction module for spectral feature extraction is incorporated. The effective fusing of CNN and graph convolutional network features is made possible by the Gaussian weighted fusion module, which exhibits excellent classification performance on a variety of datasets. In^31,32, authors proposed some backbone networks for extracting multiscale hyperspectral features due to the complexity and computational cost of the self-attention modules in vision transformers (ViT). Li, B. et al.³³ proposed a multi-granularity vision transformer via semantic tokens transformer to learn the multi-granularity features and improve accuracy. The author used the LFE module to extract local features. In³⁴, the author proposed a hierarchical attention transformer for HSI classification to extract complex features using a hierarchical attention mechanism. For HSI classification, the author of³⁵ presents a double branch CNN and Enhanced Graph Attention Network (CEGAT) fusion network. With modules such as a spatial-spectral correlation attention module for spatial-spectral feature extraction and linear discrimination of spectral inter-class slices for spectral redundancy reduction, it handles limited labeled data. By combining Graph Attention Network and CNN branches with a key sample selection approach, CEGAT outperforms other algorithms in HSI classification tasks by improving classification accuracy. The scholar of³⁶ presents an approach for HSI classification that involves using a transformer-enhanced two-stream complementary CNN. The technique consists of a spectral feature extraction stream that incorporates a hybrid convolution block and an attention mechanism using a transformer encoder, as well as a spatial feature extraction stream. The complementary spectral-spatial weight feature module effectively utilizes features from both streams, resulting in higher classification performance.

In³⁷, the authors proposed a hybrid former network to extract shallow features. They use CNN and a spectral-spatial attention (SSA)-based transformer encoder for semantic features. Nevertheless, transformer-based approaches do have certain restrictions. The above models rely on a multi-head self-attention module to handle long-range dependencies, and they also depend on sufficient training samples. In comparison, combining a convolutional model with a transformer for low- and high-level feature extraction and subsequent classifiers for proposed networks is possible. Using an attention module approach to multiscale spectral and spatial feature extraction significantly improves network performance. A multiscale 3D-CNN with a group attention layer can extract features by combining spatial and spectral information. An uneven distribution of “samples” would reduce the ability to distinguish small sample problems from a spatial information perspective. For HSI classification, the authors of³⁸ suggest a feature-complementary attention network based on adaptive knowledge filtering. They develop a Dual Pyramid Spectrum-Spatial Attention module and a Nonlocal Band Regrouping approach to collect spectrum properties and remove duplicated information. Comparing the suggested approach to state-of-the-art methods, three difficult datasets demonstrate increased performance. The researcher of³⁹ presents a morphological transformer technique called morphformer, which combines a spatial morphological and trainable spectral network. This technique utilizes spatial and spectral morphological convolution based on the attention technique to facilitate the shape and structure of information combination. The experimental results demonstrate that this technique performs better than the existing methods.

In⁴⁰, a dual-branch network for HSI classification is introduced that integrates multi-scale dual aggregated attention (CDC-MDAA) with cross-channel dense connection. However, these models show significant improvement in HSI classification. Previous work usually focuses on designing a deep and complicated network for HSI classification. CNN models are still indispensable to capture spectral spatial features due to convolution kernels. Transformer models learn the long-range dependencies through self-attention mechanisms. However, they have trouble with computational cost, slow inference and high memory usage. Integrating the strengths of both models, CNN is to extract low-level features and Transformer model is to learn high level features on spectral spatial dimensions with multiple attributes and scales that need to be solved. We have proposed a Groupformer model for HSI classification using a group attention mechanism with the AFE module. The proposed model has fewer flops than other state-of the-art models and achieved satisfactory classification results while using small training samples.

Proposed methodology

Figure 1 depicts the suggested model. This section will explain the model’s structure and its functioning. In this study, we aim to address the issue of a limited training sample by utilizing network structures. First, the dimensionality reduction technique using PCA was applied to the raw HSI data. Let the HSI data be represented by Inline graphic , where indicates the height and width of the data, and C represents the spectral dimension. After applying PCA to the spectral dimension, the hyperspectral data can be represented as . Suppose that contains N labeled pixel and their corresponding hot label where K represents the number of classes. The spatial size Inline graphic around of the center pixel can be defined as a spectral-spatial vector. The cross-entropy loss function estimates the difference between the predicted class label and the ground truth label.

Illustration of the Proposed Model for HSI Classification [We have used the raw data to draw the satellite imagery in Fig. 1 from the publicly available dataset at the GitHub repository, where we have already mentioned the link in our previous manuscript in the data availability section (https://github.com/gokriznastic/HybridSN/tree/master/data) . Moreover, the classification maps in Fig. 1 were generated using the PHYTON 3.8 version and are publicly available at Pytorch library https://pytorch.org/get-started/locally/. Which is installed on NVIDIA RTX 3060 GPU with 64 GB RAM].

Multiscale spectral spatial feature extractor

We use different filters and kernel sizes for feature extraction to extract multiscale spectral-spatial features from PCA-reduced data. The convolution operation in a 3D convolution model involves applying a 3D convolutional kernel. This operation acts on the spatial dimension and captures the correlation information between multiple spectral bands. The input dimension is Inline graphic and the N 3D convolution kernels; the kernel size is used for the convolution operation. The output feature maps are 4D tensors and the dimension of this is . The number of bands across spectral dimension is fix and the final result is 4D tensor size of . The 3D kernel better utilized the spectral features. Hence, the 3D CNN approach is better suited for HSI classification, which involves abundant spectral information. During the computation of 3D convolution, the activation value at position Inline graphic on the jth feature map in the ith layer is obtained. The mathematical formula for 3D convolution is given in the equation:

Where Inline graphic means that the output variable is at position . is the activation function, and is the kernel’s depth, which is biased. The suggested multiscale 3D-CNN model pulls out spectral and spatially distinguishing features using a three-dimensional convolutional layer with various kernel sizes. For the multiscale spatial branch, we used Inline graphic are effective at capturing fine-grained, local features, while larger kernels with 64 filter sizes can capture more contextual information. By using a combination of these kernel sizes, we can ensure that our model captures a comprehensive set of spatial features that span both local details and broader patterns. The varying kernel sizes increase the effective receptive field of the convolutional layers, allowing the network to integrate information from larger spatial regions. This is crucial for accurately identifying and distinguishing between different classes in the HSI, which may have spatial dependencies at different scales. We then concatenated the enhanced spatial features. We employ three 3D CNN blocks for multiscale spectral feature augmentation in the same manner. These blocks have a kernel size of Inline graphic and , an output filter size of 64; the chosen kernels allow the model to capture spectral dependencies over varying ranges of bands, from immediate neighbors to slightly broader contexts . Using 1 × 1 convolutions in the spectral domain helps reduce the spectral data’s dimensionality while preserving essential information. Each 3D CNN block consists of a 3D convolutional layer regardless of the branch. A batch normalization layer is employed with Gaussian error linear units (GELUs) as the activation function. Next, we used a 2D convolutional layer to get spectral-spatial features that include information at different levels so that we could combine features from different scales into two branches. In the 2D convolution layer, a kernel size is used with a 64-filter size. Figure 2 illustrates the multiscale feature extractor.

Illustration of the multiscale spectral spatial feature extraction.

Group attention module

Figure 1 shows an illustration of the group attention module. This structure consists of three stages: feature grouping, group propagation, and feature ungrouping. The first step involves classifying the features of the image. Next, in the second phase, global knowledge spreads amongst the grouped features. Ultimately, the final stage reintegrates this comprehensive information into the image features.

Feature grouping

After the feature extractor module, the input of the group attention consists of image features N, which is the total number of image features, and C, which represents the dimensionality of the feature vector. The matrix Inline graphic is utilized to store M learnable group tokens. A simple multi-head attention process initiates the grouping process. This operation results in a generation of grouped features denoted

In the attention operation, we denote the channel number as d, the head index as h, and the projection matrices for the query, key, and values as Inline graphic After the concatenation procedure, we eliminate the feature projection layers and designate them as an identity matrix. Therefore, we can describe the grouped features as the weighted summation of image features at each head, where we compute the weights using the attention operation.

Group propagation

Once the grouped features have been obtained, updating and passing on global information among them is possible. Multilayer perceptron (MLP) Mixer necessitates an established input size corresponding to the selected groups. Our MLP-Mixer comprises a set of sequential MLP models. Inline graphic contains the grouped features from the feature grouping process, which can be updated to with the MLP-Mixer layer.

The first MLP is employed for inter-group information mixing, whereas the second MLP is utilized for channel maxing.

Feature ungrouping

Once the grouped features have been updated, retrieving global information from the image features is possible by employing a feature ungrouping procedure. The features are ungrouped by the utilization of a transformer decoder layer, wherein the image features query the grouped features.

Where Inline graphic represents the projection matrix of the attention, while represents the linear matrix used to project the concatenated features Z onto the same dimension as the image features X. The original transformer decoder layer is altered by replacing the initial residual connection with a concatenation operation. Afterward, the feature projection layer is moved to transform the feature back to its initial dimensions.

Auxiliary feature enhancement module

We proposed an auxiliary feature enhancement (AFE) module, further enhancing spectral-spatial features and decreasing the model’s parameters. This module consists of 2D convolutional layers, which follow the batch normalization layer and activation function; the layer’s kernel size is 32 and 64 filters. The effectiveness of direct classification will be limited even if many discriminative features have been obtained and the input size remains high after completing the previous two modules. AFE Module is shown in Fig. 3.

Moreover, after PCA, the reduced cube is embedded into the multiscale feature extractor module consisting of 3D and 2D layers. After the low-level feature extraction process, the group attention layer employing interleaved operation of group tokens is used for high-level semantic information. Then, an auxiliary feature-enhanced module is used to refine these features.

Moreover, the AFE module is intended to increase our model’s efficiency by cutting down on parameters without sacrificing performance. To extract more pertinent information and enable the model to operate with fewer parameters, the AFE module makes use of auxiliary features that enhance the model’s core characteristics. Auxiliary feature extraction, feature fusion, and parameter sharing are the three primary parts of the AFE module architecture. The auxiliary feature extraction component obtains additional contextual information by enhancing the core features. These auxiliary characteristics are integrated with the major features through the feature fusion process in a way that maximizes information flow without appreciably raising the complexity of the model. The number of unique parameters that the model requires is further decreased by parameter sharing across the major and auxiliary features. We performed a comparison examination of models with and without the AFE module to support our assertions.

Ablation research

In the ablation research, the performance of a basic transformer encoder and the suggested multi-head network were assessed using the given datasets. Table 1 below provides an overview of the findings:

Table 1.

In the ablation investigation, the performance of a basic Transformer Encoder and the suggested Multi-head Network were assessed using the given datasets.

Datasets	Model	Accuracy (%)	F1 Score	Training Time/Hour	Testing Time/sec
Indian Pines 5%	Simple Encoder Transformer	86.9	84.1	4.6	54
Indian Pines 5%	A proposed multi-head network	89.2	87.3	4.3	44
Pavia University 1%	Simple Encoder Transformer	88.2	85.3	6.2	69
Pavia University 1%	A proposed multi-head network	92.3	88.4	4.9	64
Salinas 1%	Simple Encoder Transformer	87.5	84.5	6.6	64
Salinas 1%	A proposed multi-head network	90.7	87.1	5.9	54
KSA 10%	Simple Encoder Transformer	84.3	81.2	5.7	59
KSA 10%	A proposed multi-head network	93.2	90.1	6.4	59

Methods	OA (%)	AA (%)	Kappa (100%)
FA	96.57	94.67	95.44
LDA	97.02	95.05	95.26
PCA	98.47	97.23	97.9

Methods	OA (%)	AA (%)	Kappa (100%)
FA	97.58	97.26	97.71
LDA	98.09	98.19	98.89
PCA	99.53	99.24	99.4

Class	HybridSN	SSRN	BS2T	SF	MorphFormer	FouriorFormer	GSC-ViT	GAHT	Proposed
C1	94.18	97.54	96.13	95.11	90.73	98.81	96.02	95.87	98.21
C2	99.84	99.98	99.81	99.97	99.85	100	99.26	99.75	99.90
C3	87.15	94.89	53.99	66.98	69.58	83.01	86.42	84.88	93.53
C4	84.70	93.40	92.15	82.45	81.70	82.65	90.20	83.18	93.73
C5	100	99.77	99.84	97.37	98.87	99.39	99.62	99.92	98.93
C6	97.24	95.84	99.15	92.56	99.23	100	99.73	100	100
C7	91.41	86.94	68.18	53.37	89.97	96.96	99.54	99.16	99.76
C8	71.98	84.66	96.62	64.41	80.24	98.49	89.60	97.61	96.86
C9	96.15	58.48	93.27	68.51	65.84	86.65	87.51	71.29	94.18
OA	94.21	95.74	94.96	90.19	92.80	97.20	96.46	96.43	98.47
AA	91.41	90.17	88.79	80.08	86.22	93.99	94.21	92.41	97.23
Kappa	0.923	0.943	0.933	0.901	0.903	0.962	0.953	0.952	0.979
Params	795.9 (K)	198.52 (K)	1.35 (M)	194.1(K)	141.1 (K)	713.67 (K)	0.77 (M)	1.06 (M)	0.9 (M)
MACs	31.88 (M)	46.15 (M)	16.78 (M)	12.47(M)	5.97 (M)	110.32 (M)	49.64 (M)	177.62 (M)	0.63 (M)
TR Time (s)	54.89	84.65	412.66	63.29	135.89	190.13	150.96	152.66	101.27
TS Time (s)	8.60	13.69	80.79	22.82	16.58	40.65	16.05	23.43	27.16

Class	HybridSN	SSRN	BS2T	SF	MorphFormer	FouriorFormer	GSC-ViT	GAHT	Proposed
C1	100	99.94	100	99.79	89.99	100	99.89	100	100
C2	100	100	100	100	99.89	100	100	100	100
C3	99.23	100	100	100	92.94	100	100	100	100
C4	98.55	99.92	99.49	91.95	88.98	82.24	99.34	99.56	99.92
C5	99.35	98.49	96.64	98.60	91.13	100	98.41	98.94	99.77
C6	99.94	100	100	100	98.13	100	99.97	99.97	100
C7	100	99.88	100	99.18	97.34	99.97	99.18	99.94	99.94
C8	95.42	99.48	99.97	93.07	90.63	99.34	98.44	99.17	99.92
C9	100	100	100	100	98.53	100	99.98	100	100
C10	97.68	99.19	99.07	98.02	87.11	99.13	99.47	99.32	99.43
C11	96.68	99.05	100	93.37	88.55	99.90	99.05	100	100
C12	99.73	100	100	100	88.36	100	100	99.89	100
C13	98.01	39.47	90.62	91.06	90.40	92.28	89.41	99.44	100
C14	97.92	97.92	98.77	98.58	90.55	98.77	97.45	93.95	97.90
C15	92.03	98.17	84.47	91.64	86.08	96.74	97.87	96.76	99.97
C16	99.38	99.49	99.94	99.94	88.31	99.88	99.32	100	100
OA	97.56	98.41	97.49	96.66	92.18	98.75	98.93	99.15	99.74
AA	98.37	95.69	98.06	97.20	91.68	98.01	98.61	99.18	99.76
Kappa	0.972	0.982	0.972	0.962	0.912	0.986	0.988	0.990	0.997
Params	796.8 (K)	139.69 (K)	1.36 (M)	205.59 (K)	141.5 (K)	715.9 (K)	1.04 (M)	1.07 (M)	0.98 (M)
MACs	31.96 (M)	47.10 (M)	14.71 (M)	13.69 (M)	5.98 (M)	110.34 (M)	118.93 (M)	177.81 (M)	1.25 (M)
TR Time (s)	51.93	90.70	391.18	65.91	153.86	206.13	157.63	189.37	168.93
TS Time (s)	10.98	20.28	103.65	25.37	26.09	51.23	19.87	22.18	54.80

Class	HybridSN	SSRN	BS2T	SF	MorphFormer	FouriorFormer	GSC-ViT	GAHT	Proposed
C1	6.66	9.60	4.54	4.54	56.81	31.81	20.45	11.11	73.08
C2	74.58	93.80	93.94	74.50	93.14	94.54	93.73	88.80	93.30
C3	79.37	99.61	77.50	64.38	92.77	97.97	96.07	94.78	97.18
C4	56.08	75.55	55.70	16	90.22	87.55	92.0	84.78	92.95
C5	96.80	97.38	95.25	92.37	100	97.38	97.60	98.93	98.16
C6	98.87	98.84	99.71	95.23	97.69	96.96	98.26	98.87	99.84
C7	70.37	20.41	21.08	3.70	66.66	22.03	96.42	94.18	97.26
C8	100	99.11	100	97.35	99.33	100	97.13	100	100
C9	36.84	10.87	11.66	5.26	36.84	20.98	21.05	78.94	80.05
C10	86.00	93.49	75.88	68.25	93.39	97.29	96.20	94.16	97.82
C11	83.16	97.94	88.16	83.49	95.24	98.92	98.11	96.93	99.09
C12	69.21	90.76	78.73	35.70	83.83	93.07	88.80	88.17	96.62
C13	93.46	86.66	99.49	93.33	98.46	84.61	91.28	98.49	98.37
C14	92.99	99.08	99.91	95.67	99.75	98.16	100	99.91	99.91
C15	76.47	95.09	80.86	54.22	92.37	90.73	91.00	95.98	99.13
C16	74.44	76.13	82.02	73.86	69.31	53.40	55.68	61.11	91.66
OA	83.66	94.994	87.91	76.60	94.30	95.51	95.09	94.36	97.60
AA	74.70	77.77	72.7	59.86	85.36	79.09	83.36	86.57	94.65
Kappa	0.814	0.938	0.861	0.729	0.934	0.948	0.943	0.935	0.972
Params	1.17 (M)	140.60 (K)	1.35 (M)	121.59 (K)	141.6 (K)	705.84 (K)	1.03 (M)	1.06 (M)	0.97 (M)
MACs	65.35 (M)	47.09 (M)	14.70 (M)	3.42 (M)	6.00 (M)	78.42 (M)	117.36 (M)	127.17 (M)	1.24 (M)
TR Time (s)	56.50	90.70	117.88	54.40	123.7	110.81	135.21	105.98	120.19
TS Time (s)	9.89	20.28	5.39	5.60	10.61	3.74	15.33	8.80	15.43

Class	HybridSN	SSRN	BS2T	SF	MorphFormer	FouriorFormer	GSC-ViT	GAHT	Proposed
C1	98.68	99.29	100	99.71	100	99.86	99.44	100	100
C2	96.34	99.11	100	97.78	100	100	96.96	100	100
C3	96.52	97.89	100	94.11	99.17	98.76	99.17	100	100
C4	88.98	99.14	99.14	89.31	77.82	98.32	78.66	96.23	92.46
C5	94.48	76.66	80.66	73.33	77.12	56.20	45.09	79.08	100
C6	83.49	99.06	99.53	89.67	85.77	100	94.49	99.54	97.70
C7	100	100	100	98.96	100	81.0	84.0	99.0	100
C8	100	100	99.75	99.75	99.75	100	100	100	100
C9	100	100	100	100	100	100	99.79	100	100
C10	100	100	100	100	100	100	100	100	100
C11	100	100	99.48	99.74	100	100	100	100	100
C12	100	100	100	99.57	91.42	100	99.16	100	100
C13	100	100	100	100	100	100	100	100	100
OA	98.03	98.94	99.27	97.66	96.70	98.10	96.34	99.13	99.53
AA	96.80	97.78	98.35	95.53	94.69	94.93	92.06	97.98	99.24
Kappa	0.978	0.988	0.991	0.974	0.963	0.978	0.959	0.990	0.994
Params	534.27 (K)	141.21 (K)	1.36 (M)	121.18 (K)	139.9 (K)	701.75 (K)	9.09 (M)	1.05 (M)	0.95 (M)
MACs	16.11 (M)	32.77 (M)	14.71 (M)	3.42 (M)	5.98 (M)	78.40 (M)	116.01 (M)	125.39 (M)	1.23 (M)
TR Time (s)	36.51	50.10	90.68	47.19	60.5	63.8	100.10	72.95	92.58
TS Time (s)	3.9	4.87	5.14	3.87	4.60	5.01	6.90	4.05	7.1

Metrics	CNN	CNN + GA	GA + AFE	CNN + AFE	Proposed
OA (%)	97.32	97.59	96.38	97.89	99.74
AA (%)	96.68	97.14	95.91	97.25	99.76
Kappa (100%)	0.970	0.967	0.961	0.958	0.997

Metrics	CNN	CNN + GA	GA + AFE	CNN + AFE	Proposed
OA (%)	96.25	96.78	94.18	96.89	98.47
AA (%)	95.47	95.33	93.09	95.96	97.23
Kappa (100%)	0.961	0.951	0.940	0.959	0.979

Metrics	CNN	CNN + GA	GA + AFE	CNN + AFE	Proposed
OA (%)	95.10	95.69	92.54	95.55	97.60
AA (%)	90.03	90.81	87.71	91.67	94.65
Kappa (100%)	0.941	0.945	0.919	0.948	0.972

Metrics	CNN	CNN + GA	GA + AFE	CNN + AFE	Proposed
OA (%)	98.14	98.68	97.52	98.28	99.53
AA (%)	97.58	98.04	96.92	98.33	99.24
Kappa (100%)	0.976	0.988	0.973	0.980	0.994

Metrics	CNN	CNN + MHSA	MHSA + AFE
OA (%)	98.05	95.63	95.81
AA (%)	96.99	94.70	93.43
Kappa (100%)	0.970	0.948	0.948

PERMALINK

GroupFormer for hyperspectral image classification through group attention

Rahim Khan

Tahir Arshad

Xuefei Ma

Haifeng Zhu

Chen Wang

Javed Khan

Zahid Ullah Khan

Sajid Ullah Khan

Abstract

Introduction

Related work

Proposed methodology

Figure 1.

Multiscale spectral spatial feature extractor

Figure 2.

Group attention module

Feature grouping

Group propagation

Feature ungrouping

Auxiliary feature enhancement module

Figure 3.

Ablation research

Table 1.

Dataset and experimental evaluation

Table 2.

Table 3.

Table 4.

Table 5.

Experimental setup

Model parameter selection

Selection of dimensionality reduction method

Table 6.

Table 7.

Table 8.

Table 9.

Result and discussion

Table 10.

Figure 4.

Table 11.

Figure 5.

Table 12.

Figure 6.

Table 13.

Figure 7.

Ablation experiments

Table 14.

Table 15.

Table 16.

Table 17.

Table 18.

Impact of training ratio

Figure 8.

Discussion on training and testing time

Conclusion

Acknowledgements

Author contributions

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases