Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Apr 3;15:11488. doi: 10.1038/s41598-025-95492-y

Adaptive feature interaction enhancement network for text classification

Rui Su 1,2, Shangbing Gao 1,2,, Kefan Zhao 1,2, Junqiang Zhang 1,2
PMCID: PMC11969002  PMID: 40181025

Abstract

Text classification aims to establish text distinctions, which face difficulty in capturing global text semantics and local details. To address this issue, we propose an Adaptive Feature Interactive Enhancement Network (AFIENet). Specifically, AFIENet uses two branches to model the text globally and locally. The adaptive segmentation module in the local network can dynamically split the text and capture key phrases, while the global network grasps the overall central semantics. After obtaining the results from the two branches, an interaction gate is designed to evaluate the confidence of the global features and selectively fuse them with the local features effectively. Finally, the interactively enhanced features are re-input into the classifier to improve text classification performance. Experiment results show that our proposed method can effectively enhance the performance of backbone networks such as TextCNN, RNN, and Transformer with fewer parameters. AFIENet achieved an average accuracy of 3.82% and an F1-score of 3.88% improvement across the three datasets when using Transformer as the backbone network. The comparable results to MacBERT that obtained with static word vectors also reflect the applicability of the proposed method.

Keywords: Natural language processing, Text classification, Adaptive feature enhancement, Interaction gate, Pre-training

Subject terms: Computational science, Computer science, Scientific data, Information technology

Introduction

Text classification1 is a key natural language processing task widely used in areas such as news classification2, sentiment analysis3, topic labeling4, etc. Traditional machine learning methods rely on shallow feature extraction, lacking sufficient understanding of the underlying semantics, structure, sequence, and context within the text. Their limited representational abilities result in relatively low classification performance5. The advent of deep learning solves issues like the inferior ability of shallow architectures to represent complex functions, gradually replacing traditional machine learning as the mainstream approach in text classification6. However, deep learning models demonstrate inconsistent results when handling varying text lengths. However, there are also difficulties controlling model parameter size.

With the wide application of deep learning in natural language processing, numerous text classification models incorporating deep learning have been continuously proposed. Early convolutional neural networks (CNNs), originally applied in computer vision for their visual cortex-like structure, were soon widely adopted in text classification. Kim7 first applied CNNs through using convolutional kernels of varying scales to capture local sentence information and improve model interpretability. However, this also led to vanishing gradients. To address this issue, Wang et al.8 proposed the Deep Pyramid Convolutional Neural Network (DPCNN) with a pyramidal structure and layer hopping to mitigate gradient decay. Joulin et al.9 put forth the FastText model, projecting input text into an embedding space for quick, shallow convolutional learning and improved classification with lower computational costs. Convolutional networks are limited in modeling longer sequences due to small convolutional windows. Recurrent Neural Networks (RNNs) avoids this issue by treating sentences wholly as complete sequences. Thus, sequential models like RNN10 and Long Short Term Memory (LSTM)11 better capture overall text semantics. The Gated Recurrent Unit (GRU)12 was introduced to solve the exploding/vanishing gradient problems of RNNs to some extent through learned gating of information flow and propagation.

Most existing classification methods using CNNs and RNNs derive text representations through max-pooling of hidden layers or final states, potentially overlooking decisive keywords. Attention mechanisms can effectively learn global semantic information in text. Transformer13 employed multi-head self-attention to capture richer features, with attention weights measuring each part’s influence on the whole. Additionally, Bi-directional Encoder Representations from Transformer (BERT)14, a pre-trained bi-directional 12-layer Transformer model, better captures deeper contextual information. However, BERT’s14 stacked architecture and numerous attention mechanisms lead to exponential growth in parameters. Current models still need a better balance between modeling text length variability and parameter efficiency.

To address the limitations of previous methods, we propose an Adaptive Feature Interactive Enhancement Network (AFIENet) for text classification. AFIENet contains two network branches, i.e., a Global Feature Extraction Network (GE-Net), a Local Adaptive Feature Extraction Network (LA-Net) and along with an Interactive Enhancement Gate (IE-Gate). Specifically, GE-Net employs a feature extraction model to learn global feature from the input text to obtain a holistic representation. Meanwhile, LA-Net first performs adaptive segmentation of the text before extracting local feature. Then, the IE-Gate is proposed to select local features based on confidence levels to interactively fuse with the global feature, obtaining an enhanced semantic expression. Overall, the cooperation of global and local networks enables more comprehensive and focused feature representations, thus improving text classification performance.

Our contributions are as follows:

  • We propose an Adaptive Feature Interactive Enhancement Network for text classification, which forms a dual-branch architecture and achieves complementary semantic representations to improve classification performance.

  • In the local network branch, an adaptive text segmentation module is designed to perform the local feature extraction of text with different lengths.

  • The interactive gate proposes to use a gating mechanism for feature selection fusion, which can effectively filter the noise of local features and specifically uses the confidence of global features to evaluate the enhancement, to avoid the influence of arbitrarily mixed local noise.

Relared work

Variable-length text processing

Many excellent modeling approaches have been proposed for the existing text classification tasks for variable-length texts and that with different semantic complexity. To address the complexity and diversity of text, Zhang et al.15 proposed to incorporate multi-granularity dynamic semantic representations for product text data in the chemical industry, where a Chinese pre-trained language model is used upstream of the classification task to obtain the dynamic word-vector language of the sentences, and where the idea of adversarial training is introduced to increase the robustness of the text representations. Aiming at the problems of short text feature sparsity and model dimension explosion, Yang et al.16 proposed a short text classification model using the attention mechanism and the word symbiosis of labeled graphs, which solved the problems of short text sparsity and topic text heterogeneity. Li et al.17 fused knowledge perception and dual attention mechanism, which enabled the classification model to improve the efficiency of model information extraction while acquiring semantic features of short text.

On the task of performing long text classification, direct long sequence processing will ignore the hierarchical information contained in the document, to solve this problem, SUN18 performs multi-layer modeling according to the hierarchy of the text, as well as using the hybrid attention mechanism to obtain the obvious category features in the hierarchy, which enables the model to better capture the important information of the document and further improves the accuracy of the long document classification. Zhang et al.19 proposed a fusion multi-feature text classification method in the face of long text classification and constructed a classification model that fuses contextual, local, and averaged features, which effectively improves the accuracy in the task of long text classification. When targeting text datasets of varying lengths, Zheng et al.20 proposed a text classification method based on LDA, combining the LDA model to enrich the word vector information of the text, and adding a fusion layer inside the original convolutional neural network, which increases the richness of the feature representation, and solves the problem of insufficient feature representation resulting in a low accuracy rate of text classification.

Since Vaswani et al.13 proposed the Transformer model, it has significantly improved the performance of various natural language processing tasks, including machine translation, text generation, text summarization, etc. Many Transformer-based works14,21,22 have also made significant progress in variable length text processing tasks, such as Microsoft Research23 introduced relative position embedding and block masking to improve the model’s ability to handle long sequences. In addition, the explosive emergence of large models24,25 marks a new era in text processing.

While these methods have advanced text classification, most exhibit instability when handling varying text lengths and semantic complexity. A key reason is differing models focus on global versus local features, failing to balance semantic representation across short and long texts. This results in insufficient generalization on variable lengths. To enable adaptive processing, we propose an adaptive text segmentation layer for digital text modeling. This module divides the original text into segments based on adaptive lengths, dynamically adjusting segmentation granularity. Feature extraction is then performed separately on each segment. This mechanism can more effectively focus on key local features potentially overlooked when modeling entire sentences holistically.

Gating units

In recurrent neural networks such as the LSTM, a large number of gating structures are used to control the flow and transfer of information, including the input gate, forgetting gate, and output gate. This mechanism can effectively avoid the gradient vanishing and gradient explosion problems and enhances the ability to model the long-term dependence of sequence data.

Similar gating designs have also been widely used in other network structures, for example, Ma et al.26 implemented feature screening through gating units and used gates to control the strength of information flow to the lower layers, thus mitigating gradient dispersion and improving the accuracy of model classification. Li et al.27 used gating components to map statistical features into a shared information space to make the two latent representations compatible with each other, thus assisting text classification to optimize the classification effect. Xu et al.28 proposed a new method for sentiment analysis and classification based on bidirectional gated recursive unit model. Yang et al.29 distinguished sequences and improved performance by integrating gated residual attention units and channel embedding techniques. In addition, gating units are often used in other network architectures to improve performance, such as recursive unit driven multi-dimensional dynamic graph neural networks.30.

Attention mechanisms and pre-trained models

In recent years, attention mechanisms and pre-trained language models have demonstrated powerful modeling capabilities in natural language processing tasks. Attention mechanisms can effectively focus on the semantic core of the text, while pre-trained models have semantically rich word vector representations. Combining the two organically can further improve the effect of text comprehension and analysis. Yang et al.31 proposed the HAN (Hierarchy Attention Network) model with attention as a component of text classification, giving different attention to sentence-level and word-level scores, and selecting sentences and words containing helpful classification to improve classification performance. Subsequently, Cui et al.32 proposed a pre-trained language model MacBERT (MLM as correction BERT) for Chinese text, which greatly improved text classification performance. Zhan et al.33 utilized cross attention mechanism to focus text features with aspect category features, the model focuses on the features most relevant to the given aspect. In addition, there are also research methods in other fields such as medical analysis34,35 that combine pre-trained models for attention mechanisms. For example, Xiao et al.36 proposed an adaptive attention model that integrates domain knowledge and optimizes its ability to understand medical terminology and handle complex contexts.

Inspired by the above methods, our proposed framework also absorbs the ideas of attention and pre-training. We use dynamic word vectors after pre-trained to further enhance the semantic nature of text representation. Experimental results show that modeling with dynamic word vectors generated from the pre-trained model greatly improves the model’s effectiveness on the text classification task.

Methodology

In this article, we propose an Adaptive Feature Interactive Enhancement Network (AFIENet), whose structure is shown in Fig. 1. AFIENet contains three main components:

  • Global feature extraction network, i.e., GE-Net, which is used to learn global semantic information from the whole text;

  • Adaptive local feature extraction network, i.e., LA-Net, which can dynamically adjust the extraction range of local features according to the text length;

  • Interaction Enhancement Module, i.e., IE-Gate, which enhances the interaction between global and local features by evaluating the confidence level to filter out the noise.

Specifically, GE-Net and LA-Net model the text from global and local levels respectively, and use the feature extractors Inline graphic and Inline graphic to obtain the feature representations Inline graphic and Inline graphic. GE-Net provides a high degree of flexibility by learning global semantic information from the entire input text, while LA-Net adapts to text length to capture local features dynamically. To obtain more adequate semantic information, IE-Gate first counts the confidence of the global features and then fuses them with the local features for weighted fusion. Below, we will provide a detailed description of each component of AFIENet.

Fig. 1.

Fig. 1

Structure of the adaptive feature interaction enhancement network (AFIENet). We used different colors to distinguish the features obtained by the two networks, i.e., GE-Net and LA-Net, to clearly illustrate the data flow of AFIENet.

Feature Extractor. Given a dataset Inline graphic , where Inline graphic is the text with input length Seq and Inline graphic is the label corresponding to the text Inline graphic. It should be noted that GE-Net and LA-Net have the same input Inline graphic and feature extractors, but they do not share network parameters.

Unless otherwise specified, TextCNN is leveraged as the default feature extractor to evaluate the weights of sentence semantics, and the color of the word indicates the degree of its contribution to the final classification result. Where the color indicates the degree of positive samples of the word and the blue color indicates the degree of negative samples. Fig. 2 shows, that with global modeling (TextCNN), the semantic weights are evenly distributed across the sentence. In contrast, adaptive segmentation focuses locally within a sliding window, emphasizing keywords while filtering lexical noise. As we can see, the weight distribution of each word on sentence semantics is relatively uniform in global modeling. Whereas with adaptive segmentation, more attention can be paid to the respective local text, and the keywords can be made more prominent by filtering out some lexical noises that are not useful for semantic understanding. Therefore, by adjusting the segmentation window to focus locally, the adaptive text segmentation layer can discover and emphasize those semantic keywords that may be overlooked in the global modeling based on the whole sentence, and also filter out some of the noisy words. This provides a more powerful semantic representation for processing and understanding variable-length text.

Fig. 2.

Fig. 2

Comparison of global and local modeling. The example given is a positive sample “The room was exceptionally large and comfortable, the restaurant food was delicious, and overall it was okay” that from the Tansombo Hotel Review dataset. The contribution of each word is visualized using a blue-to-red heatmap, and the difference between global and local modeling is represented by different windows.

Specifically, when the Pos-Embed obtains Inline graphic from the input layer, the dimension of the word vector is Embed, and the embedding matrix Inline graphic of Inline graphic is obtained after encoding of the word vector. The feature extractor Inline graphic extracts the features Inline graphic from Inline graphic in the form of n-grams, and its formula is as follows:

graphic file with name d33e533.gif 1

where Inline graphic denotes the convolution result of TextCNN for Inline graphic, n is the convolution kernel width, the convolution kernel size is Inline graphic, and the convolution bias is Inline graphic. Inline graphic can be expressed as:

graphic file with name d33e574.gif 2

where Func is a nonlinear activation function such as ReLU and GeLU. Then a maximum pooling operation is performed to obtain Inline graphic. the features extracted from Inline graphic and Inline graphic are referred to as respectively:

graphic file with name d33e603.gif 3
graphic file with name d33e609.gif 4

The features are then augmented and interacted with using the IE-Gate to obtain the final interaction semantic feature Inline graphic:

graphic file with name d33e622.gif 5

GE-Net

The global feature extraction network GE-Net aims to learn the global semantic information of the input text to provide overall semantic features with a high degree of freedom and flexibility. GE-Net consists of an input layer Pos-Embed and a feature extractor Inline graphic. The input of GE-Net is the text Inline graphic with the length of Seq, and for the non-BERT model, firstly, the Pos-Embed is utilized to each word in the text is mapped to a k-dimensional continuous space to get the encoded-word embedding matrix Inline graphic. Then the feature extractor Inline graphic (TextCNN, LSTM, and Transformer, etc.) is used on the embedding matrix to get the semantic features Inline graphic.

graphic file with name d33e669.gif 6

To capture text features at different scales, we introduce a multi-scale feature fusion mechanism:

graphic file with name d33e676.gif 7

where Inline graphic is the convolution feature at the i-th scale, and Inline graphic is the weight coefficient learned through an attention mechanism.

GE-Net provides a customizable framework for learning global semantic information using different feature extractors and representation models.

LA-Net

Local feature extraction network LA-Net with adaptive text length aims to learn the local semantic information of the text. Through the adaptive text length for text cutting, the model focuses on the local features of the text. LA-Net consists of three parts: the input layer Pos-Embed, the adaptive text segmentation layer Split, feature extractor. Firstly, the text length is judged and segmented, and then the adaptive length is used for feature extraction using a classifier. Specifically first designed adaptive text segmentation layer Split to segment the text, defined as follows:

graphic file with name d33e705.gif 8

LA-Net for the segmented text and then semantic information learning, the realization principle is shown in Fig. 3.

Fig. 3.

Fig. 3

LA-Net local feature extraction network. We used different degrees of yellow color to represent the different weights of the individual words in the sentence.

For the input text, the embedding matrix Inline graphic is obtained after encoding, and Inline graphic is set to set the length of the text for judging the length of the text, and Inline graphic is partitioned into Num matrices with a window of length L. The principle is as follows:

graphic file with name d33e749.gif 9
graphic file with name d33e755.gif 10

The partitioned matrix is denoted Inline graphic, where Inline graphic, the dimension size of Inline graphic is:

graphic file with name d33e780.gif 11

The feature extractor Inline graphic is used to compute the feature values of each part after segmentation, in the computation, the feature value Inline graphic of Inline graphic is obtained firstly, in the computation of the feature Inline graphic of Inline graphic, Inline graphic is spliced with Inline graphic firstly and then computed, and the feature value of Inline graphic is fused with Inline graphic with the fusion ratio of Inline graphic to get the local semantic feature Inline graphic finally.

graphic file with name d33e855.gif 12
graphic file with name d33e862.gif 13
graphic file with name d33e868.gif 14

To better capture the semantic structure of the text, we introduce a dynamic segmentation method based on semantic similarity. It ensures that segments are contextually meaningful, facilitating the extraction of fine-grained local features. Defining a semantic similarity function Inline graphic, the set of segmentation points P can be represented as:

graphic file with name d33e881.gif 15

where Inline graphic, k is the number of segments dynamically determined based on the text length. Adaptive segmentation method allows LA-Net to generalize well to various text lengths and domains. This adaptability minimizes information loss for shorter texts while preventing feature oversaturation for longer texts. For instance, in tasks with short texts (e.g., sentiment classification), smaller window sizes improve local feature granularity, leading to better performance. Conversely, for longer documents (e.g., topic classification), larger window sizes and dynamic segmentation ensure that global coherence is preserved.

IE-Gate

In this article, we propose a gating mechanism for the fusion of global and local features. Specifically, we use the confidence of global features to evaluate the filtering, and only select the global semantics with higher confidence as the fusion input. This design can serve to filter noise and avoid the interference of irrelevant and invalid information in local features on the final semantics. At the same time, we use the confidence of global features as the weight of fusion to dynamically regulate the proportion of different semantic representations and realize the adaptive processing of long and short texts. In this way, we play the role of the global features’ orientation and also retain the effective supplementation of local information.

IE-Gate is designed to selectively implement local feature enhancement by finding the part of features that need to be enhanced more accurately and rationally through confidence evaluation. IE-Gate contains a component gate and classifier Inline graphic, which first performs confidence evaluation to get the distribution of confidence intervals of the global features, and then uses the local features to enhance the global features within the target confidence intervals, thus helping the classification. GE-Net extracts to the initial global feature Inline graphic of the text and LA-Net adaptive text length extracts to the local feature Inline graphic, in order to utilize the two flexibly, the gate component is designed to fuse Inline graphic and Inline graphic using the VALUE function to obtain the interactive semantic feature Inline graphic.

graphic file with name d33e947.gif 16
graphic file with name d33e953.gif 17

where Inline graphic is a Sigmoid function to evaluate the confidence of the global feature Inline graphic, and the Inline graphic parameter functions as a memory gate that selectively integrates local feature enhancements based on the evaluated confidence of global features. For instance, when Inline graphic is close to 0, all enhanced information is retained, whereas a higher Inline graphic filters out potential noise, relying more on the original global features. This selective enhancement, controlled by the Inline graphic-adjusted value function, ensures that noise-prone features are filtered out, improving the robustness and efficiency of the interaction. Attention is used in the classifier Inline graphic to combine the interaction semantic features Inline graphic and global features Inline graphic for classification to the final label Inline graphic.

graphic file with name d33e1022.gif 18

If Inline graphic, discard all enhanced information, then self-attention is computed at this point. After passing through the softmax layer, the feature vector is mapped to the label classification to compute the loss. An optimizer is used to minimize the cross-entropy loss L.

graphic file with name d33e1041.gif 19

To enhance the model’s discriminative ability, we introduce the concept of contrastive learning and design a new loss function:

graphic file with name d33e1049.gif 20

where Inline graphic is the cross-entropy loss and Inline graphic is the contrastive loss. To achieve more fine-grained feature interaction, we introduce an interaction enhancement module based on multi-head attention:

graphic file with name d33e1068.gif 21
graphic file with name d33e1074.gif 22
graphic file with name d33e1080.gif 23

where MHA represents the multi-head attention function, Inline graphic and Inline graphic are global and local features respectively, and Inline graphic and Inline graphic are learnable parameters.

Unlike conventional attention mechanisms that assign weights based on direct feature similarities, the IE-Gate introduces a confidence evaluation attention mechanism, which dynamically evaluates the reliability of global features (Inline graphic) and selectively enhances them using local features (Inline graphic). The incorporation of the confidence parameter Inline graphic enables the model to adaptively balance augmentation and noise suppression.

To facilitate understanding of the above process, we summarize the entire process of AFIE-Net as Algorithm 1, which includes the correspondence between each step and the previous description.

Algorithm 1.

Algorithm 1

Adaptive Feature Enhancement Network (AFIENet)

Experimental procedure and analysis

Dataset

Our experimental dataset consists of two public datasets and one self-constructed dataset, the details of the dataset are shown below:

THUCNews: a dataset used for news classification, involving finance and economics, real estate, stocks, education, science and technology, society, current affairs, sports, games, and entertainment, a total of 10 categories, each category of 20,000, a total of 200,000, with an average data length of 22.34 words, belonging to the category of a balanced distribution of fine-grained short text dataset.

ChnSentiCorp: a hotel review dataset, divided into 2 categories of positive and negative reviews totaling 7,765 pieces of data, of which 5,322 positive reviews and 2,443 negative reviews, with an average data length of 128.52 words, belonging to the coarse-grained long text dataset with unbalanced distribution of categories.

ChemicalPro: To better reflect the effectiveness of the model on specialized domain text data, the Chinese domain text we used is the self-constructed chemical domain text dataset. The chemical domain text data is characterized by high specialization and sparse association between words and meanings, which can better reflect the characteristics of domain class text data. The data is collected from chemical product information from several major domestic chemical trading platforms such as China Chemical Manufacturing Website, Chemical Products Website, and Gaide Chemical Website. The dataset has a total of 128,053 samples, which belong to 17 categories, such as organic chemicals, inorganic chemicals, chemical intermediates, and biochemical. The average words length of the dataset is 533.77, and the number of samples of the categories is unevenly distributed.

Implementation details

To validate the effectiveness of our proposed method, experimental analyses are carried out using the following model as a feature extractor, respectively.

TextCNN7: a convolutional neural network model proposed by Kim et al. in 2014. It extracts the local features of text through convolutional kernels of different sizes and fuses them to get the text representation. TextCNN is a typical CNN-based text feature extractor.

DPCNN8: a convolutional neural network-based text classification model proposed by Johnson in 2017, which constructs text representations through region-level supervision.

LSTM11: a text classification model named long and short-term memory network proposed by Liu et al. in 2016, in order that long-range sequential information can be preserved in text modeling.

Transformer13: an encoding model of the full attention mechanism proposed by Vaswani et al. in 2017, with the advantages of parallel computing and long-range dependency modeling.

MacBERT32: a Chinese pre-trained language model proposed by Cui et al. in 2020. It pre-trains and encodes semantic knowledge on large-scale Chinese corpus. MacBERT is one of the most effective Chinese text feature extraction models.

The specific configuration of the above feature extractors is as follows:

  1. DPCNN: 250 convolutional filters are used with filter sizes [3, 4, and 5].

  2. TextCNN: we use a filter size of [2, 3, and 4] with a number of hides set to 250.

  3. RNN: a 3-layer LSTM is used as the feature extractor with the number of hidden ones set to 256 for each.

  4. Transformer: 1 layer of Transformer’s encoder is used as the feature extractor.

  5. MacBERT: the sum file with its own pre-trained parameters is used.

In the LA-Net network, the length of the text is set to 50, and the feature fusion ratio Inline graphic after text segmentation is set to 0.5. In IE-Gate, the Gate function sets the memory parameter Inline graphic for the fusion of Inline graphic and Inline graphic to 0.5.

Evaluation metrics

In the experiment, Precision (P), Accuracy (A), and F1-score are used as evaluation indexes to evaluate the classification effect of the model. The Precision is the proportion of correctly predicted samples out of all samples predicted to be positive cases, a higher Precision indicates that the model is making fewer false positive predictions, which is particularly critical in scenarios where the cost of false positives is high. The value of Precision reflects the model’s ability to accurately identify relevant instances. The formula is calculated as:

graphic file with name d33e1265.gif 24

The Accuracy refers to the ratio of the number of correct samples predicted by the model to the total number of samples, which provides an overall measure of the model’s correctness across all predictions. The formula is calculated as:

graphic file with name d33e1272.gif 25

The Recall refers to the proportion of samples that are correctly predicted out of all samples that are true as positive examples, a higher Recall demonstrates the model’s capacity to identify more relevant instances from the dataset, reducing false negatives, which is important in applications where missing relevant instances is costly.. The formula is calculated as:

graphic file with name d33e1279.gif 26

The F1-score value is used to combine precision and recall for a comprehensive evaluation of model effectiveness, which balances Precision and Recall, providing a comprehensive view of the model’s performance, particularly when there is an imbalance between false positives and false negatives. The formula is calculated as:

graphic file with name d33e1286.gif 27

In the formula for the above evaluation metrics, TP, FP, TN, and FN denote ture positive, false positive, true negative, and false negative in predictions.

Analysis of experimental results

Model validity assessment

Considering the generality, validity, and accuracy of the proposed method for different datasets and different classification models, the accuracy rate and F1-score value are used as evaluation metrics in the experiments. The experimental results are shown in Table 1. The experiments compare the effectiveness of classification using TextCNN, DPCNN, LSTM, and Transformer models along with the effectiveness of model classification with the addition of the AFIENet network. The experiments use static word vectors obtained from pre-training with Glove37 as word embeddings.

Table 1.

Performance comparison of AFIENet combined with different backbone network on THUCNews, ChnSentiCorp, and ChemicalPro datasets (%).

Method THUCNews ChnSentiCorp ChemicalPro Average
Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1
TextCNN 91.02 91.03 85.51 85.30 86.45 86.44 87.66 87.59
AFIE-TextCNN 97.55 97.55 88.64 88.65 88.24 88.20 91.48 91.47
DPCNN 91.15 91.12 84.93 84.83 85.72 85.63 87.27 87.19
AFIE-DPCNN 93.62 93.59 86.05 86.09 87.86 87.82 89.18 89.17
LSTM 90.99 91.00 83.52 83.38 81.88 82.01 85.46 85.46
AFIE-LSTM 91.60 91.56 85.45 85.12 84.52 84.42 87.19 87.03
Transformer 89.73 89.74 82.42 81.90 78.59 78.49 83.58 83.38
AFIE-Transformer 90.01 90.01 83.13 82.87 81.37 81.29 84.84 84.72

AFIE-* indicates that both GE-Net and LA-Net in AFIE-Net use * as the feature extractor.

Models * and AFIE - * form a comparison group. The bold is used to mark the group with better experimental performance in this comparison.

  1. The experimental results show that the average accuracy of TextCNN, DPCNN, LSTM, and Transformer using the AFIENet network on the three datasets is improved by 3.82, 1.63, 1.73, and 1.26 percentage points, respectively, compared with the original model, which validates the effectiveness of the proposed method. The analysis reveals that GE-Net can extract the global semantic features of the text, and LA-Net can learn segmented features according to the length of the text. Thus models using AFIENet can focus on more local feature information, thus improving model accuracy. For models with strong local feature extraction, AFIENet enhances their global modeling capabilities. For models with strong global feature extraction capability, the segmented feature interaction mechanism in AFIENet can enhance its local feature learning capability. Therefore, AFIENet can significantly improve the effectiveness of different feature extractors on multiple datasets. In particular, the performance improvement is more significant when TextCNN, which is strong in joint local feature learning, is used as a feature extractor.

  2. On the THUCNews dataset, the three models improved their accuracy by 2.47 percentage points on average. The average improvement on the ChnSentiCorp dataset was 1.72 percentage points. The average improvement on the ChemicalPro dataset is 2.35 percentage points. THUCNews and ChemicalPro are fine-grained datasets that improve even more with the AFIENet network. This is due to the large differences between fine-grained text categories, and LA-Net segmented feature extraction can find effective features. ChnSentiCorp is a coarse-grained dataset with a small number of texts and little variation. Although LA-Net can also focus on local information, the enhancement is not obvious.

  3. Although AFIE-TextCNN uses only the combined structure of TextCNN and AFIENet, it achieves good results on the THUCNews dataset. Analyzing the reason, TextCNN extracts features using convolutional kernels of different sizes and focuses more on local information compared to other models. THUCNews is a short text dataset with dense information, and the local modeling capability of TextCNN is fully exploited. Both GE-Net and LA-Net in AFIENet use TextCNN to enhance global and local feature learning. So the fusion of TextCNN with AFIENet can improve the performance on short text datasets such as THUCNews. In addition, AFIE-TextCNN achieves good results on the ChemicalPro dataset.

We compared the performance of the MacBERT-only model, the model with AFIE-* after pre-training on MacBERT, on multiple datasets. The variation trends of training and validation loss, classification accuracy, and F1-score during the training process are shown in the Fig. 4. The results of the experiments are shown in Table 2. Five independent experiments were repeated for each AFIE-* model and the mean and standard deviation of the indicators obtained.

Fig. 4.

Fig. 4

Visualization of training process on THUCNews, ChnSentiCorp, and ChemicalPro datasets, respectively.

Table 2.

Performance comparison of AFIENet pre-trained on MacBERT (%).

Model name THUCNews ChnSentiCorp ChemicalPro
Accuracy F1 Accuracy F1 Accuracy F1
MacBERT 94.56 94.56 90.60 90.60 88.57 88.49
AFIE-TextCNN 97.55 97.55 88.64 88.65 88.24 88.20
MacBERT + AFIE-TextCNN 97.91 (±0.3) 97.95 (±0.3) 89.97 (±0.2) 89.95 (±0.2) 88.68 (±0.1) 88.68 (±0.1)
AFIE-DPCNN 93.62 93.59 86.05 86.09 87.66 87.62
MacBERT + AFIE-DPCNN 94.32 (±0.2) 94.33 (±0.2) 87.41 (±0.1) 87.45 (±0.1) 88.56 (±0.3) 88.56 (±0.3)
AFIE-LSTM 91.60 91.56 85.45 85.12 84.52 84.42
MacBERT + AFIE-LSTM 92.12 (±0.2) 92.11 (±0.2) 85.97 (±0.1) 85.99 (±0.1) 85.37 (±0.2) 85.36 (±0.2)

AFIE-* indicates that both GE-Net and LA-Net in AFIE-Net use * as the feature extractor.

Models AFIE - * and MacBERT + AFIE - * serve as a comparison group. The result of the group with better experimental performance is also highlighted in bold.

Table 2 shows that the MacBERT model has an average accuracy of 91.24% on the three datasets THUCNews, ChnSentiCorp, and ChemicalPro. After fusing the model with the AFIE module on top of the MacBERT pre-training, it improves the average accuracy to 92.68%, 90.49%, and 88.20%, respectively. In particular, when TextCNN is used as the feature extractor, the Mac-BERT + AFIE-TextCNN shows varying degrees of accuracy improvement over MacBERT for all three datasets. Accuracy improved by an average of 1.44 percentage points. This indicates that the AFIENet we proposed can effectively enhance the text modeling capability of the model and can incorporate the advantages of pre-training. The model did not improve much on the ChnSentiCorp dataset. The reason for this problem is that the ChnSentiCorp training set has a small sample size and the model is not adequately trained.

Model generalization assessment

We tested the model inference speed under the same hardware conditions, the GPU model is “NVIDIA Quadro RTX4000” and the CPU model is “Intel(R) Xeon(R) Silver 4210R”. TextCNN, DPCNN, LSTM, Transformer, the model after using the AFIENet network, and the MacBERT model were selected for the experiments to compare their effects on the THUCNews dataset. The experiments also calculated the number of parameters and the speed of inference for each method. AFIE-* is a model for the proposed AFIENet network in which both GE-Net and LA-Net use * as a feature extractor. The results of the models in terms of the number of parameters and the speed of reasoning compared to the model after using the AFIENet network are given in Table 3, in which M stands for million and ms/it stands for the time spent reasoning about each piece of data.

Table 3.

Comparison of number of parameters and inference speed of different models.

Model name Number of parameters THUCNews
Inference speed (ms/it) F1 (%)
TextCNN 2.13M (1Inline graphic) 5.52 91.03
AFIE-TextCNN 2.14M (1.01Inline graphic) 9.10 97.55
DPCNN 2.21M (1Inline graphic) 5.17 91.15
AFIE-DPCNN 2.25M (1.02Inline graphic) 8.65 93.62
LSTM 5.04M (1Inline graphic) 2.27 91.00
AFIE-LSTM 5.24M (1.04Inline graphic) 6.53 91.56
Transformer 3.48M (1Inline graphic) 3.68 89.74
AFIE-Transformer 3.49M (1.01Inline graphic) 7.32 90.01
MacBERT 102.9M 42.25 94.56

Models * and AFIE - * form a comparison group. The bold is used to mark the group with better experimental performance in this comparison.

As we can see in Table 3, the number of parameters in AFIENet increased by up to 4% compared to the original model, but the accuracy improved by up to 6.52 percentage points. The reason for this phenomenon is that AFIENet does not change the structure of the original model and only enhances the representation through adaptive length local feature extraction and global feature interaction. Thus performance can be effectively improved with less impact on the original model. In terms of inference time, the AFIENet network reduces it by about 100 percent on average. However, since the inference time of the original model is then shorter, the absolute inference speed of the AFIE model is still faster and still meets the needs of practical applications. AFIE-TextCNN has only 2.1% of the number of parameters of MacBERT, but the inference speed is 4.64 times faster than MacBERT, and the accuracy is also improved by 2.99 percentage points. The main reason for this phenomenon is the faster raw inference time of the network, which maintains its speed advantage after using AFIENet.

Comparison with state-of-the-art methods

In order to show the progressiveness of the method proposed in this article, we compared the results of state-of-the-art (SOTA) text classification methods in the past few years. Specifically, for backbone networks such as TextCNN, DPCNN, LSTM, and Transformer, the results when using them alone are presented. In addition, performance of SOTA methods including RBTL3, ELECTRA, BERT-CNN, and ERNIE are reported for comparison. The comparison results are shown in Table 4, which shows that our proposed AFIE-Net has the best performance when using Text CNN as the backbone network, and is also very competitive when using other backbone networks.

Table 4.

Performance comparison results with state-of-the-art methods.

Model name THUCNews ChnSentiCorp ChemicalPro
Accuracy F1 Accuracy F1 Accuracy F1
TextCNN 91.02 91.03 85.51 85.30 86.45 86.44
DPCNN 91.15 91.12 84.93 84.83 85.72 85.63
LSTM 90.99 91.00 83.52 83.38 81.88 82.01
Transformer 89.73 89.74 82.42 81.90 78.59 78.49
RBTL3 93.30 93.29 86.50 86.52 85.32 85.33
ELECTRA 93.68 93.68 88.53 88.51 84.79 84.78
BERT-CNN 91.28 91.26 89.17 89.15 85.91 85.88
ERNIE 94.07 94.08 89.37 89.37 86.09 86.11
AFIE-TextCNN 97.55 97.55 88.64 88.65 88.24 88.20
AFIE-DPCNN 93.62 93.59 86.05 86.09 87.86 87.82
AFIE-LSTM 91.60 91.56 85.45 85.12 84.52 84.42
AFIE-Transformer 90.01 90.01 83.13 82.87 81.37 81.29

The bold marks the group with the optimal experimental performance among all models, aiming to highlight the best experimental result.

Ablation study

To illustrate the role of each part of the feature enhancement network AFIENet more effectively. We conducted ablation experiments on the THUCNews dataset, using F1-Score as the evaluation metric. The experimental results are shown in Table 5. Where the feature extractors Inline graphic, Inline graphic in GE-Net and LA-Net use TextCNN, DPCNN, 3-layer LSTM, and Transformer, respectively. The experiments compare several scenarios: using GE-Net only, using GE-Net and LA-Net but not IE-Gate, using GE-Net, LA-Net and IE-Gate, using GE-Net, LA-Net and IE-Gate.

Table 5.

Ablation study of the proposed GE-Net, LA-Net, and IE Gate (%).

Model name Group GE FLA Gate F1
TextCNN Group A Inline graphic Inline graphic Inline graphic 97.55
A1 Inline graphic 91.03
A2 Inline graphic Inline graphic - 90.81
DPCNN Group B Inline graphic Inline graphic Inline graphic 93.62
B1 Inline graphic 91.12
B2 Inline graphic Inline graphic 90.97
LSTM (3) Group C Inline graphic Inline graphic Inline graphic 91.56
C1 Inline graphic 91.00
C2 Inline graphic Inline graphic 90.69
Transformer Group D Inline graphic Inline graphic Inline graphic 90.01
D1 Inline graphic 89.74
D2 Inline graphic Inline graphic 89.80

In the ablation experiment with "Model name" as a group, the result with the optimal performance in this group is emphasized in bold.

As can be seen from the experimental results in Table 5, relative to the control groups A, B, C, and D, the ablation experiments of the A1st, B1st, C1st, and D1st groups made the F1-Score decrease by 6.52, 2.5, 0.56, and 0.27 percentage points, respectively. This is mainly due to ignoring the local features of LA-Net, which leads to the lack of keyword information, resulting in poorer model experiment results. Meanwhile, the ablation experiments of groups A2, B2, C2, and D2 made the F1-Score decrease by 6.74, 2.65, 0.87, and 0.21 percentage points, respectively, which showed a large decrease. It was analyzed and learned that this is because IE-Gate selects the enhanced local features based on the global features of GE-Net, and if this mechanism is not used, the two are simple feature splicing, which introduces unnecessary perturbations.

To further validate the ability of each model for adaptive feature extraction in the local feature extraction network LA-Net, we conducted experiments on the THUCNews dataset using different models as feature extractors in LA-Net. The results of their experiments are shown in Table 5, where accuracy and F1-score value were used as evaluation metrics. Where LA-* denotes the use of * as a feature extractor for LA-Net, TextCNN, DPCNN, LSTM with 2 and 3 layers, and Transformer were selected.

As we can see from Table 6, the best accuracy of 91.54% was achieved using TextCNN as the feature extractor for LA-Net. In contrast, the results of the remaining models are slightly lower. The analysis reveals that this is mainly due to the fact that THUCNews belongs to a short text dataset with a concentrated feature distribution. In this scenario, local feature extraction with TextCNN after adaptive segmentation can achieve a better representation.

Table 6.

Performance comparison of LA-Net using different feature-enhanced networks (%).

Model name THUCNews
Accuracy F1
LA_TextCNN 91.54 91.54
DPCNN 91.43 91.42
LA_RNN (2) 90.41 90.38
LA_RNN (3) 90.13 90.12
LA_Transformer 90.19 90.16

The experimental result of the group with the optimal performance is marked in bold.

Conclusion

In this article, we propose an Adaptive Feature Interaction Enhancement Network (AFIENet) for text classification. The model utilizes an internal gating mechanism to incorporate both global and local key information for feature enhancement. Specifically, AFIENet contains two subnetworks for global and local feature learning. For local feature extraction, an adaptive text length-based network is proposed to focus on local contexts. Then, through the designed interaction enhancement gate, local features are selectively fused with global features based on the global feature confidence distribution generated by the gating mechanism. This achieves effective enhancement of global features.

Currently, research on local feature fusion through interaction enhancement gates is still rare. We validated the effectiveness of this fusion mechanism through comparative experiments. It should be noted that due to length limitations, only several typical models were selected for experimental verification. In future work, richer types of feature extraction networks could be explored, or AFIENet could be extended to other directions such as graph neural networks, e.g. graph convolutional networks and generative adversarial networks, to further enrich the application scenarios of this fusion mechanism.

Acknowledgements

The author would like to thank his supervisor Professor Gao Shangbing for his patient guidance and constant encouragement. The author also wishes to express his gratitude to his colleague Zhang Junqiang for his kind help and useful suggestions. This work was partially inspired by the previous study on text classification using multi-granularity dynamic semantic representations by Zhang Junqiang et al. Funding: This work was supported by the General Program of the National Natural Science Foundation of China (62076107); the National Key R&D Plan (2018YFB1004904); the Six Talent Peak Funding Projects in Jiangsu Province (XYDXXJS-011); and the Huaiyin Institute of Technology Graduate Science and Technology Innovation Program (HGYK202314).

Author contributions

We declare that this manuscript is original, has not been published before and is not currently being considered for publication elsewhere. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We wish to confirm the following contributions to the paper: 1. GAO Shangbing (Corresponding Author): - Conceived and designed the study - Supervised the research project - Provided critical revision of the manuscript for important intellectual content - Approved the final version to be published - Responsible for the overall direction and planning 2. SU Rui: - Developed the AFIENet architecture - Implemented the model and conducted experiments - Analyzed and interpreted the data - Drafted the initial manuscript 3. ZHAO Kefan: - Assisted in the design of the adaptive text segmentation module - Contributed to the data analysis and interpretation - Helped in the preparation and review of the manuscript 4. ZHANG Junqiang: - Contributed to the development of the interactive valve mechanism - Assisted in the experimental design and data collection - Participated in the critical revision of the manuscript All authors contributed to the conceptual development of the AFIENet model. They were all involved in the discussion of results, drawing conclusions, and revising the manuscript. Each author has sufficiently participated in the work to take public responsibility for appropriate portions of the content. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs.

Data availibility

The datasets used and analyzed in the current study are available from the corresponding author upon reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Zhou, Y. A review of text classification based on deep learning. In Proceedings of the 2020 3rd International Conference on Geoinformatics and Data Analysis, 132–136 (2020).
  • 2.Hu, Y., Zhao, Q., Chen, L. & Yongjie, N. A fusion network for Chinese news text classification. J. Chin. Inf. Process.35, 107–114 (2021). [Google Scholar]
  • 3.Zhao, X. & Shi, S. Review of cross-domain sentiment analysis. J. Beijing Inf. Sci. Technol. Univ.34, 73–79 (2020). [Google Scholar]
  • 4.Huang, P., Huang, P., Ding, J., Ai, W. & Zhang, J. Latent topic-enriched co-attention networks for domain classification. J. Chin. Inf. Process.34, 73–79 (2020). [Google Scholar]
  • 5.Wan, J. & Wu, Y. Review of text classification research based on deep learning. J. Tianjin Polytech. Univ.27, 41–47 (2021). [Google Scholar]
  • 6.Zhang, Z. Review on the application of deep transfer learning in text classification. Information Technology and Information 121–124 (2023).
  • 7.Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751, 10.3115/v1/D14-1181 (Association for Computational Linguistics, Doha, Qatar, 2014).
  • 8.Wang, P. et al. Semantic clustering and convolutional neural network for short text categorization. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 352–357 (2015).
  • 9.Joulin, A., Grave, E., Bojanowski, P., Mikolov, T. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers (Association for Computational Linguistics, 2017).
  • 10.Lipton, Z.C. A critical review of recurrent neural networks for sequence learning. arXiv Preprint, CoRR, abs/1506.00019 (2015).
  • 11.Liu, P., Qiu, X., Huang, X. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2873–2879 (2016).
  • 12.Chung, J., Gulcehre, C., Cho, K., Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014 (2014).
  • 13.Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems (2017).
  • 14.Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186, 10.18653/v1/N19-1423 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).
  • 15.Zhang, J., Gao, S., Su, R. & Li, W. Text classification model incorporating multi-granularity dynamic semantic representation. Nanjing Xinxi Gongcheng Daxue Xuebao15, 148–159 (2023). [Google Scholar]
  • 16.Yang, Y., Liu, E., Gu, C. & Pei, S. Research on short text classification model combined with word vector for sparse data. J. Comput. Appl.39, 750 (2022). [Google Scholar]
  • 17.Li, B. et al. Short text classification model combining knowledge aware and dual attention. J. Softw.33, 3565–3581 (2022). [Google Scholar]
  • 18.Sun, X., Tang, Z., Zhao, Y. & Zhang, Y. Hierarchical networks with mixed attention for text classification. J. Chin. Inf. Process.35, 69–77 (2021). [Google Scholar]
  • 19.Zhang, Y., Jia, Y. & Liu, J. A multi-feature fusion method for long text classification. J. China Acad. Electr. Inf. Technol.15, 910–916 (2020). [Google Scholar]
  • 20.Fei, Z., Dehao, W. & Sheng, H. Text classification method based on lda and deep learning. Comput. Eng. Des.41, 2184–2189 (2020). [Google Scholar]
  • 21.Brown, T. et al. Language models are few-shot learners. Adv. Neural. Inf. Process. Syst.33, 1877–1901 (2020). [Google Scholar]
  • 22.Greco, C. M., Simeri, A., Tagarelli, A. & Zumpano, E. Transformer-based language models for mental health issues: A survey. Pattern Recogn. Lett.167, 204–211 (2023). [Google Scholar]
  • 23.Sun, Y. et al. A length-extrapolatable transformer. In The 61st Annual Meeting Of The Association For Computational Linguistics (2023).
  • 24.Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • 25.Wang, X. et al. Large-scale multi-modal pre-trained models: A comprehensive survey. Mach. Intell. Res.20, 447–482 (2023). [Google Scholar]
  • 26.Ma, J. et al. Text classification method based on lda and deep learning. Mini-micro Syst.42, 516–521 (2021). [Google Scholar]
  • 27.Li, X., Li, Z., Xie, H. & Li, Q. Merging statistical feature via adaptive gate for improved text classification. Proc. AAAI Conf. Artif. Intell.35, 13288–13296 (2021). [Google Scholar]
  • 28.Xu, W., Chen, J., Ding, Z., Wang, J. Text sentiment analysis and classification based on bidirectional gated recurrent units (grus) model. arXiv preprint arXiv:2404.17123 (2024).
  • 29.Yang, C., Wang, Y., Yang, B. & Chen, J. Graformer: A gated residual attention transformer for multivariate time series forecasting. Neurocomputing581, 127466 (2024). [Google Scholar]
  • 30.Guo, W. & Wang, Y. Convolutional gated recurrent unit-driven multidimensional dynamic graph neural network for subject-independent emotion recognition. Expert Syst. Appl.238, 121889 (2024). [Google Scholar]
  • 31.Yang, Z. et al. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489 (2016).
  • 32.Cui, Y. et al. Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).
  • 33.Zhan, X., Shi, C., Li, L., Xu, K. & Zheng, H. Aspect category sentiment analysis based on multiple attention mechanisms and pre-trained models. Appl. Comput. Eng.71, 21–26 (2024). [Google Scholar]
  • 34.Gangrade, S., Sharma, P. C., Sharma, A. K. & Singh, Y. P. Modified deeplabv3+ with multi-level context attention mechanism for colonoscopy polyp segmentation. Comput. Biol. Med.170, 108096 (2024). [DOI] [PubMed] [Google Scholar]
  • 35.Khan, S. U. R. & Asif, S. Oral cancer detection using feature-level fusion and novel self-attention mechanisms. Biomed. Signal Process. Control95, 106437 (2024). [Google Scholar]
  • 36.Xiao, L. et al. Exploration of attention mechanism-enhanced deep learning models in the mining of medical textual data. arXiv preprint arXiv:2406.00016 (2024).
  • 37.Pennington, J., Socher, R., Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543 (2014).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used and analyzed in the current study are available from the corresponding author upon reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES