Abstract
In existing multimodal sentiment analysis methods, only the last layer output of BERT is typically used for feature extraction, neglecting abundant information from intermediate layers. This paper proposes an Aspect-level Multimodal Sentiment Analysis Model with Multi-scale Feature Extraction (AMSAM-MFE). The model conducts sentiment analysis on both text and images. For text feature extraction, it incorporates a Multi-scale Layer module based on BERT and utilizes aspect terms to supervise text feature extraction, enhancing text processing performance. For image feature extraction, the model employs a pre-trained Resnest269 model with a specially designed Supervision Layer to improve effectiveness. For feature fusion, the Tensor Fusion Network method is adopted to achieve comprehensive interaction between visual and textual features. Experimental comparisons with other multimodal sentiment analysis models on Twitter2015 and Twitter2017 datasets demonstrated that the proposed multi-scale feature extraction model achieved improved accuracy and F1 scores in aspect-level multimodal sentiment analysis tasks, showing superior classification effectiveness compared to traditional multimodal sentiment analysis models.
Keywords: Aspect-level multimodal sentiment analysis, Multi-scale feature extraction, Aspect terms, Tensor fusion network
Subject terms: Mathematics and computing, Computer science, Information technology
Introduction
With the rapid development of information technology, channels for acquiring information have become increasingly diverse, and multimodal data such as text, images, audio, and video have emerged as crucial carriers for information transmission. Traditional sentiment analysis primarily focuses on textual information and has become a common solution in many industries, including film and television prediction, financial forecasting, and election outcome prediction. However, with the widespread adoption of multimedia platforms like Twitter, Facebook, and Instagram, images and text frequently co-occur to express sentiments, rendering single-modality sentiment analysis insufficient to meet practical demands. Over the past decade, this issue has garnered increasing attention from both academia and industry1.
In previous literature2–5researchers have proposed numerous methods to perform sentiment analysis for target entities. These methods have designed many hand-crafted features, which are then fed into linear classifiers. As deep learning has become increasingly prevalent in the field of natural language processing, various neural network structures have emerged for the task of sentiment classification at the entity level, including Recursive Neural Networks6Convolutional Neural Networks7and Recurrent Neural Networks8. In recent years, the application of BERT9 in sentiment analysis has achieved impressive results.
Traditional sentiment analysis typically operates at the sentence or document level10–12aiming to determine the general sentiment of the entire text. The approach relies on the premise that the text conveys a unified sentiment about a single subject, which might not accurately reflect real-world scenarios. Recognizing this limitation, Aspect-level sentiment analysis, which focuses on more detailed aspect-specific opinions and sentiments, has gained significant interest over the last decade1,13,14.
Aspect-level sentiment analysis emerged as an advancement over traditional sentiment analysis techniques. Initially, sentiment analysis was conducted at the sentence level and discourse level, primarily concentrating on the overall emotional tone of the text while overlooking its internal structure and nuanced details. To achieve a more precise capture of emotional nuances, researchers shifted their focus to aspect-level sentiment analysis. It takes the opinion target in the sentence as the research object and identifies its sentiment polarity (e.g., positive, negative, or neutral)15. This approach surpasses traditional sentence and discourse level analyses by providing a finer-grained revelation of users’ sentiments towards specific aspects, doing so with greater accuracy and effectiveness, and thus holding higher practical value. Consider the sentence, “The food at this restaurant is very delicious, but the service attitude needs improvement.” Here, we can discern two distinct aspects: the user’s positive appraisal of the food (“the food is very delicious”) and their negative assessment of the service (“the service attitude needs improvement”). Aspect-level sentiment analysis provides a more precise sentiment evaluation of this sentence compared to earlier methods. Through this illustration, it becomes evident that aspect-level sentiment classification offers a more detailed analysis of users’ emotional inclinations across various aspects, thereby delivering more refined insights.
In recent years, researchers have achieved significant results in the field of Aspect-level multimodal sentiment analysis. Several prominent models exhibit distinct characteristics: Tsai et al.16 proposed the MulT model, utilizing a directional bidirectional cross-modal attention mechanism to enable interaction between multimodal sequences with different time steps, potentially transferring information across modalities. However, its attention mechanism might not sufficiently bridge the semantic gap inherent in raw, heterogeneous modalities (e.g., text, audio, visual), and it incurs substantial computational overhead. Khan et al.17 employed a pre-trained Transformer model to translate images into descriptive statements, effectively transforming the multimodal task into a unimodal one. While this approach simplifies the process, the image-to-text translation inevitably discards crucial visual details and affective cues, essentially circumventing the core challenge of genuine multimodal fusion. Xu et al.18 designed the MIMN model to better leverage aspect terms for supervision, specifically introducing two interactive memory networks to use aspect terms in guiding the generation of corresponding text and image features, achieving a degree of interaction. Nevertheless, the mechanism for precisely how aspect terms guide feature generation may be underexplored, and the interaction paradigm appears relatively rigid. Yu et al.19 proposed the ESAFN model, emphasizing the utilization of entity location information. It employed gating mechanisms to supervise image features and performed sentiment analysis via bilinear interaction fusion. This model’s reliance on entity location makes its performance sensitive to the accuracy of entity detection, and its fusion mechanism may have limitations in capturing complex, fine-grained interdependencies between cross-modal features. The multimodal sentiment analysis proposed in the above research and model has achieved better results compared to single modal sentiment analysis. However, there is still a lack of comprehensive consideration for feature extraction and information exchange in various modalities. Existing approaches in sentiment analysis predominantly rely on the final layer outputs of BERT for feature representation, while neglecting the potentially valuable semantic information encoded in intermediate layers. More importantly, current aspect-level multimodal sentiment analysis frameworks have not adequately explored the integration of multi-scale feature extraction mechanisms, despite their demonstrated effectiveness in related domains. Our empirical observations reveal that the performance of existing image-text based aspect-level multimodal sentiment analysis systems remains suboptimal, with considerable room for improvement in terms of both prediction accuracy and model robustness. To address these critical limitations, we propose a novel Aspect-level Multimodal Sentiment Analysis Model with Multi-scale Feature Extraction (AMSAM-MFE), which systematically incorporates multi-scale representations and cross-modal interactions.
The Multi-scale Layer structure proposes a novel text feature extraction approach. Specifically, it first concatenates outputs from all BERT layers with the final layer’s output, forming a composite representation. This representation is then fed in parallel into an attention layer (Att) and an LSTM layer. The outputs of these two layers are integrated to collectively construct inherent associations and hierarchical order information across BERT layers. Critically, this architecture mitigates intermediate layer information loss inherent in standard BERT, while simultaneously enabling comprehensive preservation of original information and extraction of enhanced contextual understanding.
For visual processing, we employed ResNeSt26920, a parallel multi-branch network architecture derived from ResNeSt. This architecture enhances ResNet through a Split Attention module and cross-feature-map attention mechanisms, achieving effective multi-scale image feature extraction. The structure preserves both shallow and deep features through residual connections while capturing features from varied receptive fields. Additionally, we designed an Aspect-guided Supervision Layer to align image features with aspect terms, ensuring aspect-aware visual representation.
In the field of multimodal feature fusion research, the Tensor Fusion Network21 serves as a foundational framework due to its ability to explicitly model high-order modal interactions. By augmenting feature vectors with a unity dimension and computing their Cartesian product, TFN simultaneously preserves unimodal features and multimodal combinatorial features, thereby comprehensively capturing fine-grained interactions essential for sentiment analysis. In contrast, simple concatenation fusion, while computationally efficient, merely stacks feature dimensions while ignoring dynamic inter-modal relationships, leading to sentiment inference biases. The Low-Rank Multimodal Fusion22 method reduces computational complexity from exponential to linear through tensor decomposition, but its low-rank approximation sacrifices high-order interaction information, weakening the model’s capacity to capture complex affective cues.
Recent studies attempt to address these limitations but face new challenges. For instance, cross-modal alignment approaches (e.g., MocolNet23 rely on contrastive learning to bridge modal semantic gaps, yet their performance depends heavily on negative sample quality and fail to explicitly model tri-modal joint interactions. Text-centric fusion frameworks (e.g., Vanessa24 align multimodal temporal information via attention mechanisms but overly prioritize textual dominance, diminishing contributions from non-semantic cues like visual aesthetics. Aesthetic-enhanced models25 incorporate visual connotations to enrich sentiment dimensions, yet require predefined aesthetic rules or additional annotations, limiting generalizability.
Through comparative experiments with existing multimodal sentiment analysis models on Twitter2015 and Twitter2017 datasets, our model demonstrated improvements in both accuracy and F1-score for aspect-level sentiment analysis. The results indicated that our multi-scale feature extraction framework achieved superior classification performance compared to traditional multimodal approaches.
Aspect-level multimodal sentiment analysis model
This model mainly consists of three modules: image feature extraction module, text feature extraction module, and feature fusion module. The network structure of the model is shown in Fig. 1.
Fig. 1.
Overall network structure.
Text feature extraction module
Word embedding techniques such as Word2ve26Glove27 were often used in text feature extraction tasks in the past. These methods have high language comprehension ability, fast training speed, and stable training results. However, compared to the BERT model, these methods cannot fully consider the contextual relationships between words. BERT is first pre trained on a large amount of text to learn universal language features, and then fine tuned on specific tasks. This transfer learning approach enables BERT to quickly adapt and perform on various downstream tasks such as sentiment analysis. Through a bidirectional self attention mechanism for training, it can deeply capture and utilize contextual information, thereby achieving more accurate multi-directional text feature extraction. Given that human understanding of words heavily relies on context, using these methods directly in complex contexts may not be effective. Therefore, in this study, in order to better handle the complexity of human emotional expression, we employed BERT model for text feature extraction.
Multi-scaleLayer
In sentiment analysis tasks, traditional methods only use the last layer output of the BERT model as features, ignoring the rich information in the middle layer, resulting in single features and insufficient resource utilization. A large amount of research in the field of machine vision has shown that in the field of image processing, multi-scale feature extraction techniques can effectively utilize the features in the middle layer of the model, expand the receptive field, and provide more comprehensive and delicate feature representations, thereby improving the overall performance of the model. For example, Feature Pyramid Network28 model applies a series of operations on feature maps at different scales so that the network can capture targets of different sizes. Through multi-scale feature extraction, the intermediate layer features of the model are successfully utilized at different scales, and significant results have been achieved in various tasks. This case demonstrates that in text processing, drawing on multi-scale feature extraction strategies in the image field and fully exploiting intermediate layer features of the model can effectively improve the performance of the model. Therefore, this article designs a multi-scale feature extraction module Multi-scaleLayer based on the original BERT, which uses a serial skip layer connection method similar to image multi-scale feature extraction to fully utilize BERT’s shallow, intermediate, and deep features to obtain text features at different scales.
As shown in Fig. 2, the Multi-scale Layer structure is a novel text feature extraction method that concatenates the output A of all layers in the BERT model with the output B of the last layer. After processing by the attention layer Att and LSTM layer, the association and hierarchical order information between each layer are constructed to fully preserve the original information and obtain more contextual information.
Fig. 2.
Multi-scale layer structure diagram.
Aspect terms participate in text feature extraction
The innovation of Aspect-level multimodal sentiment analysis lies in the introduction of the concept of aspect terms. To fully tap into the potential of aspect terms in supervising text features, this study cleverly utilizes the advantages of the BERT model. Specifically, treating the text and aspect terms as independent sentences and inputting them together into the BERT model through concatenation not only efficiently extracts text features, but also enhances the supervision effect of aspect terms on text features with the help of BERT’s characteristics. This method maximizes the use of aspect word information and significantly improves the overall performance of sentiment analysis. In addition, through this design, we can more accurately capture the emotional tendencies in the text, especially in contexts involving specific aspects. This study uses the form of “[CLS] + text+[SEP] + aspect+[SEP]” to connect sentences and aspect terms as input for BERT, expressed as
, where m is the length of the connection. The calculation steps are shown in Eq. (1) to (4).
![]() |
1 |
![]() |
2 |
![]() |
3 |
![]() |
4 |
Among them, where
represents the vector information obtained from the [CLS] token at each layer of BERT. p is the number of layers of BERT.
,
is the learnable parameter.
and
are the outputs of Att and Lstm layers, respectively.
representing the extracted text features.
Image feature extraction module
The image feature extraction module consists of two parts: Resnest269 model and Supervision Layer feature supervision module.
Resnest269 model
Residual neural networks play a key role in deep network training, effectively solving the degradation problem of deep network training, overcoming the common gradient vanishing problem in deep neural network training, and making it possible to train networks with higher depths. The structural depth of ResNet enables the network to continuously learn new features with increasing layers and demonstrate advantages in various tasks. However, due to its single branch structure, standard ResNet has width limitations and faces challenges such as limited receptive fields and insufficient channel numbers.
To address these challenges, we enhance the ResNet architecture by integrating ResNeSt269 – a deep variant pre-trained on large-scale image classification tasks. This leverages its substantial depth and rich feature representation capabilities.Our selection of ResNeSt269 as the image feature extractor is fundamentally driven by its intrinsic architectural alignment with fine-grained cross-modal correlation requirements essential for multimodal sentiment analysis. Compared to ViT-B/1629, whose global self-attention suffers from significant background distraction, ResNeSt269’s split-attention mechanism delivers substantially higher region localization precision on OpenImages-V6 benchmarks, ensuring precise activation of aspect-relevant regions. Furthermore, relative to EfficientNet-B7’s30 compound scaling approach, ResNeSt269 achieved superior spatial sensitivity with significantly fewer parameters and enhanced noise robustness. These characteristics prove particularly valuable for processing social media imagery from Twitter2015 and Twitter2017 datasets, where image resolutions typically vary across a wide spectrum and partial occlusion occurs frequently. ResNeSt269’s consistent performance on variable-resolution benchmarks, combined with its background interference suppression (as validated in CVPR 2022), establishes it as an optimal solution for aspect-level sentiment tasks in social media contexts.
Therefore, we employ ResNeSt269 as our foundational feature extractor due to these advantages, specifically its abundant channel capacity and strong representational power, which facilitate superior image feature processing for our task.
Supervision layer
In this model, aspect terms contribute to image feature extraction. We design a Supervision Layer inspired by multimodal alignment attention mechanisms31,32where multi-scale processing is achieved by progressively unifying originally heterogeneous aspect and image features into dimensionally consistent representations through layered linear projections. Specifically, the module first maps the aspect features and image features into a shared subspace through linear transformations, which bridges the modality gap. Then, the Softmax function computes association weights, which theoretically corresponds to a lightweight attention mechanism that selectively enhances image regions relevant to the aspect terms. Finally, feature modulation and concatenation are applied to generate the aligned visual representation. This approach not only provides supervision for image features but also facilitates interaction between modalities, thereby helping the model to identify and utilize key information more effectively and maximizing the utility of aspect terms.
For a given aspect word, it is represented as
, where v is the length of the aspect word, as input to the BERT model. Similarly, Multi-scale Layer is used to attempt to obtain a more comprehensive feature representation of the aspect word. The calculation steps are shown in (5) to (8):
![]() |
5 |
![]() |
6 |
![]() |
7 |
![]() |
8 |
Among them, where
represents the vector information obtained from the [CLS] token at each layer of BERT. p is the number of layers of BERT.
,
is the learnable parameter.
and
are the outputs of Att and Lstm layers, respectively.
representing the extracted aspect word features. Represent the visual information of the image as
, where n is the number of images in the model, and feed it into Resnest269 to extract preliminary visual features
as shown in formula (9).
![]() |
9 |
Using the Supervision Layer to further supervise and extract the visual feature
, the structure of the Supervision Layer is shown in Fig. 3. This module operates on principles derived from multimodal attention mechanisms to bridge the modality gap and perform attention-driven region selection. First, the aspect term features
and the visual features
are projected into a shared semantic space using linear transformations followed by the
activation function, where
,
, and
denote the learnable weights and biases. Subsequently, a Softmax function is applied to compute attention weights
over the visual regions, effectively implementing the attention-like selection mechanism to emphasize areas pertinent to the aspect terms (weights
,
). These weights are then used to modulate the original visual features, generating weight-supervised visual features
through element-wise multiplication
and a linear transformation (weights
,
). Finally, the aspect term features
and the modulated visual features
are concatenated to produce the final, optimized visual representation
output by the Supervision Layer. As illustrated in Fig. 3, this design provides direct cross-modal supervision and facilitates key information interaction through attention principles, maximizing the utility of aspect terms in refining the visual representation. The calculation steps are shown in (10) to (13):
![]() |
10 |
![]() |
11 |
![]() |
12 |
![]() |
13 |
Fig. 3.

Structure of supervision.
Feature fusion module
Standard feature-level fusion fails to model cross-modal dynamics, while recent advanced methods impose impractical requirements like meticulous negative sampling or extra aesthetic annotations as discussed in Sect. 1. For Twitter2015/2017 datasets where images and text exhibit weak alignment and high noise, we adopt the Tensor Fusion Network based on its inherent capacity for high-order multiplicative interactions. This enables robust fusion of multi-scale noisy features while preserving original modality information — critical for capturing sparse aspect-sentiment correlations in social media multimodal data.
Specifically, we first augment both text features
and image features
by appending a unity dimension. The fused representation is generated through an outer product operation between the augmented features, capturing multiplicative interactions while preserving original modality information. The resulting high-dimensional tensor is flattened and processed by a linear layer with weights initialized via Xavier uniform distribution. This enables comprehensive cross-modal interaction critical for aspect-based sentiment analysis. The TFN operation is formally defined as:
![]() |
14 |
Experimental results and analysis
Dataset and evaluation indicators
This study utilizes two benchmark multimodal datasets, Twitter2015 and Twitter2017, for experimental validation. The Twitter2015 dataset comprises 5338 image-text pairs in total, with the training, validation, and test sets containing 3179, 1122, and 1037 instances respectively. The Twitter2017 dataset expands to a total of 5972 image-text pairs, with its training, validation, and test sets configured as 3562, 1176, and 1234 samples. A defining characteristic of both datasets is the annotation of a specific aspect term for each image-text pair, serving as critical supervisory signals for fine-grained sentiment analysis that precisely localizes target entities or attributes requiring sentiment judgment within texts and images.
In evaluating the performance of the model, this article chooses the accuracy acc and F1 scores frequently used in Aspect-level multimodal sentiment analysis. The calculation method of the evaluation indicators is shown in formulas (15) and (16).
![]() |
15 |
![]() |
16 |
Among them,
represents F1 score,
represents accuracy,
represents the number of samples correctly assigned to that class,
represents the number of samples correctly assigned to other classes,
represents the number of samples incorrectly assigned to that class, and
represents the number of samples incorrectly assigned to other classes.
Environment configuration and parameters
The hyperparameter settings for Aspect-level multimodal sentiment analysis based on multi-scale feature extraction are shown in the Table 1:
Table 1.
Hyperparameter settings.
| Hyperparameter description | Value |
|---|---|
| Batch size | 10 |
| Learning rate | 0.00002 |
| optimizer | Adam |
| Multi-scale module dimension | 512 |
| Cross modal attention dimension | 512 |
| Extract feature dimensions | 512 |
Training was conducted on an NVIDIA Quadro P5000 GPU (16GB VRAM) with Intel Xeon Silver 4116 CPU using Adam optimizer (batch_size = 10, lr = 2e-5). With modified early stopping (halted when validation F1 fluctuated < 0.005 for 8 consecutive epochs), average training time was 52 ± 3 h on Twitter2015 and 68 ± 4 h on Twitter2017.
Ablation experiment
In order to demonstrate the advantages of the AMSAM-MFE model proposed in this article in feature extraction compared to other models, we designed the following ablation experiments. Our model was used for single modal sentiment analysis, and the analysis results were compared with traditional single modal sentiment analysis methods for the corresponding modality. The ablation experiments were conducted on the same dataset to verify the effectiveness of each part.
The models we compared are as follows:
(1) AE-LSTM33 uses attention to supervise the generation and capture of important features for a given aspect word in text, and finally uses a combination of long short-term memory recurrent neural networks and attention to achieve Aspect-level sentiment analysis of the text.
(2) MemNet34 use a deep memory network for Aspect-level sentiment classification. The use of attention methods in sentiment analysis of an aspect can effectively capture important features of each context.
(3) RAM35 uses multiple attention mechanisms to achieve long-distance feature association, sends the association results to a recursive neural network for feature fusion, and utilizes a weighted memory mechanism to tailor memory for different aspects of words.
(4) MGAN36 is a new multi granularity attention network that uses fine-grained attention to obtain fused features of aspect terms and text, effectively solving the feature loss that may be caused by coarse-grained methods and improving the model’s ability to extract targeted features.
(5) ESTR is the text only implementation of Yu et al.‘s ESAFN model, which changes the input method of the same text by splitting it into left, aspect, and right segments, which are then sent to a feature extraction network for extraction. The extracted features are fused using attention mechanisms to obtain the final Aspect-level sentiment analysis classification.
The results of the ablation experiment are shown in Table 2, where T represents Aspect-level sentiment analysis using only text features, V represents sentiment analysis using only image features.
Table 2.
Results of ablation experiment.
| Model | Twitter2015 | Twitter2017 | |||
|---|---|---|---|---|---|
| Acc | F1 | Acc | F1 | ||
| V | Resnest269 | 58.05 | 31.25 | 45.71 | 31.99 |
| Our only V | 59.79* | 32.32* | 58.91* | 54.47* | |
| T | AE-LSTM | 70.30 | 63.43 | 61.67 | 57.97 |
| MemNet | 70.11 | 61.76 | 64.18 | 60.90 | |
| RAM | 70.68 | 63.05 | 64.42 | 61.01 | |
| MGAN | 71.17 | 64.21 | 64.75 | 61.46 | |
| ESTR | 71.36 | 64.28 | 65.80 | 62.00 | |
| BERT | 74.35 | 68.45 | 66.94 | 64.83 | |
| Our only T | 75.22* | 69.53* | 68.56* | 66.50* | |
Note: An asterisk(*) indicates a statistically significant improvement (p < 0.05) over the specified baseline model: Resnest269 for Our only V, BERT for Our only T, and Our only T for the full Our model.
From the above table, it can be seen that in Aspect-level multimodal sentiment analysis tasks, the model based on multi-scale feature extraction proposed in this paper has achieved significant improvement in sentiment analysis using image modalities. Compared with the Resnest269 model, its accuracy on the Twitter2015 and Twitter2017 datasets is 1.74% and 13.2% higher, respectively, and its F1 score is 1.07 and 22.48 higher, respectively. This is mainly attributed to the effective use of its supervision module, Supervision Layer. This module can fully explore aspect word features and effectively supervise the multi-scale features generated by images, which helps to focus on image features related to aspect terms while removing irrelevant features. In addition, the module also integrates aspect word features into image features, significantly enhancing the analytical ability of image features in Aspect-level sentiment analysis tasks. This fully demonstrates the important role of aspect terms in image feature extraction.
In addition, by comparing the data in the table above, AMSAM-MFE showed improved performance in Aspect-level sentiment analysis using only text modality compared to classic models such as AE-LSTM, MemNet, RAM, MGAN, and ESTR. On the Twitter 2015 dataset, the accuracy was 4.92%, 5.11%, 4.54%, 4.05%, and 3.86% higher, and the F1 scores were 6.1%, 7.77%, 6.48%, 5.32%, and 5.25% higher, respectively. On the Twitter 2017 dataset, the accuracy was 6.89%, 4.38%, 4.11%, 3.81%, and 2.76% higher, and the F1 scores were 8.53%, 5.6%, 5.49%, 5.04%, and 4.5% higher, respectively. In Aspect-level multimodal sentiment analysis tasks, the model proposed in this paper effectively captures text features related to given aspect terms through a text feature extraction module, significantly improving sentiment classification ability. In addition, the accuracy and F1 score of this model, which only uses text for Aspect-level sentiment analysis, were improved on the Twitter 2015 and Twitter 2017 datasets compared to BERT’s accuracy and F1 score on both datasets, with accuracy rates 0.87% and 1.62% higher, and F1 scores 1.08% and 1.67 higher, respectively. This improvement is mainly due to the Multi-scale Layer feature extraction module, which fully demonstrates that this module can combine the shallow, middle, and deep features of the BERT model, significantly enhancing the analytical ability of text features in Aspect-level sentiment analysis tasks.
Comparison experiment of fusion module
In order to demonstrate the advantages of AMSAM-MFE feature fusion method compared to other fusion methods, this paper designs the following comparative experiments, using different feature fusion strategies while ensuring the same feature extraction, to illustrate the advantages of our model’s Tensor Fusion Network method in fusing extracted features for multimodal sentiment analysis.
The experimental comparison model is as follows: (1) SumFusion uses linear layers to extract features from each modality for prediction, and finally sums up the predicted results as the final sentiment analysis result. (2) ConcatFusion concatenates the features extracted from various modalities and then performs sentiment analysis. (3) FiLM37 multiply the extracted features of one modality with another feature point by point and superimpose them on itself, then perform sentiment analysis. (4)GatedFusion38 feed the features extracted from one modality into the Sigmoid function as weights for another feature, multiply them point by point, and then perform sentiment analysis. The results of the fusion module comparison experiment are shown in Table 3.
Table 3.
Comparison experimental results of fusion modules.
| Twitter2015 | Twitter2017 | |||
|---|---|---|---|---|
| Acc | F1 | Acc | F1 | |
| SumFusion | 76.28 | 71.30 | 69.61 | 68.11 |
| ConcatFusion | 75.51 | 70.63 | 68.72 | 67.27 |
| FiLM | 75.31 | 69.63 | 68.23 | 66.77 |
| GatedFusion | 74.45 | 68.43 | 68.15 | 66.35 |
| Our | 76.76* | 71.99* | 70.10* | 68.52* |
Note: An asterisk (*) denotes a statistically significant improvement (p < 0.05) over the best-performing alternative fusion method (SumFusion).
From the table above, it can be seen that the fusion method proposed in this paper has improved accuracy and F1 score compared to SumFusion, ConcatFusion, FiLM, and GatedFusion. On the Twitter 2015 dataset, the accuracy was 0.48%, 1.25%, 1.45%, and 2.31% higher, respectively, and the F1 scores were 0.69, 0.41, 2.36, and 3.56 higher. On the Twitter 2017 dataset, the accuracy was 0.49%, 1.38%, 1.87%, 1.95% higher, and the F1 scores were 0.41%, 0.77%, 0.79%, and 2.17% higher, respectively. This indicates that the tensor fusion method used in this article, Tensor Fusion Network, can fully achieve the interaction between image and text features by fusing multi-scale features extracted from images and texts, thereby improving the performance of the model.
Model overall comparison experiment
To further validate the performance of our model in Aspect-level multimodal sentiment analysis, we compared our model with the following models:
Res RAM, Res MGAN, and Res ESPR: These are models that maximize pooling and connect the visual features extracted by Resnet152 based on RAM, MGAN, and ESTR mentioned in the ablation experiment section for Aspect-level emotion prediction.
Res RAM-TFN and Res MGAN TFN: These are models that maximize pooling of visual features extracted by Resnet152 based on RAM and MGAN mentioned in the ablation experiment section of this article, and fuse them using Tensor Fusion Network’s tensor fusion method for Aspect-level sentiment prediction·.
MIMN: This model uses two interactive memory networks to achieve the interaction between text features and aspect word features, as well as the interaction between image features and aspect word features, to achieve Aspect-level sentiment analysis.
ESAFN: This model generates text features with aspect word attention through attention mechanism, divides the input text into three parts for separate processing, uses gating mechanism to supervise image features with aspect terms, and finally uses bilinear interaction to fuse and achieve sentiment analysis.
The results of the comparative experiments between this model and different models are shown in Table 4.
Table 4.
Overall model comparison experimental results.
| Model | Twitter2015 | Twitter2017 | ||
|---|---|---|---|---|
| Acc | F1 | Acc | F1 | |
| Res-RAM | 71.55 | 64.68 | 65.40 | 62.23 |
| Res-RAM-TFN | 69.91 | 61.49 | 63.45 | 58.92 |
| Res-MGAN | 71.65 | 63.88 | 66.37 | 63.04 |
| Res-MGAN-TFN | 70.30 | 64.14 | 64.10 | 59.13 |
| MIMN | 71.84 | 65.69 | 65.88 | 62.99 |
| Res-ESTR | 72.03 | 63.98 | 66.13 | 63.63 |
| ESAFN | 73.38 | 67.37 | 67.83 | 64.22 |
| Our | 76.76* | 71.99* | 70.10* | 68.52* |
Note: An asterisk (*) signifies a statistically significant improvement (p < 0.05) over the strongest baseline model (ESAFN).
According to the data in the table above, compared with Res RAM, Res RAM-TFN, Res MGAN, Res MGAN-TFN, MIMN, Res ESPR, and ESAFN models, the Aspect-level multimodal sentiment analysis model based on multi-scale feature extraction proposed in this paper has significantly improved accuracy and F1 score on the Twitter2015 and Twitter2017 datasets. On the Twitter 2015 dataset, the accuracy was 5.21%, 6.85%, 4.92%, 6.46%, 4.73%, and 3.38% higher, respectively, and the F1 scores were 7.31, 10.5, 8.11, 7.85, 6.3, and 4.62 higher, respectively. On the Twitter 2017 dataset, the accuracy was 4.7%, 6.65%, 3.97%, and 4.22% higher, and the F1 scores were 6.29%, 9.62%, 5.53%, and 5.37% higher, respectively. These data fully demonstrate that AMSAM-MFE can effectively extract features from various modalities for feature fusion, achieving sufficient interaction of information between modalities, and has better sentiment analysis performance compared to other models.
Conclusion
This article conducts in-depth research on Aspect-level multimodal sentiment analysis models and proposes the AMSAM-MFE model. In the work of text feature extraction, we introduced the Multi-scale Layer module based on the BERT model to achieve multi-dimensional extraction of text features, and added aspect terms to the text feature extraction to fully utilize it, significantly improving the performance of text feature extraction. In the work of image feature extraction, based on the pre trained Resnest269 model, we designed the Supervision Layer feature supervision module, which imitates the method of using aspect terms in text feature extraction to add aspect terms to image feature extraction, improving the performance of image feature extraction. In the fusion module, this model uses the tensor fusion method to further enhance the interaction ability between various modal features, enabling the model to better understand the emotional connections between graphics and text.
Although the model proposed in this article has achieved some improvement and success in Aspect-level multimodal sentiment analysis, both the F1 score and accuracy of this model could be improved in the fulture. We uses three elements: image, text, and aspect terms for sentiment analysis. Sometimes the information we obtain is not fixed to these three elements. In future research, we will focus on other modalities such as video and image.
Acknowledgements
The authors are thankful to researchers in Beijing Institute of Graphic Communication for the helpful discussion.
Author contributions
study conception and design: C.X., B.M.; data collection: B.M.; analysis and interpretation: C.X., B.M., C.X.; draft manuscript preparation: C.X., B.M. All authors reviewed the results and approved the final version of the manuscript.
Funding
This work was supported by the Graduate Education Reform and Quality Enhancement Initiative (Curriculum Development for the Artificial Intelligence Experimental Class) (21090325006).
Data availability
Data sets generated during the current study are available from the corresponding author on reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Schouten, K. & Frasincar, F. Survey on aspect-level sentiment analysis. IEEE Trans. Knowl. Data Eng.28 (3), 813–830 (2016). [Google Scholar]
- 2.Jiang, L. et al. Target-dependent Twitter sentiment classification. In Proc. Annu. Meeting Assoc. Comput. Linguist. 151–160 (2011).
- 3.Yu, J. et al. Aspect ranking: identifying important product aspects from online consumer reviews. In Proc. 49th Annu. Meeting Assoc. Comput. Linguist. 1496–1505 (2011).
- 4.Vo, D. T. & Zhang, Y. Target-dependent Twitter sentiment classification with rich automatic features. In Proc. Int. Conf. Artif. Intell. 1347–1353 (2015).
- 5.Deng, L. & Wiebe, J. Joint prediction for entity/event-level sentiment analysis using probabilistic soft logic models. In Proc. Conf. Empir. Methods Nat. Lang. Process. 179–189 (2015).
- 6.Dong, L. et al. Adaptive recursive neural network for target-dependent Twitter sentiment classification. In Proc. Annu. Meeting Assoc. Comput. Linguist. 49–54 (2014).
- 7.Xue, W. & Li, T. Aspect based sentiment analysis with gated convolutional networks. In Proc. Annu. Meeting Assoc. Comput. Linguist. 2514–2523 (2018).
- 8.Tang, D. et al. Effective LSTMs for target dependent sentiment classification. In Proc. Int. Conf. Comput. Linguist. 3298–3307 (2016).
- 9.Devlin, J. et al. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. Conf. North Amer. Chap. Assoc. Comput. Linguist. 4171–4186 (2019).
- 10.Turney, P. D. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. ACL. 417–424 (2002).
- 11.Pang, B. et al. Thumbs up? Sentiment classification using machine learning techniques. EMNLP 79–86 (2002).
- 12.Yu, H. & Hatzivassiloglou, V. Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. EMNLP. 129–136 (2003).
- 13.Nazir, A. et al. Issues and challenges of aspect-based sentiment analysis: a comprehensive survey. IEEE Trans. Affect. Comput.13 (1), 845–863 (2020). [Google Scholar]
- 14.Zhang, W. et al. A survey on aspect-based sentiment analysis: tasks, methods, and challenges. IEEE Trans. Knowl. Data Eng.35 (12), 11019–11038 (2023). [Google Scholar]
- 15.Zhang, M. et al. Aspect-level sentiment analysis based on deep learning. Comput. Mater. Contin. 78 (3), 3743–3762 (2024). [Google Scholar]
- 16.Tsai, Y. et al. Multimodal transformer for unaligned multimodal language sequences. In Proc. Conf. Assoc. Comput. Linguist. 6558–6569 (2019). [DOI] [PMC free article] [PubMed]
- 17.Khan, Z. & Fu, Y. Exploiting BERT for multimodal target sentiment classification through input space translation. In Proc. Assoc. Comput. Mach. Conf. Multim. 3034–3042 (2021).
- 18.Xu, N. et al. Multi-interactive memory network for aspect based multimodal sentiment analysis. In Proc. AAAI Conf. Artif. Intell. 371–378 (2019).
- 19.Yu, J. et al. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio Speech Lang. Process.28, 429–439 (2019). [Google Scholar]
- 20.Zhang, H. et al. ResNeSt: split-attention networks. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 2736–2746 (2022).
- 21.Zadeh, A. et al. Tensor fusion network for multimodal sentiment analysis. In Proc. Conf. Empir. Methods Nat. Lang. Process. 1103–1114 (2017).
- 22.Liu, Z. et al. Efficient low-rank multimodal fusion with modality-specific factors. In Proc. Annu. Meeting Assoc. Comput. Linguist. 2247–2256 (2018).
- 23.Mu, J. et al. MOCOLNet: a momentum contrastive learning network for multimodal aspect-level sentiment analysis. IEEE Trans. Knowl. Data Eng.36 (12), 8787–8800 (2024). [Google Scholar]
- 24.Chen, T. et al. Vanessa: visual connotation-aware network for multimodal aspect-based sentiment analysis. In Proc. Annu. Meeting Assoc. Comput. Linguist. 8921–8935 (2023).
- 25.Xiao, L. et al. Atlantis: aesthetic-oriented multiple granularities fusion network for joint multimodal aspect-based sentiment analysis. Inf. Fusion. 106, 102304 (2024). [Google Scholar]
- 26.Mikolov, T. et al. Distributed representations of words and phrases and their compositionality. NeurIPS 3111–3119 (2013).
- 27.Pennington, J. et al. GloVe: global vectors for word representation. EMNLP 1532–1543 (2014).
- 28.Lin, Y. et al. Feature pyramid networks for object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2117–2125 (2017).
- 29.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. 9th Int. Conf. Learn. Represent. (2021).
- 30.Tan, M. et al. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proc. 36th Int. Conf. Mach. Learn. 6105–6114 (2019).
- 31.Baltrušaitis, T. et al. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell.41 (2), 423–443 (2019). [DOI] [PubMed] [Google Scholar]
- 32.Wang, H. et al. Linear attention mechanisms for efficient multimodal fusion. IEEE Trans. Pattern Anal. Mach. Intell.45 (6), 7128–7140 (2023). [Google Scholar]
- 33.Wang, Y. et al. Attention-based LSTM for aspect-level sentiment classification. In Proc. Conf. Empir. Methods Nat. Lang. Process. 606–615 (2016).
- 34.Tang, D. et al. Aspect level sentiment classification with deep memory network. In Proc. Conf. Empir. Methods Nat. Lang. Process. 214–224 (2016).
- 35.Chen, P. et al. Recurrent attention network on memory for aspect sentiment analysis. In Proc. Conf. Empir. Methods Nat. Lang. Process. 452–461 (2017).
- 36.Fan, F. et al. Multi-grained attention network for aspect-level sentiment classification. In Proc. Conf. Empir. Methods Nat. Lang. Process. 3433–3442 (2018).
- 37.Perez, E. et al. FiLM: visual reasoning with a general conditioning layer. In Proc. AAAI Conf. Artif. Intell. 3942–3951 (2018).
- 38.Kiela, D. et al. Efficient large-scale multi-modal classification. In Proc. AAAI Conf. Artif. Intell. 5198–5204 (2018).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data sets generated during the current study are available from the corresponding author on reasonable request.


















