Multimodal false information detection method based on Text-CNN and SE module

Yi Liang; Turdi Tohti; Askar Hamdulla

doi:10.1371/journal.pone.0277463

. 2022 Nov 23;17(11):e0277463. doi: 10.1371/journal.pone.0277463

Multimodal false information detection method based on Text-CNN and SE module

Yi Liang ^1,², Turdi Tohti ^1,^2,^*, Askar Hamdulla ^1,²

Editor: T Ganesh Kumar³

PMCID: PMC9683577 PMID: 36417421

Abstract

False information detection can detect false information in social media and reduce its negative impact on society. With the development of multimedia, the multimodal content contained in false information is increasing, so it is important to use multimodal features to detect false information. This paper mainly uses information from two modalities, text and image. The features extracted by the backbone network are not further processed in the previous work, and the problems of noise and information loss in the process of fusing multimodal features are ignored. This paper proposes a false information detection method based on Text-CNN and SE modules. We use Text-CNN to process the text and image features extracted by BERT and Swin-transformer to enhance the quality of the features. In addition, we use the modified SE module to fuse text and image features and reduce the noise in the fusion process. Meanwhile, we draw on the idea of residual networks to reduce information loss in the fusion process by concatenating the original features with the fused features. Our model improves accuracy by 6.5% and 2.0% on the Weibo dataset and Twitter dataset compared to the attention based multimodal factorized bilinear pooling. The comparative experimental results show that the proposed model can improve the accuracy of false information detection. The results of ablation experiments further demonstrate the effectiveness of each module in our model.

Introduction

With the development of information technology, social media has become the main way for people to obtain information, especially during the epidemic, people’s lives are more closely connected with social media. The rapid development of social media not only brings convenience to people, but also facilitates the spread of false news and misleading news. False information is defined as unsubstantiated stories and statements [1]. The spread of false information can mislead the public and have a negative impact on society. For example, the 2012 Doomsday Theory, which declares that the Earth will experience a major catastrophe on December 21, 2012, or “three consecutive days of darkness”, this rumor has caused panic among people around the world, causing people to spend a lot of money to hoard shopping supplies, and even spend a lot of money to build “Noah’s Ark”.

Fig 1 presents several multimodal false information posts from the Twitter dataset [2]. Each post contains a paragraph of text and an image associated with it. The image in the first post has been altered so the image is false and the text is false, the second post image is real but the image is about the Sicily air disaster. In the final post, the image was edited to include a shark that didn’t exist during Hurricane Sandy. The dissemination of this false information has a serious impact on the normal operation of society, so it is very important to find out how to detect false information and stop it from spreading. For false information containing text and images, can be divided into three categories. The first type of false information is that the text content is false but the image is true, the second type of false information is that the text content is true but the image is false, and the third type of false information is that both text content and images are false.

In recent years, deep learning models have been used for false information detection, early approaches focused on detection using text features [3, 4], such as the model proposed by Pérez [5] uses textual content to detect false information, but this model can only detect the first and third types of false information, but cannot correctly detect the second type of false information, if we use both text and image information at the same time, all kinds of false information can be detected [6–8], which reflects the importance of multi-modal false information detection.

There are two challenging problems in existing research work. First, how to extract higher-quality text features and image features. Second, how to better fuse text features and image features to obtain more valuable fusion features. Previous works have used RNN (Recurrent Neural Network) [9] or Transformer-based models [10] to extract text features, CNN (Convolutional Neural Network)-based [11] models to extract image features, and finally fusing text and image features through simple splicing, factorization bilinear pool or attention mechanism. However, these methods directly fuse the features extracted from the backbone network, and fail to perform corresponding processing on the extracted features to make up for the insufficiency of the features extracted from the backbone network in some aspects. Furthermore, these research works do not consider the problems of noise and information loss in the feature fusion process. This paper proposes a false information detection model based on Text-CNN [12] and SE (Squeeze-and-Excitation Networks) [13] modules, which solves the above problems well. Our model uses 3 scales of Text-CNN to make up for the slight deficiency of the Transformer-based model in the extraction of local features, at the same time, we adopt our modified SE module to reduce the influence of noise in the fusion process, and reduce the information loss in the fusion process by concatenating the original features and the fusion features. Therefore, our model can better detect false information.

The main contributions of this paper are as follows:

We use 3 different scales of Text-CNN to process the text features extracted by the pre-trained model BERT [14] and the image features extracted by the pre-trained model SWTR [15] to obtain more valuable features.
We modify the SE module so that it can fuse text features and image features. We utilize the channel attention mechanism in the SE module to mitigate the effect of noise during fusion to obtain better-represented fused features.
We draw on the idea of residual network to concatenate original features and fusion features to reduce the information loss in the fusion process.
The accuracy and F1 value of our model on Twitter dataset and Weibo dataset [16] outperform the baseline model AMFB.

Related work

Traditional false information detection models are mostly text-based. In earlier studies, people mainly extracted text features manually. Qazvinian et al. [17]. exploited n-grams, bi-grams features extracted from text to detect rumors. Pérez et al. [5]. extracted five linguistic features from the text to detect error messages. With the development of technology, researchers found that the artificial extraction of features will be limited by the dataset, resulting in the extracted features not having generality [18]. Subsequent researchers used deep learning techniques to allow computers to automatically extract features from text to detect false information. Liu et al. [19]. proposed a model that uses CNN to extract text features and detect false information. The model uses CNN to mine deeper text features that humans cannot discover. Ma et al. [20]. used RNN to extract text features, and the model used RNN to discover content related to textual contextual content. Nasir [21] utilizes both CNN and RNN to extract text features, which can combine the advantages of CNN and RNN.

In recent years, with the increasing number of forms of information expression, how to use different forms of information simultaneously to detect false information detection has attracted the attention of many researchers. Two forms of information, text and images, are often used in existing research work to detect false information. Singhal et al. [22]. The text and image features extracted by BERT and VGG19 are concatenated and fed into the classifier to obtain detection results. Kumari et al. [23]. proposed a multi-modal fusion model based on multi-modal factorization bilinear pooling, the model first uses a combination of BILSTM and Attention to extract text features, followed by a combination of CNN, BiGRU and Attention to extract image features, and finally the two features are fused by Multimodal Factorized Bilinear Pooling and fed into the detector to obtain detection results. Song et al. [24]. proposed a multimodal false information detection model based on cross-modal attention residuals and multi-channel CNN. The model can extract information related to the target modality from the remaining modalities without losing the information of the target modality, while the influence of noise during the fusion of information from different modalities can be reduced by multi-channel CNN. Dhawan [25] proposed a multi-modality detection model based on a graph neural network, which can allow fine-grained interactions within and between different modalities to further improve the accuracy of multi-modal false information detection. Wu et al. [26]. proposed a novel multimodal co-attention network to better fuse text and image features for false information detection. With the rise of pre-trained models, researchers have conducted research on fusion algorithms that fuse text features and image features. Xu et al. [27]. divided the existing Transformer—based pre-trained fusion models into six categories, (1) Early Summation [28, 29], the model takes text and image features, weights them together and feeds them into the Transformer layer to fuse text and image features. This fusion method does not increase the computational complexity, but requires manual setting of the weights. (2) Early Concatenation [30–33], this model concatenates text features and image features and then inputs them into the Transformer layer to fuse the features of different modalities. This approach increases computational complexity. (3) Multi-stream to One-stream [34], this model inputs text features and image features into two Transformer layers for processing, and then concatenates them through another Transformer layer to fuse features. (4) One-stream to Multi-stream [35], this model concatenates text features and image features and inputs them into the Transformer layer for fusion, and then divides the fused features into two parts and inputs them into two different Transformer layers. (5) Cross-Attention [36, 37], using two Transformer layers to process text features and image features by exchanging two Q(Query) to complete the fusion of text features and image features. (6) Cross-Attention to Concatenation [38, 39], the text features and image features processed by Cross-Attention are concatenated and input to another Transformer layer for processing.

In addition to text and image information, other forms of information can also be used to detect false information. Wang [40] and others found that the existing research ignores the role of the strong emotion of the image in the rumor content, proposes a multimodal rumor detection model composed of visual emotion and textual emotion. Azri [41] proposed an end-to-end model that utilizes three features of text, images and emotion simultaneously. Armin [42] proposed a multimodal detection model that supports the fusion of different levels and types of information, which can simultaneously utilize textual, visual, user reviews and metadata.

Multimodal false information detection method based on Text-CNN and SE module

Problem Definition: Suppose P = {p₁, p₂, ⋯, p_m} each post in this dataset contains data in the form of both text and images, where p_i represents the ith post. T_set = {t₁, t₂, ⋯, t_m} is the text set, t_i represents the text content in the ith post, V_set = {v₁, v₂, ⋯, v_m} is the image set, v_i represents the image contained in the ith post, L = {l₁, l₂, ⋯, l_m} is the tag set, l_i is the tag of the ith post, p_i = {t_i, v_i, l_i}. The main purpose of disinformation detection is to find a function f(T, V) = Y, this function identifies the authenticity of a post by the text information and image information in the post. Y = {y₁, y₂, ⋯, y_m}, y_i is the predicted label of the ith post.

The model in this paper mainly consists of four parts: text feature extraction, image feature extraction, image and text feature fusion and classifier. Fig 2 shows our proposed multimodal false information detection method based on Text-CNN and SE module.

The model first uses BERT to extract the features of each token from the text, and concatenates the features of each token as text features, and its dimension is (33/95,768). Use SWTR to extract image features with dimension (49,768), then fuses text features and image features through modified SE module to obtain fused features, and text and image features are processed by Text-CNN with widths of 1, 2 and 3 to improve feature quality. Then the fused features, locally enhanced text features and image features are concatenated and fed into a classifier to classify them.

Text feature extraction

Usually, when people post on social media, they express their thoughts in the form of text. The text contains the main meaning that the publisher wants to express. Therefore, how to process the text and extract high-quality text features has a significant impact on the detection accuracy of the model. This paper extracts text features through the combination of BERT and Text-CNN models.

BERT is a Transformer-based pre-training model. BERT is first trained on a large unsupervised dataset to learn some general knowledge, and then the learned knowledge is transferred to a specific task. BERT, due to its special structure, can achieve good results while reducing the consumption of training resources. However, the extraction of local features by BERT is slightly inadequate, so this paper uses Text-CNN to process the text features extracted by BERT, so that the text features contain more local information. Text-CNN is a CNN model applied to text. The convolution kernel of Text-CNN will keep the length of the convolution kernel consistent with the length of the word feature, and only adjust the width of the convolution kernel. The model can extract similar features to n-grams. The calculation of text features extracted using BERT is as follows:

\begin{matrix} P E = P o s i t i o n_E m b e d d i n g s (t) \end{matrix}

(1)

\begin{matrix} S E = S e g m e n t_E m b e d d i n g s (t) \end{matrix}

(2)

\begin{matrix} T E = T o k e n_E m b e d d i n g s (t) \end{matrix}

(3)

\begin{matrix} T_{a l l} = P E + S E + T E \end{matrix}

(4)

Position_Embeddings() is a function that encodes the position of the text, Segment_Embedding() is a function that encodes the paragraphs of the text, and Token_Embedd ings () is a function that transforms each word in the text into a word vector, where t is a set of text, T_all = {T_[cls], T} is the text feature extracted by BERT, $T \in R^{n * d_{i}}$ , $T_{[c l s]} \in R^{d_{i}}$ , n is the number of words in a sentence, and di is the dimension of each token feature vector. In this paper, we process T by three scales of Text-CNN, and the calculation formula is as follows:

\begin{matrix} T_{1} = φ {(c o n v 1 (W_{1}, T) + b_{1})}_{k = 1} \end{matrix}

(5)

\begin{matrix} T_{2} = φ {(c o n v 1 (W_{2}, T) + b_{2})}_{k = 2} \end{matrix}

(6)

\begin{matrix} T_{3} = φ {(c o n v 1 (W_{3}, T) + b_{3})}_{k = 3} \end{matrix}

(7)

φ is the activation function, conv1() is the function of the one-dimensional convolution operation, W₁, W₂, W₃ are convolution kernels that can be obtained by learning, and b₁, b₂, b₃ are bias that can be obtained by learning, and k represents the width of the convolution kernel. T₁, T₂, T₃ ∈ R⁶⁴ are the text features obtained after three widths of Text-CNN processing.

Image feature extraction

Images are more believable than text content, so an accurate image feature extraction module plays an important role in false information detection models. We use SWTR to extract image features and further process the extracted image features through Text-CNN with three widths.

SWTR is a successful model for using Transofrmer in computer vision. SWTR can extract both local features and global features through a windowing mechanism compared to CNN-based models, and SWTR’s unique windowing mechanism reduces computational effort compared to the rest of Transformer-based models. SWTR has SOTA performance on multiple tasks. Since SWTR and BERT are both Transformer-based models, this model is therefore also similar to the BERT model in that it has some shortcomings in the treatment of local features. Since the windowing mechanism of SWTR cannot set the size of the convolutional kernel as flexibly as CNN, this paper uses Text-CNN to process the image features extracted by SWTR. In this paper, the image features extracted by SWTR are input into three different widths of Text-CNN for processing. The specific calculation process is as follows:

\begin{matrix} V = S W T R (v) \end{matrix}

(8)

\begin{matrix} V_{1} = φ {(c o n v 1 (W_{4}, V) + b_{4})}_{k = 1} \end{matrix}

(9)

\begin{matrix} V_{2} = φ {(c o n v 1 (W_{5}, V) + b_{5})}_{k = 2} \end{matrix}

(10)

\begin{matrix} V_{3} = φ {(c o n v 1 (W_{6}, V) + b_{6})}_{k = 3} \end{matrix}

(11)

SWTR() is the function of extracting image features using Swin-Transformer, φ is the activation function, conv1 is the function for the 1D convolution operation, W₄, W₅, W₆ are the learnable convolution kernels, b₄, b₅, b₆ are the biases, k is the scale of the convolution kernel, v represents an image in the post, $V \in R^{z * d_{v}}$ is the image feature, z is the number of extracted features, and dv is the dimension of the feature vector. V₁, V₂, V₃ ∈ R⁶⁴ are the image features after processing by Text-CNN of three widths.

Feature fusion

So far we have the text features T_all extracted by BERT, the image features V extracted by SWTR, text features T₁, T₂, T₃ and image features V₁, V₂, V₃. The SE module is mostly used for channel feature enhancement of the input feature maps in computer vision tasks. For example, if we input a feature map A with dimensions (H,W,C), the SE module will input A into two full connection layers to get the attention score, and then multiply the feature map A with the attention score in the dimension of the channel to get the final output. We modify the SE module so that it can fuse text features and image features to obtain multimodal fusion features. The MSE module is shown in Fig 3. The SE module can assign weights to each channel, automatically filter low-weight noise points. With a small increase in the number of parameters, the performance of the model on related tasks can be greatly increased. Therefore, we modify the SE module to fuse text and image features. The calculation process is as follows:

\begin{matrix} {S c o r e}_{t e x t} = L i n e a r (V) \end{matrix}

(12)

\begin{matrix} T^{'} = B m m ({S c o r e}_{t e x t}, T) \end{matrix}

(13)

\begin{matrix} T^{*} = a v g p o o l i n g (β (T^{'})) \end{matrix}

(14)

\begin{matrix} {S c o r e}_{i m a g e} = L i n e a r (T) \end{matrix}

(15)

\begin{matrix} V^{'} = B m m ({S c o r e}_{i m a g e}, V) \end{matrix}

(16)

\begin{matrix} V^{*} = a v g p o o l i n g (β (V^{'})) \end{matrix}

(17)

In this paper, the attention scores between different modalities are obtained through the fully connected layer. Score_image ∈R^n*z, Score_text ∈ R^z*n are the attention scores of text features on image features and the attention scores of image features on text features. Bmm() is the function that performs the dot product operation. T* ∈ Rⁿ, V* ∈ R^z are image features and text features fused by the MSE module, β is the activation function. Then we concatenate the text features and image features processed by Text-CNN and the fusion features extracted by the MSE module. The calculation process is as follows:

\begin{matrix} F = c o n c a t e n a t e (T_{1}, T_{2}, T_{3}, V_{1}, V_{2}, V_{3}, T^{*}, V^{*}) \end{matrix}

(18)

Concatenate() is a function of concatenation operation, F ∈ R^64*6+n+z is the fusion feature that is finally input into the classifier.

False information detection

We feed the fused features into fully connected layer and Softmax layer to obtain detection results.

\begin{matrix} F^{*} = L i n e a r (F) \end{matrix}

(19)

\begin{matrix} p_{i} = S o f t max (F^{*}) \end{matrix}

(20)

\begin{matrix} y_{i} = arg max (p_{i}) \end{matrix}

(21)

We use the cross-entropy loss function to calculate the loss value:

\begin{matrix} L = - \sum_{i = 1}^{m} [l_{i} log p_{i} + (1 - l_{i}) log (1 - p_{i})] \end{matrix}

(22)

p_i is the probability that the post is false, argmax is the function of select the predicted label values, y_i is the predicted label value of the post by the model, m is the number of posts, and l_i ∈ {0, 1} is the true label value, 1 represents false information, and 0 represents true information.

Experiment and analysis

Dataset and experimental settings

Machine configuration and environment for this experiment: CPU: Intel Xeon E5-2630L v3, 62 G memory, 8cores, GPU: NVIDIA GeForce RTX 3090, PyTorch(1.7.1), Python(3.8), Cuda(10.2). To compare with previous work, we use the Twitter and Weibo datasets to complete our experiments. These are two publicly available, high quality datasets that can be used for multimodal disinformation detection.

The Twitter dataset is published by Boididou et al. The dataset contains training dataset and test dataset. The training dataset contains three types of information: false, true and humorous, but the test dataset contains only two types of information: true and false, so we remove the humorous type of information from the training dataset. The Weibo dataset is a multimodal Chinese dataset that contains only two types of posts, real and false. We split the posts containing multiple images in the Twitter dataset and the Weibo dataset into multiple posts containing only one image, and deleted the data containing only images, only text and images as gifs and black and white images. Table 1 shows the data distribution for the Weibo and Twitter datasets.

Table 1. Data distribution of the Weibo and Twitter datasets.

Dataset	Train		Test		Image
Dataset	False	Real	False	Real	Image
Twitter	6827	4993	717	1215	410
Weibo	6476	4096	1136	1215	13274

Open in a new tab

The two datasets above are publicly available datasets applied to false information detection studies. We are only interested in the text, images and labels in the dataset, so some remaining information is removed. We take text and images as input to the model and labels as facts. First we preprocess the text and images, for the text part we remove punctuation, URL and emoticons in the sentence, and for the image part,we resize all the images to (224, 224, 3). The training dataset is used to train the model, and the test dataset is used to verify the performance of the model.

Table 2 lists all the hyperparameters used to train the model.

Table 2. Hyperparameters used in model training.

Parameters	Twitter	Weibo
Text length	33	95
Image size	(224,224,3)	(224,224,3)
Batch size	70	70
Optimizer	Adam(lr = 0.0001)	Adam(lr = 0.00005)
Epochs	100	100
Dropout	0.2	0.6

Open in a new tab

Comparative experiment

We implement some uni-modal and multimodal models to verify the validity of our model.

Uni-modal based models:

BERT: We use T_[cls] extracted from the fine-tuned BERT-Base as text features. The text features T_[cls] are fed into the classifier to detect the authenticity of posts.
SWTR: We use the image feature V extracted from the SWTR model to feed into the average pooling layer to obtain the image features V_a, and then input the processed image feature into the classifier to detect the authenticity of posts.

Multimodal based models:

att-RNN [16]: att-RNN is an RNN with attention mechanism that fuses text and image features for false information detection.
EANN [43]: EANN (Even Adversarial Neural Network, EANN) is an end-to-end event adversarial network that uses an event discriminator to remove the impact of event information on detection results and improve the generality of the model.
MVAE [44]: MVAE (Multimodal Variational Autoencoder, MVAE) is used to learn the correlation between modalities and then combined with a classifier to detect false information.
AMFB [23]: AMFB (Attention based multimodal Factorized Bilinear, AMFB), the network uses BILSTM and VGG19 to extract text and image features, finally uses multimodal decomposition bilinear pooling to fuse features of text and images.

In order to verify the effectiveness of our proposed model, we compare the above baseline model with our model on both Weibo and Twitter datasets. At the same time, we conduct 5 experiments for each model under the same experimental conditions and take the average of the 5 experiments as the final result, the aim of which is to reduce the influence of experimental errors on the experimental results. The experimental results are shown in Table 3. According to the accuracy and F1 value, our model has better performance than the existing baseline models. It can also be observed that the performance of the single-modal model is lower than the multimodal model on both datasets, this suggests that using both text and image information can be more effective in detecting false information.

Table 3. Comparative results for the Weibo and Twitter datasets.

Dataset	Model	Accuracy	F1 value of False News	F1 value of Real News
Twitter	BERT	0.831	0.852	0.802
	SWTR	0.834	0.871	0.766
	att-RNN	0.664	0.676	0.651
	EANN	0.741	0.610	0.810
	MVAE	0.745	0.758	0.730
	AMFB	0.883	0.920	0.810
	OUR	0.903	0.924	0.866
Weibo	BERT	0.870	0.880	0.858
	SWTR	0.713	0.702	0.724
	att-RNN	0.772	0.692	0.754
	EANN	0.791	0.780	0.80
	MVAE	0.824	0.809	0.837
	AMFB	0.832	0.840	0.830
	OUR	0.897	0.902	0.890

Open in a new tab

Figs 4 and 5 show the accuracy and loss values of our model when trained on the Twitter and Weibo datasets. The full form of ‘iter’ is ‘iterations’. As can be seen from the figure, the loss gradually decreases to an equilibrium position, followed by a slight fluctuation at the equilibrium position, which indicates that the model is learning properly. It can be observed from the figure that our model is fully trained on both datasets, while our model has more difficulty in achieving convergence on the Weibo dataset because the Weibo dataset contains more images and different posts from different events, while most of the posts on the Twitter dataset are from the same event.

Ablation experiment

We set up 4 ablation experiments to demonstrate the effectiveness of our model.

Ablation Experiment 1: We compare our model with the original model after removing different modules to demonstrate the validity of each module.
Ablation Experiment 2: We compare the original BERT and SWTR models with our improved BERT and SWTR models to demonstrate the validity of our improvements.
Ablation Experiment 3: We set up a series of experiments to demonstrate that the model is most effective in processing text and image features using three different scales of Text-CNNs simultaneously.
Ablation Experiment 4: Several different fusion methods were used to demonstrate the effectiveness of the fusion methods we used by comparing them with the fusion methods we used.

For the above 4 groups of ablation experiments, in order to eliminate errors, we performed 5 experiments for each model and took the average value.

Ablation experiment one

To demonstrate the effectiveness of each module in our proposed model, we conduct ablation experiments and the results are shown in Table 4:

OUR: The complete model presented in this paper.
-SE: We simply splice the text features T₁, T₂, T₃ extracted by BERTcnn and the image features V₁, V₂, V₃ extracted by SWTRcnn to detect the authenticity of the post. But remove MSE module.
-Text-CNN: We use the MSE module to fuse text features T and image features V, excluding text features and image features processed by Text-CNN.
-SE-Text-CNN: We simply concatenate the text feature T_[cls] extracted by BERT-Base and the image feature V_a processed by the average pooling layer and input it into the classifier to detect the authenticity of the post. Removed the MSE module and Text-CNN module.

Table 4. Comparative results of ablation experiments on the Weibo dataset 1.

Model	Accuracy	F1 value of False News	F1 value of Real News
•MSE-Text-CNN	0.887	0.894	0.879
•MSE	0.893	0.900	0.885
•Text -CNN	0.889	0.896	0.881
OUR	0.897	0.902	0.890

Open in a new tab

From Table 4, the complete model achieves the best results, demonstrating the effectiveness of each module. We can observe that the model with any module removed shows a decrease in accuracy compared to OUR. -SE dropped by 0.4%, -Text-CNN dropped by 0.8%, -SE-Text-CNN dropped by 1.0%. The MSE module in the model can alleviate the problem of noise introduced in the fusion process, so that the model can better fuse text and image features. It also reduces information loss during fusion by simply concatenating text features and image features that have been processed by Text-CNN.

Ablation experiment two

To verify that our improvements to BERT and SWTR are effective. We compare the original BERT and SWTR with our improved BERT and SWTR, and the results are shown in Fig 6.

BERT: We input the text features T_[cls] extracted by BERT into the classifier to get the detection result of the post.
SWTR: We input the image features V_a obtained from the averaging pooling layer into the classifier to obtain the classification results of the post.
BERT_cnn: We obtained features T₁, T₂ and T₃ by processing the text features T extracted by BERT with three different scales of Text-CNN. Subsequently, T₁, T₂ and T₃ are concatenated and fed into the classifier to obtain detection results.
SWTR_cnn: We use three different scales of Text-CNN to process the image features V to obtain features V₁, V₂ and V₃, which are subsequently concatenated and fed into the classifier to obtain detection results.

As can be seen in Table 4, our improvements to BERT and SWTR are effective. The BERTcnn model improved the accuracy of the Weibo dataset by 1.1% compared with the BERT, and the SWTRcnn model improved the accuracy of the Weibo dataset by 1.6% compared with the SWTR model. The experimental results prove our conjecture that the text and image features extracted by the Transformer-based pre-trained model can be further improved by Text-CNN processing.

Ablation experiment three

In order to demonstrate the effectiveness of using three different scales of Text-CNN to process the features extracted by BERT and SWTR, we compare with the following models, and the results are shown in Figs 7 and 8.

BERT_cnn1: We use Text-CNN with 64 convolution kernels of size (1,768) to process the text feature T extracted by BERT-Base, and input the processed text features into the classifier to obtain classification results.
BERT_cnn2: We use Text-CNN with 64 convolution kernels of size (1,768) and 64 convolution kernels of size (2,768) to process the text feature T, and concatenating it input into the classifier to obtain detection results.
BERT_cnn4: We use Text-CNN with 64 convolutional kernels of size (1,768), 64 convolutional kernels of size (2,76 8), 64 convolutional kernels of size (3,768) and 64 convolutional kernels of size (4,768) to process the text features T and concatenate them into the classifier to obtain the results.
BERT_cnn: The model used in this paper for detecting the authenticity of posts after processing the text features T using Text-CNN at three different scales.
SWTR_cnn1: Similar to the model BERT_cnn1, except that the processed text features T are replaced with image features V.
SWTR_cnn2: Similar to the model SWTR_cnn2, just replace the processed text features T with image features V.
SWTR_cnn4: Replace the text feature T processed by SWTR_cnn4 with the image feature V.
SWTR_cnn: The image features V₁, V₂ and V₃ are concatenated and fed into the classifier to obtain the detection results.

As can be seen in Figs 7 and 8, the accuracy of BERT_cnn1, BERT_cnn2, BERT_cnn and SWTR_cnn1, SWTR_cnn2, SWTR_cnn for post detection gradually improves as the number of Text-CNNs of different scales increases. However, when the number of convolution kernels at different scales exceeds three. The performance of the model will gradually decrease. Comparing BERT_cnn, SWTR_cnn with BERT_cnn4, SWTR_cnn4, the accuracy increases by 0.3% and 0.8%, and the F1 value also increases by 0.3% and 0.6%. The experimental results show that our proposed Text-CNN with three scales (1,768), (2,768) and (3,768) are the most effective to process text features and image features.

We analyze the dataset to further validate our conclusions. We used the jieba word splitter to segment the test dataset from the Weibo dataset and to calculate the number of tokens of different lengths. The distribution is shown in Fig 9.

From Fig 9, we can see that the length of each token in the sentence is not consistent, so when we use Text-CNN of different scales to extract local features similar to n-grams in the text, we will not only extract some valid features, at the same time, some invalid features will be extracted. From Fig 9 we can see that 97% of the tokens in the dataset are less than 4 in length and only 3% of tokens are longer than 3, combined with the experimental results in Fig 7, it can be concluded that when we use Text-CNN with widths of 1, 2, and 3 to extract features, more valid features are extracted than invalid features, thus increasing the performance of the model. When we use larger width Text-CNN, more invalid features than valid features are extracted, thus degrading the performance of the model.

Ablation experiment four

To demonstrate the effectiveness of our fusion method, we set up several different fusion models to compare with our fusion model.

E-Sum (Early Summation): The features of the different modalities are weighted and summed by position and fed into the Transformer for processing.
E-Con (Early Concatenation): The features of the different modalities are concatenated and fed into the Transformer for processing.
M-to-O (multi-stream to one-stream): First, two Transformer layers are used to process text features and image features, and then concatenate and input into another Transformer layer for processing.
O-to-M (one-stream to multi-stream): First, the text features and image features are concatenated and input into the Transformer for processing, and then split and input into two Transformers layers for processing.
Cross-A (Cross-Attention): When using two Transformer layers to process text features and image features, exchange two Q (Query) to complete the fusion of text features and image features.
Cross-A-C (Cross-Attention to Concatenation): The text features and image features processed by Cross-A are concatenated and input to another Transformer layer for processing.

As can be seen in Fig 10 our model has the best performance, which proves the effectiveness of the feature fusion method we use, and the fusion method we use has a smaller number of parameters than other fusion methods.

As shown in the Table 5, the number of parameters for the fusion methods we used is much smaller than for the rest of the fusion methods.

Table 5. Comparative results of ablation experiments on the Weibo dataset 1.

Model	Nnumber of parameters
OUR	0.70M
E-Sum	11.03M
E-Con	11.03M
Mul-to-One	33.08M
One-to- Mul	33.08M
Cross-A	11.02M
Cross-A-C	22.04M

Open in a new tab

Conclusion

This paper proposes a multimodal false information detection method based on Text-CNN and SE module. The model first uses multi-scale Text-CNN to process text features and image features, and uses the MSE module to fuse multi-modal features to obtain fusion features. Finally, the text features and image features processed by Text-CNN and the fusion feature is simply concatenated as the final fusion feature to detect false information. The comparative experiments demonstrate that our model achieves better results on the Weibo and Twitter datasets than the rest of the models. The ablation experiments validate the effectiveness of our improvements to each module of the model.

In future work, we will mainly study the following issues: (1) How to reduce the size of the model so that it can be deployed on small devices while ensuring detection accuracy. (2) How to extract higher quality features from text and images (3) How to fuse text features and image features more fully.

Data Availability

Twitter dataset is available from https://github.com/MKLab-ITI/image-verification-corpus Weibo dataset is available from https://doi.org/10.1145/3123266.3123454.

Funding Statement

National Natural Science Foundation of China(62166042). National Natural Science Foundation of China(U2003207). Natural Science Foundation of Xinjiang, China (2021D01C076). Strengthening Plan of National Defense Science and Technology Foundation of China (2021-JCJQ-JJ-0059). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Gupta, Manish and Zhao, Peixiang and Han, Jiawei. Evaluating event credibility on twitter. Proceedings of the 2012 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2012: 153-164.
2. Boididou C, Andreadou K, Papadopoulos S, et al. Verifying multimedia use at mediaeval. MediaEval, 2015, 3(3): 7. [Google Scholar]
3.Rashkin H, Choi E, Jang J Y, et al. Truth of varying shades: Analyzing language in fake news and political fact-checking Proceedings of the 2017 conference on empirical methods in natural language processing. 2017: 2931-2937.
4.Popat K, Mukherjee S, Strötgen J, et al. Credibility assessment of textual claims on the web. Proceedings of the 25th ACM international on conference on information and knowledge management. 2016: 2173-2178.
5.Pérez-Rosas V, Kleinberg B, Lefevre A, et al Automatic detection of fake news. arXiv preprint arXiv:1708.07104, 2017.
6.Alonso-Bartolome S, Segura-Bedmar I. Multimodal Fake News Detection. arXiv preprint arXiv:2112.04831, 2021.
7. Peng X, Xintong B. An effective strategy for multi-modal fake news detection. Multimedia Tools and Applications, 2022, 81(10): 13799–13822. doi: 10.1007/s11042-022-12290-8 [DOI] [Google Scholar]
8. Choi H, Ko Y. Effective fake news video detection using domain knowledge and multimodal data fusion on youtube. Pattern Recognition Letters, 2022, 154: 44–52. doi: 10.1016/j.patrec.2022.01.007 [DOI] [Google Scholar]
9. Graves A. Long short-term memory. Supervised sequence labelling with recurrent neural networks, 2012: 37–45. doi: 10.1007/978-3-642-24797-2_4 [DOI] [Google Scholar]
10. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30. [Google Scholar]
11. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60(6): 84–90. doi: 10.1145/3065386 [DOI] [Google Scholar]
12. Chen Y. Convolutional neural network for sentence classification. University of Waterloo, 2015. [Google Scholar]
13.Hu J, Shen L, Sun G. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.
14.Lee J D M C K, Toutanova K. Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
15.Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 10012-10022.
16.Jin Z, Cao J, Guo H, et al. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. Proceedings of the 25th ACM international conference on Multimedia. 2017: 795-816.
17.Qazvinian V, Rosengren E, Radev D, et al. Rumor has it: Identifying misinformation in microblogs. Proceedings of the 2011 conference on empirical methods in natural language processing. 2011: 1589-1599.
18. MENG J, WANG L, YANG Y, et al. Multi-modal deep fusion for false information detection. Journal of Computer Applications, 2022, 42(2): 419. [Google Scholar]
19. LIU Z, WEI Z, ZHANG R. Rumor detection based on convolutional neural network. Journal of Computer Applications, 2017, 37(11): 3053. [Google Scholar]
20. Ma J, Gao W, Mitra P, et al. Detecting rumors from microblogs with recurrent neural networks. 2016. [Google Scholar]
21. Nasir J A, Khan O S, Varlamis I. Fake news detection: A hybrid CNN-RNN based deep learning approach. International Journal of Information Management Data Insights, 2021, 1(1): 100007. doi: 10.1016/j.jjimei.2020.100007 [DOI] [Google Scholar]
22.Singhal S, Shah R R, Chakraborty T, et al. Spotfake: A multi-modal framework for fake news detection. 2019 IEEE fifth international conference on multimedia big data (BigMM). IEEE, 2019: 39-47.
23. Kumari R, Ekbal A. Amfb: attention based multimodal factorized bilinear pooling for multimodal fake news detection. Expert Systems with Applications, 2021, 184: 115412. doi: 10.1016/j.eswa.2021.115412 [DOI] [Google Scholar]
24. Song C, Ning N, Zhang Y, et al. A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks. Information Processing Management, 2021, 58(1): 102437. doi: 10.1016/j.ipm.2020.102437 [DOI] [Google Scholar]
25.Dhawan M, Sharma S, Kadam A, et al. GAME-ON: Graph Attention Network based Multimodal Fusion for Fake News Detection. arXiv preprint arXiv:2202.12478, 2022.
26. Wu Y, Zhan P, Zhang Y, et al. Multimodal fusion with co-attention networks for fake news detection. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021: 2560–2569. [Google Scholar]
27.Xu P, Zhu X, Clifton D A. Multimodal Learning with Transformers: A Survey. arXiv preprint arXiv:2206.06488, 2022. [DOI] [PubMed]
28.Gavrilyuk K, Sanford R, Javan M, et al. Actor-transformers for group activity recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 839-848.
29.Xu P, Zhu X. Deepchange: A large long-term person re-identification benchmark with clothes change. arXiv e-prints, 2021: arXiv: 2105.14685.
30.Sun C, Myers A, Vondrick C, et al. Videobert: A joint model for video and language representation learning. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 7464-7473.
31.Guo D, Ren S, Lu S, et al. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020.
32.Shi B, Hsu W N, Lakhotia K, et al. Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184, 2022.
33.Zheng R, Chen J, Ma M, et al. Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation. International Conference on Machine Learning. PMLR, 2021: 12736-12746.
34.Li R, Yang S, Ross D A, et al. Ai choreographer: Music conditioned 3d dance generation with aist++. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 13401-13412.
35.Lin J, Yang A, Zhang Y, et al. Interbert: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198, 2020.
36.Murahari V, Batra D, Parikh D, et al. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. European Conference on Computer Vision. Springer, Cham, 2020: 336-352.
37. Lu J, Batra D, Parikh D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 2019, 32. [Google Scholar]
38.Zhan X, Wu Y, Dong X, et al. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 11782-11791.
39.Tsai Y H H, Bai S, Liang P P, et al. Multimodal transformer for unaligned multimodal language sequences. Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2019, 2019: 6558. [DOI] [PMC free article] [PubMed]
40.Wang G, Tan L, Shang Z, et al. Multimodal Dual Emotion with Fusion of Visual Sentiment for Rumor Detection. arXiv preprint arXiv:2204.11515, 2022.
41.Azri A, Favre C, Harbi N, et al. Calling to CNN-LSTM for Rumor Detection: A Deep Multi-channel Model for Message Veracity Classification in Microblogs. Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Cham, 2021: 497-513..
42.Kirchknopf A, Slijepcevic D, Zeppelzauer M. Multimodal detection of information disorder from social media. arXiv preprint arXiv:2105.15165, 2021..
43.Wang Y, Ma F, Jin Z, et al. Eann: Event adversarial neural networks for multi-modal fake news detection. Proceedings of the 24th acm sigkdd international conference on knowledge discovery data mining. 2018: 849-857.
44.Khattar D, Goud J S, Gupta M, et al. Mvae: Multimodal variational autoencoder for fake news detection. The world wide web conference. 2019: 2915-2921.

PLoS One. doi: 10.1371/journal.pone.0277463.r001

Decision Letter 0

T Ganesh Kumar

12 Oct 2022

PONE-D-22-26288Multimodal false information detection method based on Text-CNN and SE modulePLOS ONE

Dear Dr. Liang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR: I assigned this research paper to two reviewers and received comments with minor revision from both reviewers. Authors has to revise the manuscript as per the reviewer suggestions.

==============================

Please submit your revised manuscript by Nov 26 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

T. Ganesh Kumar, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

"Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

4. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Authors are requested to update reviewer suggestion in your paper, then resubmit it.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Comments 1:

This is an excellent report dealing with significant technical matters. I find no fault whatsoever with the methods, data analysis, or conclusions. The work, as with all work coming from this particular domain, is fundamentally sound. My comments here are concerned solely with the organization of the manuscript. Consideration of these points will, I believe, lead to an improved report that better illustrates the key concepts and conclusions.

Comments 2:

Correction:

Abbreviate the figure 5. 'Iter' in full form

Comments 3:

Suggestion:

For data availability, use Plos one referenced data store from the site due to the shared Github link not working. Only Google Drive works for downloading.

Comments 4:

In line number 369 mentioned, we used the jieba word splitter to segment the test dataset, but in the fig.2 or any of the earlier studies, did not mention this process.

Comments 5:

In table 2 it mentioned image resize into 224,224,3 but in the above paragraph line number 237 stated 'and for the image part, we keep all images the same size'. Repharse the term.

Reviewer #2: Dear Author,

1. Kindly elaborate about your proposed work along with Deep Learning techniques / algorithms in detail.

2. In the dataset the text information contains Chinese language, how you will identify the false news?

3. Refer to the existing model technique, Kindly compare with your proposed work. Detail explanation is required about your implementation work. (Tools, language, flow chart, etc)

4. Twitter dataset analyses and implementation details required. (Webio dataset implementation adequate details provided in manuscript)

5. Add more details about Squeeze-and-Excitation Networks.

6. Kindly mention the total dataset used in each model in table no.3 (Include one more column)

7. Overall the concept is good.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Janarthanan Sekar

Reviewer #2: Yes: ANANDHAN K

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Nov 23;17(11):e0277463. doi: 10.1371/journal.pone.0277463.r002

Author response to Decision Letter 0

17 Oct 2022

Response to Reviewer#1

Point 1: Abbreviate the figure 5. 'Iter' in full form?

Response 1: Thank you for your suggestion. The full form of 'iter' is' Iterations'

Point 2: For data availability, use Plos one referenced data store from the site due to the shared Github link not working. Only Google Drive works for downloading.

Response 2: Thank you for your suggestion, This is our oversight. I have updated the Github link. Github link ‘ https://github.com/MKLab-ITI/image-verification-corpus’

Point 3: In line number 369 mentioned, we used the jieba word splitter to segment the test dataset, but in the fig.2 or any of the earlier studies, did not mention this process..

Response 3: Thank you for your suggestion.

We use the jieba word splitter to facilitate our statistics on the percentage of words of different lengths in the Weibo dataset, as a way to verify the rationality of our text feature processing using convolutional kernels of 3 different scales. In our other experiments, we did not use the jieba word splitter to process the data.

Point 4: In table 2 it mentioned image resize into 224,224,3 but in the above paragraph line number 237 stated 'and for the image part, we keep all images the same size'. Repharse the term.

Response 4: Thank you very much for your valuable comments.

We will change "and for the image part, we keep all images the same size" to "and for the image part,we resize all the images to (224, 224, 3)"

Response to Reviewer#2

Point 1: Kindly elaborate about your proposed work along with Deep Learning techniques / algorithms in detail.

Response 1: Thank you for your suggestion.

First, we describe our pre-processing process for the data. We resize all the images to (224, 224, 3) and fix the text length (33 for English data and 95 for Chinese data) to be processed in the form of '[CLS]+Text+[SEP]'.

We use the pre-processed data to fine-tune the two pre-trained models, Swin-Transformer and BERT, to make these two models perform better in our task.

Finally, we will introduce our proposed model in detail. Our model first inputs the image with the size of (224,224, 3) into the fine tuned Swin Transformer to obtain the image feature with the size of (49,768), and then inputs the preprocessed text into the fine tuned BERT to obtain the text feature with the size of (33/95,768). Then the image features and text features are processed by (1,768), (2,768) and (3,768) convolution kernels to obtain three (1,64) image features and three (1,64) text features. At the same time, we fuse (49,768) dimensional image features and (33/95,768) dimensional text features to obtain (1, 49) image features and (1, 33/95) text features through our proposed MSE module. Finally, we concatenate three (1,64) and one (1,49) dimensional image features with three (1,64) and one (1,33/95) text features to obtain a joint feature (1,466/528). Finally, we input the (1,466/528) dimensional joint features into a full connection layer with the activation function softmax to obtain the final classification results.

Point 2: In the dataset the text information contains Chinese language, how you will identify the false news?

Response 2: Thank you for your suggestion.

BERT is a powerful text pre-training model that can handle not only English text, but also Chinese text. So for Chinese text we will use the same processing method as English text to discriminate false information, the only difference is the inconsistent length of the text. For the English text, we keep the text length to 33, while the Chinese text length is kept to 95.

Point 3: Refer to the existing model technique, Kindly compare with your proposed work. Detail explanation is required about your implementation work. (Tools, language, flow chart, etc).

Response 3: Thank you for your suggestion. This was an oversight on our part.

Machine configuration and environment for this experiment: CPU: Intel Xeon E5-2630L v3, 62 G memory, 8cores, GPU: NVIDIA GeForce RTX 3090, PyTorch(1.7.1), Python(3.8), Cuda(10.2)

Point 4: Twitter dataset analyses and implementation details required. (Webio dataset implementation adequate details provided in manuscript).

Response 4: Thank you very much for your valuable comments.

We introduced how to process the Twitter dataset in "Dataset and experimental settings".As to why we do not analyze the Twitter dataset separately, there are several reasons :

1.For the Twitter dataset, the processing is done in a similar way to the Weibo dataset except for the inconsistent text length.

2.The Weibo dataset is more representative than the Twitter dataset. In terms of data volume, the Weibo dataset has more than 30 times more image data than the Twitter dataset. In terms of data composition, the Weibo dataset is closer to real life because most of the posts in the Weiboi dataset are from different events, while most of the events in the Twitter dataset are from the same event. Therefore, we choose the more representative Weibo dataset to do the ablation experiment.

3.Repeating our ablation experiments on the Twitter dataset is not necessary. We set up multiple sets of ablation experiments to validate the effectiveness of our proposed model. To ensure the objectivity of the experimental results, we choose to do all the ablation experiments on the more representative Weibo dataset. The effectiveness of our model has been validated on the more complex Weibo dataset, and we do not consider it necessary to repeat the above ablation experiments on the simpler Twitter dataset.

Point 5: Add more details about Squeeze-and-Excitation Networks.

Response 4: Thank you very much for your valuable comments. We add more details about Squeeze-and-Excitation Networks in the 'Feature fusion' section.The added content is shown below.

The SE module is mostly used for channel feature enhancement of the input feature maps in computer vision tasks. For example, if we input a feature map A with dimensions (H,W,C), the SE module will input A into two full connection layers to get the attention score, and then multiply the feature map A with the attention score in the dimension of the channel to get the final output.

Point 6: Kindly mention the total dataset used in each model in table no.3 (Include one more column).

Response 4: Thank you very much for your valuable comments.

We mention the datasets used by each model in Table 3. The first column of Table 3 shows the dataset used by the model. As shown in Table 3, the Twitter dataset and the Weibo dataset are followed by the experimental results of seven different models on that dataset.

Attachment

Submitted filename: Response to Reviewer #2.pdf

Click here for additional data file.^{(97.7KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0277463.r003

Decision Letter 1

T Ganesh Kumar

28 Oct 2022

Multimodal false information detection method based on Text-CNN and SE module

PONE-D-22-26288R1

Dear Dr. Liang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

T. Ganesh Kumar, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Dear Authors,

You have fulfilled reviewer comments.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

**********

PLoS One. doi: 10.1371/journal.pone.0277463.r004

Acceptance letter

T Ganesh Kumar

14 Nov 2022

PONE-D-22-26288R1

Multimodal false information detection method based on Text-CNN and SE module

Dear Dr. Liang:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. T. Ganesh Kumar

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Response to Reviewer #2.pdf

Click here for additional data file.^{(97.7KB, pdf)}

Data Availability Statement

Twitter dataset is available from https://github.com/MKLab-ITI/image-verification-corpus Weibo dataset is available from https://doi.org/10.1145/3123266.3123454.

[pone.0277463.ref001] 1.Gupta, Manish and Zhao, Peixiang and Han, Jiawei. Evaluating event credibility on twitter. Proceedings of the 2012 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2012: 153-164.

[pone.0277463.ref002] 2. Boididou C, Andreadou K, Papadopoulos S, et al. Verifying multimedia use at mediaeval. MediaEval, 2015, 3(3): 7. [Google Scholar]

[pone.0277463.ref003] 3.Rashkin H, Choi E, Jang J Y, et al. Truth of varying shades: Analyzing language in fake news and political fact-checking Proceedings of the 2017 conference on empirical methods in natural language processing. 2017: 2931-2937.

[pone.0277463.ref004] 4.Popat K, Mukherjee S, Strötgen J, et al. Credibility assessment of textual claims on the web. Proceedings of the 25th ACM international on conference on information and knowledge management. 2016: 2173-2178.

[pone.0277463.ref005] 5.Pérez-Rosas V, Kleinberg B, Lefevre A, et al Automatic detection of fake news. arXiv preprint arXiv:1708.07104, 2017.

[pone.0277463.ref006] 6.Alonso-Bartolome S, Segura-Bedmar I. Multimodal Fake News Detection. arXiv preprint arXiv:2112.04831, 2021.

[pone.0277463.ref007] 7. Peng X, Xintong B. An effective strategy for multi-modal fake news detection. Multimedia Tools and Applications, 2022, 81(10): 13799–13822. doi: 10.1007/s11042-022-12290-8 [DOI] [Google Scholar]

[pone.0277463.ref008] 8. Choi H, Ko Y. Effective fake news video detection using domain knowledge and multimodal data fusion on youtube. Pattern Recognition Letters, 2022, 154: 44–52. doi: 10.1016/j.patrec.2022.01.007 [DOI] [Google Scholar]

[pone.0277463.ref009] 9. Graves A. Long short-term memory. Supervised sequence labelling with recurrent neural networks, 2012: 37–45. doi: 10.1007/978-3-642-24797-2_4 [DOI] [Google Scholar]

[pone.0277463.ref010] 10. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30. [Google Scholar]

[pone.0277463.ref011] 11. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60(6): 84–90. doi: 10.1145/3065386 [DOI] [Google Scholar]

[pone.0277463.ref012] 12. Chen Y. Convolutional neural network for sentence classification. University of Waterloo, 2015. [Google Scholar]

[pone.0277463.ref013] 13.Hu J, Shen L, Sun G. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.

[pone.0277463.ref014] 14.Lee J D M C K, Toutanova K. Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[pone.0277463.ref015] 15.Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 10012-10022.

[pone.0277463.ref016] 16.Jin Z, Cao J, Guo H, et al. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. Proceedings of the 25th ACM international conference on Multimedia. 2017: 795-816.

[pone.0277463.ref017] 17.Qazvinian V, Rosengren E, Radev D, et al. Rumor has it: Identifying misinformation in microblogs. Proceedings of the 2011 conference on empirical methods in natural language processing. 2011: 1589-1599.

[pone.0277463.ref018] 18. MENG J, WANG L, YANG Y, et al. Multi-modal deep fusion for false information detection. Journal of Computer Applications, 2022, 42(2): 419. [Google Scholar]

[pone.0277463.ref019] 19. LIU Z, WEI Z, ZHANG R. Rumor detection based on convolutional neural network. Journal of Computer Applications, 2017, 37(11): 3053. [Google Scholar]

[pone.0277463.ref020] 20. Ma J, Gao W, Mitra P, et al. Detecting rumors from microblogs with recurrent neural networks. 2016. [Google Scholar]

[pone.0277463.ref021] 21. Nasir J A, Khan O S, Varlamis I. Fake news detection: A hybrid CNN-RNN based deep learning approach. International Journal of Information Management Data Insights, 2021, 1(1): 100007. doi: 10.1016/j.jjimei.2020.100007 [DOI] [Google Scholar]

[pone.0277463.ref022] 22.Singhal S, Shah R R, Chakraborty T, et al. Spotfake: A multi-modal framework for fake news detection. 2019 IEEE fifth international conference on multimedia big data (BigMM). IEEE, 2019: 39-47.

[pone.0277463.ref023] 23. Kumari R, Ekbal A. Amfb: attention based multimodal factorized bilinear pooling for multimodal fake news detection. Expert Systems with Applications, 2021, 184: 115412. doi: 10.1016/j.eswa.2021.115412 [DOI] [Google Scholar]

[pone.0277463.ref024] 24. Song C, Ning N, Zhang Y, et al. A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks. Information Processing Management, 2021, 58(1): 102437. doi: 10.1016/j.ipm.2020.102437 [DOI] [Google Scholar]

[pone.0277463.ref025] 25.Dhawan M, Sharma S, Kadam A, et al. GAME-ON: Graph Attention Network based Multimodal Fusion for Fake News Detection. arXiv preprint arXiv:2202.12478, 2022.

[pone.0277463.ref026] 26. Wu Y, Zhan P, Zhang Y, et al. Multimodal fusion with co-attention networks for fake news detection. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021: 2560–2569. [Google Scholar]

[pone.0277463.ref027] 27.Xu P, Zhu X, Clifton D A. Multimodal Learning with Transformers: A Survey. arXiv preprint arXiv:2206.06488, 2022. [DOI] [PubMed]

[pone.0277463.ref028] 28.Gavrilyuk K, Sanford R, Javan M, et al. Actor-transformers for group activity recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 839-848.

[pone.0277463.ref029] 29.Xu P, Zhu X. Deepchange: A large long-term person re-identification benchmark with clothes change. arXiv e-prints, 2021: arXiv: 2105.14685.

[pone.0277463.ref030] 30.Sun C, Myers A, Vondrick C, et al. Videobert: A joint model for video and language representation learning. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 7464-7473.

[pone.0277463.ref031] 31.Guo D, Ren S, Lu S, et al. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020.

[pone.0277463.ref032] 32.Shi B, Hsu W N, Lakhotia K, et al. Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184, 2022.

[pone.0277463.ref033] 33.Zheng R, Chen J, Ma M, et al. Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation. International Conference on Machine Learning. PMLR, 2021: 12736-12746.

[pone.0277463.ref034] 34.Li R, Yang S, Ross D A, et al. Ai choreographer: Music conditioned 3d dance generation with aist++. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 13401-13412.

[pone.0277463.ref035] 35.Lin J, Yang A, Zhang Y, et al. Interbert: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198, 2020.

[pone.0277463.ref036] 36.Murahari V, Batra D, Parikh D, et al. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. European Conference on Computer Vision. Springer, Cham, 2020: 336-352.

[pone.0277463.ref037] 37. Lu J, Batra D, Parikh D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 2019, 32. [Google Scholar]

[pone.0277463.ref038] 38.Zhan X, Wu Y, Dong X, et al. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 11782-11791.

[pone.0277463.ref039] 39.Tsai Y H H, Bai S, Liang P P, et al. Multimodal transformer for unaligned multimodal language sequences. Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2019, 2019: 6558. [DOI] [PMC free article] [PubMed]

[pone.0277463.ref040] 40.Wang G, Tan L, Shang Z, et al. Multimodal Dual Emotion with Fusion of Visual Sentiment for Rumor Detection. arXiv preprint arXiv:2204.11515, 2022.

[pone.0277463.ref041] 41.Azri A, Favre C, Harbi N, et al. Calling to CNN-LSTM for Rumor Detection: A Deep Multi-channel Model for Message Veracity Classification in Microblogs. Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Cham, 2021: 497-513..

[pone.0277463.ref042] 42.Kirchknopf A, Slijepcevic D, Zeppelzauer M. Multimodal detection of information disorder from social media. arXiv preprint arXiv:2105.15165, 2021..

[pone.0277463.ref043] 43.Wang Y, Ma F, Jin Z, et al. Eann: Event adversarial neural networks for multi-modal fake news detection. Proceedings of the 24th acm sigkdd international conference on knowledge discovery data mining. 2018: 849-857.

[pone.0277463.ref044] 44.Khattar D, Goud J S, Gupta M, et al. Mvae: Multimodal variational autoencoder for fake news detection. The world wide web conference. 2019: 2915-2921.

PERMALINK

Multimodal false information detection method based on Text-CNN and SE module

Yi Liang

Turdi Tohti

Askar Hamdulla

Roles

Abstract

Introduction

Fig 1. Example of fake news in the Twitter dataset.

Related work

Multimodal false information detection method based on Text-CNN and SE module

Fig 2. Multimodal false information detection method based on Text-CNN and SE module.

Text feature extraction

Image feature extraction

Feature fusion

Fig 3. Multimodal fusion of SE modules (MSE).

False information detection

Experiment and analysis

Dataset and experimental settings

Table 1. Data distribution of the Weibo and Twitter datasets.

Table 2. Hyperparameters used in model training.

Comparative experiment

Table 3. Comparative results for the Weibo and Twitter datasets.

Fig 4. Accuracy and loss curves of the model when trained on the Twitter dataset.

Fig 5. Accuracy and loss curves of the model when trained on the Weibo dataset.

Ablation experiment

Ablation experiment one

Table 4. Comparative results of ablation experiments on the Weibo dataset 1.

Ablation experiment two

Fig 6. Experimental results of ablation experiment 2.

Ablation experiment three

Fig 7. Comparative results of ablation experiments on the Weibo dataset 3 (1).

Fig 8. Comparative results of ablation experiments on the Weibo dataset 3 (2).

Fig 9. Frequency of words with different lengths.

Ablation experiment four

Fig 10. Experimental results of ablation experiment 4.

Table 5. Comparative results of ablation experiments on the Weibo dataset 1.

Conclusion

Data Availability

Funding Statement

References

Decision Letter 0

T Ganesh Kumar

Roles

Author response to Decision Letter 0

Decision Letter 1

T Ganesh Kumar

Roles

Acceptance letter

T Ganesh Kumar

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases