Fake news detection for epidemic emergencies via deep correlations between text and images

Jiangfeng Zeng; Yin Zhang; Xiao Ma

doi:10.1016/j.scs.2020.102652

. 2020 Dec 14;66:102652. doi: 10.1016/j.scs.2020.102652

Fake news detection for epidemic emergencies via deep correlations between text and images

Jiangfeng Zeng ^a, Yin Zhang ^b, Xiao Ma ^c,^*

PMCID: PMC9760342 PMID: 36570569

Abstract

In recent years, major emergencies have occurred frequently all over the world. When a major global public heath emergency like COVID-19 broke out, an increasing number of fake news in social media networks are exposed to the public. Automatically detecting the veracity of a news article ensures people receive truthful information, which is beneficial to the epidemic prevention and control. However, most of the existing fake news detection methods focus on inferring clues from text-only content, which ignores the semantic correlations across multimodalities. In this work, we propose a novel approach for Fake News Detection by comprehensively mining the Semantic Correlations between Text content and Images attached (FND-SCTI). First, we learn image representations via the pretrained VGG model, and use them to enhance the learning of text representation via hierarchical attention mechanism. Second, a multimodal variational autoencoder is exploited to learn a fused representation of textual and visual content. Third, the image-enhanced text representation and the multimodal fusion eigenvector are combined to train the fake news detector. Experimental results on two real-world fake news datasets, Twitter and Weibo, demonstrate that our model outperforms seven competitive approaches, and is able to capture the semantic correlations among multimodal contents.

Keywords: Epidemic diseases, Fake news detection, Semantic correlation, Multimodal fusion, Social networks

1. Introduction

The breakout of major emergencies like epidemic diseases always breed a large quantity of disinformation, which greatly threatens the harmonious and sustainable development of the society (Al-Turjman & Deebak, 2020; Arafatur Rahman et al., 2020; Jiawei, Wang, & Liu, 2020; Kolhar, Al-Turjman, Alameen, & Abualhaj, 2020; Masoud & Mirmahaleh, 2020; Srivastava, Srivastava, Chaudhary, & Al-Turjman, 2020). As the proverb goes, when the truth was wearing shoes, the lie had spread all over the city. This well-known saying not only illustrates disinformation spreads quickly and widely like virus, but also reveals that disinformation causes a great harm to the human society. As reported in Hunt and Gentzkow (2017), an American Internet user was exposed to 1–3 pieces of fake news on average within the final month of the 2016 US presidential election. Foreign media has reported that online fake news will seriously influence the 2020 US presidential election. In the era of Internet, fake news widely spreading in social media networks may increase strains on the Internet via consuming a mass of Internet traffic. For instance, social media networks in different countries are flooded with kinds of fake news when COVID-19 (Corona Virus Disease 2019) rages all over the world, which wastes network resources so that people staying at home suffer a lot from the poor quality of Internet applications such as Netflix, YouTube, Tencent Meeting and so on. The widespread of fake news in recent years has verified the inability to defend against disinformation and the gap between AI driven counterfeiting techniques and detecting techniques is getting bigger and bigger in the Internet era (Lazer et al., 2018). Consequently, it is urgent to rebuild an information ecosystem that ensures authenticity and reduce strains on the Internet by curbing the propagation of fake news as soon as possible.

Theoretically speaking, fake news is committed to be worthless and rubbish. But in fact fake news, a mixture of truth and falsehood, is intentionally created to deceive or mislead readers and characterized by confusability. For example, it is observed that plenty of fake news about COVID-19 attracts more attention than the genuine news and to our surprise more people are willing to believe the fake ones. As a result, it poses tough challenges to check the veracity of a news article.

Automatically detecting fake news has been garnering an increasing number of active research interest from the academic and industrial communities. Traditional approaches (Castillo, Mendoza, & Poblete, 2011; Feng, Banerjee, & Choi, 2012; Zhu et al., 2012) extract handcrafted features from news textual content relying on expert knowledge, followed by traditional machine learning algorithms for training fake news classifier. These handcrafted features based methods are simple but lack comprehensiveness and flexibility (Kolhar, Al-Turjman, Alameen, & Abu-Alhaj, 2020; Wang et al., 2019). Substantial research have indicated that how to design artificial features is importantly critical to kinds of natural language processing tasks (Ma, Zeng, Peng, Fortino, & Zhang, 2019; Zhou, Zeng, Liu, & Zou, 2018). In the era of big data, highly distinguishable feature representation learning has become a new performance bottleneck of AI applications. The past decade has witnessed that deep neural networks including multi-layer nonlinear transformations are able to automatically learn more accurate and effective features from raw data and can be efficiently optimized via layer-wise gradient descent (Bahdanau, Cho, & Bengio, 2014; Jiang, Zeng, Zhou, Huang, & Yang, 2019; Shui-Hua & Zhang, 2020; Wang, Muhammad, Hong, Sangaiah, & Zhang, 2020). Deep learning techniques bring in new insights into fake news detection and have achieved the state-of-the-art performance by automatically extracting features from news textual content (Chen, Li, Yin, & Zhang, 2018; Guo, Cao, Zhang, Guo, & Li, 2018a, 2018b; Jin, Cao, Zhang, & Luo, 2016; Li, Zhang, & Si, 2019; Lu & Li, 2020).

However, user-generated contents published on social media networks are always large-scale, noisy and multimodal. Multimodal machine learning gains an increasing number of attention, such as multimodal sentiment classification (Huang, Wei, Weng, & Li, 2020; Jiang, Wang, Liu, & Ling, 2020; Truong & Lauw, 2019), image captioning (Vinyals, Toshev, Bengio, & Erhan, 2015), visual question answering (Antol et al., 2015), multimedia analysis (Atrey, Hossain, El Saddik, & Kankanhalli, 2010) and so on. With the rapid development of social media networks, fake news evolves from classical text-only articles to articles attached with images or videos which carry richer information but pose more challenges to the detection task. It is apparent that multimodal news articles engage more readers than text-only articles. Some deep neural networks based methods have been investigated for multimodal fake news detection (Dhruv, Singh, Manish, & Vasudeva, 2019; Jin, Cao, Guo, Zhang, & Luo, 2017; Wang et al., 2018; Zhang, Fang, Qian, & Xu, 2019). Although these approaches mentioned above have proven to be successful and effective in detecting fake news, there still exist many difficulties in capturing the complex correlations between news textual content and news visual content.

In this work, we are dedicated to inferring newly emerged fake news by thoroughly analyzing multimodal news content that is fabricated and can be verified as false. In order to discover the complex correlations between textual modality and visual modality, we learn the multimodal fusion representation and the image-augmented text representation in a multi-task learning setting. The work most relative to ours is NVAE (Dhruv et al., 2019), which learns a shared representation for multimodal news article by using VAE (Variational Autoencoder). However, there exist two main differences between our proposed method and NVAE. (1) In terms of one text article and several images attached to it, NVAE regards each text-image pair as one instance, which is unreasonable and unpractical. By contrast, our proposed method treats them as one instance. To be specific, VAE takes as input each text-image pair and outputs a unified representation of both textual modality and visual modality. Considering that simply treating all images as equally important and merging them equally to construct a fusion vector result in sub-optimal performance, we then leverage the attention mechanism to generate a multimodal fusion eigenvector. (2) To further mine the correlations between textual modality and visual modality, a hierarchical attention based approach is designed to learn image-augmented text representations, which is able to differentiate each word of the text document and select images of great importance to the prediction.

The key advantages of our proposed method are the following:

•
In order to comprehensively mine the semantic correlations between textual modality and visual modality, we put forward a novel end-to-end deep neural networks based method, dubbed FND-SCTI which learns a multimodal fusion representation and an image-augmented text representation in a multi-task learning setting.
•
Extracting multimodal correlations is core to checking whether a piece of multimodal news is fake. In one hand, we exploit images as supervisors to highlight the important part of news text. In the other hand, a shared multimodal representation is obtained by using a variational autoencoder. The fake news detector then take as input the multimodal fusion representation and the image-reinforced text representation to predict the news as fake or not.
•
Compared with some competitive methods, qualitative experiments on two real-world datasets are conducted to validate the efficacy of our proposed approach for multimodal fake news detection and experimental results show the superiority of our proposed method.
•
We demonstrate that our proposed FND-SCTI is afford to reveal the correlations between textual modality and visual modality by visualizing the attention weights of text-image pair.

The remainder of this paper is organized as follows. In Section 2, the related work of fake news detection is briefly introduced. Afterwards, we detailedly illustrate the proposed FND-SCTI in Section 3. In Section 4, we conduct extensive experiments to demonstrate the superiority of our proposed method. Finally, we draw a conclusion and envision the future work in Section 5.

2. The related work

The emergence of social media networks not only makes it convenient for users to access, create and share messages, but also breeds and spreads misinformation like rumors and fake news which have a more negative impact on the human society than ever before. Over the past few decades, major emergencies have occurred frequently and result in the flooding of disinformation such as fake news. Fake news detection has obtained more and more attention recently. In this section, we provide a brief background on fake news detection and explore feasible insights to tackle fake news detection by modeling the multimodal news contents.

2.1. Unimodal fake news detection

The majority of existing approaches focus on detecting fake news by analyzing the text-only contents. Earlier methods manually extracted text features and use traditional machine learning algorithms as fake news detector. As a representative, Castillo et al. (2011) devised handcrafted features like linguistic features, topic features and propagation features which were fed into a SVM based classifier for final prediction. Wu, Yang, and Zhu (2015) argued that ignoring the propagation structure of the posts resulted in a sub-optimal detection performance and attempted to capture the high-order propagation patterns which were first combined with content semantic features and then fed into a graph-kernel based SVM classifier. However, manually extracting features from text is complicated, time-consuming and costly, making it difficult to take advantage of big data.

Deep learning techniques have been fully investigated for fake news detection due to their ability of powerful representation learning. Ma et al. (2016) were first to devise a RNN-based method for rumor detection on microblogging platforms. In the work of Yu, Liu, Wu, Wang, and Tan (2017), Yu et al. attempted to identify misinformation by using a CNN-based method. Recently, Shu, Wang, and Liu (2019) explored the role of social context for textual fake news detection and put forward a tri-relationship embedding framework TriFN to represent the relationship among publishers, users and news articles, which proved the effectiveness of social context in improving the detection performance. Jin et al. (2016) tried to identify rumors by discovering conflicting user comments. Guo, Cao, Zhang, Shu, and Liu (2019) pointed out that not only did the sentiment features extracted from user comments contribute to recognize misinformation but also the sentiment features extracted from news articles played an important role in detecting fake news. Ma, Gao, and Wong (2018a) designed two tree-structured recursive neural networks from bottom-up and top-down respectively and extracted content semantic features by following the propagation trees. Shu, Cui, Wang, Lee, and Liu (2019) used a co-attention mechanism to find top K important sentences of a news article and top K important user reviews as clues for final classification. Ma, Gao, and Wong (2018b) and Li et al. (2019) detected rumors and user stance jointly in a multi-task learning setting. Similar work like Cui, Wang, and Lee (2019), Liu and Wu (2018) and Zellers et al. (2019) have been proposed. Wang, Yang, et al. (2020) proposed a reinforced weakly supervised method leveraging users’ reports as weak supervision to enlarge the amount of training data. Lu (2020) developed GCAN which predicted whether a piece of news is fake or not given the short-text tweet and the corresponding sequence of retweet users without comments. In our work, we aims to address the early detection of disinformation by analyzing only the news contents since the social context is generated when the news gets widely propagated.

The development of image editing softwares like Photoshop and the big success achieved by Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Zeng, Ma, & Zhou, 2019) in image synthesis are reducing the technical threshold of image forging. Fake image detection has also been investigated recently. In the work of Zhou, Han, Morariu, and Davis (2018), Zhou et al. proposed a Faster R-CNN based method to extract tampering features from an RGB stream and a noise stream for image manipulation detection. Qi, Cao, Yang, Guo, and Li (2019) designed a framework including a frequency domain module, a pixel domain module and a fusion module to learn visual representations for fake image detection.

2.2. Multimodal fake news detection

With the rapid development of mobile intelligent devices and information technology, there are growing an increasing number of posts which are attached with images in social media networks. Unimodal information is insufficient and easy to be affected by various external factors. What's more, due to the complementarity among different modalities, multimodal feature fusion is getting more and more attention (Baltrušaitis, Ahuja, & Morency, 2018). It is highly believed that fusing textual features and visual features plays a very important role in better understanding the semantics of raw data.

Recently more and more researchers started to deal with the problem of multimodal fake news detection. Jin et al. (2017) proposed Att-RNN exploiting a neural attention on top of the LSTM to fuse text features, social context and image features. Wang et al. (2018) devised a generative adversarial network (GAN) based model whose core is using the discriminator to remove any event-specific features. They exploited the event-invariant features learned to train the fake news detector. In the work of Dhruv et al. (2019), Khattar et al. leveraged a variational autoencoder (VAE) to learn a shared multimodal representation and achieved the current state-of-the-art results for multimodal fake news detection. The biggest difference between our work and MVAE is that our work not only learns the multimodal fusion eigenvectors but also enhances the representation for text content using visual aspect attention.

Though these approaches mentioned above have proven to be successful and effective in detecting fake news, the problem of predicting fake news is far from being essentially solved. Especially, multimodal fake news detection methods still leave far space for improvement.

3. The proposed approach

As shown in Fig. 1 , a high-level illustration of the proposed approach for multimodal fake news detection is clearly introduced. Specifically, we first formulate the problem of multimodal fake news detection in Section 3.1. Then an overview of our proposed model is presented in Section 3.2. Afterward, 4 components of our proposed model are detailedly illustrated in Sections 3.3–3.6, respectively. Finally, we present the training objectives to jointly train the fake news detector and multimodal variational autoencoder in Section 3.7. In order to make readers clear about the notations in this section, Table 1 is added to explain some important symbols.

Table 1.

Description of some important symbols.

Symbol	Description
$p_{m}$	The visual features extracted from the $m$ th image through the pretrained VGG-19
$h_{i, t}$	The output of the $t$ time step of the word encoder in stage 2
$h_{t}$	The output of the $t$ time step of the sentence encoder in stage 2
$d_{j}$	The output of the hierarchical attention network
$R_{txt}$	The output of the textual encoder in stage 3
$R_{vis}^{i}$	The output of the visual encoder in stage 3
$z_{i}$	The shared multimodal representation of the text document and the $i$ th image attached to it

Open in a new tab

3.1. Problem statement

We are given a set of news articles with manually annotated labels $D = {x_{i}, y_{i}}_{i = 1}^{n}$ where $X = {x_{i}}_{i = 1}^{n}$ denotes the news pieces and $Y = {y_{i}}_{i = 1}^{n} \subset {0, 1}^{n}$ denotes the corresponding labels of 0 for true and 1 for fake. Suppose a piece of news article includes one text document $doc$ and images attached with the post $imgs = {{img}_{1}, {img}_{2}, \dots, {img}_{m}}$ that has $m$ images. The text document $doc$ contains a sequence of $L$ sentences each of which has $l$ words, i.e., $s_{i} = {w_{i, 1}, w_{i, 2}, \dots, w_{i, l}}, i \in [1, L]$ . The task of multimodal fake news detection in our work aims to predict whether a piece of news article $news$ is fake or not depending on both the text document $doc$ and images attached $imgs$ . For example, in the first module of Fig. 1, we are given one piece of news which contains one text document and three images attached. We aim to propose one machine learning based model to predict whether this piece of news is fake or not. Since only news contents of textual modality and visual modality are utilized to prefer the veracity, our work is capable of tackling the early detection of fake news without the social context such as user comments, user profiles, propagation networks and so on.

3.2. Overview

The basic idea behind our work is that thoroughly mining the deep correlations between textual modality and visual modality plays a key role in understanding the content semantics and discovering the clues for preferring whether a piece of news article is fake or not. Fig. 1 presents the detailed architecture of the proposed FND-SCTI. It can be observed that FND-SCTI is composed of 4 modules, i.e., word embeddings and VGG features, image-enhanced text representation learning, multimodal representation learning and fake news detector. The first module digitizes the raw data, and learns word embeddings and VGG features for textual data and visual data respectively. Both of the second module and the third module are designed to discover the correlations between textual modality and visual modality from two different perspectives. Fake news detector takes as input the outputs of the second module and the third module, and is optimized in a multi-task learning setting. We elaborate the four modules in the following.

3.3. Word embeddings and VGG features

This module aims to prepare the inputs for the second module and the third module. It is known to all that each NLP application starts with transforming word tokens into word embeddings and the quality of word embeddings has a great influence on how the model performs. For this purpose, we learns word embeddings for all the vocabulary collected from the textual data by using $Word 2 Vec$ (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013), which ensures that the more semantically similar the two words are, the closer the two word vectors are. For simplicity, we denote the word embeddings of the sentence $s_{i}$ as $s_{i} = {w_{i, 1}, w_{i, 2}, \dots, w_{i, l}}$ .

Due to the advancements achieved using convolutional neural networks (CNNs) in computer vision, image descriptors trained using CNNs over a large number of images have proven to be efficient and effective for many downstream applications. To this end, we train VGG-19 network (Simonyan & Zisserman, 2014) over the ImageNet dataset and use the pre-trained VGG-19 as image feature extractor in a transfer learning setting. Specifically, we use the output of the last fully-connected layer (FC7) before the classification layer to encode images $imgs$ as:

\begin{matrix} p & = {p_{1}, p_{2}, \dots, p_{m}} \\ = {VGG ({img}_{1}), VGG ({img}_{2}), \dots, VGG ({img}_{m})} \end{matrix}

(1)

where $p_{m}$ means to the visual features extracted from the $m$ th image through the pretrained VGG-19.

3.4. Image-enhanced text representation learning

The most majority of fake news detection methods are text-only oriented and their great achievements have demonstrated that understanding text content semantics is importantly crucial to predict whether a news is fake or not. However, kinds of social media networks have accumulated a large quantity of multimodal news and cross-modal semantic enhancement is doomed to improve the understanding of unimodal information. Here, to enhance text representations we exploit images attached to the text to assist text representation learning and devise a hierarchical attention mechanism (Yang et al., 2016; Zeng, Yang, et al., 2019) to differentiate the words of the text and the images. It can be seen from Fig. 1 that considering the hierarchical document structure we model the document hierarchically, i.e., word level (from word to sentence) and sentence level (from sentence to document).

3.4.1. Word encoder with soft attention

In word level, a Bi-directional Long-Shot Term Memory (BiLSTM) network is devised to process the words of each sentence, followed by a soft attention mechanism to represent the sentence. Compared with single LSTM cell, BiLSTM cell is able to compute the sentence $s_{i} = {w_{i, 1}, w_{i, 2}, \dots, w_{i, l}}$ from both forward and backward directions which outputs a forward hidden sequence ${{\vec{h}}_{i, 1}, {\vec{h}}_{i, 2}, \dots, {\vec{h}}_{i, l}}$ and a backward hidden sequence ${{\overset{⟵}{h}}_{i, 1}, {\overset{⟵}{h}}_{i, 2}, \dots, {\overset{⟵}{h}}_{i, l}}$ . For example, the forward hidden state ${\vec{h}}_{i, t}$ in the $t$ time step can be formally denoted as:

\begin{matrix} {\vec{I}}_{i, t} & = & σ ({\vec{W}}_{I}^{w} \cdot w_{i, t} + {\vec{W}}_{I}^{h} \cdot {\vec{h}}_{i, t - 1} + {\vec{b}}_{I}) \\ {\vec{F}}_{i, t} & = & σ ({\vec{W}}_{F}^{w} \cdot w_{i, t} + {\vec{W}}_{F}^{h} \cdot {\vec{h}}_{i, t - 1} + {\vec{b}}_{F}) \\ {\vec{O}}_{i, t} & = & σ ({\vec{W}}_{O}^{w} \cdot w_{i, t} + {\vec{W}}_{O}^{h} \cdot {\vec{h}}_{i, t - 1} + {\vec{b}}_{O}) \\ {\vec{c}}_{i, t} & = & {\vec{F}}_{i, t} ⊙ {\vec{c}}_{i, t - 1} + {\vec{I}}_{i, t} ⊙ \tanh ({\vec{W}}_{c}^{w} \cdot w_{i, t} + {\vec{W}}_{c}^{h} \cdot {\vec{h}}_{i, t - 1} + {\vec{b}}_{c}) \\ {\vec{h}}_{i, t} & = & {\vec{O}}_{i, t} ⊙ \tanh ({\vec{c}}_{i, t}) \end{matrix}

(2)

where ${\vec{I}}_{i, t}$ , ${\vec{F}}_{i, t}$ and ${\vec{O}}_{i, t}$ represent the input gate, forget gate and output gate respectively. $σ$ is the sigmoid function controlling the in and out of information flow. $⊙$ means to be the element-wise product and others are model parameters to be learned during the training phase.

The backward hidden state ${\overset{⟵}{h}}_{i, t}$ in the $t$ time step can be computed similarly. Then the forward hidden state ${\vec{h}}_{i, t}$ and the backward hidden state ${\overset{⟵}{h}}_{i, t}$ are combined to generate $h_{i, t} = [{\vec{h}}_{i, t}, {\overset{⟵}{h}}_{i, t}]$ which is regarded as the output of the $t$ time step of BiLSTM. We formally denote it as:

h_{i, t} = BiLSTM (w_{i, t})

(3)

Since the words are not equally important to represent the sentence, a soft attention is used as follows:

\begin{matrix} u_{i, t} & = & U^{T} tanh (W_{w} h_{i, t} + b_{w}) \\ α_{i, t} & = & \frac{exp (u_{i, t})}{\sum_{t} exp (u_{i, t})} \\ s_{i} & = & \sum_{t} α_{i, t} h_{i, t} \end{matrix}

(4)

where $U$ is a randomly initialized context vector and learned during the training phase. We first project the hidden vector $h_{i, t}$ to an attention space through a non-linear layer of neurons with the activation function $\tanh$ . Then $U$ is used as supervisor to compute the scalar $u_{i, t}$ which measures the importance of each word. Afterwards, softmax is exploited to normalize the attention weights ${α_{i, 1}, α_{i, 2}, \dots, α_{i, l}}$ . At last, we obtain the sentence representation $s_{i}$ via weighted summation over all its word representations. Note that the soft attention is marked in yellow in the second component of Fig. 1.

3.4.2. Sentence encoder with visual aspect attention

In sentence level, an another bi-directional Bi-directional Long-Shot Term Memory (BiLSTM) network is used to process the sentences of the document, which is followed by an attention mechanism supervised by the images attached to the document (visual aspect attention) to represent the document. For simplicity, we formally denote the BiLSTM cell operation as:

h_{t} = BiLSTM (s_{t})

(5)

where $h_{t}$ means to be the hidden vector in the $t$ time step which is actually composed of the forward hidden vector and the backward hidden vector, i.e., $h_{t} = [{\vec{h}}_{t}, {\overset{⟵}{h}}_{t}]$ .

Similarly, the sentences are not equally important to represent the news document. Each image attached to the post can be used as supervisor to measure the importance of each sentence. We formally denote the visual aspect attention as follows:

\begin{matrix} g_{j} & = & tanh (W_{g} p_{j} + b_{g}) \\ q_{i} & = & tanh (W_{q} h_{i} + b_{q}) \\ v_{j, i} & = & V^{T} (g_{j} ⊙ q_{i} + q_{i}) \\ β_{j, i} & = & \frac{exp (v_{j, i})}{\sum_{i} exp (v_{j, i})} \\ d_{j} & = & \sum_{t} β_{j, t} h_{t} \end{matrix}

(6)

where $V$ is a randomly initialized context vector whose effect resembles $U$ in word level. Specifically, we first project each image representation $p_{j}$ and each sentence representation $h_{i}$ to an attention space through a non-linear layer of neurons with the activation function $\tanh$ . Then $V$ is treated as supervisor to compute the scalar $v_{j, i}$ which reveals the importance of each sentence. Afterwards, softmax is exploited to normalize the attention weights ${β_{j, 1}, β_{j, 2}, \dots, β_{j, L}}$ . At last, in terms of the image $p_{j}$ we obtain the document representation $d_{j}$ via weighted summation over all its sentence representations. Note that visual aspect attention is marked in pink in the second component of Fig. 1.

In order to differentiate the value of each image, we utilize a soft attention mechanism to generate the final representation of the document in document level. We formally denote the soft attention as follows:

\begin{matrix} k_{j} & = & K^{T} tanh (W_{d} d_{j} + b_{d}) \\ γ_{j} & = & \frac{exp (k_{j})}{\sum_{j} exp (k_{j})} \\ d & = & \sum_{j} γ_{j} d_{j} \end{matrix}

(7)

where $K$ is a randomly initialized context vector whose effect resembles $U$ and $V$ . Specifically, we first project each image enhanced text representation $d_{j}$ to an attention space through a non-linear layer of neurons with the activation function $\tanh$ . Then $K$ is treated as supervisor to compute the scalar $k_{j}$ which reveals the importance of each image enhanced text representation. Afterwards, softmax is exploited to normalize the attention weights ${γ_{1}, γ_{2}, \dots, γ_{m}}$ . At last, we obtain the document representation $d$ via weighted summation over all its image enhanced text representations. Note that soft attention is marked in blue in the second component of Fig. 1.

3.5. Multimodal representation learning

In order to further capture the correlations between textual modality and visual modality, we attempt to devise a novel VAE-based multimodal feature fusion method which consists of one encoder and one decoder. We can observe from Fig. 1 that the encoder can be divided into two sub-components: textual encoder and visual encoder, and correspondingly the decoder can be divided into two sub-components as well: textual decoder and visual decoder. The four modules are marked in orange, grey, green and blue in the third component of Fig. 1 respectively. We will elaborate them in the following.

3.5.1. Textual encoder

Given a text document $doc$ , the textual encoder treats it as a long sequence which has $T$ words, $doc = {w_{1}, w_{2}, \dots, w_{T}}$ . Each word $w_{t}$ represents a word vector learned by the first module. In order to extract textual semantics, we utilize a stacked Bi-directional Long-Shot Term Memory (BiLSTM) network to model the text document. Specifically, the stacked BiLSTM consists of two Bi-LSTM and one full connected layer and is denoted as:

\begin{matrix} H_{t} & = & BiLSTM (w_{t}) \\ {\hat{H}}_{t} & = & BiLSTM (H_{t}) \\ R_{txt} & = & tanh (W_{txt} \hat{H} + b_{txt}) \end{matrix}

(8)

where $\hat{H} = {{\hat{H}}_{1}, {\hat{H}}_{2}, \dots, {\hat{H}}_{T}}$ is the output of the second BiLSTM, and $R_{txt}$ denotes text feature extracted through a fully connected layer.

3.5.2. Visual encoder

Images $imgs$ attached to a text document $doc$ are vectorized as $p = {p_{1}, p_{2}, \dots, p_{m}}$ , followed by two full connected layers. We formally denote the visual encoder as:

\begin{matrix} {\hat{R}}_{vis}^{i} & = & tanh ({\hat{W}}_{vis} p_{i} + {\hat{b}}_{vis}) \\ R_{vis}^{i} & = & tanh (W_{vis} {\hat{R}}_{vis} + b_{vis}) \end{matrix}

(9)

where $R_{vis}^{i}$ denotes the visual feature of the image ${img}_{i}$ extracted through two fully connected layers.

After the text feature $R_{txt}$ and each image feature $R_{vis}^{i}$ are produced, we first concatenate them and project it into a latent space through a fully connected layer as follows.

R_{lat}^{i} = tanh (W_{lat} [R_{txt}, R_{vis}^{i}] + b_{lac})

(10)

Then we compute the mean $μ$ and the variance $σ$ from the latent space as follows.

z_{i} = μ + σ ⊙ ϵ

(11)

where $z_{i}$ denotes the shared multimodal representation of the text document and the $i$ th image attached to it, and ϵ is a random variable sampled from a Gaussian distribution.

To sum up, aiming at mapping the multimodal data into a latent space, the encoder takes as input the text document and images attached to it, and outputs shared multimodal representations. For simplicity, we formally denote it as:

z_{i} = Encoder (doc, {img}_{i})

(12)

3.5.3. Textual decoder

We can clearly see from the architecture of our proposed method that the decoder is structurally similar to the encoder. In other words, the encoder and the decoder are inverted from the perspective of network structure. The textual decoder aims to reconstruct the words in the text according to the multimodal representations. To this end, we first use a fully connected layer to process the multimodal representations. Then a stacked Bi-LSTM is leveraged to generate words during all time steps.

3.5.4. Visual decoder

The visual decoder aims to reconstruct the visual features extracted by the pretrained VGG-19 model according to the multimodal representations. To this end, we simply devise two fully connected layers just like the encoder.

To sum up, the decoder targets at reconstructing textual modality and visual modality from the shared multimodal representations. For simplicity, we formally denote it as:

doc, {img}_{i} = Decoder (z_{i})

(13)

3.6. Fake news detector

In this work, we detect fake news through learning the multimodal fusion eigenvector and image-enhanced text representation in a multi-task learning setting. In order to obtain the multimodal fusion eigenvector, we exploit a soft attention to differentiate the importance of each image. We formally denote it as follows:

\begin{matrix} k_{i} & = & Q^{T} tanh (W_{z} z_{i} + b_{z}) \\ χ_{i} & = & \frac{exp (k_{i})}{\sum_{i} exp (k_{i})} \\ z & = & \sum_{i} χ_{i} z_{i} \end{matrix}

(14)

where $Q$ is a randomly initialized context vector whose effect resembles $U$ , $V$ and $K$ .

Then the multimodal fusion eigenvector and the image-enhanced text representation are concatenated and processed through two fully connected layers. Finally, a Softmax based classifier is used to compute the probability of this news being fake.

\begin{matrix} F & = & tanh (W_{fnd}^{1} [d, z] + b_{fnd}^{1}) \\ ρ & = & tanh (W_{fnd}^{2} F + b_{fnd}^{2}) \\ g_{c} & = & \frac{exp (ρ_{c})}{\sum_{c} exp (ρ_{c})} \end{matrix}

(15)

where $g_{c}$ is the predicted probability distribution.

3.7. Model training

In this section, we need to optimize all the parameters notated as $Θ = {Θ_{1}, Θ_{2}, Θ_{3}}$ where $Θ_{1}, Θ_{2}, Θ_{3}$ represent the model parameters of image-enhanced representation learning module, multimodal representation learning module and fake news detector module respectively. First, a VAE loss is designed for multimodal representation learning module. The VAE loss includes three parts, i.e., text reconstruction loss, VGG-19 features reconstruction loss and the KL divergence loss that ensures the probability distribution to become approximately subordinate to Normal distribution. We formally denote the VAE loss as follows:

L_{{rec}_{txt}} = - \sum_{x \in D} \sum_{i = 1}^{T} \sum_{j = 1}^{n_{w}} 1_{j = t_{x}^{i}} log {\hat{t}}_{x}^{i}

(16)

L_{{rec}_{vgg}} = \frac{1}{n_{v}} \sum_{x \in D} \sum_{i = 1}^{n_{v}} {({\hat{p}}_{i} - p_{i})}^{2}

(17)

L_{kl} = \frac{1}{2} \sum_{i = 1}^{n_{m}} (μ_{i}^{2} + σ_{i}^{2} - log (σ_{i}) - 1))

(18)

where $T$ is the number of words in one text document, $n_{w}$ is the vocabulary size, $n_{v}$ is the dimensionality size of VGG-19 features, and $n_{m}$ is the dimensionality size of multimodal features. $t_{x}^{i}$ means to be the $i$ th word in the text document. ${\hat{t}}_{x}^{i}$ is the reconstructed one. $p_{i}$ is the $i$ th image visual features extracted by pretrained VGG-19 model, and ${\hat{p}}_{i}$ is the reconstructed one.

Second, cross entropy with $L_{2}$ regularization is defined as the loss function of image-enhanced representation learning module and fake news detector when training:

L_{fnd} = - \sum_{x \in D} \sum_{c = 1}^{C} y_{c} (x) \cdot log (g_{c} (x)) + λ L_{2} (Θ_{1}, Θ_{3})

(19)

where $C$ is the number of categories, $y_{c} (x)$ is the golden truth, $g_{c} (x)$ is the predicted probability and $λ$ is the coefficient for $L_{2}$ regularization.

The VAE loss and the cross entropy loss are optimized using SGD to deeply mine the correlations between textual modality and visual modality for fake news detection.

4. Experiments

In this section, we first describe the experimental settings in brief (Section 4.1). Then, four evaluation metrics are introduced in Section 4.2. Afterwards, several baselines are listed in Section 4.3. At last, in order to evaluate the proposed FND-SCTI, comparison experiments are conducted and empirical results are analyzed in Section 4.4.

4.1. Experimental settings

We evaluate the proposed FND-SCTI on two real-world multimodal datasets, i.e., Twitter (Boididou et al., 2015) and Weibo (Jin et al., 2017). The statistics of the used two real-world datasets are summarized in Table 2 . The Twitter dataset contains 7898 fake news, 6026 real news and 514 images, respectively. The Weibo dataset contains 4749 fake news, 4779 real news and 9528 images, respectively. We preprocess the Twitter dataset by filtering out tweets attached with videos as done in Dhruv et al. (2019). As for the Weibo dataset which was released by an authoritative China news agency and verified by Weibo's official rumor debunking system, we preprocess it by removing duplicate images as done in Dhruv et al. (2019). Each dataset is split into training and testing set.

Table 2.

Statistics of the used two real-world datasets.

Dataset	Fake news	Real news	Images
Twitter	7898	6026	514
Weibo	4749	4779	9528

Open in a new tab

In our experiments, we pre-train the Word2Vec model (Mikolov et al., 2013) on the dataset unsupervisedly and generate word embeddings with dimension size of 32 for all the vocabulary. As for visual feature extractor, we pre-train a VGG-19 model (Simonyan & Zisserman, 2014) on ImageNet dataset and use the output of the second to the last layer as visual features with dimension size of 4096. The parameters of pre-trained VGG-19 model are frozen during training. All LSTM cells have dimension size of 32. The fully connected layer of the textual encoder has dimension size of 32. The two fully connected layers of the visual encoder have dimension size of 1024 and 32 respectively. The textual decoder and the visual decoder share the same dimensions with that of the textual encoder and the visual encoder. Context vectors $U$ , $V$ , $K$ and $Q$ are all set to be 64-dimensional. Our experiments are trained for 300 epochs with a batch size of 128 instances, $L_{2}$ -regularization weight of 0.05 and initial learning rate of 0.00001 for Adam. We implement our FND-SCTI using Keras.

4.2. Evaluation metrics

Here, we use four evaluation metrics, i.e., classification accuracy, precision, recall and F1-score, which are widely used for fake news detection. $Accuracy$ measures how the method performs in classification, and is formalized as:

Accuracy = \frac{N_{correct}}{N}

(20)

where $N_{correct}$ is the number of samples correctly predicted and $N$ is the total number of the tested news.

$Precision$ measures the probability of true positive samples over all the predicted positive samples, and is formalized as:

Precision = \frac{N_{tp}}{N_{tp} + N_{fp}}

(21)

where $N_{tp}$ represents the number of positive samples which are predicted to be positive, and $N_{fp}$ represents the number of negative samples which are predicted to be positive.

$Recall$ measures the probability of true positive samples over all the original positive samples, and is formalized as:

Recall = \frac{N_{tp}}{N_{tp} + N_{fn}}

(22)

where $N_{fn}$ represents the number of positive samples which are predicted to be false.

$F 1$ -score is the harmonic mean of the precision and the recall, and can be formulated as:

F_{1} = \frac{2 * Precision * Recall}{Precision + Recall}

(23)

4.3. Baseline approaches

In order to comprehensively validate our proposed FND-SCTI, we select several competitive fake news detection algorithms for a fair comparison. Two unimodal approaches ( $Textual$ and $Visual$ ) and five multimodal approaches ( $VQA$ , $Neural Talk$ , $att - RNN$ , $EANN$ and $MVAE$ ) are listed as follows:

•
Textual (Dhruv et al., 2019) aims to use a Bi-LSTM to model the text-only content of the news, followed by a 32-dimensional fully connected layer. Then a softmax layer is leveraged for final classification.
•
Visual (Dhruv et al., 2019) aims to model the images attached to news posts with a pre-trained VGG-19, followed by two fully connected layers with dimension of 1024 and 32 respectively.
•
VQA (Antol et al., 2015) targets at discovering clues from the images and answering questions. The original $VQA$ model is adapted to multimodal fake news detection by transforming the multi-class classification to a binary-class classification and setting the dimension of LSTM cell to be 32.
•
Neural Talk (Vinyals et al., 2015) is proposed to deal with image captioning which produces sentences to depict images. To adapt $Neural Talk$ to fake news detection, we average the output of the LSTM hidden values as the multimodal features which are fed into a fully connected layer.
•
att-RNN (Jin et al., 2017) is designed for rumor detection, which exploits a neural attention on top of the LSTM to fuse text features, social context and image features. We adapt it by removing social context to ensure a fair comparison.
•
EANN (Event Adversarial Neural Network) (Wang et al., 2018) is proposed to extract event-agnostic features for multimodal fake news detection. The core to EANN is that a discriminator based on adversarial learning is devised to remove any event-specific features.
•
MVAE (Multimodal Variational AutoEncoder) (Dhruv et al., 2019) is the current state-of-the-art approach for multimodal fake news detection, and explores VAE to generate multimodal features which are fed into the fake news detector for final classification.

4.4. Experimental results and analysis

In order to demonstrate that our proposed FND-SCTI is able to perform better for multimodal fake news detection via thoroughly mining the semantic correlations between textual modality and visual modality, we make comparisons with two unimodal methods and five multimodal methods. The qualitative comparison results are shown in Table 3 . Note that the results of the competitors on the two datasets stems from Dhruv et al. (2019) and they all regard one text document attached with multiple images as multiple instances. Therefore, we conduct experiments in this same experiment setting for fair comparisons.

Table 3.

Performance of our proposed FND-SCTI versus other competitive algorithms on two real-world datasets. Note that we treat one text document attached with multiple images as multiple instances in this experiment setting for fair comparisons.

Datasets	Methods	Accuracy	Fake News			Real News
			Precision	Recall	F1	Precision	Recall	F1
Twitter	Textual	0.526	0.586	0.553	0.569	0.469	0.526	0.496
	Visual	0.596	0.695	0.518	0.593	0.524	0.7	0.599
	VQA	0.631	0.765	0.509	0.611	0.55	0.794	0.65
	Neural Talk	0.610	0.728	0.504	0.595	0.534	0.752	0.625
	att-RNN	0.664	0.749	0.615	0.676	0.589	0.728	0.651
	EANN	0.648	0.810	0.498	0.617	0.584	0.759	0.660
	MVAE	0.745	0.801	0.719	0.758	0.689	0.777	0.730
	FND-SCTI	0.758	0.808	0.730	0.771	0.705	0.789	0.741

Weibo	Textual	0.643	0.662	0.578	0.617	0.609	0.685	0.647
	Visual	0.608	0.610	0.605	0.607	0.607	0.611	0.609
	VQA	0.736	0.797	0.634	0.706	0.695	0.838	0.760
	Neural Talk	0.726	0.794	0.713	0.692	0.684	0.840	0.754
	att-RNN	0.772	0.854	0.656	0.742	0.72	0.889	0.795
	EANN	0.782	0.827	0.697	0.756	0.752	0.863	0.804
	MVAE	0.824	0.854	0.769	0.809	0.802	0.875	0.837
	FND-SCTI	0.834	0.863	0.780	0.824	0.815	0.892	0.835

Open in a new tab

First, it can be clearly observed from the results that multimodal methods perform much better than unimodal methods, which shows the necessity to utilize multiple modalities. Second, comparing $att - RNN$ with $VQA$ and $Neural Talk$ , we find that visual aspect attention is of great importance to improve the performance, which validate the necessity to exploit images attached to enhance text representation. Third, comparing $MVAE$ with $VQA$ , $Neural Talk$ , $att - RNN$ and $EANN$ , we can conclude that learning the shared multimodal representations for textual modality and visual modality contributes largely to detecting fake news, which validate the necessity to fuse multimodal features. Finally, the reason why our proposed $FND - SCTI$ wins the current state-of-the-art approach $MVAE$ is that $FND - SCTI$ learns the multimodal fusion representation and the image-augmented text representation in a multi-task learning setting. In other words, the deeper the semantic correlations between textual modality and visual modality are mined, the better the method performs. To be specific, in terms of detection accuracy, $FND - SCTI$ outperforms $MVAE$ with absolute increments of 1.3 $%$ and 1.0 $%$ on the Twitter dataset and Weibo dataset respectively.

We further conduct an ablation experimental analysis to investigate the respective contributions of the components of $FND - SCTI$ 's architecture. Note that we regard one text document attached with multiple images as one instance. There are two reasons for this experimental setting: (1) This setting is much more practical and reasonable than that one text document attached with multiple images are treated as multiple instances; (2) The visual aspect attention does not work when one text document attached with multiple images are treated as multiple instances, that is to say, this setting ensures visual aspect attention to make a big contribution.

Table 4 displays the results of the ablation analysis. $FND - SCTI - han$ removes the multimodal representation learning module while $FND - SCTI - visatt$ removes the image-enhance text representation learning module. We can clearly seen from the results that: (1) $FND - SCTI$ outperforms $FND - SCTI - han$ with absolute increments of 8.9 $%$ and 6.5 $%$ on average on the Twitter dataset and Weibo dataset respectively, which directly validate the efficacy of our devised multimodal representation learning module; (2) $FND - SCTI$ outperforms $FND - SCTI - visatt$ with absolute increments of 1.4 $%$ and 1.0 $%$ on average on the Twitter dataset and Weibo dataset respectively, which directly validate the efficacy of our devised image-enhanced text representation learning module.

Table 4.

Architecture Ablation Analysis of FND-SCTI. Note that we treat one text document attached with multiple images as one instance in this experiment setting.

Datasets	Methods	Accuracy	Fake news			Real news
			Precision	Recall	F1	Precision	Recall	F1
Twitter	FND-SCTI-han	0.675	0.756	0.624	0.687	0.593	0.727	0.662
	FND-SCTI-visatt	0.751	0.805	0.724	0.760	0.693	0.784	0.738
	FND-SCTI	0.772	0.813	0.741	0.775	0.707	0.796	0.746

Weibo	FND-SCTI-han	0.778	0.802	0.661	0.749	0.730	0.894	0.799
	FND-SCTI-visatt	0.832	0.859	0.774	0.815	0.807	0.877	0.838
	FND-SCTI	0.839	0.867	0.784	0.828	0.819	0.893	0.841

Open in a new tab

All the substantial experimental results show that our proposed $FND - SCTI$ is effective in detecting fake news on social medial networks and achieves the new state-of-the-art performance.

5. Conclusion and future work

In this paper, we are targeting at detecting whether a multimodal news is fake or not by analyzing both textual and visual contents. We present a novel deep neural network based method, called FND-SCTI which not only learns a shared multimodal fusion representation, but also enhances textual representation by devising hierarchical attention to highlight the important part of news document. The fake news detector takes as input the two features learned above and is trained in a multi-task learning setting. Extensive experiments conducted on Twitter and Weibo datasets have demonstrated that our proposed FND-SCTI is able to effectively capture the semantic correlations across multi-modalities, and achieves the state-of-the-art performance.

Although it is empirically validated that our proposal has shown great potentials for multimodal fake news detection, current deep learning based methods leave space for improvement. Both domestic and foreign social networks are flooded with bot accounts which are mainly used to propagate advertisements or false information. In the future, we plan to do source checking by modeling user characteristics. With the development of image editing software like Photoshop and deep generative networks like GANs, image editing and generation becomes increasingly simpler to operate, which makes it more and more difficult to distinguish fake images from real ones. Therefore, forged image detection is our another plan.

Declaration of Competing Interest

The authors report no declarations of interest.

Acknowledgements

We are grateful to the anonymous reviewers for their helpful and thought-provoking suggestions and feedbacks. This work was supported in part by the National Natural Science Foundation of China under grant No. 61802440, the Natural Science Foundation of Hubei Province under grant No. 2020CFB492 and the Basic Scientific Research of China University under grant No. 30106200278.

References

Al-Turjman F., Deebak B.D. Privacy-aware energy-efficient framework using the internet of medical things for covid-19. IEEE Internet of Things Magazine. 2020;3(3):64–68. [Google Scholar]
Antol S., Agrawal A., Lu J., Mitchell M., Batra D., Zitnick C.L., et al. Vqa: Visual question answering. Proceedings of the IEEE international conference on computer vision, CVPR 2015. 2015:2425–2433. [Google Scholar]
Arafatur Rahman M., Zaman N., Taufiq Asyhari A., Al-Turjman F., Zakirul Alam Bhuiyan M., Zolkipli M.F. Data-driven dynamic clustering framework for mitigating the adverse economic impact of covid-19 lockdown practices. Sustainable Cities and Society. 2020;62 doi: 10.1016/j.scs.2020.102372. [DOI] [PMC free article] [PubMed] [Google Scholar]
Atrey P.K., Hossain M.A., El Saddik A., Kankanhalli M.S. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems. 2010;16(6):345–379. [Google Scholar]
Bahdanau D., Cho K., Bengio Y. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. [Google Scholar]
Baltrušaitis T., Ahuja C., Morency L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018;41(2):423–443. doi: 10.1109/TPAMI.2018.2798607. [DOI] [PubMed] [Google Scholar]
Boididou C., Andreadou K., Papadopoulos S., Dang-Nguyen D.-T., Boato G., Riegler M., et al. Verifying multimedia use at mediaeval 2015. MediaEval. 2015;3(3):7. [Google Scholar]
Castillo C., Mendoza M., Poblete B. Information credibility on twitter. Proceedings of the 20th international conference on world wide web, WWW 2011. 2011:675–684. [Google Scholar]
Chen T., Li X., Yin H., Zhang J. Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection. Pacific-Asia conference on knowledge discovery and data mining. 2018:40–52. [Google Scholar]
Cui L., Wang S., Lee D. Same: Sentiment-aware multi-modal embedding for detecting fake news. Proceedings of the 2019 IEEE/ACM international conference on advances in social networks analysis and mining. 2019:41–48. [Google Scholar]
Dhruv K., Singh G.J., Manish G., Vasudeva V. Mvae: Multimodal variational autoencoder for fake news detection. The world wide web conference. 2019:2915–2921. [Google Scholar]
Feng S., Banerjee R., Choi Y. Syntactic stylometry for deception detection. Proceedings of the 50th annual meeting of the association for computational linguistics, ACL 2012. 2012:171–175. [Google Scholar]
Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., et al. Generative adversarial nets. Advances in Neural Information Processing Systems. 2014:2672–2680. [Google Scholar]
Guo H., Cao J., Zhang Y., Guo J., Li J. Rumor detection with hierarchical social attention network. Proceedings of the 27th ACM international conference on information and knowledge management. 2018:943–951. [Google Scholar]
Guo H., Cao J., Zhang Y., Guo J., Li J. Rumor detection with hierarchical social attention network. Proceedings of the 27th ACM international conference on information and knowledge management. 2018:943–951. [Google Scholar]
Guo C., Cao J., Zhang X., Shu K., Liu H. 2019. Dean: Learning dual emotion for fake news detection on social media.arXiv:1903.01728 (arXiv preprint) [Google Scholar]
Huang F., Wei K., Weng J., Li Z. Attention-based modality-gated networks for image-text sentiment analysis. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 2020;16(3):1–19. [Google Scholar]
Hunt A., Gentzkow M. Social media and fake news in the 2016 election. Journal of Economic Perspectives. 2017;31(2):211–236. [Google Scholar]
Jiang T., Zeng J., Zhou K., Huang P., Yang T. Lifelong disk failure prediction via gan-based anomaly detection. 37th IEEE international conference on computer design, ICCD 2019; Abu Dhabi, United Arab Emirates, November 17–20, 2019, IEEE; 2019. pp. 199–207. [Google Scholar]
Jiang T., Wang J., Liu Z., Ling Y. Fusion-extraction network for multimodal sentiment analysis. Pacific-Asia conference on knowledge discovery and data mining. 2020:785–797. [Google Scholar]
Jiawei L., Wang Q., Liu K. Sustainable design of courtyard environment: From the perspectives of airborne diseases control and human health. Sustainable Cities and Society. 2020;62 doi: 10.1016/j.scs.2020.102405. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jin Z., Cao J., Zhang Y., Luo J. News verification by exploiting conflicting social viewpoints in microblogs. Thirtieth AAAI conference on artificial intelligence. 2016:2972–2978. [Google Scholar]
Jin Z., Cao J., Guo H., Zhang Y., Luo J. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. Proceedings of the 2017 ACM on multimedia conference, ACMMM 2017. 2017:795–816. [Google Scholar]
Kolhar M., Al-Turjman F., Alameen A., Abualhaj M.M. A three layered decentralized iot biometric architecture for city lockdown during covid-19 outbreak. IEEE Access. 2020;8:163608–163617. doi: 10.1109/ACCESS.2020.3021983. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kolhar M.S., Al-Turjman F., Alameen A., Abu-Alhaj M.M. A three layered decentralized iot biometric architecture for city lockdown during COVID-19 outbreak. IEEE Access. 2020;8:163608–163617. doi: 10.1109/ACCESS.2020.3021983. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lazer D.M.J., Baum M.A., Benkler Y., Berinsky A.J., Greenhill K.M., Menczer F., et al. The science of fake news. Science. 2018;359(6380):1094–1096. doi: 10.1126/science.aao2998. [DOI] [PubMed] [Google Scholar]
Li Q., Zhang Q., Si L. Rumor detection by exploiting user credibility information, attention and multi-task learning. Proceedings of the 57th annual meeting of the association for computational linguistics. 2019:1173–1179. [Google Scholar]
Liu Y., Wu Y.-F.B. Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. Thirty-second AAAI conference on artificial intelligence. 2018 [Google Scholar]
Lu Y.-J., Li C.-T. 2020. Gcan: Graph-aware co-attention networks for explainable fake news detection on social media.arXiv:2004.11648 (arXiv preprint) [Google Scholar]
Lu Y. GCAN: Graph-aware co-attention networks for explainable fake news detection on social media. In: Li C., Jurafsky D., Chai J., Schluter N., Tetreault J.R., editors. Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020; Online, July 5–10, 2020, Association for Computational Linguistics; 2020. pp. 505–514. [Google Scholar]
Ma J., Gao W., Mitra P., Kwon S., Jansen B.J., Wong K.-F., et al. 2016. Detecting rumors from microblogs with recurrent neural networks; pp. 3818–3824. [Google Scholar]
Ma J., Gao W., Wong K.-F. Rumor detection on twitter with tree-structured recursive neural networks. Association for Computational Linguistics. 2018 [Google Scholar]
Ma J., Gao W., Wong K.-F. Detect rumor and stance jointly by neural multi-task learning. Companion proceedings of the – The web conference 2018. 2018:585–593. [Google Scholar]
Ma X., Zeng J., Peng L., Fortino G., Zhang Y. Modeling multi-aspects within one opinionated sentence simultaneously for aspect-level sentiment analysis. Future Generation Computing Systems. 2019;93:304–311. [Google Scholar]
Masoud R.A., Mirmahaleh S.Y.H. Coronavirus disease (covid-19) prevention and treatment methods and effective parameters: A systematic literature review. Sustainable Cities and Society. 2020 doi: 10.1016/j.scs.2020.102568. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. 2013:3111–3119. [Google Scholar]
Qi P., Cao J., Yang T., Guo J., Li J. Exploiting multi-domain visual information for fake news detection. 2019 IEEE international conference on data mining (ICDM); IEEE; 2019. pp. 518–527. [Google Scholar]
Shu K., Wang S., Liu H. Beyond news contents: The role of social context for fake news detection. Proceedings of the twelfth ACM international conference on web search and data mining. 2019:312–320. [Google Scholar]
Shu K., Cui L., Wang S., Lee D., Liu H. defend: Explainable fake news detection. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2019:395–405. [Google Scholar]
Shui-Hua W., Zhang Y.-D. Densenet-201-based deep neural network with composite learning factor and precomputation for multiple sclerosis classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 2020;16:1–19. [Google Scholar]
Simonyan K., Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition.arXiv:1409.1556 (arXiv preprint) [Google Scholar]
Srivastava V., Srivastava S., Chaudhary G., Al-Turjman F. A systematic approach for covid-19 predictions and parameter estimation. Personal and Ubiquitous Computing. 2020:1–13. doi: 10.1007/s00779-020-01462-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Truong Q.-T., Lauw H.W. Vistanet: Visual aspect attention network for multimodal sentiment analysis. Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 2019:305–312. [Google Scholar]
Vinyals O., Toshev A., Bengio S., Erhan D. Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR 2015. 2015:3156–3164. [Google Scholar]
Wang Y., Ma F., Jin Z., Yuan Y., Xun G., Jha K., et al. Eann: Event adversarial neural networks for multi-modal fake news detection. Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining. 2018:849–857. [Google Scholar]
Wang S., Zhang Y., Yang M., Liu B., Ramírez J., Górriz J.M. Unilateral sensorineural hearing loss identification based on double-density dual-tree complex wavelet transform and multinomial logistic regression. Integrated Computer-Aided Engineering. 2019;26(4):411–426. [Google Scholar]
Wang S., Muhammad K., Hong J., Sangaiah A.K., Zhang Y. Alcoholism identification via convolutional neural network based on parametric relu, dropout, and batch normalization. Neural Computing and Applications. 2020;32(3):665–680. [Google Scholar]
Wang Y., Yang W., Ma F., Xu J., Zhong B., Deng Q., et al. Weak supervision for fake news detection via reinforcement learning. The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020; New York, NY, USA, February 7–12, 2020, AAAI Press; 2020. pp. 516–523. [Google Scholar]
Wu K., Yang S., Zhu K.Q. False rumors detection on sina weibo by propagation structures. 2015 IEEE 31st international conference on data engineering; IEEE; 2015. pp. 651–662. [Google Scholar]
Yang Z., Yang D., Dyer C., He X., Smola A.J., Hovy E.H. Hierarchical attention networks for document classification. NAACL HLT 2016, the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies. 2016:1480–1489. [Google Scholar]
Yu F., Liu Q., Wu S., Wang L., Tan T., et al. 2017. A convolutional approach for misinformation identification; pp. 3091–3097. [Google Scholar]
Zellers R., Holtzman A., Rashkin H., Bisk Y., Farhadi A., Roesner F., et al. Advances in neural information processing systems. 2019. Defending against neural fake news; pp. 9054–9065. [Google Scholar]
Zeng J., Ma X., Zhou K. Photo-realistic face age progression/regression using a single generative adversarial network. Neurocomputing. 2019;366:295–304. [Google Scholar]
Zeng J., Yang M., Zhou K., Ma X., Wang Y., Xu X., et al. In: Web and big data – Third international joint conference, APWeb-WAIM 2019, Chengdu, China, August 1–3, 2019, Proceedings, Part II. Vol. 11642 of lecture notes in computer science. Shao J., Yiu M.L., Toyoda M., Zhang D., Wang W., Cui B., editors. Springer; 2019. Improved review sentiment analysis with a syntax-aware encoder; pp. 73–87. [Google Scholar]
Zhang H., Fang Q., Qian S., Xu C. Multi-modal knowledge-aware event memory network for social media rumor detection. Proceedings of the 27th ACM international conference on multimedia. 2019:1942–1951. [Google Scholar]
Zhou K., Zeng J., Liu Y., Zou F. Deep sentiment hashing for text retrieval in social ciot. Future Generation Computing Systems. 2018;86:362–371. [Google Scholar]
Zhou P., Han X., Morariu V.I., Davis L.S. Learning rich features for image manipulation detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018:1053–1061. [Google Scholar]
Zhu Y., Wang X., Zhong E., Liu N.N., Li H., Yang Q. Discovering spammers in social networks. AAAI 2012. 2012:171–177. [Google Scholar]

[bib0005] Al-Turjman F., Deebak B.D. Privacy-aware energy-efficient framework using the internet of medical things for covid-19. IEEE Internet of Things Magazine. 2020;3(3):64–68. [Google Scholar]

[bib0010] Antol S., Agrawal A., Lu J., Mitchell M., Batra D., Zitnick C.L., et al. Vqa: Visual question answering. Proceedings of the IEEE international conference on computer vision, CVPR 2015. 2015:2425–2433. [Google Scholar]

[bib0015] Arafatur Rahman M., Zaman N., Taufiq Asyhari A., Al-Turjman F., Zakirul Alam Bhuiyan M., Zolkipli M.F. Data-driven dynamic clustering framework for mitigating the adverse economic impact of covid-19 lockdown practices. Sustainable Cities and Society. 2020;62 doi: 10.1016/j.scs.2020.102372. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0020] Atrey P.K., Hossain M.A., El Saddik A., Kankanhalli M.S. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems. 2010;16(6):345–379. [Google Scholar]

[bib0025] Bahdanau D., Cho K., Bengio Y. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. [Google Scholar]

[bib0030] Baltrušaitis T., Ahuja C., Morency L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018;41(2):423–443. doi: 10.1109/TPAMI.2018.2798607. [DOI] [PubMed] [Google Scholar]

[bib0035] Boididou C., Andreadou K., Papadopoulos S., Dang-Nguyen D.-T., Boato G., Riegler M., et al. Verifying multimedia use at mediaeval 2015. MediaEval. 2015;3(3):7. [Google Scholar]

[bib0040] Castillo C., Mendoza M., Poblete B. Information credibility on twitter. Proceedings of the 20th international conference on world wide web, WWW 2011. 2011:675–684. [Google Scholar]

[bib0045] Chen T., Li X., Yin H., Zhang J. Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection. Pacific-Asia conference on knowledge discovery and data mining. 2018:40–52. [Google Scholar]

[bib0050] Cui L., Wang S., Lee D. Same: Sentiment-aware multi-modal embedding for detecting fake news. Proceedings of the 2019 IEEE/ACM international conference on advances in social networks analysis and mining. 2019:41–48. [Google Scholar]

[bib0055] Dhruv K., Singh G.J., Manish G., Vasudeva V. Mvae: Multimodal variational autoencoder for fake news detection. The world wide web conference. 2019:2915–2921. [Google Scholar]

[bib0060] Feng S., Banerjee R., Choi Y. Syntactic stylometry for deception detection. Proceedings of the 50th annual meeting of the association for computational linguistics, ACL 2012. 2012:171–175. [Google Scholar]

[bib0065] Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., et al. Generative adversarial nets. Advances in Neural Information Processing Systems. 2014:2672–2680. [Google Scholar]

[bib0070] Guo H., Cao J., Zhang Y., Guo J., Li J. Rumor detection with hierarchical social attention network. Proceedings of the 27th ACM international conference on information and knowledge management. 2018:943–951. [Google Scholar]

[bib0075] Guo H., Cao J., Zhang Y., Guo J., Li J. Rumor detection with hierarchical social attention network. Proceedings of the 27th ACM international conference on information and knowledge management. 2018:943–951. [Google Scholar]

[bib0080] Guo C., Cao J., Zhang X., Shu K., Liu H. 2019. Dean: Learning dual emotion for fake news detection on social media.arXiv:1903.01728 (arXiv preprint) [Google Scholar]

[bib0085] Huang F., Wei K., Weng J., Li Z. Attention-based modality-gated networks for image-text sentiment analysis. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 2020;16(3):1–19. [Google Scholar]

[bib0090] Hunt A., Gentzkow M. Social media and fake news in the 2016 election. Journal of Economic Perspectives. 2017;31(2):211–236. [Google Scholar]

[bib0095] Jiang T., Zeng J., Zhou K., Huang P., Yang T. Lifelong disk failure prediction via gan-based anomaly detection. 37th IEEE international conference on computer design, ICCD 2019; Abu Dhabi, United Arab Emirates, November 17–20, 2019, IEEE; 2019. pp. 199–207. [Google Scholar]

[bib0100] Jiang T., Wang J., Liu Z., Ling Y. Fusion-extraction network for multimodal sentiment analysis. Pacific-Asia conference on knowledge discovery and data mining. 2020:785–797. [Google Scholar]

[bib0105] Jiawei L., Wang Q., Liu K. Sustainable design of courtyard environment: From the perspectives of airborne diseases control and human health. Sustainable Cities and Society. 2020;62 doi: 10.1016/j.scs.2020.102405. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0110] Jin Z., Cao J., Zhang Y., Luo J. News verification by exploiting conflicting social viewpoints in microblogs. Thirtieth AAAI conference on artificial intelligence. 2016:2972–2978. [Google Scholar]

[bib0115] Jin Z., Cao J., Guo H., Zhang Y., Luo J. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. Proceedings of the 2017 ACM on multimedia conference, ACMMM 2017. 2017:795–816. [Google Scholar]

[bib0120] Kolhar M., Al-Turjman F., Alameen A., Abualhaj M.M. A three layered decentralized iot biometric architecture for city lockdown during covid-19 outbreak. IEEE Access. 2020;8:163608–163617. doi: 10.1109/ACCESS.2020.3021983. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0125] Kolhar M.S., Al-Turjman F., Alameen A., Abu-Alhaj M.M. A three layered decentralized iot biometric architecture for city lockdown during COVID-19 outbreak. IEEE Access. 2020;8:163608–163617. doi: 10.1109/ACCESS.2020.3021983. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0130] Lazer D.M.J., Baum M.A., Benkler Y., Berinsky A.J., Greenhill K.M., Menczer F., et al. The science of fake news. Science. 2018;359(6380):1094–1096. doi: 10.1126/science.aao2998. [DOI] [PubMed] [Google Scholar]

[bib0135] Li Q., Zhang Q., Si L. Rumor detection by exploiting user credibility information, attention and multi-task learning. Proceedings of the 57th annual meeting of the association for computational linguistics. 2019:1173–1179. [Google Scholar]

[bib0140] Liu Y., Wu Y.-F.B. Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. Thirty-second AAAI conference on artificial intelligence. 2018 [Google Scholar]

[bib0145] Lu Y.-J., Li C.-T. 2020. Gcan: Graph-aware co-attention networks for explainable fake news detection on social media.arXiv:2004.11648 (arXiv preprint) [Google Scholar]

[bib0150] Lu Y. GCAN: Graph-aware co-attention networks for explainable fake news detection on social media. In: Li C., Jurafsky D., Chai J., Schluter N., Tetreault J.R., editors. Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020; Online, July 5–10, 2020, Association for Computational Linguistics; 2020. pp. 505–514. [Google Scholar]

[bib0155] Ma J., Gao W., Mitra P., Kwon S., Jansen B.J., Wong K.-F., et al. 2016. Detecting rumors from microblogs with recurrent neural networks; pp. 3818–3824. [Google Scholar]

[bib0160] Ma J., Gao W., Wong K.-F. Rumor detection on twitter with tree-structured recursive neural networks. Association for Computational Linguistics. 2018 [Google Scholar]

[bib0165] Ma J., Gao W., Wong K.-F. Detect rumor and stance jointly by neural multi-task learning. Companion proceedings of the – The web conference 2018. 2018:585–593. [Google Scholar]

[bib0170] Ma X., Zeng J., Peng L., Fortino G., Zhang Y. Modeling multi-aspects within one opinionated sentence simultaneously for aspect-level sentiment analysis. Future Generation Computing Systems. 2019;93:304–311. [Google Scholar]

[bib0175] Masoud R.A., Mirmahaleh S.Y.H. Coronavirus disease (covid-19) prevention and treatment methods and effective parameters: A systematic literature review. Sustainable Cities and Society. 2020 doi: 10.1016/j.scs.2020.102568. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0180] Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. 2013:3111–3119. [Google Scholar]

[bib0185] Qi P., Cao J., Yang T., Guo J., Li J. Exploiting multi-domain visual information for fake news detection. 2019 IEEE international conference on data mining (ICDM); IEEE; 2019. pp. 518–527. [Google Scholar]

[bib0190] Shu K., Wang S., Liu H. Beyond news contents: The role of social context for fake news detection. Proceedings of the twelfth ACM international conference on web search and data mining. 2019:312–320. [Google Scholar]

[bib0195] Shu K., Cui L., Wang S., Lee D., Liu H. defend: Explainable fake news detection. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2019:395–405. [Google Scholar]

[bib0200] Shui-Hua W., Zhang Y.-D. Densenet-201-based deep neural network with composite learning factor and precomputation for multiple sclerosis classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 2020;16:1–19. [Google Scholar]

[bib0205] Simonyan K., Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition.arXiv:1409.1556 (arXiv preprint) [Google Scholar]

[bib0210] Srivastava V., Srivastava S., Chaudhary G., Al-Turjman F. A systematic approach for covid-19 predictions and parameter estimation. Personal and Ubiquitous Computing. 2020:1–13. doi: 10.1007/s00779-020-01462-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0215] Truong Q.-T., Lauw H.W. Vistanet: Visual aspect attention network for multimodal sentiment analysis. Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 2019:305–312. [Google Scholar]

[bib0220] Vinyals O., Toshev A., Bengio S., Erhan D. Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR 2015. 2015:3156–3164. [Google Scholar]

[bib0225] Wang Y., Ma F., Jin Z., Yuan Y., Xun G., Jha K., et al. Eann: Event adversarial neural networks for multi-modal fake news detection. Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining. 2018:849–857. [Google Scholar]

[bib0230] Wang S., Zhang Y., Yang M., Liu B., Ramírez J., Górriz J.M. Unilateral sensorineural hearing loss identification based on double-density dual-tree complex wavelet transform and multinomial logistic regression. Integrated Computer-Aided Engineering. 2019;26(4):411–426. [Google Scholar]

[bib0235] Wang S., Muhammad K., Hong J., Sangaiah A.K., Zhang Y. Alcoholism identification via convolutional neural network based on parametric relu, dropout, and batch normalization. Neural Computing and Applications. 2020;32(3):665–680. [Google Scholar]

[bib0240] Wang Y., Yang W., Ma F., Xu J., Zhong B., Deng Q., et al. Weak supervision for fake news detection via reinforcement learning. The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020; New York, NY, USA, February 7–12, 2020, AAAI Press; 2020. pp. 516–523. [Google Scholar]

[bib0245] Wu K., Yang S., Zhu K.Q. False rumors detection on sina weibo by propagation structures. 2015 IEEE 31st international conference on data engineering; IEEE; 2015. pp. 651–662. [Google Scholar]

[bib0250] Yang Z., Yang D., Dyer C., He X., Smola A.J., Hovy E.H. Hierarchical attention networks for document classification. NAACL HLT 2016, the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies. 2016:1480–1489. [Google Scholar]

[bib0255] Yu F., Liu Q., Wu S., Wang L., Tan T., et al. 2017. A convolutional approach for misinformation identification; pp. 3091–3097. [Google Scholar]

[bib0260] Zellers R., Holtzman A., Rashkin H., Bisk Y., Farhadi A., Roesner F., et al. Advances in neural information processing systems. 2019. Defending against neural fake news; pp. 9054–9065. [Google Scholar]

[bib0265] Zeng J., Ma X., Zhou K. Photo-realistic face age progression/regression using a single generative adversarial network. Neurocomputing. 2019;366:295–304. [Google Scholar]

[bib0270] Zeng J., Yang M., Zhou K., Ma X., Wang Y., Xu X., et al. In: Web and big data – Third international joint conference, APWeb-WAIM 2019, Chengdu, China, August 1–3, 2019, Proceedings, Part II. Vol. 11642 of lecture notes in computer science. Shao J., Yiu M.L., Toyoda M., Zhang D., Wang W., Cui B., editors. Springer; 2019. Improved review sentiment analysis with a syntax-aware encoder; pp. 73–87. [Google Scholar]

[bib0275] Zhang H., Fang Q., Qian S., Xu C. Multi-modal knowledge-aware event memory network for social media rumor detection. Proceedings of the 27th ACM international conference on multimedia. 2019:1942–1951. [Google Scholar]

[bib0280] Zhou K., Zeng J., Liu Y., Zou F. Deep sentiment hashing for text retrieval in social ciot. Future Generation Computing Systems. 2018;86:362–371. [Google Scholar]

[bib0285] Zhou P., Han X., Morariu V.I., Davis L.S. Learning rich features for image manipulation detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018:1053–1061. [Google Scholar]

[bib0290] Zhu Y., Wang X., Zhong E., Liu N.N., Li H., Yang Q. Discovering spammers in social networks. AAAI 2012. 2012:171–177. [Google Scholar]

PERMALINK

Fake news detection for epidemic emergencies via deep correlations between text and images

Jiangfeng Zeng

Yin Zhang

Xiao Ma

Abstract

1. Introduction

2. The related work

2.1. Unimodal fake news detection

2.2. Multimodal fake news detection

3. The proposed approach

Fig. 1.

Table 1.

3.1. Problem statement

3.2. Overview

3.3. Word embeddings and VGG features

3.4. Image-enhanced text representation learning

3.4.1. Word encoder with soft attention

3.4.2. Sentence encoder with visual aspect attention

3.5. Multimodal representation learning

3.5.1. Textual encoder

3.5.2. Visual encoder

3.5.3. Textual decoder

3.5.4. Visual decoder

3.6. Fake news detector

3.7. Model training

4. Experiments

4.1. Experimental settings

Table 2.

4.2. Evaluation metrics

4.3. Baseline approaches

4.4. Experimental results and analysis

Table 3.

Table 4.

5. Conclusion and future work

Declaration of Competing Interest

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases