An automated multi-web platform voting framework to predict misleading information proliferated during COVID-19 outbreak using ensemble method

Deepika Varshney; Dinesh Kumar Vishwakarma

doi:10.1016/j.datak.2022.102103

. 2022 Nov 11;143:102103. doi: 10.1016/j.datak.2022.102103

An automated multi-web platform voting framework to predict misleading information proliferated during COVID-19 outbreak using ensemble method

Deepika Varshney ¹, Dinesh Kumar Vishwakarma ^1,^⁎

PMCID: PMC9650682 PMID: 36406205

Abstract

The spreading of misleading information on social web platforms has fuelled massive panic and confusion among the public regarding the Corona disease, the detection of which is of paramount importance. Previous studies mainly relied on a specific web platform to collect crucial evidence to detect fake content. The analysis identifies that retrieving clues from two or more different sources/web platforms gives more reliable prediction and confidence concerning a specific claim. This study proposed a novel multi-web platform voting framework that incorporates 4 sets of novel features: content, linguistic, similarity, and sentiments. The features have been gathered from each web-platforms to validate the news. To validate the fact/claim, a unique source platform is designed to collect relevant clues/headlines from two web platforms (YouTube, Google) based on specific queries and extracted features concerning each clue/headline. The proposed idea is to incorporate a unique platform to assist researchers in gathering relevant and vital evidence from diverse web platforms. After evaluation and validation, it has been identified that the built model is quite intelligent, gives promising results, and effectively predicts misleading information. The model correctly detected about 98% of the COVID misinformation on the constraint Covid-19 fake news dataset. Furthermore, it is observed that it is efficient to gather clues from multiple web platforms for more reliable predictions to validate the news. The suggested work depicts numerous practical applications for health policy-makers and practitioners that could be useful in safeguarding and implicating awareness among society from misleading information dissemination during this pandemic.

Keywords: COVID-19, Misleading information, Machine learning, Fake news, Multi-web platforms, YouTube, Google

1. Introduction

A new coronavirus stain has escalated around the world. The study has observed that the disease emerged rapidly as a respiratory infection with alerting concerns for public health interests. According to the initial reports. the infection is said to have a transmission flow from animals to humans, which is airborne. Later the pattern of the transmission shifted from human to human via inhalation of micro droplets, having close proximity contacts which created huge hysterical conditions with approximately 6,359,182 confirmed cases and 380,663 deaths,1 and the growth rate is still high and it leads to alarmed the global health authorities [1]. Misleading content has become a growing problem for societies, spreading virally and transforming into harmful impacts in social networks. Many different Infodemic extremely affects society due to the spreading and circulation of misleading information and the detection, verification and validation of shared information is quite challenging. The problem of fake news is even more troubling in the Healthcare context. Zika Virus is one of the diseases Infodemic, lots of misleading content has been shared during this period that affects and hamper public emotions and mindset [1]. Social media are the most important platform for risk communication during public health crises [2], [3]. It has become an important source of information sharing, as people become more reliant on social media for emergency information. Government and Public authorities are increasingly utilizing these technologies to communicate with the citizenry during times of crisis [4].

Twitter is one of the prominent and most productive social media networks and sharing information and has been a popular means of disseminating news and updates in emergent situations, such as public health crises. Twitter has been used much more during public health crises to communicate information efficiently and effectively [5].

Social media like Twitter helps identify fake news spreading concerning Zika Virus. The authors of the [1], investigated fake news sharing on Twitter in the context of the Zika virus.

The COVID-19 is another pandemic. The pandemic situation affects the whole world badly, where contaminated individuals take this disaster as an opportunity with bad intentions to spread false information to gain profit. Most of the misleading information related to COVID-19 cure and others has been posted that creates lots of perplexed opinions and misconceptions about the disease as people are curious to get any new announcement that can help to get some aid in COVID-19. The novel virus is found to be deadly, the people are also anxious to know some cure and in a rush to find a treatment for the new coronavirus disease. Some fake cures posted on social media are harmful and give bad health advice. The recent examples of fake cures are shown in Fig. 1, Fig. 1(a) shows a false claim is gone viral that gargling with hot water and salt or vinegar as well as drinking more water get rid of coronavirus. None of the significant clues has been reported concerning to this claim. Another cure in Fig. 1(b) reported that the silver solution could destroy coronavirus within 12 h. The proliferation of such misleading information creates many misconceptions in people’s mindset related to coronavirus disease. Some of the users are spreading it without verification and fuelled huge panic among people regarding the COVID-19. In the same way the propagation of misleading information concerning to these pandemics promotes a specific product to achieve enormous profits which has fuelled the panic of purchasing groceries, sanitizers, masks, and paper products which led to shortages that disrupted the supply chain and exacerbated demand–supply gaps and food insecurity. According to [6], misleading information can be defined as “any post that shares content, does not faithfully represent the event that it refers to”. We followed this definition in our work and defined “misleading information as the content that does not faithfully represent the event that it refers to and having no significant evidence of proof to validate the claim”. Recent research has observed that numerous misleading content is circulating about the coronavirus, and it is becoming difficult to differentiate fake news from the real one [7]. The propagation of misleading content on the virus could also be deleterious to mankind. This gives a motivation to address this problem and to develop a system that can differentiate fake from real. Many previous studies have reported methods of detecting fake news in online social media considering various applications. Most of the previous research has counter fake news problems mainly in image-based algorithms and Text-based algorithms. Many previous studies have worked on fake news detection by applying a text-based approach. It has been observed that text-based approaches mostly use and incorporate text patterns and match them with the patterns that are already existed concerning to fake news and they are sometimes referred to as linguistic approach. Along with this, many researchers have shifted their interest in the credibility detection of posts/tweets using text-based features [8]. Like a text-based approach, research has also been done by employing an image-based approach. From the study, it has been seen that researchers have explored image-based algorithms for the analysis of fake images or images attached with false claims in mainly the following ways, Text additive images, and Manipulated images. The manipulated images have termed an image whose piece/part or certain region is manipulated with respect to visual context. Moving it to the same many of the previous studies have analyse and employed image-based features for the classification. The authors of [9], propose 5 visual features and 7 statistical features for the verification of news events. Along with the manipulated images, some researchers have also considered text-additive images to analyse misleading content. The text additive images are embedded with false claims instead of manipulation from the visual context. The authors of [10], have incorporated text additive images, where they have applied a rule-based algorithm for the detection of misleading information. From recent research, it has been observed that none of the works have shown and reported fake news prediction analysis propagated during one of the major pandemic “CORONAVIRUS”. Many people share fake cures to get rid of coronavirus disease without any verification and create lots of misconceptions. Government and officials have also urged people to check the authenticity of the post before sharing [7]. This also motivates us to build an intelligent system for the prediction of fake news spreading during this pandemic. We, therefore, developed a generalized multi web platform framework of detecting misleading content on social media platforms, where we have considered COVID-19 as a special issue which is a huge pandemic and taken as one of the application case studies in this work. However, our model is generalized and works for other applications as well. COVID-19 is an emerging issue and very few research have been reported yet in this context that leads to motivates us to build an efficient framework to predict misleading content spreading during the COVID outbreak. The major key contributions of the work are highlighted in the following points.

•
The proposed work contributes to providing a novel generalized Automated Multi-Web Platform Voting Framework for collecting and validating misleading content in an online social network where considering COVID-19(fake news spreading during Corona outbreak) one of the special case studies from the application perspective.
•
To the best of our knowledge, we are first to build a unique platform (Facts collector) to collect crucial facts and knowledge concerning a claim from two different prominently used social media and web search platforms (YouTube and Google) for validating the claim. Along with this, we provide a different mechanism to search the query (build query) to get efficient and relevant results.
•
The four sets of novel features based on content, linguistics/semantic cues, similarity, and sentiments have been extracted/gathered from web platforms that further fed into an ensemble-based machine learning model to classify the news as Misleading or real. In addition, confidence/support has been gathered from different web platforms.
•
The COVID-19 is one of the emerging issues. Very few studies have been reported to predict the fake news propagating during this phase, leading to a major contribution by providing the analysis, which greatly helps researchers for further study.
•
We evaluate the performance of the model with different classifiers, and a comparative study reveals that the proposed technique outperforms other states of the art approaches on the same dataset.

The remaining paper is organized as follows. First, in Section 2, a detailed explanation of the research objectives is given. Then, Section 3, describes the earlier studies concerning to this field wherein Section 4, elaborates the proposed architecture/method that have employed by us for misleading information detection. A detailed description of experimental results are given in Section 5. Lastly, the paper is concluded with some suggested future work aspects.

2. Research objective

This research’s primary aim is to provide a framework to determine the veracity of any claim posted on social media and to validate whether the given claim is fake or real. To achieve this aim, we propose a multi-web platform framework to gather crucial clues that further be utilized to extract features from the retrieved clues/facts. The four sets of features (including content-based features, linguistic-based features, similarity-based features, and sentiments-based features) are employed for this purpose. In this paper, we investigate the following research questions:

RQ1: Is incorporating a multi-web platform effective and more reliable than a single web platform for predicting false information?

RQ2: Does the incorporation of effective clues from one platform help improve the model’s performance when another platform cannot return the relevant facts?

RQ3: Does every Social web platform is effective in collecting crucial facts concerning a claim?

RQ4: Which one of the features is more effective in discriminating against misleading information.?

We investigate all these research questions in the subsequent sections.

3. Related work

Spreading misleading information intentionally or knowingly is one of the crucial problems nowadays. It is quite problematic for the online user to distinguish fake news from the real one, so the development of an intelligent system is required. Many methods have been developed which provide state-of-the-art techniques to detect malicious information. It is considered as a classification problem intending to associate labels as true or false with a particular claim/post. From the analysis of the aforementioned study, it has been identified that the classification approaches are in turn divided into techniques based on machine learning (ML) and deep learning (DL). A detailed description is given below.

3.1. Machine learning

It has been scientifically validated that ML algorithms are extremely useful in addressing wide aspects of problems in the information engineering field. Many of the ML techniques implemented for misleading/fake information prediction applied as a substantial supervised learning strategy. Support Vector Machines (SVMs) are one of the prominently used machine learning method for classifications. The authors of [11], have proposed a method where they employed a Graph-kernel-based SVM method to detect rumours using propagation structure and content based features which achieved an accuracy of 0.91 on the Sina-Weibo dataset. Whereas, in [12], the author stated a set of features to discriminate between fake news, real news, and caricatures. The SVMs are also employed widely for the detection of clickbait [13]. Like SVM, the random forest (RF) algorithm can also be exposed in numerous fake news and rumour detection works. From the previous studies, random forest is found to be the best performer among other machine learning algorithms [14], [15], [16], [17], [18]. In [15], the features concerning to temporal, structural, and linguistic has been employed for the classification of rumours in a tweet graph by harnessing the random forest with an accuracy of 0.90. The RF is also used widely for stance detection as shown in [14], [19]. The comparative studies of a different approach in the context of rumour and fake news have shown competitive performance for (LR) logistic regression [16], [20], [21], [22]. The authors of [23], employed logistic regression for stance classification of news articles or headlines and claims. The other widely studied family of an algorithm proposed particularly for misleading content detection is a decision tree [24]. The effectiveness of decision tree algorithms like j48 concerning to other machine learning paradigms, including SVMs has been reported in [8], [16]. The authors of [8] have used the content and context-based features to perform credibility evaluation of tweets, and the model is performing well with an accuracy of 0.86. In [25], a decision tree method has been employed to evaluate the truthfulness of users in social media the author has incorporated a series of user trust metrics and reported an accuracy of 0.75.

3.2. Deep learning

Deep learning is one of the prominent and widely explored research topics in machine learning. The main advantage of DL over traditional ML approaches is that they are not completely mounted on manually crafted features and reduce the feature extraction time. Along with this, the deep learning framework can learn hidden representations from simpler inputs both in context and content variation [26]. The two prominent and widely used paradigms in Morden artificial neural network are RNN and CNN. In [26], authors have depicted that novel RNN architectures (including tanh-RNN, LSTM and Gated Recurrent Unit(GRU) for the detection of rumours. The results shows that GRU has achieved the best results in terms of accuracies with 0.88 and 0.91 including both the datasets, respectively. A multi-task learning method is proposed by [27], in which they incorporated an LSTM layer shared among all tasks to address the rumour classification problem. Like RNN, CNNs have also been explored and widely studied for image recognition and many other fields of computer vision. However, it is now gaining popularity in the NLP field [28]. The authors of [19], have explored a technique using CNN with single and multi-word embedding to counter problems concerning both stance and veracity classification of tweets. The accuracy of 0.70 concerning to stance classification problem and 0.53 for the problem of veracity classification has been reported by the author. Whereas, Paragraph embedding is explored to learn the representation of a small group of posts in a specific event and used them as input for their CNN model [29] and achieved an accuracy of 0.93 for Sina Weibo and 0.77 for Twitter. The combination of RNN and CNN has been frequently explored in the recent works as discussed in [30], [31], [32]. The authors of [32] proposed an architecture applied on the LIAR dataset that encodes text knowledge via a CNN and metadata about the author of the text using an LSTM layer as well as it has been also found that the hybrid model has proved to outperform all other baselines along with a bi-LSTM architecture with an accuracy of 0.27 on the testing dataset. Whereas, in [30] author has proposed an approach based on repost sequence patterns for the detection of false rumours.

A new coronavirus disease disseminated around the world. During pandemic public have their eye on any news announcement related to how to get rid from corona, lockdowns, and other government official notifications, but some of the malignant users utilize this Covid as an opportunity to spread fake news to mislead people for some monetary benefits or doing it as a part of some propaganda. To address the issue related to Covid-19, some of the researchers are motivated towards this field of research, consider the study as an application in the health domain, and reported some of the methods to predict misinformation [33], [34], [35], [36]. Some of the recent studies have been discussed in Table 1.

Table 1.

Summary of research on Covid-misinformation detection.

Ref.	Features	Method
[33]	TF–IDF (Term-Frequency inverse document frequency).	- The Four machine learning methods has been employed including DT, RF, LR, Gradient Boost, and Support Vector Machine(SVM)

[34]	Tweet based features Username handles and URL domains	- Ensemble-based method and the heuristic method are augmented with their original framework. They considered the basic intuition that user name handles and URL domains plays an important role in gathering important and crucial information regarding the truthfulness of the tweets. - It has been observed that RoBERTa, XLM-RoBERTa, XLNeT and DeBERTa performs best for their proposed approach.

[35]	Tweet-based features(count of hashtags, count of favourites, count of retweets, count of URL.etc.), User-based features (user verified status, favourite count, and external knowledge, etc.)	- A Cross stitch based semi-supervised end-to-end neural attention model for the prediction of misinformation in which they achieve 0.95 F1 Score on CTF.

[36]	Linguistic features(n-grams, readability, emotional tone and punctuation.)	- The proposed architecture employed linear SVM method which obtained a weighted average F1-Score of 95.19% on test data.

[37]	–	- The combination of topical distribution from latent dirichelt allocation with contextualized representation from XLNet has been introduced and achieved an F1-Score of 0.967 by incorporating XLNet+Topic Distributions.

[38]	–	- Proposed an explainable natural language processing model based on Distil BERT and SHAP (Shapley Additive explanations) to address the problem of false information misinformation proliferated during COVID outbreak and achieved a best run accuracy of 0.97 by employing their proposed dataset.

Open in a new tab

In summary, there are two potential gaps in the previous research that we attempted to address in this study. Firstly, the existing work largely focused on detecting misleading information in different application perspectives; however, there are very few studies guided by the previous work to identify COVID-19 related misinformation spread on social media concerning the health domain. Secondly, the earlier studies are more specific to a single platform to get clues for validating the news and more prominently on Twitter for fetching the crucial information, and little is known about whether integrating multiple platforms for the collection of efficient clues is effective or not in distinguishing the misinformation from fake or real. It has been observed that instead of relying on the social media web platform, it will be efficient to extract information from news media sources/articles. Our study used a multi-web platform framework to gather effective clues for predicting false information. The clues can be retrieved from multiple web platforms (here, we have considered Google web search and YouTube) to get strong support. It may happen that one platform may not give effective clues to predict some information concerning some queries, but others can. Moving to this concept, instead of relying only on a specific platform for getting information, our proposed model incorporates a multi-web platform for retrieving crucial evidence concerning to specific query. To the best of our knowledge, none of the previous studies explored this concept and included the idea of a unique platform that can collect information from various social media and web sources for validating the news/claim. These all points we will discuss in later sections in detail. This leads to giving a concept of a unique web platform which has not been explored in the states-of-art method for retrieving evidence, where first the title/heading is retrieved from two platforms (YouTube and Google) concerning a specific claim/query and secondly, these headlines/titles have been used for the feature extraction and to get support for the claim to predict whether it is fake/not.

4. Proposed methodology

This section discusses our unique idea of using automatic Multi-Web Platform Voting Framework to Predict Misleading Information Proliferated during the COVID-19 pandemic. Before discussing the proposed architecture model, let us first discuss this paper’s problem statement and objective. To address this problem, we have adopted a binary class classification method. The source claim is comprised of $c = \{c 1, c 2, c 3 \dots c k\}$ which can further bifurcated into two classes $C l a s s = {M, R}$ : (1) Real (R), namely the posted claim correctly represents the event/situation that it refers to, (2) Misleading (M), namely the posted claim does not correctly represent the event that it refers to. The collected claims which are considered in this work are related to COVID-19 pandemic. The post “chlorine and alcohol products cannot kill viruses within the body” are some of the examples which belongs to the true class while the post “coronavirus is caused by 5G technology” which later turned out to be false rumours are considered as an example of misleading information. These news and posts are also acknowledged by WHO.2 The objective of this work is to train a classifier from the sample dataset, that is $f : X^{k} ⟼ Y^{k}$ , where $Y^{k}$ consists of two target classes: {R, M}. The model takes an input feature set $X^{k}$ and return an output whether the posted claim $C^{k}$ is true or misleading. We built an automated Multi-Web Platform Voting Framework to Predict Misleading Information Proliferated during the COVID-19 outbreak to address the given problem. The detailed flow diagram of our proposed model is shown in Fig. 2. The flow diagram has two phases, Phase 1 and Phase 2. Phase 1, includes the description of data collection and information filtering step, whereas Phase 2, incorporates the detailing of feature extraction, Model building and classification. These phases are discussed briefly in the following subsections.

Fig. 2 — The flow diagram of the proposed approach.

4.1. Data collection and information-filtering: Phase 1

The first phase of the proposed architecture incorporates the data collection, query building, facts collection, and Information Filtering step. The data that we have used here for the analysis is collecting samples concerning covid-19 fake news with having binary class labels fake and real. The dataset is the CONSTRAINT 2021 shared task having 5600 real and 5100 fake samples. The complete description of the dataset is given in Section 5.1. Each sample from the dataset is considered as an input query that needs to be validated. The input query is passed through the text processing phase where the cleaning of each input sample from the dataset has been performed to make it in a format so that it can be used for further processing, and it includes removal of stopwords, removing duplicates, handling missing data, stemming, punctuation removal, text translation (Google translation API) to the English language, Removing URLs, symbol, emoji, etc. After, text processing the cleaned data is passed to the next module called “Fact Collection” Here, we build a query by adding the “ fake news” keyword separated with space. The build query is then passed through a Multi-Web platform to retrieve relevant facts concerning to query. To gather efficient and relevant titles the query building is one of the important aspects. What query should be pass to get more relevant responses?. We defined three novel ways to build a query, however from all the given build cases, considered the one most effective in retrieving relevant information, where we have adopted case 1, as other have some limitation as discussed in Table 2. Table 2, describes 3 possible cases we have considered for building the query. The first case reflects the case of query building after text processing, removing stopwords, and other things, the query is attached with a space concatenated with the “fake news” keyword. This case of query generation was found to be good and considered in this study from the analysis. Whereas the other cases include the N_grams concatenated with the keyword “fake news” and the (POS)part of speech tagging(proper nouns), in which we can find all the proper nouns from the input query. Each proper noun is concatenated with the fake news keyword to build a query. However, these cases have certain limitations. Sometimes the context of the query cannot come out properly and missed out on the relevance. The build query is then passed through the two prominent web platforms utilized to retrieve the facts (YouTube platform and Google web search platform). Each collected fact/headline is further going through the text pre-processing part as discussed earlier for cleaning, that later be used to extract/gather crucial features.

Table 2.

Possible build cases.

Case No.	Possible cases	Limitation
1.	Text_preprocessing(Input_query) $+$ ” ” $+$ “fake news”	–

2.	N_grams(Input_query) $+$ ” ” $+$ fake news	1. Sometimes the context of the query cannot come out properly and missed out.

3.	Pos(Input_query) $+$ ” ” $+$ Fake news	1. Sometimes the context of the query cannot come out properly. 2. In some of the cases giving too many irrelevant facts, goes out of context.

Open in a new tab

Finally, the top 10 title headings are scrapped automatically using selenium from both platforms that further be used for analysis. The algorithm for fact collection is shown below in Algorithm 1. Algorithm 1, shows the process of facts collected from the Multi-web platform. Here, in this study, we have incorporated two social media and web search platforms for retrieving efficient facts/title heading concerning a query that further is used in feature engineering and validates the claim as fake/real. However, other platforms like Twitter have also been explored for the collection of facts, but the issue with Twitter is it supports keyword-based searching and long query-based search is not applicable, which leads to being a major issue in the collection of relevant facts. Whereas, in the case of google web search and YouTube, we can fetch efficient responses concerning a query.

We can also address the conspiracy statements about the COVID-19 virus. Conspiracy theories can be defined as “attempts to explain the ultimate causes of significant social and political events and circumstances with claims of secret plots by two or more powerful actors” [39]. Conspiracy beliefs may also influence the course of a crisis that initially favoured their appearance. Indeed, conspiracy beliefs have consequences, notably in the health domain [40]. For example, exposure to anti-vaccine conspiracy theories decreases vaccination intention [41].

The conspiracy statement about the COVID-19 virus has been addressed by incorporating the responses from the google search engine and the YouTube Platform concerning to the specific query statement. Analysis has been performed concerning to the derived features on the retrieved responses to identify the credibility of the claim. Like, one of the conspiracy beliefs among the public is to support hydroxychloroquine and reject vaccination. Exposure to anti-vaccine conspiracy theories decreases vaccination intention [41]. Retrieving information/responses from different sources gives some significant clue to reliably predict conspiracy statements about the COVID-19 Virus.

4.2. Feature extraction, model building and classification: Phase 2

The second phase is the clues extraction and classification step, and this module takes the facts collected from the previous step and utilizes them to get some efficient clues to predict/classify the claim as fake/real. The four sets of features are employed based on content, linguistics/semantic cues, similarity, and sentiments. Each of these feature categories has been discussed in detail in the below section and in Table 3.

Table 3.

Detailed description of proposed features.

Features category	Features	Feature description
Content-based	1. Question mark count 2. Fake word count	1. Number of question mark in a title heading. 2. Number of fake words encountered in a title heading.

Linguistics/Semantic cues-based	1. NLTK POS TAGGING semantic similarity	1. The nltk wordnet’s synset is used to measure the semantic similarity between user query and title headings.

Similarity-based	1. Cosine similarity $COS (x, y) = \frac{x . y}{\|x\| . \| y \|}$	1. The cosine similarity is used to measure the similarity between user query and title headings.

Sentiment-based	1. Query sentiments 2. Clue sentiments 3. Sentiment match count	1. This features return the sentiment of the user query, either positive negative or neutral. 2. This features return the sentiment of the title heading, either positive negative or neutral. 3. This features return the count of how many times the query sentiments matches with the title heading sentiments.

Open in a new tab

The susceptibility to misinformation about COVID-19 is identified by incorporating four sets of novel features 1. Content-based 2. Linguistic-based 3. Similarity-based and Sentiment-based features. The content-based features considered Question mark count and Fake word count, whereas the linguistic-based features considered the NLTK POS tagging semantic similarity-based features. The cosine similarity is employed to measure the similarity between two sentences. Whereas, sentiment-based features return the response to the user query, either positive, negative or neutral. These all are predictor of susceptibility to misinformation about COVID-19.

4.2.1. Content-based features

The content-based features have been widely explored in numerous data mining research fields. In this paper, we have incorporated content-based features for the prediction of misleading information, including question mark count and fake word count. The question mark count gives necessary clues regarding the confidence reflected in the sentence. If the sentence is showing uncertainty, it means that the claim is not sure regarding that event. Question mark count plays a major role in finding the uncertainty in a given sentence. If any question mark has been encountered in a title/headings retrieved from the web platform while searching a specific query, it returns true. In contrast, fake_word_count is also one of the important features that discriminate fake from real. There is some set of false_phrase_corpus that incorporates a list of keywords that prominently used to represent fake news. The keyword corpus that we have created including following phrases: {‘false’, ‘misleading’, ‘inaccurate’, ‘rumour’, ‘rumour’, ‘not correct’, ‘fake news’, ‘incorrect’, ‘wrong’, ‘confounding’, ‘deceiving’, ‘deluding’, ‘wont’, ‘did’ ‘Did’, ‘funny’, ‘memes’, ‘catchy’, ‘bogus’, ‘counterfeit’, ‘fabricated’, ‘fictitious’, ‘forged’, ‘fraudulent’, ‘mock’, ‘phony’, ‘affected’, ‘artificial’, ‘erroneous’, ‘fake’, ‘fanciful’, ‘faulty’, ‘improper’, ‘invalid’, ‘mistaken’, ‘unfounded’, ‘unreal’, ‘untrue’, ‘untruthful’, ‘casuistic’, ‘fishy’, ‘illusive’, ‘imaginary’, ‘inexact’, ‘lying’, ‘misrepresentative’, ‘falsity’, ‘misreport’, ‘misstatement’, ‘deception’, ‘falsification’, ‘artificial’, ‘fabrication’, ‘falsehood’, ‘hoax’, ‘?’, ‘Not Died’, ‘misinformation’, ‘not committed’, ‘not dead’, ‘death rumour’, ‘is it true’, ‘not known’, ‘no proof’, ‘no known’, ‘no scientific evidence’, ‘no evidence’, ‘not verified’, ‘clickbait’, ‘not proven’, ‘denied’, ‘deny’, ‘unverified’, ‘falsely’, ‘myth’, ‘ridiculous’, ‘not true’}, if any of these word has been encountered in the retrieved responses corresponding to a query, the fake count incremented by 1. The feature is helpful in identifying fake as the title having these phrases more likely representing news as fake.

4.2.2. Linguistics/semantic cues-based features

It is challenging to process raw text intelligently as the exact word used in different contexts and order can mean something completely different, however, while using linguistic knowledge can be possible to understand the semantics and the in what context word has been used in a sentence. For a given claim it is very important to understand in what context it is used. The python library nltk.pos_tag is constructed to do the same. When a raw text is passed as an input, it returns an output(doc object) with a variety of annotations. The given document has been parsed and tagged by nltk and there are some statistical model which enables it to predict which tag or label most likely applies in this context also called as POS(part of speech tagging). The concern is to find a particular part of speech, based on both its definition and its context to mark a word in a text using POS.

POS tagging also describes the characteristics of lexical terms within a sentence or text that further be used for making predictions/assumptions about semantics. To compute the semantic text similarity between two sentences, we have used POS (Part of speech) text similarity. There are different POS tags that can be given to each word in a sentence like $(N N S, n o u n p l u r a l) (N N P,) (p r o p e r n o u n, s i n g u l a r) (N N, n o u n, s i n g u l a r) e t c$ . NLTK POS tagger is employed to assign grammatical information of each word of the sentence. This feature is useful in computing the semantic text similarity between the user query and the clues retrieved from web platforms. The tags generated by nltk.pos_tag are converted to the tag used by wordnet.synsets. The nltk wordnet’s synset is used to measure the similarity.

4.2.3. Similarity-based feature

This is another category of feature used in this work based on similarity. This feature is helpful in segregating relevant titles/heading from all the given responses, as not all responses are useful for validation. To get the efficient performance of the model we need to remove irrelevant titles from the analysis, only those who cross the threshold value are used for analysis. One of the prominently used similarity measures “cosine similarity” has been used in this work to compute the similarity between two sentences irrespective of their size. The sentences are considered as two vectors and the cosine similarity between two vectors is measured in ‘ $θ$ ’. If the angle between two sentences is 0, they are similar, and if $θ = 90 °$ they are dissimilar. The formula of calculating the similarity between two sentences $x$ and $y$ can be given as:

COS (x, y) = \frac{x . y}{| x | . | y |}

(1)

4.2.4. Sentiment based features

Sentiment-based features are the fourth set of features employed for the prediction of fake news. Sentiments play an important role in identifying the polarity of the sentence, whether it is showing positive, negative, or neutral sentiments. Here, we have considered 3 features under this category.

(1)
Query Sentiment: Query Sentiment is a sentiment of the input query given by the user.
(2)
Title/heading sentiment: This is a sentiment of the responses(title/heading) received as a search result concerning a specific query.
(3)
Sentiment match counts: From all the 10 responses retrieved from the web platforms, how many times the sentiments of the query and the titles are matches. It also represents whether the sentiment pose by the input query is equivalent to the responses received. It also means that both query and heading are posing the same sentiments and presented in the same polarity.

All these above discussed features are briefly shown in Table 3 and the Algorithm 2 elaborate the complete process of fact validation, where the functions to evaluate the four set of features are briefly explained that later be fed to ensemble-based classifier for analysing the performance of the model.

5. Experimental analysis and results

The experimental analysis is performed on publicly available datasets, and different performance measures are adopted (Precision (pre), Recall (rec), F1-score, Accuracy (acc), etc. to measure the effectiveness of the proposed method and lastly presenting the results showing the performance of the proposed model as well as comparative analysis with other States-of-the-art methods. This section covers each of these points in the following subsections.

5.1. Constraint-2021 COVID-19 fake news detection dataset: Dataset description

In this paper, we have used the constraint-2021 shared task to detect COVID-19 fake news in English.3 It is a CONSTRAINT-2021 shared task on the hostile post detection, it incorporates two task English and Hindi. This dataset is considered in this paper for the evaluation of our proposed model. The dataset is collected from various social media like Twitter, Facebook, Instagram, etc. The main objective of this task is to classify a given social media post into Fake/Real. The dataset collects 10,700 manually annotated social media posts and articles of fake and real news on COVID-19 [33]. Some of the example of fake and real samples concerning to Constraint-2021 fake news dataset is shown in Table 4. The dataset is further split into training validation and test sets in the ratio of 3:1:1 as shown in Table 5.

Table 4.

Example of fake and real sample in the dataset.

Label	Text
Fake	Chlorine dioxide generated sodium chloride natural zeolite can cure corona or any virus
Real	Breathlessness excessive fatigue and muscle aches from COVID can last for months.
Fake	Have ginger boiled in water for 30 min and haldi powder with milk no corona-can touch you.
Fake	Try using uv light it kills corona virus and keep you safe

Real	Wash your hands often with soap and water for at least 20 s especially after being in a public place or after blowing your nose coughing or sneezing

Real	Avoid touching your eyes nose and mouth with unwashed hands.

Real	Coronavirus is a respiratory virus which spreads primarily through droplets generated when an infected person coughs or sneezes or through droplets of saliva or discharge from the nose

Open in a new tab

Table 5.

The Constraint-2021 task dataset description.

Split category	Real	Misleading	Total
Training set	3360	3060	6420
Validation set	1120	1020	2140
Test set	1120	1020	2140

Total	5600	5100	10 700

Open in a new tab

5.2. Evaluation measures

To evaluate the performance of the models, we employ four measures of accuracy, precision, recall, and F1-score as our metrics. In addition, these metrics are prominently and widely employ evaluation measures in classification tasks. Each of these measures are explained in Table 6.

Table 6.

Performance measures.

Measure	Definition	Computation formula
Accuracy	Accuracy can be defined as the proportion of correctly predicted samples to the total number of samples	$Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$

Precision	Precision is the measure can be calculated as the proportion of truly positive samples in identified positive samples.	$Precision = \frac{T P}{T P + F P}$

Recall	Recall is the performance measure that can be calculated as a proportion of the correctly identified sample in truly positive samples.	$recall = \frac{T P}{T P + F N}$

F1-Score	The F1-score considers the combination of both precision and recall to evaluate the performances.	$F1 - score = \frac{2 * p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}$

Open in a new tab

5.3. Classification methods and results

5.3.1. Method description

The experimental analysis is performed by employing various machine learning algorithms and in this paper, we have incorporated Logistic Regression (LR) a discriminant model, Support Vector Machine (SVM), Random forest an Ensemble-based model and its other variants (Voting classifier (RF, LR, KNN (K-Nearest Neighbour)), Voting Classifier (LR, LSVM, CART (Classification and Regression Trees), Bagging Classifier (Decision-Tree)), Naive Bayes for the classification as they are more prominently used methods in misinformation detection [42], [43], [44]. The classification models are explained as follows. LR is a commonly and widely employed classification model. The model considers maximum likelihood function and gradient descent method to attain the aim of classification of data. Whereas, SVM is one of the classification algorithms that performs well and is widely employed for the classification and regression problem in Machine learning. The main aim of SVM is to create a decision boundary or the best fit line, to segregate it into classes so that we can easily put the new data point in the correct category. In this paper we have considered Linear SVM. RF is the ensemble based classification and regression method, that incorporates various different single classifiers for the classification, whereas the final classification output has been achieved by combining the classification results of multiple classifiers to enhance and improve the performance of the model.

Similarly, we incorporate variants of Ensemble learners for the effective classification, as the combination of multiple classifiers reduce/decrease variance, noise, bias, especially in the case of unstable classifiers and may also give better and reliable classification results than a single classifier.

5.3.2. Classification results

The analysis has been performed on the Constraint-2021 covid-19 fake news dataset as discussed in Section 5.1, to measure the performance of our model concerning precision, recall, f1-score, and accuracy. Here, we are addressing RQ1, RQ2, and RQ2. As we have incorporated multi-web platforms, so analysis has been performed on both the web platform separately as well hybrid (Google +YouTube) of both. In the employed dataset, it has been observed that the test set does not contain label with respect to post/claim. That is why here the validation set has been considered for analysing and validating the performance of the model. The results concerning to this experiment on the validation set are shown in Tables 7, 8, and 9 respectively. The first study incorporates the clues retrieved from the google platform concerning to specific query. The performance has been analysed majorly on four measures Precision, Recall, F1-Score and Accuracy by employing Random Forest, Support Vector Machine, Logistic-Regression and Ensemble-based classification algorithm, where it has been observed that ensemble-based Voting classifier (RF, LR, KNN) and (LR, LSVM, CART) performs exceptionally performed better than other classifiers which fetched an F1 Score of 0.989. Whereas, the second-best run is given by SVM with an F1-Score of 0.987. Another set of experiment has been performed on the clues retrieved from the YouTube platform. It has been observed that LR performs best with an F1-Score of 0.869. However, it can be clearly seen that the clues retrieved from YouTube alone are not well efficient as compare to the Google web search platform. Therefore, the concept of Multi Web platform comes into picture, giving support to a claim in deciding as false or true, if some other platform cannot give sufficient clues for the prediction. The hybrid model improves the performance of the model by incorporating clues concerning to both the web platform to improve the efficiency of the model. The best F1-Score on the hybrid model is achieved from SVM and ensemble-based classifier (LR, LSVM, CART) having value 0.980 as shown in Table 9. Some earlier studies have also worked on the given problem and reported results concerning the Constraint Task 2021 Covid fake news dataset.

Table 7.

Performance of the model incorporating Google web platform on validation set.

Model	Precision	Recall	F1-Score	Accuracy
Random Forest	0.982	0.981	0.981	0.980
SVM	0.987	0.987	0.987	0.987
LR	0.986	0.986	0.986	0.986

Ensemble learners
Random Forest Voting classifier (RF, LR, KNN) Voting Classifier (LR, LSVM, CART) Bagging Classifier (Decision-Tree)	0.986 0.989 0.989 0.980	0.986 0.989 0.989 0.979	0.986 0.989 0.989 0.979	0.985 0.989 0.987 0.978

Open in a new tab

Table 8.

Performance of the model incorporating YouTube web platform on the validation set.

Model	Precision	Recall	F1-Score	Accuracy
Random Forest	0.866	0.852	0.850	0.851
SVM	0.860	0.860	0.860	0.860
LR	0.870	0.869	0.869	0.869

Ensemble learners
Random Forest Voting classifier (RF, LR, KNN) Voting Classifier (LR, LSVM, CART) Bagging Classifier (Decision-Tree)	0.866 0.863 0.865 0.853	0.852 0.863 0.865 0.795	0.850 0.863 0.865 0.785	0.851 0.862 0.864 0.795

Open in a new tab

Table 9.

Performance of the model incorporating both the web platform (Google + YouTube) on validation set.

Model	Precision	Recall	F1-Score	Accuracy
Random Forest	0.974	0.973	0.973	0.973
SVM	0.980	0.980	0.980	0.979
LR	0.978	0.978	0.978	0.976
SGD	0.980	0.980	0.980	0.975

Ensemble learners
Random Forest Voting classifier (RF, LR, KNN) Voting Classifier (LR, LSVM, CART) Bagging Classifier (Decision-Tree)	0.974 0.979 0.980 0.972	0.973 0.979 0.980 0.971	0.973 0.979 0.980 0.970	0.973 0.979 0.980 0.975

Open in a new tab

5.4. Comparative analysis with the existing methods

The Comparative Study with the other state-of-the-art method on the validation set is shown in Table 10. The authors of [33] (Model 1) and [34] (Model 2), proposed a method to predict misleading information proliferated during COVID-19 outbreak by employing ensemble-based classification approach, where they reported best run F1-Score of 93.46 by employing SVM and 98.32 using Ensemble-based model respectively as shown in Figs. 3. and 4. respectively. It can be clear that our model outperforms in all discussed cases and provided the best run using ensemble based model incorporating (LR, LSVM and CART) with an F1-Score of 98.88. Whereas the authors of [45], worked on the same problem task using machine learning, by incorporating various machine learning classifiers and here represented as Model 3. The best run is provided by Model 3 by using SVM with an F1-Score of 95.70, whereas our proposed approach on SVM giving the F1-Score of 98.70 and enhanced the performance by 3% as shown in Fig. 5.

Table 10.

Comparative study with the other state-of-the-art method on the validation set incorporating Google web platform.

Method	Model	Precision	Recall	F1-Score	Accuracy
[33] (Model 1)	DT	85.31	85.23	85.25	85.23
[33] (Model 1)	LR	92.76	92.79	92.79	92.75
[33] (Model 1)	SVM	93.46	93.48	93.46	93.46
[34] (Model 2)	Ensemble Model + Heuristic Post-Processing	98.32	98.32	98.32	98.32
[45] (Model 3)	SVM	95.71	95.70	95.70	95.70
[45] (Model 3)	LR	95.43	95.42	95.42	95.42
[45] (Model 3)	RF	90.98	90.79	90.80	90.79
[45] (Model 3)	NB	93.33	93.32	93.31	93.32
[45] (Model 3)	MLP	93.62	93.60	93.59	93.60
Our proposed model	Ensemble voting classifier(LR,CART,LSVM)	98.88	98.88	98.88	98.79
Our proposed model	Random Forest	98.20	98.10	98.10	98.09
Our proposed model	LSVM	98.70	98.70	98.70	98.70
Our proposed model	Logistic regression	98.60	98.60	98.60	98.55
Our proposed approach model	NB	95.55	95.53	95.54	95.34

Open in a new tab

Fig. 3 — Comparative analysis of our with Model 1 on F1 score.

Fig. 4 — Comparative analysis of our model with Model 2 on F1 score.

Fig. 5 — Comparative analysis of our model with Model 3 on F1 score.

5.5. Result implications and constraints of the study

From the result analysis, it has been observed that incorporating multi-web platform reduce the uncertainty in many ways. The first case when one platform is not able to give sufficient clues concerning to a claim for the prediction that whether the claim is true or not. Secondly, the supportive clues from different platforms can be one of the reliable source to validate the claim. The third case, incorporating a multi-web platform improves the performance of the model, like here, we have considered two web platforms YouTube and Google web search. It can be seen from the results sections in 5.32 and 5.4, retrieving clues from Google web search independently gives promising results. Whereas YouTube is not performing well independently. So, suppose if we only depend on YouTube to retrieve facts, the model’s performance greatly decreases as shown in Table 8. To address this case instead of relying on the YouTube platform alone, we have also combined the facts from other platform (Google Web Search) to validate the facts and improve the model’s performance as shown in Table 9. From the results it can also be observed that Google web search found to be an effective platform for retrieving the crucial information regarding the query. These observations addressed our RQ1, RQ2 and RQ3, and validates that incorporating multi-web platform is effective and more reliable than single web platform for the prediction of false information and the incorporation of effective clues from one platform helps improve the performance of the model when other platform is unable to return the relevant facts. As it can be seen that, when YouTube alone is not performing well than from the other platform (Google Search), we can get support to predict the veracity of claim and improve the model’s performance. Moving to RQ4 says that does every social web platform effectively collect crucial facts concerning a claim?. The observation and experimental analysis reveals that it is not true; it is not always mandatory that each platform performs good concerning a query. As some of the platform may not be able to process the query and there are many other constraints with respect to specific platform. The real constraint we found in this study is when the web search platform cannot process and understand the query effectively and not give relevant facts concerning to query. In those cases, the prediction is difficult. Here we have considered two platforms for the analysis to address this issue, however, we have also been explored Twitter, but the real constraint we found with Twitter is it only supports keyword-based search, however, the sentence/long query-based search is not applicable, that is why we have not incorporated Twitter in this study.

5.6. Feature evaluation

The four sets of features are evaluated to identify how crucial they are in predicting fake news, the individual performance analysis, and other possible combinations. Table 11, describes the possible combination of features and their corresponding results in terms of precision, recall, f1-Score and accuracy. The proposed four set of features have been evaluated on the best-run model i.e. Ensemble voting classifier (LR, CART and LSVM) in our study. It can be seen from Table 11, that the (content+ similarity) and (Content+linguistic+ Sentiment+ Similarity) based features together outperform all other feature combination in terms of their F1-score with value 0.988. Whereas, with respect to accuracy, the best run is provided by the (content+ similarity) features with value of 98.83. From these observations, we can now be able to answer the RQ4. With respect to research question 4 i.e Which one of the feature is more effective in discriminating against misleading information.?. The (content+ similarity) and (Content+linguistic+ Sentiment+ Similarity) based features together perform best and are more effective in discriminating misleading information.

Table 11.

Feature evaluation.

Feature	Accuracy	Precision	Recall	F1-Score	Classifier
Content	98.73	0.987	0.987	0.987	Ensemble voting classifier (LR, CART, LSVM)
Linguistic	98.46	0.985	0.985	0.985
Sentiment	98.51	0.985	0.985	0.985
Similarity	98.66	0.987	0.987	0.987
Content $+$ Linguistic	98.51	0.985	0.985	0.985
Content $+$ Sentiment	98.68	0.987	0.987	0.987
Content $+$ Similarity	98.83	0.988	0.988	0.988
Linguistic $+$ Sentiment	98.5	0.985	0.985	0.985
Linguistic $+$ Similarity	98.57	0.986	0.986	0.986
Sentiment $+$ Similarity	98.51	0.985	0.985	0.985
Content $+$ Linguistic $+$ Sentiment	98.70	0.987	0.987	0.987
Content $+$ Sentiment $+$ Similarity	98.70	0.987	0.987	0.987
Content $+$ linguistic $+$ Sentiment $+$ Similarity	98.79	0.988	0.988	0.988

Open in a new tab

6. Conclusion and future work

In this work a novel and intelligent methodology for detecting possible evidences of news and posts which can mislead public opinions and hamper order of society with respect to particular event or situation. Fake news propagated during the pandemic is considered as a special case study on which the analysis has been discussed thoroughly. We proposed an automated Multi-Web Platform Voting Framework considering YouTube and Google as major sources for the retrieval of clues. The four set of novel features based on content, linguistics/semantic cues, similarity and sentiments has been gathered from these platforms that further fed into ensemble based machine learning model to classify the news as Misleading or real. Voting is applied to validate the news and to check the confidence/support given by different web platforms. It has been observed that Google web platform itself performing good in retrieving crucial knowledge, giving best F1-Score of 98.88 by employing Ensemble based model incorporating LR, LSVM and CART and their voting gives the final decision. However, considering YouTube as a web platform alone for retrieving knowledge it only is able to give an F1-Score of 86.90 by employing LR which is quite low. Here, we can see YouTube alone is not able to retrieve effective clues to predict the news, however, incorporating multi web platform scheme we can improve the performance of the model by taking support from other platforms to validate the veracity of news when it is not available. Retrieving clues from multi-web platform improves the model’s performance and outperforms other state-of-the-art techniques on the same dataset by employing ensemble-based classification model. In the future, one can incorporate and explore other platforms (Instagram, WhatsApp, etc.) to validate the news and expand the work by including different modalities of data (images, videos, etc.). In future, this work can be extended by developing a real-time plugin or extension around this model which can further help the society to predict misleading content. Along with this we can enhance the web scraping process to eliminate ‘irrelevant’ data from the collected ground-truth data, for example, removing the contact-us information, the organization’s location, the descriptions that are associated with images, etc. In future we can utilize the proposed framework for detecting misleading information, shared or re-twitted on Twitter in a near real-time manner.

CRediT authorship contribution statement

Deepika Varshney: Software, Validation, Investigation, Data curation, Writing – original draft, Visualization. Dinesh Kumar Vishwakarma: Conceptualization, Methodology, Formal analysis, Resources, Writing – review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Ethical approval and consent to participate

Yes

Consent for publication

Yes

Biographies

graphic file with name fx1_lrg.jpg

Miss Deepika Varshney is a Research Scholar in the Department of Information Technology, Delhi Technological University India. She has done M.Tech from IGDTUW, Delhi 2014. Her area of interest is Online Social Media Privacy and Security, Machine Learning, Data Science.

graphic file with name fx2_lrg.jpg

Dinesh Kumar Vishwakarma (M’16, SM’19) received Doctor of Philosophy (Ph.D.) degree in the field of Computer Vision from Delhi Technological University (Formerly Delhi College of Engineering), New Delhi, India, in 2016. He is currently an Associate Professor with the Department of Information Technology, Delhi Technological University, New Delhi. His current research interests include Computer Vision, Deep Learning, Sentiment Analysis, Fake News Detection, Crowd Analysis, Human Action and Activity Recognition. He is a reviewer of various Journals/Transactions of IEEE, Elsevier, and Springer. He has been awarded with “Premium Research Award” by Delhi Technological University, Delhi, India. He is also an Associate Editor of IEEE Transactions on Circuits Systems for Video Technology.

Footnotes

https://in.search.yahoo.com/search?fr=mcafee&type=E211IN885G0&p=coronavirus.

https://www.indiatoday.in/world/story/drinking-alcohol-will-not-protect-you-from-covid-19-says-who-1653555-2020-03-08.

https://constraint-shared-task-2021.github.io/.

References

1.Valecha R., Srinivasan S.K., Volety T., Kwon K.H., Agrawal M., Rao H.R. Fake news sharing: An investigation of threat and coping cues in the context of the Zika virus. Digit. Threat. Res. Pract. 2021;2(2):1–16. [Google Scholar]
2.Valecha R., Oh O., Rao R. 2013. An exploration of collaboration over time in collective crisis response during the Haiti 2010 earthquake. [Google Scholar]
3.Valecha R., Sultania A., Chandola V., Agrawal M., Rao H.R. Proceedings of the Workshop on E-Business (WEB’14) Auckland; New Zealand: 2014. A big data approach to rumor mitigation in Twitter microblog: A case of Boston bombings. [Google Scholar]
4.Li J., Vishwanath A., Rao H.R. Retweeting the Fukushima nuclear radiation disaster. Commun. ACM. 2014;57(1):78–85. [Google Scholar]
5.Rao H.R., Vemprala N., Akello P., Valecha R. Retweets of officials’ alarming vs reassuring messages during the COVID-19 pandemic: Implications for crisis management. Int. J. Inf. Manage. 2020;55 doi: 10.1016/j.ijinfomgt.2020.102187. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Boididou C., Papadopoulos S., Zampoglou M., Apostolidis L., Papadopoulou O., Kompatsiaris Y. Detection and visualization of misleading content on Twitter. Int. J. Multimed. Inf. Retr. 2018;7(1):71–86. [Google Scholar]
7.Zhou W., Wang A., Xia F., Xiao Y., Tang S. 2020. Effects of media reporting on mitigating spread of COVID-19 in the early phase of the outbreak. [DOI] [PubMed] [Google Scholar]
8.C. Castillo, M. Mendoza, B. Poblete, Information credibility on twitter, in: Proceedings of the 20th International Conference on World Wide Web, 2011, pp. 675–684.
9.Jin Z., Cao J., Zhang Y., Zhou J., Tian Q. Novel visual and statistical image features for microblogs news verification. Trans. Multi. 2017;19(3):598–608. [Google Scholar]
10.Vishwakarma D.K., Varshney D., Yadav A. Detection and veracity analysis of fake news via scrapping and authenticating the web search. Cogn. Syst. Res. 2019;58:217–229. [Google Scholar]
11.Wu K., Yang S., Zhu K.Q. 2015 IEEE 31st Int. Conf. Data Eng. 2015. False rumors detection on Sina Weibo by propagation structures; pp. 651–662. [Google Scholar]
12.Horne B.D., Adali S. Eleventh International AAAI Conference on Web and Social Media. 2017. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. [Google Scholar]
13.Chakraborty A., Paranjape B., Kakarla S., Ganguly N. Stop clickbait: Detecting and preventing clickbaits in online news media. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining; ASONAM; 2016. pp. 9–16. [Google Scholar]
14.Aker A., Derczynski L., Bontcheva K. 2017. Simple open stance classification for rumour analysis. arXiv Prepr. arXiv:1708.05286. [Google Scholar]
15.Briscoe E.J., Appling D.S., Hayes H. 2014 47th Hawaii International Conference on System Sciences. 2014. Cues to deception in social media communications; pp. 1435–1443. [Google Scholar]
16.Giasemidis others G. 2016. Determining the veracity of rumours on Twitter. CoRR, abs/1611.0. [Google Scholar]
17.Kwon S., Cha M., Jung K., Chen W., Wang Y. 2013 IEEE 13th Int. Conf. Data Min. 2013. Prominent features of rumor propagation in online social media; pp. 1103–1108. [Google Scholar]
18.Zeng L., Starbird K., Spiro E.S. Tenth International AAAI Conference on Web and Social Media. 2016. # Unconfirmed: Classifying rumor stance in crisis-related social media messages. [Google Scholar]
19.Derczynski L., Bontcheva K., Liakata M., Procter R., Hoi G.W.S., Zubiaga A. 2017. SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours. CoRR, abs/1704.0. [Google Scholar]
20.Tacchini E., Ballarin G., Della Vedova M.L., Moret S., de Alfaro L. 2017. Some like it hoax: Automated fake news detection in social networks. arXiv Prepr. arXiv:1704.07506. [Google Scholar]
21.Zhou L., Twitchell D.P., Qin T., Burgoon J.K., Nunamaker J.F. 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the. 2003. An exploratory study into deception detection in text-based computer-mediated communication; p. 10. [Google Scholar]
22.W. Ferreira, A. Vlachos, Emergent: a novel data-set for stance classification, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1163–1168.
23.Breiman L. Routledge; 2017. Classification and Regression Trees. [Google Scholar]
24.Loh W.-Y. Classification and regression trees. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011;1(1):14–23. doi: 10.1002/widm.14. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Bodnar T., Tucker C., Hopkinson K., Bilén S.G. 2014 IEEE International Conference on Big Data (Big Data) 2014. Increasing the veracity of event detection on social media networks through user trust modeling; pp. 636–643. [Google Scholar]
26.F. Yu others, Detecting Rumors from Microblogs with Recurrent Neural Networks, CoRR, 8, (1) (2016) 1435–1443.
27.E. Kochkina, M. Liakata, A. Zubiaga, All-in-one: Multi-task Learning for Rumour Verification, in: Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 3402–3413.
28.Jacovi A., Shalom O.S., Goldberg Y. 2018. Understanding convolutional neural networks for text classification. arXiv Prepr. arXiv:1809.08037. [Google Scholar]
29.Yu F., Liu Q., Wu S., Wang L., Tan T. Attention-based convolutional approach for misinformation identification from massive and noisy microblog posts. Comput. Secur. 2019;83:106–121. [Google Scholar]
30.Song C., Yang C., Chen H., Tu C., Liu Z., Sun M. CED: Credible early detection of social media rumors. IEEE Trans. Knowl. Data Eng. 2019 [Google Scholar]
31.O. Ajao, D. Bhowmik, S. Zargari, Fake news identification on twitter with hybrid cnn and rnn models, in: Proceedings of the 9th International Conference on Social Media and Society, 2018, pp. 226–230.
32.Wang W. 2017. ‘Liar, liar pants on fire’: A new benchmark dataset for fake news detection. [Google Scholar]
33.Patwa others P. 2020. Fighting an infodemic: Covid-19 fake news dataset. arXiv Prepr. arXiv:2011.03327. [Google Scholar]
34.Das S.D., Basak A., Dutta S. 2021. A heuristic-driven ensemble framework for COVID-19 fake news detection. arXiv Prepr. arXiv:2101.03545. [Google Scholar]
35.Paka W.S., Bansal R., Kaushik A., Sengupta S., Chakraborty T. Cross-SEAN: A cross-stitch semi-supervised neural attention model for COVID-19 fake news detection. Appl. Soft Comput. 2021;107 doi: 10.1016/j.asoc.2021.107393. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Felber T. 2021. Constraint 2021: Machine learning models for COVID-19 fake news detection shared task. arXiv Prepr. arXiv:2101.03717. [Google Scholar]
37.Paka others W.S. Fake news detection system using XLNet model with topic distributions: CONSTRAINT@ AAAI2021 shared task. Appl. Soft Comput. 2021;107 doi: 10.1016/j.asoc.2021.107393. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Ayoub J., Yang X.J., Zhou F. Combat COVID-19 infodemic using explainable natural language processing models. Inf. Process. Manag. 2021;58(4) doi: 10.1016/j.ipm.2021.102569. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Douglas others K.M. Understanding conspiracy theories. Polit. Psychol. 2019;40:3–35. [Google Scholar]
40.van Prooijen J.-W., Douglas K.M. Belief in conspiracy theories: Basic principles of an emerging research domain. Eur. J. Soc. Psychol. 2018;48(7):897–908. doi: 10.1002/ejsp.2530. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Jolley D., Douglas K.M. The effects of anti-vaccine conspiracy theories on vaccination intentions. PLoS One. 2014;9(2) doi: 10.1371/journal.pone.0089177. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Varshney D., Vishwakarma D.K. A review on rumour prediction and veracity assessment in online social network. Expert Syst. Appl. 2020 [Google Scholar]
43.Ghenai A., Mejova Y. 2017. Catching Zika fever: Application of crowdsourcing and machine learning for tracking health misinformation on Twitter. arXiv Prepr. arXiv:1707.03778. [Google Scholar]
44.Ahmad I., Yousaf M., Yousaf S., Ahmad M.O. Fake news detection using machine learning ensemble methods. Complexity. 2020;2020 [Google Scholar]
45.Felber others T. 2021. Constraint 2021: Machine learning models for COVID-19 fake news detection shared task. arXiv Prepr. arXiv:2101.03545. [Google Scholar]

[b1] 1.Valecha R., Srinivasan S.K., Volety T., Kwon K.H., Agrawal M., Rao H.R. Fake news sharing: An investigation of threat and coping cues in the context of the Zika virus. Digit. Threat. Res. Pract. 2021;2(2):1–16. [Google Scholar]

[b2] 2.Valecha R., Oh O., Rao R. 2013. An exploration of collaboration over time in collective crisis response during the Haiti 2010 earthquake. [Google Scholar]

[b3] 3.Valecha R., Sultania A., Chandola V., Agrawal M., Rao H.R. Proceedings of the Workshop on E-Business (WEB’14) Auckland; New Zealand: 2014. A big data approach to rumor mitigation in Twitter microblog: A case of Boston bombings. [Google Scholar]

[b4] 4.Li J., Vishwanath A., Rao H.R. Retweeting the Fukushima nuclear radiation disaster. Commun. ACM. 2014;57(1):78–85. [Google Scholar]

[b5] 5.Rao H.R., Vemprala N., Akello P., Valecha R. Retweets of officials’ alarming vs reassuring messages during the COVID-19 pandemic: Implications for crisis management. Int. J. Inf. Manage. 2020;55 doi: 10.1016/j.ijinfomgt.2020.102187. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6] 6.Boididou C., Papadopoulos S., Zampoglou M., Apostolidis L., Papadopoulou O., Kompatsiaris Y. Detection and visualization of misleading content on Twitter. Int. J. Multimed. Inf. Retr. 2018;7(1):71–86. [Google Scholar]

[b7] 7.Zhou W., Wang A., Xia F., Xiao Y., Tang S. 2020. Effects of media reporting on mitigating spread of COVID-19 in the early phase of the outbreak. [DOI] [PubMed] [Google Scholar]

[b8] 8.C. Castillo, M. Mendoza, B. Poblete, Information credibility on twitter, in: Proceedings of the 20th International Conference on World Wide Web, 2011, pp. 675–684.

[b9] 9.Jin Z., Cao J., Zhang Y., Zhou J., Tian Q. Novel visual and statistical image features for microblogs news verification. Trans. Multi. 2017;19(3):598–608. [Google Scholar]

[b10] 10.Vishwakarma D.K., Varshney D., Yadav A. Detection and veracity analysis of fake news via scrapping and authenticating the web search. Cogn. Syst. Res. 2019;58:217–229. [Google Scholar]

[b11] 11.Wu K., Yang S., Zhu K.Q. 2015 IEEE 31st Int. Conf. Data Eng. 2015. False rumors detection on Sina Weibo by propagation structures; pp. 651–662. [Google Scholar]

[b12] 12.Horne B.D., Adali S. Eleventh International AAAI Conference on Web and Social Media. 2017. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. [Google Scholar]

[b13] 13.Chakraborty A., Paranjape B., Kakarla S., Ganguly N. Stop clickbait: Detecting and preventing clickbaits in online news media. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining; ASONAM; 2016. pp. 9–16. [Google Scholar]

[b14] 14.Aker A., Derczynski L., Bontcheva K. 2017. Simple open stance classification for rumour analysis. arXiv Prepr. arXiv:1708.05286. [Google Scholar]

[b15] 15.Briscoe E.J., Appling D.S., Hayes H. 2014 47th Hawaii International Conference on System Sciences. 2014. Cues to deception in social media communications; pp. 1435–1443. [Google Scholar]

[b16] 16.Giasemidis others G. 2016. Determining the veracity of rumours on Twitter. CoRR, abs/1611.0. [Google Scholar]

[b17] 17.Kwon S., Cha M., Jung K., Chen W., Wang Y. 2013 IEEE 13th Int. Conf. Data Min. 2013. Prominent features of rumor propagation in online social media; pp. 1103–1108. [Google Scholar]

[b18] 18.Zeng L., Starbird K., Spiro E.S. Tenth International AAAI Conference on Web and Social Media. 2016. # Unconfirmed: Classifying rumor stance in crisis-related social media messages. [Google Scholar]

[b19] 19.Derczynski L., Bontcheva K., Liakata M., Procter R., Hoi G.W.S., Zubiaga A. 2017. SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours. CoRR, abs/1704.0. [Google Scholar]

[b20] 20.Tacchini E., Ballarin G., Della Vedova M.L., Moret S., de Alfaro L. 2017. Some like it hoax: Automated fake news detection in social networks. arXiv Prepr. arXiv:1704.07506. [Google Scholar]

[b21] 21.Zhou L., Twitchell D.P., Qin T., Burgoon J.K., Nunamaker J.F. 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the. 2003. An exploratory study into deception detection in text-based computer-mediated communication; p. 10. [Google Scholar]

[b22] 22.W. Ferreira, A. Vlachos, Emergent: a novel data-set for stance classification, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1163–1168.

[b23] 23.Breiman L. Routledge; 2017. Classification and Regression Trees. [Google Scholar]

[b24] 24.Loh W.-Y. Classification and regression trees. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011;1(1):14–23. doi: 10.1002/widm.14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b25] 25.Bodnar T., Tucker C., Hopkinson K., Bilén S.G. 2014 IEEE International Conference on Big Data (Big Data) 2014. Increasing the veracity of event detection on social media networks through user trust modeling; pp. 636–643. [Google Scholar]

[b26] 26.F. Yu others, Detecting Rumors from Microblogs with Recurrent Neural Networks, CoRR, 8, (1) (2016) 1435–1443.

[b27] 27.E. Kochkina, M. Liakata, A. Zubiaga, All-in-one: Multi-task Learning for Rumour Verification, in: Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 3402–3413.

[b28] 28.Jacovi A., Shalom O.S., Goldberg Y. 2018. Understanding convolutional neural networks for text classification. arXiv Prepr. arXiv:1809.08037. [Google Scholar]

[b29] 29.Yu F., Liu Q., Wu S., Wang L., Tan T. Attention-based convolutional approach for misinformation identification from massive and noisy microblog posts. Comput. Secur. 2019;83:106–121. [Google Scholar]

[b30] 30.Song C., Yang C., Chen H., Tu C., Liu Z., Sun M. CED: Credible early detection of social media rumors. IEEE Trans. Knowl. Data Eng. 2019 [Google Scholar]

[b31] 31.O. Ajao, D. Bhowmik, S. Zargari, Fake news identification on twitter with hybrid cnn and rnn models, in: Proceedings of the 9th International Conference on Social Media and Society, 2018, pp. 226–230.

[b32] 32.Wang W. 2017. ‘Liar, liar pants on fire’: A new benchmark dataset for fake news detection. [Google Scholar]

[b33] 33.Patwa others P. 2020. Fighting an infodemic: Covid-19 fake news dataset. arXiv Prepr. arXiv:2011.03327. [Google Scholar]

[b34] 34.Das S.D., Basak A., Dutta S. 2021. A heuristic-driven ensemble framework for COVID-19 fake news detection. arXiv Prepr. arXiv:2101.03545. [Google Scholar]

[b35] 35.Paka W.S., Bansal R., Kaushik A., Sengupta S., Chakraborty T. Cross-SEAN: A cross-stitch semi-supervised neural attention model for COVID-19 fake news detection. Appl. Soft Comput. 2021;107 doi: 10.1016/j.asoc.2021.107393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b36] 36.Felber T. 2021. Constraint 2021: Machine learning models for COVID-19 fake news detection shared task. arXiv Prepr. arXiv:2101.03717. [Google Scholar]

[b37] 37.Paka others W.S. Fake news detection system using XLNet model with topic distributions: CONSTRAINT@ AAAI2021 shared task. Appl. Soft Comput. 2021;107 doi: 10.1016/j.asoc.2021.107393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b38] 38.Ayoub J., Yang X.J., Zhou F. Combat COVID-19 infodemic using explainable natural language processing models. Inf. Process. Manag. 2021;58(4) doi: 10.1016/j.ipm.2021.102569. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b39] 39.Douglas others K.M. Understanding conspiracy theories. Polit. Psychol. 2019;40:3–35. [Google Scholar]

[b40] 40.van Prooijen J.-W., Douglas K.M. Belief in conspiracy theories: Basic principles of an emerging research domain. Eur. J. Soc. Psychol. 2018;48(7):897–908. doi: 10.1002/ejsp.2530. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b41] 41.Jolley D., Douglas K.M. The effects of anti-vaccine conspiracy theories on vaccination intentions. PLoS One. 2014;9(2) doi: 10.1371/journal.pone.0089177. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b42] 42.Varshney D., Vishwakarma D.K. A review on rumour prediction and veracity assessment in online social network. Expert Syst. Appl. 2020 [Google Scholar]

[b43] 43.Ghenai A., Mejova Y. 2017. Catching Zika fever: Application of crowdsourcing and machine learning for tracking health misinformation on Twitter. arXiv Prepr. arXiv:1707.03778. [Google Scholar]

[b44] 44.Ahmad I., Yousaf M., Yousaf S., Ahmad M.O. Fake news detection using machine learning ensemble methods. Complexity. 2020;2020 [Google Scholar]

[b45] 45.Felber others T. 2021. Constraint 2021: Machine learning models for COVID-19 fake news detection shared task. arXiv Prepr. arXiv:2101.03545. [Google Scholar]

PERMALINK

An automated multi-web platform voting framework to predict misleading information proliferated during COVID-19 outbreak using ensemble method

Deepika Varshney

Dinesh Kumar Vishwakarma

Abstract

1. Introduction

Fig. 1.

2. Research objective

3. Related work

3.1. Machine learning

3.2. Deep learning

Table 1.

4. Proposed methodology

Fig. 2.

4.1. Data collection and information-filtering: Phase 1

Table 2.

4.2. Feature extraction, model building and classification: Phase 2

Table 3.

4.2.1. Content-based features

4.2.2. Linguistics/semantic cues-based features

4.2.3. Similarity-based feature

4.2.4. Sentiment based features

5. Experimental analysis and results

5.1. Constraint-2021 COVID-19 fake news detection dataset: Dataset description

Table 4.

Table 5.

5.2. Evaluation measures

Table 6.

5.3. Classification methods and results

5.3.1. Method description

5.3.2. Classification results

Table 7.

Table 8.

Table 9.

5.4. Comparative analysis with the existing methods

Table 10.

Fig. 3.

Fig. 4.

Fig. 5.

5.5. Result implications and constraints of the study

5.6. Feature evaluation

Table 11.

6. Conclusion and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Ethical approval and consent to participate

Consent for publication

Biographies

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases