PAN-LDA: A latent Dirichlet allocation based novel feature extraction model for COVID-19 data using machine learning

Aakansha Gupta; Rahul Katarya

doi:10.1016/j.compbiomed.2021.104920

. 2021 Oct 12;138:104920. doi: 10.1016/j.compbiomed.2021.104920

PAN-LDA: A latent Dirichlet allocation based novel feature extraction model for COVID-19 data using machine learning

Aakansha Gupta ¹, Rahul Katarya ^1,^∗

PMCID: PMC8505021 PMID: 34655902

Abstract

The recent outbreak of novel Coronavirus disease or COVID-19 is declared a pandemic by the World Health Organization (WHO). The availability of social media platforms has played a vital role in providing and obtaining information about any ongoing event. However, consuming a vast amount of online textual data to predict an event's trends can be troublesome. To our knowledge, no study analyzes the online news articles and the disease data about coronavirus disease. Therefore, we propose an LDA-based topic model, called PAN-LDA (Pandemic-Latent Dirichlet allocation), that incorporates the COVID-19 cases data and news articles into common LDA to obtain a new set of features. The generated features are introduced as additional features to Machine learning(ML) algorithms to improve the forecasting of time series data. Furthermore, we are employing collapsed Gibbs sampling (CGS) as the underlying technique for parameter inference. The results from experiments suggest that the obtained features from PAN-LDA generate more identifiable topics and empirically add value to the outcome.

Keywords: COVID-19, Latent dirichlet allocation, Collapsed gibbs sampling, Data mining, Feature extraction, Backpropagation

1. Introduction

The Coronavirus Disease 2019 (COVID-19) outbreak was originated in Wuhan, China, and has rapidly spread worldwide. On March 11, 2020, the WHO has declared a pandemic. There is a consequent increase in confirmed cases and deaths worldwide due to coronavirus, which has instantly led to more information on social media platforms. Moreover, the availability of a massive amount of everyday data on online platforms has built a relationship between ongoing events and online data. Therefore, reducing the online textual data into topic distribution increase the value of this relationship. The numerical data can also be extended to use for technical analysis to extract more valuable information about the event with time.

In recent years, the rapid growth in powerful text mining techniques entails a significant change in the research of extracting information and prediction. These techniques enhance the ongoing research efforts; improve the efficiency and speed of existing approaches. After the emergence of text mining approaches, the research for extracting information from unstructured textual data has been taken into account more often.

While addressing unstructured textual data, one approach is to develop specialized search engines. For example, several researchers developed search engines in order to find the data related to a particular interest from the COVID-19 related publications across scientific disciplines [[1], [2], [3]]. However, such engines are limited to use numerical data, keeping the textual data aside. Some studies focused on the textual information leaving behind the numerical data. For example, A. Khadjeh Nassirtoussi, S. Aghabozorgi, T. Ying Wah et al. [4] used only the news article headlines for text mining. Another study mined only the new articles and forecasted Argentine and Brazilian currency markets' movement, which employed topic clustering, sentiment analysis, and regression analysis [5].

Another typical way for text mining is using the statistical model, such as topic models, to comprise enormous textual information based on the topic distribution of documents [6,7]. With the emergence of text mining methods, the research on topic modeling usually focused on textual data to obtain the results. While working with the unstructured textual data, there are numerous instances where the topic models, especially, LDA model, were utilized or improved to discovers topics from the online text [[8], [9], [10]]. All these studies incorporate textual data for the evolution of topics in different ways. Although the traditional latent Dirichlet allocation (LDA) model has been studied well, these methods do not consider numerical data. Recent studies have employed topic modeling methodologies to anticipate prices using unstructured data such as broadcast news and social media data [11]. In addition to online text mining for time-series predictions, researchers have widely applied sentiment analysis for processing unstructured text. For example, X. Li, W.Shang, and S.Wang have considered news sentiment text features and grouped them according to their topics using topic modeling to increase the prediction accuracy [12,13]. However, none have defined a topic model based upon the combination of statistical data and textual data for prediction. Therefore, depending on which aspect is used, there is a scope for improvement. Our approach is to incorporate the time series numerical data along with online textual data in the standard LDA to extract better topics. Accordingly, we introduced PAN-LDA, a modified LDA, to take COVID-19 cases numerical data into account to improve the feature extraction. The topic features are then presented in machine learning algorithms to have an advantage from our PAN-LDA model.

Various statistical methods have been developed, including ML techniques and time series methods, to track and predict the events' evolution with time. Regardless of the ongoing re-rise and prevalence of conventional ML algorithms, boosting strategies are still increasingly valuable for a medium dataset as the preparation time is generally extremely quick, and they do not require quite a while to tune its parameters. In recent years, gradient boosting based ML methods such as Extreme Gradient Boosting (XGBoost) [14] and Light Gradient Boosting Machines (LightGBM) [15] are applied by some researchers for a robust prediction of future events in different research fields [[16], [17], [18], [19]].

Generally, in the data mining task, the main components in achieving the outcome are data collection, data preparation, modeling, and evaluation. In the data preparation process, feature extraction models like LDA provide new lower dimension features. Accordingly, PAN-LDA is a topic model for extracting the features during the data preparation stage. Therefore, our focus will be on describing the flow of this model in various phases of data mining and performing experiments to understand the future benefits of the extracted features. In the modeling phase, evaluating the PAN-LDA model's performance is done by using ML algorithms. We intend to follow advanced ML algorithms, such as XGBoost and LightGBM, as they are fast, and their complexity will improve the prediction performance. As a summary, the significant contributions of our study are as follow:

•
We proposed a latent Dirichlet allocation (LDA) based model, PAN-LDA, to create a new set of features from integrating texts from news articles and COVID-19 case data.
•
The features from our model served as an additional feature to Machine learning algorithms for outbreak case prediction.
•
The developed model, PAN-LDA, employed collapsed Gibbs sampling (CGS) as the underlying algorithm for inference in topic modeling.
•
We provided framework details for applying our model, PAN-LDA, in text mining, even though our model focused on extracting features in the data preparation phase.
•
We used ML algorithms to justify the benefits of the features extracted from the proposed model.
•
Our proposed model delivered superior results for all ML algorithms and generated more identifiable topics than other baseline methods.

To the best of our knowledge, this is the first effort to define a topic model based on the integration of news articles and the data of daily new cases of coronavirus. We collect the COVID-19 dataset of global news articles archived by ‘Aylien’ [20] and the data of corona cases published by ‘Our World in Data’ [21]. Next, we develop a new topic model to generate structured information and produce better latent topics from the collected data. We show that our model 1) uncovers a significant number of more identifiable topics than LDA, 2) the features obtained from PAN-LDA empirically add advantage for prediction 3) and performed significantly well other baseline approaches.

The remaining paper is arranged as follows: Section 2 covers the related work. Section 3 introduces the PAN-LDA model and describes the procedure for topic inference for new documents, along with the framework to apply our model in text mining. Section 4 presents the experiments and results. Section 5 is the discussion section, where we have discussed and analyzed the outcomes of the results. Section 6 ends the paper with a conclusion remark and suggestions for future work.

2. Related work

2.1. Data mining

Usually, there are six phases in any data mining project [22], as shown in Fig. 1 . These six phases can be implemented in either manner; however, it involves backtracking towards previous steps. Our research aims to improve the result in the data preparation phase by using our PAN-LDA model.

Though the particular concern of this paper is on refining the data preparation task by using our PAN-LDA yet, in the modeling phase, we aim to provide the performance comparison of different features obtained from our model and conventional approaches.

Recently, ML algorithms, including XGBoost and LightGBM, have been used in many studies and were proved useful in predicting time series. For example, H. Qiu, L. Luo, Z. Su et al. applied six machine learning algorithms, including XGBoost and LightGBM, for building predictive models having a unique feature set [17]. They showed that the LightGBM model outperforms logistic regression (LR), Support Vector Machine (SVM), and Artificial Neural Network(ANN) with the highest AUC (0.940, 95% CI: 0.900–0.980), but its performance did not vary much from that of Random Forest (RF) and XGBoost. Another paper [24] extracted time-dependent characteristics from time series and inputted them into three models: RF, XGBoost, and LightGBM, to predict Sepsis. LightGBM has proved great potential in predicting the market price movement in finance and economics [18]. Y. Tounsi, L. Hassouni, and H. Anoun have introduced a new model CSMAS to predict problems in data mining of credit scoring domain using state-of-the-art gradient boosting methods (XGBoost, CatBoost, and LightGBM) [25]. Sunghyeon Choi forecasted solar energy output by employing RF, XGBoost, and LightGBM models [26]. Moreover, J.Cordeiro, O. Postolache, and J. Ferreira used the XGBoost model and the LightGBM model to predict the height of children [27]. Accordingly, in the modeling phase, we adopted the XGBoost and LightGBM to have the benefits of the features of our PAN-LDA model. Table 1 summarizes the various machine learning techniques used to solve the prediction tasks in multiple domains.

Table 1.

A summary of ML models used for various time series prediction.

Reference	Year	Base model	Time series prediction	Data set
[28]	2021	SVM, LR, Multi-layer perceptron, RF	Covid-19 pandemic cumulative case forecasting	Data of COVID-19 between January 20, 2020, and September 18, 2020, for the USA, Germany, and global was obtained from the World Health Organization website.
[17]	2020	LR, SVM, ANN, RF, XGBoost, LightGBM	Prediction of peak demand days of cardiovascular disease(CVD) admissions	Health Information Center of Sichuan Province, China: the daily number of admissions of CVD patients in hospital Chengdu Meteorological Monitoring Database: Meteorological data China National Environmental Monitoring Cente: air pollutants data
[25]	2020	XGBoost, CatBoost, LightGBM	Prediction of problems in data mining of credit scoring domain	Home Credit Default Risk from Kaggle Challenge
[26]	2020	RF, XGBoost, LightGBM	Photovoltaic Forecasting	Data of a Photovoltaic plant in South Korea
[18]	2020	LightGBM	Cryptocurrency price trend	Daily trading data from https://www.investing.com/
[29]	2020	XGBoost, ARIMA	Hemorrhagic fever with renal syndrome	Monthly hemorrhagic fever with renal syndrome incidence data from 2004 to 2018 from the official website of the National Health Commission of the People's Republic of China
[27]	2019	XGBoost, LightGBM	Child's Target Height Prediction	Dataset based on the famous study 1885 of Francis Galton, which included the trait observation of 928 children and their parents (205 pairs)

[30]	2018	Recurrent Neural Network, XGBoost	Taxi Demand Prediction	The real-world dataset generated by taxis
[31]	2018	Recurrent Neural Network, Gated Recurrent Unit	Stock price prediction	S&P 500 historical time series data

Open in a new tab

*SARIMA: Seasonal Autoregressive Integrated Moving Average.

XGBoost is an ensemble learning algorithm based on gradient boosting [14]. It is a widely applicable ML technique for both classification and regression. This has been justified in various real applications, such as malware classification, text classification, sales prediction, customer behavior prediction, risk prediction [14]. XGBoost is a helpful approach to optimize the gradient boosting algorithm by combining a linear model with a boosting tree model.

Suppose a dataset $D$ consists of $n$ samples and $m$ features, $D = {(x_{i}, y_{i}) (| D | = n)}$ , where $x_{i}$ are the independent variables; each of these variables has $m$ features such that ${x_{i} \in R_{m}}$ . And $y_{i}$ is the dependent variable corresponding to $x_{i}$ , ${y_{i} \in R}$ . For a given ( $x_{i}$ , $y_{i}$ ), the objective function in XGBoost is given by:

Equation 1.

(1)

where $L$ is a loss function and $f_{t}$ is the $t$ ^th tree.

LightGBM, proposed by G. Ke et al., is an improved framework based on the Gradient Boosting Decision Tree algorithm [15]. This algorithm is widely used in classification as well as regression. Moreover, we used LightGBM for regression in our model. This algorithm is mainly featured by two novel techniques: the Gradient-based One-Side Sampling (GOSS) alongside the Exclusive Feature Bundling (EFB) [32].

For a given training set $D = {(x_{i}, y_{i}) (| D | = n)})}$ , the objective function in LightGBM is defined as:

Equation 2.

(2)

where

$L$ is the loss function
$x_{i}$ are the independent variables and $y_{i}$ is the dependent variable corresponding to $x_{i}$ .

LightGBM integrates several regression trees to approximate the final model:

Equation 3.

(3)

While XGBoost enforces level-wise loss, LightGBM is grown leaf-wise. Leaf-wise growth strategy has several advantages compared to a level-wise growth strategy, such as reducing large errors, handling extensive data, higher accuracy, faster training, etc.

As mentioned in this article, the data preparation task is done by PAN-LDA, an LDA-based model; we first describe the baseline LDA model.

2.2. Latent Dirichlet allocation (LDA)

LDA is a generative probabilistic model proposed by Blei et al. [33] to compute the latent topics from various text documents. It is an unsupervised model that takes the primary units of data, i.e., the words in text documents. Fig. 2 represents the graphical model of smoothed LDA. The $w_{d, n}$ , the index of word $w$ in document $d$ , represents the input of the model. The output from the model is the $K$ , predefined number of latent topics. Each topic $k$ , $k \in {1, \dots ..., K}$ is represented by a discrete probability distribution $φ_{k}$ over the vocabulary $V$ and generated from a Dirichlet distribution $φ_{k} \sim D i r (β)$ . Additionally, every document $d$ , $d \in {1, \dots \dots, D}$ comes from a Dirichlet distribution $θ_{d} \sim D i r (α)$ , which is the topic distribution for each document $d$ . From $θ_{d}$ we calculate, $z_{d, n}$ , per word topic assignment in document $d$ , where $β$ and $α$ are the Dirichlet parameters.

Fig. 2 — Graphical Model representation of LDA.

For a set of given observed words $w = {w_{d, n}}$ , inference methods aim to determine the posterior distribution over the unknown parameters. There are several approximate inference algorithms, such as variational Bayes, Markov random fields, expectation propagation, Markov Chain Monte Carlo, etc. However, collapsed Gibbs sampling is an efficient inference procedure to learn the model from data [34], where only the latent variable $z$ is sampled, and the random variables such as $φ$ and $θ$ are marginalized out. Once the latent variables are sampled out, the random variables $φ$ and $θ$ can be estimated.

3. Proposed model

This section describes our model PAN-LDA, followed by parameter inference. This model incorporates the changes in the number of daily new COVID-19 confirmed cases on a one-day interval along with the globally published news articles. Our model can discover hidden topics discussed in news articles related to COVID-19 in various countries. We also discuss the place of our model in text mining tasks for outbreak activity prediction.

3.1. Model description

It is noted that with the increase in the severity of pandemic cases, there is an association between the topic generation and trend prediction. Inspired by this, we developed PAN-LDA, a modification of LDA, which incorporates the statistics of daily new coronavirus cases along with news articles for feature extraction. The overall framework of the PAN-LDA approach is depicted in Fig. 3 .

The graphical model of PAN-LDA is represented in Fig. 3. Our model incorporates the change in the number of reported coronavirus cases after a news article $d$ is published, $n c_{d}$ . Furthermore, the distribution of changes in reported cases per topic $δ_{k}$ , is added in the figure to connect with the other distribution via $n c_{d}$ . The past data of the infected corona cases are processed to find the change in the number of new reported cases, $n c_{d}$ , after publishing the document $d$ . The time lag that we considered for the collection of data is of one day. In the training process, data for daily new corona-infected cases were available. The presence of corona case data in PAN-LDA affects the distribution of changes in reported cases, affects the per document topic distribution, $θ_{d}$ and also the per topic word distribution, $φ_{k}$ . After the parameter estimation, the latent topics of a document can be obtained. For a new document, the latent topic distribution is obtained using the estimated word distribution from parameter estimation in the inference methods on the document. The received topic features can then be introduced as input features in any ML method. In summary, the incorporation of news articles and changes in the daily new corona infected cases is used by the proposed model to refine the parameter estimation and topic distribution inference on previously unseen documents. The obtained topic distribution can serve as input features for ML models to predict the time series.

For a given collection of $D$ documents having fixed vocabulary $V$ , with $N_{d}$ word tokens, $(w_{d, 1}, \dots .., w_{d, N_{d}})$ in document $d$ , each having an index in the vocabulary, $w_{d, n} \in {1, \dots .., V}$ . It is assumed that the number of latent topics, $K$ , is predetermined.

In LDA, per word topic assignment, $z_{d, n}$ , is drawn from the probability list of K topics and depends on the previously drawn topic proportion in document $d$ , $θ_{d, k}$ . And, a word instance, $w_{d, n}$ , is presumed to be deduced from a probability list of V words and depends on the word distribution, $φ_{d, k}$ and the topic index of the word, $z_{d, n}$ . Based on LDA, PAN-LDA is also a probabilistic generative model. Accordingly, in PAN-LDA, the change in the number of new coronavirus disease infected confirmed cases, $n c_{d}$ , is drawn from a list of probabilities of C categories and depends on the changes in the number of cases distribution $δ_{z_{d}}$ .

ALGORITHM 1 describes the generative process of the PAN-LDA model:

ALGORITHM 1

PAN-LDA

Open in a new tab

When processing a single document, the proposed PAN-LDA algorithm distributes random accesses across an O(DK) document-topic count matrix or an O(KV) topic-word count matrix, where K, V, and D represents the total number of latent topics, the vocabulary size, and the number of documents respectively.

The variables are distributed via probability distribution:

Equation 4.

(4)

Equation 5.

(5)

Equation 6.

(6)

The joint distribution of latent variables and observed data is then:

Equation 7.

(7)

Inferences about the posterior parameters typically yield topics where the probability mass of each topic is assigned to frequently co-occurred words that are semantically strongly related.

3.2. Topic inference

The central issue in topic modeling is posterior inference, which includes learning the posterior probabilities of the observed data, i.e., words in documents, $w$ , and the change in the number of reported corona infected cases after the documents were published, $n c$ , and the latent variables, i.e., $θ$ , $φ$ , $δ$ and $z$ . In PAN-LDA, the posterior probabilities of latent variables can be calculated as:

Equation 8.

(8)

Unfortunately, the computation of this posterior distribution is intractable. The computation of the normalization factor, particularly $p (w, n c, α, β, γ)$ , cannot be done accurately. Among the various inference methods, CGS proposed by T. Griffiths and M. Steyvers [34] is known for model estimation with high accuracy.

Using the CGS with LDA, we are interested in computing the posterior distribution of a topic $z$ is allocated to a word $w$ , given the remaining words are assigned to other topics, as follows:

Equation 9.

(9)

where $z_{- i}$ means all topic allocations, excluding $z_{i}$ .

Also, we must appeal to approximated inference, where some of the parameters are marginalized out. Therefore, we applied collapsed Gibbs Sampling for finding inference, which marginalizes out parameters $φ$ , $θ$ , and $δ$ and on each iteration recovers topic, $z_{d, n}$ of a word token $w$ , from a distribution conditioned on the present values of remaining variables. In a space containing all the variables, by sampling in a collapsed space, collapsed Gibbs Sampling usually converges much faster than a common Gibbs sampler. By simply computing all the K topic assignments, a naive implementation of Eq. (10) has a complexity O(K) per token. The posterior distribution for sampling the latent variable $z$ , after the random variables are marginalized out, for PAN-LDA, is:

Equation 10.

(10)

where,

$N_{d, k}$ denotes the total words allocated to topic k in document $d$ .
$N_{k, w_{d, n}}$ is the number of words of type $w_{d, n}$ allocated to topic $k$ .
$N_{k}$ is the total words allocated to topic $k$ .s
$N_{k, n c_{d}}$ is the number of words belonging to category $n c_{d}$ and assigned to topic $k$ .

For topic modeling, we estimated the topic distribution per document $θ_{d}$ , the word distribution per topic, $φ_{k}$ , and the topic assignments per word $z_{d, n}$ . Using topic allocations, $φ$ , $θ$ , and $δ$ can be computed as:

Equation 11.

(11)

Equation 12.

(12)

Equation 13.

(13)

Equations (11), (12), (13) estimate the quantity probabilities that word $w$ belongs to topic $k$ , $φ_{k, w_{d, n}}$ , topic $k$ is generated in document $d$ , $θ_{d, k}$ and the document $d$ with category $n c_{d}$ assigned to topic $k$ , $δ_{k, n c_{d}}$ , respectively. The CGS procedure is outlined in ALGORITHM 2.

ALGORITHM 2

collapsed Gibbs sampling

Open in a new tab

where count matrices are the following notations: $N_{d, k}, N_{k, w}, N_{k, n c}$ and $N_{k}$ .

As shown in Fig. 3, other parameters of the model will not get affected by the $δ$ and $γ$ , if there is an unavailability of the corona case data. Therefore the inference from a previously unseen document, without having corona case data, will be the same as the inference from LDA.

The following pseudo-code, i.e., ALGORITHM 3, shows the inference from a new document, without $θ$ , $φ,$ and $δ$ .

ALGORITHM 3

Topic Inference from a Previously Unseen Document

Open in a new tab

3.3. PAN-LDA in text and data mining

Here, we discuss how the calculations flow when our model is used for COVID-19 case prediction (see Fig. 4 ). The figure shows the flowchart representation of our proposed approach.

This paper deals with the application of our PAN-LDA model for forecasting coronavirus cases over time. Firstly, a usual requirement in the data preparation phase is the preprocessing of the raw data. In our experiments, we preprocessed the collected text by noise removal, case folding, tokenization, stemming, lemmatization, and stopword removal. Next, in the data preparation phase, we focused on feature extraction while working on topic modeling with PAN-LDA.

In the data preparation phase, our model needs the word vectors and the statistics of the COVID-19 cases. The vector of words in PAN-LDA is provided by the ‘bag-of-words ‘model. However, the case statistics need to be calculated and categorized for our model PAN-LDA-see Fig. 4. Our model generates three probability lists after training from the training set, i.e., topic distribution per document, word distribution per topic, and distribution of change in corona infected cases per topic. The topic distribution obtained serves as an additional feature for training an ML algorithm. The word distribution becomes the input parameter for inferring the topic from a previously unseen document, which in result, generates the new topic distribution for the new document, which serves as an additional feature for the ML algorithm for predicting outcomes in the next phase of time series data.

In the modeling phase, we trained machine learning algorithms to compute regressions. In this paper, we choose two machine learning algorithms, i.e., XGBoost and LightGBM, to perform our experiment. The topic distributions from the training data, along with other features, were used to train the selected ML algorithms. Ultimately, the model can be utilized to forecast the selected index.

4. Experiments and results

This section discusses the experimental setup and the comparative results. As discussed, we experimented with four different models and explored four different feature sets, i.e.:

•
COVID-19 cases statistics as a base features, FS1
•
COVID-19 cases statistics with topic distributions from LDA, FS2
•
COVID-19 cases statistics with topic distributions from LDA and sentiment scores to the latent topics, FS3
•
COVID-19 cases statistics with topic distributions from PAN-LDA, FS4

In FS1, only historical COVID-19 daily case data are used as base features in the prediction models. For FS2, in addition to the historical data, the topic distribution from the LDA is integrated into the prediction model. In FS3, we use the COVID-19 cases data along with the computed topic distribution of the news articles and sentiment scores specific to the extracted latent topics. Due to the fact that the topics extracted from news articles do not have any associated sentiments, sentiment analysis of reviews is also done by using VADER to compute the relative sentiment scores with respect to the topics. The historical data, extracted topics, and their sentiments are used as input features to a machine learning prediction model. Finally, FS4 denotes the feature set obtained from the topic distributions from PAN-LDA and the COVID-19 historical data.

Once the models are trained, we used the backtesting approach for time series forecasting. For backtesting, we used the walk-forward testing [35] routine, which accounts for model performance at different time windows.

4.1. Data selection and gathering

•
COVID-19 data: The data for the number of confirmed coronavirus infected cases used to experiment are the official data published by ‘Our World in Data’ [21]. They have provided global and reliable data to study statistics on the COVID-19 pandemic. The dataset is updated daily from the World Health Organization (WHO) situation reports [36]. We use available data on the daily new cases of coronavirus-infected people from January 2020 to May 2020.
•
News Articles: The news articles dataset used in this paper was gathered from the Aylien [20]. Aylien has aggregated and published the COVID-19 dataset that can be used to analyze global news during the outbreak. Aylien has transformed the COVID-19 dataset into structured and actionable data using NLP and ML. The data analyzed in this study correspond to the period that stretches between January 2020 to May 2020. As a result, a total of 1147454 articles were considered for the experiment.

4.2. Data preparation

Consequently, we trained our model on a collection of more than 1 million news articles, which contributes to text documents for this experiment. We preprocessed the data by tokenization, stemming, and removal of stop words. The text in new articles is represented by a vector using the ‘bag-of-words' model in Gensim [37].

We collected one-day level new corona infected data and classified the changes in the number of infected cases into three different categories. The threshold was considered based on an average of one-day change for collected data. There is an average change of 0.1285% in both directions. In this paper, we experimented with a 0.10%-0.15% change as the threshold. As a result, the data is divided into three categories, i.e.,

•
category 1, if the change in the number of new coronavirus cases lies above the threshold,
•
category −1 if the change in the number of coronavirus cases is below the threshold, and
•
category 0 if it lies within the threshold

So, the value for $n c_{d}$ in PAN-LDA is explained in three levels in Equation (14).

Equation 14.

(14)

We found that 767345 articles were falling in the category −1( $n c_{d}$ = -1), 145773 articles in category 0( $n c_{d}$ = 0) and category 1( $n c_{d}$ = 1) has 234336 articles.

After all of the data has been preprocessed, we then divide it into training and test sets using walk-forward validation [35], a variant of cross-validation. This method includes splitting the data into a series of overlapping training-testing sets, and each set is moved forward through the time series. In this paper, the whole dataset was split into 11 overlapping datasets with a 19-day window. Each of these datasets was divided into training and test sets as 25% testing and 75% training.

The input dataset for PAN-LDA consists of all the word tokens in $D$ documents $w_{d, n}$ , and category values, $n c_{d}$ . Next, we set the hyperparameters of the Dirichlet distributions, i.e., $α$ , $γ$ , and $β$ . Without any particular basis, different research papers have chosen different values of these hyperparameters, e.g. Refs. [38,39], used $β$ = 0.1, $α$ = 50/ $K$ [40,41], used $β$ = 0.1, $α$ = 0.1 and [42], used $β$ = 1/ $K$ , $α$ = 1/ $K$ . In order to have few words with a high probability per topic and numerous latent topics with a high probability per document, the values of Dirichlet distributions were set as 1/ $K$ ,i.e., $α$ = $β$ = $γ$ = 1/ $K$ for all the topic models.

Selecting the optimum number of topics( $K$ ) in topic modeling is also a significant problem. In order to estimate the optimum values of K and the number of iterations (iter), we ran 200 iterations, iter = 200, and noted the computed value at every 20 iterations. We evaluated the effect of iteration count with different numbers of topics on log-likehoods at these savings points. The results are presented in Fig. 5 , which depicts the stability in results when iter >140.

Fig. 5 — The log-likelihood for PAN-LDA with collapsed Gibbs sampling.

Also, for a better understanding of the parameters' values, we computed the log-likehoods for PAN-LDA in Fig. 6 . The graphs in Fig. 6 suggested that the optimum value is achieved at $K$ = 10. And as the graph showed stable results when iter >140, therefore, we set the iter = 160 for our experiment. Accordingly, we set the same values for the LDA model.

After setting the parameter values, we extracted topics from the LDA, news-text-sentiment feature grouping using LDA and PAN-LDA models. Table 2a, Table 2b a and b show 6 of the 10 topics discovered by LDA and PAN-LDA, respectively, with their top 15 words. The remaining topics are shown in Table S1 and Table S2 in the supplementary information.

Table 2a.

Examples of topics generated by LDA.

Sports	Finance	Business	Entertainment	Country	Politics
League	Market	business	Time	australia	Trump
Football	Economy	company	People	Reuters	president
Season	Global	ceo	Lockdown	new_zealand	donald_trump
Players	China	officer	social_media	government	white_house
Club	Markets	financial	Instagram	australian	president_trump
premier_league	Oil	year	Family	bank	house
Team	Economic	bank	Star	editing	virus
Sports	Year	industry	Quarantine	reporting	washington
England	Energy	companies	Year	reuters_reuters	bill
Training	Reuters	airline	Video	sydney	congress
United	Virus	insurance	Food	european	year
Sport	Stock	group	life	germany	trump_administration
Games	Bank	chief_executive	social	france	fox_news
Clubs	Prices	cash	facebook	prime_minister	senate
Game	Demand	businesses	times	french	federal

Open in a new tab

Table 2b.

Examples of topics generated by PAN-LDA

Sports	Finance	Business	Entertainment	Country	Health
League	Bank	company	instagram	new_york	virus
Football	Economy	business	time	city	hospital
Season	Market	year	star	county	patients
Team	Economic	ceo	video	california	infection
Players	Global	officer	social_media	governor	disease
Sports	Markets	industry	live	york	health
Game	Financial	companies	twitter	florida	vaccine
Games	Oil	sales	family	texas	testing
Club	Reuters	market	music	los_angeles	symptoms
Events	Stimulus	stock	film	virus	test
Time	Energy	group	story	department	tests
Year	Prices	production	years	order	people
premier_league	unemployment	supply_chain	life	mayor	fever
United	Money	quarter	series	chicago	medical
Event	Rate	nasdaq	netflix	people	medicine

Open in a new tab

Table 2a, Table 2ba and b suggest that some topics from both models have the same words or words with similar implicit meanings. Though, their ranking order, suggesting their importance, is different. Moreover, some topics from the two models are entirely different.

Based on the extracted words, we interpreted the meaning of the topic and assigned labels to each, i.e., 'Sports', ‘Finance’, 'Business', ‘Entertainment’, ‘Country’, ‘Health’ and 'Politics'. We assign the same color to the words belonging to the same topic. Topic 1, 'Sports', colored in orange, has similar sets of words for both the models. Though the words in the models have a different order, indicating their importance. Topic 2, ‘Finance’, also contains similar words for both the models, yet, the absence of an important word, i.e., “stock”, can be noted in the vocabulary words of PAN-LDA.

Similarly, the word ‘unemployment’ is absent from the top vocabulary words of LDA. In topic 3, 'Business', the words generated by both the models are quite different but appear to be similar in their implicit meanings. In the next topic, i.e., ‘Entertainment’, PAN-LDA has generated more meaningful words related to the topic in comparison to LDA. In LDA, the words are vaguely present and do not contribute much to extract a single topic. The words from LDA in topic 5 seems to be a combination of two topics. PAN-LDA isolates more coherent topics, such as health, social anxiety, as compared to LDA. In LDA, the remaining words talk about politics, lockdown, etc. We noted that the rest of the generated words in LDA do not contribute much to form new identifiable topics. Also, the remaining topics generated by PAN-LDA are vaguely present in LDA. Some topics in LDA seemed to be a combination of topics from PAN-LDA. Additionally, PAN-LDA produced more identifiable topics.

As the topics obtained from the models are mainly used for inferring the topic distributions from the text, the interpretation and meaning of topics are not of much concern in this experiment. From the parameter estimation process, the topic distributions for all training set documents along with estimated word distributions for inference were obtained. The obtained topic distribution then served as an input feature for ML algorithms without any need to interpret their meaning. It can be noted that two different models generated different topics. This suggests that adding a new feature, i.e., changes in data of new corona cases, successfully influenced per topic word distribution in the parameter estimation, which is evaluated in the next step.

Following that, using the estimated $φ$ values and all words in the test set, $w_{d, n},$ the topic distributions were inferred from documents in the test set. The resulted topic distributions were fed into machine learning models, i.e., XGBoost and LightGBM, for testing in the next phase.

The prepared data is then arranged into four feature sets, FS1, FS2, FS3, and FS4, for both training and test sets.

4.3. Evaluation indicators

We used four widely accepted statistical indicators, i.e., the determination coefficient, R² (R-Square), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Deviation (MAD) for performance comparison of ML algorithms trained with the above-said feature sets. Each metric is computed, as mentioned in Table 3 .where $N$ corresponds to the total in the test data, $c$ and $\hat{c}$ are the $i$ ^th value of the observed and forecasted number of new COVID-19 cases in the testing period and $\overline{c}$ denotes the mean value of $c$ .

Table 3.

Performance metrics and their calculations.

Metrics	Calculation
R²	$1 - \frac{\sum {(c_{i} - {\hat{c}}_{i})}^{2}}{{(c_{i} - \overline{c})}^{2}}$
RMSE	$\sqrt{\frac{1}{N} \sum_{1}^{N} {(c_{i} - {\hat{c}}_{i})}^{2}}$
MAE	$\frac{1}{N} \sum_{1}^{N} \| c_{i} - {\hat{c}}_{i} \|$
MAD	$\frac{1}{N} \sum_{1}^{N} \| c_{i} - \overline{c} \|$

Open in a new tab

It should also be noted that lower values of the RMSE, MAE and MAD indicate a better fit.

4.4. Results

Though the focus of our study is on extracting the better feature in the data preparation stage, yet we use two ML algorithms, i.e., XGBoost and LightGBM, to validate the performance of PAN-LDA. Next, we sought to evaluate the models using four statistical metrics, i.e., R², RMSE, and MAE provided by scikit-learn [[43], [44], [45]] and MAD provided by mad function in class Series in pandas [46].

4.4.1. Results of XGBoost

Initially, as we were interested in a fair comparison among the results of different feature sets, we trained the XGBoost with the default parameter values [47] in the modeling phase. The results obtained for the different feature sets are given in the Supplementary information, i.e., Tables S3–S6.

To better understand the XGBoost results, Fig. 7 (a)–(d) show the distribution of the four evaluation metrics mentioned in this paper. In each figure, we compare the outputs (vertical axis) of evaluation indicators for all four feature sets against each step of walk-forward testing.

4.4.2. Results of LightGBM

The results of all the feature sets from LightGBM are presented in the supplementary information (Tables S7–S10), using 11 overlapping training–test sets from the walk-forward validation. Fig. 8 (a)-(d) illustrate that the performances of LightGBM with different sets of features, i.e., FS1-FS4.

Fig. 8 — The presentation of (a) R² (b) RMSE, (c) MAE, and (d) MAD between the actual and the predicted number of new confirmed cases for FS1, FS2, FS3, and FS4 by LightGBM.

5. Discussion

After text mining became popular and viable for extracting information from text, public health research often incorporated unstructured textual data. This paper presented a feature extraction model based on changes in daily COVID-19 cases data and news articles. To derive the inputs for the ML prediction models, we used the PAN-LDA model to extract relevant features. To take advantage of our PAN-LDA, the topic distributions from PAN-LDA are then used in a machine learning model. We compared the performance of our approach in predicting COVID-19 cases one day after news articles were released. By comparing the final results while employing the four different feature sets, FS1, FS2, FS3, and FS4, an experiment was conducted to demonstrate the benefits of integrating the features generated using PAN-LDA. Furthermore, we compare the results from XGBoost and LightGBM for all four feature sets, using the 11 overlapping training-test sets. The proposed model's prediction errors were lower than those of the other techniques. When the features from the proposed model are employed in both XGBoost and LightGBM, the results reveal that they empirically add value to the prediction.

The results for R², RMSE, MAE, and MAD for XGBoost are provided using graphs in Fig. 7 (a)–(d). We observed that the best results are obtained with FS4 for 7 out of 11 overlapping datasets with all statistical measures. Comparing the results with FS1, FS2, and FS3 as shown in Fig. 7 (a)–(d) shows that FS1, FS2, and FS3 have larger values for average RMSE, average MAE, and average MAD but smaller average R² than the FS4. Compared to the baseline methods, the proposed model improves average RMSE by 24-3% and MAE by 22-7%. The MAD in Fig. 7(d) reveals that results from XGBoost when using FS2 were better in some datasets, but the best result for average MAD is achieved for the FS4, followed by FS2, FS3, and FS1. We concluded that FS4 provided better input features than FS1, FS2, and FS3. The result is consistent in all of the evaluation indicators. Also, the XGBoost gave the best average performance when used with the feature set FS4.

The results of LightGBM are shown in Fig. 8 (a)–(d). We can see these figures show the FS4 has smaller RMSE, MAE, and MAD than FS3, FS2, and FS1 but larger R² in 8 out of 11 steps of backtesting. It implies that the performance of LightGBM when using features from PAN-LDA was better than that when using other sets of features. The average performance from 11 overlapping datasets indicates that LightGBM with FS1 has the worst performance, followed by FS2 and FS3 and the best with FS4 features. The R² for FS4 was improved by 3.84% than when using FS1. The average RMSE from PAN-LDA, 20.6934, is significantly better than FS1, 30.5913, FS2, 24.0235, and FS3 27.1294. The average MAE and MAD show the same results. Fig. 8(d) shows that the LightGBM, when used with the features from our model, outperformed FS1 by 45.78% and FS2 by 33.39%, and FS3 by 27.06%. Additionally, in Fig. 8 (a)–(d), the performance of all the feature sets is compared by taking all the evaluation metrics, showing the benefit of PAN-LDA clearly. These figures also suggest that the performance with FS4 was much better than FS1, FS2, and FS3 for all the evaluation metrics, namely R², RMSE, MAE, and MAD.

As for both XGBoost and LightGBM, on average, the results from FS4 are better than the results from FS3, FS2, and FS1. Fig. 7(a)–(d) shows that the PAN-LDA model outperforms the baseline models in terms of MAE but not so much in terms of RMSE as RMSE penalizes larger prediction errors, while MAE stands for the absolute difference between observed and predicted values. Therefore, in Table 4 , a comparison of these two machine learning algorithms using FS4 has been demonstrated. It can be noted from Table 4 that the highest correlations were achieved for LightGBM with an average R² of 0.7583. Also, the LightGBM has a smaller average value of RMSE, MAE, and MAD than XGBoost.

Table 4.

Comparison of results of XGBoost and LightGBM with FS4.

Step	R²		RMSE		MAE		MAD
Step	XGBoost	LightGBM	XGBoost	LightGBM	XGBoost	LightGBM	XGBoost	LightGBM
1	0.7592	0.7670	23.6395	15.7655	20.6263	11.5822	19.9906	9.2664
2	0.7812	0.7812	29.8801	16.7653	25.6069	13.3801	23.7570	10.6228
3	0.7112	0.8112	19.3234	11.6423	17.1817	9.29156	10.5774	7.6562
4	0.5531	0.7676	9.34649	07.9876	9.21931	6.37479	8.18504	6.6132
5	0.6766	0.7317	82.2345	78.2345	67.3902	30.4953	62.5700	27.6917
6	0.7175	0.7768	54.0757	28.8765	44.9171	19.0459	41.9074	16.7886
7	0.7752	0.8075	18.7480	13.6653	16.7225	10.9061	15.5794	9.0570
8	0.6766	0.7226	19.0169	14.8265	16.9371	11.8328	11.2149	10.5660
9	0.7697	0.7615	20.3450	13.8643	17.9971	11.0649	16.7289	8.6037
10	0.7755	0.7079	17.8170	13.2345	15.9795	10.5622	14.7570	9.6441
11	0.7673	0.7070	18.3331	12.7654	16.3914	10.1879	12.1495	9.6112
avg	0.7239	0.7583	28.4327	20.6934	24.4517	13.1567	21.5834	11.4655

Open in a new tab

The experimental results and comparison of the proposed PAN-LDA model's performance with baseline models clearly show that supplementary/side information, such as new article content, is a valuable and expressive source of information for improving ML algorithms' predictions. Because historical data only captures the general perception of the target item, it cannot be used to generate precise forecasts. The features extracted from the LDA model do not seem to give much advantage for data forecasting over time. Moreover, incorporating sentiment scores as an additional feature in the prediction model has improved performance with less prediction error, such as MAE and RMSE. Besides, the results from FS1 are the worst for both XGBoost and LightGBM with the walk-forward testing. It can be concluded that incorporating the infectious disease data along with news articles in PAN-LDA gave better performance than LDA, which incorporates news articles only. This suggests the benefit of additional features in PAN-LDA. However, it seems that adding XGBoost resulted in only little changes with the PAN-LDA model. Overall, it can be implied that LightGBM can forecast more closely to the actual values of COVID-19 cases than the XGBoost method. Also, including changes in the number of COVID-19 cases into account in PAN-LDA for prediction with time series, esp. with LightGBM.

In this study, the overall time-period of the research is short because of limited availability of the reliable new articles data [20]. We will improve our model with more data in the future.

6. Conclusions and future directions

In this work, we proposed an LDA-based mathematical model, PAN-LDA, which integrates news articles and data of confirmed COVID-19 cases for better feature extraction. The resultant features can be input as additional features to any ML algorithm to forecast trends with time series. In our paper, we introduced the extracted features from the PAN-LDA model to two gradient boosting-based ML algorithms, i.e., XGBoost and LightGBM, to validate the feasibility of applying PAN-LDA compared to baseline methods. The features from PAN-LDA significantly added value to the goal output when used in ML algorithms. Moreover, LightGBM gave a considerably better performance than XGBoost.

In summary, the features from PAN-LDA generated more identifiable topics and empirically added value to the prediction when they were used in LightGBM.

In the future, we will focus on incorporating the other sophisticated features, e.g., daily death cases, the number of recovered cases, etc., as well as on hyperparameters tuning. In the present paper, we have used default values of hyperparameters for both ML algorithms, i.e., XGBoost and LightGBM. So it cannot be guaranteed that the used hyperparameters' values rates are the best. Choosing the optimal values of the hyperparameters of ML algorithms will be investigated in future work.

Footnotes

^{Appendix A}

Supplementary data to this article can be found online at https://doi.org/10.1016/j.compbiomed.2021.104920.

Appendix A. Supplementary data

The following is the supplementary data to this article:

Multimedia component 1

mmc1.docx^{(36KB, docx)}

References

1.Esteva A., Kale A., Paulus R., Hashimoto K., Yin W., Radev D., Socher R. COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. Npj Digit. Med. 2021;4 doi: 10.1038/s41746-021-00437-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Zhang E., Gupta N., Tang R., Han X., Pradeep R., Lu K., Zhang Y., Nogueira R., Cho K., Fang H., Lin J. Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset. 2020. pp. 31–41. [DOI] [Google Scholar]
3.Köksal A., Dönmez H., Özçelik R., Ozkirimli E., Özgür A. Proc. 1st Work. NLP COVID-19 (Part 2) EMNLP 2020. Association for Computational Linguistics; Stroudsburg, PA, USA: 2020. Vapur: a search engine to find related protein - compound pairs in COVID-19 literature. [DOI] [Google Scholar]
4.Khadjeh Nassirtoussi A., Aghabozorgi S., Ying Wah T., Ngo D.C.L. Text mining of news-headlines for FOREX market prediction: a Multi-layer Dimension Reduction Algorithm with semantics and sentiment. Expert Syst. Appl. 2015;42:306–324. doi: 10.1016/j.eswa.2014.08.004. [DOI] [Google Scholar]
5.Jin F., Self N., Saraf P., Butler P., Wang W., Ramakrishnan N. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 2013. Forex-foreteller: currency trend modeling using news articles; pp. 1470–1473. [DOI] [Google Scholar]
6.Tissaoui A., Sassi S., Chbeir R. Probabilistic topic models for enriching ontology from texts. SN comput. Sci. 2020;1 doi: 10.1007/s42979-020-00349-y. [DOI] [Google Scholar]
7.Li X., Lei L. A bibliometric analysis of topic modelling studies (2000–2017) J. Inf. Sci. 2021;47:161–175. doi: 10.1177/0165551519877049. [DOI] [Google Scholar]
8.Zhu B., Zheng X., Liu H., Li J., Wang P. Analysis of spatiotemporal characteristics of big data on social media sentiment with COVID-19 epidemic topics. Chaos, Solit. Fractals. 2020;140 doi: 10.1016/j.chaos.2020.110123. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ordun C., Purushotham S., Raff E. Exploratory analysis of covid-19 tweets using topic modeling, UMAP, and DiGraphs. 2020. https://radimrehurek.com/gensim/models/ldamulticore.html ArXiv.
10.Rortais A., Barrucci F., Ercolano V., Linge J., Christodoulidou A., Cravedi J.P., Garcia-Matas R., Saegerman C., Svečnjak L. A topic model approach to identify and track emerging risks from beeswax adulteration in the media. Food Control. 2021;119 doi: 10.1016/j.foodcont.2020.107435. [DOI] [Google Scholar]
11.Chuluunsaikhan T., Ryu G.A., Yoo K.H., Rah H., Nasridinov A. Incorporating deep learning and news topic modeling for forecasting pork prices: the case of South Korea. Agric. For. 2020;10:1–22. doi: 10.3390/agriculture10110513. [DOI] [Google Scholar]
12.Li X., Shang W., Wang S. Text-based crude oil price forecasting: a deep learning approach. Int. J. Forecast. 2019;35:1548–1560. doi: 10.1016/j.ijforecast.2018.07.006. [DOI] [Google Scholar]
13.Mahadevan A., Arock M. Integrated topic modeling and sentiment analysis: a review rating prediction approach for recommender systems. Turk. J. Electr. Eng. Comput. Sci. 2020;28:107–123. doi: 10.3906/elk-1905-114. [DOI] [Google Scholar]
14.Chen T., Guestrin C. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 2016. XGBoost: a scalable tree boosting system; pp. 785–794. [DOI] [Google Scholar]
15.Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.Y. Adv. Neural Inf. Process. Syst. 2017. LightGBM: a highly efficient gradient boosting decision tree; pp. 3147–3155.https://github.com/Microsoft/LightGBM [Google Scholar]
16.Wang J.C., Hastie T. Boosted varying-coefficient regression models for product demand prediction. J. Comput. Graph Stat. 2014;23:361–382. doi: 10.1080/10618600.2013.778777. [DOI] [Google Scholar]
17.H. Qiu, L. Luo, Z. Su, L. Zhou, L. Wang, Y. Chen, Machine learning approaches to predict peak demand days of cardiovascular admissions considering environmental exposure, (n.d.). 10.1186/s12911-020-1101-8. [DOI] [PMC free article] [PubMed]
18.Sun X., Liu M., Sima Z. A novel cryptocurrency price trend forecasting model based on LightGBM. Finance Res. Lett. 2020;32 doi: 10.1016/j.frl.2018.12.032. [DOI] [Google Scholar]
19.Liang Y., Wu J., Wang W., Cao Y., Zhong B., Chen Z., Li Z. ACM Int. Conf. Proceeding Ser. 2019. Product marketing prediction based on XGboost and LightGBM algorithm; pp. 150–153. [DOI] [Google Scholar]
20.Free coronavirus news dataset - updated - AYLIEN. https://blog.aylien.com/free-coronavirus-news-dataset/ n.d.
21.Coronavirus Pandemic (COVID-19) Statistics and research - our World in data. https://ourworldindata.org/coronavirus n.d.
22.Sharma V., Stranieri A., Ugon J., Vamplew P., Martin L. ACM Int. Conf. Proceeding Ser. Association for Computing Machinery; 2017. An agile group aware process beyond CRISP-DM: a hospital data mining case study; pp. 109–113. [DOI] [Google Scholar]
23.Ncr P.C., Spss J.C., Ncr R.K., Spss T.K., Daimlerchrysler T.R., Spss C.S., Daimlerchrysler R.W. Step-by-step data mining guide. SPSS Inc. 2000;78:1–78. http://www.crisp-dm.org/CRISPWP-0800.pdf [Google Scholar]
24.Yu Q., Huang X., Li W., Wang C., Chen Y., Ge Y. 2019 Comput. Cardiol. Conf. 2019. Using features extracted from vital time series for early prediction of Sepsis. [DOI] [Google Scholar]
25.Tounsi Y., Anoun H., Hassouni L. ACM Int. Conf. Proceeding Ser. 2020. CSMAS: improving multi-agent credit scoring system by integrating big data and the new generation of gradient boosting algorithms. [DOI] [Google Scholar]
26.Choi S., Hur J. An ensemble learner-based bagging model using past output data for photovoltaic forecasting. Energies. 2020;13 doi: 10.3390/en13061438. [DOI] [Google Scholar]
27.Cordeiro J.R., Postolache O., Ferreira J.C. Child's target height prediction evolution. Appl. Sci. 2019;9:5447. doi: 10.3390/app9245447. [DOI] [Google Scholar]
28.Ballı S. Data analysis of Covid-19 pandemic and short-term cumulative case forecasting using machine learning time series methods. Chaos, Solit. Fractals. 2021;142 doi: 10.1016/j.chaos.2020.110512. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.C.-X. Lv, S.-Y. An, B.-J. Qiao, W. Wu, Time Series Analysis of Hemorrhagic Fever with Renal Syndrome in Mainland China by Using XGBoost Forecasting Model, n.d.. [DOI] [PMC free article] [PubMed]
30.Vanichrujee U., Horanont T., Theeramunkong T., Pattara-Atikom W., Shinozaki T. 2018 Int. Conf. Embed. Syst. Intell. Technol. Int. Conf. Inf. Commun. Technol. Embed. Syst. ICESIT-ICICTES 2018, Institute of Electrical and Electronics Engineers Inc. 2018. Taxi demand prediction using ensemble model based on RNNs and XGBOOST; pp. 1–6. [DOI] [Google Scholar]
31.Hossain M.A., Karim R., Thulasiram R., Bruce N.D.B., Wang Y. Proc. 2018 IEEE Symp. Ser. Comput. Intell. SSCI 2018, Institute of Electrical and Electronics Engineers Inc. 2019. Hybrid deep learning model for stock price prediction; pp. 1837–1844. [DOI] [Google Scholar]
32.Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.Y. Adv. Neural Inf. Process. Syst. 2017. LightGBM: a highly efficient gradient boosting decision tree; pp. 3147–3155.https://github.com/Microsoft/LightGBM [Google Scholar]
33.Blei D.M., Ng A.Y., Jordan M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003;3:993–1022. doi: 10.1016/b978-0-12-411519-4.00006-9. [DOI] [Google Scholar]
34.Griffiths T.L., Steyvers M. Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 2004;101:5228–5235. doi: 10.1073/pnas.0307752101. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Cao L.J., Tay F.E.H. Support vector machine with adaptive parameters in financial time series forecasting. IEEE Trans. Neural Network. 2003;14:1506–1518. doi: 10.1109/TNN.2003.820556. [DOI] [PubMed] [Google Scholar]
36.WHO Coronavirus disease (COVID-19) situation reports in Bangladesh. World Heal. Org. 2020:1. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports [Google Scholar]
37.Rehurek R. gensim: topic modelling for humans. 2014. https://radimrehurek.com/gensim/index.html
38.Surian D., Chawla S. Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 2013. Mining outlier participants: insights using directional distributions in latent models; pp. 337–352. [DOI] [Google Scholar]
39.Wang L., Zhang Y., Zhang Y., Xu X., Cao S. Prescription function prediction using topic model and multilabel classifiers, evidence-based complement. Altern. Med. 2017 doi: 10.1155/2017/8279109. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Panichella A. A Systematic Comparison of search-Based approaches for LDA hyperparameter tuning. Inf. Software Technol. 2021;130 doi: 10.1016/j.infsof.2020.106411. [DOI] [Google Scholar]
41.Yoshida T., Hisano R., Ohnishi T. Gaussian hierarchical latent dirichlet allocation: bringing polysemy back. 2020. http://arxiv.org/abs/2002.10855 [DOI] [PMC free article] [PubMed]
42.Vosecky J., Jiang D., Leung K.W.T., Ng W. Int. Conf. Inf. Knowl. Manag. Proc. 2013. Dynamic multi-faceted topic discovery in twitter; pp. 879–884. [DOI] [Google Scholar]
43.sklearn.metrics.r2_score — scikit-learn 0.23.2 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html n.d.
44.sklearn.metrics.mean_absolute_error — scikit-learn 0.23.2 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html n.d.
45.sklearn.metrics.mean_squared_error — scikit-learn 0.23.2 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html n.d.
46.pandas.Series.mad — pandas 1.2.4 documentation. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mad.html n.d.
47.XGBoost Python Package Python Package Introduction — xgboost 1.4.0-SNAPSHOT documentation. 2020. https://xgboost.readthedocs.io/en/latest/python/python_intro.html

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1

mmc1.docx^{(36KB, docx)}

[bib1] 1.Esteva A., Kale A., Paulus R., Hashimoto K., Yin W., Radev D., Socher R. COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. Npj Digit. Med. 2021;4 doi: 10.1038/s41746-021-00437-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Zhang E., Gupta N., Tang R., Han X., Pradeep R., Lu K., Zhang Y., Nogueira R., Cho K., Fang H., Lin J. Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset. 2020. pp. 31–41. [DOI] [Google Scholar]

[bib3] 3.Köksal A., Dönmez H., Özçelik R., Ozkirimli E., Özgür A. Proc. 1st Work. NLP COVID-19 (Part 2) EMNLP 2020. Association for Computational Linguistics; Stroudsburg, PA, USA: 2020. Vapur: a search engine to find related protein - compound pairs in COVID-19 literature. [DOI] [Google Scholar]

[bib4] 4.Khadjeh Nassirtoussi A., Aghabozorgi S., Ying Wah T., Ngo D.C.L. Text mining of news-headlines for FOREX market prediction: a Multi-layer Dimension Reduction Algorithm with semantics and sentiment. Expert Syst. Appl. 2015;42:306–324. doi: 10.1016/j.eswa.2014.08.004. [DOI] [Google Scholar]

[bib5] 5.Jin F., Self N., Saraf P., Butler P., Wang W., Ramakrishnan N. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 2013. Forex-foreteller: currency trend modeling using news articles; pp. 1470–1473. [DOI] [Google Scholar]

[bib6] 6.Tissaoui A., Sassi S., Chbeir R. Probabilistic topic models for enriching ontology from texts. SN comput. Sci. 2020;1 doi: 10.1007/s42979-020-00349-y. [DOI] [Google Scholar]

[bib7] 7.Li X., Lei L. A bibliometric analysis of topic modelling studies (2000–2017) J. Inf. Sci. 2021;47:161–175. doi: 10.1177/0165551519877049. [DOI] [Google Scholar]

[bib8] 8.Zhu B., Zheng X., Liu H., Li J., Wang P. Analysis of spatiotemporal characteristics of big data on social media sentiment with COVID-19 epidemic topics. Chaos, Solit. Fractals. 2020;140 doi: 10.1016/j.chaos.2020.110123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Ordun C., Purushotham S., Raff E. Exploratory analysis of covid-19 tweets using topic modeling, UMAP, and DiGraphs. 2020. https://radimrehurek.com/gensim/models/ldamulticore.html ArXiv.

[bib10] 10.Rortais A., Barrucci F., Ercolano V., Linge J., Christodoulidou A., Cravedi J.P., Garcia-Matas R., Saegerman C., Svečnjak L. A topic model approach to identify and track emerging risks from beeswax adulteration in the media. Food Control. 2021;119 doi: 10.1016/j.foodcont.2020.107435. [DOI] [Google Scholar]

[bib11] 11.Chuluunsaikhan T., Ryu G.A., Yoo K.H., Rah H., Nasridinov A. Incorporating deep learning and news topic modeling for forecasting pork prices: the case of South Korea. Agric. For. 2020;10:1–22. doi: 10.3390/agriculture10110513. [DOI] [Google Scholar]

[bib12] 12.Li X., Shang W., Wang S. Text-based crude oil price forecasting: a deep learning approach. Int. J. Forecast. 2019;35:1548–1560. doi: 10.1016/j.ijforecast.2018.07.006. [DOI] [Google Scholar]

[bib13] 13.Mahadevan A., Arock M. Integrated topic modeling and sentiment analysis: a review rating prediction approach for recommender systems. Turk. J. Electr. Eng. Comput. Sci. 2020;28:107–123. doi: 10.3906/elk-1905-114. [DOI] [Google Scholar]

[bib14] 14.Chen T., Guestrin C. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 2016. XGBoost: a scalable tree boosting system; pp. 785–794. [DOI] [Google Scholar]

[bib15] 15.Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.Y. Adv. Neural Inf. Process. Syst. 2017. LightGBM: a highly efficient gradient boosting decision tree; pp. 3147–3155.https://github.com/Microsoft/LightGBM [Google Scholar]

[bib16] 16.Wang J.C., Hastie T. Boosted varying-coefficient regression models for product demand prediction. J. Comput. Graph Stat. 2014;23:361–382. doi: 10.1080/10618600.2013.778777. [DOI] [Google Scholar]

[bib17] 17.H. Qiu, L. Luo, Z. Su, L. Zhou, L. Wang, Y. Chen, Machine learning approaches to predict peak demand days of cardiovascular admissions considering environmental exposure, (n.d.). 10.1186/s12911-020-1101-8. [DOI] [PMC free article] [PubMed]

[bib18] 18.Sun X., Liu M., Sima Z. A novel cryptocurrency price trend forecasting model based on LightGBM. Finance Res. Lett. 2020;32 doi: 10.1016/j.frl.2018.12.032. [DOI] [Google Scholar]

[bib19] 19.Liang Y., Wu J., Wang W., Cao Y., Zhong B., Chen Z., Li Z. ACM Int. Conf. Proceeding Ser. 2019. Product marketing prediction based on XGboost and LightGBM algorithm; pp. 150–153. [DOI] [Google Scholar]

[bib20] 20.Free coronavirus news dataset - updated - AYLIEN. https://blog.aylien.com/free-coronavirus-news-dataset/ n.d.

[bib21] 21.Coronavirus Pandemic (COVID-19) Statistics and research - our World in data. https://ourworldindata.org/coronavirus n.d.

[bib22] 22.Sharma V., Stranieri A., Ugon J., Vamplew P., Martin L. ACM Int. Conf. Proceeding Ser. Association for Computing Machinery; 2017. An agile group aware process beyond CRISP-DM: a hospital data mining case study; pp. 109–113. [DOI] [Google Scholar]

[bib23] 23.Ncr P.C., Spss J.C., Ncr R.K., Spss T.K., Daimlerchrysler T.R., Spss C.S., Daimlerchrysler R.W. Step-by-step data mining guide. SPSS Inc. 2000;78:1–78. http://www.crisp-dm.org/CRISPWP-0800.pdf [Google Scholar]

[bib24] 24.Yu Q., Huang X., Li W., Wang C., Chen Y., Ge Y. 2019 Comput. Cardiol. Conf. 2019. Using features extracted from vital time series for early prediction of Sepsis. [DOI] [Google Scholar]

[bib25] 25.Tounsi Y., Anoun H., Hassouni L. ACM Int. Conf. Proceeding Ser. 2020. CSMAS: improving multi-agent credit scoring system by integrating big data and the new generation of gradient boosting algorithms. [DOI] [Google Scholar]

[bib26] 26.Choi S., Hur J. An ensemble learner-based bagging model using past output data for photovoltaic forecasting. Energies. 2020;13 doi: 10.3390/en13061438. [DOI] [Google Scholar]

[bib27] 27.Cordeiro J.R., Postolache O., Ferreira J.C. Child's target height prediction evolution. Appl. Sci. 2019;9:5447. doi: 10.3390/app9245447. [DOI] [Google Scholar]

[bib28] 28.Ballı S. Data analysis of Covid-19 pandemic and short-term cumulative case forecasting using machine learning time series methods. Chaos, Solit. Fractals. 2021;142 doi: 10.1016/j.chaos.2020.110512. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.C.-X. Lv, S.-Y. An, B.-J. Qiao, W. Wu, Time Series Analysis of Hemorrhagic Fever with Renal Syndrome in Mainland China by Using XGBoost Forecasting Model, n.d.. [DOI] [PMC free article] [PubMed]

[bib30] 30.Vanichrujee U., Horanont T., Theeramunkong T., Pattara-Atikom W., Shinozaki T. 2018 Int. Conf. Embed. Syst. Intell. Technol. Int. Conf. Inf. Commun. Technol. Embed. Syst. ICESIT-ICICTES 2018, Institute of Electrical and Electronics Engineers Inc. 2018. Taxi demand prediction using ensemble model based on RNNs and XGBOOST; pp. 1–6. [DOI] [Google Scholar]

[bib31] 31.Hossain M.A., Karim R., Thulasiram R., Bruce N.D.B., Wang Y. Proc. 2018 IEEE Symp. Ser. Comput. Intell. SSCI 2018, Institute of Electrical and Electronics Engineers Inc. 2019. Hybrid deep learning model for stock price prediction; pp. 1837–1844. [DOI] [Google Scholar]

[bib32] 32.Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.Y. Adv. Neural Inf. Process. Syst. 2017. LightGBM: a highly efficient gradient boosting decision tree; pp. 3147–3155.https://github.com/Microsoft/LightGBM [Google Scholar]

[bib33] 33.Blei D.M., Ng A.Y., Jordan M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003;3:993–1022. doi: 10.1016/b978-0-12-411519-4.00006-9. [DOI] [Google Scholar]

[bib34] 34.Griffiths T.L., Steyvers M. Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 2004;101:5228–5235. doi: 10.1073/pnas.0307752101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Cao L.J., Tay F.E.H. Support vector machine with adaptive parameters in financial time series forecasting. IEEE Trans. Neural Network. 2003;14:1506–1518. doi: 10.1109/TNN.2003.820556. [DOI] [PubMed] [Google Scholar]

[bib36] 36.WHO Coronavirus disease (COVID-19) situation reports in Bangladesh. World Heal. Org. 2020:1. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports [Google Scholar]

[bib37] 37.Rehurek R. gensim: topic modelling for humans. 2014. https://radimrehurek.com/gensim/index.html

[bib38] 38.Surian D., Chawla S. Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 2013. Mining outlier participants: insights using directional distributions in latent models; pp. 337–352. [DOI] [Google Scholar]

[bib39] 39.Wang L., Zhang Y., Zhang Y., Xu X., Cao S. Prescription function prediction using topic model and multilabel classifiers, evidence-based complement. Altern. Med. 2017 doi: 10.1155/2017/8279109. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Panichella A. A Systematic Comparison of search-Based approaches for LDA hyperparameter tuning. Inf. Software Technol. 2021;130 doi: 10.1016/j.infsof.2020.106411. [DOI] [Google Scholar]

[bib41] 41.Yoshida T., Hisano R., Ohnishi T. Gaussian hierarchical latent dirichlet allocation: bringing polysemy back. 2020. http://arxiv.org/abs/2002.10855 [DOI] [PMC free article] [PubMed]

[bib42] 42.Vosecky J., Jiang D., Leung K.W.T., Ng W. Int. Conf. Inf. Knowl. Manag. Proc. 2013. Dynamic multi-faceted topic discovery in twitter; pp. 879–884. [DOI] [Google Scholar]

[bib43] 43.sklearn.metrics.r2_score — scikit-learn 0.23.2 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html n.d.

[bib44] 44.sklearn.metrics.mean_absolute_error — scikit-learn 0.23.2 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html n.d.

[bib45] 45.sklearn.metrics.mean_squared_error — scikit-learn 0.23.2 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html n.d.

[bib46] 46.pandas.Series.mad — pandas 1.2.4 documentation. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mad.html n.d.

[bib47] 47.XGBoost Python Package Python Package Introduction — xgboost 1.4.0-SNAPSHOT documentation. 2020. https://xgboost.readthedocs.io/en/latest/python/python_intro.html

PERMALINK

PAN-LDA: A latent Dirichlet allocation based novel feature extraction model for COVID-19 data using machine learning

Aakansha Gupta

Rahul Katarya

Abstract

1. Introduction

2. Related work

2.1. Data mining

Fig. 1.

Table 1.

2.2. Latent Dirichlet allocation (LDA)

Fig. 2.

3. Proposed model

3.1. Model description

Fig. 3.

ALGORITHM 1

3.2. Topic inference

ALGORITHM 2

ALGORITHM 3

3.3. PAN-LDA in text and data mining

Fig. 4.

4. Experiments and results

4.1. Data selection and gathering

4.2. Data preparation

Fig. 5.

Fig. 6.

Table 2a.

Table 2b.

4.3. Evaluation indicators

Table 3.

4.4. Results

4.4.1. Results of XGBoost

Fig. 7.

4.4.2. Results of LightGBM

Fig. 8.

5. Discussion

Table 4.

6. Conclusions and future directions

Footnotes

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases