Table 1.
Related work | Topic modeling method | Evaluation method | Outcome |
---|---|---|---|
Chakkarwar and Tamane (2020) | Latent Dirichlet allocation (LDA) with bag of words (BoW) |
Visual overview of extracted topics | - Aimed to discover the current trends, topics, or patterns from research documents to overview different research trends. |
- The result shows that the LDA is an effective topic modeling method for creating the context of a document collection. | |||
Ray et al. (2019) | Latent semantic indexing (LSI) | Perplexity | - Aimed to introduce methods and tools of topic modeling to the Hindi language. |
LDA | Topic coherence | - Discussed many techniques and tools used for topic modeling. | |
Non-negative matrix factorization (NMF) | - The coherence result of the NMF model was a little better than the LDA model. | ||
- The perplexity of the LDA model on the Hindi dataset is better compared to other evaluated topic modeling methods. | |||
Xu et al. (2019) | LDA | Perplexity | - Aimed to help Chinese movie creators to get the psychological needs of movie viewers and provide suggestions to improve the quality of Chinese movies. |
- Used the word cloud as a visual display of high-frequency keywords in a text which gives a basic understanding of the core ideas of text data. | |||
- The LDA model provides topics that deliver a good analysis of the Douban online review. | |||
- Used the perplexity method to determine the best number of extracted topics, as a result, 20 extracted topics were set. | |||
Alghamdi and Alfalqi (2015) | Latent semantic analysis (LSA) | - Reviewed many topic modeling methods in terms of characteristics, limitations, and theoretical background. | |
Probabilistic latent semantic analysis (PLSA) | |||
LDA | - Reviewed many topic modeling application areas and evaluation methods. | ||
Correlated topic model (CTM) | |||
Chen et al. (2017) | NMF | t-Distributed stochastic neighbor embedding (TSNE) dimensionality-reduction method | - Aimed to compare and evaluate many topic modeling approaches in analyzing a large set of the US Securities and Exchange Commission (SEC) filings made by US public banks. |
Principal component analysis (PCA) | - Both NMF and LDA methods provide very good document representation, while the K-Competitive Autoencoder for Text (KATE)1 delivered more meaningful document and high-accuracy topics. | ||
LDA | |||
KATE | - The LDA provided the best result regarding the classification of topic representation. | ||
Mazarura and de Waal (2016) | LDA | Topic stability | - Tested many numbers of topics (10, 20, 30, 40, 50, and 100 topics). |
- Topic coherence decreases for both the LDA and Dirichlet multinomial mixture model (GSDMM) as the number of topics increases in a long text, which indicates an overall decline in the quality of topics uncovered by both models as the number of topics increases. | |||
GSDMM | Topic coherence | - The LDA's performance of the coherence values is slightly better than the GSDMM. | |
- The GSDMM is more stable than LDA. | |||
- The GSDMM is indeed a viable option on the short text as it displays the potential to produce better results than LDA. | |||
Sisodia et al. (2020) | BoW | - The Nu-support vector classification (Nu-SVC) classifier outperforms all other included classifiers in the set of individual classifiers. | |
Term frequency–inverse document frequency (TF-IDF) | Accuracy | - Random forest classifier outperforms all other included classifiers in the set of the case on ensemble classifiers. | |
Naive Bayes | Precision | - The support vector machine (SVM) classifier outperforms all other classifiers in the set of individual classifiers. | |
SVM | Recall | - Random forest classifier outperforms the remaining ones. | |
Decision trees | F-measures | - Considered only two datasets; other datasets of different sizes need to be studied for better results. | |
Nu-SVC | |||
Shi et al. (2017) | Vector space model (VSM) | - Reviewed all of the following methods: VSM, LSI, PLSA, and LDA. | |
LSI | - Reviewed the essential concept of topic modeling using a bag-of-words approach. | ||
PLSA | - Discussed the basic idea of topic modeling including the bag-of-words approach, training of model, and output. | ||
LDA | - Discussed topic modeling application, features, limitations, and tools such as Gensim, standard topic modeling toolbox, Machine Learning for Language Toolkit (MALLET), and BigARTM. | ||
Nugroho et al. (2020) | LDA | Purity | - It focuses on the review of the approaches and discusses the features that are exploited to deal with the extreme sparsity and dynamics of the online social network (OSN) environment. |
NMF | Normalized mutual information (NMI) | - Run the algorithms over both datasets 30 times and note the average value of each evaluation metric for comparison. | |
Task-driven NMF | - Most methods can achieve high purity value. | ||
- The NMF and non-negative matrix inter-joint factorization (NMijF) having the best performance over the other methods. | |||
Plink-LDA | Pairwise F-measure | - F-measure evaluation results in all methods were well and similar. | |
- NMijF provides the best results according to all the evaluation metrics. | |||
NMijF | - Both LDA and NMF focus on the simple content exploitation of social media posts, main features (content, social interactions, and temporal). | ||
Ahmed Taloba et al. (2018) | PCA model | Precision | - The aim was to compare the performance of these methods before and after using PCA. |
Standard SVM | Accuracy | ||
J-48 decision tree | Sensitivity | - The RF gives acceptable and higher accuracy when compared to the rest of the classifiers. | |
KNN methods | F-measure | - The RF algorithm gives higher performance, and its performance is improved after using PCA. | |
Chen et al. (2019) | LDA | PMI score | - Tested many numbers of topics (20, 40, 60, 80, and 100). |
- The NMF has overwhelming advantages over LDA. | |||
NMF | Human judgments | - The knowledge-guided NMF (KGNMF) model performs better than NMF and LDA | |
KGNMF | - The NMF provides better topics than LDA with topic numbers ranging from 20 to 100. | ||
Anantharaman et al. (2019) | LDA | Precision | - Evaluated all topic modeling algorithms with both BoW and TF-IDF representations. |
Recall F-measure |
- Used the Naïve Bayes classifier for the 20-newsgroup dataset and the random forest classifier for the BBC news and PubMed datasets. | ||
LSA | Accuracy Cohen's |
- The results of the 20-newsgroup dataset LDA with BoW outperform those of the other topic algorithms. | |
Kappa score | - The LDA model does not perform well with TF-IDF when compared to BoW. | ||
NMF | Matthews | ||
Correlation coefficient | - The LDA takes a lot of time when compared to the LSA and NMF models. | ||
Time taken |