Skip to main content
. 2022 May 6;7:886498. doi: 10.3389/fsoc.2022.886498

Table 5.

Comparison of topic models.

Advantages Disadvantages
LDA • Prior domain knowledge is not necessarily required
• Finds coherent topics when correct hyperparameter tuning is applied
• Can deal with sparse input
• The number of topics is generally smaller than word-embedding based approaches; thus, it is easier to be interpreted
• One document can contain several different topics (Mixed membership extraction)
• Full generative models with multinominal distribution over topics are generated
• Shows both adjectives and nouns within topics
• Detailed assumptions are required
• Hyperparameters need to be tuned carefully
• Results can easily produce overlapping topics as topics are soft clusters
• Objective evaluation metrics are widely missing
• The number of topics needs to be defined by the user(s)
• Since the results are not deterministic, reliability and validity are not automatically ensured
• Assumes that the topics are independent of each other; hence, only the frequency of the common occurrence of words is used
• Word correlations are ignored, so no relationships between topics can be modeled
NMF • Prior domain knowledge is not required
• Supports mixed membership models; thus, one document can contain several topics
• In contrast to LDA, which uses raw word frequencies, the term-document matrix can be weighted with TF-IDF
• It proves to be computationally efficient and very scalable
• Easy to implement
• Frequently delivers incoherent topics
• The number of topics to be extracted must be defined by the user in advance
• Implicit specification of probabilistic generative models
Top2Vec • Supports hierarchical topic reduction
• Allows for multilingual analysis
• Automatically finds the number of topics
• Creates jointly embedded word, document, and topic vectors
• Contains built-in search functions (easy to go from topic to documents, search topics, etc.)
• Can work on very large dataset sizes
• It uses embeddings, so no preprocessing of the original data is needed
• The embedding approach might result in too many topics, requiring labor-intensive inspection of each topic
• Generates many outliers
• Not very suitable for small datasets (<1,000)
• Each document is assigned to one topic
• Objective evaluation metrics are missing
BERTopic • High versatility and stability across domains
• Allows for multilingual analysis
• Supports topic modeling variations (guided topic modeling, dynamic topic modeling, or class-based topic modeling)
• It uses embeddings, so no preprocessing of the original data is needed
• Automatically finds the number of topics
• Supports hierarchical topic reduction
• Contains built-in search functions (easy to go from topic to documents, search topics, etc.)
• Broader support of embedding models than Top2Vec
• The embedding approach might result in too many topics, requiring labor-intensive inspection of each topic
• Generates many outliers
• No topic distributions are generated within a single document; rather, each document is assigned to a single topic
• Objective evaluation metrics are missing