Skip to main content
. 2024 Oct 9;14:23516. doi: 10.1038/s41598-024-74022-2

Table 2.

The percentage of intersected terms over 20 extracted topics between various topic modeling methods in the CAMDA dataset. The diagonal values represent the number of distinct terms in 20 topics extracted by each topic modeling method.

NMF LSI FLSA PLSA LDA CTM LDA2VEC TOP2VEC BERTopic CombinedTM ETM
NMF 278 56% 50% 57% 55% 62% 37% 43% 49% 57% 56%
LSI 90% 172 62% 65% 70% 62% 44% 50% 62% 67% 70%
FLSA 37% 28% 379 46% 31% 39% 28% 29% 28% 39% 33%
PLSA 42% 29% 60% 380 39% 39% 34% 36% 31% 43% 37%
LDA 69% 54% 52% 68% 222 58% 41% 45% 55% 63% 64%
CTM 69% 43% 58% 59% 52% 250 34% 40% 42% 59% 65%
LDA2VEC 42% 30% 43% 53% 37% 35% 246 52% 37% 47% 32%
TOP2VEC 46% 33% 43% 53% 38% 38% 50% 259 42% 51% 41%
BERTopic 69% 54% 53% 58% 62% 52% 45% 54% 199 63% 57%
CombinedTM 50% 37% 47% 51% 44% 46% 36% 42% 39% 317 41%
ETM 68% 53% 55% 62% 62% 71% 35% 50% 50% 57% 228