. 2021 Aug 3;11:15747. doi: 10.1038/s41598-021-94897-9

Table 3.

Classifier comparison.

Classifier	Accuracy	Average precision	Brier loss	F1	Log loss	Precision	Recall	AUC	Time (s)
ETC	0.95	0.93	0.05	0.95	1.71	0.95	0.95	0.95	1.35
GPC	0.88	0.85	0.12	0.87	4.25	0.89	0.88	0.88	6.12
KNC	0.86	0.84	0.14	0.86	4.74	0.89	0.86	0.86	2.22
LOG	0.93	0.91	0.07	0.93	2.36	0.94	0.93	0.93	0.54
MLP	0.92	0.89	0.08	0.92	2.85	0.91	0.92	0.92	1.27
RDC	0.86	0.83	0.14	0.86	4.74	0.87	0.86	0.86	0.22
RFC	0.95	0.93	0.05	0.95	1.81	0.95	0.95	0.95	1.26
SVC	0.94	0.92	0.06	0.94	2.14	0.94	0.93	0.94	1.96

Performance metrics for the 8 classifiers (Extra Trees Classifier, ETC; Gaussian Process Classifier, GPC; K-Nearest Neighbour, KNN; Logistic Regression, LOG; MultiLayer Perceptron Classifier, MLC; Ridge Classifier, RDC; Random Forest Classifier, RFC; and Support Vector Machine classifier, SVC; in descending order) used for the disambiguation in “Topic detection” for a random sample of 2000 genes. The metrics shown in this table were obtained by averaging the results on the validation set during the threefold cross validation. Subsequently, the results were averaged for a sample of 2000 genes. The logistic regression classifier (bold) was the fastest and second most accurate model for a random sample of 2000 genes and therefore it was selected as the default model to run the disambiguation on the remaining 17,082 human protein-coding genes. This high validation score verified that there was no over-fitting after the threefold cross-validation.