Skip to main content
. 2021 Aug 3;11:15747. doi: 10.1038/s41598-021-94897-9

Table 3.

Classifier comparison.

Classifier Accuracy Average precision Brier loss F1 Log loss Precision Recall AUC Time (s)
ETC 0.95 0.93 0.05 0.95 1.71 0.95 0.95 0.95 1.35
GPC 0.88 0.85 0.12 0.87 4.25 0.89 0.88 0.88 6.12
KNC 0.86 0.84 0.14 0.86 4.74 0.89 0.86 0.86 2.22
LOG 0.93 0.91 0.07 0.93 2.36 0.94 0.93 0.93 0.54
MLP 0.92 0.89 0.08 0.92 2.85 0.91 0.92 0.92 1.27
RDC 0.86 0.83 0.14 0.86 4.74 0.87 0.86 0.86 0.22
RFC 0.95 0.93 0.05 0.95 1.81 0.95 0.95 0.95 1.26
SVC 0.94 0.92 0.06 0.94 2.14 0.94 0.93 0.94 1.96

Performance metrics for the 8 classifiers (Extra Trees Classifier, ETC; Gaussian Process Classifier, GPC; K-Nearest Neighbour, KNN; Logistic Regression, LOG; MultiLayer Perceptron Classifier, MLC; Ridge Classifier, RDC; Random Forest Classifier, RFC; and Support Vector Machine classifier, SVC; in descending order) used for the disambiguation in “Topic detection” for a random sample of 2000 genes. The metrics shown in this table were obtained by averaging the results on the validation set during the threefold cross validation. Subsequently, the results were averaged for a sample of 2000 genes. The logistic regression classifier (bold) was the fastest and second most accurate model for a random sample of 2000 genes and therefore it was selected as the default model to run the disambiguation on the remaining 17,082 human protein-coding genes. This high validation score verified that there was no over-fitting after the threefold cross-validation.