Skip to main content
. 2022 Jul 15;6(3):344–374. doi: 10.1007/s41666-022-00118-x

Table 2.

Performance of different models for the triggering task (reported values are all in percentages)

Method Accuracy Precision Recall F1-score AUC-ROC
Infeasible Feasible Infeasible Feasible Infeasible Feasible
BERT 87.82 ± 0.6 80.51 ± 2.8 89.59 ± 1.8 62.70 ± 7.7 95.32 ± 1.5 70.08 ± 3.6 92.34 ± 0.3 93.93 ± 0.2
BiLSTM 87.37 ± 0.4 72.36 ± 2.9 92.14 ± 1.4 73.72 ± 5.5 91.44 ± 1.9 72.81 ± 1.4 91.76 ± 0.4 91.71 ± 0.5
XGBoost†* 86.03 ± 0.3 73.78 ± 1.7 88.93 ± 0.2 61.10 ± 0.5 93.49 ± 0.6 66.83 ± 0.5 91.15 ± 0.2 92.38 ± 0.3
SVM†* 85.68 ± 0.3 74.93 ± 2.2 87.99 ± 0.3 56.98 ± 1.4 94.27 ± 0.8 64.69 ± 0.5 91.02 ± 0.2 92.21 ± 0.5
Weighted TF-IDF 75.36 ± 0.8 47.54 ± 1.3 88.68 ± 0.2 66.73 ± 0.6 77.94 ± 1.1 55.51 ± 0.7 82.96 ± 0.6 72.34 ± 0.4
TF-IDF 75.43 ± 0.6 47.39 ± 1.1 87.04  ±  0.1 60.16 ± 0.7 80.00 ± 1.0 53.00 ± 0.4 83.37 ± 0.5 70.08 ± 0.2
Frequency 64.74 ± 0.3 23.21 ± 0.9 77.03 ± 0.3 23.03 ± 1.2 77.21 ± 0.6 23.11 ± 1.0 77.12 ± 0.2 49.99 ± 0.7

Bold entries are the best performance values given each metric

PubMedBERT embedding; Wikipedia-PubMed embedding; *TF-IDF values as weight