. 2022 Jul 15;6(3):344–374. doi: 10.1007/s41666-022-00118-x

Table 2.

Performance of different models for the triggering task (reported values are all in percentages)

Method	Accuracy	Precision		Recall		F1-score		AUC-ROC
Method	Accuracy	Infeasible	Feasible	Infeasible	Feasible	Infeasible	Feasible	AUC-ROC
BERT^‡	87.82 ± 0.6	80.51 ± 2.8	89.59 ± 1.8	62.70 ± 7.7	95.32 ± 1.5	70.08 ± 3.6	92.34 ± 0.3	93.93 ± 0.2
BiLSTM^†	87.37 ± 0.4	72.36 ± 2.9	92.14 ± 1.4	73.72 ± 5.5	91.44 ± 1.9	72.81 ± 1.4	91.76 ± 0.4	91.71 ± 0.5
XGBoost^†*	86.03 ± 0.3	73.78 ± 1.7	88.93 ± 0.2	61.10 ± 0.5	93.49 ± 0.6	66.83 ± 0.5	91.15 ± 0.2	92.38 ± 0.3
SVM^†*	85.68 ± 0.3	74.93 ± 2.2	87.99 ± 0.3	56.98 ± 1.4	94.27 ± 0.8	64.69 ± 0.5	91.02 ± 0.2	92.21 ± 0.5
Weighted TF-IDF^†	75.36 ± 0.8	47.54 ± 1.3	88.68 ± 0.2	66.73 ± 0.6	77.94 ± 1.1	55.51 ± 0.7	82.96 ± 0.6	72.34 ± 0.4
TF-IDF	75.43 ± 0.6	47.39 ± 1.1	87.04 ± 0.1	60.16 ± 0.7	80.00 ± 1.0	53.00 ± 0.4	83.37 ± 0.5	70.08 ± 0.2
Frequency	64.74 ± 0.3	23.21 ± 0.9	77.03 ± 0.3	23.03 ± 1.2	77.21 ± 0.6	23.11 ± 1.0	77.12 ± 0.2	49.99 ± 0.7

Bold entries are the best performance values given each metric

^‡PubMedBERT embedding; ^†Wikipedia-PubMed embedding; ^*TF-IDF values as weight