. 2021 Oct 20;26:100467. doi: 10.1016/j.invent.2021.100467

Table 5.

Macro-average model performances on 5-fold cross validation on 80% of our annotated data and performances of models trained on our training set (80% of our annotated data) and evaluated on our test set (20% of our annotated data).

Features	Cross validation
Features	Precision	Recall	Macro-F1	Accuracy	AUC
N-Gram L + D	0.82 (±0.11)	0.81 (±0.09)	0.81 (±0.09)	0.81 (±0.09)	0.91 (±0.12)
N-Gram L + D	0.82 (±0.10)	0.81 (±0.09)	0.81 (±0.08)	0.81 (±0.09)	0.89 (±0.12)
TFIDF L + D	0.84 (±0.04)	0.83 (±0.04)	0.82 (±0.05)	0.83 (±0.04)	0.92 (±0.06)
BERT	0.82 (±0.05)	0.78 (±0.06)	0.78 (±0.06)	0.81 (±0.05)	0.87 (±0.04)
Baseline	0.75 (±0.08)	0.75 (±0.09)	0.71 (±0.10)	0.72 (±0.10)	0.86 (±0.09)

Features	Test set
Features	Precision	Recall	Macro-F1	Accuracy	AUC
N-Gram L + D	0.73	0.76	0.74	0.76	0.81
N-Gram L + D	0.71	0.73	0.72	0.74	0.78
TFIDF L + D	0.72	0.74	0.73	0.76	0.79
BERT	0.76	0.79	0.76	0.77	0.83
Baseline	0.64	0.66	0.61	0.62	0.66

LR — logistic regression, SVM — linear support vector machine, RF — random forest, LSTM NN — long short-term neural network.

L + D — lemmatized and debiased (see Supplement section “Debiasing” for more information about debiasing).