Table 5.
Macro-average model performances on 5-fold cross validation on 80% of our annotated data and performances of models trained on our training set (80% of our annotated data) and evaluated on our test set (20% of our annotated data).
Features | Cross validation |
||||
---|---|---|---|---|---|
Precision | Recall | Macro-F1 | Accuracy | AUC | |
N-Gram L + D | 0.82 (±0.11) | 0.81 (±0.09) | 0.81 (±0.09) | 0.81 (±0.09) | 0.91 (±0.12) |
N-Gram L + D | 0.82 (±0.10) | 0.81 (±0.09) | 0.81 (±0.08) | 0.81 (±0.09) | 0.89 (±0.12) |
TFIDF L + D | 0.84 (±0.04) | 0.83 (±0.04) | 0.82 (±0.05) | 0.83 (±0.04) | 0.92 (±0.06) |
BERT | 0.82 (±0.05) | 0.78 (±0.06) | 0.78 (±0.06) | 0.81 (±0.05) | 0.87 (±0.04) |
Baseline | 0.75 (±0.08) | 0.75 (±0.09) | 0.71 (±0.10) | 0.72 (±0.10) | 0.86 (±0.09) |
Features | Test set |
||||
---|---|---|---|---|---|
Precision | Recall | Macro-F1 | Accuracy | AUC | |
N-Gram L + D | 0.73 | 0.76 | 0.74 | 0.76 | 0.81 |
N-Gram L + D | 0.71 | 0.73 | 0.72 | 0.74 | 0.78 |
TFIDF L + D | 0.72 | 0.74 | 0.73 | 0.76 | 0.79 |
BERT | 0.76 | 0.79 | 0.76 | 0.77 | 0.83 |
Baseline | 0.64 | 0.66 | 0.61 | 0.62 | 0.66 |
LR — logistic regression, SVM — linear support vector machine, RF — random forest, LSTM NN — long short-term neural network.
L + D — lemmatized and debiased (see Supplement section “Debiasing” for more information about debiasing).