Table 4. Performance score.
a. Sentence task | |||
Precision | Recall | f1 score | |
LSTM | |||
Original | 0.28 | 0.20 | 0.23 |
Balanced | 0.10 | 0.96 | 0.19 |
Under-sampling | 0.41 | 0.33 | 0.37 |
Bi-LSTM | |||
Original | 0.35 | 0.33 | 0.34 |
Balanced | 0.15 | 0.86 | 0.26 |
Under-sampling | 0.33 | 0.46 | 0.38 |
BERT | |||
Original | 0.43 | 0.23 | 0.30 |
Balanced | 0.03 | 0.56 | 0.07 |
Under-sampling | 0.45 | 0.66 | 0.54 |
b. User task | |||
Precision | Recall | f1 score | |
LSTM | |||
Original | 0.66 | 0.52 | 0.58 |
Balanced | 0.23 | 1.00 | 0.37 |
Under-sampling | 0.57 | 0.57 | 0.57 |
Bi-LSTM | |||
Original | 0.65 | 0.68 | 0.66 |
Balanced | 0.30 | 1.00 | 0.46 |
Under-sampling | 0.50 | 0.68 | 0.57 |
BERT | |||
Original | 0.53 | 0.36 | 0.43 |
Balanced | 0.13 | 0.93 | 0.23 |
Under-sampling | 0.63 | 0.82 | 0.71 |
Precision, recall and f1 scores are shown in these tables for the sentence task (a) and the user task (b). NLP deep-learning models used for this study are LSTM, Bi-LSTM and BERT. The percentage of positive data in the training dataset is approx. 2.7% for “Original” (the same ratio as in the original population), 50% for “Balanced” and 5% for “Under-sampling”.