Table 4. Performance score.
| a. Sentence task | |||
| Precision | Recall | f1 score | |
| LSTM | |||
| Original | 0.28 | 0.20 | 0.23 |
| Balanced | 0.10 | 0.96 | 0.19 |
| Under-sampling | 0.41 | 0.33 | 0.37 |
| Bi-LSTM | |||
| Original | 0.35 | 0.33 | 0.34 |
| Balanced | 0.15 | 0.86 | 0.26 |
| Under-sampling | 0.33 | 0.46 | 0.38 |
| BERT | |||
| Original | 0.43 | 0.23 | 0.30 |
| Balanced | 0.03 | 0.56 | 0.07 |
| Under-sampling | 0.45 | 0.66 | 0.54 |
| b. User task | |||
| Precision | Recall | f1 score | |
| LSTM | |||
| Original | 0.66 | 0.52 | 0.58 |
| Balanced | 0.23 | 1.00 | 0.37 |
| Under-sampling | 0.57 | 0.57 | 0.57 |
| Bi-LSTM | |||
| Original | 0.65 | 0.68 | 0.66 |
| Balanced | 0.30 | 1.00 | 0.46 |
| Under-sampling | 0.50 | 0.68 | 0.57 |
| BERT | |||
| Original | 0.53 | 0.36 | 0.43 |
| Balanced | 0.13 | 0.93 | 0.23 |
| Under-sampling | 0.63 | 0.82 | 0.71 |
Precision, recall and f1 scores are shown in these tables for the sentence task (a) and the user task (b). NLP deep-learning models used for this study are LSTM, Bi-LSTM and BERT. The percentage of positive data in the training dataset is approx. 2.7% for “Original” (the same ratio as in the original population), 50% for “Balanced” and 5% for “Under-sampling”.