Table 3.
Paired classification performances (all instances) over the three data sets. ADR F-scores, non-ADR F-scores, Accuracies and 95% Confidence Intervals (CI) for each of the train-test set combinations are shown.
Test Data | Training Data | ADR F-score | non-ADR F-score | Accuracy (%) | 95% CI |
---|---|---|---|---|---|
| |||||
ADE | ADE | 0.812 | 0.914 | 88.2 | 87.3 – 89.1 |
ADE+DSALL | 0.789 | 0.904 | 86.9 | 85.9 – 87.8 | |
ADE+TWALL | 0.800 | 0.912 | 87.7 | 86.8 – 88.7 | |
| |||||
TW | TW | 0.538 | 0.919 | 86.2 | 84.7 – 87.6 |
TW+ADEALL | 0.545 | 0.941 | 88.6 | 87.2 – 89.7 | |
TW+DSALL | 0.597* | 0.943 | 90.1 | 88.7 – 91.3 | |
| |||||
DS | DS | 0.678 | 0.890 | 83.8 | 82.2 – 85.0 |
DS+ADEALL | 0.674 | 0.891 | 83.5 | 81.6 – 84.8 | |
DS+TWALL | 0.704* | 0.899 | 85.0 | 83.3 – 86.5 |
indicates statistically significant improvement in performance over the highest score achieved in the binary classification task.