Table 7. Evaluation results (averaged with calculated confidence intervals) of Task 2 using the testing dataset.
The best results are presented in bold.
| Vocab size | ||||
|---|---|---|---|---|
| 8,000 | 16,000 | 32,000 | ||
| LSTM | BLEU | 0.560 ± 0.009 | 0.574 ± 0.006 | 0.592 ± 0.006 |
| ROUGE | 0.596 ± 0.011 | 0.602 ± 0.005 | 0.623 ± 0.007 | |
| BiLSTM | BLEU | 0.612 ± 0.009 | 0.649 ± 0.003 | 0.671 ± 0.011 |
| ROUGE | 0.635 ± 0.015 | 0.673 ± 0.008 | 0.695 ± 0.005 | |
| Transformer | BLEU | 0.912 ± 0.011 | 0.946 ± 0.006 | 0.927 ± 0.011 |
| ROUGE | 0.932 ± 0.014 | 0.950 ± 0.010 | 0.949 ± 0.004 | |
| T5 small | BLEU | – | – | 0.906 ± 0.005 |
| ROUGE | – | – | 0.928 ± 0.009 | |
| T5 base | BLEU | – | – | 0.902 ± 0.004 |
| ROUGE | – | – | 0.937 ± 0.008 | |