Table 4. Optimal hyper-parameters for transformer neural networks architectures for Task 1.
| Vocabulary size | Embedding dimension | Batch size | Number of heads | Attention layers | Importance | Optimizer | |
|---|---|---|---|---|---|---|---|
| Transformer | 2,500 | 16 | 32 | 2 | 1 | 1.245 | RMSprop |
| 5,000 | 8 | 128 | 4 | 2 | 1.928 | RMSprop | |
| 10,000 | 8 | 64 | 8 | 4 | 1.594 | Adam | |
| 16,000 | 8 | 32 | 2 | 2 | 1.117 | Adamax | |
| 32,000 | 8 | 64 | 8 | 4 | 2.226 | RMSprop | |
| 64,000 | 8 | 64 | 4 | 4 | 1.764 | RMSprop |