Number of layers |
Multiple recurrent layers could be stacked on top of each other. |
[1; 3] |
1 |
Hidden size |
Hidden state vector size. |
[10; 500] |
300 |
Learning rate |
The rate at which network weights were updated during training. |
[10–6; 1] |
0.0023 |
L2 |
Strength of the L2 weight regularization. |
[0; 10] |
0.0052 |
Gradient clipping |
Gradient clipping (Pascanu et al., 2013) limits the gradient magnitude at a specified maximum value. |
[yes; no] |
Yes |
Max. gradient |
Value at which the gradients are clipped. |
[0.1, 2] |
1 |
Dropout |
During training, a percentage of units could be set to 0 for regularization purposes (Srivastava et al., 2014). |
[0; 0.2] |
0 |
Residual connection |
Feeding the input directly to the linear decoder bypassing the RNN’s computation. |
[yes; no] |
No |
Batch size |
The number of training trials fed into the network before each weight update. |
[3; 20] |
12 |