Table 1.
Model hyperparameters.
Hyperparameter | Value |
---|---|
Number of layers | 24 |
Number of heads | 16 |
Hidden size | 1024 |
Optimizer | Adam with = 0.99, = 0.999, |
Learning rate | |
Learning rate scheduler | linear with a warm-up ratio of 0.1 |
Batch size | 16 |
Gradient accumulation steps | 4 |