Table 2. Model Configuration and Hyperparameters.
hyperparameter | value |
---|---|
batch size per GPU (11 GB) | 8 |
gradient accumulation steps | 32 |
effective batch size | 4096 |
peak learning rate | 0.0006 |
hidden size | 768 |
intermediate size | 3072 |
hyperparameter | value |
---|---|
batch size per GPU (11 GB) | 8 |
gradient accumulation steps | 32 |
effective batch size | 4096 |
peak learning rate | 0.0006 |
hidden size | 768 |
intermediate size | 3072 |