Model |
Bart Large MNLI |
Vocabulary size |
50,265 (vocabulary size defines the number of different tokens) |
Dimensionality size |
1024 (dimensionality of the layers and the pooling layer) |
No. of encoder layers |
12 |
No. of decoder layers |
12 |
Attention heads |
16 (number of attention heads for each attention layer in encoder), 16 (number of attention heads for each attention layer in decoder) |
Feed-forward layer dimensionality |
4096 (dimensionality of the feed-forward layer in encoder), 4096 (dimensionality of the feed-forward layer in decoder) |
Activation function |
Gelu (nonlinear activation function in the encoder and pooler) |
Position embeddings |
1024 |
Number of labels |
2 |
Batch size |
8 (tested 8, 16, 32) |
Epochs |
10 |
Sequence length |
700 (other values used are 512, 1024, 2048 but 700 suits to current settings) |
Learning rate |
1e−4 (tested 1e−2, 1e−3, 1e−4) |
Dropout |
0.1 (dropout probability for all fully connected layers, tested in {0.0, 0.1, 0.2, …, 0.9}) |
Warm up steps |
500 (tested 0, 100, 300, 500, 1000) |
Optimizer |
Adam |
Loss function |
Cross entropy |
Output layer |
SoftMax |