Skip to main content
. 2022 Jan 30;13(4):335–362. doi: 10.1007/s41060-021-00302-z

Table 3.

Hyperparameters used for the proposed model

Hyperparameter Value
Model Bart Large MNLI
Vocabulary size 50,265 (vocabulary size defines the number of different tokens)
Dimensionality size 1024 (dimensionality of the layers and the pooling layer)
No. of encoder layers 12
No. of decoder layers 12
Attention heads 16 (number of attention heads for each attention layer in encoder), 16 (number of attention heads for each attention layer in decoder)
Feed-forward layer dimensionality 4096 (dimensionality of the feed-forward layer in encoder), 4096 (dimensionality of the feed-forward layer in decoder)
Activation function Gelu (nonlinear activation function in the encoder and pooler)
Position embeddings 1024
Number of labels 2
Batch size 8 (tested 8, 16, 32)
Epochs 10
Sequence length 700 (other values used are 512, 1024, 2048 but 700 suits to current settings)
Learning rate 1e−4 (tested 1e−2, 1e−3, 1e−4)
Dropout 0.1 (dropout probability for all fully connected layers, tested in {0.0, 0.1, 0.2, …, 0.9})
Warm up steps 500 (tested 0, 100, 300, 500, 1000)
Optimizer Adam
Loss function Cross entropy
Output layer SoftMax