Table 2.
Description of the studied pre-trained lightweight LLMs. Model size is proportional to the number of Layers, Attention Heads, and trainable Parameters.
Model | Layers | Att. heads | Parameters | Size |
---|---|---|---|---|
BERT Base | 12 | 12 | 110M | 440 Mb |
BERT Large | 24 | 16 | 340M | 1.2 Gb |
ALBERT Base | 12 | 12 | 11M | 63 Mb |
ALBERT Large | 24 | 16 | 17M | 87 Mb |
RoBERTa Base | 12 | 12 | 82M | 499 Mb |
RoBERTa Large | 24 | 16 | 355M | 1.6 Gb |
XLNet Base | 12 | 12 | 110M | 565 Mb |
XLNet Large | 24 | 16 | 340M | 1.57 Gb |