Skip to main content
. 2023 May 9;9:e1377. doi: 10.7717/peerj-cs.1377

Table 4. Details of the models’ architecture.

The tokenization used by each Transformer model is shown along with the number of layer (L), the number of hidden states of the model ( Hm), the dimension of the feed-forward layer ( Hff), the number of attention heads (A), the size of the vocabulary (V), and the number of parameters (Params).

Model Tokenization L A Hm Hff V Params
BETO WordPiece 12 12 768 3,072 31 K 110 M
ALBETO SentencePiece 12 12 768 3,072 31 K 12 M
DistilBETO WordPiece 6 12 768 3,072 31 K 67 M
MarIA Byte-level BPE 12 12 768 3,072 50 K 125 M
BERTIN Byte-level BPE 12 12 768 3,072 50 K 125 M