. 2023 May 9;9:e1377. doi: 10.7717/peerj-cs.1377

Table 4. Details of the models’ architecture.

The tokenization used by each Transformer model is shown along with the number of layer (L), the number of hidden states of the model ( $H_{m}$ ), the dimension of the feed-forward layer ( $H_{f f}$ ), the number of attention heads (A), the size of the vocabulary (V), and the number of parameters (Params).

Model	Tokenization	L	A	$H_{m}$	$H_{f f}$	V	Params
BETO	WordPiece	12	12	768	3,072	31 K	110 M
ALBETO	SentencePiece	12	12	768	3,072	31 K	12 M
DistilBETO	WordPiece	6	12	768	3,072	31 K	67 M
MarIA	Byte-level BPE	12	12	768	3,072	50 K	125 M
BERTIN	Byte-level BPE	12	12	768	3,072	50 K	125 M