. 2023 Jan 27;23(3):1423. doi: 10.3390/s23031423

Table 5.

Statistics of the corpora used to pretrain the GPT-2 model in Spanish, French and Norwegian. In Norwegian, values in brackets refer to the data prior to the addition of the OSCAR corpus.

	Spanish	French	Norwegian
Amount of raw text	10 GB	7 GB	5 GB (1 GB)
Number of sentences	230 M	121 M	30 M (14 M)
Running words	1.7 B	1.3 B	750 M (150 M)