Skip to main content
. 2023 Jan 27;23(3):1423. doi: 10.3390/s23031423

Table 5.

Statistics of the corpora used to pretrain the GPT-2 model in Spanish, French and Norwegian. In Norwegian, values in brackets refer to the data prior to the addition of the OSCAR corpus.

Spanish French Norwegian
Amount of raw text 10 GB 7 GB 5 GB (1 GB)
Number of sentences 230 M 121 M 30 M (14 M)
Running words 1.7 B 1.3 B 750 M (150 M)