Table 5.
Statistics of the corpora used to pretrain the GPT-2 model in Spanish, French and Norwegian. In Norwegian, values in brackets refer to the data prior to the addition of the OSCAR corpus.
Spanish | French | Norwegian | |
---|---|---|---|
Amount of raw text | 10 GB | 7 GB | 5 GB (1 GB) |
Number of sentences | 230 M | 121 M | 30 M (14 M) |
Running words | 1.7 B | 1.3 B | 750 M (150 M) |