Skip to main content
. 2018 Mar 13;13(3):e0194317. doi: 10.1371/journal.pone.0194317

Table 1. Sentiment label distribution of Twitter datasets in 13 languages.

The last column is a qualitative assessment of the annotation quality, based on the levels of the self- and inter-annotator agreement.

Language Negative Neutral Positive Total Quality
Albanian alb 7,062 15,066 23,630 45,758 poor
Bulgarian bul 14,374 28,961 19,932 63,267 fair
English eng 23,250 38,457 25,721 87,428 v.good
German ger 19,039 52,166 26,743 97,948 fair
Hungarian hun 9,062 17,833 30,410 57,305 good
Polish pol 59,027 48,658 84,245 191,930 good
Portuguese por 56,008 53,026 43,009 152,043 fair
Russian rus 30,249 37,401 25,671 93,321 good
Ser/Cro/Bos scb 58,796 61,265 73,766 193,827 fair
Slovak slk 15,060 13,112 30,598 58,770 good
Slovenian slv 34,164 48,458 30,210 112,832 good
Spanish spa 27,675 88,481 117,048 233,204 poor
Swedish swe 22,381 15,387 13,630 51,398 good
Total 376,147 518,271 544,613 1,439,031