Skip to main content
. 2023 Apr 30:1–25. Online ahead of print. doi: 10.1007/s41701-023-00143-0

Table 1.

Basic statistics of English corpus and samples

0.1% Sample 0.5% Sample 1% Sample Full corpus
N tweets (actual) 923,550 3,940,969 7,245,394 352,556,633
N tweets (rep.) 1,104,964 5,526,188 11,052,737 1,117,379,746
N tokens (actual) 28,754,912 109,303,100 199,369,396 9,134,879,457
N tokens (rep.) 31,236,676 156,348,389 312,746,780 31,292,640,403
Space savinga 16.42% 28.69% 34.45% 68.45%

aSpace saving is calculated with the equation 1-CompressedSizeUncompressedSize, expressed as a percentage and using the number of tweets.