Table 1.
0.1% Sample | 0.5% Sample | 1% Sample | Full corpus | |
---|---|---|---|---|
N tweets (actual) | 923,550 | 3,940,969 | 7,245,394 | 352,556,633 |
N tweets (rep.) | 1,104,964 | 5,526,188 | 11,052,737 | 1,117,379,746 |
N tokens (actual) | 28,754,912 | 109,303,100 | 199,369,396 | 9,134,879,457 |
N tokens (rep.) | 31,236,676 | 156,348,389 | 312,746,780 | 31,292,640,403 |
Space savinga | 16.42% | 28.69% | 34.45% | 68.45% |
aSpace saving is calculated with the equation , expressed as a percentage and using the number of tweets.