Table 5. Content analysis of n-grams in the location and text fields. For each category, we show the fraction of total weight in all location estimates from n-grams of that category; e.g., 49% of all estimate weight in the good estimates was from n-grams with category city (weights do not add up to 100% because time zone and language fields are not included). Weights that are significantly greater in good estimates than bad (or vice versa) are indicated with a significance code (o = 0.1, * = 0.05, ** = 0.01, *** = 0.001) determined using a Mann-Whitney U test with Bonferroni correction, the null hypothesis being that the mean weight assigned to a category over all n-grams in the good set is equal to the mean weight for the same category in the bad set. Categories with less than 1.5% weight in both classes are rolled up into other. We also show the top-weighted examples in each category.
Category | Good | Bad | Examples | |
---|---|---|---|---|
|
|
|
||
location | *** 0.83 | 0.19 | ||
city | *** 0.49 | 0.09 | edinburgh, roma, leicester, houston tx | |
country | ** 0.10 | 0.03 | singapore, the netherlands, nederland, janeiro brasil | |
generic | 0.01 | 0.02 | de mar, puerta de, beach, rd singapore | |
state | *** 0.14 | 0.02 | maryland, houston tx, puebla, connecticut | |
other lo | *** 0.09 | 0.02 | essex, south yorkshire, yorkshire, gloucestershire | |
not-location | 0.07 | 0.57 *** | ||
dutch word | *** 0.02 | 0.00 | zien, bij de, uur, vrij | |
english word | 0.01 | 0.37 *** | st new, i, pages, check my | |
letter | 0.01 | 0.04 | μ, w, α, s | |
slang | 0.00 | 0.08 *** | bitch, lad, ass, cuz | |
spanish word | 0.00 | 0.07 *** | mucha, niña, los, suerte | |
swedish word | 0.00 | 0.02 | rätt, jävla, på, kul | |
turkish word | 0.02 | 0.00 | kar, restoran, biraz, daha | |
untranslated | 0.02 | 0.00 | cewe, gading, ung, suria | |
technical | ** 0.03 | 0.02 | ||
foursquare | *** 0.03 | 0.00 | paulo http, istanbul http, miami http, brasflia http | |
url | 0.00 | 0.02 | co, http, http t, co h | |
other | 0.03 | 0.04 |