. Author manuscript; available in PMC: 2014 May 2.

Published in final edited form as: CSCW Conf Comput Support Coop Work. 2014:1523–1536. doi: 10.1145/2531602.2531607

Table 5. Content analysis of n-grams in the location and text fields. For each category, we show the fraction of total weight in all location estimates from n-grams of that category; e.g., 49% of all estimate weight in the good estimates was from n-grams with category city (weights do not add up to 100% because time zone and language fields are not included). Weights that are significantly greater in good estimates than bad (or vice versa) are indicated with a significance code (o = 0.1, * = 0.05, = 0.01, * = 0.001) determined using a Mann-Whitney U test with Bonferroni correction, the null hypothesis being that the mean weight assigned to a category over all n-grams in the good set is equal to the mean weight for the same category in the bad set. Categories with less than 1.5% weight in both classes are rolled up into other. We also show the top-weighted examples in each category.

Category		Good	Bad	Examples

location		* 0.83**	0.19
	city	*** 0.49	0.09	edinburgh, roma, leicester, houston tx
	country	** 0.10	0.03	singapore, the netherlands, nederland, janeiro brasil
	generic	0.01	0.02	de mar, puerta de, beach, rd singapore
	state	*** 0.14	0.02	maryland, houston tx, puebla, connecticut
	other lo	*** 0.09	0.02	essex, south yorkshire, yorkshire, gloucestershire
not-location		0.07	0.57 ***
	dutch word	*** 0.02	0.00	zien, bij de, uur, vrij
	english word	0.01	0.37 ***	st new, i, pages, check my
	letter	0.01	0.04	μ, w, α, s
	slang	0.00	0.08 ***	bitch, lad, ass, cuz
	spanish word	0.00	0.07 ***	mucha, niña, los, suerte
	swedish word	0.00	0.02	rätt, jävla, på, kul
	turkish word	0.02	0.00	kar, restoran, biraz, daha
	untranslated	0.02	0.00	cewe, gading, ung, suria
technical		0.03**	0.02
	foursquare	*** 0.03	0.00	paulo http, istanbul http, miami http, brasflia http
	url	0.00	0.02	co, http, http t, co h
other		0.03	0.04