Skip to main content
. 2017 Aug 29;12(8):e0183537. doi: 10.1371/journal.pone.0183537

Table 4. Top predictive features for each age group in tweet language use and Twitter handle metadata models.

Predictive Features Youth
(Aged 13 to 17)
Young Adults
(Aged 18 to 24)
Adults
(Aged 25 or Older)
Metadata Features Cohen’s d Direction of Association Cohen’s d Direction of Association Cohen’s d Direction of Association
Age of Twitter Account 0.336 0.193 +
Linguistic Features
Count of the term “school” 0.210 + 0.194
Count of WWBP words positively correlated with 23–29 age category, in tweet 0.222
Count of the stems of “ili” (e.g. “I like”) 0.186
Count of the term “college” 0.236 0.232 +
Percent of WWBP words negatively correlated with 19–22 age category, in tweeta 0.171 + 0.331 -
Count of stems of 18b 0.210 +
Count of stems of 21 0.209 +
Count of the term “drunkard” 0.194 +
Count of the term “semester” 0.179 +
Count of kissyheart emoji 0.162 +
Count of smiley emoji 0.170 -
Count of stems of “via” 0.172 +
Mean absolute deviation of count of URLs in tweeta 0.174 +

a To capture the distributional properties of a user’s tweeting behavior, we created tweet-level features and then calculated descriptive statistics of those features across a user’s tweets. For example, for the “Average Percent Characters in Tweet that are Emoji” feature, we calculated the percentage of characters that are emoji for each tweet and then took the average across all the user’s collected tweets.

b To group common categorizes of words together, terms were stemmed, a process of reducing words to their base form. For example, a stemming algorithm would reduce the words “hunting,” “hunter,” “hunts,” and “hunters” to the stem “hunt.”