Table 2.
Variables for modeling age of Reddit users in the training set derived from the comment, post, and user data collected for users whose ages were confirmed by manual labeling.
| Variable group | Metadata used | Example | Variables (N=1523), n |
| Summary statistics | All | Median post score | 189 |
| Subreddit frequencies | Posts and comments | Frequency user posted to “Teenagers” subreddit | 624 |
| Emoji frequencies | Comments | Frequency of “ ” out of emojis used by user |
101 |
| Literary characteristics | Posts and comments | Average Flesch Reading Ease score | 28 |
| Patterns in Posting | Posts and comments | Percentage of user’s posts that were videos | 42 |
| Term usage | Comments | TF-IDFa score for the term “school” | 539b |
aTF-IDF: term frequency–inverse document frequency.
bThe number of TF-IDF terms varied across the cross-validation folds based on the comments and submissions vocabulary present in the training portion of each fold. The value presented here is the number of TF-IDF features when calculated on text from the full training set.
” out of emojis used by user