Skip to main content
. 2021 Mar 16;7(3):e25807. doi: 10.2196/25807

Table 2.

Variables for modeling age of Reddit users in the training set derived from the comment, post, and user data collected for users whose ages were confirmed by manual labeling.

Variable group Metadata used Example Variables (N=1523), n
Summary statistics All Median post score 189
Subreddit frequencies Posts and comments Frequency user posted to “Teenagers” subreddit 624
Emoji frequencies Comments Frequency of “Inline graphic” out of emojis used by user 101
Literary characteristics Posts and comments Average Flesch Reading Ease score 28
Patterns in Posting Posts and comments Percentage of user’s posts that were videos 42
Term usage Comments TF-IDFa score for the term “school” 539b

aTF-IDF: term frequency–inverse document frequency.

bThe number of TF-IDF terms varied across the cross-validation folds based on the comments and submissions vocabulary present in the training portion of each fold. The value presented here is the number of TF-IDF features when calculated on text from the full training set.