Table 2.
Comparison of different data sources for prediction in OHSNs
| Data source | Survey data (high-level data) |
User-generated data (mid-level data) |
User log data (low-level data) |
| Effort required to collect data | |||
| Design questionnaires | Perform text mining | Extract data from server database | |
| Conduct surveys | Apply natural language processing on text | ||
| Data generation rate | |||
| Slow | Fast | Instantaneous | |
| Need to conduct new survey to get recent data | Hundreds of posts written by users everyday | New generated with every user action (eg, access time, search history) | |
| Interpretability | |||
| Very easy to understand | Relatively easy to understand | Difficult to derive meaning from raw data | |
| Questions directly suited to user’s intentions | Requires data processing to extract features from long texts | Requires insight on what features to obtain from given data | |
| Data types | |||
| Numerical data (eg, scale of 1~10) |
Text data (eg, title, user posts, comments) |
Periodical data (eg, access time) |
|
| Demographic information (eg, age, sex, region) |
Demographic information (eg, user profile information) |
||
| Text data for open-ended questions | Hypertext data (eg, accessed links) | ||
| Text data (eg, keywords typed in for search) | |||
| Obtainable characteristics | |||
| A user’s (dis)agreement toward a particular characteristic | Words that represent a user’s main interests or concerns | Visiting frequency | |
| Open-ended answers toward a question | Response to a particular article | Reading preference |
|
| Search preference | |||