Table 1.
List of all used sources with their number of posts and with the available demographic attributes.
Dataset | No. of posts | Gendera | Agea | Ethnicitya | Locationa | Writing level |
TwitterHealth [23] | 11,637,888 | Gender classifier | NO | Ethnicity classifier | YES | Writing level classifier |
Google+Health [24] | 186,666 | YES | YES | Ethnicity classifier | YES | Writing level classifier |
Drugs.com [25] | 74,461 | Gender classifier | NO | NO | NO | Writing level classifier |
DailyStrength/Treatments [26] | 1,055,603 | YES | YES | NO | YES | Writing level classifier |
WebMD/Drugs [27] | 122,040 | YES | YES | NO | NO | Writing level classifier |
Drugs.com/Answers [28] | 320,118 | Gender classifier | NO | NO | NO | Writing level classifier |
DailyStrength/Forums [29] | 5,948,877 | YES | YES | NO | YES | Writing level classifier |
WebMD [30] | 1,128,629 | Gender classifier | NO | NO | NO | Writing level classifier |
aNO indicates that the demographic attribute is not provided by the source and no classifier is used due to low accuracy. YES indicates that the demographic attribute is provided by the source. More details on the demographic classifiers are available in the paper by Sadah et al [21].