Table 2.
Nature of the Data
Feature | Survey data | Social media content |
---|---|---|
Temporal properties | Responses depend on retrieval from memory, which worsens over time and is prone to error. Time period of events is specified by survey designers, so is same for all respondents. Measurement occurs at discrete moments. |
Content often concerns recent events or users’ current states, for which forgetting is unlikely. Time period of reported events is chosen by user, and varies across users and posts. Posting occurs continuously, around the clock. |
Population coverage | Full population of interest can, in principle (if not always in practice), be represented in a “sampling frame.” Researchers’ goal is to approach full coverage, which allows generalizability to population. | Users’ characteristics (e.g., demographics) cannot be assumed to match population’s characteristics, and there is no reason that they should; many members of general (e.g., national) populations do not use social media, and providers’ goal is not to support population estimates nor to make claims about population coverage in data they release. |
Topic coverage | Assumption is that population’s attitudes and behaviors relevant to the topic will be accurately characterized to the extent that the population is covered (i.e., the sampling frame accurately corresponds to the population). | Analyses of posts may capture population-wide distribution of attitudes and behaviors relevant to the topic, even if the characteristics of the user base do not reflect the characteristics of the full population. How this can work is not yet well understood. |
Sampled units | Sampled units are individuals or households/organizations. | Sampled units are posts (e.g., tweets, Facebook updates), either treated individually or (less often) aggregated by user accounts (i.e., one post per account). Accounts do not necessarily represent individual users; individuals can have multiple user accounts, and multiple people can post to an account. |
Sampling frame | Population of interest, as represented by a list of phone numbers, household addresses, email addresses, etc. | Set of posts available to researchers. Not an exhaustive enumeration of any population external to the social media site. Users self-select as posters. From survey perspective, a nonprobability sample frame (like an opt-in web panel). |
Sampling procedure | Every unit in the population of interest (the sampling frame) has a known chance of being chosen (i.e., probability sampling). Demographic subgroups within the population can be sampled. |
Probability of posting within sample frame is not known; there is wide variation in frequency of posting, with a small number of users potentially creating disproportionately large amounts of content. Unknown how this might bias inferences from data. Subgroups of posts can be sampled based on content, but not as easily on demographics of users. |
Sample size | Typically, smallest data set that can allow statistical inference. Restricted by cost. | Typically, much larger number of observations (posts, users) than in surveys. Restricted by corporate policies about access and computational resources. |
Relevance to research topic | Data are answers to survey questions on topics directly queried by researcher, even if respondent has not previously thought about the issue. | Data are user-generated content. Some will be directly relevant to researcher’s topic of interest and some may be relevant in non-obvious ways (ways researcher has not thought to ask about), but much content will not necessarily be germane to topic in which researcher is interested. |
Granularity of possible analyses | Analyses can be focused on particular subgroups by using other demographic information, e.g., age, citizenship, employment status, etc., which can be collected in survey or may be contained in frame. | Analyses can be focused on subgroups of users only if posters happen to have provided relevant characteristics, or if site makes characteristics automatically available (e.g., geolocation, time of post). Temporal properties of data can allow more temporally fine-grained analyses than surveys usually do (e.g., changes in opinions over course of a day). |
Data structure | Data are usually represented in a rectangular array. A data point comes from an individual respondent answering one question. Because most questions are closed form, answers map directly onto array. Open-ended answers can be coded to categorical responses and mapped to array because they are elicited by a question relevant to the research topic. |
The structure of the data set depends on content of the posts and how they are analyzed. Textual traces must be transformed to be mapped to data array, requiring researcher’s judgment (e.g., choice of text analysis tool and associated algorithm) of what is relevant. The exact set of variables in the data set may not be determined ahead of time. Potentially relevant traces may take more forms than are usually analyzed with survey responses: text, geolocation, network structure, time and location of log-ins/posts, sites visited, number of friends, inbound and outbound links. |
Automatically generated auxiliary information | Auxiliary data (paradata) that can be (but are not always) collected include • operational measures (e.g., number of calls to reach a household) • respondent measures (e.g., response latency, keystrokes) • interviewer measures (behavior during interview, keystrokes) |
Auxiliary data can include geolocation, profile information, system activity, interaction with others, etc., though these may also be treated as primary data. Auxiliary data may be missing when they are user provided. Analytics companies may extrapolate this information from content. |