Skip to main content
Public Opinion Quarterly logoLink to Public Opinion Quarterly
. 2016 Jan 13;80(1):180–211. doi: 10.1093/poq/nfv048

Social Media Analyses for Social Measurement

Michael F Schober 1,*, Josh Pasek 1, Lauren Guggenheim 1, Cliff Lampe 1, Frederick G Conrad 1
PMCID: PMC4884815  PMID: 27257310

Abstract

Demonstrations that analyses of social media content can align with measurement from sample surveys have raised the question of whether survey research can be supplemented or even replaced with less costly and burdensome data mining of already-existing or “found” social media content. But just how trustworthy such measurement can be—say, to replace official statistics—is unknown. Survey researchers and data scientists approach key questions from starting assumptions and analytic traditions that differ on, for example, the need for representative samples drawn from frames that fully cover the population. New conversations between these scholarly communities are needed to understand the potential points of alignment and non-alignment. Across these approaches, there are major differences in (a) how participants (survey respondents and social media posters) understand the activity they are engaged in; (b) the nature of the data produced by survey responses and social media posts, and the inferences that are legitimate given the data; and (c) practical and ethical considerations surrounding the use of the data. Estimates are likely to align to differing degrees depending on the research topic and the populations under consideration, the particular features of the surveys and social media sites involved, and the analytic techniques for extracting opinions and experiences from social media. Traditional population coverage may not be required for social media content to effectively predict social phenomena to the extent that social media content distills or summarizes broader conversations that are also measured by surveys.


Measures of public opinion and behavior are vitally important for shaping policy and improving our understanding of the social world, from healthcare policies and election results to the price of gas and the attitudes and habits of consumers. Traditionally, the most predictive and accurate method for social measurement has been sample surveys that ask carefully crafted questions to scientifically constructed samples of the population. But high-quality survey data come at a substantial cost: large investments of time, effort, and money for researchers who design the surveys; interviewers who collect the data; and respondents who voluntarily provide answers. In the current and likely future financial climate, levels of funding for surveys that produce official statistics and social indicators are unlikely to grow and may well decline (Keeter 2012; Massey and Tourangeau 2013). Moreover, members of the public are increasingly unwilling to participate in surveys (at least in the United States; see Brick and Williams [2013]), even if compensated (e.g., Groves 2006, 2011; Keeter et al. 2007), and it is unclear whether current modes of data collection will remain viable as people’s communicative habits change (e.g., Schober et al. 2015).

Recently an alternative has emerged with the potential for augmenting or even replacing established survey methods while reducing costs for researchers and eliminating effort for respondents. In several highly visible demonstrations, researchers from various fields have mined content from social media (what Taylor [2013] calls “found” data, arising independently of researchers’ data-collection efforts) to study phenomena traditionally measured using surveys (“made” or “elicited” data). In one study of consumer sentiment, the content of tweets was shown to correlate highly with answers to survey questions that contribute to Gallup’s Economic Confidence Index (O’Connor et al. 2010). In another study of election forecasting (Tumasjan et al. 2010), the number of tweets mentioning each major political party preceding the 2009 German federal election predicted the election outcome as accurately as some pre-election polls. Other successful examples include Sang and Bos (2012), Fu and Chan (2013), Jensen and Anstead (2013), and Ceron et al. (2014).

Based on such demonstrations, proponents have come close to arguing that analyses of social media (along with analyses of other “found” data, such as online search strings) will render traditional sample surveys obsolete (Savage and Burrows 2007; Mayer-Schönberger and Cukier 2013). The opportunity to mine the vast amount of social media data now potentially available—what Prewitt (2013) calls a “data tsunami”—is certainly unprecedented, not only from Twitter and Facebook streams (which have been the basis of most of the demonstrations to date) but from other social media sites as well. 1 Though the extent to which the data are available for secondary analyses varies, the potential advantages are extraordinary: the cost to researchers of collecting each additional message is marginal, archives of earlier social media postings allow exploration of a vast store of past posts, and new social media data are being continuously created. This all enables researchers to collect longitudinal data with unprecedented ease, and to measure phenomena more frequently than is realistic with surveys, as in Golder and Macy’s (2011) analyses of how mood changed over the course of the day in different cultures using data from Twitter messages sent by 2.4 million people.

Skeptics, on the other hand, have questioned whether enthusiasts’ claims are overly optimistic (Couper 2013; Langer Research Associates 2013; Smith 2013), and more broadly whether any form of “nonprobability sampling”—which includes social media analysis—is simply too risky to endorse (Baker et al. 2013). Others have noted that social media data may introduce new kinds of bias and measurement error (Biemer 2014; Tufekci 2014), which raises the question of whether they are sufficiently reliable to be used for official statistics and scientific social measurement; given how quickly people’s adoption and use of different platforms changes, it is unclear whether current platforms are stable enough to be used for high-impact indices (Diaz et al. 2014). Empirically, analyses of social media content do not always track survey measures or accurately predict social phenomena (Gayo-Avello, Metaxas, and Mustafaraj 2011; Gayo-Avello 2012). Some attempts at prediction—even in domains that have shown alignment before—have not succeeded (Kim et al. 2014), and one study that has been cited as a major success (O’Connor et al. 2010) can also be used to argue for the method’s limitations (Smith 2013). Seemingly minor differences in analytic choices (e.g., whether or not to include tweets that mention a small political party) can lead to quite different outcomes (e.g., a different predicted election winner) (Jungherr, Jürgens, and Schoen 2012).

Our purpose in this research synthesis is to systematically characterize the ways in which social media analyses and the survey enterprise currently align and diverge. The role social media analyses can play in social research is an open question (Groves 2011; Murphy et al. 2014), not least because data scientists’ starting assumptions and analytic techniques are so different from those of survey researchers. We see significant potential for researchers who use either approach to misunderstand one another and not to recognize when they are using the same technical terms to mean different things. We also see a need for greater clarity among both survey researchers and data scientists that use social media about conceptually distinct aspects of their approaches that could be unbundled, even though they often go together. For example, whether the data to be analyzed are “made” or “found” (that is, “designed” by researchers as opposed to pre-existing or “organic”; see Groves [2011]) is conceptually distinct from researchers’ analytic approach: Do they define a priori what the relevant variables for analysis are, or do they use a data-driven approach that makes no assumptions about which variables are likely to be relevant?

Our synthesis is a preliminary step toward future empirical research into when social media and survey data can be expected to yield similar or different conclusions. A meta-analytic review of the available direct comparisons of social media analyses with survey data as of the end of 2013 (Guggenheim et al. 2014) makes it clear that the number of comparisons carried out to date is not yet sufficient for quantitative meta-analysis, nor for an assessment of larger trends across topic domains, social media sources, data-mining algorithms and techniques, or statistical methods. The needed systematic comparisons across topic domains and methods, and replications over multiple rounds of comparison, have simply not yet been carried out. Further, given publication bias toward positive (non-null) results (Rosenthal 1979; Gayo-Avello 2012; Couper 2013) and the possibility that some reported correlations are coincidental, currently published research may overstate the frequency with which social media data yield accurate conclusions about society (Huberty 2013).

What we can do now is characterize the research to date, and enumerate features of survey data and social media content we see as relevant to when alignment is more or less likely to occur. The starting assumptions and the principles governing inference-making from each method are more different than they might at first appear; clear thinking about the dimensions on which the approaches differ will be important for understanding when the results align. Our synthesis is based on claims made in the empirical comparisons in Guggenheim et al. (2014), on distinctions listed in critical discussions and reviews thus far (Ampofo et al. 2015; boyd and Crawford 2012; Baker et al. 2013; Couper 2013; Smith 2013; Hill and Dever 2014; Murphy, Hill, and Dean 2014; Murphy et al. 2014; Tufekci 2014), and on our own view of additional relevant dimensions. At the center of our synthesis is a comparison, in three tables, of features that define and distinguish surveys and analyses of social media. We present our synthesis this way in order to facilitate conversation about when both approaches are more and less likely to produce the same results, and when one approach might be preferable to the other.

General Characterizations of Research to Date

Three general points are evident from the research to date:

1. When social media and survey results align with one another, they do so through radically different mechanisms.

The most central difference between analyses of these two types of data is in how population-level estimates are generated. Survey researchers and data scientists who analyze social media make different (explicit or tacit) assumptions about the need for population coverage: the extent to which all members of the population are potentially able to be sampled (i.e., included in the “sampling frame”). The accuracy of survey estimates depends on adequate population coverage. A frequently raised concern about generalizing from social media analyses is that not everyone uses social media, and even those who do use social media post at different frequencies; social media users who post content may be atypical in their demographics, and their opinions and behaviors may not represent the full population (boyd and Crawford 2012; Baker et al. 2013; Couper 2013; Smith 2013). In other words, inferences from social media data are inherently of unknown quality because they come from nonprobability samples that are not designed to cover the population.

From the perspective of data scientists, representativeness is not an issue when one has access to “all of the data”; statistical sampling is necessary only when massive and complete data sets are unavailable (as Mayer-Schönberger and Cukier [2013, 31] put it, “Reaching for a random sample in the age of big data is like clutching at a horse whip in the era of the motor car”). Others have noted that although the social media users whose content they have analyzed do not cover the population they are characterizing, their analyses predict survey estimates reasonably well (O’Connor et al. 2010; Tumasjan et al. 2010). One way of characterizing this is to say that the properties of social media data—not just their sheer magnitude—can sometimes override their nonprobability origin: social media data can end up adequately covering the research topics under study, and thus represent the population accurately, even though the individuals who contributed to the social media corpus are not sampled in a representative way.

In our view, this distinction between population and topic coverage is at the heart of why conclusions from survey data and social media analyses might align. Understanding when topic coverage is achieved (whether through population coverage or not) is the central scientific problem. For surveys of probability samples, accurate inferences stem from the correspondence between the population as a whole and the subset of people who participate. That is, in surveys, topic coverage follows naturally from—is a logical consequence of—population coverage, because the survey designers control the research topic (the construct or relationship in the world that is measured with survey questions) and ask every sample member about it. As long as there is population coverage and the survey questionnaire addresses the topic appropriately, topic coverage will be achieved. 2

For social media analyses, topic coverage can, in principle, be achieved without population coverage. That is, other mechanisms of information propagation that are particular to the dynamics of social media may lead a corpus of social media posts to reflect the broader population’s collective opinion and experience, through a range of (not yet fully understood) possible mechanisms. A collection of posts may accurately distill larger conversations in the full population despite the lack of population coverage among posters, perhaps because those who post may have particular access to—or are—opinion-formers or elite communicators (Ampofo, Anstead, and O’Loughlin 2011). We see at least three plausible (and not mutually exclusive) mechanisms: (1) audience design: social media postings may reflect users’ judgments of what their audience (friends and followers and unknown others) is interested in hearing about right now, and the collection of audience interests across the many networks within a massive social media site may reflect—even represent—the broader public’s interests; (2) propagation: ideas that resonate more broadly (both within and beyond a user’s social media network) may be more likely to survive and flourish or “cascade”—be “liked,” retweeted, replied to, and lead to new followers or friends; 3 (3) and media reflection: social media postings reflect the issues disseminated by multiple media outlets and thereby reflect the current public agenda (cf. Neuman et al. 2014; Jang and Pasek 2015). Within a social media site, the networks that users develop and interact with vary in their structures (Smith et al. 2014; Wu et al. 2011, for analyses in Twitter), and thus presumably in how they connect with the social world beyond the social media site. But it is through that connection, we propose, that nonrepresentative samples of social media users can post content that potentially represents the larger population’s opinions and experiences.

The fact that topic coverage can be achieved in social media analyses without population coverage suggests that a one-to-one comparison of all sources of error as conceived of in surveys (e.g., error due to sampling, incomplete coverage, nonresponse, and inaccurate measurement) rests on a false equivalency. A tweet or Facebook post simply is not a survey response, and the collection of all tweets or Facebook posts by a single user does not necessarily represent how that user would have answered a survey question. We do not mean to suggest that the “total survey error” approach (Biemer and Lyberg 2003; Weisberg 2005; Groves et al. 2009; Groves and Lyberg 2010) that has dominated the field of survey research has no relevance to social media analyses; to the contrary, this is an important framework that can guide evaluation of the quality of estimates from new data sources. It may, however, need to be adapted and extended so that it can apply to social media analyses (Biemer 2014).

2. Current methods of transforming textual social media content into quantities that can be compared to survey data are predominantly lexical analyses.

The majority of predictions of survey data from social media content use dictionary-based automated lexical analyses of the textual content of individual posts. The most straightforward method (used in 36 percent of the research articles in Guggenheim et al. [2014]) is assessing, for each post in a corpus, the frequency of keywords (and variants on those keywords) from a particular semantic category (Tumasjan et al. 2010; Shi et al. 2012; Sanders and Van Den Bosch 2013).

A more sophisticated dictionary-based lexical method (used in 24 percent of the research articles in Guggenheim et al. [2014]) is sentiment analysis, which calculates the ratio of positive to negative words in posts on a particular topic like “jobs” (O’Connor et al. 2010). Although details of the method can vary, the general strategy is to tag social media posts that contain a topical keyword or set of keywords according to whether they also contain words classified as positive or negative. 4 The sentiment ratio is calculated by dividing the number of these messages that include positive words, created over a particular time period (a day, an hour, a week), by the number of messages that include both positive and negative words.

An alternative data-driven or bottom-up lexical approach is the n-gram approach, in which groupings of words (bigrams, trigrams, etc.) are automatically tagged in a corpus of textual data; clusters of the most frequent groupings are discovered without relying on pre-specified dictionaries or semantic classifications of words. Data-driven methods such as this have the potential to uncover patterns that researchers have not pre-identified, and even if there is often overlap with what dictionary-based approaches uncover (see Oberlander and Gill [2006] for a systematic comparison in a corpus of e-mails), the findings are not identical. In the Guggenheim et al. (2014) collection of research articles, data-driven approaches were used in a minority of the analyses, either alone or in combination with dictionary-based methods.

Demonstrations to date have not relied on the semantic and computational linguistics techniques for analyzing content and form in large corpora of text that have become prevalent in other fields, for example Latent Semantic Analysis and its variants (Landauer, Foltz, and Laham 1998) or the computational linguistics NLP (Natural Language Processing) measures available in systems like Coh-Metrix (McNamara et al. 2014). For example, lexically based sentiment analysis tools can have difficulty with more complex posts like “Perhaps it is a great phone, but I fail to see why” or “In theory, the phone should have worked even underwater,” but discourse-oriented approaches that embody pragmatic knowledge can do better (Wiegand et al. 2010). More complex analytic techniques that include additional user characteristics (prior posting history, grammatical features of posts, author’s embeddedness in network, etc.) may eventually allow automated detection of non-literal usages like sarcasm (Rajadesingan, Zafarani, and Liu 2015).

It is plausible that more complex and sophisticated textual analysis techniques, applied to social media content, might lead to improved prediction, much as they have been effective in improving automated essay scoring (e.g., Shermis and Burstein 2013), text summarization (e.g., Moreno 2014), analysis of online customer reviews (e.g., Tirunillai and Tellis 2014), and dialogue-based intelligent tutoring systems (e.g., Niraula and Rus 2014).

3. Cases where social media analyses and survey results do not align do not yet allow definitive explanations for failures of alignment.

As is always the case with null findings, it is unclear whether studies that do not demonstrate or replicate alignment between a particular social media data set and a particular survey might have demonstrated alignment with alternative methods: by choosing a different procedure for sampling social media content (e.g., should researchers analyze all posts from all users—the dominant strategy—or attempt to account for disproportionate posting by a small subset of users [e.g., Kim et al. 2014]?), choosing some other algorithm for quantifying textual content (selecting different keywords and their variants, or a different method for sentiment analysis), or analyzing content posted during a different temporal window. Or perhaps alignment might have been demonstrated with a different data set that had population coverage more appropriate for making inferences to that particular population (Huberty 2013). The literature to date includes very few direct attempts at replication using exactly the same methods (same extraction algorithms, time windows, and statistical tools) on the same social media content predicting the same survey in a different time period. 5 The devil may well be in the seemingly innocuous details of analysis choices—such as whether to sample one or multiple tweets per user.

Many such differences across the studies in the literature to date make it hard to definitively assess the true potential of using social media content for social measurement. 6 At present, there simply is not enough longitudinal evidence to know which social media sources, sampled and quantified in which ways, can provide social measurement as reliable as the measures produced with survey data.

Synthesis: Side-by-Side Comparison of Defining Features

As articulated in tables 13, major differences exist in (1) how participants understand the activity they are engaged in (responding to a survey versus posting in social media); (2) the nature of the data to be analyzed; and (3) practical and ethical considerations surrounding use of the data. For simplicity and clarity, our comparison focuses on the features of prototypical and widely trusted “gold standard” probability sample surveys and the features of social media like Facebook and Twitter that have large bases of active users posting content across a broad and varied range of topics. But of course surveys vary widely in content, implementation, and quality and can differ in important ways from the prototypical survey assumed here. Social media sites also vary widely in their content and their dynamics of use, and the analytic methods for extracting information from social media sites also vary substantially (Wilson, Gosling, and Graham 2012).

Table 1.

How Participants Understand the Activity (of Responding versus Posting)

Feature Surveys Social media
Initiative Participation is actively solicited by researchers, but is unplanned by respondents. Users generate content for their own audience on their own schedule.
Informed consent Before respondents participate, they give informed consent for their anonymized data to be analyzed. Respondents are told about research sponsor and how to obtain more information about project. Consent for research is bundled with terms of service for the site. Social media statements, as a type of public performance, may not require users’ consent to be used for research. University ethics boards (such as Institutional Review Boards [IRB’s]) and government oversight bodies (such as the US Office of Management and Budget) have not ruled on using social media content to produce social or official statistics.
Ability to opt out Respondents are asked to opt in to data collection, and may rescind their agreement at any moment, generally without loss of benefits. They can choose not to answer particular questions without incurring harm. Users often do not know their content is being studied. The only way to opt out is by not posting or closing their accounts. Research is often conducted a posteriori, and users may not know secondary analyses are possible.
Prior consideration Respondents may not have thought about a topic before their answers are solicited. Users probably have previously thought about the content they post, although perhaps not deeply.
Identity of user Identity of respondent will be kept anonymous. Identity of user is explicit persona created by user—potentially fictional. Could be multiple users on a single account or multiple accounts for some users.
Perceived audience Interviewer (or interviewing agent in automated interviewing system) and/or perceived question author/researcher (e.g., in a self-administered mail survey). Potentially third parties (e.g., family members) who can hear or see the exchange. Group(s) of friends or acquaintances, or the wider public (may differ by site).
Size of perceived audience One-to-one communication with an interviewer (a stranger) or an automated system, or with researcher(s) in a self- administered questionnaire. Respondent may also believe that more researchers or third parties will see their particular responses. One-to-many communication with audience on social media site.
Social desirability pressures Respondents may answer in ways intended to make the interviewer or researchers like them or evaluate them positively. Users perform for their audiences and try to manage the impression they make, but being liked is not necessarily their agenda. No performing for researchers, and thus unlikely that researchers’ concerns are likely to affect content of posts. Social desirability pressures may be reduced if user is unidentifiable.
Potential for manipulation Respondents can give untrue answers in hopes of affecting the overall outcome, but any instance of doing this is unlikely to affect substantive survey conclusions. Users can create fake accounts or broadcast false information in hopes of changing the “landscape.” Manipulations by some users (e.g., those with many friends or followers) can have particularly large impact.
Time pressure/ synchrony Pressure to respond quickly. In voice interviews there is conversational pressure to answer quickly, which may not allow for long reflection or consulting records; there is less time pressure in self- administered surveys. Respondents are often motivated to invest minimal time in survey task. Differs greatly by site and users’ motivations. In some cases relevance of posts depends on timeliness in response to external events or other posts, and in other cases users take time to thoughtfully craft posts.
Burden May require considerable time and mental effort, and may disrupt respondent’s daily tasks. Because users choose to post content, there is no extra effort required to provide data for research beyond the original effort to share the content.

Table 3.

Practical and Ethical Considerations

Feature Surveys Social media analyses
Costs to researchers Data collection is expensive, especially for face-to-face interviews: salaries for interviewers, survey operation costs, programming costs, compensation to respondents. Analysis requires salaries for skilled personnel (statisticians, data managers).
Data collection is time intensive, from survey design to interviewing time to analysis.
Data retrieval is low cost for many researchers (especially if their organizations bear the cost of a subscription to a data source), but can vary across sites. Analysis requires salaries for skilled personnel (data scientists, computational linguists, etc.).
Data retrieval is speedy and can be carried out immediately or very soon after an event, or even continuously. Analysis is more time intensive, and data storage costs may be substantial.
Research communities Decades of practice and professionalization. Deep and extensive scrutiny of methods inherent in the discipline. Practice and research are newer, and methodological research is still developing. There is a wider range in scientific orientations and what are seen as necessary skills.
Ethics of consent for use of data Surveys involve explicit consent from respondents, who grant permission to use their anonymized data. Social media posters may not be aware that their data are being used for research, even if they have consented in a user agreement.
Ethics review of research protocol Before deployment, surveys are (ideally) subject to considerations by IRB or government ethics board. Use of data not consistently regulated by IRBs or ethics boards. Often treated as secondary data and unregulated.
Identifiability of respondents/users Identities of respondents are unavailable from aggregate reports, though potentially recoverable using covariates of survey responses in data set (e.g., location, ethnicity, income). If microdata are made available to the public, survey organization must remove potentially identifying information in order to honor agreement with respondents. Identities of users are potentially recoverable from wording and content of posts (e.g., photographs, Twitter handles). Ethical practices for concealing users’ identities in research analyses are not yet agreed upon; social media posts can be intended to be public.
Analytic approach Survey researchers define a priori what the relevant variables for analysis will be. In a well-designed survey questionnaire, questions are included because they reflect a hypothesis held by the researchers. Social media analyses (particularly machine learning) tend to follow a bottom-up data-driven approach that makes no assumptions about which variables are likely to be relevant. Social media data can be used to test hypotheses if researchers have them.
Potential for researcher bias When making predictions, the only variables used are those explicitly included by the researcher. What is tested and what can be found may therefore depend on the researcher’s preconceived notions. When researchers pre-define relevant variables, exactly the same as for surveys.
When all variables collected are used in generating a predictive model, there is a greater chance to identify covariates that were not imagined by the researcher, but also potential to be misled by spurious relationships.
Evaluating model quality Most models used to analyze surveys can be evaluated with significance testing and by comparing regression coefficients. Coefficients can be compared across models to choose best fit. As with surveys, models can be assessed in terms of how well they fit the data and how parsimonious they are. But social media analysis models can be too large for evaluation with p-values (which will often be significant with such large samples) and can be easily saturated through the inclusion of too many parameters (a large number of predictors can assure a good model fit).
Adjustments for nonrepresentativeness of data Surveys can be matched to benchmarks (e.g., national census data) and any one response can be adjusted accordingly (e.g., given less weight if produced by member of group overrepresented in sample). Strategies exist to adjust for response likelihoods of households and individuals. Possible to dampen the influence of frequent posters or those with many followers and amplify the influence of others, through strategic selection of content. Some researchers would argue that such adjustments are not needed or even appropriate, and concern about nonrepresentativeness reflects misunderstanding about how prediction from social media analyses works.
Stability of data source Large infrastructure and ongoing survey programs, but depends on continued funding, mostly from government and nonprofit sources.
Increasing refusal to participate may affect long-term stability of data source.
Not stable; may never be. Driven by social media companies’ business models, user base, technology changes, and revenues. No guarantee that organizations will be in business or continue to generate or release the same data stream in future.
Posters’ concerns about surveillance and data privacy may affect long-term stability of data source.
Ownership of data Respondents explicitly consent to researchers owning data. Users and the social media company own the data, but disputed.
Perception of research enterprise Some members of the public find the survey enterprise intrusive and do not understand how participating may benefit them. Respondents can be unclear on the legitimacy of a survey invitation, and may not perceive a difference between scientific versus marketing and sales surveys. Although posters probably do not think about researchers mining social media data when they are posting, ongoing debates about ethics of surveillance have raised concerns about data mining enterprise.
Data users and impact Policy makers, business owners, and individuals rely on official survey data for decision-making. Primary data users so far are advertisers and market researchers.

HOW PARTICIPANTS UNDERSTAND THE ACTIVITY OF RESPONDING OR POSTING (TABLE 1)

Survey respondents and social media participants understand themselves to be engaging in fundamentally different activities: answering questions as part of participating in research versus posting on a social media site because they want to. As table 1 details, there are substantial differences in who initiates the communication from which data (survey responses or social media posts) can be extracted—where the motivation for the survey respondent or social media user to provide data (content) originates. There are also differences in what participants have consented to, how and when they can opt out of participating, whether they have intended to communicate about the topic before the current moment, how their identity is represented, and who they see as their audience. Survey responding differs from social media posting in the pressures it puts on respondents to present themselves in a positive light (and for whom—researchers versus their audience, which may be a limited network of friends or a potentially unlimited audience of unknown readers if their post goes viral), the extent to which they can “fake” the content they present, and the extent to which they produce content under time pressure. On balance, this likely leads to a substantially greater burden for survey respondents, who must disrupt their daily tasks in order to participate in a survey, than for social media posters, for whom there is no extra effort in providing data beyond their original effort (which of course may be substantial) in sharing the content they are posting.

The different data sources naturally facilitate asking different research questions: questions about the thinking of those who chose to post versus those who were asked and chose to answer. For example, when tracking movie box office receipts, the opinions of film enthusiasts (identified by the fact that they have chosen to post about a film) may be particularly predictive; the frequency and passion of posts may reflect the range of engagement in the population of filmgoers—and thus predict the box office—with a richness that is harder to capture in a survey (Asur and Huberman 2010). Alternatively, because a survey can collect data from non-enthusiasts, a broader and more specific range of research topics can be addressed, for example why are some people unlikely to pay to see a particular film? How do people feel about a film they are not talking about?

The fact that audiences differ fundamentally for survey respondents (interviewer and researcher) and social media users (friends, family, the world) suggests that alignment between the two data sources should differ depending on social desirability pressures surrounding the research topic. For example, people rarely post “face-threatening” information (information that, if distributed, could engender social risks)

about their health circumstances in social media channels. Newman et al. (2011, 346) explored users’ reasoning about how and why they post as they do, for example “There are some people I wouldn’t care about if they saw [posts I might make about health on Facebook] but I got people, you know, from my high school that I am friends with that I haven’t talked to in 25 years. And I have no desire for them to know about my weight issues or weight status.” People seem more willing to post their personal health and mental health information online in more private forums, such as online health communities (Newman et al. 2011) and when using anonymous accounts (Pavalanathan and De Choudhury 2015). Social media users may be willing to reveal health information that they do not consider stigmatizing; perhaps this is why tweets about migraine episodes have, in one study, produced estimates of prevalence comparable to those of population-based epidemiological studies (Nascimento et al. 2014). But survey respondents may be even more willing to reveal health-related information, not only because they have been asked to provide this information but also because they have typically been assured that their identities will not be linked to their answers.

In contrast, discussions about illegal behaviors like drug use might feel risky in a survey, even with promises of anonymity, but in some—though not all—online communities, talking (or even boasting) about drug use may be socially sanctioned. Furthermore, because the nature of prediction can be so different with social media data than with survey data, in principle, social media postings may be able to predict survey findings on sensitive topics without depending on posters’ ever posting on that topic. For example, stock market fluctuations can be predicted using positive or negative emotion words mentioned in LiveJournal posts, even when none of those posts explicitly talk about financial issues (e.g., Gilbert and Karahalios 2010), and so users’ self-presentation about financial issues is unlikely to be a relevant concern. One could easily imagine similar predictive mechanisms for other sensitive topics.

Note that even within a single social media site posters vary in style and motivation, with some impulsively posting their thoughts and others meticulously crafting messages for specific audiences (Marwick and boyd 2011a), perhaps even with professional help from a team that vets every post to manage a public impression (Marwick and boyd 2011b; Gallicano, Brett, and Hopp 2013). This is analogous to the range of care that survey respondents exhibit in answering survey questions (Castro 2013; see also Schwartz et al.’s [2002] distinction between “maximizers” versus “satisficers”), but the range in social media postings may be even greater, in that they are not constrained by an external (research) design process that establishes the range of acceptable commentary.

NATURE OF THE DATA (TABLE 2)

Table 2.

Nature of the Data

Feature Survey data Social media content
Temporal properties Responses depend on retrieval from memory, which worsens over time and is prone to error.
Time period of events is specified by survey designers, so is same for all respondents.
Measurement occurs at discrete moments.
Content often concerns recent events or users’ current states, for which forgetting is unlikely.
Time period of reported events is chosen by user, and varies across users and posts.
Posting occurs continuously, around the clock.
Population coverage Full population of interest can, in principle (if not always in practice), be represented in a “sampling frame.” Researchers’ goal is to approach full coverage, which allows generalizability to population. Users’ characteristics (e.g., demographics) cannot be assumed to match population’s characteristics, and there is no reason that they should; many members of general (e.g., national) populations do not use social media, and providers’ goal is not to support population estimates nor to make claims about population coverage in data they release.
Topic coverage Assumption is that population’s attitudes and behaviors relevant to the topic will be accurately characterized to the extent that the population is covered (i.e., the sampling frame accurately corresponds to the population). Analyses of posts may capture population-wide distribution of attitudes and behaviors relevant to the topic, even if the characteristics of the user base do not reflect the characteristics of the full population. How this can work is not yet well understood.
Sampled units Sampled units are individuals or households/organizations. Sampled units are posts (e.g., tweets, Facebook updates), either treated individually or (less often) aggregated by user accounts (i.e., one post per account). Accounts do not necessarily represent individual users; individuals can have multiple user accounts, and multiple people can post to an account.
Sampling frame Population of interest, as represented by a list of phone numbers, household addresses, email addresses, etc. Set of posts available to researchers. Not an exhaustive enumeration of any population external to the social media site. Users self-select as posters. From survey perspective, a nonprobability sample frame (like an opt-in web panel).
Sampling procedure Every unit in the population of interest (the sampling frame) has a known chance of being chosen (i.e., probability sampling).
Demographic subgroups within the population can be sampled.
Probability of posting within sample frame is not known; there is wide variation in frequency of posting, with a small number of users potentially creating disproportionately large amounts of content. Unknown how this might bias inferences from data.
Subgroups of posts can be sampled based on content, but not as easily on demographics of users.
Sample size Typically, smallest data set that can allow statistical inference. Restricted by cost. Typically, much larger number of observations (posts, users) than in surveys. Restricted by corporate policies about access and computational resources.
Relevance to research topic Data are answers to survey questions on topics directly queried by researcher, even if respondent has not previously thought about the issue. Data are user-generated content. Some will be directly relevant to researcher’s topic of interest and some may be relevant in non-obvious ways (ways researcher has not thought to ask about), but much content will not necessarily be germane to topic in which researcher is interested.
Granularity of possible analyses Analyses can be focused on particular subgroups by using other demographic information, e.g., age, citizenship, employment status, etc., which can be collected in survey or may be contained in frame. Analyses can be focused on subgroups of users only if posters happen to have provided relevant characteristics, or if site makes characteristics automatically available (e.g., geolocation, time of post).
Temporal properties of data can allow more temporally fine-grained analyses than surveys usually do (e.g., changes in opinions over course of a day).
Data structure Data are usually represented in a rectangular array. A data point comes from an individual respondent answering one question. Because most questions are closed form, answers map directly onto array.
Open-ended answers can be coded to categorical responses and mapped to array because they are elicited by a question relevant to the research topic.
The structure of the data set depends on content of the posts and how they are analyzed. Textual traces must be transformed to be mapped to data array, requiring researcher’s judgment (e.g., choice of text analysis tool and associated algorithm) of what is relevant.
The exact set of variables in the data set may not be determined ahead of time.
Potentially relevant traces may take more forms than are usually analyzed with survey responses: text, geolocation, network structure, time and location of log-ins/posts, sites visited, number of friends, inbound and outbound links.
Automatically generated auxiliary information Auxiliary data (paradata) that can be (but are not always) collected include
• operational measures (e.g., number of calls to reach a household)
• respondent measures (e.g., response latency, keystrokes)
• interviewer measures (behavior during interview, keystrokes)
Auxiliary data can include geolocation, profile information, system activity, interaction with others, etc., though these may also be treated as primary data.
Auxiliary data may be missing when they are user provided. Analytics companies may extrapolate this information from content.

Data collected via surveys differ in significant ways from data mined from social media sites. As table 2 outlines, one important difference is in their temporal properties. Social media content is posted continuously and often contemporaneously with what is being discussed—what has been called nowcasting (Lampos and Cristiani 2012)—whereas survey responses often involve recall of past events and experiences, and are elicited at discrete moments chosen by researchers.

As table 2 further details, survey and social media data differ in a number of additional ways relevant to the larger question of population versus topic coverage: in the sampled units (individuals versus posts), the sampling frame (population versus corpus), the sampling procedure (probability versus non-probability), the sample size (typically, smaller versus much larger), and the relevance to the research topic of the survey responses or social media posts (directly relevant by definition versus not necessarily relevant). There are also differences in the granularity of possible analyses (i.e., the opportunity to focus analyses on subgroups of people); to adapt Prewitt’s terms in characterizing administrative data (Peytchev 2008), social media data are “case-rich but variable-poor,” meaning that variables that allow measurement of subgroups are rarely contained in posts (even if they can sometimes be imputed). The differences in the nature of the data (“elicited” versus “found”) lead to further differences in how the data are structured, since the social media analyses must be carried out on transformations of textual data. There are also differences in what auxiliary information (paradata, e.g., Couper 2008) is automatically generated.

PRACTICAL AND ETHICAL CONSIDERATIONS (TABLE 3)

Beyond differences in how survey data and social media analyses are generated (table 1) and structured (table 2), there are substantial differences in the costs, practices, and communities surrounding them (table 3). As table 3 details, survey data collection is extremely costly both monetarily and in terms of time (Groves 2011; Presser and McCulloch 2011). Social media data are often claimed to be relatively low in cost to retrieve and analyze (e.g., Murphy et al. 2014; Japec et al. 2015)—although they are certainly not cost free (see Jacobs 2009). 7 Costs for analysis and storage can vary depending on the nature of the access and the amount of data retrieved (De Choudhury et al. 2010); the logistical challenges can be nontrivial (Desouza and Smith 2014). The survey research community has a longer history and has reached greater consensus on how best to collect and analyze data, conceive of error, and address ethical issues, compared to the various data science communities, whose identities and practices are evolving. (This is not to suggest that the survey paradigm is free of challenges or controversies—quite the contrary; see Baker et al. [2013] on nonprobability sampling, and Link et al. [2014] on new challenges for surveys using mobile devices).

Regarding ethical practices, survey researchers largely agree (see AAPOR 2010; Groves et al. 2009, chapter 11) that respondents need to provide explicit consent to participate, that survey studies (at least those receiving US federal funding) need ethical review and regulation before they are deployed, and that identities of respondents must be anonymized in any presentation of findings. There is less clarity about the ethics that need to be considered in obtaining and analyzing social media data. 8 Social media posters are likely often unaware that they have agreed (in their terms of service) that their data can be used for research purposes (Tuunainen, Pitkänen, and Hovi 2009), if the terms of service address research uses of site content at all; social media content can be treated as secondary data that do not need to undergo ethical review (after all, the posts were in many cases intended for an unknown broadcast audience) (Solberg 2010; Zimmer 2010; Grimmelman 2013), and there is no clear consensus on best practices for concealing the identities of social media posters in research presentations (Nissenbaum 2009; Tufekci 2014).

The analytic practices in survey and social media research differ in several ways. The basic analytic approach differs from the start, in that survey researchers define a priori what the relevant variables for analysis will be; in a well-designed survey questionnaire, questions are included because they reflect hypotheses held by the researchers. In contrast, social media analyses often follow a data-driven (bottom-up) approach, making no assumptions about which variables will prove relevant (Aggarwal 2011), although social media data can be used to test hypotheses if researchers have them. This difference in starting assumptions can also affect the potential for researcher bias—scientific or ideological. In surveys, because the variables included in a study are explicitly designed by researchers, what is tested may be limited by a researcher’s preconceived notions; in social media analyses that include variables that were not researcher-designed, unimagined important relationships may be uncovered, at the risk that some findings may prove spurious (e.g., strong prediction of the US stock market S&P 500 index within one time period by butter production in Bangladesh [Leinweber 2007]; see Couper [2013] for discussion of this and other examples).

Statistics for analyzing surveys and comparing models for best fit are relatively well understood, and a number of strategies are used to adjust and weight nonrepresentative data (Rivers 2006; Lee and Valliant 2009), although those adjustments may not succeed in replicating results from representative samples (Loosveldt and Sonck 2008; Erens et al. 2013; Pasek forthcoming). The statistical models for analyzing social media data sets can also be evaluated for goodness of fit and parsimony, but they may be too large for evaluation with p-values, as these are always significant with large enough samples (Murphy, Myors and Wolach 2009); they can also be easily saturated through the inclusion of many parameters, as a large number of predictors will guarantee a good model fit (at the limit, a perfect fit will be achieved if there are as many predictors as observations; see Farhmeir et al. [2013]). There does not seem to be consensus on whether adjusting social media data for the nonrepresentativeness of those who post (see Diaz et al. 2014), or for heavy versus light posters, or sampling one versus all posts by a single user (see Kim et al. 2014), is appropriate or relevant; most analyses do not carry out such adjustments. 9

Social media and survey data differ in other important ways relevant to the generation and stability of the ongoing research enterprises. The underlying business models are radically different, with surveys often depending on funding streams from government, nonprofit, or corporate sources. Social media businesses may not persist as user preferences and technologies change; no regulations guarantee the availability of data from social media streams in the future, as social media companies own the data. 10 Both kinds of data face challenges in public perceptions of their value and legitimacy for research. For surveys, falling response rates and survey fatigue, along with a proliferation of surveys that are not conducted to the highest standards, raise the question of how long the public can be relied on to support surveys and to provide anonymous data (Brick and Williams 2013; Kreuter 2013; Massey and Tourangeau 2013). For social media analysis, growing concerns about surveillance—whether by government or corporations—and the resulting loss of privacy (Ellison et al. [2011] and other papers in Trepte and Reinecke [2011]; papers in Fuchs et al. [2011]; Turow, Hennessy, and Draper [2015]) are also likely to change the climate: what people are willing to post, whether new restrictions will prevent research access, and whether posting behavior is stable enough for official measurement. It is unclear whether members of the public would prefer to provide anonymized data that they have consented to provide or to provide access to their social media posts because there is no additional burden.

Applying These Distinctions to Concrete Research Questions

In order to apply the distinctions in tables 13, one must consider the specific features of the survey and the social media site in question and how they interact with the topic under investigation. First, surveys vary substantially in their scope and quality; not all surveys meet the gold standard for probability sample surveys described in the tables. Nonprobability panels or convenience samples that do not achieve good population coverage, in particular, may operate on different principles (Baker et al. 2013), and it is unknown how similar the mechanisms by which they achieve topic coverage (when they do) are to the mechanisms leading to topic coverage from social media analyses. There is also substantial variation in the business models underlying surveys—surveys are not guaranteed to persist in perpetuity—as well as in the extent to which they are carefully designed, ethically regulated, and perceived as trustworthy by respondents or the public; a survey may well have less longevity, generalizability, or solid ethical grounding than a social media analysis. Certainly the motivation to participate in a survey and provide accurate answers can vary substantially; respondents who volunteer to participate in an opt-in survey because they are interested in the topic or because they are seeking compensation (e.g., through a crowdsourcing site like Amazon Mechanical Turk; see Antoun et al. [2015]) may be more similar to social media posters who are interested in posting about a topic than respondents who had no intention to participate in a survey prior to being contacted.

Second, there is considerable variation in social practices, participants’ motivations, and communicative dynamics within and across social media sites. On Facebook, for example, the default privacy setting is that only articulated connections (friends) can see a user’s posts; on Twitter, the default privacy setting gives the public access to a user’s posts. 11 Twitter allows for pseudonymous user identities, whereas Facebook highly encourages “real” name use, which alters how people create accounts (individuals are more likely to have multiple accounts on Twitter) and how organizations use these sites (including for uncivil or manipulative “trolling”). And Facebook users have to agree to become Friends (symmetric linking), whereas Twitter users can follow a user without reciprocity (asymmetric linking); this seemingly subtle difference has been associated with major differences in social practices of posting on those sites (Ellison and boyd 2013). Other social media sites have other affordances that must be considered when identifying the motivations of participants and the nature of the data being created in those sites (e.g., Lampe et al. 2010).

Third, the topic under investigation is likely to matter substantially on all the fronts considered in these tables: how participants (survey respondents and social media posters) conceive of their participation, the nature of the data, and the various additional practical and ethical considerations. For example, consider measuring a stigmatized behavior like drug use through a survey versus social media analyses. Denying drug use when directly questioned in a survey is an act of deception, if it has occurred. 12 In contrast, not posting about one’s own drug use on social media is not an act of deception at all. Many posts about the topic may have nothing to do with the poster’s use: a celebrity’s publicized mention of drug use could lead to a spike in posts, but might not reflect changes in the poster’s behavior (Couper 2013; Murphy et al. 2011). The dynamics of posting about stigmatized behaviors—particularly ones that vary in their legality and level of stigma in different subgroups—are little understood.

Conclusion

Our synthesis shows how far we are from understanding the principles for when social media analyses are likely to align with (i.e., come to the same conclusions as) findings from surveys that form the bases for important policy decisions and social understandings. Clearly, social media analyses have an enormous potential to measure, speedily and affordably, not only what surveys now measure but also social phenomena that have never previously been assessed. 13 Just as clearly, there are many questions about whether social media analyses that align with survey findings will be replicable over time, particularly as social media use continues to evolve. There simply is not yet a large enough body of research, across topic domains, analytic techniques, and social media types, to propose a general theory or clear assessment of when social media analyses will converge with survey findings.

Delving into the communicative dynamics underlying not only posting and information propagation on social media sites but also survey responding will be particularly informative for understanding when and why social media analyses and survey measurements converge. For example, we (Conrad et al. 2015) have hypothesized that tweets are more similar to responses to some survey questions than others, in that tweets not only reflect individuals’ own thoughts but also express judgments about the larger social world; this seems similar to group predictions in survey responses where, for example, voter expectations of who will win can better predict election outcomes than who voters claim they themselves will vote for (Rothschild and Wolfers 2013; Graefe 2014). Consistent with this “collective-versus-self” hypothesis, we (Conrad et al. 2015) have found that sentiment in tweets from 2008 to 2011 containing the word “jobs” aligns more (r = .84) with answers to a survey question over that period about business conditions in the country as a whole (collective) and less (r = .39) with a question about financial conditions for “you and your family” (self). Whether this finding replicates across topic domains and different social media platforms remains to be seen, but we see testing hypotheses about which survey findings might converge with which social media analyses to be a promising avenue for further exploration.

Our review leads us to conclude that understanding the alignment between survey findings and social media analyses will require new conversations in which survey researchers and data scientists step beyond their starting assumptions about social measurement, as well as new empirical work. A major challenge arises from the diversity of backgrounds of the relevant researchers: computer scientists and engineers who use a wide range of methods to analyze social media data (machine learning, natural language processing, lexical analysis), social researchers from different disciplines, communication and media researchers, market and consumer researchers, and so forth. Each relevant discipline requires substantial technical expertise and specialized vocabulary, and researchers from these different areas do not now naturally meet in the same venues (conferences, journals).

To advance the conversation, survey researchers will need broader thinking about issues of coverage, being open to the possibility that the aims of coverage as conceived in survey research may be achievable through mechanisms that do not on the surface look like they provide coverage. Social media analysts will need to become more concerned about the volatility and reliability of their predictors, given the speed of change in social media companies’ terms of use, users’ changing preferences and sensitivities to data privacy, and so forth. They will also need to document their analytic procedures with even more detail so as to allow clear replication tests, as called for by Jungherr, Jürgens, and Schoen (2012), among others.

We suggest that it is premature to fully endorse or reject the idea that social media analyses could replace surveys in producing official statistics or could reliably provide accurate insights independent of probability-based surveys. As we see it, replacing surveys with social media data could be plausible only when there is an external gold standard that is not a survey (e.g., unemployment insurance claims, as in Antenucci et al.’s [2015] analyses). If the gold standard is a survey, then there would have to be at least occasional surveys in order to calibrate the social media trends. 14 One could imagine, for example, collecting survey data bimonthly instead of monthly and interpolating the estimates in off-months based on social media data—but only if over time the social media trends prove to track sufficiently well with the survey findings.

1

As of 2013, 72 percent of Internet-using adults in the United States used social media of one sort or another (Brenner and Smith 2013), and the amount of data is indeed tsunami-like: for example, as of August 2013, over 500 million tweets were being produced each day (Krikorian 2013).

2

With poorly designed questions, a survey might not have the desired topic coverage even with good population coverage; the survey will end up with coverage of the topic the researcher actually asked about, rather than what the researcher intended to ask about. And of course topic coverage may not be achieved if the distribution of opinions and experiences among sample members who do not respond (nonrespondents) differs from the distribution for those who do (Groves et al. 2009, chapter 6, among many others).

3

Propagation in social media may well function similarly to “cascades of influence” for public opinion more generally (Watts and Dodds 2007). The fact that some social media may function as “echo chambers,” with participants choosing to be exposed only to content to which they are already sympathetic (Bakshy, Messing, and Adamic [2015] for discussion), might actually contribute to how social media could distill or sum up broader conversations.

4

See Wilson, Wiebe and Hoffman’s (2005) 2,800-word OpinionFinder dictionary, and more complex later efforts (Velikovich et al. 2010). Sentiment dictionaries can vary (for example in their focus on positive and negative words) in ways that potentially affect the extent to which social media analyses align with survey results; see González-Bailón and Pantoglou (2015) for a useful comparison of automated content analysis tools.

5

Most attempts at replication have focused on different countries (e.g., Sang and Bos [2012] report a replication in the Netherlands of Tumasjen et al. [2010] which was conducted in Germany), time frames (e.g., Shi et al. [2012] replicate both O’Connor et al. [2010] and Tumasjan et al. [2010] during a different period and focusing on a US primary election), or keywords, but they do not necessarily hold all details constant in ways that allow a systematic understanding of which variables matter.

6

Note that the well-publicized replication failure in Google Flu (Butler 2013) is a case based on Google searches for flu symptoms, rather than social media posts. The reasons proposed for this failure may be related to replication failures involving social media content: difficulty discriminating between news coverage interest and actual flu (Butler 2013), inadequate updates to account for changes in search behavior (Cook et al. 2011) or the search algorithm (Lazer et al. 2014), or problems with search term selection stemming from overfitting among the models Google researchers were using. But the communicative dynamics involved in searching are quite different from those involved in social media posting.

7

Not every social media site allows researchers access to the data stream, and access costs can vary across sites or third-party distributors of social media data. Researchers may be unaware of the actual costs if their organization (university department or company) buys a subscription to a third-party service. One subscription may allow a large number of research efforts on the same data stream, which can distribute costs across many researchers with different research questions.

8

Note that analyzing already-existing social media posts, as discussed here, is a quite different enterprise than experimentally manipulating the feeds provided to potentially unsuspecting social media posters and tracking their subsequent behavior (see, e.g., the controversies and discussions about consent and regulatory review following the publication of Kramer, Guillory, and Hancock [2014]). Although the ethical questions in both practices may be related, they are distinct.

9

The range of levels and kinds of posting within a single social media site can be enormous, with a small percentage of users accounting for a large percentage of the posting and reposting (Wu et al. 2011).

10

Availability could increase over time through public repositories; see the US Library of Congress’s projects to archive social media for research purposes (http://www.digitalpreservation.gov/, which proposes to archive all Twitter feeds: https://blog.twitter.com/2010/tweet-preservation).

11

Although both Facebook and Twitter allow users to change these settings, the majority of users of software systems in general and of social media in particular tend to follow the defaults (Mackay 1991; Lampe, Johnston, and Resnick 2007).

12

It is, of course, unclear how often underreporting of illegal and potentially embarrassing behaviors in surveys (Tourangeau and Smith 1996; Turner et al. 1998; Tourangeau and Yan 2007; Kreuter, Presser, and Tourangeau 2008; Couper, Tourangeau, and Marvin 2009) is consciously intended to be deceptive (Schaeffer 2000; Schober and Glick 2011).

13

See, for example, Wang et al.’s (2012) application for tracking public reactions to the nine Republican candidates during the 2012 US election campaign in sliding five-minute intervals.

14

For any particular comparison between social media and survey data, it will be important for researchers to be clear about whether there is an external standard to which both kinds of data can be compared, or whether the survey data are the only available gold standard.

References

  1. AAPOR 2010. The Code of Professional Ethics and Practices (May). Available at http://www.aapor.org/AAPORKentico/AAPOR_Main/media/MainSiteFiles/RevisedCode(5_2010)_withLogo.pdf. [Google Scholar]
  2. Aggarwal Charu C, ed. 2011. An Introduction to Social Network Data Analytics. New York: Springer US. [Google Scholar]
  3. Ampofo Lawrence, Anstead Nick, O’Loughlin Ben. 2011. “Trust, Confidence, and Credibility: Citizen Responses on Twitter to Opinion Polls during the 2010 UK General Election.” Information, Communication & Society 14:850–71. [Google Scholar]
  4. Ampofo, Lawrence, Simon Collister, Ben O’Loughlin, and Andrew Chadwick. 2015. “Text Mining and Social Media: When Quantitative Meets Qualitative and Software Meets People.” In Innovations in Digital Research Methods, edited by Peter Halfpenny and Rob Procter, 161–91. London: Sage. [Google Scholar]
  5. Antenucci Michael Cafarella, Levenstein Margaret C., Ré Christopher, Shapiro Matthew D. 2015. “Using Social Media to Measure Labor Market Flows.” Unpublished manuscript.
  6. Antoun, Christopher, Chan Zhang, Frederick G. Conrad, and Michael F. Schober. 2015. “Comparisons of Online Recruitment Strategies for Convenience Samples: Craigslist, Google AdWords, Facebook and Amazon’s Mechanical Turk.” Field Methods. doi:10.1177/1525822X15603149 [Google Scholar]
  7. Asur Sitaram, Huberman Bernardo A. 2010. “Predicting the Future with Social Media.” In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference, IEEE; 1:492–99. [Google Scholar]
  8. Baker Reg, J., Brick Michael, Bates Nancy A., Battaglia Mike, Couper Mick P., Dever Jill A., Gile Krista J., Tourangeau Roger. 2013. “Summary Report of the AAPOR Task Force on Non-Probability Sampling.” Journal of Survey Statistics and Methodology 1:90–143. [Google Scholar]
  9. Bakshy Eytan, Messing Solomon, Adamic Lada. 2015. “Exposure to Ideologically Diverse News and Opinion on Facebook.” Science:aaa1160. [DOI] [PubMed] [Google Scholar]
  10. Biemer Paul. 2014. “Toward a Total Error Framework for Big Data.” Paper presented at theAnnual Conference of the American Association for Public Opinion Research, Anaheim, CA, USA. [Google Scholar]
  11. Biemer Paul P., Lyberg Lars E. 2003. Introduction to Survey Quality. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
  12. boyd danah, Crawford Kate. 2012. “Critical Questions for Big Data.” Information, Communication & Society 15:662–79. [Google Scholar]
  13. Brenner Joanna, Smith Aaron. 2013. “72% of Online Adults Are Social Networking Site Users.” Pew Research Center Internet & American Life Project (August 5). Available at http://pewinternet.org/Reports/2013/social-networking-sites.aspx.
  14. Brick J. Michael, Williams Douglas. 2013. “Explaining Rising Nonresponse Rates in Cross-Sectional Surveys.” ANNALS of the American Academy of Political and Social Science 645:36–59. [Google Scholar]
  15. Butler D. 2013. “When Google Got Flu Wrong.” Nature 494:155–56. [DOI] [PubMed] [Google Scholar]
  16. Castro Rubén. 2013. “Inconsistent Respondents and Sensitive Questions.” Field Methods 25:283–98. [Google Scholar]
  17. Ceron Andrea, Curini Luigi, Iacus Stefano M., Porro Giuseppe. 2014. “Every Tweet Counts? How Sentiment Analysis of Social Media Can Improve Our Knowledge of Citizens’ Political Preferences with an Application to Italy and France.” New Media & Society 16:340–58. [Google Scholar]
  18. Conrad Frederick G., Schober Michael F., Pasek Josh, Guggenheim Lauren, Lampe Cliff, Hou Elizabeth. 2015. “A ‘Collective-vs.-Self” Hypothesis for When Twitter and Survey Data Tell the Same Story.” Paper presented at the Annual Conference of the American Association for Public Opinion Research, Hollywood, FL, USA. [Google Scholar]
  19. Cook Samantha, Conrad Corrie, Fowlkes Ashley L., Mohebbi Matthew H. 2011. “Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic.” PLOS ONE 6(8):e23610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Couper Mick P. 2008. Designing Effective Web Surveys. New York: Cambridge University Press. [Google Scholar]
  21. ———. 2013. “Is the Sky Falling? New Technology, Changing Media, and the Future of Surveys.” Survey Research Methods 7(3):145–56. [Google Scholar]
  22. Couper Mick P., Tourangeau Roger, Marvin Theresa. 2009. “Taking the Audio Out of Audio-CASI.” Public Opinion Quarterly 73:281–303. [Google Scholar]
  23. De Choudhury Munmun, Lin Yu-Ru, Sundaram Hari, Candan K. Selcuk, Xie Lexing, Kelliher Aisling. 2010. “How Does the Data Sampling Strategy Impact the Discovery of Information Diffusion in Social Media?” ICWSM 10:34–41. [Google Scholar]
  24. Desouza Kevin C., Smith Kendra L. 2014. “Big Data for Social Innovation.” Stanford Social Innovation Review 13:39–43. [Google Scholar]
  25. Diaz Fernando, Gamon Michael, Hofman Jake, Kiciman Emre, Rothschild David. 2014. “Online and Social Media Data as a Flawed Continuous Panel Survey.” Available at http://research.microsoft.com/en-us/projects/flawedsurvey. [DOI] [PMC free article] [PubMed]
  26. Ellison Nicole B., boyd danah m. 2013. “Sociality through Social Network Sites.” In The Oxford Handbook of Internet Studies, edited by Dutton William H., 151–72. Oxford: Oxford University Press. [Google Scholar]
  27. Ellison Nicole B., Vitak Jessica, Steinfield Charles, Gray Rebecca, Lampe Cliff. 2011. “Negotiating Privacy Concerns and Social Capital Needs in a Social Media Environment.” In Privacy Online: Perspectives on Privacy and Self-Disclosure in the Social Web, edited by Trepte Sabine and Reinecke Leonard, 19–32. Berlin, Heidelberg: Springer Verlag. [Google Scholar]
  28. Erens Bob, Burkill Sarah, Copas Andrew, Couper Mick, Conrad Fred. 2013. “How Well Do Volunteer Web Panel Surveys Measure Sensitive Behaviours in the General Population, and Can They Be Improved? A Comparison with the Third British National Survey of Sexual Attitudes & Lifestyles (Natsal-3).” The Lancet 382:S34. [Google Scholar]
  29. Fahrmeir Ludwig, Kneib Thomas, Lang Stefan, Marx Brian. 2013. Regression: Models, Methods, and Applications. New York: Springer Science & Business Media. [Google Scholar]
  30. Fu King-wa, Chan Chee-hon. 2013. “Analyzing Online Sentiment to Predict Telephone Poll Results.” Cyberpsychology, Behavior, and Social Networking 16:702–7. [DOI] [PubMed] [Google Scholar]
  31. Fuchs Christian, Boersma Kees, Albrechtslund Anders, Sandoval Marisol. 2011. Internet and Surveillance: The Challenges of Web 2.0 and Social Media. New York: Routledge. [Google Scholar]
  32. Gallicano Tiffany Derville, Brett Kevin, Hopp Toby. 2013. “Is Ghost Blogging Like Speechwriting? A Survey of Practitioners about the Ethics of Ghost Blogging.” Public Relations Journal 7:1–41. [Google Scholar]
  33. Gayo-Avello Daniel. 2012. “No, You Cannot Predict Elections with Twitter.” IEEE Internet Computing 16:91–94. [Google Scholar]
  34. Gayo-Avello Daniel, Metaxas Panagiotis Takis, Mustafaraj Eni. 2011. “Limits of Electoral Predictions Using Twitter.” Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, 490–93. [Google Scholar]
  35. Gilbert Eric, Karahalios Karrie. 2010. “Widespread Worry and the Stock Market.” Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), 10:59–65. [Google Scholar]
  36. Golder Scott A., Macy Michael W. 2011. “Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures.” Science 333:1878–1881. [DOI] [PubMed] [Google Scholar]
  37. Gonzalez-Bailón Sandra, Pantoglou Georgios. 2015. “Public Opinion in Online Communication: A Comparison of Methods and Data Sources.” ANNALS of the American Academy of Political and Social Science 659:95–107. [Google Scholar]
  38. Graefe Andreas. 2014. “Accuracy of Vote Expectation Surveys in Forecasting Elections.” Public Opinion Quarterly 78:204–32. [Google Scholar]
  39. Grimmelman James. 2013. “The Law and Ethics of Experiments on Social Media Users.” Colorado Technology Law Journal 13:219–71. [Google Scholar]
  40. Groves Robert M. 2006. “Nonresponse Rates and Nonresponse Bias in Household Surveys.” Public Opinion Quarterly 70:646–75. [Google Scholar]
  41. ———. 2011. “Three Eras of Survey Research.” Public Opinion Quarterly 75:861–71. [Google Scholar]
  42. Groves Robert M., Fowler Floyd J., Jr., Couper Mick P., Lepkowski James M., Singer Eleanor, Tourangeau Roger. 2009. Survey Methodology, 2nd ed. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
  43. Groves Robert M., Lyberg Lars. 2010. “Total Survey Error: Past, Present, and Future.” Public Opinion Quarterly 74:849–79. [Google Scholar]
  44. Guggenheim Lauren, Pasek Josh, Lampe Cliff, Schober Michael F., Conrad Frederick G., Wagner Ellen, Brown Lindsay K. 2014. “Can Social Media Data Predict Survey Data? A Meta-Analytic Review of the Literature.” Paper presented at the Annual Conference of the American Association for Public Opinion Research, Anaheim, CA, USA. [Google Scholar]
  45. Hill Craig A., Dever Jill. 2014. “The Future of Social Media, Sociality, and Survey Research.” In Social Media, Sociality, and Survey Research, edited by Hill Craig A., Dean Elizabeth, Murphy Joe, 295–317. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
  46. Huberty Mark E. 2013. “Multi-Cycle Forecasting of Congressional Elections with Social Media.” Proceedings of the 2nd Workshop on Politics, Elections, and Data, 23–30. [Google Scholar]
  47. Jacobs Adam. 2009. “The Pathologies of Big Data.” Communications of the ACM 52:36–44. [Google Scholar]
  48. Jang S. Mo, Pasek Josh. 2015. “Assessing the Carrying Capacity of Twitter and Online News.” Mass Communication and Society 18:577–598. [Google Scholar]
  49. Japec Lilli, Kreuter Frauke, Berg Marcus, Biemer Paul, Decker Paul, Lampe Cliff, Lane Julia, O’Neil Cathy, Usher Abe. 2015. “Big Data in Survey Research.” Public Opinion Quarterly 79(4):839–880. [Google Scholar]
  50. Jensen Michael J., Anstead Nick. 2013. “Psephological Investigations: Tweets, Votes, and Unknown Unknowns in the Republican Nomination Process.” Policy & Internet 5:161–82. [Google Scholar]
  51. Jungherr Andreas, Jürgens Pascal, Schoen Harald. 2012. “Why the Pirate Party Won the German Election of 2009 or the Trouble with Predictions: A Response to Tumasjan, A., Sprenger, T. O., Sander, P. G., & Welpe, I. M. ‘Predicting Elections with Twitter: What 140 Characters Reveal About Political Sentiment.’” Social Science Computer Review 30:229–34. [Google Scholar]
  52. Keeter Scott. 2012. “Presidential Address: Survey Research, Its New Frontiers, and Democracy.” Public Opinion Quarterly 76:600–608. [Google Scholar]
  53. Keeter Scott, Kennedy Courtney, Dimock Michael, Best Jonathan, Craighill Peyton. 2007. “Gauging the Impact of Growing Nonresponse on Estimates from a National RDD Telephone Survey.” Public Opinion Quarterly 70:759–79. [Google Scholar]
  54. Kim Annice, Murphy Joe, Richards Ashley, Hansen Heather, Howell Rebecca, Haney Carol. 2014. “Can Tweets Replace Polls? A US Health-Care Reform Case Study.” In Social Media, Sociality, and Survey Research, edited by Hill Craig A., Dean Elizabeth, Murphy Joe, 61–86. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
  55. Kramer Adam D. I., Guillory Jamie E., Hancock Jeffrey T. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences 111:8788–8790. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Kreuter Frauke. 2013. “Facing the Nonresponse Challenge.” ANNALS of the American Academy of Political and Social Science 645:23–35. [Google Scholar]
  57. Kreuter Frauke, Presser Stanley, Tourangeau Roger. 2008. “Social Desirability Bias in CATI, IVR, and Web Surveys: The Effects of Mode and Question Sensitivity.” Public Opinion Quarterly 72:847–65. [Google Scholar]
  58. Krikorian Raffi. 2013. “New Tweets Per Second Record, and How! The Twitter Engineering Blog: Information from Twitter’s Engineering Team about our Technology, Tools, and Events.” Available at https://blog.twitter.com/2013/new-Tweets-per-second-record-and-how.
  59. Lampe Cliff A. C., Johnston Erik, Resnick Paul. 2007. “Follow the Reader: Filtering Comments on Slashdot.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1253–1262. [Google Scholar]
  60. Lampe Cliff, Wash Rick, Velasquez Alcides, Ozkaya Elif. 2010. “Motivations to Participate in Online Communities.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1927–1936. [Google Scholar]
  61. Lampos Vasileios, Cristianini Nello. 2012. “Nowcasting Events from the Social Web with Statistical Learning.” ACM Transactions on Intelligent Systems and Technology (TIST) 3:72–93. [Google Scholar]
  62. Landauer Thomas, Foltz Peter W., Laham Darrell. 1998. “Introduction to Latent Semantic Analysis.” Discourse Processes 25:259–84. [Google Scholar]
  63. Langer Research Associates 2013. “Briefing Paper: Social Media and Public Opinion.” Available at http://www.langerresearch.com/uploads/Langer_Research_Briefing_Paper-Social_Media_and_Public_Opinion.pdf.
  64. Lazer David M., Kennedy Ryan, King Gary, Vespignani Alessandro. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343:1203–1205. [DOI] [PubMed] [Google Scholar]
  65. Lee Sunghee, Valliant Richard. 2009. “Estimation for Volunteer Panel Web Surveys Using Propensity Score Adjustment and Calibration Adjustment.” Sociological Methods & Research 37:319–43. [Google Scholar]
  66. Leinweber David J. 2007. “Stupid Data Miner Tricks: Overfitting the S&P 500.” Journal of Investing 16:15–22. [Google Scholar]
  67. Link Michael W., Murphy Joe, Schober Michael F., Buskirk Trent D., Childs Jennifer Hunter, Tesfaye Casey Langer. 2014. “Mobile Technologies for Conducting, Augmenting and Potentially Replacing Surveys: Executive Summary of the AAPOR Task Force on Emerging Technologies in Public Opinion Research.” Public Opinion Quarterly 78:779–87. [Google Scholar]
  68. Loosveldt Geert, Sonck Nathalie. 2008. “An Evaluation of the Weighting Procedures for an Online Access Panel Survey.” Survey Research Methods 2:93–105. [Google Scholar]
  69. Mackay Wendy E. 1991. “Triggers and Barriers to Customizing Software.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 153–60. [Google Scholar]
  70. Marwick Alice E., boyd danah. 2011a. “I Tweet Honestly, I Tweet Passionately: Twitter Users, Context Collapse, and the Imagined Audience.” New Media & Society 13:114–33. [Google Scholar]
  71. ———. 2011b. “To See and Be Seen: Celebrity Practice on Twitter.” Convergence: The International Journal of Research into New Media Technologies 17:139–58. [Google Scholar]
  72. Massey Douglas S., Tourangeau Roger. 2013. “Introduction: New Challenges to Social Measurement.” ANNALS of the American Academy of Political and Social Science 645:6–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Mayer-Schönberger Viktor, Cukier Kenneth. 2013. Big Data: A Revolution That Will Transform How We Live, Work, and Think. New York: Eamon Dolan/Houghton Mifflin Harcourt. [Google Scholar]
  74. McNamara Danielle S., Graesser Arthur C., McCarthy Philip M., Cai Zhiqiang. 2014. Automated Evaluation of Text and Discourse with Coh-Metrix. New York: Cambridge University Press. [Google Scholar]
  75. Moreno Juan Manuel Torres. 2014. Automatic Text Summarization. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
  76. Murphy Joe, Hill Craig A., Dean Elizabeth. 2014. “Social Media, Sociality, and Survey Research.” In Social Media, Sociality, and Survey Research, edited by Hill Craig A., Dean Elizabeth, Murphy Joe, 1–33. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
  77. Murphy Joe, Kim Annice, Hagood Heather, Richards Ashley, Augustine Cynthia, Kroutil Larry, Sage Adam. 2011. “Twitter Feeds and Google Search Query Surveillance: Can They Supplement Survey Data Collection?” In Shifting the Boundaries of Research: Proceedings of the Sixth ASC International Conference, Bristol, UK, edited by Birks David. et al. , 228–45. Association for Survey Computing. [Google Scholar]
  78. Murphy Joe, Link Michael W., Childs Jennifer Hunter, Tesfaye Casey Langer, Dean Elizabeth, Stern Michael, Pasek Josh, Cohen Jon, Callegaro Mario, Harwood Paul. 2014. “Social Media in Public Opinion Research: Executive Summary of the AAPOR Task Force on Emerging Technologies in Public Opinion Research.” Public Opinion Quarterly 78:788–94. [Google Scholar]
  79. Murphy Kevin R., Myors Brett, Wolach Allen. 2009. Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests, 3rd ed. New York: Routledge. [Google Scholar]
  80. Nascimento Thiago D., et al. 2014. “Real-Time Sharing and Expression of Migraine Headache Suffering on Twitter: A Cross-Sectional Infodemiology Study.” Journal of Medical Internet Research 16(4):e96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Neuman W. Russell, Guggenheim Lauren, Jang S. Mo, Bae Soo Young. 2014. “The Dynamics of Public Attention: Agenda-Setting Theory Meets Big Data.” Journal of Communication 64:193–214. [Google Scholar]
  82. Newman Mark W., Lauterbach Debra, Munson Sean A., Resnick Paul, Morris Margaret E. 2011. “‘It’s Not That I Don’t Have Problems, I’m Just Not Putting Them on Facebook’: Challenges and Opportunities in Using Online Social Networks for Health.” Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work, 341–50. [Google Scholar]
  83. Nissenbaum Helen. 2009. Privacy in Context: Technology, Policy, and the Integrity of Social Life. Palo Alto, CA: Stanford University Press. [Google Scholar]
  84. Niraula Nobal B., Rus Vasile. 2014. “A Machine Learning Approach to Pronominal Anaphora Resolution in Dialogue Based Intelligent Tutoring Systems.” In Computational Linguistics and Intelligent Text Processing, 307–18. Berlin, Heidelberg: Springer. [Google Scholar]
  85. Oberlander Jon, Gill Alastair J. 2006. “Language with Character: A Stratified Corpus Comparison of Individual Differences in E-Mail Communication.” Discourse Processes 42:239–70. [Google Scholar]
  86. O’Connor Brendan, Balasubramanyan Ramnath, Routledge Bryan R., Smith Noah A. 2010. “From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series.” Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM) 11:122–29. [Google Scholar]
  87. Pasek Josh.In press. “When Will Nonprobability Surveys Mirror Probability Surveys? Considering Types of Inference and Weighting Strategies as Conditions for Correspondence.” International Journal of Public Opinion Research. doi:10.1093/ijpor/edv016 [Google Scholar]
  88. Pavalanathan Umashanti, De Choudhury Munmun. 2015. “Identity Management and Mental Health Discourse in Social Media.” WWW 2015 Companion, Florence, Italy. ACM 978-1-4503-3473-0/15/05. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. Peytchev Andy. 2008. “An Interview with Kenneth Prewitt.” Survey Practice 1(1). ISSN: 2168-0094. [Google Scholar]
  90. Presser Stanley, McCulloch Susan. 2011. “The Growth of Survey Research in the United States: Government-Sponsored Surveys, 1984–2004.” Social Science Research 40:1019–1024. [Google Scholar]
  91. Prewitt Kenneth. 2013. “The 2012 Morris Hansen Lecture: Thank You Morris, et al., for Westat, et al.” Journal of Official Statistics 29:223–31. [Google Scholar]
  92. Rajadesingan Ashwin, Zafarani Reza, Liu Huan. 2015. “Sarcasm Detection on Twitter: A Behavioral Modeling Approach.” Proceedings of WSDM (Web Search and Data Mining) 2015, February 2–6, Shanghai, China. [Google Scholar]
  93. Rivers Douglas. 2006. “Sample Matching: Representative Sampling from Internet Panels.” Polimetrix White Paper Series. Palo Alto, CA: YouGovPolimetrix. [Google Scholar]
  94. Rosenthal Robert. 1979. “The File Drawer Problem and Tolerance for Null Results.” Psychological Bulletin 86:638–41. [Google Scholar]
  95. Rothschild David, Wolfers Justin. 2013. “Forecasting Elections: Voter Intentions versus Expectations.” Available at http://nber.org/~jwolfers/research.php#voterexpectations.
  96. Sanders Eric, Van Den Bosch Antal. 2013. “Relating Political Party Mentions on Twitter with Polls and Election Results.” Proceedings of the 13th Dutch-Belgian Workshop on Information Retrieval, 68–71. [Google Scholar]
  97. Sang Erik Tjong Kim, Bos Johan. 2012. “Predicting the 2011 Dutch Senate Election Results with Twitter.” Proceedings of the Workshop on Semantic Analysis in Social Media, 53–60. Association for Computational Linguistics. [Google Scholar]
  98. Savage Mike, Burrows Roger. 2007. “The Coming Crisis of Empirical Sociology.” Sociology 41:885–99. [Google Scholar]
  99. Schaeffer Nora Cate. 2000. “Asking Questions about Sensitive Topics: A Selective Overview.” In The Science of Self-Report: Implications for Research and Practice, edited by Stone Arthur A., Bachrach Christine A., Jobe Jared B., Kurtzman Howard S., Cain Virginia S., Turkkan Jaylan, 105–21. New York: Taylor & Francis. [Google Scholar]
  100. Schober Michael F., Conrad Frederick G., Antoun Christopher, Ehlen Patrick, Fail Stefanie, Hupp Andrew L., Johnston Michael, Vickers Lucas, Yan H. Yanna, Zhang Chan. 2015. “Precision and Disclosure in Text and Voice Interviews on Smartphones.” PLOS ONE 10(6):e0128337. Available at http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0128337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  101. Schober Michael F., Glick Peter J. 2011. “Self-Deceptive Speech: A Psycholinguistic View.” In Personality and Psychopathology: Critical Dialogues with David Shapiro, edited by Piers Craig, 183–200. New York: Springer. [Google Scholar]
  102. Schwartz Barry, Ward Andrew, Monterosso John, Lyubomirsky Sonja, White Katherine, Lehman Darrin R. 2002. “Maximizing Versus Satisficing: Happiness Is a Matter of Choice.” Journal of Personality and Social Psychology 83:1178–1197. [DOI] [PubMed] [Google Scholar]
  103. Shermis Mark D., Burstein Jill. 2013. Handbook of Automated Essay Evaluation: Current Applications and New Directions. New York: Routledge. [Google Scholar]
  104. Shi Lei, Agarwal Neeraj, Agrawal Ankur, Garg Rahul, Spoelstra Jacob. 2012. “Predicting US Primary Elections with Twitter.” Proceedings of Social Network and Social Media Analysis: Methods, Models, and Applications (NIPS Workshop), Lake Tahoe, NV, USA, December 7. Available at http://snap.stanford.edu/social2012/papers/shi.pdf.
  105. Smith Marc A., Rainie Lee, Shneiderman Ben, Himelboim Itai. 2014. “Mapping Twitter Topic Networks: From Polarized Crowds to Community Clusters.” Pew Research Center. Available at http://www.pewinternet.org/2014/02/20/mapping-twitter-topic-networks-from-polarized-crowds-to-community-clusters/.
  106. Smith Tom W. 2013. “Survey-Research Paradigms Old and New.” International Journal of Public Opinion Research 25:218–29. [Google Scholar]
  107. Solberg Lauren B. 2010. “Data Mining on Facebook: A Free Space for Researchers or an IRB Nightmare?” (November 28). Journal of Law, Technology and Policy, No. 2. Available at http://ssrn.com/abstract=2182169.
  108. Taylor Sean J. 2013. “Real Scientists Make Their Own Data.” Available at http://seanjtaylor.com/post/41463778912/real-scientists-make-their-own-data.
  109. Tirunillai Seshadri, Tellis Gerard J. 2014. “Mining Marketing Meaning from Online Chatter: Strategic Brand Analysis of Big Data Using Latent Dirichlet Allocation.” Journal of Marketing Research 51:463–79. [Google Scholar]
  110. Tourangeau Roger, Smith Tom W. 1996. “Asking Sensitive Questions: The Impact of Data Collection Mode, Question Format, and Question Context.” Public Opinion Quarterly 60:275–304. [Google Scholar]
  111. Tourangeau Roger, Yan Ting. 2007. “Sensitive Questions in Surveys.” Psychological Bulletin 133:859–83. [DOI] [PubMed] [Google Scholar]
  112. Trepte Sabine, Reinecke Leonard. 2011. Privacy Online: Perspectives on Privacy and Self-Disclosure in the Social Web. Berlin, Heidelberg: Springer Verlag. [Google Scholar]
  113. Tufekci Zeynep. 2014. “Big Questions for Social Media Big Data: Representativeness, Validity, and Other Methodological Pitfalls.” ICWSM ‘14: Proceedings of the International AAAI Conference on Weblogs and Social Media. arXiv preprint arXiv:1403.7400.
  114. Tumasjan Andranik, Sprenger Timm O., Sandner Philipp G., Welpe Isabell M. 2010. “Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment.” ICWSM ‘10: Proceedings of the International AAAI Conference on Weblogs and Social Media, 178–85.
  115. Turner Charles F., Ku Leighton, Rogers Susan M., Lindberg Laura D., Pleck Joseph H., Sonenstein Freya L. 1998. “Adolescent Sexual Behavior, Drug Use and Violence: Increased Reporting with Computer Survey Technology.” Science 280:867–73. [DOI] [PubMed] [Google Scholar]
  116. Turow Joseph, Hennessy Michael, Draper Nora. 2015. “The Tradeoff Fallacy: How Marketers Are Misrepresenting American Consumers and Opening Them Up to Exploitation.” Report from the Annenberg School for Communication, University of Pennsylvania. Available at https://www.asc.upenn.edu/sites/default/files/TradeoffFallacy_0.pdf.
  117. Tuunainen Virpi Kristiina, Pitkänen Olli, Hovi Marjaana. 2009. “Users’ Awareness of Privacy on Online Social Networking Sites—Case Facebook.” BLED 2009 Proceedings, Paper 42. Available at http://aisel.aisnet.org/bled2009/42.
  118. Velikovich Leonid, Blair-Goldensohn Sasha, Hannan Kerry, McDonald Ryan. 2010. “The Viability of Web-Derived Polarity Lexicons.” Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 777–85. [Google Scholar]
  119. Wang Hao, Dogan Can, Kazemzadeh Abe, Bar François, Narayanan Shrikanth. 2012. “A System for Real-Time Twitter Sentiment Analysis of 2012 US Presidential Election Cycle.” Proceedings of the ACL 2012 System Demonstrations, ACL ‘12, 115–20. [Google Scholar]
  120. Watts Duncan J., Dodds Peter Sheridan. 2007. “Influentials, Networks, and Public Opinion Formation.” Journal of Consumer Research 34:441–58. [Google Scholar]
  121. Weisberg Herbert F. 2005. The Total Survey Error Approach: A Guide to the New Science of Survey Research. Chicago: University of Chicago Press. [Google Scholar]
  122. Wiegand Michael, Balahur Alexandra, Roth Benjamin, Klakow Dietrich, Montoyo Andrés. 2010. “A Survey on the Role of Negation in Sentiment Analysis.” Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, 60–68. [Google Scholar]
  123. Wilson Robert E., Gosling Samuel D., Graham Lindsay T. 2012. “A Review of Facebook Research in the Social Sciences.” Perspectives on Psychological Science 7:203–20. [DOI] [PubMed] [Google Scholar]
  124. Wilson Theresa, Wiebe Janyce, Hoffmann Paul. 2005. “Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis.” Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, 347–54. [Google Scholar]
  125. Wu Shaomei, Hofman Jake M., Mason Winter A., Watts Duncan J. 2011. “Who Says What to Whom on Twitter.” Proceedings of the 20th International Conference on World Wide Web, 705–14. [Google Scholar]
  126. Zimmer Michael. 2010. “‘But the Data Is Already Public’: On the Ethics of Research in Facebook.” Ethics and Information Technology 12:313–25. [Google Scholar]

Articles from Public Opinion Quarterly are provided here courtesy of Oxford University Press

RESOURCES