Confusion over the definition of “snowball sampling” reflects a phenomena in the sociology of science: that multi-disciplinary fields tend to produce a plethora of inconsistent terminology. Often the meaning of a term evolves over time, or different terms are used for the same concept. More confusing is the use of the same term for different concepts. The term “snowball sampling” suffers from this treatment.
The term “snowball sampling” has likely been in informal use for a long time, but it certainly pre-dates Coleman (1958) and Trow (1957). The earliest systematic work dates to the 1940s from the Columbia Bureau of Applied Social Research, lead by Paul Lazarsfeld. The Bureau became interested in the empirical study of personal influence via media (Barton, 2001). This led to the consideration of interpersonal environments and to the identification of opinion leaders and followers. However standard sampling of individuals was regarded as ineffective in studying the relations between opinion leaders and followers as pairs related in this way were seldom both selected in the sample (Lazarsfeld, Berelson, and Gaudet, 1944, pp. 49–50). To address this, Robert Merton asked individuals in an initial diverse sample to name the people who influenced them. From these, a second wave of influential people were interviewed as a “snowball sample” (Merton, 1949). This approach was expanded in a panel survey of women in a Midwestern town in 1945 (Katz and Lazarsfeld, 1955). Barton (2001) provides a history of the work of the Bureau that is still relevant to today’s study of social media.
Trow’s objective was to understand the support for anti-democratic popular movements. To do this he conducted an empirical study of the political orientations and behaviors of men in Bennington, Vermont in 1954 with particular focus on their support for Senator McCarthy. Trow conducted a snowball sample over the friendship networks of the men starting from “arbitrarily chosen lists of employees and occupational groups.” (Trow, 1957, p. 297). He is very clear that this does not produce a representative sample, and goes on to provide a discussion of the issues with network sampling that is still relevant today (Trow, 1957, pp. 290–295). He surmises: “The resulting sample, while not meant to be representative of any specific population, nevertheless includes representatives of all the important occupational groups, …”
Following on from these foundations, Coleman, Katz, and Menzel (1957) used the approach to collect information on influence patterns among physicians. Coleman (1958) is now the primary reference for the meaning of snowball sampling. He defines it as: “Snowball sampling: One method of interviewing a man’s immediate social environment is to use the sociometric questions in the interview for sampling purposes.” and describes Trow’s work as the example.
Acknowledging Coleman (1958), Goodman (1961) introduced “s stage k name snowball sampling”, a specific form of snowball sampling. Goodman’s formulation requires an initial sample drawn using a probability method on a known sampling frame. It also fixes parameters of the sampling process: the number of links followed from each participant (k) and the number of waves of the sample (s). In this work, Goodman develops a rigorous statistical approach to estimating certain relational features (number of mutual ties, triangles, etc.) based on the resulting sample. Just as Lazarsfeld et al. (1944) followed links because they were interested in studying, and therefore sampling, relationships rather than individuals, Goodman’s use of link-tracing is motivated by improvements in efficiency allowed by over-sampling relations most likely involved in the structures he is studying.
More recently, the term “snowball sampling” has been taken to refer to a convenience sampling mechanism with motivation more like that of Trow: collecting a sample from a population in which a standard sampling approach is either impossible or prohibitively expensive, for the purpose of studying characteristics of individuals in the population Biernacki and Waldorf (1981, e.g., ). Such settings are often hard-to-reach populations, characterized by the lack of a serviceable sampling frame. In such cases, an initial probability sample is either impossible or impractical, such that the initial sample is drawn by a convenience mechanism, dooming the full sample to non-probability sample status. In many such hard-to-reach populations, link-tracing sampling is an effective means of collecting data on population members. For this reason, this latter non-probabilistic usage of “snowball sampling” is most common in practice, although less common in the statistical literature, which favors the probabilistic formulations. Note that it is possible for the seeds in RDS to be chosen randomly even in applications to hard-to-reach populations. For example, they could be selected based on a spatial sampling frame.
The tension between these two uses of snowball sampling is highlighted in Thompson (2002), a definitive textbook, (p. 183): “The term ‘snowball sampling’ has been applied to two types of procedures related to network sampling. In one type …, a few identified members of a rare population are asked to identify other members of the population, those so identified are asked to identify others, and so, for the purpose of obtaining a nonprobability sample or for constructing a frame from which to sample. In the other type (Goodman 1961), individuals in the sample are asked to identify other individuals, for a fixed number of stages, for the purpose of estimating the number of ‘mutual relationships’ or ‘social circles’ in the population.” Other definitions of “snowball sampling” are consistent with this duality in usage (Snijders, 1992, p. 59).
Respondent-driven sampling (RDS, introduced by Heckathorn and colleagues, e.g. Heckathorn, 1997) is a newer variant of link-tracing network sampling, which brings to a head the tension between these two usages. This is because RDS is a practical sampling method in hard-to-reach populations, beginning with a convenience sample, but aims to approximate a probability sample over time.
RDS is not a variant of either usage of snowball sampling, nor is the reverse true. Because of the confusion surrounding this term, in Gile and Handcock (2010) we prefer, and use throughout that paper, the more precise broad category “link-tracing sampling” while paying homage to the intellectual descent of the methods from snowball sampling.
It is precisely the tension between the two usages of snowball sampling that makes RDS a fruitful area for ongoing research. RDS pairs the practical implementation of a convenience sample with the hope of recovering “something like” a probability sample. Gile (2008) and Gile and Handcock (2010) are the first works to systematically evaluate the statistical properties of current estimators based on RDS data. Gile (2011) proposes a new estimator that adjusts for the bias introduced by the with-replacement assumption of these estimators. It is also sometimes possible to adjust for a convenience sample of seeds. For example, Gile and Handcock (2011) extend the estimator of Gile (2011) to correct for the bias introduced by seed selection in the presence of homophily.
The issue here, then, is to recognize the different uses of the term “snowball sampling”. A good solution is for scientists to be as clear as possible in defining the meaning of terms upon first use in each manuscript. There is enough confusion in the various literatures to make this good practice.
References
- Barton Allen. Paul lazarsfeld as institutional investor. International Journal of Public Opinion Research, 13:245–269, 2001. [Google Scholar]
- Biernacki Patrick and Waldorf Dan. Snowball sampling: problem and techniques of chain referral sampling. Sociological Methods and Research, 10: 141–163, 1981. [Google Scholar]
- Coleman James S.. Relational analysis: The study of social organizations with survey methods. Human Organization, 17:28–36, 1958. [Google Scholar]
- Coleman James S., Katz Elihu, and Menzel Hazel. The diffusion of an innovation among physicians. Sociometry, 20:253–270, 1957. [Google Scholar]
- Gile Krista J.. Inference from Partially-Observed Network Data. PhD in Statistics, University of Washington, 2008. [Google Scholar]
- Gile Krista J.. Improved inference for respondent-driven sampling data with application to hiv prevalence estimation. Journal of the American Statistical Association, 106(493):135–146, 2011. doi: 10.1198/jasa.2011.ap09475. [DOI] [Google Scholar]
- Gile Krista J. and Handcock Mark S.. Respondent-driven sampling: An assessment of current methodology. Sociological Methodology, 40:285–327, 2010. URL http://arxiv.org/abs/0904.1855v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gile Krista J. and Handcock Mark S.. Network model-assisted inference from respondent-driven sampling data. ArXiv Preprint, 2011. URL http://arxiv.org/abs/XXX. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodman Leo A.. Snowball sampling. Annals of Mathematical Statistics, 32: 148–170, 1961. [Google Scholar]
- Heckathorn Douglas D.. Respondent-driven sampling: A new approach to the study of hidden populations. Social Problems, 44:174–199, 1997. [Google Scholar]
- Katz Elihu and Lazarsfeld Paul F.. Personal Influence. Free Press, 1955. [Google Scholar]
- Lazarsfeld Paul F., Berelson Bernard, and Gaudet Hazel. The People’s Choice: How the Voter Makes Up His Mind in a Presidential Campaign. Duell, Sloan and Pearce, New York, 1944. [Google Scholar]
- Merton Robert K.. Patterns of influence: A study of interpersonal influence and communications behavior in a local community. In Lazarsfeld Paul F. and Stanton Frank, editors, Communications Research, 1948–49, pages 180–219. Harper and Brothers, New York, 1949. [Google Scholar]
- Snijders Thomas A. B.. Estimation on the basis of snowball samples: how to weight. Bulletin Methodologie Sociologique, 36:59–70, 1992. [Google Scholar]
- Thompson Steven K.. Sampling. Wiley, Second edition, 2002. [Google Scholar]
- Trow Martin. Right-Wing Radicalism and Political Intolerance. Arno Press, New York, 1957. Reprinted 1980. [Google Scholar]