Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Nov 1.
Published in final edited form as: Stat Probab Lett. 2015 Nov 1;106:100–102. doi: 10.1016/j.spl.2015.07.003

Nonparametric Identification for Respondent-Driven Sampling

Peter M Aronow 1,2, Forrest W Crawford 2
PMCID: PMC4552244  NIHMSID: NIHMS709307  PMID: 26327739

Abstract

We detail nonparametric identification results for respondent-driven sampling when sampling probabilities are assumed to be functions of network degree known to scale. We show that the conditions for consistency of the Volz-Heckathorn estimator are weaker than previously assumed.

Keywords: Horvitz-Thompson estimator, network degree, respondent-driven sampling

1 Introduction

Respondent-driven sampling (RDS) is a method for surveying hidden or hard-to-reach populations such as sex workers or injection drug users (Heckathorn, 1997; Broadhead et al., 1998). Starting with a group of initial subjects called “seeds,” respondents recruit others who are also members of the study population by giving them “coupons” to present to the researcher. These new subjects are interviewed, given coupons, and the process repeats. Many researchers have approximated RDS as a sampling design in which the sampling probability for subject i is proportional to their network degree di (Salganik and Heckathorn, 2004; Volz and Heckathorn, 2008; Gile and Handcock, 2010; Gile, 2011). In particular, Salganik and Heckathorn (2004) and Volz and Heckathorn (2008) justify this choice by modeling the recruitment process as a with-replacement random walk on a connected population network, where only one coupon is given to each subject, recruitment is uniformly at random from network neighbors, and each subject can be recruited infinitely many times. For an RDS sample of size n, Volz and Heckathorn (2008) (hereafter VH) give the estimator

μ^VH=i=1nyidi1i=1ndi1 (1)

where yi is the outcome of interest and di is the degree of subject i.

Several authors have expressed skepticism about RDS survey methodology in general and the VH estimator in particular (Heimer, 2005; Johnston et al., 2008; Goel and Salganik, 2010; Gile and Handcock, 2010; Salganik, 2012; White et al., 2012). Many alternative characterizations of the recruitment process exist (Goel and Salganik, 2009; Gile and Handcock, 2010; Gile, 2011; Berchenko et al., 2013; Crawford, 2014). Empirical studies have also cast doubt on the performance of the VH estimator in real-world RDS datasets (Wejnert, 2009; McCreesh et al., 2012; Rudolph et al., 2013).

A recent paper by Gile et al. (2015) presents diagnostics whose purpose is to help researchers determine whether the assumptions often invoked to motivate the VH estimator (1) are met in empirical RDS data. The diagnostics presented by Gile et al. (2015) address a particular class of motivating assumptions about the structure of a hypothesized social network and the process by which new subjects are sampled. These assumptions, characterized by Gile et al. (2015, pg. 3) as “required by the [VH] estimator,” are summarized in Table 1, reproduced from the original paper.

Table 1.

Assumptions listed by Gile et al. (2015) as requirements for the VH estimator. Reproduced from their Table 1.

Network structure assumptions Sampling assumptions
Random-walk model Network size large (N » n) With-replacement sampling,
single non-branching chain
Remove seed dependence Homophily sufficiently weak,
bottlenecks limited,
connected graph
Enough sample waves
Respondent behaviour All ties reciprocated Degree accurately measured,
random referral

In this short note, we give an alternative, nonparametric set of conditions under which the VH estimator is consistent, and note identification conditions for a generalization of the VH estimator. The conditions we articulate for consistency are restrictive and untestable, but they are nevertheless less stringent than the traditional model used to justify the VH estimator. Consistency of the VH estimator does not require random sampling or even the existence of a network connecting the members of the study population. Our results clarify the inferential challenges posed by RDS data, challenges beyond those of other non-probability samples. Importantly, however, our results suggest conditions that can be more generally implied by other generative models that may justify the VH estimator or variants thereof.

2 Results

Formally, consider a sequence of populations and samples converging weakly to a joint limit distribution on the outcome, (reported) degree, and sample, denoted (Y, D, S). Let the E [·] and Pr[·] operators refer to features of this limiting distribution. In RDS, we observe the empirical joint distribution of the outcome Y and degree D conditional on the sampling indicator S = 1. Without loss of generality, suppose that Y has bounded support and that D has support in the set {1, …, K}.

Condition 1

(Ignorability). For all k such that Pr[D = k] > 0, E[Y|S = 1, D = k] = E[Y|D = k] and Pr[S = 1|D = k] > 0.

Condition 2

(Knowledge of the Conditional Probability of Sampling). Pr[S = 1|D = k] = f (k), where f (·) is known up to an unknown scale parameter c.

Proposition 1

Given Conditions 1 and 2, the population mean is identified, with

E[Y]=k=1KE[YS=1,D=k]Pr[D=kS=1]f(k)k=1KPr[D=kS=1]f(k) (2)

Proof

We can identify E[Y|D = k] for each degree k from Condition 1 since E[Y|D = k] = E[Y|S = 1, D = k]. We can identify each Pr[D = k] to scale directly from Condition 2 as

Pr[D=k]=Pr[D=kS=1]Pr[S=1]Pr[S=1D=k]=Pr[S=1]cPr[D=kS=1]f(k). (3)

Then by the law of total expectation,

E[Y]=k=1KE[YD=k]Pr[S=1]cPr[D=kS=1]f(k)k=1KPr[S=1]cPr[D=kS=1]f(k)=k=1KE[YD=k]Pr[D=kS=1]f(k)k=1KPr[D=kS=1]f(k). (4)

Given Proposition 1, consistency of the VH estimator directly follows from convergence of sample analogues to population quantities.

Corollary 1

Given Conditions 1 and 2, the VH estimator is consistent for E [Y] if f (k) ∝ k.

3 Discussion

A variant of Condition 1 is usually assumed implicitly in statistical arguments in favor of the VH estimator (Salganik and Heckathorn, 2004; Salganik, 2006; Volz and Heckathorn, 2008). Ignorability is not empirically testable from RDS data alone, since researchers never observe E[Y|S = 0, D = k] for any k (but see, e.g., Lunagomez and Airoldi, 2014, for an alternative definition of ignorability that can be tested in some circumstances). While ignorability is a strong but common assumption imposed for inference from non-probability samples, Condition 2 highlights the additional challenges posed by RDS data. The researcher does not generally have knowledge of the population distribution of degree, and thus ignorability with respect to degree is not sufficient to identify the population mean. Specification of the conditional sampling probability in Condition 2 provides an alternative means for identification, and has typically been the focus of researchers’ efforts to justify the VH estimator. The random-walk argument articulated by Salganik and Heckathorn (2004) and Volz and Heckathorn (2008) serves to motivate the choice of f (k) ∝ k in the VH estimator, but is not strictly necessary for its consistency. Under any model that implies subjects with higher reported degrees are more likely to be sampled and f (k) ∝ k characterizes this relationship, Condition 2 holds. Notably, neither Conditions 1 nor 2 require that any of the “Network structure assumptions” given in Table 1 hold. Finally, we note that our results suggest that the VH estimator and variants thereof may be appropriate even when diagnostics predicated on a more restrictive model, such as those proposed by Gile et al. (2015), fail.

Without knowledge of the characteristics of the unsampled subjects, neither Condition 1 nor Condition 2 has directly testable implications, and thus the value of any diagnostics must depend on further assumptions about the generative process. Under further parametric assumptions, some of the conditions listed in Table 1 might be sufficient to imply consistency of the VH estimator. A formalization of these assumptions as part of a generative model for the recruitment process would allow researchers to evaluate the statistical properties of diagnostics like those proposed by Gile et al. (2015).

Acknowledgements

FWC was supported by NIH grant KL2 TR000140. We are grateful to Krista Gile, Donald P. Green, Edward M. Kaplan, Molly Offer-Westort, Cyrus Samii, Matthew Salganik, and Alexei Zelenev for helpful comments.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Berchenko Y, Rosenblatt J, Frost SD. Modeling and analysing respondent driven sampling as a counting process. 2013 doi: 10.1111/biom.12678. arXiv preprint arXiv:13043505. [DOI] [PubMed] [Google Scholar]
  2. Broadhead RS, Heckathorn DD, Weakliem DL, Anthony DL, Madray H, Mills RJ, Hughes J. Harnessing peer networks as an instrument for AIDS prevention: results from a peer-driven intervention. Public Health Reports. 1998;113(Suppl 1):42. [PMC free article] [PubMed] [Google Scholar]
  3. Crawford FW. The graphical structure of respondent-driven sampling. 2014 doi: 10.1177/0081175016641713. ArXiv Pre-print URL http://arxiv.org/pdf/1406.0721. [DOI] [PMC free article] [PubMed]
  4. Gile KJ. Improved inference for respondent-driven sampling data with application to HIV prevalence estimation. Journal of the American Statistical Association. 2011;106(493):135–146. [Google Scholar]
  5. Gile KJ, Handcock MS. Respondent-driven sampling: an assessment of current methodology. Sociological Methodology. 2010;40(1):285–327. doi: 10.1111/j.1467-9531.2010.01223.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Gile KJ, Johnston LG, Salganik MJ. Diagnostics for respondent-driven sampling. Journal of the Royal Statistical Society, Series A. 2015;178(1):241–269. doi: 10.1111/rssa.12059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Goel S, Salganik MJ. Respondent-driven sampling as Markov chain Monte Carlo. Statistics in Medicine. 2009;28(17):2202–2229. doi: 10.1002/sim.3613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Goel S, Salganik MJ. Assessing respondent-driven sampling. Proceedings of the National Academy of Sciences. 2010;107(15):6743–6747. doi: 10.1073/pnas.1000261107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Heckathorn DD. Respondent-driven sampling: a new approach to the study of hidden populations. Social Problems. 1997;44(2):174–199. [Google Scholar]
  10. Heimer R. Critical issues and further questions about respondent-driven sampling: comment on Ramirez-Valles, et al. (2005) AIDS and Behavior. 2005;9(4):403–408. doi: 10.1007/s10461-005-9030-1. [DOI] [PubMed] [Google Scholar]
  11. Johnston LG, Malekinejad M, Kendall C, Iuppa IM, Rutherford GW. Implementation challenges to using respondent-driven sampling methodology for HIV biological and behavioral surveillance: field experiences in international settings. AIDS and Behavior. 2008;12(1):131–141. doi: 10.1007/s10461-008-9413-1. [DOI] [PubMed] [Google Scholar]
  12. Lunagomez S, Airoldi EM. Valid inference from non-ignorable network sampling designs. 2014 arXiv preprint arXiv:14014718. [Google Scholar]
  13. McCreesh N, Frost S, Seeley J, Katongole J, Tarsh MN, Ndunguse R, Jichi F, Lunel NL, Maher D, Johnston LG, et al. Evaluation of respondent-driven sampling. Epidemiology. 2012;23(1):138. doi: 10.1097/EDE.0b013e31823ac17c. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Rudolph AE, Fuller CM, Latkin C. The importance of measuring and accounting for potential biases in respondent-driven samples. AIDS and Behavior. 2013;17(6):2244–2252. doi: 10.1007/s10461-013-0451-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Salganik MJ. Variance estimation, design effects, and sample size calculations for respondent-driven sampling. Journal of Urban Health. 2006;83(1):98–112. doi: 10.1007/s11524-006-9106-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Salganik MJ. Commentary: respondent-driven sampling in the real world. Epidemiology. 2012;23(1):148–150. doi: 10.1097/EDE.0b013e31823b6979. [DOI] [PubMed] [Google Scholar]
  17. Salganik MJ, Heckathorn DD. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology. 2004;34(1):193–240. [Google Scholar]
  18. Volz E, Heckathorn DD. Probability based estimation theory for respondent driven sampling. Journal of Official Statistics. 2008;24(1):79–97. [Google Scholar]
  19. Wejnert C. An empirical test of respondent-driven sampling: point estimates, variance, degree measures, and out-of-equilibrium data. Sociological Methodology. 2009;39(1):73–116. doi: 10.1111/j.1467-9531.2009.01216.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. White RG, Lansky A, Goel S, Wilson D, Hladik W, Hakim A, Frost SD. Respondent driven sampling–where we are and where should we be going? Sexually Transmitted Infections. 2012;88(6):397–399. doi: 10.1136/sextrans-2012-050703. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES