RESPONDENT-DRIVEN SAMPLING AS MARKOV CHAIN MONTE CARLO

SHARAD GOEL; MATTHEW J SALGANIK

doi:10.1002/sim.3613

. Author manuscript; available in PMC: 2013 Jun 17.

Published in final edited form as: Stat Med. 2009 Jul 30;28(17):2202–2229. doi: 10.1002/sim.3613

RESPONDENT-DRIVEN SAMPLING AS MARKOV CHAIN MONTE CARLO

SHARAD GOEL ¹, MATTHEW J SALGANIK ²

PMCID: PMC3684629 NIHMSID: NIHMS467237 PMID: 19572381

Abstract

Respondent-driven sampling (RDS) is a recently introduced, and now widely used, technique for estimating disease prevalence in hidden populations. RDS data are collected through a snowball mechanism, in which current sample members recruit future sample members. In this paper we present respondent-driven sampling as Markov chain Monte Carlo (MCMC) importance sampling, and we examine the effects of community structure and the recruitment procedure on the variance of RDS estimates. Past work has assumed that the variance of RDS estimates is primarily affected by segregation between healthy and infected individuals. We examine an illustrative model to show that this is not necessarily the case, and that bottlenecks anywhere in the networks can substantially affect estimates. We also show that variance is inflated by a common design feature in which sample members are encouraged to recruit multiple future sample members. The paper concludes with suggestions for implementing and evaluating respondent-driven sampling studies.

Key words and phrases: hard-to-reach populations, hidden populations, HIV surveillance, importance sampling, Markov chain Monte Carlo, respondent-driven sampling, social networks, spectral gap

1. Introduction

The Joint United Nations Program on HIV/AIDS (UNAIDS) estimates that there are between 30 and 35 million people living with HIV/AIDS worldwide, and that between 2 and 4 million people were newly infected in 2007. In most countries outside of sub-Saharan Africa, these infections are concentrated in three subpopulations: men who have sex with men, injection drug users, and sex workers and their sexual partners [1]. Consequently, there is general consensus among epidemiologists that better data about disease prevalence and risk behaviors within these key subpopulations are critical for understanding and controlling the spread of the disease [2–5].

Unfortunately, because these subpopulations lack appropriate sampling frames, are relatively small, and their members often desire to remain anonymous, they are difficult to study with standard sampling methods. For this reason they are often called “hidden” or “hard-to-reach.” A variety of sampling approaches have been tried to study these hidden populations, but in many cases they produce estimates of unknown bias and variance [5; 6]. The resulting uncertainty about key subpopulations has complicated public health efforts to evaluate prevention programs and allocate resources effectively.

Respondent-driven sampling (RDS) is a new approach for sampling from hidden populations that is rapidly gaining in popularity: A recent review identified more than 120 RDS studies worldwide [7], including populations as diverse as men who have sex with men in Uganda [8], sex workers in Vietnam [9], and injection drug users in the former Soviet Union [10]. Furthermore, the U.S. Centers for Disease Control and Prevention (CDC) recently selected RDS for a 25-city study of injection drug users that is part of the National HIV Behavioral Surveillance System [11]. Because CDC decisions often influence global public health standards, RDS is likely to become increasingly common in the study of hidden populations.

RDS data are collected through a snowball mechanism, in which current sample members recruit future sample members.¹ An RDS study begins by recruiting a small number of people in the target population to serve as seeds. After participating, the seeds are asked—and often provided financial incentive—to recruit other people that they know in the target population. The sampling continues in this way with current sample members recruiting the next wave of sample members until the desired sample size is reached.² The process results in recruitment networks like the one in Figure 1 from a study of drug users in New York City; the sample began with eight seeds and grew to include 618 people in 13 weeks [27]. Under certain strong assumptions described later in the paper, these RDS data can then be used to produce asymptotically unbiased estimates about the hidden population (e.g., estimates of the proportion of drug users in New York who have HIV).

Recruitment networks from a study of drug users in New York City. The eight seeds are larger than the others nodes and all nodes are color coded by race/ethnicity. This figure was originally published in [27].

Despite the widespread use of RDS and its potential to address important public health questions, the statistical foundations of RDS remain poorly understood. This paper presents respondent-driven sampling as Markov chain Monte Carlo importance sampling, and analyzes how RDS estimates are affected by both the community structure of the hidden population and the recruitment procedure. We show that the variance of the RDS estimator is increased by: (1) “bottlenecks” between different groups in the hidden population, and (2) a study design in which participants recruit multiple individuals.

Our paper is organized as follows. In Section 2 we show that RDS sampling and estimation can be viewed as Markov chain Monte Carlo (MCMC) importance sampling. While MCMC algorithms are typically computer-driven, a novel feature of RDS is that state transitions consist of individuals physically recruiting others in the hidden population. In Section 3 we analyze a particular, illustrative network model in detail. This example shows that the structure of the hidden population’s social network can significantly impact both the bias and variance of RDS estimates, a phenomenon that is well understood in the statistical MCMC community but that has been overlooked in the RDS literature. Importantly, bottlenecks in any part of the network may affect RDS estimates of quantities that are not directly related to the source of the bottleneck. For example, a bottleneck between racial groups may degrade RDS estimates of gender composition. This suggests that the variance of RDS estimates is likely larger than previously believed. In Section 4 we explore the effect of multiple recruitment on variance, an issue that is important for RDS, but has not been considered previously and typically does not arise in MCMC applications. We show that “thick,” as opposed to “thin,” recruitment chains increase the statistical dependence between samples, and consequently worsen RDS estimates. Section 5 summarizes the results and concludes with recommendations for users of RDS. We have relegated most proofs and technical details to Appendices A and B. Appendix C reviews conductance, a formal measure of bottlenecks in networks.

2. Respondent-Driven Sampling as MCMC

Respondent-driven sampling [28–30] is a form of snowball sampling often used to estimate the proportion of a population with a specific characteristic. Although in this paper we talk about estimating the proportion p of infected individuals, we could more generally be estimating the occurrence of any characteristic or behavior. Here we review Markov chain Monte Carlo importance sampling, and make the connection to RDS precise.

Markov Chain Monte Carlo

Markov chain Monte Carlo was popularized by the introduction of the Metropolis algorithm [31], and has been applied extensively in a variety of fields, including physics, chemistry, biology and statistics. MCMC has also been the subject of several book-length treatments [32–35].

Behind all MCMC methods is a Markov chain on a state space V. In the context of RDS, V is the population from which we sample (e.g., drug injectors in New York City). We confine ourselves to the case where V is a finite population of size N, and so identify the chain with a kernel K(v_i, v_j) that gives the probability of transition from state v_i to state v_j:

K (v_{i}, v_{j}) \geq 0 \sum_{v_{j} \in V} K (v_{i}, v_{j}) = 1.

In terms of RDS, K(v_i, v_j) is the probability that any individual v_i recruits an individual v_j. The chain is irreducible if for every pair of points v_i, v_j there is positive probability of eventually reaching v_j starting from v_i. Under this assumption, there is a unique distribution π: V → ℝ—called the stationary distribution—satisfying

\sum_{v_{i} \in V} π (v_{i}) K (v_{i}, v_{j}) = π (v_{j}) .

That is, if X₀, X₁, X₂, … is a realization of the chain with X₀ ~ π, then X_i ~ π for i ≥ 0. Consequently, by starting the chain in equilibrium, the walk can be used to generate dependent samples from the distribution π.

Importance Sampling

As shown above, a chain-referral sampling method can be used to draw dependent samples from the population V with distribution π:

P (X_{i} = v_{j}) = π (v_{j}) .

That is, on each draw individual v_j has probability π(v_j) of being chosen. Then for any function f : V → ℝ, the sample mean

\frac{1}{n} \sum_{i = 0}^{n - 1} f (X_{i})

(2.1)

gives an unbiased estimate not of the population mean, but of $E_{π} f = \sum_{i = 1}^{N} f (v_{j}) π (v_{j})$ . That is, because units are selected with unequal probability, the sample mean is not a consistent estimator of the population mean. As is common in the survey sampling literature [36], the idea behind importance sampling [37] is that the weighted sample mean

\frac{1}{n} \sum_{i = 0}^{n - 1} \frac{f (X_{i})}{N \cdot π (X_{i})}

(2.2)

produces an unbiased estimate of the population mean μ_f of f since

\begin{array}{l} E_{π} (\frac{f (X_{i})}{N \cdot π (X_{i})}) = \sum_{i = 1}^{N} \frac{f (v_{i})}{N \cdot π (v_{i})} π (v_{i}) \\ = \frac{1}{N} \sum_{i = 1}^{N} f (v_{i}) . \end{array}

In particular, if D ⊆ V is the subset of infected individuals, then (2.2) can be used to estimate the disease prevalence p = |D|/N = μ_f by setting f(v_i) = 1 if v_i ∈ D and f(v_i) = 0 otherwise.

It is often necessary to replace (2.2) by the asymptotically unbiased importance sampling estimator

{\hat{μ}}_{IS} = \frac{1}{\sum_{i = 0}^{n - 1} 1 / π (X_{i})} \sum_{i = 0}^{n - 1} \frac{f (X_{i})}{π (X_{i})} .

(2.3)

The considerable advantage of (2.3) over (2.2) is that the importance weights 1/π(X_i) only need to be evaluated up to a multiplicative constant (e.g., one does not need to know N). In many applications, including RDS, this simplification is essential.

Respondent-Driven Sampling

Importance sampling allows estimation of p given samples X₀, X₁, … from any fixed distribution p. RDS generates such samples via a recruitment process akin to Markov chain Monte Carlo. The link between respondent-driven sampling and Markov chain Monte Carlo has been noted previously [29; 38; 39]; here we make that connection explicit.

Consider a social network G = (V, E) where nodes x ∈ V represent individuals in the population that are either infected or healthy, and e ∈ E represent edges in the network. We assume symmetric weighted edges (i.e., symmetric relationships) and we write W(x, y) = W(y, x) for the weight of the edge between nodes x and y.³ Further, we assume that the network is connected (i.e., that there exists a path between every pair of individuals in the population).

For a subset of individuals A ⊆ V we use the notation

W_{A} = \sum_{x \in A} \sum_{y \in V} W (x, y)

to denote the weight of A. For singleton sets, we write W_x instead of W_{_x_}.

We model the RDS sampling procedure as a random walk on the weighted graph G defined by the kernel K(x, y) = W(x, y)/W_x, where K(x, y) is the probability that individual x recruits individual y.⁴ Assuming the network is connected (i.e., the chain is irreducible), the walk has a unique stationary distribution

π (x) = \frac{W_{x}}{W_{V}} .

Consequently, for X₀, X₁, X₂, … a realization of the chain with X₀ ~ π,⁵ and f : V → ℝ any function, the importance sampling estimator (2.3) of the population mean μ_f reduces to

{\hat{μ}}_{f} = \frac{1}{\sum_{i = 0}^{n - 1} 1 / W_{X_{i}}} \sum_{i = 0}^{n - 1} \frac{f (X_{i})}{W_{X_{i}}}

(2.4)

The RDS estimator (2.4) was recently introduced in [38], and will likely supplant the RDS estimator introduced in [29]. In the case of estimating disease prevalence, by setting f(v_i) = 1 if v_i is infected and f(v_i) = 0 otherwise, (2.4) simplifies to

\hat{p} = \frac{1}{\sum_{i = 0}^{n - 1} 1 / W_{X_{i}}} \sum_{X_{i} infected} \frac{1}{W_{X_{i}}} .

(2.5)

To evaluate the RDS estimators (2.4) and (2.5) one still needs to know the weights W_{X_i}. Typically, researchers set uniform edge weights, W(x, y) = 1, corresponding to the assumption that participants recruit their contacts uniformly at random and that all contacts approached agree to participate. Throughout the paper we refer to this as the uniform recruitment assumption.⁶ In this case, W_x equals the degree of node x (i.e., her number of contacts).⁷

In contrast to naive estimates from snowball sampling based on the sample mean (2.1),⁸ RDS estimates weight samples proportional to their assumed probability of selection. In the case where all nodes have the same degree, the sample mean estimate is equivalent to the RDS estimate (2.4) given the uniform recruitment assumption.

In the above, we start the walk in stationarity: X₀ ~ π (i.e., the initial seed is drawn according to the stationary distribution). However, if the walk is aperiodic (i.e., if the network is not bipartite), then the RDS estimator μ̂ is asymptotically unbiased regardless of the starting distribution. Moreover, there is a central limit theorem for μ̂:

\sqrt{n} ({\hat{μ}}_{n} - μ) \to N (0, σ_{f}^{2})

(2.6)

for any initial distribution on X₀.⁹ The variance $σ_{f}^{2}$ depends on the variance of f and the autocorrelation structure of the chain, and can be difficult to estimate in practice [34].

We hasten to point out that these results regarding the asymptotic behavior of RDS estimates hinge critically on the validity of the modeling assumptions. In particular, these results require that participants recruit a single individual¹⁰ chosen uniformly at random from their network of contacts, and that participants can be recruited into the sample multiple times (i.e., sampling with replacement). Furthermore, even if all of the appropriate conditions are met, the asymptotic theory says little regarding the performance of RDS in small samples (n ≈ 500). As we show, in the case of small samples, the social network structure is of central importance.

3. Effects of Community Structure

It is well understood that the bias and variance of MCMC estimates are critically affected by the structure of the network underlying the random walk. However, past work on RDS has focused on only one structural feature: bottlenecks between infected and uninfected individuals (Figure 2) [28; 29; 44].¹¹ That is, it was previously believed that as long as there were sufficient connections between infected and uninfected individuals, the RDS estimates would be reasonably precise. While this structural feature is certainly a concern, taken in isolation it underestimates the effect of network structure on the variance of RDS estimates. Even when infected and uninfected individuals are relatively well connected, bottlenecks in other parts of the network can lead to large variance.

Hypothetical network with an edge between every pair of nodes, where within-group edges have higher weight than between-group edges. Here the two groups are defined by infection status, and a bottleneck exists between healthy and infected individuals. This is the only type of bottleneck that had been considered in the previous RDS literature.

To illustrate this point, we analyze RDS on two network models in detail. Our examples, while motivated by the qualitative features of real social networks, are not intended to be accurate models of any specific social network. Rather, they provide insight by allowing for exact and interpretable results.

3.1. Two Network Models

3.1.1. A Two-Group Model

Consider a population V consisting of two groups, A and B, of equal size N/2. Edges exist between every pair of individuals, however within-group edges have weight 1–c while between-group edges have weight c where 0 < c < 1/2 (see Figure 3(a)).¹² That is, within-group relationships are stronger than between-group relationships. In this model, c parameterizes homophily based on group membership—the well-observed social tendency for people to form ties to others who are similar [45]; as c increases, the tendency for within-group ties decreases.¹³

Hypothetical networks with an edge between every pair of nodes, where within-group edges have higher weight than between-group edges. In the two-group model the population is divided into two equally sized groups that differ in disease prevalence. In the multi-group model, the population is divided into many smaller equally sized subgroups that also differ in disease prevalence. In terms of RDS, these two models are equivalent.

Let p_A and p_B denote the proportion of infected individuals within the two groups, and let D ⊆ V be the subset of infected people. Since we are assuming |A| = |B|, the proportion of infected individuals in the entire population is p = (p_A + p_B)/2. If the two groups have different infection rates, p_A ≠ p_B, then, as we show, the network bottleneck between the two groups affects the RDS estimate, even though infected and uninfected individuals are well-connected.

We can imagine this more concretely by considering the case of street-based and agency-based sex workers in Belgrade, two groups that have been found to have little contact [46]. If these two groups had different HIV prevalence, then the weak connections between the groups could lead to high variance for the RDS estimated HIV prevalence for sex workers as a whole because sometimes the sampling would get stuck in one group and sometimes it would get stuck in the other. Further, if the seeds are not selected from the stationary distribution, the bottleneck between groups can lead to biased estimates.

3.1.2. A Multi-Group Extension

Although one could plausibly detect and potentially compensate for the simple network bottleneck in the two-group model, more subtle—and hence harder to diagnose—structural features can also lead to high variance of RDS estimates.

Consider the multi-group network depicted in Figure 3(b). This more general model aims to capture the fact that many real social networks partition into relatively homogenous subgroups where there are stronger ties within subgroups than between subgroups—a feature that sociologists call “cohesive subgroups” [47] and physicists call “community structure” [48]. In the multi-group model, N nodes are divided into subgroups of m nodes; all nodes are connected, but within-subgroup edges have weight 1, and between-subgroup edges have weight b < 1.¹⁴ The subgroups themselves come in two varieties, A and B, with exactly half the subgroups of type A and the other half type B. Type A subgroups have a proportion p_A of their nodes infected, and type B subgroups have a proportion p_B infected. In the case of contagious diseases, such clumping of cases within subgroup is particularly likely [49].

Despite their apparent differences, the multi-group model is in fact equivalent to the two-group model: For every value of c < 1/2 in the two-group model, there is a corresponding value of b in the multi-group model such that the RDS estimator p̂ has the same distribution under both network models (details are provided in Appendix A).

3.2. Analyzing the Models

Here we consider the bias and variance properties of RDS on the network models discussed above. Without loss of generality, we consider only the two-group model.

In the two-group model, RDS is based on the following Markov chain:

K (x, y) = {\begin{cases} 2 (1 - c) / N & x, y \in A or x, y \in B \\ 2 c / N & x \in A, y \in B or x \in B, y \in A \end{cases} .

(3.1)

Written as a matrix,

K = [\begin{array}{c} 2 (1 - c) / N & 2 c / N \\ 2 c / N & 2 (1 - c) / N \end{array}]

where the matrix is partitioned into blocks of size N/2 × N/2.

K has stationary distribution π(x) = W_x/W_V = 1/N that is uniform over V since

W_{x} = \sum_{y} W (x, y) = (1 - c) N / 2 + c N / 2 = N / 2

independent of x (i.e., each unit has equal probability of selection). Furthermore, since the weight of each node is the same, the RDS estimator (2.5) simplifies to

\hat{p} = \frac{1}{\sum_{i = 0}^{n - 1} 1 / W_{X_{i}}} \sum_{X_{i} \in D} \frac{1}{W_{X_{i}}} = \frac{# {X_{i} \in D}}{n}

which is the usual estimator for simple random samples. Unlike simple random samples, however, the samples X_i are not independent, and the social network structure of the population affects RDS estimates.

To analyze p̂, we derive an explicit expression for the distribution K_l of the state of the chain after l steps.

Lemma 3.1

For 0 < c < 1/2, the l-step distribution of the walk defined in (3.1) is

K_{l} (x, y) = {\begin{cases} (1 + β_{1}^{l}) / N & x, y \in A o r x, y \in B \\ (1 - β_{1}^{l}) / N & x \in A, y \in B o r x \in B, y \in A \end{cases}

where β₁ is the second largest eigenvalue of the transition matrix K, which in this case is equal to 1 – 2c.

Although the equilibrium distribution π(x) = 1/N is uniform over V, after any finite number of steps the chain is more likely to be in the group from which the initial sample was chosen due to preferential within-group recruitment. For example, if the initial seed is chosen from A, then for c = 0.1, after 5 steps the chain is still about twice as likely to be in A than in B.

Here the second largest eigenvalue β₁ of the transition matrix is seen to control the rate of convergence of the chain to its equilibrium distribution. This phenomenon is true for general chains [50], and as we show below, β₁ also affects both the bias and variance of the RDS estimate.

In our example, the RDS estimator p̂ is unbiased if the initial sample X₀ is chosen from the stationary distribution π. If instead X₀ is chosen uniformly from group A, then p̂ is biased (although still asymptotically unbiased), and moreover, the bias depends on the bottleneck that is induced by the homophily parameter c and the length of the recruitment chains. This illustrates the fact that the network location of the seed becomes increasingly important in populations with bottlenecks between groups.

Lemma 3.2

Consider the walk defined in (3.1). For an initial sample X₀ chosen uniformly from group A, and a referral chain of size n

E \hat{p} = p + (p_{A} - p_{B}) \frac{1 - β_{1}^{n}}{4 n c}

where β₁ = 1 − 2c.

From Lemma 3.2 we know that the estimator p̂ has bias

bias (\hat{p}) = (p_{A} - p_{B}) \frac{1 - {(1 - 2 c)}^{n}}{4 n c} \approx \frac{p_{A} - p_{B}}{4 n c} = \frac{p_{A} - p_{B}}{2 n (1 - β_{1})}

that depends on the homophily c, the length of the chain n, and the difference in infection proportions between the two groups. The spectral gap 1 − β₁ = 2c captures the effect of network structure. Note that this also shows that even though RDS estimates are asymptotically unbiased, as is often claimed in the literature, there can be substantial bias when the seeds are not selected from the stationary distribution and the sample size is small.

In a population with c = 0.1, a referral chain of length 10 that has initial seed chosen uniformly from group A has bias approximately (p_A − p_B)/5. As c → 0, (i.e., as the two populations become completed disconnected), bias(p̂) → (p_A − p_B)/2. In this extreme case, RDS erroneously estimates only p_A instead of p = (p_A + p_B)/2.

In the above we estimated the bias of p̂ given that the initial seed was chosen uniformly from group A. Now we assume that the seed is chosen uniformly from the entire population (so that p̂ is unbiased) and analyze its variance.

In populations with community structure, it is more likely that individuals refer people who are in their same social subgroup. Intuitively, in this situation we gain less information from each recruit than if that recruit were chosen randomly from the entire population. The result of this dependence is an effective reduction in sample size. That is, the variance of RDS estimates is larger than the variance of estimates based on a simple random sample of the same nominal size.

The dependence between samples is quantified by their covariance.

Lemma 3.3

Consider the walk defined in (3.1). Suppose $X_{0}^{1}, X_{1}^{1}$ , … and $X_{0}^{2}, X_{1}^{2}$ , … are two independent realizations of the walk with $X_{0}^{1} = X_{0}^{2} ~ π$ . That is, both chains begin at the same vertex v, which is drawn from the stationary distribution π. Then for i, j ≥ 0

Cov (f_{D} (X_{i}^{1}), f_{D} (X_{j}^{2})) = {(\frac{p_{A} - p_{B}}{2})}^{2} β_{1}^{i + j}

where β₁ = 1 − 2c and

f_{D} (v_{i}) = {\begin{matrix} 1 & v_{i} infected \\ 0 & otherwise \end{matrix} .

Corollary 3.1

Consider the walk X₀, X₁, … defined in (3.1). If X₀ ~ π, then the variance of p̂ satisfies

Var (\hat{p}) = \frac{p - p^{2}}{n} + \frac{{(p_{A} - p_{B})}^{2} β_{1}}{2 n (1 - β_{1})} - \frac{{(p_{A} - p_{B})}^{2} (β_{1} - β_{1}^{n + 1})}{2 n^{2} {(1 - β_{1})}^{2}}

where β₁ = 1 − 2c and n is the sample size.

Again we see the spectral gap 1 − β₁ affects RDS estimates. A naive estimate of the variance (i.e., the variance under simple random sampling) assumes samples are uncorrelated, yielding only the first term (p − p²)/n. In particular, it does not take into account possible community structure in the hidden population. For example, for c = 0.1, p_A = 0.3 and p_B = 0.1, Var(p̂) is approximately 1.5 times the variance of estimates from a simple random sample. v Accordingly, confidence intervals determined by the true variance are $\sqrt{1.5} \approx 1.2$ times wider. Put another way, community structure in this example effectively reduces sample size by a third: RDS estimates based a sample of 500 individuals have the same variance as estimates based on a simple random sample of 335.

4. Effects of Multiple Recruitment

Above, we have been assuming that RDS estimates are based on a single, long run of the chain. In practice, this approach is difficult to implement since some sample members do not recruit others, causing the chains to terminate. Instead, in order to ensure that the chains continue, each respondent is encouraged to recruit multiple individuals, as seen in recruitment chains from the Abdul-Quader et al. study of drug users in New York City (Figure 1) [27]. In this study, as with almost all RDS studies, participants were encouraged to recruit up to three others [7]. For a given sample size, however, chain lengths are shorter under multiple recruitment than single recruitment. Consequently, multiple recruitment increases the dependence between participants, and in turn increases the variance of RDS estimates—a concern that was previously overlooked. Observe that multiple recruitment is a different source of dependence than that which directly arises from network community structure; but, as we show, the two interact: community structure amplifies the problems caused by multiple recruitment.¹⁵

We examine the effects of multiple recruitment on the two-group and multi-group network models described in Section 3. As before, we assume that the initial sample X₀ is drawn from the stationary distribution. To compute the covariance between f_D(X₁) and f_D(X₂) in Figure 4, observe that A_x₁x₂ is the most recent common ancestor of X₁ and X₂. Consequently, X₁ and X₂ result from independent runs of the chain started at A_x₁x₂, and so we are in the situation of Lemma 3.3. That is,

A _x₁x₂ is the most recent common ancestor of X₁ and X₂

Cov (f_{D} (X_{1}), f_{D} (X_{2})) = {(\frac{p_{A} - p_{B}}{2})}^{2} β_{1}^{2 + 3} .

In general, for two samples X_i and X_j, this argument shows that

Cov (f_{D} (X_{i}), f_{D} (X_{j})) = {(\frac{p_{A} - p_{B}}{2})}^{2} β_{1}^{l (i, j)}

where l(i, j) is the length of the unique path between X_i and X_j in the recruitment tree.

Lemma 4.1

Consider the walk defined in (3.1). Suppose a recruitment tree is chosen according to a probability distribution λ on the set of n-node trees, and RDS samples are correspondingly collected. Then the variance of p̂ satisfies

Var (\hat{p}) = \frac{p - p^{2}}{n} + \frac{{(p_{A} - p_{B})}^{2}}{2 n^{2}} \sum_{k = 1}^{n - 1} β_{1}^{k} E_{λ} L (k)

where L(k) is the number of pairs of samples distance k apart.

Lemma 4.1 shows that, for a given social network structure, the further apart the sample units, the lower the variance. That is, “thin,” as opposed to “thick,” recruitment chains lead to improved estimates. Furthermore, observe the key role again played by the second largest eigenvalue β₁. As β₁ increases (i.e., as the spectral gap 1 − β₁ decreases), the variance of RDS increases. In other words, community structure amplifies the effects of multiple recruitment.

With single recruitment, the Markov Chain central limit theorem (2.6) shows that the variance of p̂ decreases as 1/n, where n is the sample size. With multiple recruitment, however, the variance may decrease as 1/n^δ for δ < 1. That is, multiple recruitment may lead to asymptotically slower decay of error in the RDS estimator. To see this effect, we analyze the variance of the RDS estimator on the two-group model in the case where each sample member recruits exactly two other sample members.

In order to apply Lemma 4.1, we first estimate the path length distribution.

Theorem 4.1

Suppose T_H is a complete binary tree of height H ≥ 1 (i.e., T_H has 2^H⁺¹ − 1 nodes, and each non-leaf node has exactly two children). Let L_H (k) be the number of pairs of nodes that are distance k apart. Then for 1 ≤ k ≤ 2H

\frac{1}{4} 2^{H + k / 2} \leq L_{H} (k) \leq 2 k 2^{H + k / 2} .

Since 2^H ≈ n, Theorem 4.1 shows that, ignoring log factors, L_H (k) ≈ n2^k^/2. Now we estimate the variance of p̂.

Theorem 4.2

Consider the walk defined in (3.1). Suppose the recruitment tree is a complete binary tree of height H ≥ 1, in which case n = 2^H⁺¹ − 1. If $β_{1} > \sqrt{2} / 2$ , then the variance of p̂ satisfies

\frac{p - p^{2}}{n} + [\frac{1}{n^{{log}_{2} 1 / β_{1}^{2}}} - \frac{2}{n}] (\frac{β_{1} \sqrt{2} (p_{A} - p_{B})}{(32 (β_{1} \sqrt{2} - 1))}) \leq Var (\hat{p}) \leq \frac{p - p^{2}}{n} + \frac{4 {log}_{2} n}{n^{{log}_{2} 1 / β_{1}^{2}}} {(\frac{p_{A} - p_{B}}{1 - β_{1} \sqrt{2}})}^{2} .

Ignoring log factors, Theorem 4.2 shows that for $β_{1} > \sqrt{2} / 2, Var (\hat{p}) \approx 1 / n^{{log}_{2} 1 / β_{1}^{2}}$ . In particular, for $\sqrt{2} / 2 < β_{1} < 1$ , we have ${log}_{2} 1 / β_{1}^{2} < 1$ . Furthermore, this exponent decreases (i.e., decay is slower) as β₁ increases.

Above we analyze a deterministic recruitment tree; now we consider a more realistic stochastic recruitment procedure that is modeled as a branching process¹⁶ with offspring distribution based on data from the Frost et al. study of injection drug users in Tijuana and Ciudad Juarez [51]. In that study, three coupons were provided to each participant and approximately one-third of the participants recruited no other participants, one-sixth recruited one other, another one-sixth recruited two others, and the remaining one-third recruited three other participants (Table 1). In this case, while it seems difficult to find an analytic expression for Inline graphic L(k), the path length distribution can be estimated by simulation. Combining these simulation results that estimate L(k) with our common network parameter values (c = 0.1, p_A = 0.1 and p_B = 0.3) and a sample size of n = 500, we have Var(p̂) is approximately 3.7 times the variance under simple random sampling. In other words, community structure and multiple recruitment substantially reduce the effective RDS sample size. In this example, a sample size of 500 collected via RDS with multiple recruitment corresponds to a sample size of just 136 people collected via simple random sampling.

Table 1.

Multiple recruitment offspring distribution based on a study of injection drug users in Tijuana and Ciudad Juarez [51].

	Number of recruits
	0	1	2	3
Probability	1/3	1/6	1/6	1/3

Open in a new tab

5. Conclusion

5.1. Summary

Our network models illustrate the effects of both the social network and multiple recruitment on RDS estimates in a stylized setting which attempts to mimic situations in which RDS may be used. To summarize our findings for the two-group model (and equivalently, the multi-group model), we compare three sampling situations: simple random sampling, RDS with single recruitment, and RDS with multiple recruitment. Figure 5 shows the distribution of p̂ in these three cases.

Simple random sampling

Since p = 0.2, the variance of p̂ is (p − p²)/500 = 0.00032 and its standard deviation is approximately 0.0179. Consequently, the 95% confidence interval for the estimate is approximately p̂ ± 3.5%. The variance, in this case, was independent of the network structure.

RDS – Single Recruitment

For c = 0.1 the second largest eigenvalue satisfies β₁ = 1 − 2c = 0.8. If the samples are the result of a single, long chain (without multiple recruitment) starting at the stationary distribution, then Var(p̂) is given by Corollary 3.1, which yields a standard deviation of approximately 0.0219. The 95% confidence interval is then p̂±4.3%, the same interval one would get from a simple random sample of size 335.

RDS – Multiple Recruitment

Assume multiple recruitment that follows a branching process with offspring distribution based on the recruitment data from the Frost et al. study of injection drug users in two Mexican cities (see Table 1) [51]. Simulation shows that

\frac{1}{n} \sum_{k = 1}^{n - 1} β_{1}^{k} E L (k) \approx 21.3

where the expectation is conditional on the recruitment tree having size n = 500 (i.e., we disregard trees that go extinct prematurely). Lemma 4.1 then shows the standard deviation of the estimate is approximately .0343, yielding a confidence interval p̂ ± 6.7%. This level of variance corresponds to a simple random sample of size 136, or an RDS sample with single recruitment of size 204.

5.2. Implications and Directions for Further Research

We conclude by describing some of the specific implications of these findings for the practice of RDS.

Community structure

Past RDS work focused on the bottleneck between infected and healthy individuals. Bottlenecks anywhere in the network, however, impact the quality of RDS estimates. While preexisting knowledge may alert researchers to some bottlenecks (e.g., those between street-based and brothel-based sex workers), we suspect that it is difficult in practice to detect and to adjust for bottlenecks that exists in networks with complex community structure, such as the multi-group network in Figure 3(b). We hope future theoretical and empirical work continues to explore the bias and variance of the RDS estimators as a function of network structure, particularly taking care to develop procedures which require only limited information about the underlying social network.

Multiple recruitment

The multiple recruitment feature of RDS was developed to help ensure that sampling chains survive even when some subjects do not recruit. However, this design feature may diminish the accuracy of RDS estimates by increasing the dependence between sample units. Since the specific structure of recruitment chains impacts RDS estimates, larger sample sizes do not always produce more accurate estimates than smaller samples sizes, contrary to intuition from simple random sampling. While it is currently common practice to provide subjects with three recruitment coupons each, RDS would benefit from techniques that make it practical to reduce that number.

Assumptions

The properties of the RDS estimator rest on a number of assumptions, many of which may not be met in practice. For example, the sampling with replacement assumption is violated by design in virtually all RDS studies. Also problematic is the uniform recruitment assumption (i.e., that participants recruit their contacts uniformly at random and that all contacts approached agree to participate). For example, de Mello et al. found evidence of non-random recruitment in their study of men who have sex with men in Campinas, Brazil [53], and similar results have been reported elsewhere in the literature [24; 26; 54]. We hope that future work develops diagnostics to detect violations of these assumptions, and explores the effects of such violations on RDS estimates.

High variance

Much of excitement around RDS in the public health community has focused on the fact that, under certain strong assumptions, the estimates are asymptotically unbiased. This paper highlights the variance of these estimates. In some cases, the variance of RDS estimates may be so large that the estimates themselves are of little value. Prior work suggested that RDS researchers should assume a design effect of 2; that is, that RDS samples should be twice as large as would be needed under simple random sampling [44]. The results in this paper suggest that this rule of thumb should probably be revised upward. We hope that future work will provide further guidance to researchers about the sample sizes needed for their studies.

Beyond the specific results from this paper, clarifying the connection between RDS and MCMC allows future researchers to harvest ideas from the vast MCMC literature. For example, RDS researchers could consider discarding data from early sample waves, just as researchers using computer-driven MCMC often discard a portion of their draws during the so-called “burn-in” phase [55]. This possibility is especially important because RDS seeds are almost certainly not chosen from the stationary distribution. Another potential avenue for future work would be to modify existing MCMC convergence diagnostics so as to monitor the convergence of RDS estimates. For example, the use of multiple seeds, a common feature of RDS studies, creates parallel chains that could lead to natural convergence measures [56–58]. One nice feature of this approach would be that researchers could run these diagnostics while the study is underway, and thereby potentially detect problems while there is still time to correct them. These suggestions represent just a few of the possible improvements to RDS, improvements that may ultimately allow researchers to better study hidden populations.

Acknowledgments

This work was partially supported by the Department of Mathematics at the University of Southern California, the Institute for Social and Economic Research and Policy (ISERP) at Columbia University, and the Applied Statistics Center at Columbia University. The authors thank Edo Airoldi, Andrew Gelman, Doug Heckathorn, Erik Volz, and the anonymous reviewers for helpful comments.

Appendix A. Equivalence of the two-group and multi-group Models

Random walks on the two-group and multi-group network models of Section 3 are equivalent in the following sense. Let f indicate infection status: f(v_i) = 1 if v_i is infected and f(v_i) = 0 otherwise. Suppose N is even and divisible by m, and that the between-subgroup edge weight c in the two-group model satisfies c < 1/2. Set the between-subgroup edge weight b in the multi-group model to:

b = \frac{c m}{N (1 / 2 - c) + c m} .

(A.1)

Finally, let X₀, X₁, … and X̃₀, X̃₁, … denote RDS samples from the two networks, respectively, with either X₀ and X̃₀ chosen uniformly from a type-A subgroup or X₀ and X̃₀ chosen uniformly from a type-B subgroup. That is, X₀, X₁, … and X̃₀, X̃₁, … are RDS samples on the two networks conditioned to start in the same subgroup type. Then, f(X₀), f(X₁), … has the same distribution as f(X̃₀), f(X̃₁), ….

To prove this equivalence, it is sufficient to show that with b satisfying (A.1), the probability of transition from any node in an A subgroup to any node in a B subgroup is the same in both models. In the two-group model, this between-group transition probability is

\frac{c N / 2}{c N / 2 + (1 - c) N / 2} = c

and in the multi-group model, the between-group transition probability is

\frac{b N / 2}{b (N - m) + m} .

Consequently, the two models are equivalent for b such that

\frac{b N / 2}{b (N - m) + m} = c .

(A.2)

Solving for b in (A.2), we have

b = \frac{c m}{N / 2 - c (N - m)} = \frac{c m}{N (1 / 2 - c) + c m}

establishing the equivalence.

To better understand the equivalence between these two models, we examine their limit behavior to build intuition; as we have shown, even the finite network models are equivalent. Observe that the probability of transitioning out of one of the small m-node subgroups is

\frac{b (N - m)}{b (N - m) + m} = \frac{1}{1 + \frac{m}{b (N - m)}} = \frac{1}{1 + \frac{N (1 / 2 - c) + c m}{c (N - m)}} \to 2 c as N \to \infty .

In the limit, the two models are equivalent when transition out of a subgroup in the multi-group model occurs with probability 2c. Since the number of subgroups N/m → ∞, the probability of transitioning to a type-A subgroup, given that one has transitioned out of one’s current subgroup, is 1/2. Consequently, transition between subgroup-types (i.e., type-A or type-B subgroups) occurs with probability c.

Appendix B. Further Technical Details and Proofs

Suppose K is the kernel of an irreducible finite Markov chain, and π its stationary distribution. We think of (K, π) as an operator on L²(π)—the space of functions f : V → ℝ with inner product

〈 f, g 〉 = \sum_{x \in V} f (x) g (x) π (x)

and corresponding norm

{‖ f ‖}_{2}^{2} = \sum_{x \in V} f^{2} (x) π (x) .

Then for f ∈ L²(π)

K f (x) = \sum_{y \in V} K (x, y) f (y) .

We call ψ ∈ L²(π) an eigenfunction for K with eigenvalue λ if Kψ = λψ.

Random walks on weighted graphs are reversible, i.e., they satisfy the detailed balance equation

π (x) K (x, y) = π (y) K (y, x) .

Reversibility is equivalent to K : L²(π) → L²(π) being self-adjoint. Consequently, reversible walks are diagonalizable in an orthonormal basis of real eigenfuctions. That is, there exist eigenfunctions ψ₀, ψ₁, …, ψ_N₋₁ with corresponding real eigenvalues

1 = β_{0} \geq β_{1} \geq \dots \geq β_{N - 1} \geq - 1

such that 〈ψ_i, ψ_j〉 = δ_ij. For details of the above functional analytic view, see Saloff-Coste [50].