Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2018 Sep 25;115(41):10299–10304. doi: 10.1073/pnas.1706699115

Generalized least squares can overcome the critical threshold in respondent-driven sampling

Sebastien Roch a,1, Karl Rohe b,1,2
PMCID: PMC6187121  PMID: 30254152

Significance

Respondent-driven sampling (RDS) is a popular technique to sample marginalized or hard-to-reach populations, where participants can refer multiple contacts into the sample. Using the sampled participants, we wish to estimate properties of the population, often the proportion of individuals that are HIV+. Because contacts often share the same HIV status, adjacent samples are dependent. As a result, RDS can lead to highly variable estimates of HIV prevalence. This paper studies an estimation technique for HIV prevalence that is based upon the classical idea of generalized least squares.

Keywords: snowball sampling, link-tracing sampling, spectral gap

Abstract

To sample marginalized and/or hard-to-reach populations, respondent-driven sampling (RDS) and similar techniques reach their participants via peer referral. Under a Markov model for RDS, previous research has shown that if the typical participant refers too many contacts, then the variance of common estimators does not decay like O(n1), where n is the sample size. This implies that confidence intervals will be far wider than under a typical sampling design. Here we show that generalized least squares (GLS) can effectively reduce the variance of RDS estimates. In particular, a theoretical analysis indicates that the variance of the GLS estimator is O(n1). We then derive two classes of feasible GLS estimators. The first class is based upon a Degree Corrected Stochastic Blockmodel for the underlying social network. The second class is based upon a rank-two model. It might be of independent interest that in both model classes, the theoretical results show that it is possible to estimate the spectral properties of the population network from a random walk sample of the nodes. These theoretical results point the way to entirely different classes of estimators that account for the network structure beyond node degree. Diagnostic plots help to identify situations where feasible GLS estimators are more appropriate. The computational experiments show the potential benefits and also indicate that there is room to further develop these estimators in practical settings.


Respondent-driven sampling (RDS) is a popular network-based approach to sample marginalized and/or hard-to-reach populations (1). RDS has become particularly popular in HIV research because the populations most at risk for HIV (e.g., people who inject drugs, female sex workers, and men who have sex with men) cannot be sampled using conventional techniques. Several domestic and international institutions use RDS to quantify the prevalence of HIV in at-risk populations, including the Centers for Disease Control (CDC), the World Health Organization (WHO), and the Joint United Nations Program on HIV/AIDS (UNAIDS) (2). The most recent review of the literature in 2015 counted over 460 different RDS studies, in 69 different countries (3).

Because RDS collects samples from link-tracing the relationships in a social network, adjacent samples are dependent. In a simulation study, ref. 4 showed how this can lead to highly variable estimates. Under independent sampling, the variance of standard estimators decays like O(n1). This implies that a sample size of 4n will have a 50% smaller SE than a sample of size n. However, this does not necessarily hold for RDS. Under a Markov model, ref. 5 showed how the dependence induced by RDS can drastically inflate the variance of traditional estimators, making it decay at a rate slower than O(n1). This implies that reducing the sampling error by 50% can require far more than 4 times as many samples. This means that confidence intervals are much wider than under independent sampling. Using the covariance function derived in ref. 5, this paper studies the generalized least squares (GLS) estimator for RDS. Our theoretical analysis establishes that the variance of the GLS estimator is O(n1). We then derive a feasible GLS (fGLS) estimator based upon the Degree Corrected Stochastic Blockmodel (DC-SBM). Two alternative estimators are derived. These estimators first construct estimates about the spectral properties of the population social graph, which might be of independent interest. Our fGLS estimators easily accommodate any preliminary reweighting of the data to adjust for the sampling biases that occur in RDS (e.g., refs. 6 and 7). We study these estimators with simulations and propose a simple diagnostic plot to compare the different fGLS estimators.

A Simple Motivating Example

Fig. 1 uses a model studied in ref. 8. In this example, the population that we wish to sample is equally divided into two groups: HIV+ and HIV–. The seed participant is selected uniformly at random. Starting with the seed participant, every participant refers two additional participants (as in a complete binary tree). The participant refers a person that matches his or her own HIV status with probability p and refers a person with the opposite status with probability 1p. Each referral is independent. Using this sample, we wish to estimate the proportion of the population that is HIV+ (i.e., 50%). Fig. 1 compares two estimators: (i) the sample proportion and (ii) the GLS estimator proposed in this paper.

Fig. 1.

Fig. 1.

In this experiment, GLS provides dramatic improvements when the sample is large and the correlation between samples (i.e., p) is high. Both axes are on the log scale.

Under this sampling with replacement model, the variances of both the sample proportion and the GLS estimator have closed form solutions (see ref. 8 and the proof of Theorem 2 in SI Appendix, Section S4). Fig. 1 gives the ratio of these formulas as a function of the sample size n. There are three lines, corresponding to p=.6,p=.75, and p=.9. In all cases, the lines are less than 1, indicating that the GLS estimator has a smaller variance than the sample proportion. Under this simulation model, if p>.86, then the variance of the sample proportion decays slower than O(n1) (5, 8). As Theorems 1 and 2 below show, the variance of the GLS estimator converges to 0 like O(n1). So, as n increases, the bottom line converges to 0. The other two lines, on the other hand, do not converge to 0.

Preliminaries

The Markov model used in this paper is a straightforward combination of the Markov models developed in the RDS literature (e.g., refs. 1, 6, and 8).

The social network, G=(V,E), consists of the node set V={1,,N} and the edge set E= {(i,j): i and j can refer each other}. To simplify the notation, iG is used synonymously with iV. Unless otherwise noted, everything below also applies to weighted graphs. Let wij be the weight of edge (i,j)E, which models preferential recruitment as described in SI Appendix, Section S1. If (i,j)/E, define wij=0. If the graph is unweighted, then let wij=1 for all (i,j)E. Throughout this paper, the graph is undirected—that is, wij=wji for all pairs i,j. Define the degree of node i as deg(i)=jwij. For each node iG, let y(i)R denote some characteristic of this node (e.g., the indicator of HIV status). We wish to estimate the population average:

μtrue=1NiGy(i). [1]

We assume that the nodes are sampled with a Markov process that is indexed by a rooted tree T (i.e., a connected graph with n nodes, no cycles, and a vertex 0). The seed participant is vertex 0 in T. To simplify the notation, σT is used synonymously with σ belonging to the vertex set of T. For any node in the tree σT, denote σT as the parent of σ (the node one step closer to the root). Define the matrix PRN×N as

Pij=wijdeg(i). [2]

Because the graph is undirected, P is a reversible Markov transition matrix with a stationary distribution π:GR. Our sample is the set of random nodes

{XσG:σT},

where X0 is initialized with π and each transition XσXσ is independent with

P(Xσ=j|Xσ=i)=Pij,fori,jG.

Observe that T and G are distinct graphs: The nodes in T index the Markov process, while the nodes in G are its state space. Following ref. 9, we refer to this stochastic process as a (T,P)walkonG.

When the Xτs sample the target population in G, we observe

Yτ=y(Xτ)forτT.

Under the stationary (T,P)walkonG, the sample average of the Yτs is an estimate of

μ=E(Y0)=iy(i)πi.

In general, μμtrue (where μtrue was defined in Eq. 1), and the sample average must be adjusted with sampling weights to obtain an unbiased estimator of μtrue. Define

yπ(i)=y(i)πiN,  Yτπ=yπ(Xτ).

The sample average of the Yτπs is the inverse probability weighted (IPW) estimator; it is an unbiased estimator of μtrue (10). However, the weights πiN are unknown and must be estimated with additional information, as we describe next.

Under the (T,P)walkonG,

Nπi=Ndeg(i)jdeg(j)=Ndeg(i)Nd¯=deg(i)d¯,

where d¯=N1jdeg(j). The popular Volz–Heckathorn (VH) estimator replaces d¯ with the harmonic mean of those degrees (6). Recall that T has n nodes and define

Hdeg1=1nτT1deg(Xτ),  π^i=Hdeg1deg(i),  yπ^(i)=y(i)π^i,

and Yτπ^=yπ^(Xτ). The VH estimator is the sample average of the Yτπ^s, and it is an asymptotically unbiased estimator of μtrue under the (T,P)walkonG. [In practice, deg(i) is estimated by asking participants how many contacts they have. Recall that deg(i)=jwij. If the graph is weighted, then the (T,P)walkonG exhibits preferential recruitment (as discussed in SI Appendix, Section S1) and the number of contacts will not necessarily align with deg, making the estimator biased.]

Remark

The next section will drop the superscript π and π^ in Yτπ and Yτπ^. Using the Yτs to construct the GLS estimator will lead to an unbiased estimator of μ. In practice, before doing any of the GLS computations, one could replace the Yτs with Yτπ or Yτπ^ to estimate μtrue. The simulations in this paper use a reweighting that is similar to π^ but replaces Hdeg1 with a GLS estimate of E(1/deg(Xτ)). In ref. 7, sampling weights are estimated under an alternative, non-Markovian model. These weights could also be used before doing GLS computations.

GLS for RDS

The GLS estimator is the weighted average of the Yτs with smallest variance (11); that is, it is the solution g* to

mingVarτTgτYτsuch thatτTgτ=1. [3]

Because of the constraint that the weights gτ sum to 1, the linearity of expectation, and the fact that the (T,P)walkonG is stationary, the resulting estimator is an unbiased estimate of E(Yτ). Define the covariance matrix ΣRn×n as

Σσ,τ=CovRDS(Yσ,Yτ), [4]

which is assumed to be nonsingular. It can be seen that the solution to Eq. 3 depends upon solving a system of equations involving the covariance matrix—namely, that g*=(xT1)1xT where Σx=1. (Throughout, we use the notation 1M for the all-one vector of length M. We drop the length when clear from context.) If YRn is the vector of Yτs, then the GLS estimator can be expressed as

μ^GLS=(1TΣ11)11TΣ1Y. [5]

The rest of this section contains our main theoretical results, which study how

VarRDS(μ^GLS)=(1TΣ11)1 [6]

decays with the sample size.

Main Result.

In our main result, we assume that T is a complete binary tree with n nodes, but we expect the result to hold for more general tree topologies.

Theorem 1 (Main Result).

Let {Xτ:τT} be sampled from the (T,P)walkonG for a fixed N×N transition matrix P that is irreducible and reversible with respect to a stationary distribution π. If T is a complete binary tree with n nodes, then the variance of the GLS estimator defined in Eq. 5 decays like O(n1) as n.

The proof, which is contained in SI Appendix, Section S4, does not directly compute the variance of the GLS estimator. Instead, it proceeds by constructing an explicit linear estimator and relies on the variational characterization (Eq. 3) of μ^GLS. We emphasize that computing μ^GLS requires the covariance matrix Σ, which is typically unknown. The next section proposes a technique to estimate Σ that is based upon the SBM. We also point out that the result in Theorem 1 is asymptotic and, as such, is only meaningful for n large enough.

Before moving on to practical estimators, we give a more precise result on the constant in the O(n1) by making further assumptions on the spectral properties of P or of the features y. The eigenvectors of the reversible transition matrix P, denoted f1,,fN:VR, are real-valued functions of the nodes iG that are orthonormal with respect to the inner product

fa,fbπ=iGfa(i)fb(i)πi. [7]

(See, e.g., lemma 12.2 of ref. 12.) We take the eigenfunction f1 corresponding to the eigenvalue 1 to be the constant vector 1. Define β=y,fπ for =1,,N and note that μ=β1=iy(i)πi. Let λ1,,λN be the eigenvalues of P corresponding to f1,,fN. For each node iG, y decomposes as follows:

y(i)=μ+=2Ny,fπf(i)==1Nβf(i). [8]

Under the (T,P)walkonG, the covariance is stationary with autocovariance function

γ(d)==2Nβ2λd. [9]

That is, the covariance matrix has the form Σσ,τ=γ(d(σ,τ)), where d(σ,τ) is the graph distance (i.e., minimum path length) between σ and τ in T (5).

When the autocovariance further simplifies to

γ(d)=β2λd, [10]

for some λ,βR, then we call the (T,P)walkonG with feature y a rank-two model. For instance, if rank(P)=2, then as the name suggests, we have a rank-two model. In particular, all of the results in ref. 8 are for such transition matrices. Fig. 1 also studies such a rank-two model on two groups of people. There are other sufficient conditions for Eq. 10. For instance, if y(i)=μ+βf(i) for all nodes iG, then we have a rank-two model because βj=y,fjπ=0 for j/{1,}.

Theorem 2.

Under a rank-two model,

nVar(μ^GLS)1+λ1λβ2asn. [11]

This proof follows from the fact that under a rank-two model, Σ1 has a closed form expression (see SI Appendix, Eq. S6).

Using RDS to Estimate the Spectral Properties of the Graph for fGLS

The fGLS estimator depends upon an estimated covariance matrix Σ^ (e.g., see ref. 13):

μ^GLS(Σ^)=(1Σ^11)11Σ^1Y. [12]

With this notation, observe that μ^GLS=μ^GLS(Σ).

In our setting, estimating Σ is equivalent to estimating γ(). We propose and compare several estimators for γ. An estimator based upon the DC-SBM is derived in this section. Two additional estimators based upon the rank-two assumption are derived in SI Appendix, Section S3. The first rank-two estimator, μ^auto, relies upon a plug-in estimator for the correlation between Yσ and Yσ (i.e., the autocorrelation at lag 1). The second rank-two estimator, μ^Δ, relies upon plug-in estimators for the first and second differences, E(YσYσ)2 and E(Y(σ)Yσ)2.

Estimating the Spectral Properties of a SBM from an RDS.

The DC-SBM is a generalization of the SBM (14, 15). Both are models for a random network with community structure. As the name suggests, the degree-corrected model allows for degree heterogeneity within the blocks.

Definition (DC-SBM).

Partition the N nodes into K blocks with z: {1,2,,N}{1,2,,K}, and assign each node i a value θi>0 such that the θs sum to 1 within each block—that is,

i:z(i)=uθi=1,for allu{1,,K}. [13]

The block membership of node i is z(i), and the parameter θi controls the degree heterogeneity within each block. Let B be a symmetric K×K matrix such that Bab0 for all a,b1,,K. Under the DC-SBM,

P({i,j}E)=θiθjBz(i),z(j)

for all pairs i,j=1,2,,N and each possible edge is independent.

In much of the previous literature on the DC-SBM, the full network is observed, and we wish to estimate the partition z. In this paper, we presume that z is observed on the sampled nodes in the (T,P)walkonG, and we wish to estimate the spectral properties of P. This is reasonable in RDS because each participant takes a survey that records several salient demographic variables (e.g., gender, race, neighborhood, etc.). In practice, the block labels should be chosen such that they are highly autocorrelated from one referral to the next. Many RDS papers already report such statistics. For example, the original RDS paper (1) presents four empirical transition matrices on four different demographic partitions (i.e., race, gender, drug preference, and location).

The derivations below condition on the block labels z; only the graph G is random. Let A{0,1}N×N be the (random) adjacency matrix; Aij=1 if and only if (i,j)E. Define A[0,1]N×N such that Aij=E(Aij)=P((i,j)E). Define DRN×N as a diagonal matrix with (i,i)-th element jAij. Define P=D1A as a population version of P.

The inspiration for the following estimators is based on a population version of the chain and relies on three results. Define the matrix Q^RK×K such that for any two blocks u,v,

Q^uv=1n×number of referrals from blockuto blockv. [14]

Proposition 1 below shows that Q^ is an estimator of B under a (T,P)walkonG. Then, Proposition 2 shows that a normalized version of B has spectral properties that match the spectral properties of P. Finally, under the DC-SBM, if the smallest expected degree is growing fast enough, then P converges to P in spectral norm (e.g., see ref. 16). So estimates of the spectral properties of P are similar to the spectral properties of P. With these facts in mind, we propose estimating the spectral properties of P with the spectral properties of a normalized version of Q^. We let Z{0,1}N×K be such that Zij=1 if and only if z(i)=j.

Proposition 1.

If P is constructed from the DC-SBM and if Q^ is computed via a sample from the (T,P)walkonG, then

E(Q^)=B/m,

where m=1TB1.

Proposition 2.

Define DBRK×K to be a diagonal matrix that contains the row sums of BRK×K—that is, DB=diag(B1K)—and define BL=DB1/2BDB1/2. Define U and Λ via the eigendecomposition, BL=UΛUT. Define β*=y,f*π*, where π* is the stationary distribution of P. Then, (i) the nonzero eigenvalues of BL are identical to the nonzero eigenvalues of P; (ii) the columns of

f*=mZDB1/2U [15]

are eigenvectors of P; and (iii) if X is sampled from π*, then

β*=E(y(X)f*(X)),forK. [16]

The proofs of the propositions are given in SI Appendix, Section S5. We now introduce our estimator of Σ and μ.

SBM-fGLS.

Using z~:T{1,,K} as an observed partition of the nodes (e.g., by demographic characteristics), the SBM estimator of Σ is computed with the following steps. Each step uses a plug-in estimator using the previously derived formulas. After the statement of the algorithm, the steps are matched to the motivating equation.

For notational convenience, denote Yτ, z~(τ), and deg(τ) as y(Xτ),z~(Xτ), and deg(Xτ) for each sampled individual Xτ. Moreover, suppose a one-to-one mapping between the node set of T and {1,,n}:

  • i)

    Compute Q^ via Eq. 14 using the block memberships z~(τ). Define Q^(S)=(Q^+Q^T)/2. This symmetrization ensures the eigenvalues are real-valued.

  • ii)

    Row and column normalize Q^(S), as Q^L=DQ^1/2Q^(S)DQ^1/2, where DQ^=diag(Q^1K)RK×K.

  • iii)

    Take an eigendecomposition of

Q^L=ÛΛ^ÛT. [17]
  • iv)

    Compute f^=Z^DQ^1/2Û, where Z^{0,1}n×K contains Z^ij=1 if z~(i)=j.

  • v)

    For =1,,K, compute β^=1nτYτf^(τ), where f^(τ) is the (,τ) element of f^.

  • vi)

    Compute an estimate of the autocovariance function as

γ^SBM(d)==1Kβ^2Λ^d.
  • vii)

    Define ŝ2 to be the sample variance of the Yτ. For σ,τT,

Σ^σ,τSBM=γ^SBM(d(σ,τ))ifστγ^SBM(0)+ŝ2ifσ=τ,

where ŝ2 provides for Tikhonov regularization in (Σ^SBM)1.

  • viii)

    Define ĝRn to solve the system of equations Σ^sbmĝ=1.

  • ix)

    Estimate E(Yτ) with τTĝτYτ/τTĝτ.

Step i comes from Proposition 1. Steps iv and v come from Eqs. 15 and 16 in Proposition 2. Step vi comes from Eq. 9. In all of the plug-in formulas, it is unnecessary to estimate m because we must only specify Σ^ up to a constant of proportionality; this constant appears in both the numerator and denominator of μ^fGLS in step ix.

Simulations

This section compares the SBM-fGLS estimator to the VH estimator via simulation. Each simulated sample is collected by tracing contacts in social graphs collected in the National Longitudinal Study of Adolescent Health (Add Health). In the 1994–95 school year, the Add Health study collected a nationally represented sample of adolescents in grades 7–12. The sample covers 84 pairs of middle and high schools in which students nominated up to five male and five female friends in their middle/high school network (17). In this analysis, all graphs are restricted to the largest connected component. SI Appendix, Section S1 performs a similar simulation on the Colorado Spring Project 90 network (18). These networks were previously studied in refs. 4 and 19. The simulation was performed without replacement on the directed edges; both of these settings are different from the model used in the theoretical results. Details of the simulation settings are given in SI Appendix.

Fig. 2 shows the RMSE for fGLS and VH estimators; RMSE = E(μμ^)2. Overall, the SBM-fGLS estimators have a smaller RMSE. Each panel in Fig. 2 has one line with an asterisk. These lines correspond to the same school, which has both (i) a referral bottleneck between the white and black populations and (ii) a referral bottleneck between the high school and middle school. None of the fGLS estimators model both bottlenecks, yet they perform well.

Fig. 2.

Fig. 2.

Reduction in RMSE for SBM-fGLS vs. VH Estimators. These figures present the root mean squared error (RMSE) for the SBM-fGLS estimator and the VH estimator. Each panel corresponds to a different outcome y. In each panel, the horizontal axis corresponds to RMSE, and the vertical axis corresponds to different schools, ordered by RMSE of the VH estimator. Each line connects the RMSE for the SBM-fGLS to the RMSE of the VH estimator. If the line is red, then SBM-fGLS has a smaller RMSE.

The most difficult quantity to estimate, high school, also has one of the largest absolute reductions in RMSE. This is consistent with the broader pattern of the experiments and the theory: The VH estimator can have excessive error on quantities that are aligned with the community structures in the network that create referral bottlenecks, and our results suggest that fGLS can reduce the error in such cases. However, we must be careful to extrapolate either the frequency or magnitude of the fGLS improvement. These are highly empirical quantities. Moreover, the social networks that are available to perform simulation experiments (Add Health and P90) are not necessarily representative of the typical RDS population. More discussion of the idiosyncrasies of these networks is given in SI Appendix, Section S1.

Fig. 3 presents a diagnostic plot to evaluate the fGLS estimators using only data that are observed in a single sample. This diagnostic plot was created from the first simulated sample taken on the school that has the asterisk in Fig. 2.

Fig. 3.

Fig. 3.

Diagnostic plots. Each of these diagnostic plots is created from a single sample on the school with the asterisk in Fig. 2. We should prefer the fGLS estimators that have a smaller ratio of standard errors (RSE) as defined in the text and displayed on the vertical axis. The y corresponds to the SBM-fGLS estimator that constructs the blocks from the outcome variable of interest. For the race and ethnicity outcomes, z corresponds to the SBM-fGLS estimator that constructs the blocks with all races and ethnicities observed in the sample. In each plot, there are (K1)-many zs because SBM-fGLS estimates K1 eigenvalues; each of these K1 points has the same value on the vertical axis. For completeness, this plot includes the rank-two estimators μ^auto and μ^Δ that are developed in SI Appendix, Section S4. Under the rank-two model, the RSE is completely determined by the estimated eigenvalue; this is the gray line.

The horizontal axis in Fig. 3 gives eigenvalue(s) of P estimated by the fGLS technique. The vertical axis gives the plug-in estimate for the RSE:

RSE(Σ^)=Var^(μ^GLS(Σ^))Var^(μ^)=(1TΣ^11)1n11TΣ^1.

We should prefer the fGLS estimators that have a smaller ratio. As is justified in more detail in SI Appendix, Section S6, estimators with smaller RSE make reductions in the variance by taking advantage of the dependencies. Notice how the fGLS estimators have smaller ratios for the outcomes of black, white, and high school. For these outcomes, fGLS significantly reduces the RMSE in Fig. 2. It fails to identify the reduction in RMSE for the outcome male. For Asian, Hispanic, and male, the ratio of SEs is closer to 1.

Summary

This paper derives and studies GLS and fGLS estimators that account for the covariance between samples in an RDS. Under the Markov model where the covariance between samples is known, Theorems 1 and 2 show that the variance of the GLS estimator decays like O(n1). To estimate the covariance between samples, we use the fact that the covariance between adjacent samples can be exactly specified in terms of the spectral properties of the Markov transition matrix (5, 2024). These essential spectral properties of the network can be estimated from the observed data under the DC-SBM and the rank-two model.

Simulations shows in simulations on the Add Health networks that the fGLS estimates typically have smaller RMSEs than VH estimates. This simulation is performed under a more realistic model than the models used in the technical results (Theorems 1 and 2 and Propositions 1 and 2). First, the RDS is simulated on social graphs that were recorded in the Add Health study (neither rank-two nor simulated from the DC-SBM). Second, the sampling is without replacement. Third, the edges have not been symmetrized. Despite these departures from the reversible Markov model in the technical results, the estimators appear to still perform well. This finding is empirical, and given that these networks are not necessarily representative of the typical RDS population, we must be careful to extrapolate this intuition to other scenarios.

The diagnostic plots in Fig. 3 help to determine whether the outcome of interest is correlated in the observed sample. For quantities that are correlated (e.g., race, ethnicity, and school), Fig. 2 shows that fGLS estimates significantly reduce the RMSE.

SI Appendix, Sections S2 and S3 present two additional simulations to investigate the role of (i) sample size, (ii) referral rates, (iii) alignment of the outcome y with the blocks z, and (iv) preferential recruitment. In those simulations, when the outcome of interest correlates or aligns with the underlying structure of the graph and the referral rate is larger than the critical threshold identified in ref. 5, fGLS estimators can appreciably reduce the variability over previous estimators. In some simulations, the fGLS estimators have a smaller RMSE with 500 samples than the VH estimators have with 1,000 samples. While the fGLS estimators are derived under a Markov model, all simulations were performed under a without-replacement (i.e., non-Markovian) model.

Under the Markov model in Theorem 1 and under the simulations on the networks to which we have access, our results suggest effective ways (i) to diagnose strong dependence between samples and (ii) to alleviate such dependence. However, we must be careful in extrapolating specific values from the simulations (e.g., the amount that fGLS reduces the RMSE). The Add Health and P90 networks that are available to perform simulation experiments are not necessarily representative of the typical RDS population. The RMSE of the VH estimator and magnitude of the reduction in RMSE from fGLS are two highly empirical quantities that change between networks and outcomes.

Supplementary Material

Supplementary File
pnas.1706699115.sapp.pdf (772.9KB, pdf)

Acknowledgments

S.R. is supported by NSF Grants DMS-1149312 (CAREER), DMS-1614242, and CCF-1740707 (TRIPODS). K.R. is supported by NSF Grant DMS-1612456 and Army Research Office Grant W911NF-15-1-0423.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1706699115/-/DCSupplemental.

References

  • 1.Heckathorn DD. Respondent-driven sampling: A new approach to the study of hidden populations. Soc Probl. 1997;44:174–199. [Google Scholar]
  • 2. World Health Organization, Regional Office for the Eastern Mediterranean (2013) Introduction to HIV/AIDS and sexually transmitted infection surveillance: Module 4: Introduction to respondent-driven sampling. Available at www.who.int/iris/handle/10665/116864. Accessed September 13, 2018.
  • 3.White RG, et al. Strengthening the reporting of observational studies in epidemiology for respondent-driven sampling studies: “strobe-rds” statement. J Clin Epidemiol. 2015;68:1463–1471. doi: 10.1016/j.jclinepi.2015.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Goel S, Salganik MJ. Assessing respondent-driven sampling. Proc Natl Acad Sci USA. 2010;107:6743–6747. doi: 10.1073/pnas.1000261107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Rohe K. 2015. Network driven sampling; a critical threshold for design effects. arXiv:1505.05461.
  • 6.Volz E, Heckathorn DD. Probability based estimation theory for respondent driven sampling. J Off Stat. 2008;24:79–97. [Google Scholar]
  • 7.Gile KJ. Improved inference for respondent-driven sampling data with application to HIV prevalence estimation. J Am Stat Assoc. 2011;106:135–146. [Google Scholar]
  • 8.Goel S, Salganik MJ. Respondent-driven sampling as Markov chain Monte Carlo. Stat Med. 2009;28:2202–2229. doi: 10.1002/sim.3613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Benjamini I, Peres Y. Markov chains indexed by trees. Ann Probab. 1994;22:219–243. [Google Scholar]
  • 10.Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–685. [Google Scholar]
  • 11.Aitken AC. IV.—On least squares and linear combination of observations. Proc R Soc Edinburgh. 1936;55:42–48. [Google Scholar]
  • 12.Levin DA, Peres Y, Wilmer EL. Markov chains and mixing times. American Mathematical Society, Providence; RI: 2009. [Google Scholar]
  • 13.Amemiya T. Advanced Econometrics. Harvard Univ Press; Cambridge, MA: 1985. [Google Scholar]
  • 14.Holland P, Laskey K, Leinhardt S. Stochastic blockmodels: First steps. Social Netw. 1983;5:109–137. [Google Scholar]
  • 15.Karrer B, Newman M. Stochastic blockmodels and community structure in networks. Phys Rev E. 2011;83:016107. doi: 10.1103/PhysRevE.83.016107. [DOI] [PubMed] [Google Scholar]
  • 16.Chung F, Radcliffe M. On the spectra of general random graphs. Electron J Combinatorics. 2011;18:1–14. [Google Scholar]
  • 17.Harris KM, et al. 2009 The national longitudinal study of adolescent health: Research design. Available at www.cpc.unc.edu/projects/addhealth/design. Accessed September 13, 2018.
  • 18.Klovdahl AS, et al. Social networks and infectious disease: The Colorado springs study. Soc Sci Med. 1994;38:79–88. doi: 10.1016/0277-9536(94)90302-6. [DOI] [PubMed] [Google Scholar]
  • 19.Baraff AJ, McCormick TH, Raftery AE. Estimating uncertainty in respondent-driven sampling using a tree bootstrap method. Proc Natl Acad Sci USA. 2016;113:14668–14673. doi: 10.1073/pnas.1617258113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Verdery AM, Mouw T, Bauldry S, Mucha PJ. 2013. Network structure and biased variance estimation in respondent driven sampling. arXiv:1309.5109.
  • 21.Khabbazian M, Hanlon B, Russek Z, Rohe K. Novel sampling design for respondent-driven sampling. Electron J Stat. 2017;11:4769–4812. [Google Scholar]
  • 22.Qin T, Rohe K. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in Neural Information Processing Systems. Curran Associates; Red Hook, NY: 2013. pp. 3120–3128. [Google Scholar]
  • 23.Li X, Rohe K. 2015. Central limit theorems for network driven sampling. arXiv:1509.04704.
  • 24.Durrett R. Random Graph Dynamics. Vol 200 Cambridge Univ Press; Cambridge, UK: 2007. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
pnas.1706699115.sapp.pdf (772.9KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES