Generalized least squares can overcome the critical threshold in respondent-driven sampling

Sebastien Roch; Karl Rohe

doi:10.1073/pnas.1706699115

. 2018 Sep 25;115(41):10299–10304. doi: 10.1073/pnas.1706699115

Generalized least squares can overcome the critical threshold in respondent-driven sampling

Sebastien Roch ^a,¹, Karl Rohe ^b,^1,²

PMCID: PMC6187121 PMID: 30254152

Significance

Respondent-driven sampling (RDS) is a popular technique to sample marginalized or hard-to-reach populations, where participants can refer multiple contacts into the sample. Using the sampled participants, we wish to estimate properties of the population, often the proportion of individuals that are HIV+. Because contacts often share the same HIV status, adjacent samples are dependent. As a result, RDS can lead to highly variable estimates of HIV prevalence. This paper studies an estimation technique for HIV prevalence that is based upon the classical idea of generalized least squares.

Keywords: snowball sampling, link-tracing sampling, spectral gap

Abstract

To sample marginalized and/or hard-to-reach populations, respondent-driven sampling (RDS) and similar techniques reach their participants via peer referral. Under a Markov model for RDS, previous research has shown that if the typical participant refers too many contacts, then the variance of common estimators does not decay like $O (n^{- 1})$ , where $n$ is the sample size. This implies that confidence intervals will be far wider than under a typical sampling design. Here we show that generalized least squares (GLS) can effectively reduce the variance of RDS estimates. In particular, a theoretical analysis indicates that the variance of the GLS estimator is $O (n^{- 1})$ . We then derive two classes of feasible GLS estimators. The first class is based upon a Degree Corrected Stochastic Blockmodel for the underlying social network. The second class is based upon a rank-two model. It might be of independent interest that in both model classes, the theoretical results show that it is possible to estimate the spectral properties of the population network from a random walk sample of the nodes. These theoretical results point the way to entirely different classes of estimators that account for the network structure beyond node degree. Diagnostic plots help to identify situations where feasible GLS estimators are more appropriate. The computational experiments show the potential benefits and also indicate that there is room to further develop these estimators in practical settings.

Respondent-driven sampling (RDS) is a popular network-based approach to sample marginalized and/or hard-to-reach populations (1). RDS has become particularly popular in HIV research because the populations most at risk for HIV (e.g., people who inject drugs, female sex workers, and men who have sex with men) cannot be sampled using conventional techniques. Several domestic and international institutions use RDS to quantify the prevalence of HIV in at-risk populations, including the Centers for Disease Control (CDC), the World Health Organization (WHO), and the Joint United Nations Program on HIV/AIDS (UNAIDS) (2). The most recent review of the literature in 2015 counted over 460 different RDS studies, in 69 different countries (3).

Because RDS collects samples from link-tracing the relationships in a social network, adjacent samples are dependent. In a simulation study, ref. 4 showed how this can lead to highly variable estimates. Under independent sampling, the variance of standard estimators decays like $O (n^{- 1})$ . This implies that a sample size of $4 n$ will have a 50% smaller SE than a sample of size $n$ . However, this does not necessarily hold for RDS. Under a Markov model, ref. 5 showed how the dependence induced by RDS can drastically inflate the variance of traditional estimators, making it decay at a rate slower than $O (n^{- 1})$ . This implies that reducing the sampling error by 50% can require far more than 4 times as many samples. This means that confidence intervals are much wider than under independent sampling. Using the covariance function derived in ref. 5, this paper studies the generalized least squares (GLS) estimator for RDS. Our theoretical analysis establishes that the variance of the GLS estimator is $O (n^{- 1})$ . We then derive a feasible GLS (fGLS) estimator based upon the Degree Corrected Stochastic Blockmodel (DC-SBM). Two alternative estimators are derived. These estimators first construct estimates about the spectral properties of the population social graph, which might be of independent interest. Our fGLS estimators easily accommodate any preliminary reweighting of the data to adjust for the sampling biases that occur in RDS (e.g., refs. 6 and 7). We study these estimators with simulations and propose a simple diagnostic plot to compare the different fGLS estimators.

A Simple Motivating Example

Fig. 1 uses a model studied in ref. 8. In this example, the population that we wish to sample is equally divided into two groups: HIV+ and HIV–. The seed participant is selected uniformly at random. Starting with the seed participant, every participant refers two additional participants (as in a complete binary tree). The participant refers a person that matches his or her own HIV status with probability $p$ and refers a person with the opposite status with probability $1 - p$ . Each referral is independent. Using this sample, we wish to estimate the proportion of the population that is HIV+ (i.e., 50%). Fig. 1 compares two estimators: (i) the sample proportion and (ii) the GLS estimator proposed in this paper.

Under this sampling with replacement model, the variances of both the sample proportion and the GLS estimator have closed form solutions (see ref. 8 and the proof of Theorem 2 in SI Appendix, Section S4). Fig. 1 gives the ratio of these formulas as a function of the sample size $n$ . There are three lines, corresponding to $p = . 6, p = . 75,$ and $p = . 9$ . In all cases, the lines are less than 1, indicating that the GLS estimator has a smaller variance than the sample proportion. Under this simulation model, if $p > . 86$ , then the variance of the sample proportion decays slower than $O (n^{- 1})$ (5, 8). As Theorems 1 and 2 below show, the variance of the GLS estimator converges to 0 like $O (n^{- 1})$ . So, as $n$ increases, the bottom line converges to 0. The other two lines, on the other hand, do not converge to 0.

Preliminaries

The Markov model used in this paper is a straightforward combination of the Markov models developed in the RDS literature (e.g., refs. 1, 6, and 8).

The social network, $G = (V, E)$ , consists of the node set $V = {1, \dots, N}$ and the edge set $E =$ { $(i, j) :$ i and j can refer each other}. To simplify the notation, $i \in G$ is used synonymously with $i \in V$ . Unless otherwise noted, everything below also applies to weighted graphs. Let $w_{i j}$ be the weight of edge $(i, j) \in E$ , which models preferential recruitment as described in SI Appendix, Section S1. If $(i, j) / \in E$ , define $w_{i j} = 0$ . If the graph is unweighted, then let $w_{i j} = 1$ for all $(i, j) \in E$ . Throughout this paper, the graph is undirected—that is, $w_{i j} = w_{j i}$ for all pairs $i, j$ . Define the degree of node $i$ as $d e g (i) = \sum_{j} w_{i j}$ . For each node $i \in G$ , let $y (i) \in R$ denote some characteristic of this node (e.g., the indicator of HIV status). We wish to estimate the population average:

μ_{t r u e} = \frac{1}{N} \sum_{i \in G} y (i) .

[1]

We assume that the nodes are sampled with a Markov process that is indexed by a rooted tree $T$ (i.e., a connected graph with $n$ nodes, no cycles, and a vertex 0). The seed participant is vertex 0 in $T$ . To simplify the notation, $σ \in T$ is used synonymously with $σ$ belonging to the vertex set of $T$ . For any node in the tree $σ \in T$ , denote $σ' \in T$ as the parent of $σ$ (the node one step closer to the root). Define the matrix $P \in R^{N \times N}$ as

P_{i j} = \frac{w_{i j}}{d e g (i)} .

[2]

Because the graph is undirected, $P$ is a reversible Markov transition matrix with a stationary distribution $π : G \to R$ . Our sample is the set of random nodes

{X_{σ} \in G : σ \in T},

where $X_{0}$ is initialized with $π$ and each transition $X_{σ'} \to X_{σ}$ is independent with

P (X_{σ} = j | X_{σ'} = i) = P_{i j}, for i, j \in G .

Observe that $T$ and $G$ are distinct graphs: The nodes in $T$ index the Markov process, while the nodes in $G$ are its state space. Following ref. 9, we refer to this stochastic process as a $(T, P) - w a l k o n G$ .

When the $X_{τ}$ s sample the target population in $G$ , we observe

Y_{τ} = y (X_{τ}) for τ \in T .

Under the stationary $(T, P) - w a l k o n G$ , the sample average of the $Y_{τ}$ s is an estimate of

μ = E (Y_{0}) = \sum_{i} y (i) π_{i} .

In general, $μ \neq μ_{t r u e}$ (where $μ_{t r u e}$ was defined in Eq. 1), and the sample average must be adjusted with sampling weights to obtain an unbiased estimator of $μ_{t r u e}$ . Define

y^{π} (i) = \frac{y (i)}{π_{i} N}, Y_{τ}^{π} = y^{π} (X_{τ}) .

The sample average of the $Y_{τ}^{π}$ s is the inverse probability weighted (IPW) estimator; it is an unbiased estimator of $μ_{t r u e}$ (10). However, the weights $π_{i} N$ are unknown and must be estimated with additional information, as we describe next.

Under the $(T, P) - w a l k o n G$ ,

N π_{i} = \frac{N d e g (i)}{\sum_{j} d e g (j)} = \frac{N d e g (i)}{N \bar{d}} = \frac{d e g (i)}{\bar{d}},

where $\bar{d} = N^{- 1} \sum_{j} d e g (j)$ . The popular Volz–Heckathorn (VH) estimator replaces $\bar{d}$ with the harmonic mean of those degrees (6). Recall that $T$ has $n$ nodes and define

H_{d e g}^{- 1} = \frac{1}{n} \sum_{τ \in T} \frac{1}{d e g (X_{τ})}, {\hat{π}}_{i} = H_{d e g}^{- 1} d e g (i), y^{\hat{π}} (i) = \frac{y (i)}{{\hat{π}}_{i}},

and $Y_{τ}^{\hat{π}} = y^{\hat{π}} (X_{τ})$ . The VH estimator is the sample average of the $Y_{τ}^{\hat{π}}$ s, and it is an asymptotically unbiased estimator of $μ_{t r u e}$ under the $(T, P) - w a l k o n G$ . [In practice, $d e g (i)$ is estimated by asking participants how many contacts they have. Recall that $d e g (i) = \sum_{j} w_{i j}$ . If the graph is weighted, then the $(T, P) - w a l k o n G$ exhibits preferential recruitment (as discussed in SI Appendix, Section S1) and the number of contacts will not necessarily align with $d e g$ , making the estimator biased.]

Remark

The next section will drop the superscript $π$ and $\hat{π}$ in $Y_{τ}^{π}$ and $Y_{τ}^{\hat{π}}$ . Using the $Y_{τ}$ s to construct the GLS estimator will lead to an unbiased estimator of $μ$ . In practice, before doing any of the GLS computations, one could replace the $Y_{τ}$ s with $Y_{τ}^{π}$ or $Y_{τ}^{\hat{π}}$ to estimate $μ_{t r u e}$ . The simulations in this paper use a reweighting that is similar to $\hat{π}$ but replaces $H_{d e g}^{- 1}$ with a GLS estimate of $E (1 / d e g (X_{τ}))$ . In ref. 7, sampling weights are estimated under an alternative, non-Markovian model. These weights could also be used before doing GLS computations.

GLS for RDS

The GLS estimator is the weighted average of the $Y_{τ}$ s with smallest variance (11); that is, it is the solution $g^{*}$ to

min_{g} V a r (\sum_{τ \in T} g_{τ} Y_{τ}) such that \sum_{τ \in T} g_{τ} = 1 .

[3]

Because of the constraint that the weights $g_{τ}$ sum to 1, the linearity of expectation, and the fact that the $(T, P) - w a l k o n G$ is stationary, the resulting estimator is an unbiased estimate of $E (Y_{τ})$ . Define the covariance matrix $Σ \in R^{n \times n}$ as

Σ_{σ, τ} = {C o v}_{R D S} (Y_{σ}, Y_{τ}),

[4]

which is assumed to be nonsingular. It can be seen that the solution to Eq. 3 depends upon solving a system of equations involving the covariance matrix—namely, that $g^{*} = {(x^{T} 1)}^{- 1} x^{T}$ where $Σ x = 1$ . (Throughout, we use the notation $1_{M}$ for the all-one vector of length $M$ . We drop the length when clear from context.) If $Y \in R^{n}$ is the vector of $Y_{τ}$ s, then the GLS estimator can be expressed as

{\hat{μ}}_{G L S} = {(1^{T} Σ^{- 1} 1)}^{- 1} 1^{T} Σ^{- 1} Y .

[5]

The rest of this section contains our main theoretical results, which study how

{V a r}_{R D S} ({\hat{μ}}_{G L S}) = {(1^{T} Σ^{- 1} 1)}^{- 1}

[6]

decays with the sample size.

Main Result.

In our main result, we assume that $T$ is a complete binary tree with $n$ nodes, but we expect the result to hold for more general tree topologies.

Theorem 1 (Main Result).

Let ${X_{τ} : τ \in T}$ be sampled from the $(T, P) - w a l k o n G$ for a fixed $N \times N$ transition matrix $P$ that is irreducible and reversible with respect to a stationary distribution $π$ . If $T$ is a complete binary tree with $n$ nodes, then the variance of the GLS estimator defined in Eq. 5 decays like $O (n^{- 1})$ as $n \to \infty$ .

The proof, which is contained in SI Appendix, Section S4, does not directly compute the variance of the GLS estimator. Instead, it proceeds by constructing an explicit linear estimator and relies on the variational characterization (Eq. 3) of ${\hat{μ}}_{G L S}$ . We emphasize that computing ${\hat{μ}}_{G L S}$ requires the covariance matrix $Σ$ , which is typically unknown. The next section proposes a technique to estimate $Σ$ that is based upon the SBM. We also point out that the result in Theorem 1 is asymptotic and, as such, is only meaningful for $n$ large enough.

Before moving on to practical estimators, we give a more precise result on the constant in the $O (n^{- 1})$ by making further assumptions on the spectral properties of $P$ or of the features $y$ . The eigenvectors of the reversible transition matrix $P$ , denoted $f_{1}, \dots, f_{N} : V \to R$ , are real-valued functions of the nodes $i \in G$ that are orthonormal with respect to the inner product

{⟨ f_{a}, f_{b} ⟩}_{π} = \sum_{i \in G} f_{a} (i) f_{b} (i) π_{i} .

[7]

(See, e.g., lemma 12.2 of ref. 12.) We take the eigenfunction $f_{1}$ corresponding to the eigenvalue 1 to be the constant vector $1$ . Define $β_{ℓ} = {⟨ y, f_{ℓ} ⟩}_{π}$ for $ℓ = 1, \dots, N$ and note that $μ = β_{1} = \sum_{i} y (i) π_{i}$ . Let $λ_{1}, \dots, λ_{N}$ be the eigenvalues of $P$ corresponding to $f_{1}, \dots, f_{N}$ . For each node $i \in G$ , $y$ decomposes as follows:

y (i) = μ + \sum_{ℓ = 2}^{N} {⟨ y, f_{ℓ} ⟩}_{π} f_{ℓ} (i) = \sum_{ℓ = 1}^{N} β_{ℓ} f_{ℓ} (i) .

[8]

Under the $(T, P) - w a l k o n G$ , the covariance is stationary with autocovariance function

γ (d) = \sum_{ℓ = 2}^{N} β_{ℓ}^{2} λ_{ℓ}^{d} .

[9]

That is, the covariance matrix has the form $Σ_{σ, τ} = γ (d (σ, τ))$ , where $d (σ, τ)$ is the graph distance (i.e., minimum path length) between $σ$ and $τ$ in $T$ (5).

When the autocovariance further simplifies to

γ (d) = β^{2} λ^{d},

[10]

for some $λ, β \in R$ , then we call the $(T, P) - w a l k o n G$ with feature $y$ a rank-two model. For instance, if $r a n k (P) = 2$ , then as the name suggests, we have a rank-two model. In particular, all of the results in ref. 8 are for such transition matrices. Fig. 1 also studies such a rank-two model on two groups of people. There are other sufficient conditions for Eq. 10. For instance, if $y (i) = μ + β_{ℓ} f_{ℓ} (i)$ for all nodes $i \in G$ , then we have a rank-two model because $β_{j} = {⟨ y, f_{j} ⟩}_{π} = 0$ for $j / \in {1, ℓ}$ .

Theorem 2.

Under a rank-two model,

n V a r ({\hat{μ}}_{G L S}) \to (\frac{1 + λ}{1 - λ}) β^{2} a s n \to \infty .

[11]

This proof follows from the fact that under a rank-two model, $Σ^{- 1}$ has a closed form expression (see SI Appendix, Eq. S6).

Using RDS to Estimate the Spectral Properties of the Graph for fGLS

The fGLS estimator depends upon an estimated covariance matrix $\hat{Σ}$ (e.g., see ref. 13):

{\hat{μ}}_{G L S} (\hat{Σ}) = {(1 {\hat{Σ}}^{- 1} 1)}^{- 1} 1 {\hat{Σ}}^{- 1} Y .

[12]

With this notation, observe that ${\hat{μ}}_{G L S} = {\hat{μ}}_{G L S} (Σ)$ .

In our setting, estimating $Σ$ is equivalent to estimating $γ (\cdot)$ . We propose and compare several estimators for $γ$ . An estimator based upon the DC-SBM is derived in this section. Two additional estimators based upon the rank-two assumption are derived in SI Appendix, Section S3. The first rank-two estimator, ${\hat{μ}}_{a u t o}$ , relies upon a plug-in estimator for the correlation between $Y_{σ'}$ and $Y_{σ}$ (i.e., the autocorrelation at lag 1). The second rank-two estimator, ${\hat{μ}}_{Δ}$ , relies upon plug-in estimators for the first and second differences, $E {(Y_{σ'} - Y_{σ})}^{2}$ and $E {(Y_{(σ')'} - Y_{σ})}^{2}$ .

Estimating the Spectral Properties of a SBM from an RDS.

The DC-SBM is a generalization of the SBM (14, 15). Both are models for a random network with community structure. As the name suggests, the degree-corrected model allows for degree heterogeneity within the blocks.

Definition (DC-SBM).

Partition the $N$ nodes into $K$ blocks with $z$ : ${1,2, \dots, N} \to {1,2, \dots, K}$ , and assign each node $i$ a value $θ_{i} > 0$ such that the $θ$ s sum to 1 within each block—that is,

\sum_{i : z (i) = u} θ_{i} = 1, for all u \in {1, \dots, K} .

[13]

The block membership of node $i$ is $z (i)$ , and the parameter $θ_{i}$ controls the degree heterogeneity within each block. Let $B$ be a symmetric $K \times K$ matrix such that $B_{a b} \geq 0$ for all $a, b \in 1, \dots, K$ . Under the DC-SBM,

P ({i, j} \in E) = θ_{i} θ_{j} B_{z (i), z (j)}

for all pairs $i, j = 1,2, \dots, N$ and each possible edge is independent.

In much of the previous literature on the DC-SBM, the full network is observed, and we wish to estimate the partition $z$ . In this paper, we presume that $z$ is observed on the sampled nodes in the $(T, P) - w a l k o n G$ , and we wish to estimate the spectral properties of $P$ . This is reasonable in RDS because each participant takes a survey that records several salient demographic variables (e.g., gender, race, neighborhood, etc.). In practice, the block labels should be chosen such that they are highly autocorrelated from one referral to the next. Many RDS papers already report such statistics. For example, the original RDS paper (1) presents four empirical transition matrices on four different demographic partitions (i.e., race, gender, drug preference, and location).

The derivations below condition on the block labels $z$ ; only the graph $G$ is random. Let $A \in {0,1}^{N \times N}$ be the (random) adjacency matrix; $A_{i j} = 1$ if and only if $(i, j) \in E$ . Define $A \in {[0,1]}^{N \times N}$ such that $A_{i j} = E (A_{i j}) = P ((i, j) \in E)$ . Define $D \in R^{N \times N}$ as a diagonal matrix with $(i, i)$ -th element $\sum_{j} A_{i j}$ . Define $P = D^{- 1} A$ as a population version of $P$ .

The inspiration for the following estimators is based on a population version of the chain and relies on three results. Define the matrix $\hat{Q} \in R^{K \times K}$ such that for any two blocks $u, v$ ,

{\hat{Q}}_{u v} = \frac{1}{n} \times number of referrals from block u to block v .

[14]

Proposition 1 below shows that $\hat{Q}$ is an estimator of $B$ under a $(T, P) - w a l k o n G$ . Then, Proposition 2 shows that a normalized version of $B$ has spectral properties that match the spectral properties of $P$ . Finally, under the DC-SBM, if the smallest expected degree is growing fast enough, then $P$ converges to $P$ in spectral norm (e.g., see ref. 16). So estimates of the spectral properties of $P$ are similar to the spectral properties of $P$ . With these facts in mind, we propose estimating the spectral properties of $P$ with the spectral properties of a normalized version of $\hat{Q}$ . We let $Z \in {0,1}^{N \times K}$ be such that $Z_{i j} = 1$ if and only if $z (i) = j$ .

Proposition 1.

If $P$ is constructed from the DC-SBM and if $\hat{Q}$ is computed via a sample from the $(T, P) - w a l k o n G$ , then

E (\hat{Q}) = B / m,

where $m = 1^{T} B 1$ .

Proposition 2.

Define $D_{B} \in R^{K \times K}$ to be a diagonal matrix that contains the row sums of $B \in R^{K \times K}$ —that is, $D_{B} = d i a g (B 1_{K})$ —and define $B_{L} = D_{B}^{- 1 / 2} B D_{B}^{- 1 / 2}$ . Define $U$ and $Λ$ via the eigendecomposition, $B_{L} = U Λ U^{T}$ . Define $β_{ℓ}^{*} = {⟨ y, f_{ℓ}^{*} ⟩}_{π^{*}}$ , where $π^{*}$ is the stationary distribution of $P$ . Then, (i) the nonzero eigenvalues of $B_{L}$ are identical to the nonzero eigenvalues of $P$ ; (ii) the columns of

f^{*} = \sqrt{m} Z D_{B}^{- 1 / 2} U

[15]

are eigenvectors of $P$ ; and (iii) if $X$ is sampled from $π^{*}$ , then

β_{ℓ}^{*} = E (y (X) f_{ℓ}^{*} (X)), for ℓ \leq K .

[16]

The proofs of the propositions are given in SI Appendix, Section S5. We now introduce our estimator of $Σ$ and $μ$ .

SBM-fGLS.

Using $\tilde{z} : T \to {1, \dots, K}$ as an observed partition of the nodes (e.g., by demographic characteristics), the SBM estimator of $Σ$ is computed with the following steps. Each step uses a plug-in estimator using the previously derived formulas. After the statement of the algorithm, the steps are matched to the motivating equation.

For notational convenience, denote $Y_{τ}$ , $\tilde{z} (τ)$ , and $d e g (τ)$ as $y (X_{τ}), \tilde{z} (X_{τ}),$ and $d e g (X_{τ})$ for each sampled individual $X_{τ}$ . Moreover, suppose a one-to-one mapping between the node set of $T$ and ${1, \dots, n}$ :

i)
Compute $\hat{Q}$ via Eq. 14 using the block memberships $\tilde{z} (τ)$ . Define ${\hat{Q}}^{(S)} = (\hat{Q} + {\hat{Q}}^{T}) / 2$ . This symmetrization ensures the eigenvalues are real-valued.
ii)
Row and column normalize ${\hat{Q}}^{(S)}$ , as ${\hat{Q}}_{L} = D_{\hat{Q}}^{- 1 / 2} {\hat{Q}}^{(S)} D_{\hat{Q}}^{- 1 / 2},$ where $D_{\hat{Q}} = d i a g (\hat{Q} 1_{K}) \in R^{K \times K}$ .
iii)
Take an eigendecomposition of

{\hat{Q}}_{L} = Û \hat{Λ} Û^{T} .

[17]

iv)
Compute $\hat{f} = \hat{Z} D_{\hat{Q}}^{- 1 / 2} Û$ , where $\hat{Z} \in {0,1}^{n \times K}$ contains ${\hat{Z}}_{i j} = 1$ if $\tilde{z} (i) = j$ .
v)
For $ℓ = 1, \dots, K$ , compute ${\hat{β}}_{ℓ} = \frac{1}{n} \sum_{τ} Y_{τ} {\hat{f}}_{ℓ} (τ)$ , where ${\hat{f}}_{ℓ} (τ)$ is the $(ℓ, τ)$ element of $\hat{f}$ .
vi)
Compute an estimate of the autocovariance function as

{\hat{γ}}_{S B M} (d) = \sum_{ℓ = 1}^{K} {\hat{β}}_{ℓ}^{2} {\hat{Λ}}_{ℓ ℓ}^{d} .

vii)
Define $ŝ^{2}$ to be the sample variance of the $Y_{τ}$ . For $σ, τ \in T$ ,

{\hat{Σ}}_{σ, τ}^{S B M} = \{\begin{matrix} {\hat{γ}}_{S B M} (d (σ, τ)) & if σ \neq τ \\ {\hat{γ}}_{S B M} (0) + ŝ^{2} & if σ = τ, \end{matrix}

where $ŝ^{2}$ provides for Tikhonov regularization in ${({\hat{Σ}}^{S B M})}^{- 1}$ .

viii)
Define $ĝ \in R^{n}$ to solve the system of equations ${\hat{Σ}}_{s b m} ĝ = 1$ .
ix)
Estimate $E (Y_{τ})$ with $\sum_{τ \in T} ĝ_{τ} Y_{τ} / \sum_{τ \in T} ĝ_{τ}$ .

Step i comes from Proposition 1. Steps iv and v come from Eqs. 15 and 16 in Proposition 2. Step vi comes from Eq. 9. In all of the plug-in formulas, it is unnecessary to estimate $m$ because we must only specify $\hat{Σ}$ up to a constant of proportionality; this constant appears in both the numerator and denominator of ${\hat{μ}}_{f G L S}$ in step ix.

Simulations

This section compares the SBM-fGLS estimator to the VH estimator via simulation. Each simulated sample is collected by tracing contacts in social graphs collected in the National Longitudinal Study of Adolescent Health (Add Health). In the 1994–95 school year, the Add Health study collected a nationally represented sample of adolescents in grades 7–12. The sample covers 84 pairs of middle and high schools in which students nominated up to five male and five female friends in their middle/high school network (17). In this analysis, all graphs are restricted to the largest connected component. SI Appendix, Section S1 performs a similar simulation on the Colorado Spring Project 90 network (18). These networks were previously studied in refs. 4 and 19. The simulation was performed without replacement on the directed edges; both of these settings are different from the model used in the theoretical results. Details of the simulation settings are given in SI Appendix.

Fig. 2 shows the RMSE for fGLS and VH estimators; RMSE = $\sqrt{E {(μ - \hat{μ})}^{2}}$ . Overall, the SBM-fGLS estimators have a smaller RMSE. Each panel in Fig. 2 has one line with an asterisk. These lines correspond to the same school, which has both (i) a referral bottleneck between the white and black populations and (ii) a referral bottleneck between the high school and middle school. None of the fGLS estimators model both bottlenecks, yet they perform well.

The most difficult quantity to estimate, high school, also has one of the largest absolute reductions in RMSE. This is consistent with the broader pattern of the experiments and the theory: The VH estimator can have excessive error on quantities that are aligned with the community structures in the network that create referral bottlenecks, and our results suggest that fGLS can reduce the error in such cases. However, we must be careful to extrapolate either the frequency or magnitude of the fGLS improvement. These are highly empirical quantities. Moreover, the social networks that are available to perform simulation experiments (Add Health and P90) are not necessarily representative of the typical RDS population. More discussion of the idiosyncrasies of these networks is given in SI Appendix, Section S1.

Fig. 3 presents a diagnostic plot to evaluate the fGLS estimators using only data that are observed in a single sample. This diagnostic plot was created from the first simulated sample taken on the school that has the asterisk in Fig. 2.

The horizontal axis in Fig. 3 gives eigenvalue(s) of $P$ estimated by the fGLS technique. The vertical axis gives the plug-in estimate for the RSE:

R S E (\hat{Σ}) = \sqrt{\frac{\hat{V a r} ({\hat{μ}}_{G L S} (\hat{Σ}))}{\hat{V a r} (\hat{μ})}} = \sqrt{\frac{{(1^{T} {\hat{Σ}}^{- 1} 1)}^{- 1}}{n^{- 1} 1^{T} \hat{Σ} 1}} .

We should prefer the fGLS estimators that have a smaller ratio. As is justified in more detail in SI Appendix, Section S6, estimators with smaller RSE make reductions in the variance by taking advantage of the dependencies. Notice how the fGLS estimators have smaller ratios for the outcomes of black, white, and high school. For these outcomes, fGLS significantly reduces the RMSE in Fig. 2. It fails to identify the reduction in RMSE for the outcome male. For Asian, Hispanic, and male, the ratio of SEs is closer to 1.

Summary

This paper derives and studies GLS and fGLS estimators that account for the covariance between samples in an RDS. Under the Markov model where the covariance between samples is known, Theorems 1 and 2 show that the variance of the GLS estimator decays like $O (n^{- 1})$ . To estimate the covariance between samples, we use the fact that the covariance between adjacent samples can be exactly specified in terms of the spectral properties of the Markov transition matrix (5, 20–24). These essential spectral properties of the network can be estimated from the observed data under the DC-SBM and the rank-two model.

Simulations shows in simulations on the Add Health networks that the fGLS estimates typically have smaller RMSEs than VH estimates. This simulation is performed under a more realistic model than the models used in the technical results (Theorems 1 and 2 and Propositions 1 and 2). First, the RDS is simulated on social graphs that were recorded in the Add Health study (neither rank-two nor simulated from the DC-SBM). Second, the sampling is without replacement. Third, the edges have not been symmetrized. Despite these departures from the reversible Markov model in the technical results, the estimators appear to still perform well. This finding is empirical, and given that these networks are not necessarily representative of the typical RDS population, we must be careful to extrapolate this intuition to other scenarios.

The diagnostic plots in Fig. 3 help to determine whether the outcome of interest is correlated in the observed sample. For quantities that are correlated (e.g., race, ethnicity, and school), Fig. 2 shows that fGLS estimates significantly reduce the RMSE.

SI Appendix, Sections S2 and S3 present two additional simulations to investigate the role of (i) sample size, (ii) referral rates, (iii) alignment of the outcome $y$ with the blocks $z$ , and (iv) preferential recruitment. In those simulations, when the outcome of interest correlates or aligns with the underlying structure of the graph and the referral rate is larger than the critical threshold identified in ref. 5, fGLS estimators can appreciably reduce the variability over previous estimators. In some simulations, the fGLS estimators have a smaller RMSE with 500 samples than the VH estimators have with 1,000 samples. While the fGLS estimators are derived under a Markov model, all simulations were performed under a without-replacement (i.e., non-Markovian) model.

Under the Markov model in Theorem 1 and under the simulations on the networks to which we have access, our results suggest effective ways (i) to diagnose strong dependence between samples and (ii) to alleviate such dependence. However, we must be careful in extrapolating specific values from the simulations (e.g., the amount that fGLS reduces the RMSE). The Add Health and P90 networks that are available to perform simulation experiments are not necessarily representative of the typical RDS population. The RMSE of the VH estimator and magnitude of the reduction in RMSE from fGLS are two highly empirical quantities that change between networks and outcomes.

Supplementary Material

Supplementary File

pnas.1706699115.sapp.pdf^{(772.9KB, pdf)}

Acknowledgments

S.R. is supported by NSF Grants DMS-1149312 (CAREER), DMS-1614242, and CCF-1740707 (TRIPODS). K.R. is supported by NSF Grant DMS-1612456 and Army Research Office Grant W911NF-15-1-0423.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1706699115/-/DCSupplemental.

References

1.Heckathorn DD. Respondent-driven sampling: A new approach to the study of hidden populations. Soc Probl. 1997;44:174–199. [Google Scholar]
2. World Health Organization, Regional Office for the Eastern Mediterranean (2013) Introduction to HIV/AIDS and sexually transmitted infection surveillance: Module 4: Introduction to respondent-driven sampling. Available at www.who.int/iris/handle/10665/116864. Accessed September 13, 2018.
3.White RG, et al. Strengthening the reporting of observational studies in epidemiology for respondent-driven sampling studies: “strobe-rds” statement. J Clin Epidemiol. 2015;68:1463–1471. doi: 10.1016/j.jclinepi.2015.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Goel S, Salganik MJ. Assessing respondent-driven sampling. Proc Natl Acad Sci USA. 2010;107:6743–6747. doi: 10.1073/pnas.1000261107. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Rohe K. 2015. Network driven sampling; a critical threshold for design effects. arXiv:1505.05461.
6.Volz E, Heckathorn DD. Probability based estimation theory for respondent driven sampling. J Off Stat. 2008;24:79–97. [Google Scholar]
7.Gile KJ. Improved inference for respondent-driven sampling data with application to HIV prevalence estimation. J Am Stat Assoc. 2011;106:135–146. [Google Scholar]
8.Goel S, Salganik MJ. Respondent-driven sampling as Markov chain Monte Carlo. Stat Med. 2009;28:2202–2229. doi: 10.1002/sim.3613. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Benjamini I, Peres Y. Markov chains indexed by trees. Ann Probab. 1994;22:219–243. [Google Scholar]
10.Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–685. [Google Scholar]
11.Aitken AC. IV.—On least squares and linear combination of observations. Proc R Soc Edinburgh. 1936;55:42–48. [Google Scholar]
12.Levin DA, Peres Y, Wilmer EL. Markov chains and mixing times. American Mathematical Society, Providence; RI: 2009. [Google Scholar]
13.Amemiya T. Advanced Econometrics. Harvard Univ Press; Cambridge, MA: 1985. [Google Scholar]
14.Holland P, Laskey K, Leinhardt S. Stochastic blockmodels: First steps. Social Netw. 1983;5:109–137. [Google Scholar]
15.Karrer B, Newman M. Stochastic blockmodels and community structure in networks. Phys Rev E. 2011;83:016107. doi: 10.1103/PhysRevE.83.016107. [DOI] [PubMed] [Google Scholar]
16.Chung F, Radcliffe M. On the spectra of general random graphs. Electron J Combinatorics. 2011;18:1–14. [Google Scholar]
17.Harris KM, et al. 2009 The national longitudinal study of adolescent health: Research design. Available at www.cpc.unc.edu/projects/addhealth/design. Accessed September 13, 2018.
18.Klovdahl AS, et al. Social networks and infectious disease: The Colorado springs study. Soc Sci Med. 1994;38:79–88. doi: 10.1016/0277-9536(94)90302-6. [DOI] [PubMed] [Google Scholar]
19.Baraff AJ, McCormick TH, Raftery AE. Estimating uncertainty in respondent-driven sampling using a tree bootstrap method. Proc Natl Acad Sci USA. 2016;113:14668–14673. doi: 10.1073/pnas.1617258113. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Verdery AM, Mouw T, Bauldry S, Mucha PJ. 2013. Network structure and biased variance estimation in respondent driven sampling. arXiv:1309.5109.
21.Khabbazian M, Hanlon B, Russek Z, Rohe K. Novel sampling design for respondent-driven sampling. Electron J Stat. 2017;11:4769–4812. [Google Scholar]
22.Qin T, Rohe K. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in Neural Information Processing Systems. Curran Associates; Red Hook, NY: 2013. pp. 3120–3128. [Google Scholar]
23.Li X, Rohe K. 2015. Central limit theorems for network driven sampling. arXiv:1509.04704.
24.Durrett R. Random Graph Dynamics. Vol 200 Cambridge Univ Press; Cambridge, UK: 2007. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.1706699115.sapp.pdf^{(772.9KB, pdf)}

[r1] 1.Heckathorn DD. Respondent-driven sampling: A new approach to the study of hidden populations. Soc Probl. 1997;44:174–199. [Google Scholar]

[r2] 2. World Health Organization, Regional Office for the Eastern Mediterranean (2013) Introduction to HIV/AIDS and sexually transmitted infection surveillance: Module 4: Introduction to respondent-driven sampling. Available at www.who.int/iris/handle/10665/116864. Accessed September 13, 2018.

[r3] 3.White RG, et al. Strengthening the reporting of observational studies in epidemiology for respondent-driven sampling studies: “strobe-rds” statement. J Clin Epidemiol. 2015;68:1463–1471. doi: 10.1016/j.jclinepi.2015.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4] 4.Goel S, Salganik MJ. Assessing respondent-driven sampling. Proc Natl Acad Sci USA. 2010;107:6743–6747. doi: 10.1073/pnas.1000261107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Rohe K. 2015. Network driven sampling; a critical threshold for design effects. arXiv:1505.05461.

[r6] 6.Volz E, Heckathorn DD. Probability based estimation theory for respondent driven sampling. J Off Stat. 2008;24:79–97. [Google Scholar]

[r7] 7.Gile KJ. Improved inference for respondent-driven sampling data with application to HIV prevalence estimation. J Am Stat Assoc. 2011;106:135–146. [Google Scholar]

[r8] 8.Goel S, Salganik MJ. Respondent-driven sampling as Markov chain Monte Carlo. Stat Med. 2009;28:2202–2229. doi: 10.1002/sim.3613. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9] 9.Benjamini I, Peres Y. Markov chains indexed by trees. Ann Probab. 1994;22:219–243. [Google Scholar]

[r10] 10.Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–685. [Google Scholar]

[r11] 11.Aitken AC. IV.—On least squares and linear combination of observations. Proc R Soc Edinburgh. 1936;55:42–48. [Google Scholar]

[r12] 12.Levin DA, Peres Y, Wilmer EL. Markov chains and mixing times. American Mathematical Society, Providence; RI: 2009. [Google Scholar]

[r13] 13.Amemiya T. Advanced Econometrics. Harvard Univ Press; Cambridge, MA: 1985. [Google Scholar]

[r14] 14.Holland P, Laskey K, Leinhardt S. Stochastic blockmodels: First steps. Social Netw. 1983;5:109–137. [Google Scholar]

[r15] 15.Karrer B, Newman M. Stochastic blockmodels and community structure in networks. Phys Rev E. 2011;83:016107. doi: 10.1103/PhysRevE.83.016107. [DOI] [PubMed] [Google Scholar]

[r16] 16.Chung F, Radcliffe M. On the spectra of general random graphs. Electron J Combinatorics. 2011;18:1–14. [Google Scholar]

[r17] 17.Harris KM, et al. 2009 The national longitudinal study of adolescent health: Research design. Available at www.cpc.unc.edu/projects/addhealth/design. Accessed September 13, 2018.

[r18] 18.Klovdahl AS, et al. Social networks and infectious disease: The Colorado springs study. Soc Sci Med. 1994;38:79–88. doi: 10.1016/0277-9536(94)90302-6. [DOI] [PubMed] [Google Scholar]

[r19] 19.Baraff AJ, McCormick TH, Raftery AE. Estimating uncertainty in respondent-driven sampling using a tree bootstrap method. Proc Natl Acad Sci USA. 2016;113:14668–14673. doi: 10.1073/pnas.1617258113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r20] 20.Verdery AM, Mouw T, Bauldry S, Mucha PJ. 2013. Network structure and biased variance estimation in respondent driven sampling. arXiv:1309.5109.

[r21] 21.Khabbazian M, Hanlon B, Russek Z, Rohe K. Novel sampling design for respondent-driven sampling. Electron J Stat. 2017;11:4769–4812. [Google Scholar]

[r22] 22.Qin T, Rohe K. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in Neural Information Processing Systems. Curran Associates; Red Hook, NY: 2013. pp. 3120–3128. [Google Scholar]

[r23] 23.Li X, Rohe K. 2015. Central limit theorems for network driven sampling. arXiv:1509.04704.

[r24] 24.Durrett R. Random Graph Dynamics. Vol 200 Cambridge Univ Press; Cambridge, UK: 2007. [Google Scholar]

PERMALINK

Generalized least squares can overcome the critical threshold in respondent-driven sampling

Sebastien Roch

Karl Rohe

Significance

Abstract

A Simple Motivating Example

Fig. 1.

Preliminaries

Remark

GLS for RDS

Main Result.

Theorem 1 (Main Result).

Theorem 2.

Using RDS to Estimate the Spectral Properties of the Graph for fGLS

Estimating the Spectral Properties of a SBM from an RDS.

Definition (DC-SBM).

Proposition 1.

Proposition 2.

SBM-fGLS.

Simulations

Fig. 2.

Fig. 3.

Summary

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Generalized least squares can overcome the critical threshold in respondent-driven sampling

Sebastien Roch

Karl Rohe

Significance

Abstract

A Simple Motivating Example

Fig. 1.

Preliminaries

Remark

GLS for RDS

Main Result.

Theorem 1 (Main Result).

Theorem 2.

Using RDS to Estimate the Spectral Properties of the Graph for fGLS

Estimating the Spectral Properties of a SBM from an RDS.

Definition (DC-SBM).

Proposition 1.

Proposition 2.

SBM-fGLS.

Simulations

Fig. 2.

Fig. 3.

Summary

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases