Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts

Lingxiao Wang; Richard Valliant; Yan Li

doi:10.1002/sim.9122

. Author manuscript; available in PMC: 2021 Oct 30.

Published in final edited form as: Stat Med. 2021 Jul 5;40(24):5237–5250. doi: 10.1002/sim.9122

Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts

Lingxiao Wang ¹, Richard Valliant ^1,², Yan Li ¹

PMCID: PMC8526388 NIHMSID: NIHMS1725980 PMID: 34219260

Abstract

Many epidemiologic studies forgo probability sampling and turn to nonprobability volunteer-based samples because of cost, response burden, and invasiveness of biological samples. However, finite population (FP) inference is difficult to make from the nonprobability sample due to the lack of population representativeness. Aiming for making inferences at the population level using nonprobability samples, various inverse propensity score weighting methods have been studied with the propensity defined by the participation rate of population units in the nonprobability sample. In this article, we propose an adjusted logistic propensity weighting (ALP) method to estimate the participation rates for nonprobability sample units. The proposed ALP method is easy to implement by ready-to-use software while producing approximately unbiased estimators for population quantities regardless of the nonprobability sample rate. The efficiency of the ALP estimator can be further improved by scaling the survey sample weights in propensity estimation. Taylor linearization variance estimators are proposed for ALP estimators of FP means that account for all sources of variability. The proposed ALP methods are evaluated numerically via simulation studies and empirically using the naïve unweighted National Health and Nutrition Examination Survey III sample, while taking the 1997 National Health Interview Survey as the reference, to estimate the 15-year mortality rates.

Keywords: finite population inference, nonprobability sample, propensity score weighting, survey sampling, variance estimation

1 |. INTRODUCTION

In the big data era, assembling volunteer-based epidemiologic cohorts within integrated healthcare systems that have electronic health records and a large preexisting base of volunteers are increasingly popular due to their cost-and-time efficiency, such as the UK Biobank in the UK National Health Service.¹ However, samples of volunteer-based cohorts are not randomly selected from the underlying finite target population, and therefore cannot well represent the target population. As a result, the naïve sample estimates obtained from the cohort can be biased for the finite population (FP) quantities. For example, the estimated all-cause mortality rate in the UK Biobank was only half that of the UK population,² and the Biobank is not representative of the UK population with regard to many sociodemographic, physical, lifestyle, and health-related characteristics.

Aiming for making inferences at the population level using nonprobability samples, various propensity-score weighting, and matching methods have been proposed to improve the population representativeness of nonprobability samples, by using probability-based survey samples as external references in survey research.^3–6

Inverse propensity score weighting (IPSW) methods have been studied with the propensity defined by the participation rate of population units in the nonprobability sample. We review two methods—both assume that the units in the nonprobability sample are observed according to some random, but unknown, mechanism. Because that mechanism is unknown, the inclusion probability of each unit must be estimated. As described in Section 2, all methods are based on estimating a pseudo log-likelihood, although the methods differ in their details. Valliant and Dever⁷ estimated participation rates by fitting a logistic regression model to the combined nonprobability sample and a reference, probability sample. Sample weights for the probability sample were scaled by a constant so that the scaled probability sample was assumed to represent the complement of the nonprobability sample. Each unit in the nonprobability sample was assigned a weight of one. This results in the sum of the scaled weights in the combined probability plus nonprobability sample being an estimate of the population size. This method will be referred to as the rescaled design weight (RDW) method. The participation rate for each nonprobability sample unit was estimated by the inverse of the estimated inclusion (or participation) probability.

The RDW estimator is biased especially when the participation rate of the nonprobability sample is large, as noted by Chen et al.⁴ As a remedy, Chen et al⁴ estimated the participation rate by manipulating the log-likelihood estimating equation in a somewhat different way. The resulting estimator, denoted by CLW, is consistent and approximately unbiased regardless of the magnitude of participation rates. Compared with the CLW method, which requires special programming, the RDW method has the advantage of easy implementation by ready-to-use software such as R, Stata, or SAS. Survey practitioners can simply fit a logistic regression model with scaled survey weights in the probability sample to obtain the estimated participation rates.

In this article, we propose an adjusted logistic propensity weighting (ALP) method to estimate the participation rates for nonprobability sample units. Like the CLW, the proposed ALP method relaxes the assumptions required by the RDW method,^7,8 by formulating the method in an innovative way. As in the RDW method, the proposed ALP method retains the advantage of easy implementation by fitting a propensity model with survey weights in ready-to-use software. Taylor linearization (TL) variance estimators are proposed for ALP estimates that account for variability due to differential pseudo-weights in the nonprobability sample, complex survey design of the reference probability survey, as well as the estimation of the propensity scores. The variance of the proposed estimator has the order of the inverse of the nonprobability sample size (as shown in Appendix C). Moreover, under the logistic propensity model, the ALP method can flexibly scale the probability sample weights for propensity estimation to further improve efficiency. In summary, the contributions of the proposed ALP method include (1) easy implementation with ready-to-use software, (2) high efficiency, as well as (3) the justification of a set of pseudo-estimating equation (14) that underly the straightforward implementation in survey software.

2 |. METHODS

2.1 |. Basic setting

Let FP = {1, ⋯ , N} represent the FP with size N. We are interested in estimating the FP mean $μ = N^{- 1} \sum_{i \in FP} y_{i}$ . Suppose a volunteer-based nonprobability sample s_c of size n_c is selected from FP by a self-selection mechanism, with $δ_{i}^{(c)}$ (= 1 if i ∈ s_c; 0 otherwise) denoting the indicator of s_c inclusion. The underlying participation rate of nonprobability sample for a FP unit is defined as

π_{i}^{(c)} = P (i \in s_{c} ∣ FP) = E_{c} {δ_{i}^{(c)} ∣ y_{i}, x_{i}}, i \in FP,

Where the expectation E_c is with respect to the nonprobability sample selection, and x_i is a vector of self-selection variables, that is, covariates related to the probability of inclusion in s_c. The corresponding implicit nonprobability sample weight is $w_{i} = 1 / π_{i}^{(c)}$ for i ∈ FP.

We consider the following assumptions for the nonprobability sample self-selection.

A1. The nonprobability sample selection is uncorrelated with the variable of interest given the covariates, that is, $π_{i}^{(c)} = E_{c} {δ_{i}^{(c)} ∣ y_{i}, x_{i}} = E_{c} {δ_{i}^{(c)} ∣ x_{i}}$ for i ∈ FP.

A2. All FP units have a positive participation rate, that is, $π_{i}^{(c)} > 0$ for i ∈ FP.

A3. The indicators of participation in the nonprobability cohort are uncorrelated with each other given the self-selection variables, that is, $cov (δ_{i}^{(c)}, δ_{j}^{(c)} ∣ x_{i}, x_{j}) = 0$ for i ≠ j.

An independent reference probability-based survey sample s_p of size n_p is randomly selected from FP. The sample inclusion indicator, selection probability, and the corresponding sample weights are defined by $δ_{i}^{(p)}$ (=1 if i ∈ s_p; 0 otherwise), $π_{i}^{(p)} = E_{p} (δ_{i}^{(p)} ∣ x_{i})$ , and $d_{i} = 1 / π_{i}^{(p)}$ , respectively, where E_p is with respect to the survey sample selection.

2.2 |. Existing logistic propensity weighting method

In this section, we first briefly introduce the existing RDW and CLW methods and discuss their pros and cons.

2.2.1 |. RDW method

Valliant and Dever^6,7 assumed a logistic regression model for the participation rates $π_{i}^{(c)} (γ)$

\log {\frac{π_{i}^{(c)} (γ)}{1 - π_{i}^{(c)} (γ)}} = γ^{T} x_{i}, for i \in FP,

(1)

where γ is a vector of unknown parameters, and x_i is a vector of covariates for i ∈ FP. To simplify the notation, we use $π_{i}^{(c)}$ below. They considered (implicitly) the population likelihood function of $π_{i}^{(c)}$ as

L (γ) = \prod_{i \in FP} {π_{i}^{(c)}}^{δ_{i}^{(c)}} {1 - π_{i}^{(c)}}^{1 - δ_{i}^{(c)}} .

(2)

Then, the log-likelihood function can be written as

l (γ) = \sum_{i \in FP} [δ_{i}^{(c)} \log π_{i}^{(c)} + {1 - δ_{i}^{(c)}} \log {1 - π_{i}^{(c)}}] = \sum_{i \in S_{c}} \log π_{i}^{(c)} + \sum_{i \in FP - s_{c}} \log {1 - π_{i}^{(c)}},

(3)

where the set FP − s_c represents the FP units that are not self-selected into the nonprobability sample. Since FP − s_c is not available in practice, the pseudo-loglikelihood function was constructed to estimate l(γ) by

{\tilde{l}}^{RDW} (γ) = \sum_{i \in S_{c}} w_{i}^{*} \log π_{i}^{(c)} + \sum_{i \in s_{p}} w_{i}^{*} \log {1 - π_{i}^{(c)}},

(4)

where $w_{i}^{*} = {\begin{array}{l} 1, & for i \in s_{c} \\ d_{i} ({\hat{N}}_{p} - n_{c}) / {\hat{N}}_{p}, & for i \in s_{p} \end{array}$ , with ${\hat{N}}_{p} = \sum_{i \in s_{p}} d_{i}$ being the survey estimate of the target FP size N. This leads to the total of the scaled weights across the probability sample units being $\sum_{i \in s_{p}} w_{i}^{*} = {\hat{N}}_{p} - n_{c}$ . The rationale for rescaling is to weight the survey sample to represent the complement of s_c in the FP, that is, the set FP − s_c. Under the logistic regression model, the nonprobability sample participation rate $π_{i}^{(c)}$ for i ∈ s_c can be estimated by fitting Model (1) to the combined sample of s_c and scaled-weighted s_p with scaled weights $w_{i}^{*}$ , leading to the RDW estimates.

The RDW method has been shown to effectively reduce the bias of the naïve nonprobability sample estimates. However, the summand $\sum_{i \in FP - S_{c}} \log {1 - π_{i}^{(c)}}$ in (3) is not a fixed FP total because units in the nonprobability sample s_c are treated as being randomly observed. This leads to a bias as shown below.

Comparing the expectation of the population log-likelihood function l(γ) in (3) and the expectation of the pseudo log-likelihood ${\tilde{l}}^{RDW} (γ)$ in (4), and letting E(·) = E_cE_p(·) we have

E {l (γ)} = \sum_{i \in F P} π_{i}^{(c)} \log π_{i}^{(c)} + \sum_{i \in F P} {1 - π_{i}^{(c)}} \log {1 - π_{i}^{(c)}}, and E {{\tilde{l}}^{RDW} (γ)} = E_{c} E_{p} {{\tilde{l}}^{RDW} (γ)} = E_{c} [\sum_{i \in F P} δ_{i}^{(c)} \log π_{i}^{(c)}] + E_{p} [\sum_{i \in F P} δ_{i}^{(p)} \cdot \frac{{\hat{N}}_{p} - n_{c}}{{\hat{N}}_{p}} d_{i} \log {1 - π_{i}^{(c)}}] ≐ \sum_{i \in F P} π_{i}^{(c)} \log π_{i}^{(c)} + \sum_{i \in F P} {1 - \frac{n_{c}}{N}} \log {1 - π_{i}^{(c)}}

by assuming $E_{p} ({\hat{N}}_{p}) = N$ . The difference of the two expectations, denoted by Δ_RDW, can be written as

Δ_{RDW} = E {{\tilde{l}}^{RDW} (γ)} - E {l (γ)} = \sum_{i \in FP} {\frac{n_{c}}{N} - π_{i}^{(c)}} \log {1 - π_{i}^{(c)}}

which, in general, is nonzero. Accordingly, the nonprobability sample participation rates estimated by solving for γ in $\partial {\tilde{l}}^{RDW} (γ) / \partial γ = 0$ under Model (1) can be biased, unless either (i) the nonprobability sample units have small participation rates, that is, both n_c∕N and $π_{i}^{(c)}$ are close to 0 for all i ∈ FP, in which case $\log {1 - π_{i}^{(c)}} \approx 0$ , or (ii) all population units are equally likely to participate in the nonprobability sample, that is, $π_{i}^{(c)} \equiv n_{c} / N$ . In many practical applications, (i) will hold. For example, suppose that n_c = 1000 and the US population age 18 and over is the target population. The population size is approximately 210 million, so that n_c∕N ≐ 5 × 10⁻⁶. If, instead, the population is for a small state like Wyoming where the 18+ population size is about 365000, then n_c∕N ≐ 0.0027. In both examples, with such small sampling fractions, all ${π_{i}^{(c)}, i \in s_{c}}$ should be near zero also.

2.2.2 |. CLW method

Chen et al⁴ proposed another IPSW method using the same likelihood function L(γ) in (2), but rewriting the population log-likelihood as

l (γ) = \sum_{i \in S_{c}} \log \frac{π_{i}^{(c)}}{1 - π_{i}^{(c)}} + \sum_{i \in F P} \log {1 - π_{i}^{(c)}} .

(5)

In contrast to the RDW method, CLW estimated the population total of $\log {1 - π_{i}^{(c)}}$ by a weighted reference sample total and constructed the pseudo log-likelihood as

{\tilde{l}}^{CLW} (γ) = \sum_{i \in S_{c}} \log \frac{π_{i}^{(c)}}{1 - π_{i}^{(c)}} + \sum_{i \in s_{p}} d_{i} \log {1 - π_{i}^{(c)}} .

(6)

Under the same logistic regression model (1), the participation rate $π_{i}^{(c)}$ was estimated by solving the pseudo estimation equation

\tilde{S} (γ) = \frac{1}{N} {\sum_{i \in s_{c}} x_{i} - \sum_{i \in s_{p}} d_{i} π_{i}^{(c)} x_{i}} = 0,

(7)

derived from the pseudo log-likelihood (6). The resulting CLW weights are calculated as ${w_{i}^{CLW} = 1 + \exp^{- 1} ({\hat{γ}}^{T} x_{i}), i \in s_{c}}$ . Chen et al⁴ proved that the CLW estimator of the FP mean, ${\hat{μ}}^{CLW} = {(\sum_{i \in s_{c}} w_{i}^{CLW})}^{- 1} \sum_{i \in s_{c}} w_{i}^{CLW} y_{i}$ , was design consistent when model (1) for the participation rates was correct.

In contrast to the RDW method, CLW does not require condition (i) or (ii) in the RDW method for unbiased estimation of participation rates ${π_{i}^{(c)}, i \in S_{c}}$ . In the following section, we propose an ALP method, which corrects the bias in the RDW method. The proposed ALP method provides consistent estimators of FP means and is as easy to implement as the RDW method.

2.3 |. ALP method

The ALP method also aims to estimate the cohort sample participation rates ${π_{i}^{(c)}, i \in S_{c}}$ and use the inverse of estimated $π_{i}^{(c)}$ as the pseudo-weight for i ∈ s_c. As a computational device, we construct a pseudo-population of $s_{c}^{*} \cup F P$ , where $s_{c}^{*}$ is a copy of s_c that has the same joint distributions of covariates x and outcome y with the original s_c. The number of units in $s_{c}^{*} \cup F P$ is n_c + N. In the union of $s_{c}^{*} \cup F P$ , $s_{c}^{*}$ and s_c are treated as two different sets. We use R_i to indicate the membership of $s_{c}^{*}$ in $s_{c}^{*} \cup F P$ (=1 if $i \in s_{c}^{*}$ ; 0 if i ∈ FP), and $p_{i} = P (R_{i} = 1) = P (i \in s_{c}^{*} ∣ s_{c}^{*} \cup FP)$ . Instead of directly modeling $π_{i}^{(c)}$ as in the RDW and CLW methods, we model p_i as a function of $π_{i}^{(c)}$ :

p_{i} = \frac{π_{i}^{(c)}}{1 + π_{i}^{(c)}}, or equivalently, π_{i}^{(c)} = \frac{p_{i}}{1 - p_{i}} .

(8)

The relationship between p_i and $π_{i}^{(c)}$ follows because

\frac{p_{i}}{1 - p_{i}} = \frac{P (i \in s_{c}^{*} ∣ s_{c}^{*} \cup FP)}{P (i \in FP ∣ s_{c}^{*} \cup FP)} = \frac{P (i \in s_{c})}{P (i \in FP)} = P (i \in s_{c} ∣ FP) = π_{i}^{(c)}

(9)

since $s_{c}^{*}$ is a copy of s_c and $P (i \in s_{c}^{*} ∣ s_{c}^{*} \cup FP) = P (i \in s_{c} ∣ s_{c}^{*} \cup FP)$ . Notice that, derived from Formula (8),

p_{i} \leq \frac{1}{2},

since $π_{i}^{(c)} \leq 1$ and the equality holds only if $π_{i}^{(c)} = 1$ , that is, the FP unit i participates in the cohort with certainty. As illustrated by the examples at the end of section 2.2.1, requiring p_i ≤ 1∕2 is not unrealistic in typical applications because $π_{i}^{(c)}$ is generally quite small.

Suppose that p_i can be modeled parametrically by p_i = p(x_i;β) = expit(β^Tx_i), where β is a vector of unknown model parameters. That is,

\log {\frac{p_{i}}{1 - p_{i}}} = β^{T} x_{i}, for i \in s_{c}^{*} \cup FP

(10)

Notice that β, the coefficients in Model (10), differ from the coefficients γ in Model (1) because the two logistic regression models have different dependent variables. Based on (8), expression (10) implies that $π_{i}^{(c)}$ is being modeled as exp(β^Tx_i), which differs from the RDW/CLW model in (1) where $π_{i}^{(c)} = \exp (γ^{T} x_{i}) / {1 + \exp (γ^{T} x_{i})}$ . The corresponding “likelihood” function can be written as

L^{*} (β) = \prod_{i \in s_{c}^{*} \cup F P} p_{i}^{R_{i}} {(1 - p_{i})}^{(1 - R_{i})},

(11)

where R_i indicates the membership of $s_{c}^{*}$ in $s_{c}^{*} \cup FP (= 1 if i \in s_{c}^{*}; 0 if i \in FP)$ . We put “likelihood” in quotes because L*(β) varies depending on which set of units is selected for $s_{c}^{*}$ . This contrasts with the population likelihood in (2) which applies regardless of which sample is selected. Note that L*(β) in (11) is written as if the units are independent when they are not. This is a standard procedure in pseudo-MLE estimation, and the resulting parameter estimators remain design-consistent even when some units may be correlated due to, for example, clustering.^9,10 The quantity L*(β) should be viewed as motivation for developing the estimating equations given below in (13). The log-likelihood generated from L*(β) is

l^{*} (β) = \sum_{i \in S_{c}^{*} \cup F P} {R_{i} \cdot \log p_{i} + (1 - R_{i}) \log (1 - p_{i})} = \sum_{i \in FP} δ_{i}^{(c)} \log p_{i} + \sum_{i \in FP} \log (1 - p_{i}) .

(12)

Notice that the randomness of L*(β) and l*(β) comes from the cohort selection, that is, $δ_{i}^{(c)}$ in the first summand in the last line of (12). In reality, since the unit level information of FP is unknown, we replace the second summand in (12) by a survey sample estimate, $\sum_{i \in s_{p}} d_{i} \log (1 - p_{i})$ , and obtain the maximum pseudo-likelihood estimator $\hat{β}$ by solving the pseudo-estimating equation

{\tilde{S}}^{*} (β) = \frac{1}{N + n_{c}} {\sum_{i \in s_{c}} (1 - p_{i}) x_{i} - \sum_{i \in s_{p}} d_{i} p_{i} x_{i}} = 0 .

(13)

Assuming that p_i is bounded by 0 and 1/2 implies that $π_{i}^{(c)}$ is automatically bounded by 0 and 1. Note that (13) falls in a general class of estimating equations that ensure unique solutions of parameters^4,11,12 (eg, equation (6) in CLW⁴ if their function h(x_i, θ) is set equal to $x_{i} {1 + π_{i}^{(c)}}^{- 1}$ ).

The ALP estimator of μ is

{\hat{μ}}^{ALP} = \frac{\sum_{i \in S_{c}} w_{i}^{ALP} y_{i}}{\sum_{i \in S_{c}} w_{i}^{ALP}},

(14)

where $w_{i}^{ALP} = 1 / π_{i}^{(c)} (\hat{β})$ for i ∈ s_c. Although L*(β) is not a standard likelihood, ${\hat{μ}}^{ALP}$ is a consistent estimator of the population mean as shown in the theorem below.

We consider the following limiting process for the theoretical development.^4,13 Suppose there is a sequence of FPs FP_k of size N_k, for k = 1, 2, …. Cohort s_c,k of size n_c,k and survey sample s_p,k of size n_p,k are sampled from FP_k. The sequences of the FP, the cohort and the survey sample have their sizes satisfy $\lim_{k \to \infty} n_{t, k} / N_{k} \to f_{t}$ , where t = c or p and 0 < f_t ≤ 1 (regularity condition C1 in Appendix A). In the following the index k is suppressed for simplicity.

Theorem.

Consistency of ALP estimator of FP mean (see Appendix B).

Under the regularity conditions A1 to A3, and C1 to C5 in Appendix A, and assuming logistic regression model (10) for p_i, the ALP estimate ${\hat{μ}}^{ALP}$ is design consistent for μ, in particular ${\hat{μ}}^{ALP} - μ = O_{p} (n_{c}^{- 1 / 2})$ , with the FP variance

Var ({\hat{μ}}^{ALP}) ≐ N^{- 2} \sum_{i \in FP} p_{i} (1 - 2 p_{i}) {\frac{(y_{i} - μ)}{p_{i}} - b^{T} x_{i}}^{2} + b^{T} D b,

(15)

where $p_{i} = expit (β^{T} x_{i}), b^{T} = {\sum_{i \in FP} (y_{i} - μ) x_{i}^{T}} {\sum_{i \in FP} p_{i} x_{i} x_{i}^{T}}^{- 1}$ , and $D = N^{- 2} V_{p} (\sum_{i \in s_{p}} d_{i} p_{i} x_{i})$ is the design-based variance-covariance matrix under the probability sampling design for s_p.

In practice, the ALP estimator of a FP mean can be obtained by three steps:

Step 1 Search for covariates x available in both the cohort (s_c) and the reference survey sample (s_p) and combine the two samples. Assign R_i = 1 for i ∈ s_c and R_i = 0 for i ∈ s_p in the combined sample.

Step 2 Fit a logistic regression model for p_i = P(R_i = 1) in the combined s_c and weighted s_p, with the survey sample weights {d_i, i ∈ s_p}, and obtain the estimate ${\hat{p}}_{i}$ for i ∈ s_c.

Step 3 Estimate the FP mean by Formula (14) with the ALP pseudo weight $w_{i}^{ALP} = {\hat{p}}_{i} / (1 - {\hat{p}}_{i})$ for i ∈ s_c.

Notice that Step 2 can be accomplished by any existing survey software, such as svyglm in survey package of R, svy:logit in Stata, and PROC SURVEYLOGISTIC in SAS. In addition to being easy to implement, the ALP estimator from (14) does not require conditions (i) or (ii), unlike RDW. Moreover, we prove that in large samples, $Var ({\hat{μ}}^{ALP}) = O (n_{c}^{- 1})$ is as or more efficient compared with $Var ({\hat{μ}}^{CLW}) = O {\min {(n_{p}, n_{c})}^{- 1}}$ under their correct propensity model, respectively, which depends on both the nonprobability and probability sample sizes (see Appendix C).

An alternative method would be to omit the odds transformation, which uses p_i to approximate the participation rate $π_{i}^{(c)}$ . Denote this method by FDW for full design weight, which contrasts to the scaling of the survey sample weights in the RDW method. Comparing the expectation of the population log-likelihood function l(γ) in (3) and the expectation of the pseudo log-likelihood ${\tilde{l}}^{*} (β)$ in (12) with $π_{i}^{(c)}$ replacing p_i by the FDW method, that is, ${\tilde{l}}^{*} (γ) = \sum_{i \in s_{c}} \log π_{i}^{(c)} + \sum_{i \in s_{p}} d_{i} \log (1 - π_{i}^{(c)})$ , we have their difference, denoted by Δ_FDW, written as

Δ_{FDW} = E {{\tilde{l}}^{*} (γ)} - E {l (γ)} = \sum_{i \in F P} π_{i}^{(c)} \log π_{i}^{(c)} + \sum_{i \in F P} \log {1 - π_{i}^{(c)}} - \sum_{i \in F P} π_{i}^{(c)} \log π_{i}^{(c)} - \sum_{i \in F P} (1 - π_{i}^{(c)}) \log {1 - π_{i}^{(c)}} = \sum_{i \in F P} π_{i}^{(c)} \log {1 - π_{i}^{(c)}} .

The bias is zero only if $π_{i}^{(c)}$ for i ∈ FP are all close to zero. Thus, the odds transformation step in ALP could be skipped if all nonprobability participation rates are extremely small; but, in general, that step is essential for unbiased estimation.

2.4 |. Variance estimation

Using the FP variance formula (15), the first summand can be consistently estimated by

{{\hat{N}}^{(c)}}^{- 2} \sum_{i \in s_{c}} (1 - {\hat{p}}_{i}) (1 - 2 {\hat{p}}_{i}) {\frac{(y_{i} - {\hat{μ}}^{ALP})}{{\hat{p}}_{i}} - {\hat{b}}^{T} x_{i}}^{2},

(16)

where ${\hat{p}}_{i}$ is the prediction for i ∈ s_c, ${\hat{N}}^{(c)} = \sum_{i \in S_{c}} w_{i}^{ALP}$ , and ${\hat{b}}^{T} = {\sum_{i \in s_{c}} (y_{i} - {\hat{μ}}^{ALP}) x_{i}^{T}} {\sum_{i \in s_{c}} {\hat{p}}_{i} x_{i} x_{i}^{T}}^{- 1}$ . The second summand b^TDb is estimated by ${\hat{b}}^{T} \hat{D} \hat{b}$ , where $\hat{D}$ is the survey design consistent variance estimator of D. For example, under stratified multistage cluster sampling with H strata and a_h primary sampling units (PSUs) in stratum h selected with replacement,

\hat{D} = {{\hat{N}}^{(p)}}^{- 2} \cdot \sum_{h = 1}^{H} \frac{a_{h}}{a_{h} - 1} \sum_{l = 1}^{a_{h}} (z_{l} - \bar{z}) {(z_{l} - \bar{z})}^{T},

(17)

where ${\hat{N}}^{(p)} = \sum_{i \in s_{p}} d_{i}$ , $z_{l} = \sum_{i \in s_{p (h l)}} d_{i} {\hat{p}}_{i} x_{i}$ is the weighted PSU total for cluster l in stratum h, s_p(hl) is the set of sample elements stratum h and cluster l, and $\bar{z} = a_{h}^{- 1} \sum_{l}^{a_{h}} z_{l}$ is the mean of the PSU totals in stratum h.

2.5 |. Scaling survey weights in the likelihood for the ALP method

The proposed ALP can flexibly scale the survey weights in estimating equation (13) to improve efficiency. For case-control studies, Scott and Wild¹⁴ and Li et al¹⁵ previously used the technique we propose below to reduce variances of estimates of relative risks when weights for cases and controls are substantially different. We multiply the second summand in ${\tilde{S}}^{*} (β)$ by a constant λ, say $λ = n_{c} / (\sum_{i \in s_{p}} d_{i})$ , so that the sum of the scaled survey weights (λd_i) is n_c. Accordingly, the score function becomes

{\tilde{S}}_{λ}^{*} (β) = \sum_{i \in s_{c}} (1 - p_{i}) x_{i} - λ \sum_{i \in s_{p}} d_{i} p_{i} x_{i} .

(18)

Solving ${\tilde{S}}_{λ}^{*} (β) = 0$ for β, and the resulting vector of estimates is denoted by ${\hat{β}}_{λ} = ({\hat{β}}_{0, λ}, {\hat{β}}_{1, λ})$ , where ${\hat{β}}_{0, λ}$ is estimate of the intercept. Similar derivations to those in Chambers and Skinner¹⁰ and Beaumont¹¹ can be used to prove that ${\hat{β}}_{1, λ}$ , is design-consistent with various efficiency gains, depending on the variability of survey weights vs the nonprobability sample weights (with implicit common value of 1). However, the estimate of the intercept ${\hat{β}}_{0, λ}$ can be badly biased with scaled weights. As a result, the estimate of participation rate $\exp ({\hat{β}}_{λ}^{T} x_{i})$ including ${\hat{β}}_{0, λ}$ would also be biased. The bias of ${\hat{β}}_{0, λ}$ , however, would not affect the estimate of population mean because the scaled ALP-weighted mean, ${\hat{μ}}^{ALP . S}$ ,

{\hat{μ}}^{ALP . S} = \frac{\sum_{i \in s_{c}} w_{i}^{ALP . S} y_{i}}{\sum_{i \in s_{c}} w_{i}^{ALP . S}} = \frac{\sum_{i \in s_{c}} \exp^{- 1} ({\hat{β}}_{1, λ}^{T} x_{i}) y_{i}}{\sum_{i \in s_{c}} \exp^{- 1} ({\hat{β}}_{1, λ}^{T} x_{i})},

depends on ${\hat{β}}_{1, λ}$ , but not ${\hat{β}}_{0, λ}$ , where $w_{i}^{ALP . S} = \exp^{- 1} ({\hat{β}}_{1, λ}^{T} x_{i})$ is the scaled ALP pseudo weight.

It can be proved that ${\hat{μ}}^{ALP . S}$ is a consistent estimator of the FP mean, μ. The TL variance estimator of ${\hat{μ}}^{ALP . S}$ can be obtained by substituting $w_{i}^{ALP}$ , ${\hat{μ}}^{ALP}$ , $\hat{β}$ , ${\hat{p}}_{i}$ , and d_i by $w_{i}^{ALP . S}$ , ${\hat{μ}}^{ALP . S}$ , ${\hat{β}}_{λ}$ , ${\hat{p}}_{i, λ} = \exp ({\hat{β}}_{λ}^{T} x_{i})$ , and λd_i, respectively, in Formulae (16) and (17). Details on the variance and the consistency of ALP.S are discussed in the dissertation by Wang.¹⁶

3 |. SIMULATIONS

3.1 |. FP generation and sample selection

We applied simulation setups similar to those in Chen et al.⁴ In the FP of size N = 500 000, a vector of covariates x_i = (x_1i, x_2i, x_3i, x_4i)^T was generated for i ∈ FP where x_1i = v_1i, x_2i = v_2i + 0.3×_1i, x_3i = v_3i + 0.2(x_1i + x_2i), x_4i = v_4i + 0.1 (x_1i + x_2i + x_3i), with v_1i ~ Bernoulli(0.5), v_2i ~ Uniform(0, 2), v_3i ~ Exponential(1), and v_4i ~ χ²(4). The variable of interest y_i ~ Normal(μ_i, 1), where μ_i = −x_1i − x_2i + x_3i + x_4i for i ∈ FP. The parameter of interest was the FP mean $μ = N^{- 1} \sum_{i \in FP} y_{i} = 3.97$ .

The probability-based survey sample s_p with the target sample size n_p = 12 500 (sampling fraction f_p = 2.5%) was selected by Poisson sampling, with inclusion probability $π_{i}^{(p)} = (n_{p} \cdot q_{i}) / \sum_{i \in FP} q_{i}$ for i ∈ FP, where q_i = const + x_3i + 0.03y_i to control for the variation of the survey weights, $d_{i} = 1 / π_{i}^{(p)}$ . We set const = −0.26 so that max q_i/ min q_i = 20.

As noted in Section 2, the ALP and CLW methods do assume somewhat different models for the participation rate. Thus, it is interesting to check their performances both when their underlying models are correct and when the assumed participation rate models fail. The volunteer-based nonprobability sample s_c (with a target sample size n_c) was also selected by Poisson sampling but with different inclusion probabilities $π_{i}^{(c)}$ for i ∈ FP. We considered two scenarios with different functional forms of $π_{i}^{(c)}$ so that the ALP (and FDW) or the CLW method had the true linear logistic regression propensity model in one scenario but not in the other. In Scenario 1, $π_{i}^{(c)} = \exp (β_{0} + β^{T} x_{i})$ was the specified participation rate for the ith population unit to be included into the nonprobability sample. The underlying true propensity model for ALP (and FDW) methods, shown in (10), was $logit (p_{i}) = logit {π_{i}^{(c)}} = β_{0} + β^{T} x_{i}$ , which implies $logit {π_{i}^{(c)}} = β_{0} + β^{T} x_{i} - \log {1 - π_{i}^{(c)}}$ . This model differs from the underlying linear model (1) assumed by the CLW method by the addition of the term $\log {1 - π_{i}^{(c)}}$ . In Scenario 2, $π_{i}^{(c)} = expit (γ_{0} + γ^{T} x_{i})$ was specified so that $logit {π_{i}^{(c)}} = γ_{0} + γ^{T} x_{i}$ , which was the model (1) assumed by the CLW method. This model, however, implied that $logit (p_{i}) = γ_{0} + γ^{T} x_{i} - \log {1 - π_{i}^{(c)}}$ , which was different from the model assumed by the ALP and the FDW method (by the extra term $\log {1 - π_{i}^{(c)}}$ ). Hence, ALP and CLW estimates of the population mean are expected to be unbiased in one scenario but not the other since both methods assume a linear logistic propensity model. The biases of the FDW and RDW estimates, as measured by Δ_FDW and Δ_RDW, depend on $π_{i}^{(c)}$ , and go to 0 as $π_{i}^{(c)}$ approaches 0. The biases become larger as $π_{i}^{(c)}$ increases in either scenario.

In both scenarios, the coefficients were set to be β = γ = (0.18, 0.18, −0.27, −0.27)^T. The parameters were chosen so that $0 < π_{i}^{(c)} < 1$ for all units i ∈ FP. The intercepts β₀ and γ₀ were also controlled so that the expected number of nonprobability sample units E_c(n_c) = $E_{c} (n_{c}) = \sum_{FP} π_{i}^{(c)}$ was varied from 1250, 2500, 5000, to 10000 with the corresponding overall participation rate f_c = E_c(n_c)∕N being 0.5%, 5%, 10%, or 20%.

3.2 |. Evaluation criteria

We examined the performance of five IPSW estimators of FP mean μ: (1) to (2) ${\hat{μ}}^{ALP}$ and ${\hat{μ}}^{ALP . S}$ described in Sections 2.2 to 2.5; (3) ${\hat{μ}}^{FDW}$ using weights from the ALP method omitting the odds transformation; (4) ${\hat{μ}}^{CLW}$ proposed by Chen et al⁴; and (5) ${\hat{μ}}^{RDW}$ proposed by Wang et al,⁶ compared with the naïve nonprobability sample mean $({\hat{μ}}^{Naive})$ that did not use weights, and the weighted nonprobability sample mean, ${\hat{μ}}^{TW}$ , with weights equal to the inverse of the true nonprobability sample inclusion probabilities. Note that ${\hat{μ}}^{TW}$ is unavailable in practice because the true nonprobability sample inclusion probabilities are unknown. Relative bias (%RB), empirical variance (V), mean squared error (MSE) of the point estimates were used to evaluate the performance of the four IPSW point estimates, calculated by

% RB = \frac{1}{B} \sum_{b = 1}^{B} \frac{{\hat{μ}}^{(b)} - μ}{μ} \times 100, V = \frac{1}{B - 1} \sum_{b = 1}^{B} {{\hat{μ}}^{(b)} - \frac{1}{B} \sum_{b = 1}^{B} {\hat{μ}}^{(b)}}^{2}, MSE = \frac{1}{B} \sum_{b = 1}^{B} {{\hat{μ}}^{(b)} - μ}^{2},

where B = 4000 is the number of simulation runs, ${\hat{μ}}^{(b)}$ is one of the point estimates obtained from the bth simulated sample, and μ is the true FP mean.

We also evaluated the variance estimates using the variance ratio (VR) and 95% confidence interval coverage probability (CP), which were calculated as

VR = \frac{\frac{1}{B} \sum_{b = 1}^{B} {\hat{v}}^{(b)}}{V} \times 100, and CP = \frac{1}{B} \sum_{b = 1}^{B} I (μ \in {CI}^{(b)}),

where ${\hat{v}}^{(b)}$ is the proposed analytical variance estimate in simulated sample b, and ${CI}^{(b)} = ({\hat{μ}}^{(b)} - 1.96 \sqrt{{\hat{v}}^{(b)}}, {\hat{μ}}^{(b)} + 1.96 \sqrt{{\hat{v}}^{(b)}})$ is the 95% confidence interval from the bth simulated sample.

3.3 |. Results

Table 1 presents simulation results for the seven nonprobability sample estimators of the FP mean. The naïve estimator ${\hat{μ}}^{Naive}$ that ignored the underlying sampling scheme had relative biases ranging from −36.5% to −42.8% while the true weighted nonprobability sample estimator, ${\hat{μ}}^{TW}$ , was approximately unbiased in all scenarios. The variance of ${\hat{μ}}^{Naive}$ was much smaller than that of the other estimators, but its bias caused the MSE to be extremely high (not reported).

TABLE 1.

Results from 4000 simulated survey samples and nonprobability samples with low to high participation rates under various propensity score models

	Scenario 1 True propensity model for ALP					Scenario 2 True propensity model for CLW

	%RB	V (×10⁵)	VR	MSE (×10⁵)	CP⁵	%RB	V (×10⁵)	VR	MSE (×10⁵)	CP
f_c = 0.5%
${\hat{μ}}^{Naive}$	−42.76	0.22	0.99			−42.61	0.22	1.00
${\hat{μ}}^{TW}$	−0.13	4.38	0.93	4.39	0.90	−0.12	4.38	0.93	4.38	0.90
${\hat{μ}}^{RDW}$	−0.29	3.73	0.93	3.75	0.87	−0.40	3.63	0.93	3.66	0.87
${\hat{μ}}^{FDW}$	−0.28	3.73	0.93	3.75	0.87	−0.40	3.63	0.93	3.66	0.87
${\hat{μ}}^{ALP}$	−0.07	3.70	0.93	3.77	0.88	−0.19	3.66	0.93	3.67	0.88
${\hat{μ}}^{CLW}$	0.05	3.87	0.93	3.87	0.89	−0.07	3.76	0.93	3.76	0.88
${\hat{μ}}^{ALP . S}$	−0.11	3.54	0.92	3.54	0.87	−0.21	3.45	0.92	3.45	0.87
f_c = 5%
${\hat{μ}}^{Naive}$	−42.74	0.02	0.99			−41.21	0.02	1.01
${\hat{μ}}^{TW}$	−0.04	0.50	0.98	0.50	0.92	−0.02	0.46	1.00	0.47	0.93
${\hat{μ}}^{RDW}$	−2.15	0.56	1.00	1.29	0.66	−3.05	0.43	1.01	1.89	0.45
${\hat{μ}}^{FDW}$	−2.05	0.57	1.00	1.23	0.68	−2.95	0.43	1.01	1.81	0.47
${\hat{μ}}^{ALP}$	−0.01	0.62	1.00	0.62	0.94	−1.03	0.47	1.01	0.64	0.85
${\hat{μ}}^{CLW}$	1.29	0.84	1.00	1.10	0.95	0.01	0.61	1.01	0.61	0.94
${\hat{μ}}^{ALP . S}$	−0.05	0.45	1.00	0.45	0.92	−0.63	0.35	1.02	0.41	0.86
f_c = 10%
${\hat{μ}}^{Naive}$	−42.74	0.01	1.11			−39.65	0.01	1.11
${\hat{μ}}^{TW}$	−0.01	0.25	1.02	0.25	0.94	−0.01	0.22	1.01	0.22	0.94
${\hat{μ}}^{RDW}$	−4.25	0.34	1.00	3.20	0.17	−5.62	0.22	0.99	5.20	0.02
${\hat{μ}}^{FDW}$	−3.87	0.35	1.00	2.71	0.24	−5.28	0.22	0.99	4.62	0.03
${\hat{μ}}^{ALP}$	0.01	0.42	1.00	0.42	0.95	−1.81	0.27	0.99	0.79	0.65
${\hat{μ}}^{CLW}$	2.94	0.80	1.00	2.16	0.76	0.03	0.42	0.99	0.42	0.95
${\hat{μ}}^{ALP . S}$	−0.03	0.27	1.02	0.27	0.94	−0.86	0.18	1.01	0.29	0.81
f_c = 20%
${\hat{μ}}^{Naive}$	−42.75	0.00	1.26			−36.50	0.01	1.21
${\hat{μ}}^{TW}$	−0.02	0.15	0.93	0.15	0.93	−0.02	0.11	0.96	0.11	0.93
${\hat{μ}}^{RDW}$	−8.58	0.21	0.95	11.83	0.00	−9.59	0.10	0.97	14.60	0.00
${\hat{μ}}^{FDW}$	−7.15	0.23	0.96	8.29	0.01	−8.51	0.11	0.98	11.53	0.00
${\hat{μ}}^{ALP}$	0.00	0.32	0.96	0.32	0.95	−2.85	0.16	0.98	1.44	0.19
${\hat{μ}}^{CLW}$	7.80	1.67	0.92	11.27	0.06	0.01	0.33	0.98	0.33	0.95
${\hat{μ}}^{ALP . S}$	−0.03	0.19	0.96	0.19	0.94	−1.02	0.10	1.00	0.27	0.73

Open in a new tab

Abbreviations: ALP, adjusted logistic propensity; CP, coverage probability; MSE, mean squared error; VR, variance ratio.

Consistent with the bias theory in Section 2, the RDW point estimator ${\hat{μ}}^{RDW}$ and the FDW point estimator ${\hat{μ}}^{FDW}$ were approximately unbiased when $π_{i}^{(c)}$ was small for all i ∈ FP and the overall participation rate $f_{c} = N^{- 1} \sum_{i \in FP} π_{i}^{(c)}$ was low, but more biased as f_c increased. The CPs decreased correspondingly.

As expected, the ALP estimators ${\hat{μ}}^{ALP}$ and ${\hat{μ}}^{ALP . S}$ (or the CLW estimator ${\hat{μ}}^{CLW}$ ) consistently provided unbiased point estimators in the scenarios where they were expected to be unbiased, that is, scenario 1 for ${\hat{μ}}^{ALP}$ and ${\hat{μ}}^{ALP . S}$ , and scenario 2 for ${\hat{μ}}^{CLW}$ . When the underlying model was incorrect for an estimator, biases occurred. For example, the relative biases of ${\hat{μ}}^{CLW}$ in scenario 1 were 0.05%, 1.29%, 2.94%, and 7.80% as f_c increased from 0.5%, 5%, 10%, to 20%, respectively. In scenario 2, the corresponding relative biases for ${\hat{μ}}^{ALP}$ are −0.19%, −1.03%, −1.81%, and −2.85%.

Consistent with the theory in Section 2, the ALP estimator ${\hat{μ}}^{ALP}$ was more efficient than ${\hat{μ}}^{CLW}$ with consistently smaller empirical variances in all scenarios, especially when the nonprobability cohort size was much larger than the probability sample size. Among all considered methods, ${\hat{μ}}^{ALP . S}$ was approximately unbiased with the smallest variance under Scenario 1 of the correct model. Under Scenario 2 of a misspecified model, ${\hat{μ}}^{ALP . S}$ was biased but most efficient, and therefore achieved smallest MSE.

The variance estimators for ${\hat{μ}}^{ALP}$ , ${\hat{μ}}^{ALP . S}$ , and ${\hat{μ}}^{CLW}$ performed very well (with VR’s near 1), providing CPs close to the nominal level under the correct propensity models when f_c was large. The lower coverage of the nominal level (about 88%) when f_c = 0.5% was due to the small sample bias with skewed distributions of underlying sampling weights in the selected nonprobability sample.

4 |. REAL DATA EXAMPLE

We use the same data example as Wang et al⁶ for illustration purposes. We estimated prospective 15-year all-cause, all-cancer, and heart disease mortality rates for adults in the US using the adult household interview part of The Third U.S. National Health and Nutrition Examination Survey (NHANES III) III conducted in 1988 to 1994, with sample size n_c = 20 050. We ignored all complex design features of NHANES III and treated it as a nonprobability sample. The coefficient of variation of sample weights is 125%, indicating highly variable selection probabilities, and thus low representativeness of the unweighted sample. For estimating mortality rates, we approximated that the entire sample of NHANES III was randomly selected in 1991 (the midpoint of the data collection time period).

For the reference survey, we used 1994 U.S. National Health Interview Survey (NHIS) respondents to the supplement for monitoring achievement of the Healthy People Year 2000 objectives. Adults aged 18 and older are included (sample size n_p = 19 738). The 1994 NHIS used a multistage stratified cluster sample design with 125 strata and 248 pseudo-PSUs.^17,18 We collapsed strata with only one PSU with the next nearest stratum for variance estimation purposes.¹⁹ Both samples of NHANES III and NHIS were linked to National Death Index (NDI) for mortality, allowing us to quantify the relative bias of unweighted NHANES estimates, assuming the NHIS estimates as the gold standard. Notice that the mortality information was obtained by statistical linkage between the survey sample and NDI,²⁰ but not responses from the questionnaires. The all-cancer and heart-disease mortality were classified according to National Center for Health Statistics death code.^21,22

The usage of NHANES III as the “nonprobability cohort” has several advantages for illuminating the performance of the propensity weighting methods. The “nonprobability sample” and the reference survey sample have approximately the same target population, data collection mode, and similar questionnaires. This ensures that the pseudo-weighted “nonprobability sample” could potentially represent the target population, and thus enables us to characterize the performance of the propensity weighting methods in real data.

The distributions of selected common covariates and variables of interests in the two samples are presented in Table 2. As expected, the variables in the weighted samples of NHANES and 1994 NHIS have very close distributions because both weighted samples represent approximately the same FP. By contrast, covariates distribute quite differently in the unweighted NHANES from the weighted samples, especially for design variables such as age, race/ethnicity, poverty, and region, which leads to large biases in mortality rates estimated from the unweighted NHANES.

TABLE 2.

Distribution of selected common variables in NIH-AARP and NHIS

		NHIS 1994		NHANES III

		n_p = 19738	${\hat{N}}_{p} = 189608549$	n_c = 20050	${\hat{N}}_{c} = 187647206$

Variable	Total count	%	Weighted %	%	Weighted %
Age Group	18–24 years	10.5	13.3	15.8	15.8
	25–44 years	42.9	43.7	35.4	43.7
	45–64 years	26.1	26.6	22.6	24.6
	65 years and older	20.5	16.4	26.2	16.0
Race	NH-White	76.1	75.9	42.3	76.0
	NH-Black	12.6	11.2	27.4	11.2
	Hispanic	8.0	9.0	28.9	9.3
	NH-other	3.3	4.0	1.5	3.5
Region	Northeast	20.7	20.5	14.6	20.8
	Midwest	26.1	25.1	19.2	24.1
	South	31.5	32.5	42.7	34.3
	West	21.6	21.9	23.5	20.9
Poverty	No	79.1	82.3	67.9	80.3
	Yes	13.1	10.6	21.4	12.1
	Unknown	7.8	7.0	10.7	7.6
Education	Lower than high school	20.1	19.1	42.5	26.6
	High school/Some college	58.7	59.6	45.9	54.1
	College or higher	21.2	21.3	11.6	19.3
Health status	Excellent/Very good	60.5	62.0	39.0	51.6
	Good	25.7	25.7	35.9	32.7
	Fair/Poor	13.8	12.3	25.1	15.7
Mortality	All-cause	20.8	17.6	26.7	17.1
	Heart-disease	9.43	5.69	4.95	4.04
	All-cancer	5.57	4.11	5.10	4.47

Open in a new tab

The propensity model included main effects of common demographic characteristics (age, sex race/ethnicity, region, and marital status), socioeconomic status (education level, poverty, and household income), tobacco usage (smoking status, and chewing tobacco), health variables (body mass index and self-reported health status), and a quadratic term for age. Appendix D shows the final propensity models for the five considered methods.

To evaluate the performance of the five PS-based methods, we used relative difference from the NHIS estimate $% RD = (\hat{μ} - {\hat{μ}}^{NHIS}) / {\hat{μ}}^{NHIS} \times 100$ , TL variance estimate (V), and estimated $MSE = {(\hat{μ} - {\hat{μ}}^{NHIS})}^{2} + V$ , which treated the NHIS estimate as truth. Table 3 shows that the naïve NHANES III estimate of overall mortality was ~52% biased from the NHIS estimate because older people who have higher mortalities were oversampled (Table 2). All five IPSW methods substantially reduced the bias from the naïve estimate. Consistent with the simulation results, the ALP, FDW, RDW, and CLW method yielded close estimates when the sample fraction of the nonprobability sample was small ( ${\hat{f}}_{c} = n_{c} / {\hat{N}}_{p} = 1.06 \times 10^{- 4}$ calculated from Table 2). The ALP.S method, by scaling the NHIS sample weights in propensity estimation, reduced more bias than the other methods, and was more efficient. Therefore, the ALP.S estimate had the smallest MSE. The results for inference of all-cancer mortality had the similar pattern as the results for all-cause mortality. All pseudo-weighting methods removed most bias of the naïve NHANES estimate (with %RD = −3.21%~2.07% reduced from 24.68%). By contrast, for heart-disease mortality, all pseudo-weighting methods were substantially less biased than the naïve estimate, with %RD = 42.58%~57.78% reduced from 133.66%, but the alternative estimators still had undesirably large biases themselves. The bias reduction is not as much as that for all-cancer or all-cause mortality, and this may be due to the omission of important predictors of having heart disease and of being observed in the nonprobability sample in the propensity model.

TABLE 3.

Relative difference (%RD) of all-cause 15-year mortality estimates from the NHIS estimate with estimated variance (V) and mean squared error (MSE)

Mortality	Method	Estimate (%)	%RD	V (×10⁵)	MSE (×10⁵)
All cause	NHIS	17.6
	Naïve	26.7	52.16
	ALP	18.6	6.08	1.87	13.27
	FDW	18.6	6.08	1.87	13.28
	RDW	18.6	6.08	1.87	13.28
	CLW	18.6	6.07	1.87	13.24
	ALP.S	17.2	−2.05	1.08	2.37
All cancer	NHIS	4.5
	Naïve	5.6	24.68
	ALP	4.6	2.07	0.38	0.46
	FDW	4.6	2.07	0.38	0.46
	RDW	4.6	2.07	0.38	0.46
	CLW	4.6	2.06	0.37	0.46
	ALP.S	4.3	−3.21	0.32	0.53
Heart disease	NHIS	4.0
	Naïve	9.4	133.66
	ALP	6.4	57.78	0.50	54.88
	FDW	6.4	57.78	0.50	54.90
	RDW	6.4	57.78	0.50	54.90
	CLW	6.4	57.77	0.50	54.87
	ALP.S	5.8	42.58	0.33	29.86

Open in a new tab

Abbreviations: ALP, adjusted logistic propensity; RDW, rescaled design weight.

5 |. DISCUSSION

This article proposed ALP weighting methods for population inference using nonprobability samples. The proposed ALP method corrects the bias in the RDW method⁷ by formulating the problem in an innovative way. As does the RDW method, the proposed ALP method retains the advantage of easy implementation by fitting a propensity model with survey weights in ready-to-use software. The proposed ALP estimators are design consistent if the assumed model for participation rate is correct. TL variance estimators for ALP estimates are derived. Consistency of the ALP FP mean estimators was proved theoretically and evaluated numerically.

A primary competitor to ALP is the CLW estimator developed by Chen et al.⁴ If the nonprobability cohort is a small fraction of the population, ALP and CLW are very similar, although ALP does have computational advantages regardless of the size of the sampling fraction. As the sampling fraction increases, ALP and CLW become more distinct.

Both ALP and CLW methods fit a propensity model to the combined nonprobability sample and a weighted survey sample. Highly variable weights in the combined sample can lead to low efficiency of the estimated propensity model coefficients. Therefore, the variances of the ALP and the CLW estimators of the FP means can be large in some applications. However, the proposed ALP is proved analytically and numerically to have a variance that is less than or equal to that of the CLW method regardless of whether the propensity model underlying ALP is correct. It worth noting that ALP and CLW methods assume different logistic regression models for propensity score estimation. Propensity is defined as $p_{i} = P (i \in s_{c}^{*} ∣ s_{c}^{*} \cup FP)$ by ALP in (9) and π_i = P(i ∈ s_c|FP) by CLW in (2). Model diagnostics should be developed to select which propensity model is more appropriate for a given dataset and will be the focus of our future research.

An alternative ALP with scaled survey weights in the logistic regression propensity model produces consistent propensity estimates and further improves efficiency as shown in the simulation and the real data example. The scaled ALP had the smallest MSE in every scenario in our simulation study regardless of underlying model and had the smallest MSE for two of the three mortality causes in our real data application. The CLW estimator with the scaled survey weights, albeit more efficient than the unscaled CLW, is biased (simulation results not shown). The extension of scaling technique to the CLW method and other binary regression propensity models requires further investigation.

The theory for the ALP method implies that p_i > 1/2 since p_i defines the probability of the nonprobability sample inclusion among the combined FP units and the nonprobability sample, that is, $P (i \in s_{c}^{*} ∣ s_{c}^{*} \cup FP)$ . In estimation, however, ${\hat{p}}_{i} > 1 / 2$ can happen, especially in the unusual case where the nonprobability sample $s_{c}^{*}$ is a large proportion of the population. Fortunately, this will not be a concern for the estimation of population mean or regression coefficients. By scaling survey weights in the propensity model for the ALP method, we can control for all ALP weights $w_{i}^{ALP . S} = 1 / {\hat{π}}_{i}^{ALP . S}$ to be greater than one. As proved in Section 2.5, using scaled survey weights, $w_{i}^{ALP . S} = λ \cdot \exp^{- 1} ({\hat{β}}_{λ}^{T} x_{i}) = \exp^{- 1} ({\hat{β}}_{0 λ}^{*} + {\hat{β}}_{1 λ}^{T} x_{i})$ , would not bias the mean estimates. For population total estimation, an option is to estimate the population mean first and then multiply by a known or estimated population size from an independent source. Another approach to avoid estimated p_i > 1/2 is to solve the pseudo-estimating equations in (14) using a constrained optimization algorithm that requires ${\hat{p}}_{i} \leq 1 / 2$ for all units. This would, of course, negate the computation advantage of the ALP estimator.

Both ALP and CLW are inverse-propensity-score-weighting methods that directly use (functions of) the propensity score to estimate the cohort participation rate. They can be sensitive to propensity model misspecification (eg, missing interaction terms in the fitted propensity model) due to inaccurate estimates of participation rates. Furthermore, extreme pseudo-weights can occur if the estimate of the participation rate is close to 0. By contrast, propensity-score-based matching methods (not included in this study) may be more robust to the model misspecification and less likely to produce extreme pseudo-weights, because they use propensity scores to measure the similarity between survey and cohort sample units and distribute survey sample weights to the cohort based on their similarity. Examples of matching methods are propensity-score adjustment by subclassification,²³ propensity-score-based kernel weighting methods,^5,6,16 and River’s matching method.²⁴

There are a number of shortcomings associated with the estimation of propensity scores using logistic regression. First, the logistic model is susceptible to model misspecification, requiring assumptions regarding correct variable selection and functional form, including the choice of polynomial terms and multiple-way interactions. If any of these assumptions are incorrect, propensity score estimates can be biased, and balance may not be achieved when conditioning on the estimated PS. Second, implementing a search routine for model specification, such as repeatedly fitting logistic regression models while in/excluding predictor variables, interactions, or transformations of variables can be computationally infeasible or suboptimal. In this context, parametric regression can be limiting in terms of possible model structures that can be searched over, particularly when many potential predictors are present (high-dimensional data). Various machine learning methods for estimating the propensity score that incorporate survey weights will also be our future research interest.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available on request from the corresponding author.

Supplementary Material

Appendix A-D

NIHMS1725980-supplement-Appendix_A-D.pdf^{(204.8KB, pdf)}

Footnotes

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of this article.

REFERENCES

1.Collins R What makes UK Biobank special. Lancet. 2012;379(9822):1173–1174. [DOI] [PubMed] [Google Scholar]
2.Fry A, Littlejohns TJ, Sudlow C, et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am J Epidemiol. 2017;186(9):1026–1034. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Elliott MR, Valliant R. Inference for nonprobability samples. Stat Sci. 2017;32(2):249–264. [Google Scholar]
4.Chen Y, Li P, Wu C. Doubly robust inference with nonprobability survey samples. J Am Stat Assoc. 2019;115(532):2011–2021. [Google Scholar]
5.Wang L, Graubard BI, Katki H, Li Y. Improving external validity of epidemiologic cohort analyses: a kernel weighting approach. J Royal Stat Soc Ser A. 2020; 183(3):1293–1311. 10.1111/rssa.12564. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Wang L, Graubard BI, Katki HA, Li Y. Efficient and robust propensity-score-based methods for population inference using epidemiologic cohorts; 2021. arXiv preprint arXiv:2011.14850. [Google Scholar]
7.Valliant R, Dever JA. Estimating propensity adjustments for volunteer web surveys. Sociol Methods Res. 2011;40(1):105–137. [Google Scholar]
8.Valliant R Comparing alternatives for estimation from nonprobability samples. J SurvStat Methodol. 2020;8(2):231–263. [Google Scholar]
9.Binder DA. On the variances of asymptotically normal estimators from complex surveys. Int Stat Rev. 1983;51(3):279–292. [Google Scholar]
10.Chambers RA, Skinner CJ. Sec. 6.3 Analysis of Survey Data. New York, NY: Wiley; 2003. [Google Scholar]
11.Beaumont JF. Calibrated imputation in surveys under a quasi-model-assisted approach. J Royal Stat Soc Ser B. 2005;67(3):445–458. [Google Scholar]
12.Kim JK, Kim JJ. Nonresponse weighting adjustment using estimated probability. Canadian J Stat. 2007;35(4):501–514. [Google Scholar]
13.Krewski D, Rao JN. Inference from stratified samples: properties of the linearization, jackknife and balanced repeated replication methods. Ann Stat. 1981;9(5):1010–1019. [Google Scholar]
14.Scott AJ, Wild CJ. Fitting logistic models under case-control or choice based sampling. J Royal Stat Soc Ser B. 1986;48(2):170–182. [Google Scholar]
15.Li Y, Graubard BI, DiGaetano R. Weighting methods for population-based case–control studies with complex sampling. J Royal Stat Soc Ser C. 2011;60(2):165–185. [Google Scholar]
16.Wang L Improving External Validity of Epidemiologic Analyses by Incorporating Data from Population-Based Surveys [Doctoral dissertation]. University of Maryland, College Park; 2020. https://drum.lib.umd.edu/handle/1903/26125. [Google Scholar]
17.Massey JT. Design and estimation for the national health interview survey, 1985–94. US Department of Health and Human Services, Public Health Service, Centers for Disease Control, National Center for Health Statistics; 1989. [Google Scholar]
18.Ezzati TM, Massey JT, Waksberg J, Chu A, Maurer KR. Sample design: third National Health and nutrition examination survey. Vital Health Stat Ser 2 Data Eval Methods Res. 1992;113:1–35. [PubMed] [Google Scholar]
19.Hartley HO, Rao JN, Kiefer G. Variance estimation with one unit per stratum. J Am Stat Assoc. 1969;64(327):841–851. [Google Scholar]
20.National Center for Health Statistics. National Death Index User’s Guide. Hyattsville, MD: National Center for Health Statistics; 2013. https://www.cdc.gov/nchs/data/ndi/ndi_users_guide.pdf. Accessed April 2021. [Google Scholar]
21.National Center for Health Statistics. Data Linkage Public-Use Linked Mortality File Data Dictionary. Hyattsville, MD: National Center for Health Statistics; 2015. https://www.cdc.gov/nchs/data/datalinkage/public-use-2015-linked-mortality-files-data-dictionary.pdf. Accessed April 2021. [Google Scholar]
22.National Center for Health Statistic. Data Linkage Underlying and Multiple Cause of Death Codes. Hyattsville, MD: National Center for Health Statistics; 2018. https://www.cdc.gov/nchs/data/datalinkage/underlying_and_multiple_cause_of_death_codes.pdf. Accessed April 2021. [Google Scholar]
23.Lee S, Valliant R. Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociol Methods Res. 2009;37(3):319–343. [Google Scholar]
24.Rivers D Sampling for web surveys. Paper presented at: Proceedings of the Joint Statistical Meetings, Section on Survey Research Methods; 2007; Salt Lake City, Utah. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix A-D

NIHMS1725980-supplement-Appendix_A-D.pdf^{(204.8KB, pdf)}

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author.

[R1] 1.Collins R What makes UK Biobank special. Lancet. 2012;379(9822):1173–1174. [DOI] [PubMed] [Google Scholar]

[R2] 2.Fry A, Littlejohns TJ, Sudlow C, et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am J Epidemiol. 2017;186(9):1026–1034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Elliott MR, Valliant R. Inference for nonprobability samples. Stat Sci. 2017;32(2):249–264. [Google Scholar]

[R4] 4.Chen Y, Li P, Wu C. Doubly robust inference with nonprobability survey samples. J Am Stat Assoc. 2019;115(532):2011–2021. [Google Scholar]

[R5] 5.Wang L, Graubard BI, Katki H, Li Y. Improving external validity of epidemiologic cohort analyses: a kernel weighting approach. J Royal Stat Soc Ser A. 2020; 183(3):1293–1311. 10.1111/rssa.12564. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Wang L, Graubard BI, Katki HA, Li Y. Efficient and robust propensity-score-based methods for population inference using epidemiologic cohorts; 2021. arXiv preprint arXiv:2011.14850. [Google Scholar]

[R7] 7.Valliant R, Dever JA. Estimating propensity adjustments for volunteer web surveys. Sociol Methods Res. 2011;40(1):105–137. [Google Scholar]

[R8] 8.Valliant R Comparing alternatives for estimation from nonprobability samples. J SurvStat Methodol. 2020;8(2):231–263. [Google Scholar]

[R9] 9.Binder DA. On the variances of asymptotically normal estimators from complex surveys. Int Stat Rev. 1983;51(3):279–292. [Google Scholar]

[R10] 10.Chambers RA, Skinner CJ. Sec. 6.3 Analysis of Survey Data. New York, NY: Wiley; 2003. [Google Scholar]

[R11] 11.Beaumont JF. Calibrated imputation in surveys under a quasi-model-assisted approach. J Royal Stat Soc Ser B. 2005;67(3):445–458. [Google Scholar]

[R12] 12.Kim JK, Kim JJ. Nonresponse weighting adjustment using estimated probability. Canadian J Stat. 2007;35(4):501–514. [Google Scholar]

[R13] 13.Krewski D, Rao JN. Inference from stratified samples: properties of the linearization, jackknife and balanced repeated replication methods. Ann Stat. 1981;9(5):1010–1019. [Google Scholar]

[R14] 14.Scott AJ, Wild CJ. Fitting logistic models under case-control or choice based sampling. J Royal Stat Soc Ser B. 1986;48(2):170–182. [Google Scholar]

[R15] 15.Li Y, Graubard BI, DiGaetano R. Weighting methods for population-based case–control studies with complex sampling. J Royal Stat Soc Ser C. 2011;60(2):165–185. [Google Scholar]

[R16] 16.Wang L Improving External Validity of Epidemiologic Analyses by Incorporating Data from Population-Based Surveys [Doctoral dissertation]. University of Maryland, College Park; 2020. https://drum.lib.umd.edu/handle/1903/26125. [Google Scholar]

[R17] 17.Massey JT. Design and estimation for the national health interview survey, 1985–94. US Department of Health and Human Services, Public Health Service, Centers for Disease Control, National Center for Health Statistics; 1989. [Google Scholar]

[R18] 18.Ezzati TM, Massey JT, Waksberg J, Chu A, Maurer KR. Sample design: third National Health and nutrition examination survey. Vital Health Stat Ser 2 Data Eval Methods Res. 1992;113:1–35. [PubMed] [Google Scholar]

[R19] 19.Hartley HO, Rao JN, Kiefer G. Variance estimation with one unit per stratum. J Am Stat Assoc. 1969;64(327):841–851. [Google Scholar]

[R20] 20.National Center for Health Statistics. National Death Index User’s Guide. Hyattsville, MD: National Center for Health Statistics; 2013. https://www.cdc.gov/nchs/data/ndi/ndi_users_guide.pdf. Accessed April 2021. [Google Scholar]

[R21] 21.National Center for Health Statistics. Data Linkage Public-Use Linked Mortality File Data Dictionary. Hyattsville, MD: National Center for Health Statistics; 2015. https://www.cdc.gov/nchs/data/datalinkage/public-use-2015-linked-mortality-files-data-dictionary.pdf. Accessed April 2021. [Google Scholar]

[R22] 22.National Center for Health Statistic. Data Linkage Underlying and Multiple Cause of Death Codes. Hyattsville, MD: National Center for Health Statistics; 2018. https://www.cdc.gov/nchs/data/datalinkage/underlying_and_multiple_cause_of_death_codes.pdf. Accessed April 2021. [Google Scholar]

[R23] 23.Lee S, Valliant R. Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociol Methods Res. 2009;37(3):319–343. [Google Scholar]

[R24] 24.Rivers D Sampling for web surveys. Paper presented at: Proceedings of the Joint Statistical Meetings, Section on Survey Research Methods; 2007; Salt Lake City, Utah. [Google Scholar]

PERMALINK

Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts

Lingxiao Wang

Richard Valliant

Yan Li

Abstract

1 |. INTRODUCTION

2 |. METHODS

2.1 |. Basic setting

2.2 |. Existing logistic propensity weighting method

2.2.1 |. RDW method

2.2.2 |. CLW method

2.3 |. ALP method

Theorem.

Consistency of ALP estimator of FP mean (see Appendix B).

2.4 |. Variance estimation

2.5 |. Scaling survey weights in the likelihood for the ALP method

3 |. SIMULATIONS

3.1 |. FP generation and sample selection

3.2 |. Evaluation criteria

3.3 |. Results

TABLE 1.

4 |. REAL DATA EXAMPLE

TABLE 2.

TABLE 3.

5 |. DISCUSSION

DATA AVAILABILITY STATEMENT

Supplementary Material

Footnotes

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts

Lingxiao Wang

Richard Valliant

Yan Li

Abstract

1 |. INTRODUCTION

2 |. METHODS

2.1 |. Basic setting

2.2 |. Existing logistic propensity weighting method

2.2.1 |. RDW method

2.2.2 |. CLW method

2.3 |. ALP method

Theorem.

Consistency of ALP estimator of FP mean (see Appendix B).

2.4 |. Variance estimation

2.5 |. Scaling survey weights in the likelihood for the ALP method

3 |. SIMULATIONS

3.1 |. FP generation and sample selection

3.2 |. Evaluation criteria

3.3 |. Results

TABLE 1.

4 |. REAL DATA EXAMPLE

TABLE 2.

TABLE 3.

5 |. DISCUSSION

DATA AVAILABILITY STATEMENT

Supplementary Material

Footnotes

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases