Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Oct 30.
Published in final edited form as: Stat Med. 2021 Jul 5;40(24):5237–5250. doi: 10.1002/sim.9122

Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts

Lingxiao Wang 1, Richard Valliant 1,2, Yan Li 1
PMCID: PMC8526388  NIHMSID: NIHMS1725980  PMID: 34219260

Abstract

Many epidemiologic studies forgo probability sampling and turn to nonprobability volunteer-based samples because of cost, response burden, and invasiveness of biological samples. However, finite population (FP) inference is difficult to make from the nonprobability sample due to the lack of population representativeness. Aiming for making inferences at the population level using nonprobability samples, various inverse propensity score weighting methods have been studied with the propensity defined by the participation rate of population units in the nonprobability sample. In this article, we propose an adjusted logistic propensity weighting (ALP) method to estimate the participation rates for nonprobability sample units. The proposed ALP method is easy to implement by ready-to-use software while producing approximately unbiased estimators for population quantities regardless of the nonprobability sample rate. The efficiency of the ALP estimator can be further improved by scaling the survey sample weights in propensity estimation. Taylor linearization variance estimators are proposed for ALP estimators of FP means that account for all sources of variability. The proposed ALP methods are evaluated numerically via simulation studies and empirically using the naïve unweighted National Health and Nutrition Examination Survey III sample, while taking the 1997 National Health Interview Survey as the reference, to estimate the 15-year mortality rates.

Keywords: finite population inference, nonprobability sample, propensity score weighting, survey sampling, variance estimation

1 |. INTRODUCTION

In the big data era, assembling volunteer-based epidemiologic cohorts within integrated healthcare systems that have electronic health records and a large preexisting base of volunteers are increasingly popular due to their cost-and-time efficiency, such as the UK Biobank in the UK National Health Service.1 However, samples of volunteer-based cohorts are not randomly selected from the underlying finite target population, and therefore cannot well represent the target population. As a result, the naïve sample estimates obtained from the cohort can be biased for the finite population (FP) quantities. For example, the estimated all-cause mortality rate in the UK Biobank was only half that of the UK population,2 and the Biobank is not representative of the UK population with regard to many sociodemographic, physical, lifestyle, and health-related characteristics.

Aiming for making inferences at the population level using nonprobability samples, various propensity-score weighting, and matching methods have been proposed to improve the population representativeness of nonprobability samples, by using probability-based survey samples as external references in survey research.36

Inverse propensity score weighting (IPSW) methods have been studied with the propensity defined by the participation rate of population units in the nonprobability sample. We review two methods—both assume that the units in the nonprobability sample are observed according to some random, but unknown, mechanism. Because that mechanism is unknown, the inclusion probability of each unit must be estimated. As described in Section 2, all methods are based on estimating a pseudo log-likelihood, although the methods differ in their details. Valliant and Dever7 estimated participation rates by fitting a logistic regression model to the combined nonprobability sample and a reference, probability sample. Sample weights for the probability sample were scaled by a constant so that the scaled probability sample was assumed to represent the complement of the nonprobability sample. Each unit in the nonprobability sample was assigned a weight of one. This results in the sum of the scaled weights in the combined probability plus nonprobability sample being an estimate of the population size. This method will be referred to as the rescaled design weight (RDW) method. The participation rate for each nonprobability sample unit was estimated by the inverse of the estimated inclusion (or participation) probability.

The RDW estimator is biased especially when the participation rate of the nonprobability sample is large, as noted by Chen et al.4 As a remedy, Chen et al4 estimated the participation rate by manipulating the log-likelihood estimating equation in a somewhat different way. The resulting estimator, denoted by CLW, is consistent and approximately unbiased regardless of the magnitude of participation rates. Compared with the CLW method, which requires special programming, the RDW method has the advantage of easy implementation by ready-to-use software such as R, Stata, or SAS. Survey practitioners can simply fit a logistic regression model with scaled survey weights in the probability sample to obtain the estimated participation rates.

In this article, we propose an adjusted logistic propensity weighting (ALP) method to estimate the participation rates for nonprobability sample units. Like the CLW, the proposed ALP method relaxes the assumptions required by the RDW method,7,8 by formulating the method in an innovative way. As in the RDW method, the proposed ALP method retains the advantage of easy implementation by fitting a propensity model with survey weights in ready-to-use software. Taylor linearization (TL) variance estimators are proposed for ALP estimates that account for variability due to differential pseudo-weights in the nonprobability sample, complex survey design of the reference probability survey, as well as the estimation of the propensity scores. The variance of the proposed estimator has the order of the inverse of the nonprobability sample size (as shown in Appendix C). Moreover, under the logistic propensity model, the ALP method can flexibly scale the probability sample weights for propensity estimation to further improve efficiency. In summary, the contributions of the proposed ALP method include (1) easy implementation with ready-to-use software, (2) high efficiency, as well as (3) the justification of a set of pseudo-estimating equation (14) that underly the straightforward implementation in survey software.

2 |. METHODS

2.1 |. Basic setting

Let FP = {1, ⋯ , N} represent the FP with size N. We are interested in estimating the FP mean μ=N1iFPyi. Suppose a volunteer-based nonprobability sample sc of size nc is selected from FP by a self-selection mechanism, with δi(c) (= 1 if isc; 0 otherwise) denoting the indicator of sc inclusion. The underlying participation rate of nonprobability sample for a FP unit is defined as

πi(c)=P(iscFP)=Ec{δi(c)yi,xi},iFP,

Where the expectation Ec is with respect to the nonprobability sample selection, and xi is a vector of self-selection variables, that is, covariates related to the probability of inclusion in sc. The corresponding implicit nonprobability sample weight is wi=1/πi(c) for i ∈ FP.

We consider the following assumptions for the nonprobability sample self-selection.

A1. The nonprobability sample selection is uncorrelated with the variable of interest given the covariates, that is, πi(c)=Ec{δi(c)yi,xi}=Ec{δi(c)xi} for i ∈ FP.

A2. All FP units have a positive participation rate, that is, πi(c)>0 for i ∈ FP.

A3. The indicators of participation in the nonprobability cohort are uncorrelated with each other given the self-selection variables, that is, cov(δi(c),δj(c)xi,xj)=0 for ij.

An independent reference probability-based survey sample sp of size np is randomly selected from FP. The sample inclusion indicator, selection probability, and the corresponding sample weights are defined by δi(p)(=1 if isp; 0 otherwise), πi(p)=Ep(δi(p)xi), and di=1/πi(p), respectively, where Ep is with respect to the survey sample selection.

2.2 |. Existing logistic propensity weighting method

In this section, we first briefly introduce the existing RDW and CLW methods and discuss their pros and cons.

2.2.1 |. RDW method

Valliant and Dever6,7 assumed a logistic regression model for the participation rates πi(c)(γ)

log{πi(c)(γ)1πi(c)(γ)}=γTxi,foriFP, (1)

where γ is a vector of unknown parameters, and xi is a vector of covariates for i ∈ FP. To simplify the notation, we use πi(c) below. They considered (implicitly) the population likelihood function of πi(c) as

L(γ)=iFP{πi(c)}δi(c){1πi(c)}1δi(c). (2)

Then, the log-likelihood function can be written as

l(γ)=iFP[δi(c)logπi(c)+{1δi(c)}log{1πi(c)}]=iSclogπi(c)+iFPsclog{1πi(c)}, (3)

where the set FP − sc represents the FP units that are not self-selected into the nonprobability sample. Since FP − sc is not available in practice, the pseudo-loglikelihood function was constructed to estimate l(γ) by

l˜RDW(γ)=iScwi*logπi(c)+ispwi*log{1πi(c)}, (4)

where wi*={1,foriscdi(N^pnc)/N^p,forisp, with N^p=ispdi being the survey estimate of the target FP size N. This leads to the total of the scaled weights across the probability sample units being ispwi*=N^pnc. The rationale for rescaling is to weight the survey sample to represent the complement of sc in the FP, that is, the set FP − sc. Under the logistic regression model, the nonprobability sample participation rate πi(c) for isc can be estimated by fitting Model (1) to the combined sample of sc and scaled-weighted sp with scaled weights wi*, leading to the RDW estimates.

The RDW method has been shown to effectively reduce the bias of the naïve nonprobability sample estimates. However, the summand iFPSclog{1πi(c)} in (3) is not a fixed FP total because units in the nonprobability sample sc are treated as being randomly observed. This leads to a bias as shown below.

Comparing the expectation of the population log-likelihood function l(γ) in (3) and the expectation of the pseudo log-likelihood l˜RDW(γ) in (4), and letting E(·) = EcEp(·) we have

E{l(γ)}=iFPπi(c)logπi(c)+iFP{1πi(c)}log{1πi(c)},andE{l˜RDW(γ)}=EcEp{l˜RDW(γ)}=Ec[iFPδi(c)logπi(c)]+Ep[iFPδi(p)N^pncN^pdilog{1πi(c)}]iFPπi(c)logπi(c)+iFP{1ncN}log{1πi(c)}

by assuming Ep(N^p)=N. The difference of the two expectations, denoted by ΔRDW, can be written as

ΔRDW=E{l˜RDW(γ)}E{l(γ)}=iFP{ncNπi(c)}log{1πi(c)}

which, in general, is nonzero. Accordingly, the nonprobability sample participation rates estimated by solving for γ in l˜RDW(γ)/γ=0 under Model (1) can be biased, unless either (i) the nonprobability sample units have small participation rates, that is, both ncN and πi(c) are close to 0 for all i ∈ FP, in which case log{1πi(c)}0, or (ii) all population units are equally likely to participate in the nonprobability sample, that is, πi(c)nc/N. In many practical applications, (i) will hold. For example, suppose that nc = 1000 and the US population age 18 and over is the target population. The population size is approximately 210 million, so that ncN ≐ 5 × 10−6. If, instead, the population is for a small state like Wyoming where the 18+ population size is about 365000, then ncN ≐ 0.0027. In both examples, with such small sampling fractions, all {πi(c),isc} should be near zero also.

2.2.2 |. CLW method

Chen et al4 proposed another IPSW method using the same likelihood function L(γ) in (2), but rewriting the population log-likelihood as

l(γ)=iSclogπi(c)1πi(c)+iFPlog{1πi(c)}. (5)

In contrast to the RDW method, CLW estimated the population total of log{1πi(c)} by a weighted reference sample total and constructed the pseudo log-likelihood as

l˜CLW(γ)=iSclogπi(c)1πi(c)+ispdilog{1πi(c)}. (6)

Under the same logistic regression model (1), the participation rate πi(c) was estimated by solving the pseudo estimation equation

S˜(γ)=1N{iscxiispdiπi(c)xi}=0, (7)

derived from the pseudo log-likelihood (6). The resulting CLW weights are calculated as {wiCLW=1+exp1(γ^Txi),isc}. Chen et al4 proved that the CLW estimator of the FP mean, μ^CLW=(iscwiCLW)1iscwiCLWyi, was design consistent when model (1) for the participation rates was correct.

In contrast to the RDW method, CLW does not require condition (i) or (ii) in the RDW method for unbiased estimation of participation rates {πi(c),iSc}. In the following section, we propose an ALP method, which corrects the bias in the RDW method. The proposed ALP method provides consistent estimators of FP means and is as easy to implement as the RDW method.

2.3 |. ALP method

The ALP method also aims to estimate the cohort sample participation rates {πi(c),iSc} and use the inverse of estimated πi(c) as the pseudo-weight for isc. As a computational device, we construct a pseudo-population of sc*FP, where sc* is a copy of sc that has the same joint distributions of covariates x and outcome y with the original sc. The number of units in sc*FP is nc + N. In the union of sc*FP, sc* and sc are treated as two different sets. We use Ri to indicate the membership of sc* in sc*FP(=1 if isc*; 0 if iFP), and pi=P(Ri=1)=P(isc*sc*FP). Instead of directly modeling πi(c) as in the RDW and CLW methods, we model pi as a function of πi(c):

pi=πi(c)1+πi(c),orequivalently,πi(c)=pi1pi. (8)

The relationship between pi and πi(c) follows because

pi1pi=P(isc*sc*FP)P(iFPsc*FP)=P(isc)P(iFP)=P(iscFP)=πi(c) (9)

since sc* is a copy of sc and P(isc*sc*FP)=P(iscsc*FP). Notice that, derived from Formula (8),

pi12,

since πi(c)1 and the equality holds only if πi(c)=1, that is, the FP unit i participates in the cohort with certainty. As illustrated by the examples at the end of section 2.2.1, requiring pi ≤ 1∕2 is not unrealistic in typical applications because πi(c) is generally quite small.

Suppose that pi can be modeled parametrically by pi = p(xi;β) = expit(βTxi), where β is a vector of unknown model parameters. That is,

log{pi1pi}=βTxi,forisc*FP (10)

Notice that β, the coefficients in Model (10), differ from the coefficients γ in Model (1) because the two logistic regression models have different dependent variables. Based on (8), expression (10) implies that πi(c) is being modeled as exp(βTxi), which differs from the RDW/CLW model in (1) where πi(c)=exp(γTxi)/{1+exp(γTxi)}. The corresponding “likelihood” function can be written as

L*(β)=isc*FPpiRi(1pi)(1Ri), (11)

where Ri indicates the membership of sc* in sc*FP(=1ifisc*;0ifiFP). We put “likelihood” in quotes because L*(β) varies depending on which set of units is selected for sc*. This contrasts with the population likelihood in (2) which applies regardless of which sample is selected. Note that L*(β) in (11) is written as if the units are independent when they are not. This is a standard procedure in pseudo-MLE estimation, and the resulting parameter estimators remain design-consistent even when some units may be correlated due to, for example, clustering.9,10 The quantity L*(β) should be viewed as motivation for developing the estimating equations given below in (13). The log-likelihood generated from L*(β) is

l*(β)=iSc*FP{Rilogpi+(1Ri)log(1pi)}=iFPδi(c)logpi+iFPlog(1pi). (12)

Notice that the randomness of L*(β) and l*(β) comes from the cohort selection, that is, δi(c) in the first summand in the last line of (12). In reality, since the unit level information of FP is unknown, we replace the second summand in (12) by a survey sample estimate, ispdilog(1pi), and obtain the maximum pseudo-likelihood estimator β^ by solving the pseudo-estimating equation

S˜*(β)=1N+nc{isc(1pi)xiispdipixi}=0. (13)

Assuming that pi is bounded by 0 and 1/2 implies that πi(c) is automatically bounded by 0 and 1. Note that (13) falls in a general class of estimating equations that ensure unique solutions of parameters4,11,12 (eg, equation (6) in CLW4 if their function h(xi, θ) is set equal to xi{1+πi(c)}1).

The ALP estimator of μ is

μ^ALP=iScwiALPyiiScwiALP, (14)

where wiALP=1/πi(c)(β^) for isc. Although L*(β) is not a standard likelihood, μ^ALP is a consistent estimator of the population mean as shown in the theorem below.

We consider the following limiting process for the theoretical development.4,13 Suppose there is a sequence of FPs FPk of size Nk, for k = 1, 2, …. Cohort sc,k of size nc,k and survey sample sp,k of size np,k are sampled from FPk. The sequences of the FP, the cohort and the survey sample have their sizes satisfy limknt,k/Nkft, where t = c or p and 0 < ft ≤ 1 (regularity condition C1 in Appendix A). In the following the index k is suppressed for simplicity.

Theorem.

Consistency of ALP estimator of FP mean (see Appendix B).

Under the regularity conditions A1 to A3, and C1 to C5 in Appendix A, and assuming logistic regression model (10) for pi, the ALP estimate μ^ALP is design consistent for μ, in particular μ^ALPμ=Op(nc1/2), with the FP variance

Var(μ^ALP)N2iFPpi(12pi){(yiμ)pibTxi}2+bTDb, (15)

where pi=expit(βTxi),bT={iFP(yiμ)xiT}{iFPpixixiT}1, and D=N2Vp(ispdipixi) is the design-based variance-covariance matrix under the probability sampling design for sp.

In practice, the ALP estimator of a FP mean can be obtained by three steps:

Step 1 Search for covariates x available in both the cohort (sc) and the reference survey sample (sp) and combine the two samples. Assign Ri = 1 for isc and Ri = 0 for isp in the combined sample.

Step 2 Fit a logistic regression model for pi = P(Ri = 1) in the combined sc and weighted sp, with the survey sample weights {di, isp}, and obtain the estimate p^i for isc.

Step 3 Estimate the FP mean by Formula (14) with the ALP pseudo weight wiALP=p^i/(1p^i) for isc.

Notice that Step 2 can be accomplished by any existing survey software, such as svyglm in survey package of R, svy:logit in Stata, and PROC SURVEYLOGISTIC in SAS. In addition to being easy to implement, the ALP estimator from (14) does not require conditions (i) or (ii), unlike RDW. Moreover, we prove that in large samples, Var(μ^ALP)=O(nc1) is as or more efficient compared with Var(μ^CLW)=O{min(np,nc)1} under their correct propensity model, respectively, which depends on both the nonprobability and probability sample sizes (see Appendix C).

An alternative method would be to omit the odds transformation, which uses pi to approximate the participation rate πi(c). Denote this method by FDW for full design weight, which contrasts to the scaling of the survey sample weights in the RDW method. Comparing the expectation of the population log-likelihood function l(γ) in (3) and the expectation of the pseudo log-likelihood l˜*(β) in (12) with πi(c) replacing pi by the FDW method, that is, l˜*(γ)=isclogπi(c)+ispdilog(1πi(c)), we have their difference, denoted by ΔFDW, written as

ΔFDW=E{l˜*(γ)}E{l(γ)}=iFPπi(c)logπi(c)+iFPlog{1πi(c)}iFPπi(c)logπi(c)iFP(1πi(c))log{1πi(c)}=iFPπi(c)log{1πi(c)}.

The bias is zero only if πi(c) for i ∈ FP are all close to zero. Thus, the odds transformation step in ALP could be skipped if all nonprobability participation rates are extremely small; but, in general, that step is essential for unbiased estimation.

2.4 |. Variance estimation

Using the FP variance formula (15), the first summand can be consistently estimated by

{N^(c)}2isc(1p^i)(12p^i){(yiμ^ALP)p^ib^Txi}2, (16)

where p^i is the prediction for isc, N^(c)=iScwiALP, and b^T={isc(yiμ^ALP)xiT}{iscp^ixixiT}1. The second summand bTDb is estimated by b^TD^b^, where D^ is the survey design consistent variance estimator of D. For example, under stratified multistage cluster sampling with H strata and ah primary sampling units (PSUs) in stratum h selected with replacement,

D^={N^(p)}2h=1Hahah1l=1ah(zlz¯)(zlz¯)T, (17)

where N^(p)=ispdi, zl=isp(hl)dip^ixi is the weighted PSU total for cluster l in stratum h, sp(hl) is the set of sample elements stratum h and cluster l, and z¯=ah1lahzl is the mean of the PSU totals in stratum h.

2.5 |. Scaling survey weights in the likelihood for the ALP method

The proposed ALP can flexibly scale the survey weights in estimating equation (13) to improve efficiency. For case-control studies, Scott and Wild14 and Li et al15 previously used the technique we propose below to reduce variances of estimates of relative risks when weights for cases and controls are substantially different. We multiply the second summand in S˜*(β) by a constant λ, say λ=nc/(ispdi), so that the sum of the scaled survey weights (λdi) is nc. Accordingly, the score function becomes

S˜λ*(β)=isc(1pi)xiλispdipixi. (18)

Solving S˜λ*(β)=0 for β, and the resulting vector of estimates is denoted by β^λ=(β^0,λ,β^1,λ), where β^0,λ is estimate of the intercept. Similar derivations to those in Chambers and Skinner10 and Beaumont11 can be used to prove that β^1,λ, is design-consistent with various efficiency gains, depending on the variability of survey weights vs the nonprobability sample weights (with implicit common value of 1). However, the estimate of the intercept β^0,λ can be badly biased with scaled weights. As a result, the estimate of participation rate exp(β^λTxi) including β^0,λ would also be biased. The bias of β^0,λ, however, would not affect the estimate of population mean because the scaled ALP-weighted mean, μ^ALP.S,

μ^ALP.S=iscwiALP.SyiiscwiALP.S=iscexp1(β^1,λTxi)yiiscexp1(β^1,λTxi),

depends on β^1,λ, but not β^0,λ, where wiALP.S=exp1(β^1,λTxi) is the scaled ALP pseudo weight.

It can be proved that μ^ALP.S is a consistent estimator of the FP mean, μ. The TL variance estimator of μ^ALP.S can be obtained by substituting wiALP, μ^ALP, β^, p^i, and di by wiALP.S, μ^ALP.S, β^λ, p^i,λ=exp(β^λTxi), and λdi, respectively, in Formulae (16) and (17). Details on the variance and the consistency of ALP.S are discussed in the dissertation by Wang.16

3 |. SIMULATIONS

3.1 |. FP generation and sample selection

We applied simulation setups similar to those in Chen et al.4 In the FP of size N = 500 000, a vector of covariates xi = (x1i, x2i, x3i, x4i)T was generated for i ∈ FP where x1i = v1i, x2i = v2i + 0.3×1i, x3i = v3i + 0.2(x1i + x2i), x4i = v4i + 0.1 (x1i + x2i + x3i), with v1i ~ Bernoulli(0.5), v2i ~ Uniform(0, 2), v3i ~ Exponential(1), and v4i ~ χ2(4). The variable of interest yi ~ Normal(μi, 1), where μi = −x1ix2i + x3i + x4i for i ∈ FP. The parameter of interest was the FP mean μ=N1iFPyi=3.97.

The probability-based survey sample sp with the target sample size np = 12 500 (sampling fraction fp = 2.5%) was selected by Poisson sampling, with inclusion probability πi(p)=(npqi)/iFPqi for i ∈ FP, where qi = const + x3i + 0.03yi to control for the variation of the survey weights, di=1/πi(p). We set const = −0.26 so that max qi/ min qi = 20.

As noted in Section 2, the ALP and CLW methods do assume somewhat different models for the participation rate. Thus, it is interesting to check their performances both when their underlying models are correct and when the assumed participation rate models fail. The volunteer-based nonprobability sample sc (with a target sample size nc) was also selected by Poisson sampling but with different inclusion probabilities πi(c) for i ∈ FP. We considered two scenarios with different functional forms of πi(c) so that the ALP (and FDW) or the CLW method had the true linear logistic regression propensity model in one scenario but not in the other. In Scenario 1, πi(c)=exp(β0+βTxi) was the specified participation rate for the ith population unit to be included into the nonprobability sample. The underlying true propensity model for ALP (and FDW) methods, shown in (10), was logit(pi)=logit{πi(c)}=β0+βTxi, which implies logit{πi(c)}=β0+βTxilog{1πi(c)}. This model differs from the underlying linear model (1) assumed by the CLW method by the addition of the term log{1πi(c)}. In Scenario 2, πi(c)=expit(γ0+γTxi) was specified so that logit{πi(c)}=γ0+γTxi, which was the model (1) assumed by the CLW method. This model, however, implied that logit(pi)=γ0+γTxilog{1πi(c)}, which was different from the model assumed by the ALP and the FDW method (by the extra term log{1πi(c)}). Hence, ALP and CLW estimates of the population mean are expected to be unbiased in one scenario but not the other since both methods assume a linear logistic propensity model. The biases of the FDW and RDW estimates, as measured by ΔFDW and ΔRDW, depend on πi(c), and go to 0 as πi(c) approaches 0. The biases become larger as πi(c) increases in either scenario.

In both scenarios, the coefficients were set to be β = γ = (0.18, 0.18, −0.27, −0.27)T. The parameters were chosen so that 0<πi(c)<1 for all units i ∈ FP. The intercepts β0 and γ0 were also controlled so that the expected number of nonprobability sample units Ec(nc) = Ec(nc)=FPπi(c) was varied from 1250, 2500, 5000, to 10000 with the corresponding overall participation rate fc = Ec(nc)∕N being 0.5%, 5%, 10%, or 20%.

3.2 |. Evaluation criteria

We examined the performance of five IPSW estimators of FP mean μ: (1) to (2) μ^ALP and μ^ALP.S described in Sections 2.2 to 2.5; (3) μ^FDW using weights from the ALP method omitting the odds transformation; (4) μ^CLW proposed by Chen et al4; and (5) μ^RDW proposed by Wang et al,6 compared with the naïve nonprobability sample mean (μ^Naive) that did not use weights, and the weighted nonprobability sample mean, μ^TW, with weights equal to the inverse of the true nonprobability sample inclusion probabilities. Note that μ^TW is unavailable in practice because the true nonprobability sample inclusion probabilities are unknown. Relative bias (%RB), empirical variance (V), mean squared error (MSE) of the point estimates were used to evaluate the performance of the four IPSW point estimates, calculated by

%RB=1Bb=1Bμ^(b)μμ×100,V=1B1b=1B{μ^(b)1Bb=1Bμ^(b)}2,MSE=1Bb=1B{μ^(b)μ}2,

where B = 4000 is the number of simulation runs, μ^(b) is one of the point estimates obtained from the bth simulated sample, and μ is the true FP mean.

We also evaluated the variance estimates using the variance ratio (VR) and 95% confidence interval coverage probability (CP), which were calculated as

VR=1Bb=1Bv^(b)V×100,andCP=1Bb=1BI(μCI(b)),

where v^(b) is the proposed analytical variance estimate in simulated sample b, and CI(b)=(μ^(b)1.96v^(b),μ^(b)+1.96v^(b)) is the 95% confidence interval from the bth simulated sample.

3.3 |. Results

Table 1 presents simulation results for the seven nonprobability sample estimators of the FP mean. The naïve estimator μ^Naive that ignored the underlying sampling scheme had relative biases ranging from −36.5% to −42.8% while the true weighted nonprobability sample estimator, μ^TW, was approximately unbiased in all scenarios. The variance of μ^Naive was much smaller than that of the other estimators, but its bias caused the MSE to be extremely high (not reported).

TABLE 1.

Results from 4000 simulated survey samples and nonprobability samples with low to high participation rates under various propensity score models

Scenario 1 True propensity model for ALP Scenario 2 True propensity model for CLW


%RB V (×105) VR MSE (×105) CP5 %RB V (×105) VR MSE (×105) CP
fc = 0.5%
μ^Naive −42.76 0.22 0.99 −42.61 0.22 1.00
μ^TW −0.13 4.38 0.93 4.39 0.90 −0.12 4.38 0.93 4.38 0.90
μ^RDW −0.29 3.73 0.93 3.75 0.87 −0.40 3.63 0.93 3.66 0.87
μ^FDW −0.28 3.73 0.93 3.75 0.87 −0.40 3.63 0.93 3.66 0.87
μ^ALP −0.07 3.70 0.93 3.77 0.88 −0.19 3.66 0.93 3.67 0.88
μ^CLW 0.05 3.87 0.93 3.87 0.89 −0.07 3.76 0.93 3.76 0.88
μ^ALP.S −0.11 3.54 0.92 3.54 0.87 −0.21 3.45 0.92 3.45 0.87
fc = 5%
μ^Naive −42.74 0.02 0.99 −41.21 0.02 1.01
μ^TW −0.04 0.50 0.98 0.50 0.92 −0.02 0.46 1.00 0.47 0.93
μ^RDW −2.15 0.56 1.00 1.29 0.66 −3.05 0.43 1.01 1.89 0.45
μ^FDW −2.05 0.57 1.00 1.23 0.68 −2.95 0.43 1.01 1.81 0.47
μ^ALP −0.01 0.62 1.00 0.62 0.94 −1.03 0.47 1.01 0.64 0.85
μ^CLW 1.29 0.84 1.00 1.10 0.95 0.01 0.61 1.01 0.61 0.94
μ^ALP.S −0.05 0.45 1.00 0.45 0.92 −0.63 0.35 1.02 0.41 0.86
fc = 10%
μ^Naive −42.74 0.01 1.11 −39.65 0.01 1.11
μ^TW −0.01 0.25 1.02 0.25 0.94 −0.01 0.22 1.01 0.22 0.94
μ^RDW −4.25 0.34 1.00 3.20 0.17 −5.62 0.22 0.99 5.20 0.02
μ^FDW −3.87 0.35 1.00 2.71 0.24 −5.28 0.22 0.99 4.62 0.03
μ^ALP 0.01 0.42 1.00 0.42 0.95 −1.81 0.27 0.99 0.79 0.65
μ^CLW 2.94 0.80 1.00 2.16 0.76 0.03 0.42 0.99 0.42 0.95
μ^ALP.S −0.03 0.27 1.02 0.27 0.94 −0.86 0.18 1.01 0.29 0.81
fc = 20%
μ^Naive −42.75 0.00 1.26 −36.50 0.01 1.21
μ^TW −0.02 0.15 0.93 0.15 0.93 −0.02 0.11 0.96 0.11 0.93
μ^RDW −8.58 0.21 0.95 11.83 0.00 −9.59 0.10 0.97 14.60 0.00
μ^FDW −7.15 0.23 0.96 8.29 0.01 −8.51 0.11 0.98 11.53 0.00
μ^ALP 0.00 0.32 0.96 0.32 0.95 −2.85 0.16 0.98 1.44 0.19
μ^CLW 7.80 1.67 0.92 11.27 0.06 0.01 0.33 0.98 0.33 0.95
μ^ALP.S −0.03 0.19 0.96 0.19 0.94 −1.02 0.10 1.00 0.27 0.73

Abbreviations: ALP, adjusted logistic propensity; CP, coverage probability; MSE, mean squared error; VR, variance ratio.

Consistent with the bias theory in Section 2, the RDW point estimator μ^RDW and the FDW point estimator μ^FDW were approximately unbiased when πi(c) was small for all i ∈ FP and the overall participation rate fc=N1iFPπi(c) was low, but more biased as fc increased. The CPs decreased correspondingly.

As expected, the ALP estimators μ^ALP and μ^ALP.S(or the CLW estimator μ^CLW) consistently provided unbiased point estimators in the scenarios where they were expected to be unbiased, that is, scenario 1 for μ^ALP and μ^ALP.S, and scenario 2 for μ^CLW. When the underlying model was incorrect for an estimator, biases occurred. For example, the relative biases of μ^CLW in scenario 1 were 0.05%, 1.29%, 2.94%, and 7.80% as fc increased from 0.5%, 5%, 10%, to 20%, respectively. In scenario 2, the corresponding relative biases for μ^ALP are −0.19%, −1.03%, −1.81%, and −2.85%.

Consistent with the theory in Section 2, the ALP estimator μ^ALP was more efficient than μ^CLW with consistently smaller empirical variances in all scenarios, especially when the nonprobability cohort size was much larger than the probability sample size. Among all considered methods, μ^ALP.S was approximately unbiased with the smallest variance under Scenario 1 of the correct model. Under Scenario 2 of a misspecified model, μ^ALP.S was biased but most efficient, and therefore achieved smallest MSE.

The variance estimators for μ^ALP, μ^ALP.S, and μ^CLW performed very well (with VR’s near 1), providing CPs close to the nominal level under the correct propensity models when fc was large. The lower coverage of the nominal level (about 88%) when fc = 0.5% was due to the small sample bias with skewed distributions of underlying sampling weights in the selected nonprobability sample.

4 |. REAL DATA EXAMPLE

We use the same data example as Wang et al6 for illustration purposes. We estimated prospective 15-year all-cause, all-cancer, and heart disease mortality rates for adults in the US using the adult household interview part of The Third U.S. National Health and Nutrition Examination Survey (NHANES III) III conducted in 1988 to 1994, with sample size nc = 20 050. We ignored all complex design features of NHANES III and treated it as a nonprobability sample. The coefficient of variation of sample weights is 125%, indicating highly variable selection probabilities, and thus low representativeness of the unweighted sample. For estimating mortality rates, we approximated that the entire sample of NHANES III was randomly selected in 1991 (the midpoint of the data collection time period).

For the reference survey, we used 1994 U.S. National Health Interview Survey (NHIS) respondents to the supplement for monitoring achievement of the Healthy People Year 2000 objectives. Adults aged 18 and older are included (sample size np = 19 738). The 1994 NHIS used a multistage stratified cluster sample design with 125 strata and 248 pseudo-PSUs.17,18 We collapsed strata with only one PSU with the next nearest stratum for variance estimation purposes.19 Both samples of NHANES III and NHIS were linked to National Death Index (NDI) for mortality, allowing us to quantify the relative bias of unweighted NHANES estimates, assuming the NHIS estimates as the gold standard. Notice that the mortality information was obtained by statistical linkage between the survey sample and NDI,20 but not responses from the questionnaires. The all-cancer and heart-disease mortality were classified according to National Center for Health Statistics death code.21,22

The usage of NHANES III as the “nonprobability cohort” has several advantages for illuminating the performance of the propensity weighting methods. The “nonprobability sample” and the reference survey sample have approximately the same target population, data collection mode, and similar questionnaires. This ensures that the pseudo-weighted “nonprobability sample” could potentially represent the target population, and thus enables us to characterize the performance of the propensity weighting methods in real data.

The distributions of selected common covariates and variables of interests in the two samples are presented in Table 2. As expected, the variables in the weighted samples of NHANES and 1994 NHIS have very close distributions because both weighted samples represent approximately the same FP. By contrast, covariates distribute quite differently in the unweighted NHANES from the weighted samples, especially for design variables such as age, race/ethnicity, poverty, and region, which leads to large biases in mortality rates estimated from the unweighted NHANES.

TABLE 2.

Distribution of selected common variables in NIH-AARP and NHIS

NHIS 1994 NHANES III


np = 19738 N^p=189608549 nc = 20050 N^c=187647206


Variable Total count % Weighted % % Weighted %
Age Group 18–24 years 10.5 13.3 15.8 15.8
25–44 years 42.9 43.7 35.4 43.7
45–64 years 26.1 26.6 22.6 24.6
65 years and older 20.5 16.4 26.2 16.0
Race NH-White 76.1 75.9 42.3 76.0
NH-Black 12.6 11.2 27.4 11.2
Hispanic 8.0 9.0 28.9 9.3
NH-other 3.3 4.0 1.5 3.5
Region Northeast 20.7 20.5 14.6 20.8
Midwest 26.1 25.1 19.2 24.1
South 31.5 32.5 42.7 34.3
West 21.6 21.9 23.5 20.9
Poverty No 79.1 82.3 67.9 80.3
Yes 13.1 10.6 21.4 12.1
Unknown 7.8 7.0 10.7 7.6
Education Lower than high school 20.1 19.1 42.5 26.6
High school/Some college 58.7 59.6 45.9 54.1
College or higher 21.2 21.3 11.6 19.3
Health status Excellent/Very good 60.5 62.0 39.0 51.6
Good 25.7 25.7 35.9 32.7
Fair/Poor 13.8 12.3 25.1 15.7
Mortality All-cause 20.8 17.6 26.7 17.1
Heart-disease 9.43 5.69 4.95 4.04
All-cancer 5.57 4.11 5.10 4.47

The propensity model included main effects of common demographic characteristics (age, sex race/ethnicity, region, and marital status), socioeconomic status (education level, poverty, and household income), tobacco usage (smoking status, and chewing tobacco), health variables (body mass index and self-reported health status), and a quadratic term for age. Appendix D shows the final propensity models for the five considered methods.

To evaluate the performance of the five PS-based methods, we used relative difference from the NHIS estimate %RD=(μ^μ^NHIS)/μ^NHIS×100, TL variance estimate (V), and estimated MSE=(μ^μ^NHIS)2+V, which treated the NHIS estimate as truth. Table 3 shows that the naïve NHANES III estimate of overall mortality was ~52% biased from the NHIS estimate because older people who have higher mortalities were oversampled (Table 2). All five IPSW methods substantially reduced the bias from the naïve estimate. Consistent with the simulation results, the ALP, FDW, RDW, and CLW method yielded close estimates when the sample fraction of the nonprobability sample was small ( f^c=nc/N^p=1.06×104 calculated from Table 2). The ALP.S method, by scaling the NHIS sample weights in propensity estimation, reduced more bias than the other methods, and was more efficient. Therefore, the ALP.S estimate had the smallest MSE. The results for inference of all-cancer mortality had the similar pattern as the results for all-cause mortality. All pseudo-weighting methods removed most bias of the naïve NHANES estimate (with %RD = −3.21%~2.07% reduced from 24.68%). By contrast, for heart-disease mortality, all pseudo-weighting methods were substantially less biased than the naïve estimate, with %RD = 42.58%~57.78% reduced from 133.66%, but the alternative estimators still had undesirably large biases themselves. The bias reduction is not as much as that for all-cancer or all-cause mortality, and this may be due to the omission of important predictors of having heart disease and of being observed in the nonprobability sample in the propensity model.

TABLE 3.

Relative difference (%RD) of all-cause 15-year mortality estimates from the NHIS estimate with estimated variance (V) and mean squared error (MSE)

Mortality Method Estimate (%) %RD V (×105) MSE (×105)
All cause NHIS 17.6
Naïve 26.7 52.16
ALP 18.6 6.08 1.87 13.27
FDW 18.6 6.08 1.87 13.28
RDW 18.6 6.08 1.87 13.28
CLW 18.6 6.07 1.87 13.24
ALP.S 17.2 −2.05 1.08 2.37
All cancer NHIS 4.5
Naïve 5.6 24.68
ALP 4.6 2.07 0.38 0.46
FDW 4.6 2.07 0.38 0.46
RDW 4.6 2.07 0.38 0.46
CLW 4.6 2.06 0.37 0.46
ALP.S 4.3 −3.21 0.32 0.53
Heart disease NHIS 4.0
Naïve 9.4 133.66
ALP 6.4 57.78 0.50 54.88
FDW 6.4 57.78 0.50 54.90
RDW 6.4 57.78 0.50 54.90
CLW 6.4 57.77 0.50 54.87
ALP.S 5.8 42.58 0.33 29.86

Abbreviations: ALP, adjusted logistic propensity; RDW, rescaled design weight.

5 |. DISCUSSION

This article proposed ALP weighting methods for population inference using nonprobability samples. The proposed ALP method corrects the bias in the RDW method7 by formulating the problem in an innovative way. As does the RDW method, the proposed ALP method retains the advantage of easy implementation by fitting a propensity model with survey weights in ready-to-use software. The proposed ALP estimators are design consistent if the assumed model for participation rate is correct. TL variance estimators for ALP estimates are derived. Consistency of the ALP FP mean estimators was proved theoretically and evaluated numerically.

A primary competitor to ALP is the CLW estimator developed by Chen et al.4 If the nonprobability cohort is a small fraction of the population, ALP and CLW are very similar, although ALP does have computational advantages regardless of the size of the sampling fraction. As the sampling fraction increases, ALP and CLW become more distinct.

Both ALP and CLW methods fit a propensity model to the combined nonprobability sample and a weighted survey sample. Highly variable weights in the combined sample can lead to low efficiency of the estimated propensity model coefficients. Therefore, the variances of the ALP and the CLW estimators of the FP means can be large in some applications. However, the proposed ALP is proved analytically and numerically to have a variance that is less than or equal to that of the CLW method regardless of whether the propensity model underlying ALP is correct. It worth noting that ALP and CLW methods assume different logistic regression models for propensity score estimation. Propensity is defined as pi=P(isc*sc*FP) by ALP in (9) and πi = P(isc|FP) by CLW in (2). Model diagnostics should be developed to select which propensity model is more appropriate for a given dataset and will be the focus of our future research.

An alternative ALP with scaled survey weights in the logistic regression propensity model produces consistent propensity estimates and further improves efficiency as shown in the simulation and the real data example. The scaled ALP had the smallest MSE in every scenario in our simulation study regardless of underlying model and had the smallest MSE for two of the three mortality causes in our real data application. The CLW estimator with the scaled survey weights, albeit more efficient than the unscaled CLW, is biased (simulation results not shown). The extension of scaling technique to the CLW method and other binary regression propensity models requires further investigation.

The theory for the ALP method implies that pi > 1/2 since pi defines the probability of the nonprobability sample inclusion among the combined FP units and the nonprobability sample, that is, P(isc*sc*FP). In estimation, however, p^i>1/2 can happen, especially in the unusual case where the nonprobability sample sc* is a large proportion of the population. Fortunately, this will not be a concern for the estimation of population mean or regression coefficients. By scaling survey weights in the propensity model for the ALP method, we can control for all ALP weights wiALP.S=1/π^iALP.S to be greater than one. As proved in Section 2.5, using scaled survey weights, wiALP.S=λexp1(β^λTxi)=exp1(β^0λ*+β^1λTxi), would not bias the mean estimates. For population total estimation, an option is to estimate the population mean first and then multiply by a known or estimated population size from an independent source. Another approach to avoid estimated pi > 1/2 is to solve the pseudo-estimating equations in (14) using a constrained optimization algorithm that requires p^i1/2 for all units. This would, of course, negate the computation advantage of the ALP estimator.

Both ALP and CLW are inverse-propensity-score-weighting methods that directly use (functions of) the propensity score to estimate the cohort participation rate. They can be sensitive to propensity model misspecification (eg, missing interaction terms in the fitted propensity model) due to inaccurate estimates of participation rates. Furthermore, extreme pseudo-weights can occur if the estimate of the participation rate is close to 0. By contrast, propensity-score-based matching methods (not included in this study) may be more robust to the model misspecification and less likely to produce extreme pseudo-weights, because they use propensity scores to measure the similarity between survey and cohort sample units and distribute survey sample weights to the cohort based on their similarity. Examples of matching methods are propensity-score adjustment by subclassification,23 propensity-score-based kernel weighting methods,5,6,16 and River’s matching method.24

There are a number of shortcomings associated with the estimation of propensity scores using logistic regression. First, the logistic model is susceptible to model misspecification, requiring assumptions regarding correct variable selection and functional form, including the choice of polynomial terms and multiple-way interactions. If any of these assumptions are incorrect, propensity score estimates can be biased, and balance may not be achieved when conditioning on the estimated PS. Second, implementing a search routine for model specification, such as repeatedly fitting logistic regression models while in/excluding predictor variables, interactions, or transformations of variables can be computationally infeasible or suboptimal. In this context, parametric regression can be limiting in terms of possible model structures that can be searched over, particularly when many potential predictors are present (high-dimensional data). Various machine learning methods for estimating the propensity score that incorporate survey weights will also be our future research interest.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available on request from the corresponding author.

Supplementary Material

Appendix A-D

Footnotes

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of this article.

REFERENCES

  • 1.Collins R What makes UK Biobank special. Lancet. 2012;379(9822):1173–1174. [DOI] [PubMed] [Google Scholar]
  • 2.Fry A, Littlejohns TJ, Sudlow C, et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am J Epidemiol. 2017;186(9):1026–1034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Elliott MR, Valliant R. Inference for nonprobability samples. Stat Sci. 2017;32(2):249–264. [Google Scholar]
  • 4.Chen Y, Li P, Wu C. Doubly robust inference with nonprobability survey samples. J Am Stat Assoc. 2019;115(532):2011–2021. [Google Scholar]
  • 5.Wang L, Graubard BI, Katki H, Li Y. Improving external validity of epidemiologic cohort analyses: a kernel weighting approach. J Royal Stat Soc Ser A. 2020; 183(3):1293–1311. 10.1111/rssa.12564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wang L, Graubard BI, Katki HA, Li Y. Efficient and robust propensity-score-based methods for population inference using epidemiologic cohorts; 2021. arXiv preprint arXiv:2011.14850. [Google Scholar]
  • 7.Valliant R, Dever JA. Estimating propensity adjustments for volunteer web surveys. Sociol Methods Res. 2011;40(1):105–137. [Google Scholar]
  • 8.Valliant R Comparing alternatives for estimation from nonprobability samples. J SurvStat Methodol. 2020;8(2):231–263. [Google Scholar]
  • 9.Binder DA. On the variances of asymptotically normal estimators from complex surveys. Int Stat Rev. 1983;51(3):279–292. [Google Scholar]
  • 10.Chambers RA, Skinner CJ. Sec. 6.3 Analysis of Survey Data. New York, NY: Wiley; 2003. [Google Scholar]
  • 11.Beaumont JF. Calibrated imputation in surveys under a quasi-model-assisted approach. J Royal Stat Soc Ser B. 2005;67(3):445–458. [Google Scholar]
  • 12.Kim JK, Kim JJ. Nonresponse weighting adjustment using estimated probability. Canadian J Stat. 2007;35(4):501–514. [Google Scholar]
  • 13.Krewski D, Rao JN. Inference from stratified samples: properties of the linearization, jackknife and balanced repeated replication methods. Ann Stat. 1981;9(5):1010–1019. [Google Scholar]
  • 14.Scott AJ, Wild CJ. Fitting logistic models under case-control or choice based sampling. J Royal Stat Soc Ser B. 1986;48(2):170–182. [Google Scholar]
  • 15.Li Y, Graubard BI, DiGaetano R. Weighting methods for population-based case–control studies with complex sampling. J Royal Stat Soc Ser C. 2011;60(2):165–185. [Google Scholar]
  • 16.Wang L Improving External Validity of Epidemiologic Analyses by Incorporating Data from Population-Based Surveys [Doctoral dissertation]. University of Maryland, College Park; 2020. https://drum.lib.umd.edu/handle/1903/26125. [Google Scholar]
  • 17.Massey JT. Design and estimation for the national health interview survey, 1985–94. US Department of Health and Human Services, Public Health Service, Centers for Disease Control, National Center for Health Statistics; 1989. [Google Scholar]
  • 18.Ezzati TM, Massey JT, Waksberg J, Chu A, Maurer KR. Sample design: third National Health and nutrition examination survey. Vital Health Stat Ser 2 Data Eval Methods Res. 1992;113:1–35. [PubMed] [Google Scholar]
  • 19.Hartley HO, Rao JN, Kiefer G. Variance estimation with one unit per stratum. J Am Stat Assoc. 1969;64(327):841–851. [Google Scholar]
  • 20.National Center for Health Statistics. National Death Index User’s Guide. Hyattsville, MD: National Center for Health Statistics; 2013. https://www.cdc.gov/nchs/data/ndi/ndi_users_guide.pdf. Accessed April 2021. [Google Scholar]
  • 21.National Center for Health Statistics. Data Linkage Public-Use Linked Mortality File Data Dictionary. Hyattsville, MD: National Center for Health Statistics; 2015. https://www.cdc.gov/nchs/data/datalinkage/public-use-2015-linked-mortality-files-data-dictionary.pdf. Accessed April 2021. [Google Scholar]
  • 22.National Center for Health Statistic. Data Linkage Underlying and Multiple Cause of Death Codes. Hyattsville, MD: National Center for Health Statistics; 2018. https://www.cdc.gov/nchs/data/datalinkage/underlying_and_multiple_cause_of_death_codes.pdf. Accessed April 2021. [Google Scholar]
  • 23.Lee S, Valliant R. Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociol Methods Res. 2009;37(3):319–343. [Google Scholar]
  • 24.Rivers D Sampling for web surveys. Paper presented at: Proceedings of the Joint Statistical Meetings, Section on Survey Research Methods; 2007; Salt Lake City, Utah. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix A-D

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author.

RESOURCES