Summary
A non-probability sampling mechanism arising from non-response or non-selection is likely to bias estimates of parameters with respect to a target population of interest. This bias poses a unique challenge when selection is ‘non-ignorable’, i.e. dependent upon the unobserved outcome of interest, since it is then undetectable and thus cannot be ameliorated. We extend a simulation study by Nishimura et al. [International Statistical Review, 84, 43–62 (2016)], adding two recently published statistics: the so-called ‘standardized measure of unadjusted bias (SMUB)’ and ‘standardized measure of adjusted bias (SMAB)’, which explicitly quantify the extent of bias (in the case of SMUB) or non-ignorable bias (in the case of SMAB) under the assumption that a specified amount of non-ignorable selection exists. Our findings suggest that this new sensitivity diagnostic is more correlated with, and more predictive of, the true, unknown extent of selection bias than other diagnostics, even when the underlying assumed level of non-ignorability is incorrect.
Keywords: Non-Ignorable Selection Bias, Survey Non-Response, Multiple Imputation, Pattern Mixture Model
1. Introduction
Classical methods of scientific probability sampling and corresponding design-based frameworks for making statistical inferences about populations have long been used to advance scientific knowledge in various fields. The random selection of elements from a population of interest into a probability sample, where all population elements have a known non-zero probability of selection, ensures that elements included in the sample mirror the population in expectation. That is, for all variables of interest, the mechanism of selection of a subset of elements into the sample is ignorable, following the theoretical framework for missing-data mechanisms originally introduced by Rubin (1976).
The modern survey research environment poses significant challenges to these “tried and true” methodologies: it has become increasingly difficult to contact sampled units, survey response rates continue to decline in all modes of administration (face-to-face, telephone, etc.; Brick and Williams, 2013; Williams and Brick, 2018), and the costs of collecting and maintaining scientific probability samples are steadily rising (Presser and McCulloch, 2011). These problems raise the question of whether, and to what extent, samples can still produce reliable estimates when only a small fraction has responded, such that the response mechanism may in fact not be ignorable?
Given the difficulties of collecting data from probability samples, researchers are also turning to non-probability samples, which have the potential to yield large amounts of data at low cost. These may also be prone to non-ignorable selection bias, as the researcher no longer has control over the mechanism that ultimately yields the final sample. Given this trend in research methodology, indicators of the potential non-ignorable selection bias in non-probability samples and probability samples with low response rates are required.
Nishimura et al. (2016) investigated the suitability of various statistics for use as diagnostics for selection bias due to non-response mechanisms, of both the ‘ignorable’ or ‘non-ignorable’ type (Rubin, 1976). They noted that none of the diagnostics they considered were intended to directly quantify selection bias. Moreover, their simulation study found that none of them were suitable as potential diagnostics, leaving the door open for other candidates. A statistic recently proposed in Little et al. (2019) explicitly estimates this bias based on an assumed level of non-ignorability and therefore is potentially appropriate for use as a diagnostic. The primary contribution of this paper is the inclusion of this statistic in this comparison of diagnostics. We also extend Nishimura et al. (2016) by simulating two auxiliary variables that are differentially associated with the survey variable and selection, which we argue is an important additional factor when evaluating the diagnostics.
The remainder of this paper is organized as follows. Section 2 presents notation and a brief description of the index of selection bias proposed in Little et al. (2019). Section 3 describes the other diagnostics we consider here, which were also evaluated in Nishimura et al. (2016). An important contextual difference between this paper and that of Nishimura, et al. is that we consider the generic non-selection scenario, of which survey non-response – the scenario of interest in Nishimura, et al. – is a special case. The practical implication of this difference is that those indices that are dependent upon selection probabilities may not be calculable if they cannot be estimated. Sections 4 and 5 describe and present the results from the simulation study, respectively. And Section 6 concludes with a discussion of all of the diagnostics considered in light of our results.
2. An index of selection bias
For a target population of size N, with i = 1, … , N, let Si ∈ {0, 1} indicate the selection of the ith subject into the sample, Yi be the continuous outcome of interest, and Zi be an observed auxiliary variable that is relevant due to its association with Yi. The vectors S = {S1, … , SN} and Z = {Z1, … , ZN} are fully observed, and the vector Y = {Y1, … , YN} is separated into selected (observed) and unselected (missing) sub-vectors, respectively Ysel = {Yi : Si = 1} and Yunsel = {Yi : Si = 0}. When needed, we will also use this same convention to separate Z into selected and unselected subvectors, Zsel and Zunsel, although in contrast to Y both subvectors of Z are always assumed to be fully observed. The primary estimand of interest is the average outcome in the target population: E[Yi] = μy.
Two forms of models for the joint distribution of {Y,Z, S} are often considered. Selection models (Little and Rubin, 2002) factorize the joint distribution as
| (1) |
with parameters {α, β}, where α and/or β may themselves be vectors. A model for Pr(S|Y,Z, β) describes the missingness mechanism for Yunsel, since Yi is not observed when Si = 0. Thus, the strongest possible assumption to make regarding Pr(S|Y,Z, β) is that S and {Y,Z} are jointly independent. Modifying the ‘missing completely at random’ terminology of Little and Rubin (2002), we call this ‘selection completely at random’ (SCAR). In this case β corresponds to the average selection rate. A weaker assumption is ‘selection at random’ (SAR), which assumes that S and Y are conditionally independent given Z. The weakest assumption is ‘selection not at random’ (SNAR), and elements of both α and β are not identified in this case.
The second decomposition is the class of ‘pattern-mixture models’ (Andridge and Little, 2011; Little, 1994), which describe outcome models that are specific to the selected and unselected populations:
| (2) |
with parameters {θunsel, θsel, π}, where θunsel and θsel may be vectors and π is a scalar equal to the probability of selection. Both the selection and pattern-mixture decompositions are statistically valid, and in the special case of a SCAR mechanism, the models coincide, meaning that θunsel = θsel ≡ θ and {θ, π} and {α, β} share a 1–1 correspondence (Little, 1994). Further, all parameters become identified in this special case. However, models (1) and (2) will not generally coincide under SAR for any distributional choices. Although the decomposition in (1) is more intuitive by directly capturing the data-generating mechanism, the usefulness in focusing on (2) is that the non-identified parameters are isolated to a single submodel: [Yunsel,Zunsel|θunsel]. In the pattern-mixture framework, the estimand of interest, μy, is equal to πE[Ysel|θsel] + (1 − π)E[Yunsel|θunsel]. The latter mean, E[Yunsel|θunsel], is not identified without making further assumptions.
Specifically, for the factorization in (2), assume that [Zsel, Ysel|θsel] and [Zunsel, Yunsel|θunsel] are both bivariate normal, with θsel and θunsel each denoting five parameters (two means, two variances, and a covariance). Additionally, assume that the marginal distribution Pr(S|π) is coherent with some true conditional distribution of S given Z and Y that takes the form
| (3) |
for some invertible function g(t) having range in the interval (0, 1) but otherwise unspecified, and for some scalar parameter ϕ ∈ [0, 1]. The population mean μy becomes identified under these assumptions (Little, 1994), and Andridge and Little (2011) derived a maximum likelihood estimate (MLE) of μy as a function of ϕ, given by
| (4) |
(these authors actually used the alternative parameterization ψ = ϕ/(1 − ϕ)). Here, , , and are the sample means of Ysel, Zsel, and Z, respectively; rsel is the sample Pearson correlation between Ysel and Zsel; and and are the sample variances of Ysel and Zsel, respectively. For ϕ = 0, i.e. when selection depends on Z alone, this estimator reduces to the regression estimator obtained from regressing Y on Z for the selected cases (Andridge and Little, 2011).
Remark 1:
Little et al. (2019) show that the estimator (4) remains unbiased for its estimand under a more general class of functions than that given in (3), namely
| (5) |
where W is uncorrelated with Z. This generalization will be important for explaining a key result in our simulation study.
This estimate of μy in (4) is a function of the parameter ϕ, which in turn controls the extent to which sampling depends upon the outcome Y, with larger values indicating greater dependence. When ϕ = 0, the selection mechanism is SAR, and the resulting statistic is closely related to the measure H1 in Särndal and Lundström (2010). When ϕ > 0, the sampling mechanism is ‘non-ignorable’ (Rubin, 1976), meaning that the sampled population cannot yield unbiased estimates of the target population parameter without knowledge of the true value of ϕ (Little et al., 2019). However, in any non-probability sample, ϕ is, by definition, not estimable, and Little, et al. propose varying this parameter in a sensitivity analysis. Subtracting from both sides of (4) and scaling by to standardize the resulting difference, we obtain a direct estimate of the standardized bias that would arise in using to estimate μy for a particular true value of ϕ. The resulting expression is the recently proposed Standardized Measure of Unadjusted Bias (SMUB, Little et al., 2019):
| (6) |
This measure quantifies the sensitivity of estimates based upon the selected sample due to increasing levels of non-ignorability, represented by the value of ϕ. As discussed in Little et al. (2019), in addition to a small value of the non-ignorability parameter ϕ decreasing the standardized bias, other characteristics that tend to do so include having an auxiliary variable that is a strong correlate for the outcome, i.e. rsel is close to 1, and/or obtaining a large sampled fraction, since , where π is the selection probability in (2).
Little et al. (2019) also proposed a Standardized Measure of Adjusted Bias (SMAB), defined as
| (7) |
Whereas SMUB measures the summative bias arising from both ignorable and non-ignorable mechanisms, SMAB measures only the excess bias after adjusting for ignorable bias. As Little et al. (2019) caution, its utility is predicated on the underlying assumptions of the normal pattern-mixture model.
The simulation study in Nishimura et al. (2016), prior to the proposal of the estimator in (6), found that “none of the indicators [evaluated] fully depict the impact of non-response in survey estimates.” (p.43) We consider here whether the SMUB or SMAB indices address this deficiency. Note that (6) is based on a normal pattern-mixture model, and as such is less well suited to non-normal outcomes. Modifications of (6) for a categorical outcomes are discussed in Andridge and Little (2020) but are not considered in this article.
3. Other Diagnostics Evaluated
Nishimura et al. (2016) grouped the diagnostics they compared based upon whether {S,Z} or {S, Ysel,Z} are required to calculate them. Except for the sample mean of the selection indicator, these other diagnostics require at least some individual-level data from the non-sampled population (or some other means of accurately assessing the selection propensity). This situation is exceedingly rare in practice and makes these diagnostics difficult, if not impossible, to compute for non-probability samples. It also provides motivation for the additional diagnostic measures we evaluate in this paper, which do not have this same requirement. We return to this important limitation in the Discussion.
The simplest diagnostic is , i.e. the sample mean of the selection indicator, or the selection rate. Small values of increase the upper bound for potential bias due to non-ignorable sampling since a larger fraction of the data are missing (Nishimura et al., 2016) but do not necessarily indicate greater selection bias, e.g. Bootsma-van der Wiel et al. (2002). Since our focus is on how well measures reflect bias characteristics beyond the selection rate, we choose to include the selection rate as a design factor in our simulation study, rather than as a diagnostic for bias. In this section, we provide a brief rationale for the use of each of these diagnostics in the non-probability sampling setting; Nishimura, et al. provide additional justification for each diagnostic in the special case of nonresponse conditional on being sampled.
3.1. Diagnostics using {S,Z}
This category characterizes the associations between the fully observed auxiliary variable Z and the selection indicator S. The underlying rationale for doing so is that a selection rate dependent upon Z, which is itself a surrogate for Y, is suggestive of a selection rate dependent upon Y, i.e. selection bias. Nishimura et al. (2016) consider three measures of this type, which are described below.
Consider first the selection model conditioning on Z alone:
Pr(S = 1|Z, γ0, γz) = logit−1(γ0 + γzZ). This is fit to the data {S,Z} from both the selected and unselected populations. Let the fitted probability, or propensity, of selection for the ith observation be given by
| (8) |
The R-Indicator (Schouten et al., 2009), where R stands for ‘representativity’, is the following linear transformation of the sample standard deviation of ηi across both the selected and un-selected samples:
Schouten proposed the R-indicator in the context of response propensities, and thus it is computed across all elements in the population and requires data sufficient to estimate the response/selection propensities ηi.
theoretically ranges from 0 to 1, where smaller values correspond to greater variability in the selection propensities and, consequently, greater potential for selection bias. However, the smallest possible value of , i.e. when the sample standard deviation of the ηi’s is 0.5, occurs only under two strong conditions. First, the average fitted selection propensity, , must be 0.5. Second, each individual propensity must either be ηi = 1 or ηi = 0, i.e. S can be completely separated by Z, in the sense of Albert and Anderson (1984). In practice, generally ranges between 0.5 and 1.
The coefficient of variation of the selection propensities is the ratio of the same standard deviation used in the R-indicator and the mean selection propensity:
The theoretical range of CVS is the set of non-negative numbers. The rationale for using the coefficient of variation is that both variability in selection probabilities (the numerator) and smaller selection rates (the denominator) contribute to the potential for selection bias. As with the other indices, however, the challenge is that this relationship does not always hold, nor is the converse true: selection bias may exist even in the presence of a “small” CVS.
Highly variable non-selection weights may also indicate greater potential for selection bias, depending on the extent to which the variables used to create the non-selection weights are associated with the outcome of interest. Thus, the variability in non-selection weights focuses on the inverse of the estimated selection probabilities, 1/ηi. Nishimura et al. (2016) consider the sample variance of 1/ηi evaluated in the selected sample:
Two other approaches limited to these same data assess the overall performance of the selection model Pr(S = 1|Z, γ0, γz) = logit−1(γ0 + γzZ) in distinguishing between selected and non-selected observations. One is the ‘Area Under the receiver-operating characteristic Curve’ (AUC), an assessment of discriminatory ability. The corresponding estimate counts the proportion of all possible selected-unselected pairs, the selection propensities of which are correctly ordered:
The pseudo-R2 seeks to generalize the linear model’s R2 metric, or proportion of variation explained, to a logistic framework (Nagelkerke et al., 1991). It is given by
Both AÛC and psR2 quantify the strength of the model used to create the selection propensities. A better (stronger) relationship between auxiliary variables and selection could indicate a higher risk for selection bias, depending on the strength of the relationship between the auxiliary variables and the outcome of interest.
3.2. Diagnostics using {S, Ysel,Z}
The two diagnostics in this section make use of all available data and are therefore potentially more sensitive to detecting selection bias. The first is the Pearson correlation between the outcome Y and the inverse of the selection propensity η:
This correlation serves as a measure of the association between the survey variable and the set of auxiliary variables used to create the selection propensities. The stronger this relationship, the more potential there is to adjust for selection bias.
The second diagnostic is called the ‘Fraction of Missing Information’ (FMI), a statistic borrowed from the literature on multiple imputation (Rubin, 2004). Given a posited model for the conditional distribution of the outcome Y given the auxiliary variable Z fit to the observed data {Ysel,Zsel}, M sets of unselected outcomes, denoted by are imputed. Each of the M completed datasets, are used to construct estimates of μy, say, , m = 1, … , M. After some simplification, the FMI statistic can be written as
There are three contributing elements to this expression. The first element, , appears in both the numerator and denominator and is the sum of the squared deviations between each imputation-specific estimate and the overall mean. It is proportional to the so-called “between-imputation variance”, capturing uncertainty in the estimate across replications of the imputation procedure. The second element, , is only in the denominator and is the sum of each imputation-specific variance estimate of . This is proportional to the “within-imputation variance”, and the sum of the between- and within-imputation variances is thus the total variance. The third element, (M + 1)/(M − 1) > 1, multiplicatively inflates the between-over-total fraction and captures the loss of information due to taking a finite number of imputations. It approaches 1 from above as M is increased. Ranging between 0 and 1, larger values of FMI indicates greater uncertainty about the imputed values (larger between-imputation variance), which could indicate a greater potential for selection bias.
4. Simulation Study: Description
The purpose of this simulation study is to characterize the association between the true bias in a sampled dataset (only observable in a simulation framework) and each of the aforementioned candidate diagnostics, including the new SMUB and SMAB diagnostics from Little et al. (2019). The data were generated according to the ‘selection model’ decomposition described in equation (1). However, recognizing that, in practice, there may be more than one auxiliary variable having different associations with selection and the survey variable, we used two auxiliary variables, X1 and X2, in place of Z. In truth, S and X1 are conditionally independent given X2 and Y, and, similarly, Y and X2 are conditionally independent given X1.
Remark 2:
Nishimura, et al. (2016) also use two auxiliary variables but for different purposes. One of their auxiliary variables is latent and used only to control the extent of response/selection not-at-random, whereas we are simulating non-response directly (see Remark 3 below). These approaches are distributionally equivalent. Nishimura’s other variable is an observed explanatory variable that is assumed to correlate with both the response/selection indicator and the outcome and thus serves the role of our two joint auxiliary variables, X1 and X2.
In more detail, at each iteration, a finite population of size N = 104 was simulated, wherein each observation consisted of the random vector {Y,X1,X2, S} drawn from the true models in the second column in Table 1. X1 and X2 are bivariate normal with mean 0, variance 1, and correlation κ. When X1 and X2 are not identically equal, i.e. κ < 1, both X1 and X2 are conditioned on in fitting the outcome and selection models, to emulate what would be done in practice. The scalar parameter ρ is the Pearson correlation between Y and X1, i.e. ; Y and X2 are conditionally independent given X1. Finally, the selection probability is controlled by parameters β0, βx, and βy in a logistic framework, with Pr(S = 1|Y,X2, β0, βy, βx) = logit−1(β0 + βyY + βxX2). In total, there are five parameters governing this distribution: κ, ρ, β0, βx, and βy.
Table 1:
Description of generating models used in the simulation study in Section 4. Five parameters fully specify the generating distribution of the data: κ, ρ, β0, βx, and βy.
| Variable | Generating Model |
|---|---|
| Auxiliary | |
| Outcome | |
| Selection | Pr(S = 1\Y, X2, β0, βy, βx) = logit−1(β0 + βyY + βxX2) |
Remark 3:
An equivalent model for inducing correlation between S and Y would be achieved by the introduction of a latent variable into the generating distribution of each, as in Heckman (1979), where S = 1 if the latent variable crosses some threshold.
We considered κ ∈ {0, 0.5, 1}, with the last scenario corresponding to X1 ≡ X2 ≡ Z, in which case we are in the ‘single auxiliary variable’ scenario, and one would not condition on both X1 and X2. The correlation between the outcome Y and its best predictor X1 was ρ ∈ {0.10, 0.25, 0.75}. Values of βx and βy, the log-odds ratios for selection, were taken from one of the scenarios listed in Table 2. The first row, for which βx = βy = 0, corresponds to a SCAR mechanism. The second row, for which βx ∈ {0.1, 0.2, 0.3, 0.4, 0.5} and βy = 0, corresponds to five different SAR mechanisms. The remaining rows in the table, for which βy ≠ 0 and |βx| + |βy| ≡ c ∈ {0.1, 0.2, 0.3, 0.4, 0.5}, all correspond to different SNAR mechanisms, ranging from mild non-ignorability (third row: {βx, βy} = {3c/4, c/4}) to extreme non-ignorability (sixth row: {βx, βy} = {0, c}) in which selection depends entirely on Y. In total, Table 2 gives 31 unique sets of βx and βy.
Table 2:
Values of ϕtrue for the pair of log-odds ratios in the true selection mechanism of the simulation study grouped by the relative relationship of βx to βy, where, except for the first row, |βx| + |βy| ≡ c 2 ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. The implied true value of the non-ignorability parameter ϕ is calculated by the expression ϕtrue = βy/(κβx + βy).
| Label | {βx, βy} | ϕ true | ||
|---|---|---|---|---|
| κ = l | κ = 0.5 | κ = 0 | ||
| SCAR | {0,0} | 0* | 0* | 0* |
| SAR | {c, 0} | 0 | 0 | 0 |
| 3X2 + Y | {3c/4, c/4} | 0.25 | 0.4 | 1 |
| X2 + Y | {c/2, c/2} | 0.5 | 0.66 | 1 |
| X2 + 3Y | {c/4, 3c/4} | 0.75 | 0.86 | 1 |
| Y | {0, c} | 1 | 1 | 1 |
| X2 – Y | {c/2, −c/2} | –† | –† | 1 |
Mathematically, ϕtrue is undefined when βx = βy = 0, but we use 0 here to indicate that this is an ignorable sampling mechanism.
There is no value of ϕtrue ∈ [0, 1] satisfying the assumptions required for the SMUB indices when βx or βy are negative and κ > 0.
Under this generating model, the assumption in (5) holds for any κ ∈ [0, 1]. To see this, express X2 as , where ϵ ~ N(0, 1) is independent of X1 and Y.
Substituting this result into the selection model, we rewrite the selection probability as
Now, letting (i) g(t1, t2) = logit−1(β0 + [κβx + βy]t1 + t2), (ii) ϕ = βy/(κβx + βy), (iii) Z = X1, and (iv) , the relaxed assumption (5) is satisfied for any κ ∈ [0, 1]. In contrast, the more restrictive assumption (3) is only satisfied for κ = 1, i.e. W ≡ 0. Under κ = 1, the third column in Table 2 give the implied true value of ϕ, which is common to all {βx, βy} pairs in each row and which we denote as ϕtrue to distinguish it from the closely related tuning parameter ϕ used by SMUB. The last two columns give the value of ϕtrue for κ = 0.5 and κ = 0, respectively. In the last row of Table 2, for which βx > 0 and βy < 0, there is no value of ϕtrue ∈ [0, 1] satisfying (5) except for the case that κ = 0, and this is noted as such in the table.
With regard to the intercept β0, we did not directly set its value but rather fixed a desired overall selection probability Pr(S = 1) = 0.05 (marginally over all other random variables), which, when set equal to , can then be numerically solved for β0. Our choice of a 5% selection rate is a fairly large selection rate for non-probability samples.
Two of the diagnostics have input values that the user must select. For SMUB, we inspected three choices of the non-ignorability tuning parameter in (6): ϕ ∈ {0, 0.5, 1.0}. When ϕ is close to the unknown ϕtrue, the SMUB statistic will estimate well the unadjusted bias, as defined below. For SMAB, we used ϕ ∈ {0.5, 1.0}, since SMAB(ϕ = 0) is always equal to 0. As with SMUB, when ϕ is close to the unknown ϕtrue, the SMAB statistic will be close to its estimand, namely adjusted bias. For FMI, we estimate μy by imputing M = 30 vectors of the unselected outcomes Yunsel conditional on the auxiliary variables X1 and X2 within a Bayesian linear regression model framework.
For each of the 3 × 3 × 31 = 279 combinations of ρ, κ, and {βx, βy} pair taken from Table 2, we simulated 2000 independent populations of size 104 and, from each, sampled a dataset according to the corresponding parameters. The available data were always {S,X1,X2, Ysel}, although not all diagnostics make use of all data, as noted in the previous sections. For those diagnostics depending on the sampling probability η, we regressed S against the auxiliary covariates X1 and X2 in the entire population data.
To assess performance, we calculated for each dataset the ‘standardized error measure’ (SEM) in using to estimate μy, which is given by
| (9) |
In words, this is the difference between the empiric mean of the outcome in the selected observations and the target population mean, divided by the true standard deviation of the outcome. We plot the median value of SEM against the median value of each diagnostic to visualize the systematic relationship between these two quantities. A diagnostic that is sensitive to selection bias should be associated with SEM, and both the qualitative and quantitative nature of this association should be similar for all types of selection mechanisms, i.e. values of ϕtrue. Also important is the pairwise relationship due to sampling variability, or “chance bias”. To that end, we also calculate the Spearman correlation between the value of SEM and each diagnostic across all 2000 datasets from each scenario.
Because calibration is often used in practice to adjust for the potential selection bias in non-probability samples, we also calculated a secondary error measure using a calibrated estimator of the average outcome. Specifically, we separately categorized X1 and X2 into groups defined by the marginal quartiles in the population data, yielding 16 bivariate categories, and then weighted each observation in the sampled data by the ratio of its corresponding category’s relative frequency in the population data versus its relative frequency in the sampled data. The calibrated estimator is the weighted mean of the outcome in the sampled data, denoted by . Then, the ‘standardized adjusted error measure (SAEM)’ is defined as
| (10) |
Results corresponding to SEM are given in the main text, and those corresponding to SAEM are in the supplement. All analyses were conducted in the R statistical environment (R Core Team, 2018; van Buuren and Groothuis-Oudshoorn, 2011; Wickham, 2017). Code for the simulation study is available here: https://github.com/bradytwest/IndicesOfNISB/tree/master/SelectionBiasDiagnostics.
5. Simulation Study: Results
Figures 1, 2, and 3 plot the relationship between the median value of SEM across 2000 simulated datasets from a given scenario against the median of each diagnostic, separately for κ = 1, 0.5, and 0, respectively. Figures S1, S2, and S3 give these analogous results using the alternative metric SAEM.
Figure 1:

Median standardized error measure (SEM, y-axes) against value of diagnostic (x-axes) for twelve candidate diagnostics (columns), three values of ρ ≡ Cor(X1, Y) (rows) using the median of 2000 simulated datasets. κ ≡ Cor(X1,X2) is fixed at 1 (Figures 2 and 3 give the same results for κ = 0:5 and κ = 0, respectively) For reference, the y = x line is plotted in black. Shape and color indicate different true selection mechanisms from Table 2, and connected segments represent different values of {βx, βy} corresponding to the same selection mechanism.
Figure 2:

Median standardized error measure (SEM, y-axes) against value of diagnostic (x-axes) for ten candidate diagnostics (columns), two values of ρ ≡ Cor(X1, Y) (rows) using the median of 2000 simulated datasets. κ ≡ Cor(X1,X2) is fixed at 0.5 (Figures 1 and 3 give the same results for κ = 1 and κ = 0, respectively). For reference, the y = x line is plotted in black. Shape and color indicate different true selection mechanisms from Table 2, and connected segments represent different values of {βx, βy} corresponding to the same selection mechanism.
Figure 3:

Median standardized error measure (SEM, y-axes) against value of diagnostic (x-axes) for ten candidate diagnostics (columns), two values of ρ ≡ Cor(X1, Y) (rows) using the median of 2000 simulated datasets. κ ≡ Cor(X1,X2) is fixed at 0 (Figures 1 and 2 give the same results for κ = 1 and κ = 0:5, respectively). For reference, the y = x line is plotted in black. Shape and color indicate different true selection mechanisms from Table 2, and connected segments represent different values of {βx, βy} corresponding to the same selection mechanism.
Points in which the underlying selection mechanism share their row in Table 2 in common are connected. Generally speaking, a diagnostic is good at detecting bias if its value (on the x-axis) changes at a similar rate with the observed bias (on the y-axis) across all of the different selection mechanisms, i.e. each plotted segment has a similar sized slope. It is useful for estimating bias if its value changes at the same rate as the observed bias across the selection mechanisms, i.e. each plotted segment is close to the line y = x (which is given by a solid black line but is not visible in all panels due to the scale of each diagnostic). There is no information in the data to determine the extent to which selection depends on Y, as represented by the different lines in the figures. If, for a single value of a diagnostic on the x-axis, there are many different values of SEM on the y-axis across different selection mechanisms, this is evidence against it being a good diagnostic. The set of candidate diagnostics are separated into two groups in each figure, with the set of six in the top three rows (one row each for ρ = 0.75, ρ = 0.25, and ρ = 0.10) roughly corresponding to the best performing diagnostics, and the set in the bottom three rows corresponding to the worst performing diagnostics.
Considering first the diagnostics in the bottom rows of Figure 1, Cor(Ysel, η−1) and FMI(μy) are not notably sensitive to changes in SEM, as indicated by the steep vertical segments. The Var(η−1) diagnostic changes with SEM, but the range of its x-axis is very wide, potentially limiting interpretability as to what constitutes an extreme value. The , AÛC, and psR2 diagnostics are also sensitive to SEM and have a narrower range along the x-axis than Var(η−1). Considering the better-performing diagnostics in the top pair of rows in Figure 1, they are all visually similar to one another. Interestingly, the behavior of CV(η) very closely resembles SMUB(0.5) and relatively closely aligns with the value of SEM, as exhibited by the segments’ close proximity to the y = x line. The SMUB indices generally increase with SEM in the ρ = 0.75 scenarios and, furthermore, are often nearly in 1–1 correspondence with SEM. The extent to which this last statement is true depends upon the proximity between ϕ and ϕtrue, as the development of these estimators would suggest. The SMAB indices, which estimate the excess bias after adjusting for ignorable bias, vary little when ρ = 0.75, since in this case X1 is actually a relatively good surrogate for Y, and are therefore less sensitive to SEM. When ρ = 0.10, most of the bias is non-ignorable, and so the SMUB and SMAB indices nearly correspond. For the third and sixth rows of Figure 1, the auxiliary variable is an especially poor predictor of the survey outcome (ρ = 0.10). In this setting, all the diagnostics show a wide scatter of values across the different selection mechanisms, suggesting that none of them are of much use in predicting the bias. This finding supports the statement in Little et al. (2019) that having an auxiliary variable that is a good predictor of the survey outcome is a key requirement for detecting bias.
Figures 2 and 3 illustrate how these diagnostics change when κ < 1, that is, when the auxiliary variable for the outcome and the auxiliary variable for selection differ. As expected, diagnostics that are based solely on the propensity, that is, CV(η), AÛC, , Var(η−1), and psR2, tend to falsely “detect” bias in these scenarios. False detection here means that segments are flat, varying in the x-value without any accompanying variation in the y-value. As noted in Table 2, smaller values of κ will increase the value of ϕtrue towards 1 when βy ≠ 0, causing SMUB(0) to underestimate SEM more so relative to the corresponding results in Figure 1. In the extreme case of κ = 0, which is given in Figure 3, the SMUB and SMAB indices are all nearly collinear. SMUB(1) looks most reasonable in this scenario because all selection mechanisms either have ϕtrue = 1 (when βy ≠ 0) or ϕtrue = 0 (when βy = 0). In this latter case, all results fall on the origin, and there is no bias to detect.
Figures S1–S3 give the analogous results using the alternative bias measure SAEM. Because SAEM is adjusted for ignorable bias, SMUB now tends to overestimate bias and SMAB is the better-performing estimator. None of the other diagnostics considered perform qualitatively different.
Figures 1–3 characterize the systematic relationship between SEM and each diagnostic, but there is also sampling variability that occurs within each dataset. That is, does the realized value of a diagnostic in a given dataset correspondingly change when the realized value of SEM is higher or lower than its mean? Table 3 reports the Spearman correlation (multiplied by 100) between each candidate diagnostic and the SEM value under seven selected sets of {βx, βy} taken from Table 2 and three values of κ under ρ = 0.75. Those correlations that are within 5% of the largest magnitude correlation (the row-wise maximum absolute value) are in boldface. From Table 3, all of the metrics except Cor(Ysel, η−1) and FMI(μy) exhibit strong positive or negative correlation with SEM, i.e. less than −0.6 or greater than 0.6, when κ = 1 and when the selection mechanism is not SCAR. However, as κ decreases, the Spearman correlations decrease or even change signs when the signs of βx and βy are in opposite directions. This even holds for CV(η), which Figures 1–3 showed to be most sensitive to SEM on a systematic basis from among the existing diagnostics. For example, in the bottom-most three rows of Table 3, CV(η) has a Spearman correlation with SEM of about 0.70 when κ = 1, but this decreases to −0.46 when κ = 0. Insofar as one does not know the true value of κ and thus whether to expect a positive or negative correlation with the error, this is problematic. The realized values of the SMUB measures do not exhibit this undesirable behavior but rather exhibit a consistently high Spearman correlation with the realized values of the SEM.
Table 3:
Spearman correlations (two significant digits; ×100) between each candidate diagnostic and the standardized error measure (SEM) for seven exemplar sets of {βx, βy} taken from Table 2 and three values of κ ≡ Cor(X1,X2) with ρ ≡ Cor(X1, Y) set to 0,75 (Table S1 in the Supplement gives the same results with ρ set to 0.25). Those values in bold are within 5% of each row-wise maximum (in magnitude).
| {βx, βy} | κ | Var(η−1) | CV(η) | AÛC | psR2 | Cor(Ysel, η−1) | FMI(μy) | SMUB(0) | SMUB(0.5) | SMUB(1.0) | SMAB(0.5) | SMAB(1.0) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SCAR | |||||||||||||
| {0, 0} | 1.0 | −4 | 4 | 4 | 4 | 4 | −51 | −1 | 73 | 73 | 73 | 73 | 73 |
| {0, 0} | 0.5 | −1 | 2 | 1 | 2 | 1 | −67 | 2 | 73 | 73 | 73 | 73 | 73 |
| {0, 0} | 0.0 | 3 | −4 | −3 | −4 | −3 | −66 | −4 | 72 | 72 | 72 | 72 | 72 |
| SAR | |||||||||||||
| {0.5, 0} | 1.0 | −67 | 62 | 74 | 71 | 73 | 25 | 9 | 70 | 70 | 65 | 51 | 47 |
| {0.5,0} | 0.5 | −35 | 30 | 38 | 37 | 38 | −48 | 1 | 68 | 68 | 67 | 62 | 60 |
| {0.5, 0} | 0.0 | −2 | 2 | 3 | 3 | 3 | −61 | 3 | 67 | 67 | 67 | 67 | 67 |
| 3X2 + Y | |||||||||||||
| {0.375, 0.125} | 1.0 | −68 | 63 | 74 | 72 | 74 | 25 | 5 | 71 | 71 | 66 | 53 | 49 |
| {0.375, 0.125} | 0.5 | −41 | 39 | 45 | 44 | 45 | −43 | 4 | 69 | 69 | 68 | 62 | 59 |
| {0.375, 0.125} | 0.0 | −19 | 16 | 19 | 19 | 19 | −60 | −1 | 67 | 67 | 67 | 66 | 65 |
| X2 + Y | |||||||||||||
| {0.25, 0.25} | 1.0 | −65 | 62 | 72 | 69 | 71 | 25 | 8 | 69 | 68 | 63 | 51 | 48 |
| {0.25, 0.25} | 0.5 | −58 | 55 | 61 | 59 | 61 | −30 | 8 | 70 | 70 | 68 | 59 | 57 |
| {0.25, 0.25} | 0.0 | −39 | 39 | 41 | 40 | 41 | −53 | 3 | 71 | 70 | 69 | 65 | 64 |
| X2 + 3Y | |||||||||||||
| {0.125, 0.375} | 1.0 | −68 | 65 | 72 | 70 | 72 | 22 | 4 | 70 | 69 | 65 | 54 | 50 |
| {0.125, 0.375} | 0.5 | −66 | 61 | 69 | 66 | 69 | −4 | 5 | 70 | 69 | 67 | 58 | 55 |
| {0.125, 0.375} | 0.0 | −62 | 59 | 65 | 63 | 65 | −18 | 5 | 71 | 71 | 68 | 60 | 57 |
| Y | |||||||||||||
| {0, 0.5} | 1.0 | −66 | 65 | 71 | 69 | 71 | 22 | 9 | 69 | 69 | 66 | 56 | 53 |
| {0, 0.5} | 0.5 | −69 | 66 | 72 | 70 | 72 | 18 | 7 | 71 | 71 | 67 | 57 | 54 |
| {0, 0.5} | 0.0 | −67 | 66 | 71 | 69 | 71 | 14 | 12 | 69 | 69 | 65 | 54 | 51 |
| X2 – Y | |||||||||||||
| {0.25, −0.25} | 1.0 | −70 | 69 | 70 | 68 | 70 | −15 | −1 | 72 | 72 | 72 | 71 | 71 |
| {0.25, −0.25} | 0.5 | 19 | −19 | −19 | −19 | −19 | −67 | −4 | 72 | 72 | 72 | 72 | 72 |
| {0.25, −0.25} | 0.0 | 44 | −44 | −46 | −44 | −46 | −49 | −4 | 71 | 71 | 70 | 66 | 64 |
Tables S1 and S2 in the Supplement gives the analogous results under ρ = 0.25 and ρ = 0.10, respectively. When ρ is small, as in Table S2 none of the diagnostics, including SMUB or SMAB, have high correlation with SEM, highlighting the importance of obtaining auxiliary variables that correlate well with the outcome.
6. Discussion
Nishimura et al. (2016) found that none of their candidate diagnostics for detecting selection bias due to non-ignorable selection mechanisms were suitable for use. Our simulation study showed that the SMUB and SMAB family of measures proposed by Little et al. (2019) outperformed other diagnostics, both in terms of detecting the presence of bias as well as directly estimating its value, and both systematically (Figure 1–3) as well as on the basis of sampling variability (Table 3). The extent of non-ignorable selection is by definition inestimable, but the SMUB family is indexed by a tuning parameter ϕ, which allows the analyst to directly estimate the amount of selection bias by assuming that a specific degree of non-ignorable sampling had occurred. Our simulation study showed that the middle value of ϕ = 0.5, which minimizes the maximum possible distance from ϕtrue and which Little et al. (2019) heuristically suggested for default use, resulted in a diagnostic that most consistently estimated the true amount of selection bias.
A number of additional qualities recommend the SMUB/SMAB family of statistics for the task of diagnosing and estimating selection bias. First, it correlates moderately well with the true measure of selection bias as evidenced in Table 3. Second, our simulation study demonstrates that the difference between the median values of the SMUB statistic and SEM was zero when the tuning parameter ϕ matched the unknown value ϕtrue. This result is consistent with the theoretical derivation of the SMUB. Third, the SMUB calculation does not require individual-level data from the non-sampled data but rather only summary statistics of the auxiliary variables, which makes it especially useful for non-probability samples and which stands in contrast to the other diagnostics evaluated. Fourth and finally, SMUB is specific to an estimand of interest, meaning that it will enable an analyst to order estimates computed from a non-probability sample in terms of their potential selection bias. Among those statistics considered in Nishimura et al. (2016), only the FMI(μy) and Cor(Ysel, η−1) statistic have this characteristic. In contrast, the values of all other potential diagnostics considered do not actually vary with the estimand. This fact alone arguably precludes from consideration any of the aforementioned diagnostics, insofar as it is impossible to expect a single statistic to serve as a universal diagnostic for bias with respect to an arbitrary estimand. Moreover, the FMI statistic focuses on variance rather than bias, and the simulation study clearly points to its deficiency as a diagnostic for bias.
Because the actual selection mechanism is unknown in practice, it is not sufficient to have a candidate diagnostic that correlates well with SEM under each selection mechanism. Rather, it must be correlated with SEM in the same way across many different selection mechanisms, since by definition of a non-probability sample, one does not know the true selection mechanism. Furthermore, high correlation between a diagnostic for selection bias and true selection bias is only useful if there is knowledge about the distribution of the diagnostic, or even just its support. For example, although psR2 was consistently correlated with SEM, the values that we observed in the simulation study were typically limited to a very small interval close to zero, such that it would be difficult to know in practice whether one has encountered an extreme-enough value that would be suggestive of selection bias. The Var(η−1) diagnostic is similarly limited: its range is arguably so extreme as to make it impractical for general use.
With regard to the other candidate diagnostics, our results were largely consistent with those reported in Nishimura et al. (2016). Because the only code from that paper that we used here was the function for calculating FMI(μy), our work largely represents an independent validation of their findings. Ironically, we found that the two statistics that make use of the greatest amount of data, Cor(Ysel, η−1) and FMI(μy), were actually among the least effective at detecting selection bias. We found that CV(η) generally had a high correlation with the true amount of selection bias, even under non-ignorable settings. Concerning, however, is its variation due to sampling variability, as demonstrated in Table 3.
Finally, a lack of a globally optimal value of the tuning parameter ϕ points to one possible and novel extension of the SMUB statistic. Although the ϕtrue is, by definition of a non-probability sample, inestimable, the sampling probabilities could be learned about, e.g. with the collection of a small, auxiliary probability sample or via non-response follow-up with a small sample of non-selected cases, the non-ignorable bias could potentially be estimated and accounted for. Or, one might propose a shrinkage-type SMUB statistic that is an adaptive combination of estimates from the large, non-probability sample (high bias/low variance) and the small, probability sample (low bias/high variance), akin to the Empirical Bayes estimator of Mukherjee and Chatterjee (2008).
Supplementary Material
Acknowledgments
This work was supported by an R21 grant from the National Institutes of Health (1R21HD090366-01A1). The authors thank Dr. Raphael Nishimura for sharing R scripts to calculate the Fraction of Missing Information and Mr. Chen Chen for his initial work on this simulation study.
Footnotes
Disclosure
The authors report no potential conflicts of interest.
References
- Albert A and Anderson J (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71, 1–10. [Google Scholar]
- Andridge RR and Little RJ (2011). Proxy pattern-mixture analysis for survey nonresponse. Journal of Official Statistics 27, 153–180. [Google Scholar]
- Andridge RR and Little RJ (2020). Proxy pattern-mixture analysis for a binary variable subject to nonresponse. Journal of Official Statistics To Appear,. [Google Scholar]
- Bootsma-van der Wiel A. v., Van Exel E, De Craen A, Gussekloo J, Lagaay A, Knook D, and Westendorp R (2002). A high response is not essential to prevent selection bias: results from the leiden 85-plus study. Journal of Clinical Epidemiology 55, 1119–1125. [DOI] [PubMed] [Google Scholar]
- Brick JM and Williams D (2013). Explaining rising nonresponse rates in cross-sectional surveys. The Annals of the American Academy of Political and Social Science 645, 36–59. [Google Scholar]
- Heckman JJ (1979). Sample selection bias as a specification error. Econometrica 47, 153–161. [Google Scholar]
- Little RJ (1994). A class of pattern-mixture models for normal incomplete data. Biometrika 81, 471–483. [Google Scholar]
- Little RJ and Rubin DB (2002). Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken, NJ, 2nd edition. [Google Scholar]
- Little RJ, West BT, Boonstra PS, and Hu J (2019). Measures of the degree of departure from ignorable sample selection. Journal of Survey Statistics and Methodology To Appear,. 10.1093/jssam/smz023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mukherjee B and Chatterjee N (2008). Exploiting gene-environment independence for analysis of case–control studies: An empirical bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics 64, 685–694. [DOI] [PubMed] [Google Scholar]
- Nagelkerke NJ et al. (1991). A note on a general definition of the coefficient of determination. Biometrika 78, 691–692. [Google Scholar]
- Nishimura R, Wagner J, and Elliott M (2016). Alternative indicators for the risk of non-response bias: a simulation study. International Statistical Review 84, 43–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Presser S and McCulloch S (2011). The growth of survey research in the United States: Government-sponsored surveys, 1984–2004. Social Science Research 40, 1019–1024. [Google Scholar]
- R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
- Rubin DB (1976). Inference and missing data. Biometrika 63, 581–592. [Google Scholar]
- Rubin DB (2004). Multiple imputation for nonresponse in surveys, volume 81. John Wiley & Sons. [Google Scholar]
- Särndal C-E and Lundström S (2010). Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias. Survey Methodology 36, 131–144. [Google Scholar]
- Schouten B, Cobben F, Bethlehem J, et al. (2009). Indicators for the representativeness of survey response. Survey Methodology 35, 101–113. [Google Scholar]
- van Buuren S and Groothuis-Oudshoorn K (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software 45, 1–67. [Google Scholar]
- Wickham H (2017). tidyverse: Easily install and load the ‘tidyverse’. R package version 1.2.1. [Google Scholar]
- Williams D and Brick JM (2018). Trends in us face-to-face household survey nonresponse and level of effort. Journal of Survey Statistics and Methodology 6, 186–211. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
