Accounting for bias due to outcome data missing not at random: comparison and illustration of two approaches to probabilistic bias analysis: a simulation study

Emily Kawabata; Daniel Major-Smith; Gemma L Clayton; Chin Yang Shapland; Tim P Morris; Alice R Carter; Alba Fernández-Sanlés; Maria Carolina Borges; Kate Tilling; Gareth J Griffith; Louise A C Millard; George Davey Smith; Deborah A Lawlor; Rachael A Hughes

doi:10.1186/s12874-024-02382-4

. 2024 Nov 13;24:278. doi: 10.1186/s12874-024-02382-4

Accounting for bias due to outcome data missing not at random: comparison and illustration of two approaches to probabilistic bias analysis: a simulation study

Emily Kawabata ^1,^2,^#, Daniel Major-Smith ^1,^2,^#, Gemma L Clayton ^1,^2,^#, Chin Yang Shapland ^1,², Tim P Morris ³, Alice R Carter ^1,², Alba Fernández-Sanlés ⁴, Maria Carolina Borges ^1,², Kate Tilling ^1,², Gareth J Griffith ^1,², Louise A C Millard ^1,², George Davey Smith ^1,², Deborah A Lawlor ^1,², Rachael A Hughes ^1,^2,^✉

PMCID: PMC11558901 PMID: 39538117

Abstract

Background

Bias from data missing not at random (MNAR) is a persistent concern in health-related research. A bias analysis quantitatively assesses how conclusions change under different assumptions about missingness using bias parameters that govern the magnitude and direction of the bias. Probabilistic bias analysis specifies a prior distribution for these parameters, explicitly incorporating available information and uncertainty about their true values. A Bayesian bias analysis combines the prior distribution with the data’s likelihood function whilst a Monte Carlo bias analysis samples the bias parameters directly from the prior distribution. No study has compared a Monte Carlo bias analysis to a Bayesian bias analysis in the context of MNAR missingness.

Methods

We illustrate an accessible probabilistic bias analysis using the Monte Carlo bias analysis approach and a well-known imputation method. We designed a simulation study based on a motivating example from the UK Biobank study, where a large proportion of the outcome was missing and missingness was suspected to be MNAR. We compared the performance of our Monte Carlo bias analysis to a principled Bayesian bias analysis, complete case analysis (CCA) and multiple imputation (MI) assuming missing at random.

Results

As expected, given the simulation study design, CCA and MI estimates were substantially biased, with 95% confidence interval coverages of 7–48%. Including auxiliary variables (i.e., variables not included in the substantive analysis that are predictive of missingness and the missing data) in MI’s imputation model amplified the bias due to assuming missing at random. With reasonably accurate and precise information about the bias parameter, the Monte Carlo bias analysis performed as well as the Bayesian bias analysis. However, when very limited information was provided about the bias parameter, only the Bayesian bias analysis was able to eliminate most of the bias due to MNAR whilst the Monte Carlo bias analysis performed no better than the CCA and MI.

Conclusion

The Monte Carlo bias analysis we describe is easy to implement in standard software and, in the setting we explored, is a viable alternative to a Bayesian bias analysis. We caution careful consideration of choice of auxiliary variables when applying imputation where data may be MNAR.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12874-024-02382-4.

Keywords: Bayesian bias analysis, Inverse probability weighting, Missing not at random, Monte Carlo bias analysis, Multiple imputation, Probabilistic bias analysis, Sensitivity analysis, UK Biobank

Introduction

The main aim of many epidemiology studies is to estimate the causal effect of an exposure on an outcome (here onward, shortened to exposure effect). Inference about the exposure effect may be invalid when the sample included in the analysis is not a representative (random) sample of the target population. The choice of method for dealing with missing data partly depends on the mechanism causing the data to be missing (called missingness mechanisms). These mechanisms are commonly classified as missing completely at random (probability of missingness is independent of the observed and missing data), missing at random (MAR; probability of missingness is independent of the missing data given the observed data) and missing not at random (MNAR; probability of missingness depends on the missing data even after conditioning on the observed data) [1]. We focus on a MNAR missingness mechanism where the value of a variable directly affects its own probability of missingness [2]. Implementations of multiple imputation (MI) and inverse probability weighting (IPW) assume by default that data are MAR and so may give biased results when the missingness mechanism is MNAR. Note that implementations of MI and IPW incorporating MNAR mechanisms also exist (e.g., [3–5]).

Information about the missingness mechanism may be available from ancillary data such as instruments for missingness [6], record-linkage data [7, 8], and responsiveness data [9]. In the absence of such information, the analyst cannot distinguish between MAR and MNAR missingness mechanisms based on the observed data only [10]. Instead, the analyst must base their decision on expert knowledge or available literature. When MNAR missingness is suspected, a bias analysis (also known as a sensitivity analysis) is recommended to quantify the potential impact of MNAR missingness on their study conclusions [11–13].

A bias analysis for MNAR missingness (here onward, shortened to bias analysis) requires a model (known as a bias model) for the data and missingness mechanism. Two commonly used approaches are selection models and pattern-mixture models [11] (chapter 15, references therein). In the context of an outcome MNAR, the selection model usually consists of a model for the substantive analysis of interest and a model for the missingness mechanism that characterizes how missingness depends on the outcome. In contrast, the pattern-mixture model describes how the distribution of the outcome depends on missingness and may consist of a model for the substantive analysis that differs between participants with observed and missing outcome. Both types of model can be fitted using maximum likelihood, within a Bayesian framework or using multiple imputation [11, 14, 15].

Under MNAR both the selection and pattern-mixture models are unidentified models since the observed data does not provide any information about the parameters governing the dependency between the outcome and missingness (known as bias or sensitivity parameters). Setting the bias parameters to prespecified values enables estimation of the remaining parameters of the model and provides an estimate of the exposure effect adjusted for bias due to MNAR (here onward, called the bias-adjusted exposure effect estimate). By changing the values of these bias parameters, a bias analysis estimates the bias-adjusted exposure effect under different assumptions about the missingness mechanism.

A bias analysis can be implemented as a deterministic or probabilistic bias analysis [12, 15]. In a deterministic bias analysis, a range of values is specified for all bias parameters and then for each plausible combination of values, the bias model is estimated by fixing the bias parameters to these values. This approach provides the analyst with information about the range of possible estimates for the exposure effect but does not indicate which of these estimates are most likely to occur, making interpretation of the results challenging [12]. Alternatively, a probabilistic bias analysis specifies a prior probability distribution for the bias parameters which explicitly incorporates the analyst’s assumptions about plausible values and the combinations of values most likely to occur. The probabilistic bias analysis generates a distribution of bias-adjusted exposure effect estimates which is then summarised as a point estimate (e.g., the median as a measure of central tendency) and a 95% interval estimate (e.g., 2.5th and 97.5th percentiles as limits of the interval) that accounts for the analyst’s uncertainty about the MNAR missingness mechanism in addition to the usual random sampling error.

A probabilistic bias analysis can be implemented as a Bayesian bias analysis (where the prior distribution of the bias parameters is combined with the likelihood function for the data) or as a Monte Carlo bias analysis (where values of the bias parameters are directly sampled from their prior distribution and then used to fix the bias parameters to enable estimation of the bias-adjusted exposure effect) [16]. Generally, a Monte Carlo bias analysis is simpler to understand, quicker and easier to implement as it requires no Bayesian computation [12, 17, 18]. We note that the term “Monte Carlo” is also used to describe simulation-based techniques for Bayesian inference. To avoid confusion, we shall use the term “Markov Chain Monte Carlo (MCMC)” when referring to sampling from a posterior distribution and “Monte Carlo bias analysis” when referring to a type of probabilistic bias analysis.

In the context of bias analysis to unmeasured confounding or misclassification, a small number of studies have compared a Monte Carlo bias analysis to a Bayesian bias analysis [16–21]. Along with some theoretical arguments, these studies indicate that the Monte Carlo bias analysis is a good approximation of a Bayesian bias analysis provided the prior distribution for the bias parameters only specifies plausible values given the observed data [17–20]. Otherwise, the Monte Carlo bias analysis can give interval estimates that are either too wide or too narrow [16, 19]. No study has compared a Monte Carlo bias analysis to a Bayesian bias analysis in the context of a bias analysis to MNAR missingness.

Currently, there is limited guidance on implementing a probabilistic bias analysis to data MNAR. Recent exceptions for cross-sectional analyses include: (1) a pattern-mixture approach where draws from a prior distribution (of the bias parameters) are used to impute a categorical covariate under MNAR [22, 23] and (2) a Bayesian implementation of a selection model for a partially observed continuous outcome [24]. Additionally, in the context of selection bias due to non-random selection of participants into a study, Banack et al. review and compare a Monte Carlo bias analysis to an alternative approach that simulates the entire dataset under different assumptions about the selection bias [25] and Jayaweera et al. conducted a Monte Carlo bias analysis by inversely weighting participants based on their probability of inclusion (i.e., participating and remaining in the study combined) [26].

In this paper, we illustrate a Monte Carlo bias analysis [12, 17] using a pattern-mixture version of fully conditional specification (FCS) imputation [5, 27, 28]. Via a data example and simulations, we compare the performance of our Monte Carlo bias analysis to a Bayesian bias analysis in a setting where a large proportion of the outcome is missing and missingness is suspected to be MNAR. R and Stata software code implementing the Monte Carlo and Bayesian bias analyses is available from https://github.com/MRCIEU/COVIDITY_ProbQBA.

Methods

Hypothetical example

We want to estimate the effect of an exposure (or treatment) $X$ on an outcome $Y$ , denoted $β_{X}$ . To estimate $β_{X}$ , our substantive analysis is a generalised linear regression of $Y$ on $X$ adjusted for measured confounders $Z$ and $W$

E (Y | X, Z, W) = g_{Y}^{- 1} (β_{0} + β_{X} X + β_{Z} Z + β_{W} W)

where $g_{Y}^{- 1} (\cdot)$ denotes the inverse link function. We assume all confounders of the $Y$ – $X$ association are measured and without error, and in the absence of missing data that the substantive analysis would give unbiased results for $β_{X}$ . Outcome $Y$ is observed in a small proportion of study participants. The study recorded data on auxiliary variables (i.e., variables not included in the substantive analysis) that are predictive of the missing values of $Y$ and whether $Y$ was observed or missing. Also, a small proportion of participants are missing data on exposure $X$ and some of the confounders and auxiliary variables. Let $Z$ and $W$ denote the fully and partially observed confounders, respectively, and $A$ and $D$ denote the fully and partially observed auxiliary variables, respectively. To simplify the notation, and without loss of generality, we assume that $A$ denotes a single variable, and similarly for $D, Z$ and $W$ . Binary variables $M^{Y}, M^{X}, M^{W}$ and $M^{D}$ denote the missingness indicators of $Y, X, W$ and $D$ , respectively (e.g., $M^{Y} = 1$ when $Y$ is missing and $M^{Y} = 0$ otherwise).

Figure 1 depicts two missingness directed acyclic graphs (m-DAGs [29]) showing the relationships among the variables of our substantive analysis of interest ( $Y, X, Z$ and $W$ ), the auxiliary variables ( $A$ and $D$ ), and the missingness mechanisms of $Y$ , and of $X, W$ and $D$ . Note that m-DAGs do not specify the form of these relationships (e.g., nonlinear relationships between variables). Exposure effect, $β_{X}$ , represents the total effect of $X$ on $Y$ (i.e., direct effect and indirect effect via auxiliaries A and D). We consider two scenarios, when $β_{X}$ is not-null (Fig. 1a) and null (Fig. 1b). Note that $U^{WZ}$ and $U^{DA}$ denote unmeasured shared ancestors of $W$ and $Z$ , and $D$ and $A$ , respectively. Outcome $Y$ is MNAR depending on fully observed auxiliary $A$ , the missing values of $Y$ , and the observed and missing values of exposure $X$ and auxiliary $D$ . Note that missingness of $Y$ does not depend on $W$ or $Z$ , and we exclude the special case where the MNAR mechanism depends on $X$ and $Y$ independently [30]. Variables $X, W$ and $D$ are MAR depending on fully observed confounder $Z$ and auxiliary $A$ ; hence this MAR mechanism applies across all missing data patterns of $Y, X, W$ , and $D$ .

Fig. 1 — Missingness directed acyclic graphs (m-DAGs) of the scenario investigated by the simulation study when the exposure effect, $β_{X}$ , is (a) not-null and (b) null. Black edges depict the relationships in the fully observed data, and the blue and red edges depict the missingness mechanisms of the outcome and baseline variables (exposure, confounders, and auxiliary variables), respectively

Complete case analysis and ignorable missing data methods

Popular missing data methods include complete case analysis (CCA) and ignorable implementations of MI and IPW that assume MAR (hereafter referred to as MI and IPW, respectively). A detailed comparison of these methods is provided elsewhere (e.g., [10, 31, 32]).

For our hypothetical example, CCA and MI are expected to give a biased estimate for $β_{X}$ in both the null and not-null scenarios. Since missingness of $Y$ depends jointly on $X$ and $Y$ , CCA is an invalid approach even when the substantive analysis is a logistic regression [30]. For MI, the MAR assumption is not valid, regardless of the variables included in the imputation model, since missingness of $Y$ depends directly on $Y$ (i.e., path $Y \to M^{Y}$ ). See the Supplementary materials for information about IPW in the context of our hypothetical example.

We next describe two non-ignorable missing data methods, a Bayesian bias analysis using a selection model (here onward, called Bayesian SM) and a Monte Carlo bias analysis using a pattern-mixture model (here onward, called Monte Carlo NARFCS). The bias models of Bayesian SM and Monte Carlo NARFCS consist of a collection of generalised linear regressions. For simplicity and without loss of generality, we describe Bayesian SM and Monte Carlo NARFCS with respect to continuous variable $X$ and binary variables $Y, W, A$ and $D$ , whilst $Z$ is left unspecified and, by definition, missingness indicators $M^{Y}, M^{X}, M^{W}$ and $M^{D}$ are binary.

Bayesian SM

Bias model specified as a selection model

We use the sequential modelling approach [33–36] to jointly model the substantive analysis, the MNAR missingness mechanism for $Y$ , and the models to estimate the missing values of $X, W$ and $D$ . The sequential modelling approach factorises a joint distribution into a sequence of simpler univariate distributions, where each univariate distribution is modelled using an appropriate regression model (e.g., linear regression for continuous variables and logistic regression for binary variables). We specify the following regression models for the joint distribution of $W, X, Y, A, D, M^{Y} |Z)$ :

\begin{matrix} P r (W = 1 | Z) \sim e x p i t \{η_{0} + η_{Z} Z\}, X |W, Z) \sim N (ζ_{0} + ζ_{W} W + ζ_{Z} Z, ξ^{2}), \\ P r (Y = 1 | X, W, Z) \sim e x p i t \{β_{0} + β_{X} X + β_{W} W + β_{Z} Z\}, \\ P r (A = 1 | Y, X, W, Z) \sim e x p i t \{θ_{0} + θ_{Y} Y + θ_{X} X + θ_{W} W + θ_{Z} Z\}, \\ P r (D = 1 | A, Y, X, W, Z) \sim e x p i t \{ω_{0} + ω_{A} A + ω_{Y} Y + ω_{X} X + ω_{W} W + ω_{Z} Z\}, \\ P r (M^{Y}, = 1 | D, A, Y, X) \sim e x p i t \{ψ_{0} + ψ_{D} D + ψ_{A} A + δ^{SM} Y + ψ_{X} X\}, \end{matrix}

where $e x p i t \{k\} = e x p \{k\} / (1 + e x p \{k\})$ , and $δ^{SM}$ is the bias parameter representing the difference in the log-odds of observing $Y$ between those with $Y = 1$ and $Y = 0$ , conditional on $D, A, Y$ and $X$ . Note that as missingness of $Y$ is conditionally independent of $W$ and $Z$ given $D, A, Y$ and $X$ then the bias model correctly assumes that $Pr (M^{Y}, = 1 | D, A, Y, X, W, Z) = Pr (M^{Y}, = 1 | D, A, Y, X)$ for all values of $W$ and $Z$ . Let $Ψ^{SM}$ denote the set of all estimable parameters of model [2] (i.e., all except $δ^{SM}$ ), noting $Ψ^{SM}$ includes exposure effect $β_{X}$ .

Different orderings of these regression models may result in different joint distributions [37]. We specified this ordering because it includes: the substantive analysis, a model for the MNAR missingness mechanism of $Y$ , and incorporates auxiliary variables $A$ and $D$ without altering the substantive analysis. This ordering is compatible with a selection model framework. The ordering of the remaining models can be with respect to the amount of missing data (i.e., starting with the model for the variable with the least amount of missing data). Note that previous studies have reported that a Bayesian implementation of the sequential modelling approach appears robust to the ordering of the models [38, 39] but it may affect computational time [33].

Prior probability distributions

We assign independent prior distributions for all parameters. Following standard practice, we assign a normal distribution for each coefficient of the regression models and an inverse gamma distribution for the variance parameter of a linear regression [40]. For $δ^{SM}$ we assign Normal distribution $δ^{SM} \sim N (μ^{SM}, σ^{SM})$ where values for mean $μ^{SM}$ and variance $σ^{SM}$ are chosen based on external information such as published results, expert opinion, or external data. In practice, external information about $δ^{SM}$ may be unattainable. Instead, it may be easier to obtain external information about a related parameter (such as the marginal difference in the log-odds of observing $Y$ between those with $Y = 1$ and $Y = 0$ ) that can then be converted into information about $δ^{SM}$ . We illustrate this concept when deriving values for hyperparameters $μ^{SM}$ and $σ^{SM}$ in our motivating example. For all remaining parameters, $Ψ^{SM}$ , we assign vague priors; namely, $N (0,100)$ for the coefficients and Inv-Gamma(0.01,0.01) for the variances.

Bayesian implementation

In the Bayesian framework, Bayes’ theorem is applied to combine the prior distributions for the bias model parameters with the likelihood function for the data to obtain the joint posterior distribution of ( $δ^{SM}, Ψ^{SM}$ ). Therefore, application of Bayes’ theorem may rule out certain values of $δ^{SM}$ because they are incompatible with the data [16]. From the joint posterior distribution of ( $δ^{SM}, Ψ^{SM}$ ), we can derive the conditional posterior distribution of a single parameter, such as $β_{X}$ .

The Bayesian framework views the missing data of $Y, X, W,$ and $D$ and parameters $δ^{SM}$ and $Ψ^{SM}$ as unknown quantities to be estimated. Since direct sampling from the joint distribution of these unknown quantities is difficult, we fit the selection model using MCMC estimation, specifically Gibbs sampling implemented by JAGS (version 4.3.0) [41–43] using R package jagsUI (version 1.5.2) [44].

Monte Carlo NARFCS

Bias model specified as a pattern-mixture model

We use the Not-At-Random Fully Conditional Specification (NARFCS) approach [5] which is an MNAR extension of the MAR imputation method FCS [28] (see references therein for other variants of FCS). Like FCS, NARFCS imputes each variable under a separate univariate regression model (of type appropriate to the variable being imputed) and updates the missing data for each variable in turn using an iterative algorithm which we shall call the FCS algorithm [45, 46]. Note that the univariate distributions implied by these regression models may not be consistent with the same joint distribution and different orderings of these regression models within the FCS algorithm could lead to sampling from different joint distributions [46, 47]. In practice, FCS has been shown to be a robust approach even when the set of regression models are not compatible with the same joint distribution ( [46], references therein). The order in which the partially observed variables are updated within the FCS algorithm is typically determined by the amount of missing data [27].

We specify the following regression models for our NARFCS bias model:

\begin{matrix} P r (W = 1 | Y, X, D, A, Z, M^{Y}, M^{X}, M^{D}) \sim e x p i t \{α_{0} + α_{Y} Y + α_{X} X + α_{D} D + α_{A} A + α_{Z} Z + α_{M^{Y}} M^{Y} + α_{M^{X}} M^{X} + α_{M^{D}} M^{D}\}, \\ X |Y, W, D, A, Z, M^{Y}, M^{W}, M^{D}) \sim N (γ_{0} + γ_{Y} Y + γ_{W} W + γ_{D} D + γ_{A} A + γ_{Z} Z + γ_{M^{Y}} M^{Y} + γ_{M^{W}} M^{W} + γ_{M^{D}} M^{D}, ε^{2}), \\ P r (D = 1 | Y, X, W, A, Z, M^{Y}, M^{X}, M^{W}) \sim e x p i t \{κ_{0} + κ_{Y} Y + κ_{X} X + κ_{W} W + κ_{A} A + κ_{Z} Z + κ_{M^{Y}} M^{Y} + κ_{M^{X}} M^{X} + κ_{M^{W}} M^{W}\}, \\ P r (Y = 1 | X, W, D, A, Z, M^{Y}, M^{X}, M^{W}, M^{D}) \sim e x p i t \{λ_{0} + λ_{X} X + λ_{W} W + λ_{D} D + λ_{A} A + λ_{Z} Z + δ^{NARFCS} M^{Y} + λ_{M^{X}} M^{X} + λ_{M^{W}} M^{W} + λ_{M^{D}} M^{D}\} \end{matrix}

where $δ^{NARFCS}$ is the bias parameter, representing the difference in the log-odds of $Y = 1$ between those with observed and missing values of $Y$ . Let $Ψ^{NARFCS}$ denote the set of all estimable parameters of model [3] (i.e., all except $δ^{NARFCS}$ ), which does not include $β_{X}$ .

NARFCS differs from FCS in two ways which we shall illustrate using the regression model for $Y$ in the bias model, [3], above. First, NARFCS includes missingness indicator $M^{Y}$ as an independent variable in the regression model for $Y$ in order to quantify how the distribution of $Y$ differs between participants with observed and missing values of $Y$ . Hence NARFCS belongs to the class of pattern-mixture models. Second, NARFCS includes the missingness indicators of the other partially observed variables, $M^{X}, M^{W}, M^{D}$ , as independent variables in the regression model for $Y$ in order to maximise the amount of correlation between the variables captured by the model [5]. Note that the regression model for $X$ omits $M^{X}$ as an independent variable because we assume $X$ is MAR given $A$ and $Z$ . Similarly, for the regression models of $W$ and $D$ .

Prior probability distributions

NARFCS does not assign a prior distribution for $δ^{NARFCS},$ instead it fixes $δ^{NARFCS}$ to a prespecified value before applying the FCS algorithm. For remaining parameters, $Ψ^{NARFCS}$ , NARFCS independently samples the parameters of each regression model from a posterior distribution (or an approximation) under a vague prior distribution (to ensure uncertainty from estimating the imputation model parameters is propagated through to the resulting imputations [27]). For example, for a regression with coefficients $υ$ (and if applicable variance parameter $ς$ ) Stata command mi impute chained (version 17 [48]) and R package mice (version 3.14.0) specify prior $p (ν, ς) \propto^{1} /_{ς})$ for a linear regression and prior $p (ν) \propto 1$ for a logistic regression [49, 50].

Note that Tompsett et al. [5] illustrate a deterministic bias analysis using NARFCS where (in the context of our hypothetical example) the user prespecifies multiple values for $δ^{NARFCS}$ and then repeatedly applies NARFCS by fixing $δ^{NARFCS}$ to each pre-specified value in turn. As we are implementing a probabilistic bias analysis using NARFCS, we must specify a prior distribution for $δ^{NARFCS}$ . In keeping with Bayesian SM, we use prior $p (δ^{NARFCS}) \sim N (μ^{NARFCS}, σ^{NARFCS})$ with $μ^{NARFCS}$ and $σ^{NARFCS}$ set to values based on external information about $δ^{NARFCS}$ , or more practically on a related parameter that is then converted into information about $δ^{NARFCS}$ .

Monte Carlo bias analysis

The Monte Carlo bias analysis repeatedly samples directly from the prior distribution for $δ^{NARFCS}$ before fitting the bias model. Therefore, no sampled values of $δ^{NARFCS}$ are rejected due to incompatibility with the observed data. Using the NARFCS bias model, we generate a Monte Carlo frequency distribution of bias-adjusted estimates of $β_{X}$ by repeatedly carrying out the following steps $S$ $(S > 1)$ times: for $s = 1, \dots, S$

i.
Randomly draw a value for the bias parameter directly from its prior distribution, $δ^{N A R F C S (s)} \sim N (μ^{NARFCS}, σ^{NARFCS})$ .
ii.
Impute the observed data $K$ $(K \geq 1)$ times using the NARFCS bias model with the bias parameter fixed at $δ^{N A R F C S (s)}$ . Fit the substantive analysis separately to each imputed dataset using maximum likelihood estimation and combine the multiple sets of results for $β_{X}$ using Rubin’s rules [1]. Let ${\tilde{β}}_{X}^{N A R F C S (s)}$ and ${\tilde{V}}_{X}^{N A R F C S (s)}$ denote the combined estimate of $β_{X}$ and accompanying variance, respectively.
iii.
Incorporate random sampling error ${\hat{β}}_{X}^{N A R F C S (s)} \sim N ({\tilde{β}}_{X}^{N A R F C S (s)}, {\tilde{V}}_{X}^{N A R F C S (s)})$ .

After $S$ steps, we compute the median, 2.5th and 97.5th percentiles of the frequency distribution of ${\hat{β}}_{X}^{N A R F C S (1)}, {\hat{β}}_{X}^{N A R F C S (2)}, \dots, {\hat{β}}_{X}^{N A R F C S (S)}$ to obtain our Monte Carlo NARFCS point and interval estimates of $β_{X}$ . Monte Carlo NARFCS was implemented in R using the NARFCS extension to mice [51] and in Stata using mi impute with option offset.

Simulation study design

We compared the performance of Monte Carlo NARFCS with Bayesian SM when a large proportion of data were missing under a very strong MNAR mechanism. We evaluated these methods when the prior distribution for the bias parameter was (i) inaccurate and imprecise, (ii) accurate and reasonably precise, and (iii) accurate and very precise. We repeated the simulation study for $β_{X} = 0$ and $β_{X} = ln (3)$ and for two data generating models: based on the selection model framework (SM data generating model) and the pattern-mixture model framework (PMM data generating model). For all combinations of the simulation settings, we generated 1000 simulated data sets, each with 100,000 observations for the full sample.

Generation of the complete data

The simulation study was based on the hypothetical example described above with the exception that $Z = (Z_{1}, Z_{2}, Z_{3})$ denotes three fully observed confounders and $A = (A_{1}, A_{2})$ denotes two fully observed auxiliary variables. Exposure $X$ and $Z_{2}$ were continuous variables with mean 0 and standard deviation of 1, and the remaining variables were binary ( $Z_{1}, Z_{3}, A,$ outcome $Y$ , partially observed confounder $W$ , partially observed auxiliary $D$ and missingness indicators $M^{Y}, M^{X}, M^{W}$ , and $M^{D}$ ).

First, we simulated (complete) data on $X, Y, Z, W, A, D,$ and $M^{Y}$ from their joint distribution factorised into a series univariate regressions: logistic regression for $Y, Z_{1}, Z_{3}, W, A, D,$ and $M^{Y}$ , and linear regression for $X$ and $Z_{2}$ . We considered two factorisations of this joint distribution, with the factorisation for the SM and PMM data generating models chosen to resemble the bias model of Bayesian SM and Monte Carlo NARFCS, respectively. See the Supplementary materials for further details.

Most of the parameter values of the SM data generating model were set to estimates from an analysis of a real dataset (our motivating example, described in the next section). Note that the value of $δ^{SM}$ was artificially derived to simulate a very strong MNAR mechanism. The marginal prevalence of $Y$ was fixed at 5% for both the $β_{X} = 0$ and $β_{X} = ln (3)$ scenarios. We were unable to analytically derive the corresponding parameter values of the PMM data generating model. Instead, we fitted the PMM data generating model to a dataset of 50,000,000 observations simulated under the SM data generating model and then used the resulting estimates as the parameter values of the PMM data generating model [4].

Generation of the missing data

Following generation of the complete data, which included missingness indicator $M^{Y}$ , values of $Y$ were set to missing when $M^{Y} = 1$ . Missing data for $X, W,$ and $D$ were subsequently generated independently of each other and of $Y$ using the following missingness mechanisms of MAR given fully observed variables:

\begin{matrix} P r (M^{X} = 1 | Z_{1}, Z_{2}, Z_{3}, A_{1}, A_{2}) = e x p i t \{- 3.20 + 0.233 \times Z_{1} - 0.0570 \times Z_{2} - 0.133 \times Z_{3} + 0.363 \times A_{1} + 0.763 \times A_{2}\}, \\ P r (M^{W} = 1 | Z_{1}, Z_{2}, Z_{3}, A_{1}, A_{2}) = e x p i t \{- 2.90 + 0.0720 \times Z_{1} - 0.232 \times Z_{2} - 0.774 \times Z_{3} + 0.169 \times A_{1} + 0.417 \times A_{2}\}, \\ P r (M^{D} = 1 | Z_{1}, Z_{2}, Z_{3}, A_{1}, A_{2}) = e x p i t \{- 2.95 - 0.0590 \times Z_{1} - 0.0290 \times Z_{2} - 0.190 \times Z_{3} + 0.130 \times A_{1} + 0.192 \times A_{2}\} \end{matrix}

where all parameter values were derived from the observed data of our motivating example. These missingness mechanisms were the same for both the $β_{x} = 0$ and $β_{x} = ln (3)$ scenarios and the SM and PMM data generating models, resulting in a non-monotone missingness pattern. Close to 5% of the observations of $X, W,$ and $D$ were set as missing.

Missing data methods and evaluation

Probabilistic bias analyses, Bayesian SM and Monte Carlo NARFCS, were implemented as described previously Based on running standard convergence checks [40] on one randomly selected dataset, Bayesian SM was applied with 50,000 iterations, of which 5,000 were burn-in iterations. Monte Carlo NARFCS was applied with 10,000 Monte Carlo steps and single imputation within each step. To assess sensitivity to the number of Monte Carlo steps and imputed datasets, we also conducted Monte Carlo NARFCS using 10,000 Monte Carlo steps with five imputations, and 5,000 Monte Carlo steps with single imputation. The number of burn-in iterations of the FCS algorithm was always set to 10. We applied Bayesian SM and Monte Carlo NARFCS with three different priors for the bias parameter: (i) vague prior $N (0,100)$ , (ii) informative prior $N (t r u t h, 4)$ , and (iii) very informative prior $N (t r u t h, 1)$ , where $truth$ denotes the true value of the bias parameter. Note that the true value of $δ^{NARFCS}$ was unknown (since it was not a parameter of either data generating model) and so we instead used an estimate of $δ^{NARFCS}$ based on a simulated dataset of 50,000,000 observations.

We compared Bayesian SM and Monte Carlo NARFCS to a CCA and MI. We applied MI using FCS imputation with 10 burn-in iterations and 50 imputations, imputing the binary and continuous variables using logistic and linear regressions, respectively. (See Supplementary materials for further details on all missing data methods).

The estimand of interest was the exposure effect $β_{X}$ . For the SM data generating model, the true value of $β_{X}$ was known as it was a parameter of this model, whilst for the PMM data generating model, a value for $β_{X}$ was computed by fitting the substantive analysis to a dataset of size 50,000,000 before data deletion. Performance measures of interest were bias, empirical, and model-based standard errors, and 95% confidence interval (CI) coverage of estimates of $β_{X}$ . We used Stata version 17.0 [48] to generate the data. The remaining methods were conducted in R 4.1.0 [52]. Bayesian SM and Monte Carlo NARFCS were applied using high performance computing for parallel processing [53] across the simulated datasets. R package rsimsum (version 0.11.3) [54] was used to compute the simulation results.

Motivating example

The motivating example for our simulation study is a previously described study where the substantive analysis of interest is a logistic regression of SARS-CoV-2 infection (0 not infected, 1 infected) on body mass index (BMI) adjusted for confounders age, sex (0 female, 1 male), university degree (0 no, 1 yes), and current smoker (0 no, 1 yes) [55]. There are three auxiliary variables: diagnosis of asthma (0 no, 1 yes), diabetes (0 no, 1 yes), and hypertension (0 no, 1 yes). This motivating example illustrates derivation of an informative prior for $δ^{SM}$ and $δ^{NARFCS}$ . As this is an illustrative example, we have ignored other potential sources of bias (such as selection bias due to non-random participation in UK Biobank [56]), and we have only considered a small number of confounders of the outcome–exposure relationship.

Motivating case study

Using data from the UK Biobank study (UKB) [56], we define our target population as middle aged and elderly adults (aged 47 – 86, with close to 75% of participants aged 61 or older) resident and alive in England on 1st January 2020. Active SARS-CoV-2 infection was defined as either a positive SARS-CoV-2 PCR test (from linked Public Health England data) or COVID-19 recorded on a death certificate between 1st January 2020 and 18th May 2020 (i.e., the date mass testing became available in the UK; [57]). Testing for SARS-CoV-2 was highly restricted during this period and so data on SARS-CoV-2 infection were missing for over 98% of participants. Data on SARS-CoV-2 infection were suspected to be MNAR since testing among the majority of the UK population (i.e., non-healthcare workers) was mainly restricted to those who experienced symptoms of COVID-19 [58]. Observed factors associated with the chance of being tested in UKB included having higher BMI, being a current smoker, having a pre-existing condition (such as asthma, diabetes, or hypertension), being female, and having a university degree or higher [55].

Among the 445,377 participants included in the UKB study, we excluded 24,465 (5.49%) participants who died before 2020 and 65 (0.0146%) who were not tested for SARS-CoV-2 but were diagnosed with COVID-19 post-mortem. Of the remaining 420,847 participants eligible for analysis, 405,174 (96.3%) were missing the outcome only, 10,870 (2.58%) were missing the outcome and at least one covariate (BMI, smoker, or degree), 4,610 (1.10%) had complete data and 193 (0.0459%) had an observed outcome but were missing at least one covariate (Supplementary Table 3). Confounders age and sex, and auxiliary variables asthma, diabetes, and hypertension were fully observed. Figure 2 shows the m-DAG for this motivating example based on subject-matter knowledge and our investigations of observed predictors of missingness (Supplementary tables 4 and 5). We assume the covariate data were MAR and there were no unmeasured common causes after accounting for age, sex, degree, smoker, BMI, asthma, diabetes, and hypertension.

Fig. 2 — Missingness directed acyclic graph for the UK Biobank example. Black edges depict the assumed relationships in the fully observed data between the outcome (SARS-CoV-2 infection), exposure (body mass index (BMI)), confounders (age, sex, degree, and smoker), and auxiliary variables (asthma, diabetes, and hypertension). Tested, M^BMI, and M^{degree,smoker} denote missingness indicators for the outcome, exposure, and confounders, respectively. Blue and red edges depict the missingness mechanisms of the outcome and covariates (exposure and confounders), respectively. Note, we have not included all edges between the variables

Statistical analyses

We analysed the data using CCA, MI, Bayesian SM, Monte Carlo NARFCS, and a “population-based comparison group approach” where untested participants were assumed to be not infected with SARS-CoV-2 [59–61]. (See the Supplementary materials for further details). Due to convergence problems encountered when applying Bayesian SM to the full data, we restricted all analyses to the 409,784 participants with complete data on the covariates. This simplified the imputation, weighting, and bias models by reducing the number of parameters to be estimated. Given the small percentage of dropped participants (the majority of which had a missing outcome), the characteristics of the full sample and the restricted sample were virtually the same (Supplementary Table 6). In keeping with the preceding paper [55], and to improve the efficiency of MCMC sampling by reducing autocorrelation in the chains, each continuous variable (age and BMI) was standardized by subtracting its observed mean and dividing by its observed standard deviation. These standardised variables were used in all analyses. We applied MI with 50 imputed datasets, Bayesian SM using 50,000 MCMC iterations (including 5,000 burn-in iterations), and Monte Carlo NARFCS with 10,000 Monte Carlo steps and single imputation. Bayesian SM and Monte Carlo NARFCS were applied using an informative prior for $δ^{SM}$ and $δ^{NARFCS}$ , respectively.

Derivation of the informative prior $δ^{SM}$ and $δ^{NARFCS}$

The hyperparameters of the informative priors $p (δ^{SM}) \sim N (μ^{SM}, σ^{SM})$ and $p (δ^{NARFCS}) \sim N (μ^{NARFCS}, σ^{NARFCS})$ were derived from published results of the REal-time Assessment of Community Transmission-2 (REACT-2) national study [62]. The REACT-2 study sent home-based SARS-CoV-2 antibody test kits to over 100,000 randomly sampled adults living in England between 20th June and 13th July 2020. Among 65–74-year-olds (similar age range to our study), SARS-CoV-2 antibody prevalence was estimated to be 3.2% [95% CI 2.8–3.6%] [62] by mid-July 2020.

Bias parameters $δ^{SM}$ and $δ^{NARFCS}$ are conditional parameters on the log-odds scale. So, we used an algorithm from Tompsett et al. [5] to compute approximate values of $δ^{SM}$ and $δ^{NARFCS}$ calibrated to marginal prevalences of SARS-CoV-2 infection. For prior $p (δ^{NARFCS}) \sim N (μ^{NARFCS}, σ^{NARFCS})$ , we set $μ^{NARFCS} = - 2.6$ (the value of $δ^{NARFCS}$ calibrated to a prevalence of 3.2%) and set $σ^{NARFCS} = {0.22}^{2}$ such that 95% of the sampled values of $δ^{NARFCS}$ were expected to be between -3.0 and -2.2 (which were the values of $δ^{NARFCS}$ calibrated to prevalences of approximately 2.2% and 4.2%, respectively). Note that we allowed for additional uncertainty because the prevalence of infection was unknown in our UKB study. The comparable prior for Bayesian SM was $p (δ^{SM}) \sim N (2.6, {0.22}^{2})$ . See Supplementary materials for further details.

Results

Simulation study results

When there were no missing data, the full data estimate of $β_{X}$ was unbiased and CI coverage was close to the nominal level in all scenarios. Figure 3 shows the bias and coverage of estimating $β_{X}$ in the presence of missing data using different missing data methods when the true value of $β_{X}$ was $ln (3)$ and 0 and the data were generated using the SM data generating model (detailed results reported in Supplementary tables 8 and 9). There was substantial bias and severe CI under-coverage for the CCA estimates, with similar levels of bias for $β_{X} = ln (3)$ and $β_{X} = 0$ but slightly higher CI coverage for $β_{X} = ln (3)$ due to wider CIs. When $β_{X} = 0$ , MI had broadly comparable levels of bias and CI coverage to CCA. However, when $β_{X} = ln (3)$ , the bias of the MI estimates was noticeably larger than that of CCA. This was likely due to amplification of the bias (resulting from incorrect assumptions about the missingness mechanism) caused by including variables in the imputation model that were strongly predictive of $Y$ [63] (see Supplementary materials for further details).

Fig. 3 — Bias and 95% confidence interval coverage of exposure effect, $β_{X}$ , according to the not null ( $β_{X} = ln (3)$ ) and null ( $β_{X} = 0$ ) scenarios for data generated using SM data generating model. Error bars denote 95% Monte Carlo intervals, and the vertical dashed line denotes zero bias (top) and nominal coverage (bottom). Results for Bayesian SM were based on 926–928 simulated datasets; the remaining methods were based on 1,000 simulated datasets

For both the $β_{X} = ln (3)$ and $β_{X} = 0$ scenarios, there was negligible bias when applying Monte Carlo NARFCS with an informative or very informative prior. Applying Monte Carlo NARFCS with a vague prior resulted in biased estimates where the level of bias was slightly lower than that of CCA for the $β_{X} = 0$ scenario but higher for the $β_{X} = ln (3)$ scenario (and comparable to that of MI). In accordance with MI, the higher level of bias for the $β_{X} = ln (3)$ scenario was likely due to the auxiliary variables amplifying the bias from misspecification of the missingness mechanism. Despite the (relatively) high level of bias, CI coverage was nominal due to the imprecision of the vague prior. Very similar results were obtained when applying Monte Carlo NARFCS with 10,000 Monte Carlo steps with 5 imputations and 5,000 Monte Carlo steps with single imputation (Supplementary tables 15 and 16).

Method Bayesian SM failed to produce results for 72 to 74 simulated datasets (further details in Supplementary Sect. 3.5) whilst the other methods returned results for all 1,000 simulated datasets. Similar to Monte Carlo NARFCS, applying Bayesian SM with an informative or very informative prior resulted in minimal bias. However, compared to Monte Carlo NARFCS, Bayesian SM showed slightly higher levels of bias and inefficiency (i.e., larger empirical standard errors), leading to moderate levels of CI under-coverage. This seeming under-performance of Bayesian SM may have been due to the omitted estimates caused by nonconvergence in a small number of datasets. Unlike Monte Carlo NARFCS, Bayesian SM with a vague prior eliminated some of the bias in both the $β_{X} = ln (3)$ and $β_{X} = 0$ scenarios, with bias levels at least 50% lower than those of CCA. Also, the model-based standard errors of Bayesian SM were considerably smaller than those of Monte Carlo NARFCS. A likely explanation is that some information was gained from the application of Bayes’ theorem combining the prior for $δ^{SM}$ with the observed data. Supporting this claim, we note that when applied with an a priori mean of 0 for $δ^{SM}$ , across the simulations the mean of the posterior estimates of $δ^{SM}$ was 8.83 (95% Monte Carlo interval 8.31 to 8.96) and 6.36 (95% Monte Carlo interval 6.20 to 6.52) for the $β_{X} = ln (3)$ and $β_{X} = 0$ scenarios, respectively (where the true value was 7.85).

For both Bayesian SM and Monte Carlo NARFCS with (very) informative priors, there was CI overcoverage when the estimates of $β_{X}$ were unbiased (or negligibly biased). This overcoverage was likely due to generating the data using a fixed value for the bias parameter which is known to lead to CI overcoverage when applying an analysis with an informative prior centred on the true value of the parameter [64].

Similar patterns were noted on the relative performances of the methods for data generated using the PMM data generating model (Supplementary tables 18 and 19). For both data generating models and $β_{X} = ln (3)$ and $β_{X} = 0$ scenarios, Bayesian SM took substantially longer to run than Monte Carlo NARFCS with Monte Carlo NARFCS taking approximately 2 days per dataset in R (approximately 1 day per dataset in Stata) and Bayesian SM taking approximately 6 days per dataset.

Results of the motivating example

Of the 409,784 participants included in our analysis with complete covariate data, 4,610 (1.12%) were tested for SARS-CoV-2, leaving 405,174 (98.9%) with a missing outcome. Out of the 4,610 participants tested for SARS-CoV-2, 1,317 (28.6%) tested positive. Figure 4 shows the results for the exposure odds ratio (i.e., odds ratio of SARS-CoV-2 infection per standard deviation increase in BMI) estimated using CCA, MI, Bayesian SM, Monte Carlo NARFCS, and the population-based comparison group approach. All analyses suggested that participants with a higher BMI tended to be at a higher risk of SARS-CoV-2 infection. The two probabilistic bias analyses, Bayesian SM and Monte Carlo NARFCS, gave similar results with slightly higher point estimates than CCA and MI, although there was substantial overlap between the CIs of these methods. The results for the population-based comparison group approach were markedly different from those of the other methods.

Fig. 4 — Forest plot of the results for exposure odds ratio, $e x p \{β_{X}\}$ , estimated by complete case analysis (CCA), multiple imputation assuming missing at random (MI), population-based comparison group approach (Missing not infected), and the probabilistic bias analyses, Monte Carlo NARFCS and Bayesian SM. Dashed line denotes the null effect

The patterns in the results were consistent with our prior knowledge that untested participants tended to have a lower BMI and were less likely to have experienced symptoms of SARS-CoV-2 infection than tested participants. For example, under this missingness mechanism we expected that dropping untested participants would lead to an underestimate of the exposure odds ratio (as demonstrated by the simulation study) and setting all untested participants as “not infected” would lead to an overestimate. All analyses except CCA were based on the untested and tested participants but had similar levels of precision to that of CCA. This was unsurprising given that (i) the precision of binary outcome estimators is primarily determined by the number of cases (i.e., positive SARS-CoV-2 infections) and (ii) for our study population and study period, the prevalence of SARS-CoV-2 infection was estimated to be relatively low (3.2% [95% CI 2.8–3.6%] [62]) and so a large proportion of the untested participants were likely not infected with SARS-CoV-2. The distinct results of the population-based comparison group approach was due to the imposed extreme scenario which implied that the prevalence of infection in the study sample was only 0.32%.

Discussion

We have illustrated the feasibility and practicality of conducting a probabilistic bias analysis to data MNAR when a large proportion of an outcome is missing under a strong MNAR mechanism. In the specific setting we considered, our simulation study demonstrated that given reasonably accurate and precise information about the bias parameter, the simpler, Monte Carlo NARFCS method performed as well as the more principled, Bayesian SM method. When very limited information was provided about the bias parameter, the Bayesian bias analysis was able to eliminate most of the bias due to data MNAR while the Monte Carlo bias analysis performed no better than the CCA and the MAR implementation of MI. We have also shown how including auxiliary variables in an imputation model can amplify bias due to data MNAR.

Monte Carlo NARFCS has three key advantages for non-specialist analysts over the Bayesian SM approach: (1) a Monte Carlo bias analysis is simpler and less daunting to implement because it does not require knowledge about Bayesian inference and specialist statistical software. (2) The bias model of Monte Carlo NARFCS uses an MNAR-extension of the popular imputation approach, FCS, which has been implemented in several software environments. (3) For our study, Monte Carlo NARFCS was less computationally demanding than Bayesian SM, resulting in substantially faster run-times. Therefore, it is encouraging that Monte Carlo NARFCS can perform as well as the more principled Bayesian SM. This is supported by previous research, which has established the robustness of FCS imputation to its theoretical weakness (that the joint distribution implied by the univariate regression models may not exist [46, 47, 65]). During the simulation study we experienced some minor technical difficulties with Bayesian SM. However, these issues can be easily resolved when applying the method in practice. For example, nonconvergence would be identified using standard Bayesian diagnostic tools and resolved by running a longer burn-in, and failure of the Bayesian sampler could be rectified by using different starting values or switching to a different Monte Carlo algorithm. In keeping with McCandless and Gustafson [16], we found that applying a Bayesian bias analysis using a vague prior for the bias parameter gained some information about the MNAR mechanism and consequently eliminated some of the bias due to missing data. This was likely due to the Bayesian process ruling out certain MNAR mechanisms (i.e., values of the bias parameter) incompatible with the observed data [16]. In contrast, since the Monte Carlo bias analysis samples directly from the prior distribution of the bias parameter, irrespective of the observed data, then applying Monte Carlo NARFCS with a vague prior performed as badly as the MAR methods. Therefore, a Bayesian bias analysis is recommended when there is limited information available about the bias parameters.

Another difference between the two probabilistic bias analyses is that the bias model of Bayesian SM is a selection model while that of Monte Carlo NARFCS is a pattern-mixture model. The advantage of the selection model framework is that it is coherent with our understanding of how the observed data arises and there is a logical separation of the parameters of interest from the bias parameters [66]. However, others have argued that the bias parameters of the pattern-mixture model are usually easier to interpret and so this framework is more convenient for conducting bias analyses [67–69]. In our applied example, the available external information was not ideally suited for the bias parameter of either the selection or pattern-mixture model. Overall, the pattern-mixture framework is credited as being more accessible and widely available [70], although the selection model framework may be preferable when the missingness mechanism is of primary interest.

Our simulation study has several limitations. First, our comparison of a Bayesian bias analysis to a Monte Carlo bias analysis also differed with respect to the bias model. However, the primary objective of our study was to illustrate an easy to apply probabilistic bias analysis (for non-specialist analysts) and to compare it to a principled approach. Second, we simulated the data using a fixed value for the bias parameter (as opposed to sampling from an appropriate prior). However, we consider the anticipated overcoverage of the probabilistic bias analyses acceptable as we focus on what Rubin terms confidence validity (i.e., intervals that cover at least nominally) rather than randomisation validity (i.e., intervals that cover exactly nominally) [71]. Third, we only explored a small number of scenarios because of the time it took to run each probabilistic bias analysis in a large data setting (typical of cohort studies). To achieve our objective of evaluating the robustness of Monte Carlo NARFCS using a small-scale simulation study, we considered an extreme setting of a large proportion of missing data with a strong MNAR mechanism.

We note that the sequential modelling approach of Bayesian SM and the FCS-type approach of Monte Carlo NARFCS can both flexibly incorporate nonlinear terms and interactions between the outcome and the predictors (i.e., covariates and auxiliary variables), and between the predictors (e.g., [34, 37, 72]). Future work should compare Bayesian SM and Monte Carlo NARFCS when the bias models include nonlinear or interaction terms.

Alternative approaches to a bias analysis are available [11]. These include (i) reference based methods used for handling missing data in randomized clinical trials (e.g., [73]), (ii) placing restrictions on the model parameters (e.g., [74]), (iii) instrumental variable(s) for missingness [6], and (iv) use of additional data (e.g., information from recontacting nonparticipants [75]).

In the extreme setting we explored, our simpler Monte Carlo bias analysis is a viable alternative to a Bayesian bias analysis provided information is available on plausible values of the bias parameter. However, when limited information is available, a Bayesian bias analysis is preferred. By illustrating two different types of probabilistic bias analyses and providing code to replicate them, we hope to encourage the increased adoption of such bias analyses in epidemiological research. Finally, in keeping with [63, 76], we caution careful consideration of the choice of auxiliary variables when applying MI where data may be MNAR.

Supplementary Information

Supplementary Material 1.^{(343.5KB, docx)}

Acknowledgements

Not applicable.

Abbreviations

BMI: Body mass index
CCA: Complete case analysis
FCS: Fully conditional specification
IPW: Inverse probability weighting
MAR: Missing at random
MCMC: Markov Chain Monte Carlo
MI: Multiple imputation
MNAR: Missing not at random
NARFCS: Not-at-random fully conditional specification
PMM: Pattern mixture model
REACT-2: REal-time Assessment of Community Transmission-2
SM: Selection model
UKB: UK Biobank study

Authors’ contributions

Authors EK, DM-S, GLC, CYS, TPM, ARC, AF-S, MCB, GJG, LACM and RAH designed the study with critical review from KT, DL AND GDS. EK, DM-S and GLC performed the simulation study and statistical analyses under the supervision of RAH, CYS and TPM. EK, DM-S, GLC and RAH drafted the paper with input from the remaining authors. All authors were responsible for critical revision of the manuscript and have approved the final version to be published.

Funding

This work was supported by the Bristol British Heart Foundation (BHF) Accelerator Award (AA/18/7/34219), the University of Bristol and Medical Research Council (MRC) Integrative Epidemiology Unit (MC_UU_00032/01, 02 & 05), the BHF-National Institute of Health Research (NIHR) COVIDITY flagship project, the John Templeton Foundation (61917), and the Wellcome Trust and Royal Society (215408/Z/19/Z). TPM was funded by the UKRI Medical Research Council, grant number MC_UU_00004/09 and GJF was funded by an MQ fellowship, grant number MQF22\22. The computation work was carried out using the computational facilities of the Advanced Computing Research Centre, University of Bristol—http://www.bristol.ac.uk/acrc/.

Data availability

The software code to generate the simulated datasets analysed during the simulated study are available in the COVIDITY_ProbQBA repository, https://github.com/MRCIEU/COVIDITY_ProbQBA. The UK Biobank study dataset analysed during the current study is available from the UK Biobank Access Management Team (https://www.ukbiobank.ac.uk/learn-more-about-uk-biobank/contact-us) but restrictions may apply to the availability of these data, which were used under license for the current study, and so are not publicly available. All methods discussed in this paper can be implemented using the provided software code available from the COVIDITY_ProbQBA repository, https://github.com/MRCIEU/COVIDITY_ProbQBA.

Declarations

Ethics approval and consent to participate

For the simulation study, data were completely simulated, which did not require approval from an ethics committee or consent from participants. UKB received ethical approval from the UK National Health Service’s National Research Ethics Service (ref. 11/NW/0382). All participants provided written and informed consent for data collection, analysis, and record linkage. This research was conducted under UKB application number 16729.

Consent for publication

Not applicable.

Competing interests

TPM has received consultancy fees from: Bayer Healthcare Pharmaceuticals, Alliance Pharmaceuticals, Gilead Sciences, and Kite Pharmaceuticals. Since January 2023, ARC has been an employee of Novo Nordisk Research Centre Oxford, which is not related to the current work and had no involvement in the decision to publish. The remaining authors declare that they have no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Emily Kawabata, Daniel Major-Smith and Gemma L. Clayton contributed equally to this work.

References

1.Rubin D. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]
2.Li Y, Miao W, Shpitser I, Tchetgen Tchetgen EJ. A self-censoring model for multivariate nonignorable nonmonotone missing data. Biometrics. 2023;: 1–12. [DOI] [PubMed]
3.Giusti C, Little RJ. An analysis of nonignorable nonresponse to income in a survey with a rotating panel design. J Official Statistics. 2011;27(2):211–29. [Google Scholar]
4.White IR, Carpenter J, Horton NJ. A mean score method for sensitivity analysis to depatures from the missing at random assumption in randomised trials. Stat Sin. 2018;28(4):1985–2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Tompsett DM, Leacy F, Moreno-Betancu M, Heron J, White IR. On the use of the not-at-random fully conditional specification (NARFCS) procedure in practice. Stat Med. 2018;37:2338–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Tchetgen Tchetgen EJ, Wirth KE. A general instrumental variable framework for regression analysis with outcome missing not at random. Biometrics. 2017;73:1123–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Cornish RP, Macleod J, Carpenter JR, Tilling K. Multiple imputation using linked proxy outcome data results in important bias reduction and efficiency gains: a simulation study. Emerg Themes Epidemiol. 2017;14:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Gray L, Gorman E, White IR, Vittal Katikireddi S, McCartney G, Rutherford L, et al. Correcting for non-participation bias in health surveys usuing record-linkage, synthetic observations and pattern mixture modelling. Stat Methods Med Res. 2020;29(4):1212–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Doidge JC. Responsivenss-informed multiple imputation and inverse probability weighting in cohort studies with missing data that are non-monotone or not missing at random. Stat Methods Med Res. 2018;27(2):352–63. [DOI] [PubMed] [Google Scholar]
10.Hughes RA, Heron J, Sterne JA, Tilling K. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int J Epidemiol. 2019;48(4):1294–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Little RA, Rubin DB. Statistical Analysis with Missing Data. 3rd ed. New York: John Wiley & Sons; 2019. [Google Scholar]
12.Fox MP, MacLehose RF, Lash TL. Applying quantitative bias analysis to epidemiologic data. 2nd ed. New York: Springer; 2021. [Google Scholar]
13.Lee K, Tilling K, Cornish R, Little R, Bell M, Goetghebeur E, et al. Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework. J Clin Epidemiol. 2021;134:79–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Andridge RR, Little RJ. Proxy pattern-mixture analysis for survey nonresponse. J Official Statistics. 2011;27(2):153–80. [Google Scholar]
15.Andridge R, Little RJ. Proxy pattern-mixture analysis for a binary variable subject to nonresponse. J Official Statistics. 2020;36(3):703–28. [Google Scholar]
16.McCandless LC, Gustafson P. A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding. Stat Med. 2017;36(18):2887–901. [DOI] [PubMed] [Google Scholar]
17.Greenland S. Multiple-bias modelling for analysis of observational data. J R Stat Soc Ser A. 2005;168:267–306. [Google Scholar]
18.MacLeohse RF, Gustafson P. Is probabilistic bias analysis approximately Bayesian? Epidemiology. 2012;23(1):151–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Flanders WD, Waller LA, Zhang Q, Getahun D, Silverberg M, Goodman M. Negative Control Exposures - Causal effect identifiability and use in probabilistic-bias and Bayesian analyses with unmeasured confounders. Epidemiology. 2022;33(6):832–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Gustafson P, McCandless L. Priors, parameters and probability - A Bayesian perspective on sensitivity analysis. Epidemiology. 2014;25(6):910–2. [DOI] [PubMed] [Google Scholar]
21.Corbin M, Haslett S, Pearce N, Maule M, Greenland S. A comparison of sensitivity-specificity imputation, direct imputation and fully Bayesian analysis to adjust for exposure misclassification when validation data are unavailable. Int J Epidemiol. 2017;46(3):1063–72. [DOI] [PubMed] [Google Scholar]
22.Gachau S, Quartagno M, Njeru Njagi E, Owuor N, English M, Ayieko P. Handling missing data in modelling quality of clinician-prescribed routine care: Sensitivity analysis of departure from missing at random assumption. Stat Methods Med Res. 2020;29(10):3076–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Smuk M, Carpenter JR, Morris TP. What impact do assumptions about missing data have on conclusions? A practical sensitivity analysis for a cancer survival registry. BMC Med Res Methodol. 2017;17:21. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Mason A, Richardson S, Plewis I, Best N. Strategy for modelling nonrandom missing data mechanisms in observational studies using Bayesian methods. J Official Stat. 2012;28(2):279–302. [Google Scholar]
25.Banack HR, Hayes-Larsin E, Mayeda E. Monte Carlo Simulation Approaches for Quantitative Bias Analysis: A Tutorial. Epidemiolog Rev. 2021;43:106–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Jayaweera RT, Bradshaw PT, Gerdts C, Egwuatu I, Grosso B, Kristianingrum I, et al. Accounting for misclassification and selection bias in estimating effectiveness of self-managed medication abortion. Epidemiology. 2023;34:140–9. [DOI] [PubMed] [Google Scholar]
27.White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011;30:377–99. [DOI] [PubMed] [Google Scholar]
28.van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16:219–42. [DOI] [PubMed] [Google Scholar]
29.Moreno-Betancur M, Lee KJ, Leacy FP, White IR, Simpson JA, Carlin JB. Canonical Causal Diagrams to Guide the Treatment of Missing Data in Epidemiologic Studies. Am J Epidemiol. 2018;187(12):2705–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Bartlett JW, Harel O, Carpenter JR. Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am J Epidemiol. 2015;182(8):730–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Carpenter JR, Smuk M. Missing data: A statistical framework for practice. Biom J. 2021;63:915–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Little RJ, Carpenter JR, Lee KJ. A comparison of three popular methods for handling missing data: complete-analysis, inverse probability weighting, and multiple imputation. Sociol Methods Res. 2024;53(3):1105–35. [Google Scholar]
33.Erler NS, Rizopoulos D, Rosmalen JV, Jaddoe VW, Franco OH, Lesaffre EM. Dealing with missing covariates in epidemiologic studies: a comparison between multiple imputation and a full Bayesian approach. Stat Med. 2016;35(17):2955–74. [DOI] [PubMed] [Google Scholar]
34.Lüdtke O, Robitzsch A, West SG. Regression models involving nonlinear effects with missing data: A sequential modeling approach using Bayesian estimation. Psychol Methods. 2020;25(2):157. [DOI] [PubMed] [Google Scholar]
35.Ibrahim JG, Chen M, Lipsitz SR. Bayesian methods for generalized linear models with covariates missing at random. Canadian J Statistics. 2002;30(1):55–78. [Google Scholar]
36.Du H, Enders C, Keller BT, Bradbury TN, Karney BR. A Bayesian latent variable selection model for nonignorable missingness. Multivar Behav Res. 2022;57(2–3):478–512. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Bartlett JW, Seaman SR, White IR, Carpenter JR. Alzheimer’s disease neuroimaging initiative. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Chen M, Ibrahim JG. Maximum likelihood methods for cure rate models with missing covariates. Biometrics. 2001;57(1):43–52. [DOI] [PubMed] [Google Scholar]
39.Zhu J, Raghunathan TE. Convergence properties of a sequential regression multiple imputation algorithm. J Am Stat Assoc. 2015;110(511):1112–24. [Google Scholar]
40.Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 3rd ed. New York: Chapman Hall/CRC; 2013. [Google Scholar]
41.Gelfand AE, Smith AF. Sampling-Based Approaches to Calculating Marginal Densities. J Am Stat Assoc. 1990;85(410):398–409. [Google Scholar]
42.Casella G, George EI. Explaining the Gibbs Sampler. Am Stat. 1992;46(3):167–74. [Google Scholar]
43.Plummer M. JAGS Version 4.3.0 user manual. https://people.stat.sc.edu/hansont/stat740/jags_user_manual.pdf; 2017. Available from: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/. Cited 2024 March 19.
44.Kellner K, Meredith M. jagsUI: a wrapper around ‘rjags’ to streamline ‘JAGS’ analyses. 2024. Available from: https://kenkellner.com/jagsUI/. Cited 2024 March 19.
45.van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine. 1999;18:681–94. [DOI] [PubMed] [Google Scholar]
46.Hughes RA, White IR, Seaman RS, Carpenter JR, Tilling K, Sterne JA. Joint modelling rationale for chained equations. BMC Med Res Methodol. 2014;14:28. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Liu J, Gelman A, Hill J, Su Y, Kropko J. On the stationary distribution of iterative imputations. Biometrika. 2014;101(1):155–73. [Google Scholar]
48.StataCorp. Stata Statistical Software: Release 17. 2021.
49.StataCorp. Stata 17 Multiple-Imputation Reference Manual College Station. TX: Stata Press; 2021. [Google Scholar]
50.van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011;45(3):1–67. 10.18637/jss.v045.i03. [Google Scholar]
51.Moreno-Betancur M, Leacy F, Tompsett D, White I. mice: The NARFCS procedure for sensitivity analyses. 2019. Available from: https://github.com/moreno-betancur/NARFCS/blob/master/README.md. Cited 2023 September 4.
52.R Core Team. R: A Language and Environment for Statistical Computing. 2021.
53.University of Bristol. High Performance Computing. Available from: https://www.bristol.ac.uk/acrc/high-performance-computing/. Cited 2023 12 30.
54.Gasparini A. rsimsum: Summarise results from Monte Carlo simulation studies. J Open Source Software. 2018;3: 739. [Google Scholar]
55.Millard LC, Fernández-Sanlés A, Carter AR, Hughes RA, Tilling K, Morris TP, Major-Smith D, Griffith GJ, Clayton GL, Kawabata E, Davey Smith G, Lawlor DA, Borges MC. Exploring the impact of selection bias in observational studies of COVID-19: a simulation study. Int J Epidemiol. 2023;52(1):44–57. 10.1093/ije/dyac221. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, et al. Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants with Those of the General Population. Am J Epidemiol. 2017;186(9):1026–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Armstrong J, Rudkin JK, Allen N, Crook DW, Wilson DJ, Wyllie DH, et al. Dynamic linkage of COVID-19 test results between public health England’s second generation surveillance system and UK Biobank. Microbial Genomics. 2020;6(7):mgen000397. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Carter AR, Clayton GL, Borges MC, Howe LD, Hughes RA, Davey Smith G, et al. Time-sensitive testing pressures and COVID-19 outcomes~: are socioeconomic inequalities over the first year of the pandemic explained by selection bias. BMC Public Health. 1863;2023(23):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.COVID-19 Host Genetics Initiative. Mapping the human genetic architecture of COVID-19. Nature. 2021; 600: 472–477. [DOI] [PMC free article] [PubMed]
60.Lassale C, Gaye B, Hamer M, Gale CR, Batty GD. Ethnic disparities in hospitalisation for COVID-19 in England: the role of socioeconomic factors, mental health, and inflammatory and proinflammatory factors in a community-based cohort study. Brain Behav Immun. 2020;88:44–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Western Cape Department of Health in collaboration with the National Institute for Communicable Diseases SA. Risk factors for coronavirus disease 2019 (COVID-19) death in a population cohort study from the Western Cape Province. South Africa Clin Infect Dis. 2021;73:e2005-2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Ward H, Atchison C, Whitaker M, Ainslie KE, Elliott J, Okell L, et al. SARS-CoV-2 antibody prevalence in England following the first peak of the pandemic. Nat Commun. 2021;12(1):905. [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Thoemmes F, Rose N. A cautious note on auxiliary variables that can increase bias in missing data problems. Multivar Behav Res. 2014;49:443–59. [DOI] [PubMed] [Google Scholar]
64.White IR, Pham TM, Quartagno M, Morris TP. How to check a simulation study. Int J Epidemiol. 2024;53(1):1–7. 10.1093/ije/dyad134. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.van Buuren S, Brand JL, Groothuis-Oudshoorn CM, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul. 2006;76(12):1049–64. [Google Scholar]
66.Scharfstein DO, Daniels MJ, Robins JM. Incorporating prior beliefs about selection bias into the analysis of randomized trials with missing outcomes. Biostatistics. 2003;4:495–512. [DOI] [PMC free article] [PubMed] [Google Scholar]
67.White IR, Carpenter JE, Evans S, Schroter S. Eliciting and using expert opinions about dropout bias in randomized controlled trials. Clin Trials. 2007;4(2):125–39. [DOI] [PubMed] [Google Scholar]
68.Daniels MJ, Hogan JW. Reparameterizing the Pattern Mixture Model for Sensitivity Analyses Under Informative Dropout. Biometrics. 2000;56(4):1241–8. [DOI] [PubMed] [Google Scholar]
69.White IR, Higgins JP, Wood AM. Allowing for uncertainty due to missing data in meta-analysis - Part 1: Two-stage methods. Stat Med. 2008;27(5):711–27. [DOI] [PubMed] [Google Scholar]
70.Carpenter JR, Kenward MG. Sensitivity analysis with multiple imputation. In: Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G, editors. Handbook of missing data methodology. New York: CRC Press; 2015. p. 435–70. [Google Scholar]
71.Rubin DB. Multiple Imputation After 18+ Years. J Am Stat Assoc. 1996;91(434):473–89. [Google Scholar]
72.Tilling K, Williamson EJ, Spratt M, Sterne JA, Carpenter JR. Appropriate inclusion of interactions was needed to avoid bias in multiple imputation. J Clin Epidemiol. 2016;80:107–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Cro S, Morris TP, Kenward MG, Carpenter JR. Sensitivity analysis for clinical trials with missing continuous outcome data using controlled multiple imputation: A practical guide. Stat Med. 2020;39:2815–42. [DOI] [PubMed] [Google Scholar]
74.Linero AR, Daniels MJ. Bayesian approaches for missing not at random outcome data: The role of identifying restrictions. Stat Sci. 2018;33(2):198–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Karvanen J, Tolonen H, Harkanen T, Jousliahti P, Kuulasamaa K. Selection bias was reduced by recontacting nonparticipants. J Clin Epidemiol. 2016;76:209–17. [DOI] [PubMed] [Google Scholar]
76.Curnow E, Cornish RP, Heron JE, Carpenter JR, Tilling K. Multiple imputation assuming missing at random: auxiliary imputation variables that only predict missingness can increase bias due to data missing not at random. 2023. Available from: https://www.medrxiv.org/content/10.1101/2023.10.17.23297137v1. Cited 2024 March 24. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1.^{(343.5KB, docx)}

Data Availability Statement

[CR1] 1.Rubin D. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]

[CR2] 2.Li Y, Miao W, Shpitser I, Tchetgen Tchetgen EJ. A self-censoring model for multivariate nonignorable nonmonotone missing data. Biometrics. 2023;: 1–12. [DOI] [PubMed]

[CR3] 3.Giusti C, Little RJ. An analysis of nonignorable nonresponse to income in a survey with a rotating panel design. J Official Statistics. 2011;27(2):211–29. [Google Scholar]

[CR4] 4.White IR, Carpenter J, Horton NJ. A mean score method for sensitivity analysis to depatures from the missing at random assumption in randomised trials. Stat Sin. 2018;28(4):1985–2003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Tompsett DM, Leacy F, Moreno-Betancu M, Heron J, White IR. On the use of the not-at-random fully conditional specification (NARFCS) procedure in practice. Stat Med. 2018;37:2338–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Tchetgen Tchetgen EJ, Wirth KE. A general instrumental variable framework for regression analysis with outcome missing not at random. Biometrics. 2017;73:1123–31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Cornish RP, Macleod J, Carpenter JR, Tilling K. Multiple imputation using linked proxy outcome data results in important bias reduction and efficiency gains: a simulation study. Emerg Themes Epidemiol. 2017;14:14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Gray L, Gorman E, White IR, Vittal Katikireddi S, McCartney G, Rutherford L, et al. Correcting for non-participation bias in health surveys usuing record-linkage, synthetic observations and pattern mixture modelling. Stat Methods Med Res. 2020;29(4):1212–26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Doidge JC. Responsivenss-informed multiple imputation and inverse probability weighting in cohort studies with missing data that are non-monotone or not missing at random. Stat Methods Med Res. 2018;27(2):352–63. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Hughes RA, Heron J, Sterne JA, Tilling K. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int J Epidemiol. 2019;48(4):1294–304. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Little RA, Rubin DB. Statistical Analysis with Missing Data. 3rd ed. New York: John Wiley & Sons; 2019. [Google Scholar]

[CR12] 12.Fox MP, MacLehose RF, Lash TL. Applying quantitative bias analysis to epidemiologic data. 2nd ed. New York: Springer; 2021. [Google Scholar]

[CR13] 13.Lee K, Tilling K, Cornish R, Little R, Bell M, Goetghebeur E, et al. Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework. J Clin Epidemiol. 2021;134:79–88. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Andridge RR, Little RJ. Proxy pattern-mixture analysis for survey nonresponse. J Official Statistics. 2011;27(2):153–80. [Google Scholar]

[CR15] 15.Andridge R, Little RJ. Proxy pattern-mixture analysis for a binary variable subject to nonresponse. J Official Statistics. 2020;36(3):703–28. [Google Scholar]

[CR16] 16.McCandless LC, Gustafson P. A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding. Stat Med. 2017;36(18):2887–901. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Greenland S. Multiple-bias modelling for analysis of observational data. J R Stat Soc Ser A. 2005;168:267–306. [Google Scholar]

[CR18] 18.MacLeohse RF, Gustafson P. Is probabilistic bias analysis approximately Bayesian? Epidemiology. 2012;23(1):151–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Flanders WD, Waller LA, Zhang Q, Getahun D, Silverberg M, Goodman M. Negative Control Exposures - Causal effect identifiability and use in probabilistic-bias and Bayesian analyses with unmeasured confounders. Epidemiology. 2022;33(6):832–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Gustafson P, McCandless L. Priors, parameters and probability - A Bayesian perspective on sensitivity analysis. Epidemiology. 2014;25(6):910–2. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Corbin M, Haslett S, Pearce N, Maule M, Greenland S. A comparison of sensitivity-specificity imputation, direct imputation and fully Bayesian analysis to adjust for exposure misclassification when validation data are unavailable. Int J Epidemiol. 2017;46(3):1063–72. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Gachau S, Quartagno M, Njeru Njagi E, Owuor N, English M, Ayieko P. Handling missing data in modelling quality of clinician-prescribed routine care: Sensitivity analysis of departure from missing at random assumption. Stat Methods Med Res. 2020;29(10):3076–92. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Smuk M, Carpenter JR, Morris TP. What impact do assumptions about missing data have on conclusions? A practical sensitivity analysis for a cancer survival registry. BMC Med Res Methodol. 2017;17:21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Mason A, Richardson S, Plewis I, Best N. Strategy for modelling nonrandom missing data mechanisms in observational studies using Bayesian methods. J Official Stat. 2012;28(2):279–302. [Google Scholar]

[CR25] 25.Banack HR, Hayes-Larsin E, Mayeda E. Monte Carlo Simulation Approaches for Quantitative Bias Analysis: A Tutorial. Epidemiolog Rev. 2021;43:106–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Jayaweera RT, Bradshaw PT, Gerdts C, Egwuatu I, Grosso B, Kristianingrum I, et al. Accounting for misclassification and selection bias in estimating effectiveness of self-managed medication abortion. Epidemiology. 2023;34:140–9. [DOI] [PubMed] [Google Scholar]

[CR27] 27.White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011;30:377–99. [DOI] [PubMed] [Google Scholar]

[CR28] 28.van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16:219–42. [DOI] [PubMed] [Google Scholar]

[CR29] 29.Moreno-Betancur M, Lee KJ, Leacy FP, White IR, Simpson JA, Carlin JB. Canonical Causal Diagrams to Guide the Treatment of Missing Data in Epidemiologic Studies. Am J Epidemiol. 2018;187(12):2705–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Bartlett JW, Harel O, Carpenter JR. Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am J Epidemiol. 2015;182(8):730–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Carpenter JR, Smuk M. Missing data: A statistical framework for practice. Biom J. 2021;63:915–47. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Little RJ, Carpenter JR, Lee KJ. A comparison of three popular methods for handling missing data: complete-analysis, inverse probability weighting, and multiple imputation. Sociol Methods Res. 2024;53(3):1105–35. [Google Scholar]

[CR33] 33.Erler NS, Rizopoulos D, Rosmalen JV, Jaddoe VW, Franco OH, Lesaffre EM. Dealing with missing covariates in epidemiologic studies: a comparison between multiple imputation and a full Bayesian approach. Stat Med. 2016;35(17):2955–74. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Lüdtke O, Robitzsch A, West SG. Regression models involving nonlinear effects with missing data: A sequential modeling approach using Bayesian estimation. Psychol Methods. 2020;25(2):157. [DOI] [PubMed] [Google Scholar]

[CR35] 35.Ibrahim JG, Chen M, Lipsitz SR. Bayesian methods for generalized linear models with covariates missing at random. Canadian J Statistics. 2002;30(1):55–78. [Google Scholar]

[CR36] 36.Du H, Enders C, Keller BT, Bradbury TN, Karney BR. A Bayesian latent variable selection model for nonignorable missingness. Multivar Behav Res. 2022;57(2–3):478–512. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Bartlett JW, Seaman SR, White IR, Carpenter JR. Alzheimer’s disease neuroimaging initiative. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Chen M, Ibrahim JG. Maximum likelihood methods for cure rate models with missing covariates. Biometrics. 2001;57(1):43–52. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Zhu J, Raghunathan TE. Convergence properties of a sequential regression multiple imputation algorithm. J Am Stat Assoc. 2015;110(511):1112–24. [Google Scholar]

[CR40] 40.Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 3rd ed. New York: Chapman Hall/CRC; 2013. [Google Scholar]

[CR41] 41.Gelfand AE, Smith AF. Sampling-Based Approaches to Calculating Marginal Densities. J Am Stat Assoc. 1990;85(410):398–409. [Google Scholar]

[CR42] 42.Casella G, George EI. Explaining the Gibbs Sampler. Am Stat. 1992;46(3):167–74. [Google Scholar]

[CR43] 43.Plummer M. JAGS Version 4.3.0 user manual. https://people.stat.sc.edu/hansont/stat740/jags_user_manual.pdf; 2017. Available from: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/. Cited 2024 March 19.

[CR44] 44.Kellner K, Meredith M. jagsUI: a wrapper around ‘rjags’ to streamline ‘JAGS’ analyses. 2024. Available from: https://kenkellner.com/jagsUI/. Cited 2024 March 19.

[CR45] 45.van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine. 1999;18:681–94. [DOI] [PubMed] [Google Scholar]

[CR46] 46.Hughes RA, White IR, Seaman RS, Carpenter JR, Tilling K, Sterne JA. Joint modelling rationale for chained equations. BMC Med Res Methodol. 2014;14:28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Liu J, Gelman A, Hill J, Su Y, Kropko J. On the stationary distribution of iterative imputations. Biometrika. 2014;101(1):155–73. [Google Scholar]

[CR48] 48.StataCorp. Stata Statistical Software: Release 17. 2021.

[CR49] 49.StataCorp. Stata 17 Multiple-Imputation Reference Manual College Station. TX: Stata Press; 2021. [Google Scholar]

[CR50] 50.van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011;45(3):1–67. 10.18637/jss.v045.i03. [Google Scholar]

[CR51] 51.Moreno-Betancur M, Leacy F, Tompsett D, White I. mice: The NARFCS procedure for sensitivity analyses. 2019. Available from: https://github.com/moreno-betancur/NARFCS/blob/master/README.md. Cited 2023 September 4.

[CR52] 52.R Core Team. R: A Language and Environment for Statistical Computing. 2021.

[CR53] 53.University of Bristol. High Performance Computing. Available from: https://www.bristol.ac.uk/acrc/high-performance-computing/. Cited 2023 12 30.

[CR54] 54.Gasparini A. rsimsum: Summarise results from Monte Carlo simulation studies. J Open Source Software. 2018;3: 739. [Google Scholar]

[CR55] 55.Millard LC, Fernández-Sanlés A, Carter AR, Hughes RA, Tilling K, Morris TP, Major-Smith D, Griffith GJ, Clayton GL, Kawabata E, Davey Smith G, Lawlor DA, Borges MC. Exploring the impact of selection bias in observational studies of COVID-19: a simulation study. Int J Epidemiol. 2023;52(1):44–57. 10.1093/ije/dyac221. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] 56.Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, et al. Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants with Those of the General Population. Am J Epidemiol. 2017;186(9):1026–34. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR57] 57.Armstrong J, Rudkin JK, Allen N, Crook DW, Wilson DJ, Wyllie DH, et al. Dynamic linkage of COVID-19 test results between public health England’s second generation surveillance system and UK Biobank. Microbial Genomics. 2020;6(7):mgen000397. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] 58.Carter AR, Clayton GL, Borges MC, Howe LD, Hughes RA, Davey Smith G, et al. Time-sensitive testing pressures and COVID-19 outcomes~: are socioeconomic inequalities over the first year of the pandemic explained by selection bias. BMC Public Health. 1863;2023(23):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR59] 59.COVID-19 Host Genetics Initiative. Mapping the human genetic architecture of COVID-19. Nature. 2021; 600: 472–477. [DOI] [PMC free article] [PubMed]

[CR60] 60.Lassale C, Gaye B, Hamer M, Gale CR, Batty GD. Ethnic disparities in hospitalisation for COVID-19 in England: the role of socioeconomic factors, mental health, and inflammatory and proinflammatory factors in a community-based cohort study. Brain Behav Immun. 2020;88:44–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR61] 61.Western Cape Department of Health in collaboration with the National Institute for Communicable Diseases SA. Risk factors for coronavirus disease 2019 (COVID-19) death in a population cohort study from the Western Cape Province. South Africa Clin Infect Dis. 2021;73:e2005-2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR62] 62.Ward H, Atchison C, Whitaker M, Ainslie KE, Elliott J, Okell L, et al. SARS-CoV-2 antibody prevalence in England following the first peak of the pandemic. Nat Commun. 2021;12(1):905. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR63] 63.Thoemmes F, Rose N. A cautious note on auxiliary variables that can increase bias in missing data problems. Multivar Behav Res. 2014;49:443–59. [DOI] [PubMed] [Google Scholar]

[CR64] 64.White IR, Pham TM, Quartagno M, Morris TP. How to check a simulation study. Int J Epidemiol. 2024;53(1):1–7. 10.1093/ije/dyad134. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR65] 65.van Buuren S, Brand JL, Groothuis-Oudshoorn CM, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul. 2006;76(12):1049–64. [Google Scholar]

[CR66] 66.Scharfstein DO, Daniels MJ, Robins JM. Incorporating prior beliefs about selection bias into the analysis of randomized trials with missing outcomes. Biostatistics. 2003;4:495–512. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR67] 67.White IR, Carpenter JE, Evans S, Schroter S. Eliciting and using expert opinions about dropout bias in randomized controlled trials. Clin Trials. 2007;4(2):125–39. [DOI] [PubMed] [Google Scholar]

[CR68] 68.Daniels MJ, Hogan JW. Reparameterizing the Pattern Mixture Model for Sensitivity Analyses Under Informative Dropout. Biometrics. 2000;56(4):1241–8. [DOI] [PubMed] [Google Scholar]

[CR69] 69.White IR, Higgins JP, Wood AM. Allowing for uncertainty due to missing data in meta-analysis - Part 1: Two-stage methods. Stat Med. 2008;27(5):711–27. [DOI] [PubMed] [Google Scholar]

[CR70] 70.Carpenter JR, Kenward MG. Sensitivity analysis with multiple imputation. In: Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G, editors. Handbook of missing data methodology. New York: CRC Press; 2015. p. 435–70. [Google Scholar]

[CR71] 71.Rubin DB. Multiple Imputation After 18+ Years. J Am Stat Assoc. 1996;91(434):473–89. [Google Scholar]

[CR72] 72.Tilling K, Williamson EJ, Spratt M, Sterne JA, Carpenter JR. Appropriate inclusion of interactions was needed to avoid bias in multiple imputation. J Clin Epidemiol. 2016;80:107–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR73] 73.Cro S, Morris TP, Kenward MG, Carpenter JR. Sensitivity analysis for clinical trials with missing continuous outcome data using controlled multiple imputation: A practical guide. Stat Med. 2020;39:2815–42. [DOI] [PubMed] [Google Scholar]

[CR74] 74.Linero AR, Daniels MJ. Bayesian approaches for missing not at random outcome data: The role of identifying restrictions. Stat Sci. 2018;33(2):198–213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR75] 75.Karvanen J, Tolonen H, Harkanen T, Jousliahti P, Kuulasamaa K. Selection bias was reduced by recontacting nonparticipants. J Clin Epidemiol. 2016;76:209–17. [DOI] [PubMed] [Google Scholar]

[CR76] 76.Curnow E, Cornish RP, Heron JE, Carpenter JR, Tilling K. Multiple imputation assuming missing at random: auxiliary imputation variables that only predict missingness can increase bias due to data missing not at random. 2023. Available from: https://www.medrxiv.org/content/10.1101/2023.10.17.23297137v1. Cited 2024 March 24. [DOI] [PMC free article] [PubMed]

PERMALINK

Accounting for bias due to outcome data missing not at random: comparison and illustration of two approaches to probabilistic bias analysis: a simulation study

Emily Kawabata

Daniel Major-Smith

Gemma L Clayton

Chin Yang Shapland

Tim P Morris

Alice R Carter

Alba Fernández-Sanlés

Maria Carolina Borges

Kate Tilling

Gareth J Griffith

Louise A C Millard

George Davey Smith

Deborah A Lawlor

Rachael A Hughes

Abstract

Background

Methods

Results

Conclusion

Supplementary Information

Introduction

Methods

Hypothetical example

Fig. 1.

Complete case analysis and ignorable missing data methods

Bayesian SM

Bias model specified as a selection model

Prior probability distributions

Bayesian implementation

Monte Carlo NARFCS

Bias model specified as a pattern-mixture model

Prior probability distributions

Monte Carlo bias analysis

Simulation study design

Generation of the complete data

Generation of the missing data

Missing data methods and evaluation

Motivating example

Motivating case study

Fig. 2.

Statistical analyses

Derivation of the informative prior δSM and δNARFCS

Results

Simulation study results

Fig. 3.

Results of the motivating example

Fig. 4.

Discussion

Supplementary Information

Acknowledgements

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Derivation of the informative prior $δ^{SM}$ and $δ^{NARFCS}$