Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jun 17.
Published in final edited form as: Stat Methods Med Res. 2021 Oct 13;30(12):2685–2700. doi: 10.1177/09622802211047346

Multiple imputation with missing data indicators

Lauren J Beesley 1, Irina Bondarenko 1, Michael R Elliott 1,2, Allison W Kurian 4, Steven J Katz 3, Jeremy M G Taylor 1
PMCID: PMC9205685  NIHMSID: NIHMS1811470  PMID: 34643465

Abstract

Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation (SRMI), also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approach, however, assumes that the missingness mechanism is missing at random, and it is not well-justified under not-at-random missingness without additional modification. In this paper, we describe how we can generalize the SRMI imputation procedure to handle missingness not at random (MNAR) in the setting where missingness may depend on other variables that are also missing but not on the missing variable itself, conditioning on fully-observed variables. We provide algebraic justification for several generalizations of standard SRMI using Taylor series and other approximations of the target imputation distribution under MNAR. Resulting regression model approximations include indicators for missingness, interactions, or other functions of the MNAR missingness model and observed data. In a simulation study, we demonstrate that the proposed SRMI modifications result in reduced bias in the final analysis compared to standard SRMI, with an approximation strategy involving inclusion of an offset in the imputation model performing the best overall. The method is illustrated in a breast cancer study, where the goal is to estimate the prevalence of a specific genetic pathogenic variant.

Keywords: chained equations multiple imputation, not missing at random, missing data indicator, sequential regression multiple imputation

Introduction

Multiple imputation has become a popular and effective approach for analyzing datasets with missing values (1; 2; 3). This general approach relies on an assumed statistical model for the variables with missing values. If this model is appropriately specified and the mechanism generating missingness in the data depends only on fully-observed data (called missing at random [MAR]), then this method has been shown to have good theoretical and numerical properties (4). In analyzing data in practice, analysts must make good choices in specifying models used for imputation, and they must determine whether the MAR missingness assumption is plausible or at least approximately satisfied.

When missingness depends on unobserved data conditional on the observed data, called missing not at random (MNAR), then many standard multiple imputation strategies cannot be directly applied (2). For example, suppose we have three variables in our data (denoted X1, X2, and X3) and that X1 and X2 have missing values for some subjects. Let Rj be the indicator of whether Xj is observed (Rj = 1) or not (Rj = 0). Missingness in X1 is MAR if P(R1 = 1|X) depends only on X3. Missingness is MNAR if missingness in X1 depends directly on the value of X1 or if it depends on X2, which is also sometimes missing. If we were to impute missing values of X1 and X2 ignoring MNAR missingness, we may introduce bias in estimating parameters of interest later on.

It is well-known that it is impossible to distinguish between MNAR and MAR missingness using the observed data alone (5). Therefore, a general recommendation is to use a large number of observed variables to impute the missing data, since it may be more reasonable to assume MAR missingness when we condition on a larger amount of the observed data. Another general approach is to perform a sensitivity analysis exploring how much final analysis conclusions are impacted when we perform imputation from distributions incorporating different plausible MNAR assumptions (i.e., models for R1 and R2 with corresponding fixed parameter values). These imputation distributions, however, can often be complicated functions of the data models (models for X) and the assumed models for missingness (model for R|X). Approximations of these imputation distributions can provide an easier path toward routine implementation.

The ideal way to impute variables with missing values under MAR is to specify a joint distribution for all the X variables and then use the conditional distribution derived from that joint distribution to impute missing values. It is challenging to specify such a joint distribution when many variables have missing values and the variables may be of mixed types, such as binary, categorical and continuous. A convenient and pragmatic way to overcome this problem is to perform chained equations multiple imputation, also known as sequential regression multiple imputation [denoted SRMI] (6; 7; 8; 3). This method is also referred to in the statistical literature as multiple imputation by chained equations (MICE) and fully conditional specification (FCS). In this approach, a regression model is specified for imputing each variable with missing values, conditional on all the other variables. The variables with missing values in X are then imputed sequentially, and the procedure is iterated a few times until stable results are obtained.

SRMI can be thought of as mimicking an iterative Markov chain Monte Carlo (MCMC) algorithm under a full Bayesian joint model with flat priors, where missing values are viewed as parameters and are drawn from corresponding posterior distributions. The posterior distribution for imputing each variable is the conditional distribution of that variable given all the others, which is analogous to the SRMI approach. The SRMI standard practice for sequentially imputing variables in X conditional on the other X variables can be extended to also condition on response indicators, R1 and R2. As a generalization of SRMI under MNAR missingness, some researchers propose including missingness indicators R1 and R2 as predictors in regression models used for imputation (9; 10). When imputing missing values of X1, for example, we might include R2 as a covariate in the model that is used for imputation. R1 may also be incorporated into the imputation of X1 through a corresponding fixed parameter, δ, used in sensitivity analysis to control the degree of MNAR dependence between X1 and R1. However, it is unclear how well these strategies approximate the true posterior distribution and in what settings this approach is justified.

In this paper, we primarily explore a particular missingness scenario where missingness in each covariate is MNAR dependent on other variables that themselves have missing values but where does not depend on the value of the missing covariate itself, conditional on fully-observed variables. This setting may occur, for example, if decisions for whether or not a medical test is performed are based on other incompletely-recorded patient characteristics. In this missingness setting, we derive regression model approximations for imputing normally-distributed, binary, and categorical variables within the SRMI algorithm under this form of MNAR. This work provides theoretical justification for existing modifications of the SRMI procedure under MNAR and suggests several new extensions that may outperform existing SRMI strategies in certain settings. The paper is organized as follows: we first propose extensions of SRMI for handling MNAR missingness, including an exact imputation strategy and several simple approximations. We then compare the performance of these different approximation strategies in terms of bias in estimating downstream regression model parameters in a simulation study. We then apply these methods to handle informative missingness in a motivating study of the prevalence of BRCA1 and BRCA2 pathogenic variants among women newly-diagnosed with breast cancer, where missingness in the BRCA1/2 status is likely related to familial history of breast cancer diagnosis, which is also only partially observed. Finally, we present a discussion.

Sequential regression multiple imputation under MNAR

Deriving the conditional imputation distribution

Assume we have a dataset consisting of n independent observations in p variables, denoted X1, …, Xp. For each subject, let Rj = 0 if Xj is missing and Rj = 1 if Xj is observed. Let X(−j) denote the p − 1 variables in X left after excluding Xj, and let R(−j) denote the p − 1 variables in R left after excluding Rj. To avoid the situation where an observation has missing values for all X’s, we will assume that at least one of the X’s has no missing values for every subject. We will also assume a non-monotone pattern of missingness, by which we mean there is no (j, k) pair of variables for which Rj = 0 implies Rk = 0. Our target of interest is some aspect of the joint distribution of X1, …, Xp, such as the coefficients in the regression model of X1 on all the other X’s or the mean of X1.

We propose using a sequential regression multiple imputation (SRMI) scheme to obtain B complete datasets with the the missing X’s filled in. We then follow the standard approach (1) of analyzing each imputed dataset separately with the desired model and then combining those results to give final estimates and confidence intervals. We want to impute each variable Xj with missing values from its assumed distribution given X(−j) and R, denoted f(Xj|X(−j), R). Some form of regression model can be used to approximate this distribution, where each regression model is tailored to the variable type for Xj, e.g. logistic regression if Xj is binary, linear regression if Xj is continuous, etc. In practice, these regression models are usually specified to have a linear combination of the variables on the right hand side, but these models could also be more flexible and include non-linear and interaction terms. The question then becomes how R should be incorporated into the imputation regression models. One strategy is to include R(−j) directly as additional predictors in the imputation model. Since we cannot use the observed data to reliably estimate the association between Xj and Rj, Rj can be indirectly incorporated into the imputation regression model through a fixed offset term δj Rj, where δj is treated as a sensitivity analysis parameter. Mercaldo et al (2020) (10) called this strategy multiple imputation with missing indicators (MIMI), and Tompsett et al (2018) (9) also advocates for its general use. We will call this general strategy “sequential regression multiple imputation with missing indicators”, denoted SRMI-MI, and we focus on the particular setting where δj = 0 under Assumptions 1 and 2 below.

One justification for including the extra terms R(−j) in the imputation models is simply as a way to make the imputation model more flexible and allow the whole imputation procedure to be less reliant on the possibly restrictive assumptions of imputation models with small numbers of parameters. A more formal justification can be obtained by considering a Bayesian MCMC approach for the problem. Mimicking the ideas developed for other models (11; 12), we obtain the form of the ideal conditional distribution expressed such that the imputation distribution is congenial with, or at least approximately congenial with, the desired target model of the analyst (13). Suppose that the desired target analysis model is some function of the joint distribution of X1, …, Xp, written as f(X1, …, Xp). This joint distribution would then determine the form of any submodel based on X, such as the marginal distribution of Xj or the conditional distribution of Xj|X(−j). Treating (R1, …, Rp) as random variables, we write the joint distribution of X1, …, Xp, R1, …, Rp as f(X1, …, Xp, R1, …, Rp), which can be factored in a selection model form as

f(X1,,Xp)×f(R1,,RpX1,,Xp).

In the MCMC algorithm, we would ideally draw missing values of Xj from the following conditional distribution:

f(XjX(j),R)f(XjX(j))f(RjX,R(j))f(R(j)X) (Eq. 1)

viewed as a function of Xj. In this expression, the distribution f(Rj|X, R(−j)) is not identified using the observed data, and f(R(−j)|X) may take a complicated form in general. In order to focus our attention on a more tractable missing data setting, we make the following two assumptions:

Assumption 1. The Rj’s are conditionally independent given X1, …, Xp.

Assumption 2. The missingness in Xj does not depend on Xj, i.e. f(RjX)=f(RjX(j)).

The second assumption allows the missingness of one variable Rj to depend on another variable Xk, kj, which itself may be missing. In this sense, this setting is a relaxation of the usual missing at random assumption, where missingness may depend only on variables that are fully-observed given the observed data. We view the first assumption as a mild one, and it could be relaxed to have blocks of Rj’s be conditionally independent. The second assumption is a stronger one, and its reasonableness will depend on context of the missing data problem. For a recent work on handling missingness dependent on a covariate’s own values, see Beesley and Taylor (2021) (14). Under Assumptions 1-2, we can simplify Eq. 1 as follows:

f(XjX(j),R)f(XjX(j))kjf(RkXj,X(j)). (Eq. 2)

We see immediately that Rj does not occur in this expression. Additionally, any missingness indicator Rk such that RkXjX(j) can also be ignored. The imputation distribution of Xj, therefore, will depend on X(−j) and any indicator Rk, kj such that RkXjX(j). The distribution in Eq. 2 will generally be a messy expression. We can apply importance sampling methods, rejection sampling, weighting, Metropolis-Hastings algorithms, or grid-based sampling to draw directly from Eq. 2. In the case of rejection sampling, for example, we could draw candidate imputations from f(Xj|X(−j)) and accept the first candidate draw that satisfies U<kjf(RkXj,X(j)), where random variable U is drawn from a uniform(0,1) distribution and the missingness models densities are evaluated at draws of the corresponding model parameters (see Supplementary Section D for details). When Xj is categorical, the exact form of the probability mass function can be worked out based on Eq. 2 as in Eq. 3. In general, we may not want to specify parametric models for the missingness probabilities, or we may prefer to impute using regression model structures. In the remainder of this paper, we will consider approximations to Eq. 2 that could be more easily implemented in a SRMI-MI algorithm.

A subtle but noteworthy issue is that the distribution in Eq. 2 does not condition on model parameters and is instead only a function of the data. For multiple imputation, drawing from the conditional distribution without parameters is usually achieved in two stages, first by drawing parameters of the model and then imputing the variable from the conditional distribution based on that parameter value. The same technique would be used for Eq. 2, in which parameters for the component distributions are drawn from distributions that are derived using the available data. This is often implemented by fitting the corresponding component model on a bootstrap sample of the data or by making a multivariate normal approximation (2). The question then becomes which subset of the data should be used to derive the distribution from which to perform these parameter draws. In a Bayesian MCMC algorithm, parameters are drawn conditional on the most recently-drawn values for all other parameters. In the missing dta setting, this would suggest drawing model parameters using the most recently-imputed data for the entire dataset of size n. In contrast, usual implementation of SRMI methods draw imputation model parameters for imputing Xj using the data with Xj observed, here called local complete case data, and treating the most recent imputations of X(−j) as if they were observed. It is feasible to adapt SRMI to make use of all n observations, i.e. use current imputed values of Xj and X(−j) in the estimation of the regression model for Xj given all other variables. It is a easy to show that, theoretically, either approach can be used when missingness in Xj is independent of Rj. In practice, imputing within SRMI based on all n observations may be preferred simply because the increased sample size may give better estimates of the relationship between each Xj and R(−j). This approach is used in our simulations and data analysis.

Regression model approximations for imputing binary, categorical and continuous variables

In this section, we approximate the imputation distribution proportional to Eq. 2 under different assumptions about the distributions of the variables in X.

Imputing binary variables

Suppose we want to impute binary variable X1 and that the distribution for X1|X(−1) is well-approximated by a logistic regression model as follows:

P1=P(X1=1X(1))=expit(θ0+Σj=2pθjXj)

where expit(u)=exp(u)/(1+exp(u)). Let PRj(x1) denote the probability of observing Xj given X(−j) with X1 = x1. We note that PRj (x1) can be a function of all the X’s except Xj, but for convenience we use the notation PRj(x1). Thus, for example, PR2(1)=P(R2=1X1=1,X3,,Xp). Following this notation and accounting for proportionality, we can express Eq. 2 as P(X1=1X(1),R)=A/(A+B) where

A=P1j=2pPRj(1)Rj[1PRj(1)]1RjandB=(1P1)j=2pPRj(0)Rj[1PRj(0)]1Rj

This expression simplifies as follows:

log[P(X1=1X(1),R)1P(X1=1X(1),R)]=log[P11P1]+j=2p{Rjlog[PRj(1)PRj(0)]+(1Rj)log[1PRj(1)1PRj(0)]}=θ0+j=2pθjXj+j=2p{Rjlog[PRj(1)PRj(0)]+(1Rj)log[1PRj(1)1PRj(0)]} (Eq. 3)

This can also be rewritten as

log[P(X1=1X(1),R)1P(X1=1X(1),R)]=θ0+j=2pθjXj+j=2pRjlog[PRj(1)PRj(0){1PRj(0)}{1PRj(1)}]+j=2plog[1PRj(1)1PRj(0)]

We now consider several special cases and then propose a general strategy for imputation of a binary variable.

Binary Special Case 1: logistic missingness with main effects.

Suppose that the model for missingness for each variable Xj can be expressed as follows:

PRj(X1)=P(Rj=1X1,X(1))=expit(ϕj0+ΣkjϕjkXk). (Eq. 4)

In this case, Eq. 3 can be simplified as

logit[P(X1=1X(1),R)]=θ0+j=2pθjXj+j=2pϕj1Rj+j=2plog[1+exp(ϕj0+k=2,kjpϕjkXk)]log[1+exp(ϕj0+ϕj1+k=2,kjpϕjkXk)]

In the special case where p = 3 and X2 and X3 are binary, all the terms involving the log’s can be simplified and combined with θ0 and the θjXj’s, and the final expression is simply a linear combination of X2, …, Xp and R2, …, Rp as follows:

logit[P(X1=1X(1),R)]=ω0+j=2pωjXj+j=2pωRjRj. (Eq. 5)

In general for p > 3 and for non-binary X2 and X3, Eq. 3 does not reduce to this simple additive form. However, a first order Taylor series approximation of the logarithm terms (assuming all values of ϕjk, kj are small) does lead to Eq. 5 as an approximation to the desired imputation distribution. A second order Taylor series approximation results in the following regression model structure:

logit[P(X1=1X(1),R)]α0+k=2pαkXk+k=2pαRkRk+j=2pk=2pα2jkXjXk (Eq. 6)

to impute X1, i.e. including interactions between the X’s.

Binary Special Case 2: interactions in logistic missingness model.

Suppose instead that the missingness models include interactions between other covariates and X1. For simplicity, we will assume p = 3. Suppose that

logit[PR2(X1)]=ϕ20+ϕ21X1+ϕ23X3+ϕ24X1X3logit[PR3(X1)]=ϕ30+ϕ31X1+ϕ32X2+ϕ34X1X2

In this case, the imputation takes the following form:

logit[P(X1=1X2,X3,R2,R3)]=θ0+θ2X2+θ3X3+R2[ϕ21+ϕ24X3]+R3[ϕ31+ϕ34X2]+log[1+exp(ϕ20+ϕ23X3)]log[1+exp(ϕ20+ϕ21+ϕ23X3+ϕ24X3)]+log[1+exp(ϕ30+ϕ32X2)]log[1+exp(ϕ30+ϕ31+ϕ32X2+ϕ34X2)].

Using the same logic as before and applying a first order Taylor series approximation, we can express the imputation distribution as follows:

logit[P(X1=1X2,X3,R2,R3)]ω0+ω2X2+ω3X3+ωR2R2+ωR3R3+ω3,R2X3R2+ω2,R3X2R3 (Eq. 7)

For p > 3, we can similarly approximate the imputation distributions by including interactions between the X’s and missingness indicators.

Binary General Case.

Suppose now that variables X2, …, Xp have some unspecified form and we allow PRj(X1) = P(Rj = 1|X) to take more general (e.g. non-logistic) form. We notice that Eq. 3 resembles a logistic regression model with predictors X(−1) and a term that is a function of the missingness indicators, R(−j), and the probabilities of missingness, PRj(X1). Guided by Eq. 3, we propose the following strategy for imputing missing values of X1 within each iteration of a chained equations imputation algorithm:

  1. For each j > 1, fit a model (e.g. logistic or probit regression or even a regression tree) to the current imputed dataset of size n for the probability that Xj is observed.

  2. For each observation and each j > 1, use these model estimates to calculate the probability that Xj is observed with X1 set to 0 and with X1 set to 1 to give PRj(0) and PRj(1), respectively. To calculate these probabilities, use the most recent imputed values for X(−j).

  3. Define new variables
    Zj=Rjlog[PRj(1)PRj(0)]+(1Rj)log[1PRj(1)1PRj(0)]. (Eq. 8)
  4. Impute X1 using the following model:
    logit[P(X1=1X(1),R2,R3,Z2,Z3)]=ω0+k=2pωkXk+k=2pZj (Eq. 9)
    where the ω’s are first drawn from an approximation to their posterior distribution derived from a model fit to the full imputed dataset and where Σk=2pZj is a fixed offset (with coefficient equal to 1).

Imputing multinomial variables

Now, we suppose that X1 is a categorical variable taking values in 0, 1, …, S and that the distribution for X1|X(−1) is well-approximated by a multinomial regression as follows:

PS=P(X1=sX(1))=exp(θ0s+j=2pθjsXj)1+r=1Sexp(θ0r+j=2pθjrXj)

where all θj0’s are equal to zero. As in the derivation of Eq. 3, we can write the imputation distribution as follows:

log[P(X1=sX(1),R)P(X1=0X(1),R)]=θ0s+j=2pθjsXj+j=2p{Rjlog[PRj(s)PRj(0)]+(1Rj)log[1PRj(s)1PRj(0)]} (Eq. 10)

where PRj(s) corresponds to the probability of observing Xj with X1 = s.

In the special case where PRj(X1) corresponds to a logistic regression with main effects such that

logit(P(Rj=1X1=s,X(1)))=ϕ0js+k=2,kjpϕkjsXk,

we have the following for s > 0:

log[P(X1=sX(1),R)P(X1=0X(1),R)]=θ0s+j=2pθjsXj+j=2p{Rj[ϕ0js+k=2,kjpϕkjsXk]+log[1+exp(ϕj0s+k=2,kjpϕjksXk)]}j=2p{Rj[ϕ0j0+k=2,kjpϕkj0Xk]+log[1+exp(ϕj00+k=2,kjpϕjk0Xk)]} (Eq. 11)

A first order Taylor series approximation of Eq. 11 suggests a regression of the form:

log[P(X1=sX(1),R)P(X1=0X(1),R)]ω0s+j=2pωjsXj+j=2pωRjsRj+j=2pk=2,kjpωRjXksRjXk. (Eq. 12)

In other words, we can include the missingness indicators and their interactions with X as additional predictors. If we can further assume no interaction between X1 and the other X’s in the model for the missingness of Xj, then ϕkjs takes a single value across s = 1, …, S for k = 2, …, p, kj. In this case, we have

log[P(X1=sX(1),R)P(X1=0X(1),R)]α0s+j=2pαjsXj+j=2pαRjsRj, (Eq. 13)

indicating that we should just include the missingness indicators in the imputation model.

For more general missingness mechanisms, we can apply a generalization of the offset strategy of Eq. 9 where we define offsets:

Zjs=Rjlog[PRj(s)PRj(0)]+(1Rj)log[1PRj(s)1PRj(0)] (Eq. 14)

and impute from a regression model as follows:

log[P(X1=sX(1),R)P(X1=0X(1),R)]=ω0s+k=2pωksXk+k=2pZks (Eq. 15)

where k=2pZks is a fixed offset.

Imputing continuous variables

We now suppose that X1 follows some continuous distribution defined on the real line. First, we will consider the special case where X1 is normally-distributed given X(−1). Then, we will propose a strategy for more general X1.

Continuous Special Case 1: imputing normally-distributed variable.

Suppose first that X1 is normally distributed such that X1X(1)N(θ0+k=2pθkXk,σ2). Suppose further that the probability of observing Xj is given by

logit[P(Rj=1X)]=ϕj0+kjϕjkXk.

Following Eq. 2, we can express the imputation model for X1 as

f(X1X(1),R(1))f(X1X2,,Xp)k=2pf(RkX1,,Xp)exp([X1(θ0+k=2pθkXk)]22σ2)×k=2pexp(Rk[ϕk0+skϕksXs])1+exp(ϕk0+skϕksXs) (Eq. 16)

Consider the special case where p = 3 (so X = (X1, X2, X3)). The two terms in Eq. 16 are respectively a bell-shaped curve and the product of two separate bounded sigmoid functions as a function of X1. The sigmoid curve for f(R2|X) will be increasing in X1 for one value of R2 and decreasing for the other value, and likewise f(R3|X) will be increasing in X1 for one value of R3 and decreasing for the other value. To represent a valid distribution, the product in Eq. 16 has to be normalized to integrate to 1. More generally, it is clear that the full conditional distribution of X1 will depend on Rk, assuming ϕk1 ≠ 0. Additionally, the conditional distribution of X1 is not symmetric and its mean is no longer given by θ0+k=2pθkXk. While it is feasible to draw from the distribution proportional to Eq. 16 exactly, we will explore approximations that may be easier to draw from in practice.

Approximation Strategy 1:

An intuitive approximation of Eq. 16 would be to draw X1 from the following normal distribution:

N(ω0+k=2pωkXk+k=2pωRkRk,τ2). (Eq. 17)

This strategy can be justified as a second order Taylor series approximation of Eq. 16 as follows. Assuming ϕjk is small for all k,

log[f(RjX)]Rj[ϕj0+kjϕjkXk]+log(1+exp(ϕj0))+exp(ϕj0)1+exp(ϕj0)[ϕj1X1,,ϕjpXp]T+exp(ϕj0)[1+exp(ϕj0)]2[ϕj1X1,,ϕjpXp]2

where ϕjj = 0. Combining these expressions with the form for log(f(X1|X(−1)) and collecting terms multiplied by X1 in Eq. 16, we obtain a linear regression in the form of Eq. 17.

If the association between the X’s and the R’s is stronger, then this Taylor series approximation may be less accurate, and a more involved approach to drawing values of missing X’s is needed. For example, we notice from equation Eq. 16 that X2 appears in f(X1|X(−1)) and may also be included in the various missingness models, suggesting something more general than a linear term in X2 may be needed for imputing X1. We propose including a spline function of X2. Similar spline terms could also be included for other covariates in the imputation model. This results in the following approximate imputation distribution:

N(s2(X2)+s3(X3)++sp(Xp)+k=2pωRkRk,τ2). (Eq. 18)

where sk(Xk) denotes a spline function of covariate Xk. The presence of the product of sigmoid curves in Eq. 16 modifies both the spread and skewness of the imputation distribution. We will ignore the skewness, but we could accommodate the spread by letting it depend on the values of R. Thus, another level of approximation would be drawing X1 from a normal distribution

N(s2(X2)+s3(X3)++sp(Xp)+k=2pωRkRk,τ(R2,R3,,Rp)2). (Eq. 19)

As an even more flexible approximation, we might allow the variance to depend on R and incorporate interactions between X and R in the mean structure of the imputation distribution. The approximations in Eq. 18 and Eq. 19 could be incorporated into a sequential regression multiple imputation procedure, provided the software being used had the ability to include splines instead of simple linear terms in the mean structure of the regression model. In practice, a large value of n may be required to actually fit the largest of the above models during the imputation procedure. To build in even more flexibility in the imputation model, we might take a generally more robust approach to multiple imputation, such as predictive mean matching (15; 16) or random forests (17), conditioning on R2, …, Rp in addition to other variables when imputing X1.

Approximation Strategy 2:

Rather than approximating the mean structure of Eq. 16 using Taylor series approximations, we could instead consider the mode of the distribution in Eq. 16, which we call mode(X(−1), R(−1)). Assuming the distribution in Eq. 16 is uni-modal, then we might impute missing X1 from N(mode(X(1),R(1)),τ2). Taking the derivative with respect to X1 of the log of Eq. 16 leads to the following expression:

X1(θ0+k=2pθkXk)σ2+k=2pϕk1[RkPRk(X1)] (Eq. 20)

where PRk(X1)=P(Rj=1X)=expit(ϕj0+kjϕjkXk) is viewed as a function of X1. Assuming a uni-modal distribution, we can obtain the mode(X(−1), R(−1)) by setting Eq. 20 equal to 0 and solving for X1. Finding this mode is numerically feasible, but the form of Eq. 20 suggests an alternative approach for imputing X1 within the iterative imputation algorithm:

  1. For each j > 1, fit a logistic regression model to the current imputed dataset of size n for the probability that Xj is observed.

  2. Using the most recent imputed values for X1 and the latest estimates of ϕ and PRj(X1) obtained in step 1, define new variables Zj=ϕj1[RjPRj(X1)] for each j > 1.

  3. Impute X1 using the following model:
    N(ω0+k=2pωkXk+σ2k=2pZk,τ2) (Eq. 21)
    where the ω’s are drawn from the approximation to their posterior distribution obtained by fitting Eq. 21 to the full imputed dataset and σ2k=2pZk is treated as an offset using the estimate of σ2 obtained from fitting the model for X1 given X2, …, Xp to the complete data. Alternatively k=2pZk could be added as another predictor in the imputation model.
Continuous General Case: non-normal continuous variable.

Suppose X1 takes a more general non-Gaussian continuous form and that X2Xp take unspecified forms. For this case we may first transform X1 so that the conditional distribution of X1|X2, …, Xp is approximately Gaussian with constant variance σ2. Using the intuition developed for normally-distributed X1 considered above, we propose the following three strategies for approximating the conditional imputation distribution for X1 in Eq. 2 using one of the following three imputation distributions:

N(ω0+k=2pωkXk+k=2pωRkRk,τ2), (Eq. 22)
N(k=2psk(Xk)+k=2pωRkRk,τ2), (Eq. 23)

where sk(Xk) is a spline function of Xk, and

N(ω0+k=2pωkXk+σ2k=2pZk,τ2), (Eq. 24)

where Zk=ϕk1[RkPRk(X1)] is a constructed variable based on estimated probability of observing Xk,PRk(X1)=P(Rk=1X(k)), obtained using the most recent imputed data.

Simulation Studies

Simulation Set-up

We performed numerical studies to investigate the performance of the proposed method under different missingness and X distribution settings. For each setting, we generate 500 simulated datasets with 2000 subjects each. In each simulated dataset, we generate 5 correlated variables under two different scenarios. In the first scenario, we simulated 5 multivariate normal variables X1, …, X5 with mean 0, unit variances, and covariances Σjk = cov(Xj, Xk) as follows: Σ12 = 0.4, Σ14 = Σ35 = 0.3, Σ13 = Σ25 = Σ34 = 0.2, and all remaining covariances equal to 0.1. In the second scenario, covariates X1, X2, and X3 are dichotomized to take the value 1 if the drawn value is above zero. We then impose roughly 25-50% missingness in each of X1, X2, and X3 under the following models:

logit(P(R1=1X2,X3,X4,X5)=ϕX2+ϕX3+ρX4+ρX5logit(P(R2=1X1,X3,X4,X5)=ϕX1+ϕX3+ρX4+ρX5logit(P(R3=1X1,X2,X4,X5)=ϕX1+ϕX2+ρX4+ρX5

where ϕ=0, 0.25, 0.50, 0.75, 1, or 1.5 and ρ was either 0 or 1. Corresponding complete case probabilities ranged between 12% and 50%.

For each simulated dataset in each setting, we obtained 10 multiple imputations for missing values in X1, X2, and X3 a subset of the following methods:

  1. SRMI: usual chained equations assuming missing at random

  2. SRMI-MI: method SRMI + adjusting for missingness indicators as in Eq. 5 and Eq. 17

  3. SRMI-Interactions R: method SRMI-MI + adjusting for missingness indicator-covariate interactions as in Eq. 7 and Eq. 19

  4. SRMI-Interactions X: method SRMI-MI + adjusting for covariate-covariate interactions as in Eq. 6

  5. SRMI-TriCube: adjusting for missingness indicators and cubic splines for other covariates as in Eq. 18

  6. SRMI-Offset(Normal): method SRMI + estimated offset as in Eq. 21

  7. SRMI-Offset(Binary): method SRMI + estimated offset as in Eq. 9

  8. SRMI-Exact: imputing from “exact” distribution proportional to Eq. 2, using drawn missingness model parameters

The SRMI-Exact method imputes missing values from the correct conditional distribution after estimating missingness model parameters in the observed data. This method serves as a benchmark for the various (more easily implemented) approximations considered.

For scenarios with normally-distributed or binary X1, X2, and X3, we performed imputation using a subset of the above methods relevant for the corresponding covariate distributions as motivated by our derivations above. For the SRMI-Offset(Normal), SRMI-Offset(Binary), and SRMI-Exact methods, we assumed a logistic regression model structure for missingness in each variable, and we estimate or draw corresponding missingness model parameters using the most recently imputed data. Parameters in the missingness model can be estimated well, as demonstrated by simulation Supplementary Figure A.1. For each simulation setting and imputation strategy combination, we obtained point estimates for (1) the mean of X1 and for (2) regression coefficients from a model for X1|X2, X3, X4, X5 using the multiply imputed data and Rubin’s combining rules. We then calculated the average bias, empirical variance of the point estimates, and the coverage rate of 95% confidence intervals across the 500 simulated datasets. Results using a larger number of multiple imputations were similar.

Simulation Results

Figure 1 shows the bias in estimating the mean of X1 for different imputation methods. Shaded regions provide a visualization of the Monte Carlo standard error as discussed in Morris et al. (2019) (18). Under MAR (ϕ = 0), none of the methods gave substantial bias. For both normally-distributed and binary variables, SRMI produced substantial bias (e.g., absolute bias of 0.10 for normal X1) under MNAR (ϕ ≠ 0). In both normal and binary settings, all MNAR adjustment methods considered resulted in similar or reduced bias relative to SRMI (e.g., SRMI-MI resulted in up to 80% reduction in bias relative to SRMI for normal X1). The SRMI-MI method worked well to reduce bias from MNAR missingness when (1) MNAR missingness was weak or (2) missingness did not depend on the continuous variables (ρ = 0).

Figure 1.

Figure 1.

Figure 1.

Bias for mean of X1 across 500 simulated datasets after applying various imputation strategies1

(a) Normally-distributed X1, X2, and X3

(b) Binary X1, X2, and X3

1 Results shown for M = 10 imputed datasets. Shaded regions correspond to Monte Carlo standard errors.

In the setting with very strong MNAR missingness or missingness dependent on continuous covariates, the SRMI-MI approximation resulted in large residual bias (e.g. absolute bias of −0.07). For imputation of normally-distributed covariates, the SRMI-Exact method was the only approach that consistently produced good properties in terms of bias. Imputation models using more complicated functions of predictors (e.g. interactions, splines) often provided smaller bias relative to SRMI-MI but did not perform as well as imputation using the “exact” conditional distribution, particularly when ρ ≠ 0. For imputation of binary covariates, the offset approach generally performed well in terms of bias reduction, particularly when missingness model parameters were fixed to the simulation truth (not shown). Some small residual bias was seen for the offset method when missingness model parameters were estimated. Although not shown, complete case analysis resulted in very large bias in all simulation settings considered. Biases for regression model coefficients are presented in Supplementary Figure A.3. Results are similar.

Figure 2 shows the empirical variance of point estimates for the mean of X1, relative to analysis of the full data with no missingness. Under true MAR missingness, there is at most a small increase in the variability due to the extensions of the SRMI method relative to standard SRMI. Inclusion of additional interaction terms (between missingness indicators and covariates or between covariates themselves) in the imputation models resulted in larger empirical variances. In the setting with normally-distributed covariates, SRMI-Exact imputation resulted in larger empirical variance when the MNAR missingness was very strong. However, coverage rates (Supplementary Figure A.2) were similar to other methods, indicating that there may be a trade-off between wider confidence intervals and lower bias when applying these methods to account for MNAR missingness.

Figure 2.

Figure 2.

Figure 2.

Relative variance for estimated mean of X1 across 500 simulated datasets after applying various imputation strategies, relative to analysis of the full data with no missingness1

(a) Normally-distributed X1, X2, and X3

(b) Binary X1, X2, and X3

1 Results shown for M = 10 imputed datasets.

Prevalence of genetic pathogenic variants in breast cancer patients

The methodological development in this paper was motivated by missing data challenges for the ICanCare study. This study consists of women aged 20 to 79 who were newly diagnosed with breast cancer between July 2013 and August 2015 and are part of the Surveillance, Epidemiology, and End Results (SEER) registries in Georgia and Los Angeles. SEER is a population-based registry that collects basic data on variables such as age, race, stage of disease, common breast cancer biomarkers and treatments. A subset of these women enrolled in the ICanCare study (19), in which they were surveyed about the care they received and many other factors. The ICanCare study broadly focused on treatment communication and decision-making in patients with favorable breast cancer. Women were also asked about whether they had a family history of breast cancer, and they provided other information related to their risk of being a carrier of genetic variants associated with breast cancer. The survey was completed by 5080 patients and linked to SEER data. In addition, genetic test results corresponding to pathogenic variants were available for some patients. An external company merged the survey responses and SEER clinical data with genetic testing information obtained from four laboratories that tested patients in the study regions and provided a de-identified dataset. More details regarding the combined datasets are provided elsewhere (20).

In this paper, we are interested in using data from the ICanCare study to better understand the prevalence of the pathogenic genetic variants in BRCA1 or BRCA2 among women diagnosed with breast cancer in the USA. Women with breast cancer are increasingly taking genetic tests to find out if they have pathogenic variants in important genes. This information can impact the treatments they receive and is relevant for the care of close relatives. The most well-known breast cancer genes are BRCA1 and BRCA2. The prevalence of pathogenic variants in BRCA1 and BRCA2 in the general population is quite low, estimated to be roughly in the 0.2% to 0.3% range (21). Estimates of prevalence of BRCA1/2 pathogenic variants among breast cancer patients vary from country to country (typically around 2% to 4%), but can exceed 20% among breast cancer patients with a positive familial history of breast cancer (22; 23; 24). Given the practical importance of these genetic variants to patient prognosis and treatment decision-making, there is a great need to better characterize the prevalence of these pathogenic variants in the population of women newly-diagnosed with breast cancer in the USA. Missing data, however, presents a challenge.

Genetic test results (including presence/absence of BRCA1/2 mutation) are not available for some patients in the ICanCare study. Amongst the 5080 women 27.5% had genetic test results, and amongst those with genetic tests 4.66% had a pathogenic variant in either BRCA1 or BRCA2. The current recommendation for genetic testing is based on patient age, personal or family history of cancer, known genetic mutation in the family, and tumor characteristics, although there is substantial variability in how much these recommendations are being followed (25). Even if genetic testing is offered, patient interest in undergoing genetic testing is influenced by factors such as age, race, education, and stage of disease (26). The sample of women who do have genetic test data results within the ICanCare study are very unlikely to be representative of all the women in the ICanCare study or of the population of all women in these two SEER registries, and we expect the estimated prevalence of mutation among women with observed genetic test results to be an over-estimate. More sophisticated strategies are, therefore, needed to address the missing data.

Our strategy for handling missingness in BRCA mutation status is to use other available data to multiply impute BRCA status for women with missing values. Then, we can estimate the prevalence of BRCA mutation in the ICanCare study using the multiply imputed data. For the purposes of this paper we will consider a single variable of whether either BRCA1 or BRCA2 has a pathogenic variant. Some key variables that will help inform our imputation of BRCA mutation status are presence of familial history of BRCA mutation, Jewish ancestry, and familial history of breast cancer. Age, race, presence of ER/PR/HER2 mutations, tumor grade, clinical T-stage, presence of lymph node invasion, and presence of bilateral disease may also be informative. Most of these variables had low missingness rates (0% - 5%), but HER2 status, family history of cancer and known familial BRCA mutation had higher missingness rates (10%-21%). Summary statistics for these variables along with their missingness rates are given in Supplementary Table C.1.

Standard multiple imputation methods require us to assume that missingness in BRCA mutation status is independent of unobserved information given the observed data. However, this may not be the case. In particular, we may believe that presence of familial history of mutation, familial history of breast cancer, and other variables may strongly impact choices for whether or not a woman undergoes genetic testing. Since these variables also are observed with missingness, the MAR assumption may likely be violated. We see evidence of this dependence in the data. Logistic regression modeling of whether a woman had an available BRCA1/2 test result using data for the 2863 patients with complete covariate information showed an association between missingness and age, race, familial history of either breast, ovarian cancer or sarcoma, familial history of BRCA1/2 mutations, Jewish ancestry, HER2 status, and geographic location. The odds ratios for this logistic regression model are presented in Supplementary Table C.2. Since these variables are related to missingness in BRCA mutation status and are also occasionally or even often missing themselves, missingness in BRCA status may likely be MNAR.

This MNAR mechanism has a potential to induce bias in resulting estimates of BRCA mutation rates, since these variables are also related to whether or not the BRCA mutation was present. In particular, we ran a logistic regression model on the 874 patients that received a BRCA1/2 test and had complete information for the clinical and demographic factors listed above. In this logistic regression we used the Firth correction to avoid quasi-separation due to the rare outcome. The following variables were clearly associated with having a BRCA1/2 pathogenic variant: age, relatives with history of either breast, ovarian cancer or sarcoma, relatives with known BRCA1/2 mutations and Jewish ancestry(borderline). The odds ratios for this logistic regression model are presented in the first column of Supplementary Table C.3. Many of the same variables that are associated with receiving a test are also associated with the positivity rate of the test. Since these variables also have missingness themselves, there is a need to carefully guard against bias due to the MNAR missingness in the multiple imputation process.

We performed sequential regression multiple imputation of the missing data using the mice program in R using several of the methods explored in this paper. There were four variables with missingness exceeding 10%: BRCA1/2 test results, family history, known pathogenic variant and HER2. For these four variables we created response indicators Rj and offset variables Zj from Eq. 9, j = 1, .. 4. Multiple imputations were generated using the following three methods: (1) standard SRMI, (2) SRMI-MI and (3) SRMI-Offset. When using the SRMI-MI method we imputed each variable j in the dataset conditional on all other variables, and all of the above R(−j). When using the SRMI-Offset method we imputed BRCA1/2 test, family history, known pathogenic variant, and HER2 status conditional on all other variables and all of the above Z(−j). The rest of the variables were imputed conditional on all other variables and Rj.

Logistic regression models were used for imputing binary variables, and multinomial logistic regression was used for imputing variables with more than 2 categories. We treated clinical stage and tumor grade as categorical. For binary variables with low prevalence, imputation using the ‘logreg’ option in mice is unstable and can produce bias in downstream prevalence estimates as shown by simulation in Supplemental Materials. Although the main goal of this analysis is to address potential MNAR missingness, this secondary problem posed a challenge for implementation of the methods proposed in this paper. In Supplementary Section B, we describe a modified strategy for drawing parameters for any imputation model structure that has better performance in the setting of rare binary outcomes. Imputation then proceeds using the proposed methods as described previously. We applied this method to impute BRCA1/2 status. For each imputation method, we obtain 10 multiple imputations based on sequential regression algorithms that were run for 50 iterations. The marginal prevalence of BRCA1/2 mutation was then estimated, along with corresponding standard errors.

Table 1 shows the estimated prevalence of BRCA1/2 pathogenic variants from the complete cases and from the different multiple imputation methods. As expected, the multiple imputation methods give lower estimates than the complete case analysis. The extensions of the SRMI that make use of the missing data indicators give slightly lower estimates than obtained from SRMI. Since we believe missingness is MNAR, we would trust results from the SRMI-MI and SRMI-Offset methods over the estimates from SRMI. In Supplementary Table C.3, we also present the estimated associations between having a BRCA1/2 pathogenic variant and the various risk factors for each of the imputation strategies. The results compared to the complete case analysis are broadly similar, but there are some differences. Notably, the associations for tumor grade and clinical T-stage are larger than in complete case analysis, and the associations for Jewish ancestry are smaller. As expected, the width of the 95% confidence intervals for the odds ratios from the multiply imputed datasets tend to be smaller than seen in complete case analysis.

Table 1.

Estimated prevalence of BRCA1/2 pathogenic variants

Estimate (× 100) Standard Error (× 100)
Complete cases 4.66 0.56
SRMI 2.82 0.47
SRMI-MI 2.77 0.39
SRMI-Offset 2.65 0.34

Discussion

Standard software for implementing sequential multiple regression imputation (SRMI) assumes that missingness does not depend on unobserved information, called missing at random (MAR). Several researchers have proposed adaptations of existing sequential multiple imputation procedures in settings where missingness is not at random (MNAR) (9; 27). For example, Tompsett et al. (2018) proposes handling MNAR missingness by including missing data indicators as predictors in the sequential imputation models. In terms of rigorous statistical justification, however, little work has been done to provide guidance for handling of MNAR missingness within chained equations imputation algorithms in general.

In this paper, we provide statistical justification for the missing data indicators method of Tompsett et al. (2018) and propose several extensions that can result in improved performance in terms of bias in the final data analysis. We approach this problem by first deriving the ideal imputation distribution as a function of observed data and assumed models for data missingness, viewing SRMI as an approximation to Bayesian MCMC estimation. Using Taylor series approximations and other methods, we obtain regression model approximations to the ideal imputation distribution to use in practice. We focus our attention on a particular MNAR setting, where missingness for a given variable may depend on other variables with missingness. The methods in this paper are not intended to apply to MNAR situations where there is good reason to believe the probability of missingness for a variable depends on the missing value of that variable. SRMI-MI is likely to give biased estimates in that setting, and the magnitude of the bias will depend on the strength of the associations in the missingness model and the strength of the associations between all the variables. It is plausible that SRMI-MI would work better than SRMI in many but not all such situations, and evaluation of the relative performance in the setting where missingness depends on the missing variable itself is beyond the scope of the current paper. We refer the reader to Beesley and Taylor (2021) for recent work addressing missingness in a given variable based on its own missing values(14).

Through simulation, we found that inclusion of missingness indicators within sequential imputation algorithms (here, called SRMI-MI) can result in reduced bias in estimating outcome models parameters when missingness is MNAR following Assumptions 1-2. The degree of bias reduction will likely depend on the strength of the MNAR missingness and the structure of the missingness model. Although not explored here, inclusion of extra parameters in the imputation models could increase the risk of overfitting and may require larger datasets in order to see good bias reduction properties. In our simulations (datasets of size n=2000), we did not see increase in bias or substantial increases in variance when SRMI-MI was applied instead of SRMI when missingness was truly MAR.

In some settings, SRMI-MI produced substantial residual bias. We proposed a variety of extensions to the SRMI-MI approach, including use of spline functions of model predictors, inclusion of interactions, and use of fixed offsets calculated as a function of estimated missingness model parameters. In general, approaches including additional interaction terms tended to result in increased standard errors with some benefit in terms of bias reduction. Of all the regression model approximations, the approach using missingness model-based offsets had the best properties on average across the many simulation settings considered. This may be because this approach is making use of more information from the data, since it involves assuming (and fitting) a model for the probability of missingness for each variable. Since we assume missingness in a given variable is independent of its own missing values, parameters in this missingness model may be identified using the observed data. However, this approach may be more sensitive to misspecification of the missingness model.

For comparison, we evaluate the performance of the various SRMI adaptations to imputation using the “exact” imputation model in Eq. 2. This distribution may only be known up to proportionality, and imputation using this distribution may be complicated in general. In our simulations, this approach (SRMI-Exact) resulted in little or no bias in estimating outcome model parameters.

With the exception of the SRMI-Exact method, we tried to restrict our focus to methods that are easily implemented within established sequential imputation software. The methods using the offset do require some non-trivial adaptations of the standard SRMI routine (including fitting of models for covariate missingness within the iterative imputation algorithm), and we provide example code guiding implementation with package mice in R.

The methods were applied to address potential MNAR missingness in data from the ICanCare study, which consists of a probability-sampled cohort of breast cancer patients identified from two SEER registries (19). Sampling weights and sampling design information for the ICanCare study could be incorporated into imputation and analysis in order to generalize results to the entire SEER registry (e.g., 28; 29; 30). In a naive exploration into selection bias adjustment, we performed analysis weighted by the provided sampling weights but ignoring these weights during imputation. The estimated weighted prevalence of a pathogenic variant of BRCA1/2 obtained after SRMI-MI imputation was 2.55%. The corresponding unweighted estimate was 2.65%. Future efforts will implement more sophisticated strategies in the survey literature to simultaneously account for selection bias and MNAR covariate missingness when obtaining multiple imputations for these data.

Supplementary Material

Supplementary Materials

Acknowledgements

Lauren Beesley and Irina Bondarenko are co-first authors of this paper. This research was partially supported by National Institutes of Health grants CA225697 and CA129102.

References

  • [1].Rubin DB. Multiple Imputation for Nonresponse in Surveys. 1st ed. New York, NY: John Wiley and Sons, Inc, 1987. [Google Scholar]
  • [2].Little RJA and Rubin DB. Statistical analysis with missing data. 2nd ed. Hoboken, NJ: John Wiley and Sons, Inc, 2002. [Google Scholar]
  • [3].White IR and Royston P. Multiple Imputation Using Chained Equations: Issues and Guidance for Practice. Statistics in Medicine 2011; 30(4): 377–399. [DOI] [PubMed] [Google Scholar]
  • [4].Carpenter JR and Kenward MG. Multiple Imputation and its Application. 1st ed. Hoboken, NJ: John Wiley and Sons, Inc, 2013. [Google Scholar]
  • [5].Molenberghs G, Beunckens C, Sotto C et al. Every Missingness Not at Random Model Has a Missingness at Random Counterpart with Equal Fit. Journal of the Royal Statistical Society Series B 2008; 70(2): 371–388. [Google Scholar]
  • [6].Raghunathan TE. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 2001; 27(1): 85–95. [Google Scholar]
  • [7].Van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM et al. Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation 2006; 76(12): 1049–1064. [Google Scholar]
  • [8].Van Buuren S Flexible Imputation of Missing Data. 2nd ed. New York, NY: CRC Press, 2018. ISBN 9780429492259. [Google Scholar]
  • [9].Tompsett DM and White IR. On the use of the notatrandom fully conditional specification (NARFCS) procedure in practice. Statistics in Medicine 2018; 37(15): 2338–2353. DOI: 10.1002/sim.7643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Mercaldo SF and Blume JD. Missing data and prediction : the pattern submodel. Biostatistics 2020; 21(2): 236–252. DOI: 10.1093/biostatistics/kxy040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Bartlett JW, Seaman SR, White IR et al. Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model. Stat Methods Med Res 2015; 24(4): 462–487. DOI: 10.1177/0962280214521348. URL https://www.ncbi.nlm.nih.gov/pubmed/24525487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Beesley LJ, Bartlett JW, Wolf GT et al. Multiple imputation of missing covariates for the Cox proportional hazards cure model. Statistics in Medicine 2016; 35(26): 4701–4717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Meng XL. Multiple-imputation inferences with uncongenial sources of input. Statistical Science 1994; 9(4): 538–573. [Google Scholar]
  • [14].Beesley LJ and Taylor JMG. Accounting for not-at-random missingness through imputation stacking. arXiv 2021;: 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Morris TP, White IR and Royston P. Tuning multiple imputation by predictive mean matching and local residual draws. BMC Medical Research Methodology 2014; 14(75): 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Schenker N and Taylor JMG. Partially parametric techniques for multiple imputation. Computational Statistics and Data Analysis 1996; 22(4): 425–446. [Google Scholar]
  • [17].Shah AD, Bartlett JW, Carpenter J et al. Practice of Epidemiology Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study. American Journal of Epidemiology 2014; 179(7): 764–774. DOI: 10.1093/aje/kwt312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Morris TP, White IR and Crowther MJ. Using simulation studies to evaluate statistical methods. Statistics in Medicine 2019; 11(38): 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Katz SJ, Hawley ST, Bondarenko I et al. Oncologists influence on receipt of adjuvant chemotherapy: does it matter whom you see for treatment of curable breast cancer? Breast Cancer Research and Treatment 2017; 165(3): 751–756. DOI: 10.1007/s10549-017-4377-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Kurian AW, Ward KC, Hamilton AS et al. Uptake, Results, and Outcomes of Germline Multiple-Gene Sequencing After Diagnosis of Breast Cancer. JAMA Oncology 2018; 8(4): 1066–1072. DOI: 10.1200/JCO.18.01854. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Lippi G, Mattiuzzi C and Montagnana M. BRCA population screening for predicting breast cancer: for or against? Ann Transl Med 2017; 13(5): 275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Armstrong N, Ryder S, Forbes C et al. A systematic review of the international prevalence of BRCA mutation in breast cancer. Clin Epidemiol 2019; 11(7): 543–561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Hu C, Hart SN, Gnanaolivu R et al. A Population-Based Study of Genes Previously Implicated in Breast Cancer. New Engl J Med 2021; 384(5): 440–451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Breast Cancer Association Consortium; Dorling L, Allen J, Gonzlez-Neira A et al. Breast Cancer Risk Genes - Association Analysis in More than 113,000 Women. New Engl J Med 2021; 384(5): 428–439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Katz SJ, Bondarenko I, Ward KC et al. Association of Attending Surgeon With Variation in the Receipt of Genetic Testing After Diagnosis of Breast Cancer. JAMA Surg 2018; 153(10): 909–916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Owens DK, Davidson KW, Krist AH et al. Risk assessment, genetic counseling, and genetic testing for BRCA-related cancer: US Preventive Services Task Force recommendation statement. JAMA 2019; 322(7): 652–665. [DOI] [PubMed] [Google Scholar]
  • [27].Jolani S Dual Imputation Strategies for Analyzing Incomplete Data. PhD Thesis, Utrecht University, 2012. [Google Scholar]
  • [28].Zhou H, Elliott MR and Raghunathan TE. A two-step semiparametric method to accommodate sampling weights in multiple imputation. Biometrics 2016; 72: 242–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Andridge RR and Little RJ. The use of sample weights in hot deck imputation. Journal of Official Statistics 2009; 1(25): 21–36. [PMC free article] [PubMed] [Google Scholar]
  • [30].Reiter JP, Raghunathan TE and Kinney SK. The importance of modeling the sampling design in multiple imputation for missing data. Survey Methodology 2006; 2(32): 143. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Materials

RESOURCES