Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 May 13.
Published in final edited form as: Stat Sin. 2018 Oct;28(4):2069–2088. doi: 10.5705/ss.202016.0325

Discrete Choice Models for Nonmonotone Nonignorable Missing Data: Identification and Inference

Eric J Tchetgen Tchetgen 1, Linbo Wang 1, BaoLuo Sun 1
PMCID: PMC8118571  NIHMSID: NIHMS1001884  PMID: 33994754

Abstract

Nonmonotone missing data arise routinely in empirical studies of social and health sciences, and when ignored, can induce selection bias and loss of efficiency. In practice, it is common to account for nonresponse under a missing-at-random assumption which although convenient, is rarely appropriate when nonresponse is nonmonotone. Likelihood and Bayesian missing data methodologies often require specification of a parametric model for the full data law, thus a priori ruling out any prospect for semiparametric inference. In this paper, we propose an all-purpose approach which delivers semiparametric inferences when missing data are nonmonotone and not at random. The approach is based on a discrete choice model (DCM) as a means to generate a large class of nonmonotone nonresponse mechanisms that are nonignorable. Sufficient conditions for nonparametric identification are given, and a general framework for fully parametric and semiparametric inference under an arbitrary DCM is proposed. Special consideration is given to the case of logit discrete choice nonresponse model (LDCM) for which we describe generalizations of inverse-probability weighting, pattern-mixture estimation, doubly robust estimation and multiply robust estimation.

Keywords: missing not at random, nonmonotone missing data, pattern mixture, doubly robust, inverse-probability-weighting

1. Introduction

Missing data are of common occurence in empirical research in health and social sciences, and will often affect one’s ability to draw reliable inferences whether from an experimental or nonexperimental study. Non-response can occur in sample surveys, due to dropout or non-compliance in clinical trials, or due to data excision by error or in order to protect confidentiality. In many practical situations, nonresponse is nonmonotone, that is, there may be no nested pattern of missingness such that observing variable Xk implies that variable Xj is also observed, for any j < k. Nonmonotone missing data patterns may occur, for instance, when individuals who dropped out of a longitudinal study re-enter at later time points; likewise, in regression analysis nonmonotone nonresponse may occur if the outcome or any of the regressors may be unobserved for a subset of the sample in an arbitrary pattern. Missing data are said to be completely-at-random (MCAR) if the nonresponse process is independent of both observed and unobserved variables in the full data, and missing-at-random (MAR) if, conditional on observed variables under a nonresponse pattern, the probability of observing the pattern does not depend on unobserved variables under the pattern (Rubin 1976; Little and Rubin 2002, Robins et al, 1994). A nonresponse process which is neither MCAR nor MAR is said to be missing-not-at-random (MNAR).

While complete-case analysis is perhaps the most widely-used method to handle missing data in practice, the approach is generally not recommended as it can give biased inferences when nonresponse is not MCAR. Formal methods to appropriately account for incomplete data include fully parametric likelihood and Bayesian approaches (Little and Rubin 2002; Horton and Laird 1999; Ibrahim and Chen 2000; Ibrahim et al. 2002, 2005) which are most commonly implemented under MAR using the EM algorithm or via multiple imputation (MI) (Dempster et al, 1977, Rubin 1977; Schafer 1997). Inverse probability weighting (IPW) is another approach to account for selection bias due to missing data (Horvitz and Thompson 1952; Robins et al. 1994; Tsiatis 2006). While IPW estimation avoids specification of a full-data likelihood, the approach does require a model for the nonresponse process. However, the development of general coherent models for nonmonotone nonresponse has proved to be particularly challenging, even under the MAR assumption; see Robins and Gill (1997) and Sun and Tchetgen Tchetgen (2016) for two concrete proposals and further discussion.

Despite recent progress in development of MAR methodology, as argued by Gill and Robins (1997), Robins (1997) and Little and Rubin (2002), the assumption is generally hard to justify on substantive grounds when nonresponse is nonmonotone. Instead, allowing for MNAR data seems particularly befitting in the context of nonmonotone nonresponse and has received substantial attention, particularly in the context of fully parametric models (Deltour et al. (1999), Albert (2000), Ibrahim et al. (2001), Fairclough et al. (1998), Troxel et al (1998), Troxel, Lipsitz & Harrington (1998)). MNAR approaches which do not necessarily rely on parametric assumptions have also been developed in recent years. Notable examples include the group permutation model (GPM) of Robins (1997) and the block conditional MAR (BCMAR) model of Zhou et al (2010). Both approaches allow for non-ignorable missing data in the sense that the nonresponse process of a given variable may depend on values of other missing variables. However, neither BCMAR nor GPM allows the missingness probability of a given variable to depend on the value of the variable. Based on subject matter considerations, it is often desirable to consider non-ignorable processes where the missingness probability of a variable depends on the possibly unobserved value of the variable, therefore, methods for non-ignorable missing data mechanisms beyond BCMAR and GPM are of interest.

In this paper, we propose a large class of non-ignorable nonmonotone nonresponse models, which unlike BCMAR and GPM, do not a priori rule out the possibility that the probability of observing a given variable may depend on the unobserved value of the variable. Our approach is based on so-called discrete choice models (DCM). DCMs were first introduced and are predominantly used in economics and other social sciences, as a principled approach for generating a large class of multinomial models to describe discrete choice decision making under rational utility maximization. In this paper, DCMs are used for a somewhat different purpose, as a means to generate a large class of nonmonotone nonresponse mechanisms which are nonignorable. Sufficient conditions for nonparametric identification are given, and a general framework for semiparametric inference under an arbitrary DCM is proposed. Special consideration is given to the case of logit discrete choice nonresponse model (LDCM). Interestingly, our identification condition in the case of the LDCM, states that the conditional distribution of unobserved variables given observed variables for any nonresponse pattern, matches the corresponding conditional distribution in complete-cases. This latter assumption is equivalent to the well-known complete-case missing value (CCMV) restriction in the pattern mixture (PM) literature which has previously been developed for fully likelihood-based inference (Little,1993). Therefore, our approach provides a comprehensive treatment of semiparametric inference for MNAR nonresponse under Little’s CCMV restriction. Specifically, in addition to reviewing Little’s (i) PM likelihood approach, we describe a generalization of (ii) inverse-probability weighting (IPW), and (iii) both doubly robust (DR) and multiply robust (MR) estimation, which are the nonmonotone MNAR analogues of existing results for monotone MAR nonresponse (Tsiatis, 2006). Our doubly robust estimators combine models (i) and (ii) but only require one of the two models to be correct. In fact, we establish that whenever J nonresponse patterns are observed, the proposed LDCM DR estimators can be made multiply robust (more precisely 2J-robust) in the sense that for each nonresponse pattern, valid inferences can be obtained if one of two pattern-specific models is correctly specified but not necessarily both. As far as we know, our paper represents the first instance of a doubly (2J -) robust estimator obtained for a general nonmonotone nonignorable missing data model that is just-identified from the observed data alone. We emphasize that our proposed inferences under the LDCM are quite attractive as a generic nonignorable approach for arbitrary nonmonotone patterns, mainly because they are somewhat easy to implement, have good robustness properties, and appear to have good finite sample performance as we illustrate via simulation studies and an HIV data application. In closing, we briefly consider IPW inference for DCMs outside of the LDCM, which can generally be used to account for nonmonotone nonignorable missing data even when Little’s CCMV condition fails and therefore the LDCM may not be appropriate.

2. Notation and definitions

Suppose full data consist of n i.i.d. realizations of a random K-vector L = (L1, …, LK)′. Let R denote the scalar random variable encoding missing data patterns, and J denote the total number of observed patterns. For missing data pattern R = r, where 1 ≤ rJ ≤ 2K, we use L(r) and L(−r) to denote observed and unobserved components of L, respectively so that L = (L(r),L(−r)). We reserve r = 1 to denote complete cases. Throughout, denote PrR=r|L=πrL=r for all r. For each realization, we observe (R, L(R)). For instance, suppose the full data L is a bivariate binary vector (L1, L2) and the following J = 3 nonmonotone nonresponse patterns are observed in the sample: R = 1, L(1) = L; R = 2, L(2) = L1; and R = 3, L(3) = L2.

Throughout, we also make the following positivity assumption,

1>σ>0a.s., (1)

for a fixed positive constant σ, that is, the probability of being a complete-case is bounded away from zero almost surely. Assumption (1) will be needed for nonparametric identification of the full data distribution, and its smooth functionals as well as finite asymptotic variance of IPW estimators (Robins et al, 1999). As further discussed in Section 2.3, complete-case IPW relies on obtaining a consistent estimator of π1L=1r1πrL which in turn requires estimating the nonresponse process πrL:r. The nonresponse process clearly fails to be nonparametrically identified under assumption (1) only. In the next section, we describe a set of sufficient conditions to identify a model for the complete-case probability π1(L) under the discrete choice framework when missingness is nonmonotone and not at random.

Our first result provides a generic nonparametric representation of the joint law of f (R, L) that will be used throughout. The result adapts the generalized odds ratio parametrization of a joint distribution due to Chen (2010) to the missing data context; see also Tchetgen Tchetgen et al (2010). Let OddsrL=πrL/π1L. We have the following result.

Lemma 1 We have that

fR,L=r1OddsrLIR=rfL|R=1r1OddsrlIr=rfl|R=1dμr,l,

provided r1OddsrlIr=rfl|R=1dμr,l<, with µ a dominating measure of the CDF of (R, L).

Lemma 1 clarifies what the identification task entails, because under assumption (1), f (L|R = 1) is just-identified, and therefore f (R, L) is nonparametrically just-identified only if one can just-identify Oddsr (L) for all r. Below we describe a sufficient condition for identification under the discrete choice model of the nonresponse process.

3. Identification

3.1. The discrete choice nonresponse model

The DCM associates with each realized nonresponse pattern r = 1, …, J ≤ 2K an underlying utility function Ur=μr(L) +εr, where {εr : r } are i.i.d. with cumulative distribution function Fε, and μr (L) encodes the dependence of a person’s utility on L (McFadden, 1984, Train, 2009). Some common choices of Fε include the extreme value distribution (further discussed below) and the normal distribution, although in principle any CDF could be specified. It is then assumed that a person’s observed response pattern maximizes her utility, that is R = arg maxr.{Ur : r }. Together, these assumptions imply that for each r,

r=πrL=PrR=r|L=srFεΔμrsL+εdFεε, (2)

where Δμrs(L)=μr(L)μs(L) captures the dependence on L of a difference in utility in comparing a person’s choice between nonresponse patterns r and s, see Train (2009). The integral in (2) is generally not available in closed form for most choices of Fε (with the notable exception of the extreme value distribution, see Section 2.2), but can easily be evaluated by numerical integration using say, Gaussian quadrature. Two interesting observations about equation (2) are worth noting. Although not immediately apparent from the expression in the display, equation (2) gives rise to a proper probability mass function, that is rπrl=1 for all values of l and for any choice of Fε. This remarkable result is a direct consequence of utility maximization as a formal principle for generating multinomial probabilities {πr : r}. A second interesting observation is that only differences in utility matter in determining the choice probabilities; in other words, the absolute level of a person’s utility for a given nonresponse pattern is irrelevant and only relative utility drives the choice of a nonresponse pattern over another. Clearly, model (2) is not identifiable without an additional assumption, even given knowledge of Fε.

For the purpose of identification, we will consider the assumption that the relative utility ∆μ1r (L) of any nonresponse pattern r ≠ 1 compared with that of complete-case pattern r = 1, only depends on data observed under both patterns, that is

Δμ1r(L)=Δμ1r(L(r))forallralmostsurley. (3)

The assumption essentially states that when faced with the choice between nonresponse pattern r ≠ 1 versus providing complete data, the excess utility a subject would experience choosing one over the other only depends on data observed under both choices. Under the assumption, one may write

r=srFεΔμ1sLsΔμ1rLr+εdFεε (4)

Note that, even under assumption (3), Πr generally depends on unobserved variables for all r, and therefore, data are missing not at random, and the corresponding observed data likelihood is nonignorable. Nevertheless, as we show in Section 5, given any continuous Fε, equation (4) is nonparametrically identified for each r provided (1) holds. We leave the detailed discussion of inference under user-specified Fε to Section 5, instead, to fix ideas, we further discuss identification and inference under the logit DCM.

3.2. The logit discrete choice model

In the special case where Fε is the extreme value distribution, the integral in equation (2) is available in closed-form, and gives the following logit DCM (Train, 2009): πrL=OddsrL/1+s1OddssL, where OddsrL=expΔμ1rL for all r. Under (3), OddsrL=OddsrLr, and therefore

r=OddsrLr1+s1OddssLs,for all r1. (5)

In order to illustrate (5), briefly consider an example with L = (L1, L2, L3). Suppose that there are 4 nonresponse patterns, L1=L, L2=(L1,L2), L3=L3,L4=. Then, by (3) Odds2(L) =Odds2(L(2)); Odds3(L)=Odds3(L(3)); Odds4(L)=Odds4(L(4))=Odds4 is a constant. Furthermore, according to (5) 2=Odds2L2/cL; 3=Odds3L3/cL; 4=Odds4/cL, where cL=1+s1OddssLs. Therefore, by virtue of c(L), the nonresponse probabilities Πj, j = 2, 3, 4 are each a function of L˜=j=2,3,4Lj, the union set of observed variables across all the nonresponse patterns. Since the variable set L˜\Lj is not observed for each of the missing data patterns j = 2, 3, 4, the nonresponse process is clearly MNAR. In particular, Π4 is a function of L˜ even though no variable is observed in the fourth missing data pattern.

Interestingly, an equivalent characterization of equation (5) is:

Lr|R=r,Lr~Lr|R=1,Lrfor all r  1, (6)

which states that the conditional distribution of unobserved variables L(−r) given observed variables L(r) for nonresponse pattern r matches the corresponding conditional distribution among completecases. Although the LDCM is derived as a particular DCM, one could in principle take (6) as primitive identifying condition without necessarily making reference to a DCM and the existence of its associated variables {εr : r}. This amounts to nonparametric identification under the complete-case missing value restriction of Little (1993). As shown in Section 5, adoption of the more general DCM framework is advantageous as it gives rise to a richer class of nonresponse models and facilitates identification; in fact, a different choice for the distribution Fε corresponds to a nonmonotone not at random nonresponse model which does not generally satisfy Little’s CCMV restriction but is nevertheless just-identified under (1) and (3).

It is instructive to compare condition (6) to standard MAR, which states that

L(r)|R=r,L(r)~L(r)|L(r)for all r  1, (7)

i.e. the conditional distribution for pattern r matches the conditional distribution obtained upon marginalizing across all nonresponse patterns. Clearly, conditions (6) and (7) have fundamentally different implications for inference. Specifically, it is well known that when the nonresponse process and the full data distribution depend on separate parameters, the MAR assumption implies that the part of the observed data likelihood which depends on the full data parameter factorizes from the nonresponse process. The missing data mechanism is then said to be “ignorable” (Little and Rubin, 2002) because it is possible to learn about the full data law without necessarily estimating the missing data process, or equivalently, it is possible to learn about the missing data process without modeling the full data law (Sun and Tchetgen Tchetgen, 2016). No such factorization is in general available under CCMV as the missing data process is nonignorable. In spite of possible challenges due to lack of factorization, as shown later in the paper, estimation of nonmonotone non-response mechanisms under (6) is nevertheless relatively straightforward. Furthermore, assumption (6) is invariant to the number and nature of other nonresponse patterns potentially realized in the observed data. In contrast, MAR does not enjoy a similar invariance property because addition or deletion of a nonresponse pattern from the observed sample changes the interpretation of (7) as it implies marginalizing over a different set of nonresponse patterns to obtain the right-hand side of equation (7). Finally, note that assumptions (6) and (7) only coincide when there is a single nonresponse pattern, i.e. J = 2.

Remark 2 Sun and Tchetgen Tchetgen (2016) recently proposed an approach tailored specifically to model a nonmonotone nonresponse process under MAR restriction (7). However, they did not consider the MNAR restriction (3). As restrictions (3) and (7) differ, the approach proposed by Sun and Tchetgen Tchetgen (2016) cannot be used under restriction (3).

Lemma 3 Suppose that assumptions (1) and (2) hold with Fε being the extreme value distribution, then if (3) holds, the joint distribution f (R, L) is nonparametrically just-identified from the observed data (LR, R), with

fR,L=r1OddsrLrIR=rfL|R=1r1Oddsrlr*Ir=rfl|R=1dμr,l, (8)

where μ is a dominating measure of the CDF of (R, L).

Lemma 2 gives an explicit expression for f (R, L) which appears to be new, and can be used to compute the full data density fL=rfr,L. In addition, equation (8) can be used for maximum likelihood estimation. Specifically, let f (L|R = 1; η) denote a parametric model for f (L|R = 1) with unknown parameter η. Likewise, consider a parametric model for nonresponse process rα=OddsrLr;αr/1+s1OddssLs;αs with unknown parameter α = {αr : r},where αr indexes a parametric model for Oddsr (L(r); αr). Let f (R, L; θ) denote the corresponding model for f (R, L), where θ = (η,α). The maximum likelihood estimator (MLE) θ^mle maximizes the observed data log-likelihood nlogfR,L;θdμLR, where n=n1ii. The full data likelihood fL;θ^mle=fr,L;θ^mledμr can then be used to make inferences about a given full data functional of interest according to the plug-in principle. By standard likelihood theory, the MLE is asymptotically efficient in the model Mlik corresponding to the set of laws {f (R, L; θ) : θ}. A major drawback of maximum likelihood inference is lack of robustness to model mis-specification, because θ^mle is likely inconsistent if either Πr (α) or f (L|R = 1; η) is incorrectly specified. Below, we consider four semiparametric estimators which are potentially more robust than direct likelihood maximization.

4. Semiparametric Inference

4.1. Inverse-probability weighting estimation

Suppose the parameter of interest β0 is the unique solution to the full data population estimating equation E{U(L;β0)}= 0, where expectation is taken over the distribution of the complete data L. Note that in principle, no further restriction on the distribution of L is strictly required;in fact, estimation is possible under certain weak regularity conditions (van der Vaart, 1998) as long as a full data unbiased estimating function exist. In the presence of missing data, the estimating function can only be evaluated for complete-cases, who might be highly selected even under MAR. This motivates the use of IPW estimating functions of complete-cases to form the following complete-case population estimating equation

E1R=11UL;β0=0, (9)

which holds by straightforward iterated expectations. We note that the IPW estimator β^ipw which solves the empirical version of this equation will in general be inefficient especially when the fraction of complete-cases is relatively small, since incomplete cases are discarded (except when estimating Π1). In the next section we will describe a strategy to recover information from incomplete-cases by augmenting estimating function shown in equation (9) to gain efficiency and potentially robustness. The IPW estimating equations framework encompasses a great variety of settings under which investigators may wish to account for non-monotone missing data. These include IPW of the full data score equation, where the score function is such an unbiased estimating function, given a model f (L; β0) for the law of the full data, in which case (9) reduces to E1R=1logfL;β/β| β0/1=0

We now describe a straightforward approach to obtain a consistent estimator of Π1 in the semiparametric model which specifies a parametric LCDM {Πr (α) : r}, but allows f (L|R = 1) to remain unrestricted. We denote this model MR. The approach follows from the fact that (5) implies that:

PrR=r|L,R1,r=r,c=OddsrLr1+OddsrLr,for all r;

which also gives the following equivalent representation of the CCMV restriction:

L(r)|R{r,1},Lrfor each r.

Note that L(r) is fully observed for observations R ∈ {1, r}. Thus, in order to estimate the parametric model {Πr,cα:r}, for each nonresponse pattern r one may fit the following logistic regression r,cαr=OddsrLr;αr/1+OddsrLr;αr by maximum likelihood estimation restricted to the subset of data containing complete-cases and incomplete-cases of pattern r only. Thus, we define the restricted MLE

α˜r=argmaxαrnllikr,cαr=argmaxαrnIR=rlogr,cαr+IR=1log1r,cαr.

Under assumption (1), the restricted MLE α˜ is consistent and asymptotically normal under model MR. The resulting estimator of the complete-case probability Π1 under MR is

1α˜=11+s1OddssLs;α˜s,

which in turn, provides the IPW estimator β^ipw of β which solves

nUipwLR,R;β^ipw,α˜=0, (10)

where UipwLR,R;β^ipw,α˜=1R=1UL;β^ipw/1α˜. Under standard regularity conditions, one can show that under MR the IPW estimator β^ipw will in large sample be approximately normal with mean β0 and asymptotic variance Γ^ipw1Ω^ipwΓ^ipw1,where

Γ^ipw1=βTnUipwLR,R;β,α˜β^ipw;Ω^ipw=n1nUipwLR,R;β^ipw,α˜+αTnUipwLR,R;β^ipw,αα˜IF^α2;IF^α=2ααTnr1llikr,cαrα˜1αr1llikr,cαrα˜.

For inference about a component of β0, one may report the corresponding Wald-type 95% confidence interval.

4.2. Pattern-mixture LDCM estimation

In this Section, we consider an alternative approach for obtaining inferences about the full data parameter β0 defined in the previous Section. The approach is a slight generalization of the well-known pattern-mixture approach due to Little (1993). To proceed, note that

EUL;β0=EEUL;β0|R,LR,=EEUL;β0|R=1,LR=ErIR=rEUL;β0|R=1,Lr=0 (11)

where the second equality follows from (6). Now, consider the semiparametric model ML which posits parametric model f (L|R = 1; η) while allowing the nonresponse process {Πr : r} to remain unrestricted. Let η˜ denote the restricted MLE of η in ML obtained using only complete-case data, i.e. η˜=argmaxηnllikl,cη=argmaxηnIR=1logfL|R=1;η. An empirical version of equation (11) can then be used to obtain the following pattern mixture estimator β^pm of β0,

0=nUpmLR,R;β^pm,η˜, (12)

where

UpmLR,R;β^pm,η˜=rIR=rEUL;β^pm|R=1,Lr;η˜, (13)

and EUL;β^pmR=1,Lr;η˜=Ulr,Lr;β^pmflrLrR=1;η˜dμlr. Note that in order to ensure that models flrLrR=1;η˜,r1 are compatible, one may need to specify a model for f (L|R = 1) ; this is effectively the approach followed by Little (1993). Also note that in the pattern mixture approach, the model for f (L) which is of primary scientific interest is indirectly specified via models for the various conditional densities flrLrR=1,r1 and the marginal densities fLr|R=r,r1 according to the following mixture: fL=rflrLrR=1flrR=rPrR=r (Little, 1993). Under standard regularity conditions, one can show that in large samples, β^pm will be approximately normal with mean β0 and asymptotic variance consistently estimated by Γ^pm1Ω^pmΓ^pm1 where

Γ^pm1=βTnUpmLR,R;β,η˜β^pm;Ω^pm=n1nUpmLR,R;β^pm,η˜+ηTnUpmLR,R;β^pm,ηη˜IF^η2;IF^η=2ηηTnllikl,cηη˜1ηr1llikl,cηη˜.

4.3. Doubly robust and multiply robust LDCM estimation

We have now described two separate approaches for estimating the full data functional β0 under the LDCM, IPW and PM estimation, each of which depends on a separate part (i.e. variation independent parameter) of the joint distribution of f (R, L) given in Lemma 2. As previously discussed, validity of IPW estimation relies on correct specification of the nonresponse model MR, while PM estimation relies for consistency on correct specification of ML. Because when L is sufficiently high dimensional, one cannot be confident that either, if any, model is correctly specified, it is of interest to develop a doubly robust estimation approach, which is guaranteed to deliver valid inferences about β0 provided that either MR or ML is correctly specified, but not necessarily both. That is, we aim to develop a consistent estimator of β0 in the semiparametric union model MDR=MRML.

In order to describe the DR approach, let

Vβ,α,η=vLR,R;β,α,η=1R=11αUL;β1R=11αr1rαEUL;β|Lr,R=1;η+r1IR=rEUL;βLr,R=1;η

and let β^dr denote the solution to the equation

0=nVβ^dr,α˜,η˜. (14)

We have the following result.

Theorem 4 Suppose that assumptions (1) and (2) hold with Fε the extreme value distribution. Then, under standard regularity conditions, we have that β^dr is consistent and asymptotically normal in the union model MDR with asymptotic variance consistently estimated by Γ^dr1Ω^drΓ^dr1, where

Γ^dr1=βTnVβ,α˜,η˜β^dr;Ω^dr=n1nVβ^dr,α˜,η˜+ηTnVβ^dr,α˜,η˜η˜IF^η+αTnVβ^dr,α,η˜α˜IF^α2.

The above theorem formally establishes the DR property of βdr. Instead of the above estimators of asymptotic variance, one may use the nonparametric bootstrap to obtain inferences based on either β^dr, β^ipw or β^pm.

Remark 5 Equation (8) of Lemma 2 implies that f (R = 1|l) (which only depends on Oddsrlr:r and f (l|R = 1) are variation independent under the CCMV restriction. This variation independence is important as double robustness is meaningful only if it is possible a priori for both of the nuisance models to be correctly specified, see Robins and Rotnitzky (2001) and Richardson et al (2016, Remark 3.1). Note however, that in general f (l|r) and f (r|l) are variation dependent even under CCMV.

Interestingly, it is possible to make the estimator β^dr even more robust by the following modification to estimation of the nuisance parameter η. Specifically, suppose that for each r, the conditional density fLrLr,r;η=fLrLr,r;ηr=fLrLr,R=1;ηr only depends on the subset of parameter ηrη, where there may be parameter overlap across patterns ηrηr' for distinct patterns r and r′. Let ML (r) denote the semiparametric model which only specifies fLr|Lr,R=1;ηr, allowing the density of fLr|R=1 and the missing data process to remain unspecified. Note that MLr1MLr. Let η¯r denote the complete-case MLE under MLr:η¯r=argmaxηrnIR=1fLr|Lr,R=1;ηr. Likewise, let MR (r) denote the semiparametric model that specifies the nonresponse model Πr,c (αr), and is otherwise unspecified. Note that MR=r1MRr. Consider the following pattern-specific union model MDRr=MRrMLr, which is the set of laws with either MR (r) or ML (r) correctly specified. The intersection submodel of these laws MMR=r1MDRr=r1MRrMLr is the set of laws such that the union model for each r holds. Note that MDRMMR since the first union model requires that either the entire nonresponse process is correctly specified, i.e. r1MRr holds, or the joint complete-case distribution of L is correctly specified, i.e. r1MLr holds; in contrast, MMR requires only correct specification of one of the two models for each pattern. An estimator of β0 that is consistent in model MMR is said to be multiply-robust, or more precisely 2J −robust (Vansteelandt et al, 2007) for a J non-monotone missing data patterns. We have the following result:

Corollary 6 Suppose that assumptions (1) and (2) hold with Fε the extreme value distribution. Then, under standard regularity conditions, we have that β^mr is consistent and asymptotically normal in the union model MMR, where β^mr is defined as β^dr with η¯r used to estimate ηr.

The above corollary describes an estimator with the MR property which states that given J nonresponse patterns, the analyst would in principle have (under our identifying assumptions) 2J opportunities to obtain valid inferences about β0. This is to be contrasted with the single chance to valid inferences offered by IPW or PM approaches respectively, or the two chances offered by the DR estimator. For inference, one may readily adapt the large sample variance estimator given in Theorem 3, or alternatively use the nonparametric bootstrap.

4.4. Simulation Study

We perfomed a simulation study to investigate the performance of the various estimators described above in finite sample. We generated 1000 samples of size n = 2000. We implemented the following data generating mechanism. Independent and identically distributed (Y, X) is generated from a normal mixture models: Y,X~k=13πkNμk,, where π1=1/2, π2=e/(2+2e), π3=1/(2+2e), μ1 =(0,0)T, μ2 =(1,1)T, μ3 =(1,2)T and ∑=(σij), where σ11 = σ12 = 1, σ22 =2. We consider four missing data patterns L(R): L(1) = L, L(2) = X, L(3) = Y, L(4) = ∅. Conditional on the generated full data, the missing data pattern is then generated under the following mechanism:

PR=1|X,Y=11+expX+exp2Y+exp1;PR=2|X,Y=expX1+expX+exp2Y+exp1;PR=3|X,Y=exp2Y1+expX+exp2Y+exp1;PR=4|X,Y=exp11+expX+exp2Y+exp1.

Since for each missing data pattern r, P(R = r | X, Y) depend on all the full data (X, Y), the missing data mechanism is MNAR. The identifiability of normal mixture models in the MNAR setting has previously been considered in Miao et al. (2016). The full data target parameter of interest is β=EY=rprEY|R=r=2+exp1/2+2exp1, with full data estimating equation U(β) =Yβ.

We implemented Little’s PM approach as well as our IPW and DR estimators. In doing so, correct specification of the nonresponse process entailed matching the data generating mechanism described above, i.e. Odds2(L2)=α20+α21X, Odds3(L3)=α30+α31Y, Odds4(L4)=α40. Misspecification of these models occured by instead fitting Odds2(L2)=α20+α21X2 and Odds3(L3)=α30+α31Y2. Likewise, correct specification for the PM approach entailed defining E(Y|R= 2,X)=E(Y|R= 1,X)=γ20+γ21X, while the incorrect model E(Y|R= 1,X)=γ20+γ21X2 was used to assess the impact of model mis-specification of the complete-case distribution. Note that as U (β) does not depend on X,EU β|R= 3,L3=Uβ. We explored four scenarios corresponding to (1) correct f (R|L) and f (L|R = 1), (2) correct f (R|L) but incorrect f (L|R = 1); (3) correct f (L|R = 1) but incorrect f (R|L); finally (4) incorrect f (R|L) and f (L|R = 1).

Results in Table 1 confirm our theoretical results, and clearly show that as expected IPW has small bias in scenarios (1) and (2) only, PM has small bias in scenarios (1) and (3), and DR has small bias in scenarios (1)-(3). In scenario (4) where all models are incorrect, as expected all estimators are significantly biased. When as in the first scenario, model misspecification is absent, IPW has larger root mean squared error (RMSE) than PM, however DR is comparable to PM, at least in this simulation setting. Interestingly, the RMSE of DR follows closely that of PM in scenarios (1) and (3) suggesting that the potential efficiency loss incurred to obtain DR inference relative to PM inference may not be substantial in practice. Table 1 of the Supplemental Appendix summarizes simulation results assessing the performance of our estimators of asymptotic variance and coverage of Wald confidence intervals using estimated standard errors for the three estimators under consideration. The results largely indicate that our standard error estimators are consistent in all scenarios where the point estimators are also consistent, including under partial model misspecification for the DR estimator (see comparison to Monte Carlo standard errors in Table 1 of the Supplemental Appendix). However, our standard error estimators appear to break down severely whenever model mis-specification induces bias in parameter estimates. Interestingly, the performance of the nonparametric bootstrap closely follows that of our estimators in all instances and also appears to break down under bias inducing model misspecification. We do not view this as a serious limitation given that inferences are in such cases unreliable even with a consistent estimator of standard error.

Table 1:

Monte Carlo results of the IPW, PM and DR estimators: accuracy of standard deviation estimator and coverage probabilities. The sample size is 2000

bth* nrm ccm bad
Estimated SD / Monte Carlo SD
IPW 0.951 0.951 0.438 0.438
PM 0.993 0.979 0.993 0.979
DR 0.995 0.995 0.886 0.725
Estimated SD / Bootstrapped SD
IPW 0.994 0.994 0.932 0.932
PM 1.000 1.002 1.000 1.002
DR 0.999 0.990 0.973 0.951
Coverage**
IPW 0.938 0.938 0.080 0.080
PM 0.954 0.001 0.954 0.001
DR 0.948 0.947 0.953 0.030
*

: bth: both models correct; nrm: nonresponse model correct; ccm: complete-case model correct; bad: both models incorrect.

**

: Nominal level = 95%.

4.5. A data application

The empirical application concerns a study of the association between maternal exposure to highly active antiretroviral therapy (HAART) during pregnancy and birth outcomes among HIV-infected women in Botswana. A detailed description of the study cohort has been presented elsewhere (Chen et al. 2012). The entire study cohort consists of 33148 obstetrical records abstracted from 6 sites in Botswana for 24 months. Our current analysis focuses on the subset of women who were known to be HIV positive (n = 9711). The birth outcome of interest is preterm delivery, defined as delivery < 37 weeks gestation. 6.7% of the outcomes are not observed. The data also contain the following risk factors of interest that are also subject to missingness (Table 2): whether CD4+ cell count is less than 200 cells/μL and whether a woman continued HAART from before pregnancy or not.

Table 2:

Real data analysis: tabulation of missing data patterns. The total sample size is 9711. Missing variables are coded by 0. The first row represents the complete case

Pattern (R) Preterm Delivery Low CD4 Count Cont. HAART percentage
1 1 1 1 10.5%
2 0 1 1 0.7%
3 1 0 1 18.3%
4 0 0 1 1.6%
5 1 1 0 33.9%
6 0 1 0 1.5%
7 1 0 0 30.6%
8 0 0 0 2.9%

Our goal is to correlate these factors with preterm delivery using a logistic regression. In other words, the parameter of interest is the vector of coefficients of the corresponding logistic regression. We implemented the complete-case (CC) analysis together with three proposed estimators that account for MNAR nonresponse: LDCM IPW, PM and DR estimators. Estimation of the nonresponse process used the fairly generic specification log OddsrLr;αr=αr'qrLr, where qr (L(r)) included all main effects and two-way interactions of components of L(r) while PM specified the log-linear model PrL|R=1expη'L.

Table 3 summarizes resuls for the complete analysis (CC) together with Little’s PM analysis and our two semiparametric estimators (IPW and DR). The results suggest that the association between CD4 count and preterm delivery may be subject to selection bias to a greater extent than that of HAART and preterm delivery. In fact, the estimated odds ratio for CD4 count is about 20% larger for IPW, PM and DR compare to the CC odds ratio, whereas the odds ratio for HAART is quite similar for all four estimators. Although PM generally appears less variable, there are no notable differences between inferences obtained using IPW, PM or DR, providing no evidence that either IPW or PM might be subject to misspecification bias.

Table 3:

Real data analysis: estimated odds ratios of preterm delivery associated with various risk factors. The 95% confidence intervals are estimated based on bootstrap samples

Low CD4 Count Cont HAART
CC 0.782 (0.531,1.135) 1.142 (0.810,1.620)
IPW 0.924 (0.631,1.338) 1.180 (0.847,1.638)
PM 0.963 (0.704,1.318) 1.175 (0.881,1.598)
DR 1.020 (0.742,1.397) 1.158 (0.869,1.560)

5. Inference for general DCM

Consider a DCM with user-specified Fε, a well-defined continuous CDF. Local identification under assumption (3) is best understood with discrete data. In this vein, suppose that L(r) takes on Mr levels, then ∆μ1r (L(r)) depends at most on Mr unknown parameters. However, note that for user-supplied Mr-dimensional function Gr=grLr. Let WrGr=Gr×1R=r1R=1r/1. It is straightforward to verify that

EWrGr=0for r = 2, (15)

yielding the Mr restrictions needed to identify each ∆μ1r. Naturally, components of Gr should be chosen appropriately to avoid redundancy and linear dependence. A similar argument could in principle be carefully crafted to establish local identification if L contains continuous components. However, this is not further pursued in this paper. Interestingly, equation (15) motivates a simple approach for estimating Πr in practice. Suppose that one posits a parametric model Δμ1rLr;αr for ∆μ1r (L(r)) with finite dimensional unknown parameter αr, for all r. Then, the following empirical version of (15) would in principle deliver an estimator α^=α^r:r of α={αr :r}.

nWrG^r;α^=0for r = 2,

where WrG^r;α^=G^r×1R=r1R=rrα^/1α^. A convenient choice for G^r=Δμ1rLr;α^r/α^r. Under mild regularity conditions, α^ will be consistent and asymptotically normal provided that Δμ1rLr;αr is correctly specified for all r.

Given a consistent estimator of Π1, IPW inferences about β0 may be obtained as described in previous sections. Likewise, maximum likelihood estimation is straightforward by maximizing a model for the likelihood given in Lemma 1. Unfortunately, outside of the LDCM, to the best of our knowledge, it does not appear possible to obtain DR and MR inferences for DCMs.

The above analysis requires evaluation of the integral defining Πr. Thus, let

Qr(ε)=srFε(Δμ1s(L(s))Δμ1r(L(r))+ε).

A reliable approximation of r=Qrεfεεdε can effectively be achieved numerically by Gauss-Hermite Quadrature (Liu and Pierce, 1994). For instance, suppose that fε is standard normal, then the approximate Gaussian Discrete Choice Model is given by rm=1MQrεmwm, where the nodes εm are the zeroes of the mth order Hermite polynomial and wm are suitably defined weights (Davis & Rabinowitz, 1975)

6. Conclusion

In this paper, we have described the DCM as an all-purpose, flexible and easy-to-implement general class of models for nonmonotone nonignorable nonresponse. The LDCM has several advantages including giving rise to four distinct strategies for inference: IPW, PM, DR and MR estimation. Simulation studies and an application suggest good finite sample performance of IPW, PM and DR estimation; although not directly evaluated, we expect the same to apply to MR estimation.

Identification conditions such as CCMV are not empirically testable and therefore, it is important that inferences are assessed for sensitivity to violation of such assumptions. Such an approach for sensitivity analysis for violation of CCMV restriction is outlined in the Supplemental Appendix.

Supplementary Material

1

References

  1. Albert PS (2000). A transitional model for longitudinal binary data subject to nonignorable missing data. Biometrics, 56(2), 602–608. [DOI] [PubMed] [Google Scholar]
  2. Andridge RR and Little RJA (2010). A review of hot deck imputation for survey non-response. International Statistical Review 78(1), 40–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen HY (2007). A semiparametric odds ratio model for measuring association. Biometrics 63.(2) : 413–421. [DOI] [PubMed] [Google Scholar]
  4. Chen JY, Ribaudo HJ, Souda S, Parekh N, Ogwu A, Lockman S, Powis K, Dryden-Peterson S, Creek T, Jimbo W, Madidimalo T, Makhema J, Essex M and Shapiro RL (2012). Highly active antiretroviral therapy and adverse birth outcomes among hiv-infected women in botswana. The Journal of Infectious Diseases 206(11), 1695–1705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Davis PJ & Rabinowitz P (1975). Methods of Numerical Integration New York: Academic Press. [Google Scholar]
  6. Deltour I, Richardson S and Le Hesran JY (1999). Stochastic algorithms for Markov models estimation with intermittent missing data. Biometrics 55.(2): 565–573. [DOI] [PubMed] [Google Scholar]
  7. Dempster AP, Laird NM, and Rubin DB (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodological): 1–38.
  8. Fairclough DL, Peterson HF, Cella D, & Bonomi P (1998). Comparison of several model-based methods for analysing incomplete quality of life data in cancer clinical trials. Statistics in Medicine, 17(5–7), 781–796. [DOI] [PubMed] [Google Scholar]
  9. Horton NJ and Laird NM (1999), Maximum likelihood analysis of generalized linear models with missing covariates, Statistical Methods in Medical Research 8(1), 37–50. [DOI] [PubMed] [Google Scholar]
  10. Horton NJ and Lipsitz SR (2001), Multiple imputation in practice: Comparison of software packages for regression models with missing variables, The American Statistician 55(3), 244–254. [Google Scholar]
  11. Horvitz D and Thompson D (1952), A generalization of sampling without replacement from a finite universe, Journal of the American Statistical Association 47(260), 663–685. [Google Scholar]
  12. Ibrahim JG and Chen MH (2000), Power prior distributions for regression models. Statistical Science 15(1), 46–60. [Google Scholar]
  13. Ibrahim JG, Chen MH and Lipsitz SR (2001), Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable, Biometrika 88(2), 551–564. [Google Scholar]
  14. Ibrahim JG, Chen MH and Lipsitz SR (2002), Bayesian methods for generalized linear models with covariates missing at random, Canadian Journal of Statistics 30(1), 55–78. [Google Scholar]
  15. Ibrahim JG, Chen MH, Lipsitz SR and Herring AH (2005), Missing-data methods for generalized linear models: A comparative review, Journal of the American Statistical Association 100(469), 332–346. [Google Scholar]
  16. Little RJ, 1993. Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88(421), pp.125–134. [Google Scholar]
  17. Little RJ and Rubin DB (2002), Statistical Analysis with Missing Data Wiley. [Google Scholar]
  18. Liu Q, & Pierce DA (1994). A note on Gauss—Hermite quadrature. Biometrika, 81(3), 624–629. [Google Scholar]
  19. McFadden DL (1984). Econometric analysis of qualitative response models. Handbook of Econometrics, Volume II. Chapter 24. Elsevier Science Publishers BV. [Google Scholar]
  20. Miao W, Ding P and Geng Z (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association, 111(516), 1673–1683. [Google Scholar]
  21. Richardson TS, Robins JM and Wang L, 2016. On Modeling and Estimation for the Relative Risk and Risk Difference. Journal of the American Statistical Association, (just-accepted).
  22. Robins JM, Rotnitzky A and Zhao LP (1994), Estimation of regression coefficients when some regressors are not always observed, Journal of the American Statistical Association 89(427), 846–866. [Google Scholar]
  23. Robins JM and Gill RD (1997), Non-response models for the analysis of non-monotone ignorable missing data. Statistics in Medicine 16, 39–56. [DOI] [PubMed] [Google Scholar]
  24. Robins JM. (1997). Non-response models for the analysis of non-monotone non-ignorable missing data. Statistics in Medicine, 16:21–37. [DOI] [PubMed] [Google Scholar]
  25. Robins JM and Ritov Y (1997), Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models. Statistics in Medicine 16, 285–319. [DOI] [PubMed] [Google Scholar]
  26. Robins JM, Rotnitzky A, Scharfstein D. (1999). Sensitivity Analysis for Selection Bias and Unmeasured Confounding in Missing Data and Causal Inference Models. In: Statistical Models in Epidemiology: The Environment and Clinical Trials Halloran ME and Berry D, eds. IMA Volume 116, NY: Springer-Verlag, pp. 1–92. [Google Scholar]
  27. Robins JM, Rotnitzky A. (2001). Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer” Statistica Sinica, 11(4):920–936. [Google Scholar]
  28. Rubin DB (1976), Inference and missing data, Biometrika 63(3), 581–592. [Google Scholar]
  29. Rubin DB (1977), Formalizing subjective notions about the effect of nonrespondents in sample surveys, Journal of the American Statistical Association 72, 538–543. [Google Scholar]
  30. Schafer J (1997), Analysis of Incomplete Multivariate Data, Chapman and Hall. [Google Scholar]
  31. Sun BL. and Tchetgen Tchetgen E. J. (2016), On Inverse Probability Weighting for Non-monotone Missing at Random Data, Journal of the American Statistical Association Advance online publication. doi: 10.1080/01621459.2016.1256814 [DOI] [PMC free article] [PubMed]
  32. Tchetgen Tchetgen E. J., Robins JM, & Rotnitzky A (2010). On doubly robust estimation in a semiparametric odds ratio model. Biometrika, 97(1), 171–180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Train K (2009). Discrete Choice Methods with Simulation Cambridge University Press. [Google Scholar]
  34. Troxel AB, Harrington DP, & Lipsitz SR (1998). Analysis of longitudinal data with non-ignorable non-monotone missing values. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(3), 425–438. [Google Scholar]
  35. Troxel AB, Lipsitz SR and Harrington DP (1998), Marginal models for the analysis of longitudinal measurements with nonignorable non-monotone missing data. Biometrika 85(3), 661–672. [Google Scholar]
  36. Tsiatis A (2006), Semiparametric Theory and Missing Data, Springer. [Google Scholar]
  37. van der Vaart A (1998), Asymptotic Statistics, Cambridge University Press. [Google Scholar]
  38. Vansteelandt S, Rotnitsky A, Robins JM. (2007). Estimation of regression models for the mean of repeated outcomes under nonignorable nonmonotone nonresponse. Biometrika 94(4):841–860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Zhou Y, Little RJ, & Kalbfleisch JD (2010). Block-conditional missing at random models for missing data. Statistical Science, 25(4), 517–532. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES