Summary
We propose a new class of models for making inference about the mean of a vector of repeated outcomes when the outcome vector is incompletely observed in some study units and missingness is nonmonotone. Each model in our class is indexed by a set of unidentified selection bias functions which quantify the residual association of the outcome at each occasion t and the probability that this outcome is missing after adjusting for variables observed prior to time t and for the past nonresponse pattern. In particular, selection bias functions equal to zero encode the investigator’s a priori belief that nonresponse of the next outcome does not depend on that outcome after adjusting for the observed past. We call this assumption sequential explainability. Since each model in our class is nonparametric, it fits the data perfectly well. As such, our models are ideal for conducting sensitivity analyses aimed at evaluating the impact that different degrees of departure from sequential explainability have on inference about the marginal means of interest. Although the marginal means are identified under each of our models, their estimation is not feasible in practice because it requires the auxiliary estimation of conditional expectations and probabilities given high-dimensional variables. We henceforth discuss estimation of the marginal means under each model in our class assuming, additionally, that at each occasion either one of following two models holds: a parametric model for the conditional probability of nonresponse given current outcomes and past recorded data, or a parametric model for the conditional mean of the outcome on the nonrespondents given the past recorded data. We call the resulting procedure 2T-multiply robust as it protects at each of the T time points against misspecification of one of these two working models, although not against simultaneous misspecification of both. We extend our proposed class of models and estimators to incorporate data configurations which include baseline covariates and a parametric model for the conditional mean of the vector of repeated outcomes given the baseline covariates.
Some key words: Double robustness, Generalized estimating equation, Intermittent missingness, Longitudinal study, Missing at random, Semiparametric inference
1 Introduction
Consider a follow-up study whose design prescribes measurements of an outcome of interest to be taken on n independent subjects at fixed time-points. The goal of the study is to make inference about the mean outcome vector possibly as a function of baseline covariates and time. The intended vector of outcomes is often not completely recorded because some subjects miss some study cycles. When, as usual, the mechanism leading to these outcomes being missing is unknown to the investigator, the expected outcome at each time-point is not identified from the observed data. Inference must then rely on unverifiable assumptions about the missing data.
The problem of missing data in follow-up studies has received much attention in the statistical literature, but most emphasis has been given to settings where the missing pattern is monotone, in which no subject returns to subsequent study cycles after missing previous cycles. Robins et al. (1999) described a model for monotone missing data patterns which requires the a priori specification of a selection bias parameter that encodes the residual association between the outcome vector and missingness at each occasion after adjusting for past recorded data. They showed that, regardless of the value of the selection bias parameter, the model is nonparametric (just) identified as it imposes no restriction on the observed data distribution and yet identifies the mean of the repeated outcomes. Since all values of the selection bias parameter determine the same model for the observed data distribution, the selection bias parameter is not identified. Robins et al. (1999) therefore recommended conducting inference about the mean of the outcome vector by repeating the estimation under different plausible values for the selection bias parameter as a form of sensitivity analysis. The goal of this paper is to extend the work of Robins et al. (1999) to the case in which the outcome vector is incompletely observed in some study units and missingness is nonmonotone.
Our interest in nonparametric identified models is motivated by the fact that other models fail to distinguish (i) the nonidentifiable, i.e. untestable, restrictions on the missing data process necessary to identify the full-data parameter of interest from (ii) additional identifiable restrictions that serve to increase the efficiency of estimation. The distinction between (i) and (ii) is not only conceptually important but can also be practically important. For example, when one has available a nonparametric identified model, one can first fit the model to the data. If the resulting uncertainty concerning the functionals of interest, such as the marginal means of the outcomes, is too large to be of substantive use, as measured for instance by the volume of a 95% confidence region, then, as in any inferential problem, to reduce uncertainty one can choose between fitting a nested submodel and refitting the same model after collecting data on additional subjects, where possible, to increase the sample size. Clearly, when logistically and financially feasible, the second option is to be preferred. Thus, fitting a model that is not nonparametrically identified is tantamount to supplementing additional modelling restrictions for the unavailable additional data. Furthermore, follow-up studies routinely collect high-dimensional data and models that are not nonparametrically identified require assumptions to be imposed on the mechanism generating these high-dimensional data. However, specification of realistic models is difficult, if not impossible. Nonparametric identified models meet the challenge posed by high-dimensional data because they only make assumptions about the missing data mechanism, thereby reducing the possibility of model misspecification.
Several available methods for the analysis of nonmonotone missing data assume that the data are missing at random (Laird & Ware, 1982; Shah et al., 1997; Andersson & Perlman, 2001; Fairclough et al., 1998; Little & Rubin, 1987; Troxel, Fairclough, Curran & Hahn, 1998). Although the missing at random assumption enables a fairly straightforward likelihood-based analysis without needing to model the missing process (Little & Rubin, 1987), we will argue in §3.1 that this assumption is rarely realistic for nonmonotone missing data. A recent model discussed by Lin et al. (2003) and van der Laan & Robins (2003, Ch. 6) relies on a more plausible assumption about the missingness process which nonetheless assumes no selection on unobservables for the marginal distribution of the responses. In §3.4 we show that it can be viewed as a special case of the model presented in this paper, in which selection bias is absent.
Several proposals also exist for nonmonotone missing data where selection depends on unobservables. With the exception of the selection bias permutation missingness model of Robins et al. (1999), none of the available models is nonparametric identified. The currently available models rely on parametric assumptions for both the full data and the missing data mechanisms (Deltour et al., 1999; Albert, 2000; Ibrahim et al., 2001; Fairclough et al., 1998; Troxel, Fairclough, Curran & Hahn, 1998; Troxel, Lipsitz & Harrington, 1998) or on parametric assumptions for just the missing data process (Rotnitzky et al., 1998; Robins et al., 1995). The selection bias permutation missingness model generalizes the permutation missingness model of Robins (1997) and the sequential coarsening model of Gill & Robins (1997). This model differs from but is related to the nonparametric identified model we propose in the current paper. The two models are compared in §3.5.
2 The formal setting
Consider a longitudinal study design that calls for measurements on a vector of variables Lit to be recorded at study cycles t = 0, …, T, for the ith of n independent subjects. The vector Lit, t = 1, …, T, includes an outcome of interest Yit as well as other variables Vit recorded for secondary analyses. The vector Li0 may include, in addition to Yi0 and Vi0, a baseline covariate vector Xi of interest.
Suppose that Li = (Li0, …, LiT) is not always fully recorded because some subjects miss some study cycles. In particular, for each t, Lit is either completely observed or completely missing and Li0 is always observed. Thus, for each t, the observed data for subject i is the vector Oit = (Rit,c (Rit,Lit)), where Rit is a response indicator which equals 1 if Lit is observed and is 0 otherwise, and where, for any random vector W, c (1, W) = W and c (0, W) is set to zero by convention. Under our setting, the observed data Oi = (Li0, Oi1, …, OiT), i = 1, …, n, can be regarded as n independent realizations of the random vector O = (O0, O1, …, OT). Here and throughout O0 denotes L0. Furthermore, for any vector Z = (Z0, …, ZT), Z̅t denotes the history (Z0, …, Zt) up to and including cycle t and Z(t) denotes the vector (Zt, …, ZT). Throughout we assume for each t that Yt is either a continuous or discrete random variable; for any random vector W, f (Yt|W) denotes a fixed version of the conditional density of Yt given W with respect to either Lebesgue measure or a counting measure and pr(Rt = 1|W) denotes a fixed version of the conditional probability that Rt equals 1 given W.
We assume that the nonresponse patterns are nonmonotone so that Rt = 0 does not imply that Rt+1 = 0. We additionally assume that no recorded past O̅t−1 and no current outcome Yt can prevent the possibility of returning to study cycle t; that is,
(1) |
Note that (1) would not hold in a study design where patients are withdrawn when they miss, say, four consecutive visits (Zeuzem et al., 2000) or when all individuals with extreme values of Yt are so physically impaired that clinic attendance at visit t is impossible. The methods we propose here are not applicable in such cases.
Condition (1) was also assumed in Lin et al. (2003) and van der Laan & Robins (2003, Ch. 6) except that visits could be in continuous rather than in discrete times. These authors obtained identifiability by imposing the additional assumption of sequential explainability,
(2) |
which we discuss in detail in §3.4. A conflict as to the number and type of variables to include in the components Vt of the full data vector Lt at each cycle t arises when one wishes to impose both assumptions (1) and (2): to make sequential explainability (2) plausible one would generally wish to choose Vt−1 to be high-dimensional; however, the positivity assumption (1) may be unrealistic for high-dimensional Vt−1, as Vt−1 may then well include covariates, such as for example the subject’s state of consciousness, certain values of which, e.g. being unconscious, preclude the possibility of being observed at occasion t. Since it will often be unrealistic to impose both (2) and (1), we propose to relax the assumption of sequential explainability: we will describe methods for conducting inference about the marginal mean E (Yt), t = 1, …, T, and the conditional mean given baseline covariates E (Yt|X), t = 1, …, T, when (1) holds but (2) may fail.
3 Identifying assumptions
3.1 Preamble
Unless Rt = 1 with probability 1, the observed data O identify neither the distribution of Yt nor its conditional distribution given X because, as Theorem 1 below implies, many distinct conditional laws of Yt given O̅t−1 are compatible with the observed data law. To identify these distributions we must make unverifiable assumptions. We now review one popular such assumption, that the data are missing at random, and argue that it represents processes that are unlikely to generate nonmonotone missing data patterns in longitudinal studies. We then propose a class of unverifiable assumptions that are naturally suited to conducting sensitivity analysis of the investigator’s a priori belief about the process that generates the intermittent nonresponse in a follow-up study. In subsequent sections we discuss inference under any such assumption.
3.2 Missing at random
Robins & Rotnitzky (1992) and Gill et al. (1997) showed that the distribution of the full data vector L is identified if the data are missing at random, provided that there is a positive conditional probability of observing the full data, i.e. pr(R̅T = 1*|L) > 0 with probability 1, where 1* denotes the T × 1 vector of ones. The missing at random assumption states that
(3) |
where L(r̅T) denotes the observed part of L when R̅T = r̅T.
Under any model in which pr(R̅T|L) and f (L) are variation independent and the missing at random condition (3) is imposed, the likelihood factorizes into a part that depends on f (L) and another that depends on pr(R̅T|L). Any method that obeys the likelihood principle then yields the same inference whether pr(R̅T|L) is fully known, unknown or known to follow a model. As a result of this, many authors have proposed analyzing nonmonotone missing data using likelihood-based methods under models that assume missing at random. However, convenient as the missing at random assumption may be, the assumption should only be adopted if it is believed plausible. Following Robins & Gill (1997) we will now argue that missing at random mechanisms that could plausibly generate the observed data in follow-up studies with nonmonotone nonresponse are quite restrictive and would rarely be plausible.
Robins & Gill (1997) showed that the set of missingness probabilities pr(R̅T|L) that satisfy missing at random can be divided into two disjoint subsets. The first set contains processes in which the observed data are generated as follows. The variable L0 is always observed. Then, with probability p00 possibly depending on L0, no further variable is observed. Otherwise one selects which of L1, …, LT to observe next by flipping a T-sided coin with probabilities p01, …, p0T that may depend on L0. One then observes nothing else with probability p10 that may depend on L0 and the most recently observed Lt. Otherwise one selects which of the T − 1 still unobserved Lt’s to observe next with probabilities p11, …, p1(T−1) that may depend on the already observed Lt’s, and so on. The second subset contains all remaining missingness processes that satisfy (3). Robins & Gill (1997) showed that the second subset is not empty. They also showed that to generate the observed data O according to a missingness mechanism in the second set it is required, in the course of the data-generation procedure, to use information about the components of L that are not in L(R̅T), and thus are ultimately missing, in a subtle and often highly contrived manner to ensure that missing at random holds. In agreement with the discussion in Robins & Gill (1997), we believe that such missing at random missingness mechanisms are often unrealistic. Consequently, the most reasonable missing at random processes with nonmonotone data are in the first set. Moreover, of the processes in the first set, the only plausible choices for longitudinal data are those in which, with conditional probability equal to 1, the next variable to be observed comes later in time than any variable already observed, as decisions today cannot affect attendance in the past. However, even these are rather unlikely processes when missingness is nonmonotone because they effectively imply for example that, if a patient chooses today to miss his next two visits and then to return, he will not reassess this decision based on evolving time-dependent covariates associated with the response. This is unlikely, as often the decision to miss a given study cycle is influenced by aspects of the subject’s health and psychological status that evolved during earlier missed study cycles.
3.3 Occasion-specific tilted models
Part (ii) of Theorem 1 below establishes that, when (1) holds for a fixed t, t = 1, …, T, the distributions f (Yt|X) and f (Yt) are identified under the following Assumption 1 which postulates that, among subjects with a given observed past O̅t−1, the distribution of Yt in the nonresponders at cycle t is equal to the distribution of Yt in the responders at cycle t tilted by a known function.
Assumption 1. If
(4) |
then
(5) |
for some user-specified, i.e. known, function qt(O̅t−1, Yt).
For ease of reference in the forthcoming discussion, we use 𝒜t (q) (𝒜(q)) to denote the model for the full data (L,R̅T) defined by (1) and Assumption 1 for a fixed t, for all t = 1, …, T. Note that 𝒜(q) is the intersection of models 𝒜t (q), t = 1, …, T.
By Bayes rule, Assumption 1 is equivalent to
(6) |
whenever f(Yt|O̅t−1, Rt = 1) > 0,, where expit(․) = exp(․)/{1 + exp(․)} and
(7) |
From (6), we interpret each function qt(O̅t−1, Yt) as quantifying, on the logistic scale, the magnitude of the residual association between the missingness probability at cycle t and the possibly missing outcome Yt, after adjustment for the observed past O̅t−1. Thus, model 𝒜(q) encodes the investigator’s a priori belief of the degree to which, for each t the decision to return to study cycle t is influenced by prognostic factors for Yt other than those included in the observed past, O̅t−1. For example, the choice qt(O̅t−1, Yt) = (1 − Rt−1) λYt encodes the belief that, for those that did not miss the prior cycle t − 1, the recorded variables Lt−1 at the prior cycle together with the observed past O̅t−2 prior to cycle t − 1 are sufficient to explain missingness at cycle t, but, for those that missed cycle t − 1, the observed past O̅t−2 is not sufficient to explain missingness at cycle t. The choice qt(O̅t−1, Yt) = {(1 − Rt−1) λ1 + λ2} Yt additionally allows a residual dependence on the current outcome. In line with the terminology used in some of the missing data literature, we call qt(․) a selection bias function (Scharfstein et al., 1999).
Part (i) of Theorem 1 below establishes that model 𝒜(q), and therefore 𝒜t (q) for each t, places no restriction on the observed data law beyond the restriction that pr(Rt = 1|O̅t−1) > 0 for all t. As such, each choice of selection bias functions qt(O̅t−1, Yt), t = 1, …, T, fits the data perfectly and cannot be rejected by any statistical test. Since there will never be any evidence from the data that can help determine the functions qt(O̅t−1, Yt), the analyst should be reluctant to analyze the data solely under one choice of functions qt(O̅t−1, Yt). Instead, he should pose a range of plausible selection bias functions and, as a form of sensitivity analysis to his prior beliefs about the missingness mechanism, repeat the analysis under each choice of qt(O̅t−1, Yt). This raises the question of how to choose the selection bias functions in practice. We suggest that one chooses, as in the example above, a collection of simple selection bias functions indexed by one or two parameters that are to be varied in a sensitivity analysis. Ideally, the parameterization should satisfy the following properties: it is easily interpretable so that a plausible parameter range can be specified by subject matter experts; values of the parameters equal to zero correspond to the assumption of no selection bias for the outcomes; and nonparametric bounds are attained when the parameters go to ±∞.
In the following theorem, proved in the Appendix, and throughout, if (4) holds we define
Furthermore, we define πt(O̅t−1, Yt) ≡ 1 if (4) does not hold and
otherwise, where ht(O̅t−1) satisfies (7).
Theorem 1. (i) Model 𝒜(q) determines a model for the law of the observed data whose only restriction is pr(Rt = 1|O̅t−1) > 0 with probability 1, for t = 1, …, T.
(ii) The conditional density f(Yt|O̅t−1) is identified under model 𝒜t (q), and therefore under 𝒜(q). Furthermore, under models 𝒜t (q) and 𝒜(q), E (Yt) is equal to the following functional of the observed data distribution:
(8) |
(9) |
Part (ii) of the Theorem implies the identifiability of the marginal density of Yt and its conditional distribution given X under model 𝒜t (q) and therefore also under model 𝒜(q). However, the theorem says nothing about the identifiability of the dependence among outcomes at different occasions. This is so because this dependence is generally not identified under model 𝒜(q). In particular, under model 𝒜(q) the correlations among the repeated outcomes are not identified for any choice of selection-bias function.
3.4 Sequential explainability
The choice qt = 0 postulates the conditional independence of Yt and Rt given O̅t−1. This assumption would hold if the observed past variables O̅t−1 included all the predictors of Yt that explain missingness at cycle t. We therefore refer to the assumption that qt = 0 for all t as the assumption of sequential explainability. Lin et al. (2003) and van der Laan & Robins (2003) consider such processes except that visits occur in continuous time.
The assumption that qt = 0 is less restrictive than the missing at random condition
(10) |
for t = 1, …, T, because (10) implies (6) with qt = 0 but the opposite is false. For example, (10) postulates the conditional independence given O̅t−1 of Rt with the current components Lt, the future components L(t+1) and the components of L̅t−1 corresponding to missed cycles. However, (6) says nothing about the dependence of Rt on future components L(t+1) and the components of L̅t−1 corresponding to missed cycles. In fact, while assumption (10) imposes restrictions on the observed data distribution and hence is a testable assumption, this is not true for (6) by part (i) of Theorem 1. Note also that, under model 𝒜(q) with qt = 0 for all t, nonresponse is nonignorable for inference about f (Yt), t = 1, …, T, in the sense that likelihood-based methods do not result in the same inference if pr(Rt = 1|O̅t−1) is known, unknown or known to follow a model. This is because the likelihood of the observed data at each cycle t does not factorize into a part that depends on pr(Rt = 1|O̅t−1) and another that depends on f (Yt).
3.5 Relationship between model 𝒜(q) and the selection bias permutation missingness model
Members of the class of selection bias permutation missingness models of Robins et al. (1999) are indexed by permutations of the visit subscripts 1, …, T. One such model is particularly appropriate for longitudinal studies and is defined by the assumptions that, for each t,
with qt(O̅t−1, Yt, Y(t+1)) known. It differs from model 𝒜(q) only in that the future outcome history Y(t+1) is added to each conditioning event and to the function qt(O̅t−1, Yt, Y(t+1)). Robins et al. (1999) showed that for each qt, t = 1, …, T, this model places no restriction on the distribution of the observed data and identifies the joint distribution of Y̅T = (Y1, ․…, YT). Thus this model can be used instead of model 𝒜(q) when the substantive question at issue depends on the joint law of Y̅T rather than simply on the marginals Yt of Y̅T. However, the ability to make inferences about the joint law comes at a price as it is more difficult to model f(Yt|O̅t−1, Y(t+1), Rt = 0) than f(Yt|O̅t−1, Rt = 0) both because the conditioning event (O̅t−1, Y(t+1), Rt = 0) is of greater dimension than the event (O̅t−1, Rt = 0) and because it is less natural to model the law of Yt given the observed past O̅t−1 and the, possibly unobserved, future Y(t+1) than to model the law of Yt given only the observed past O̅t−1.
4 Estimation of the unconditional occasion-specific outcome means
4.1 Nonparametric inference under model 𝒜(q)
We now discuss inference about the marginal means under models that assume (6) for user-specified functions qt(O̅t−1, Yt), t = 1, …, T.
Models 𝒜(q) and 𝒜t (q) define the same model, i.e. the nonparametric model, for the observed data distribution. Furthermore, as established in part (ii) of Theorem 1, under either model, is the same, unique, functional of the observed data law. Thus, inference about under either model is identical. In particular, the nonparametric maximum likelihood estimator β̂NPML,t of under either model is obtained by calculating the expressions in the right-hand side of (8) or (9) under the empirical distribution of O, i.e.
where ĥ(O̅t−1)NPML,t = −∞ if En(Rt|O̅t−1) = 1 and otherwise
and where for random variables W and Z, .
Unfortunately, unless T is small and Lt is discrete with few levels, with the sample sizes found in practice the data available for estimating the required conditional expectations will be sparse and consequently the estimator β̂NPML,t will be undefined. One could assume that the required conditional expectations are smooth in O̅t−1 and use multivariate smoothing techniques to estimate them. However, when O̅t−1 is high-dimensional, they would not be well estimated with moderate sample sizes because no two units would have values of O̅t−1 close enough to allow the borrowing of information needed for smoothing. Thus, in practice, because of the curse of dimensionality, we are forced to place more stringent dimension-reducing modelling restrictions on the law of the observed data.
Two dimension-reducing strategies are suggested by expressions (8) and (9) for . The first strategy is to assume that the function ht(O̅t−1) follows a parametric model,
(11) |
where ht(O̅t−1; αt) is a known function smooth in αt, and is an unknown pt,h × 1 parameter vector. The second strategy is to assume that mt (O̅t−1) follows a parametric model,
(12) |
where mt (O̅t−1; θt) is a known function, smooth in θt, and is an unknown pt,m × 1 parameter vector.
We use ℬt (q) (ℬ (q)) to denote the model for the full data (L,R̅T) defined by the assumptions of model 𝒜t (q) and the additional restrictions (4) and (11) specified just for a fixed t (for all t). Likewise, we use 𝒞t (q) (𝒞 (q)) to denote the model for the full data (L,R̅T) defined by the assumptions of model 𝒜t (q) and the additional restriction (12) specified just for a fixed t, for all t.
Models (11) and (12) are not in themselves of scientific interest. However, in practice we are forced to impose one of the two models. Estimation of under model ℬt (q) is not entirely satisfactory because the resulting estimators of can be biased if model (11) is incorrect, and estimation under 𝒞t (q) suffers from the same limitation if model (12) is incorrect. Luckily there is an alternative strategy for estimation of . This consists of computing an estimator that is consistent and asymptotically normal in the union model ℬt (q) ∪ 𝒞t (q), i.e. an estimator of that is consistent and asymptotically normal so long as one of the models ℬt (q) or 𝒞t (q), but not necessarily both, is correctly specified. Following Robins (2000) and Robins & Rotnitzky (2001) we call such an estimator a doubly robust estimator in the union model ℬt (q) ∪ 𝒞t (q) as it can protect against misspecification of either (11) or (12), although not against simultaneous misspecification of both. The following definition introduces a generalization of double robustness.
Definition. Given a collection {ℳu; u ∈ 𝕌} of models for a law F indexed by the elements of a finite set 𝕌 with K elements, we say that an estimator λ̂ of a parameter λ ≡ λ (F) is a K-multiply robust estimator in the union model ∪u∈𝕌ℳu if it is a consistent and asymptotically normal estimator of λ when one of the models ℳu, u ∈ 𝕌, but not necessarily more than one of them, holds.
4.2 Doubly and 2T-multiply robust estimation
In this subsection we propose a doubly robust estimator of in the union model ℬt (q) ∪ 𝒞t (q) and a multiply robust estimator of β* in the union model ∪u∈𝕌ℳu (q). Throughout, 𝕌 denotes the collection of T ×1 vectors u whose components are either 0 or 1. For each such vector u, ℳu = {∩t:ut=0ℬt (q)} ∩ {∩t:ut=1𝒞t (q)}. Thus, a 2T-multiply-robust estimator of β* is consistent and asymptotically normal for so long as at each t one of the models ℬt(q) or 𝒞t(q), but not necessarily both, is correctly specified.
To construct the doubly and 2T-multiply robust estimators we reason as follows. Suppose that we have specified working models (11) and (12). For any constant column vector dt and conformable column vector functions ϕt (O̅t−1) and ψt (O̅t−1), define
(13) |
(14) |
where εt(βt) = Yt − βt and πt(O̅t−1, Yt; αt) ≡ [1 + exp {ht(O̅t−1; αt) + qt(O̅t−1, Yt)}]−1. In the Appendix we show that, provided we choose ϕt(O̅t−1) in (13) and ψt(O̅t−1) in (14) to have the specific functional forms given by
(15) |
for any fixed θt, αt and βt, Ht(1, ϕθt,βt,t, βt, αt) and Mt(1, ψαt,t, βt, θt) are identical, where 1 is a scalar constant function equal to 1. We therefore write, for short,
In the Appendix we also show that if model (11) holds then, for any θt and regardless of whether or not model (12) holds, has mean zero and therefore so does . Furthermore, if model (12) holds then, for any αt and regardless of whether or not model (11) holds, has mean zero and therefore so does . These results suggest that we can construct a doubly robust estimator β̂t of , throughout also denoted by β̂t (ψt, ϕt), in model ℬt (q) ∪ 𝒞t (q) by solving the scalar estimating equation
(16) |
where θ̂t and α̂t solve
using arbitrary pt,m × 1 and pt,h × 1 functions ψt(O̅t−1) and ϕt(O̅t−1) respectively. Theorem 2 below establishes that β̂t is doubly robust in model ℬt (q) ∪ 𝒞t (q) and that β̂ = (β̂1, …, β̂T)′, throughout also denoted by β̂ (ψ, ϕ), is 2T-multiply robust in model ∪u∈𝕌ℳu (q).
To state the asymptotic properties of β̂t and β̂ in Theorem 2, we define
where for any random vector function K (τ, η) of a parameter (τ, η), Iτ,K (τ, η) denotes E {∂K (τ, η)/∂τ} and ∂K (τ, η)/∂τ is a derivative matrix with (i, j) entry equal to ∂Ki (τ, η)/∂τj. In what follows and throughout, we use a hat to indicate expectations calculated under the empirical distribution of the observed data and evaluation at (βt, θt, αt) = (β̂t, θ̂t, α̂t); for example,
and Û =(Û1, …, ÛT)′, where Îαt,Qt = En{∂Qt (β̂t, θ̂t, αt)/∂αt|αt=α̂t}, and so on.
Parts (i) and (iii) of Theorem 2, proved in the Appendix, state the asymptotic distribution of β̂t and β̂ under models ℬt (q) ∪ 𝒞t (q) and ∪u∈𝕌ℳu (q), respectively. Parts (ii) and (iv) state that the asymptotic variance of β̂t and β̂ under these models remains the same regardless of the choice of the functions ψt and ϕt used to compute the estimators θ̂t and α̂t of , when in fact the true data-generating process satisfies models ℬt (q) ∩ 𝒞t (q) and , respectively. In practice, the choice of functions ψt and ϕt should therefore have little impact on the efficiency of β̂t and β̂ when the models ℬt (q) and 𝒞t (q), t = 1, …, T, cannot be rejected using efficient goodness-of-fit tests. In what follows, for any matrix A, A⊗2 denotes AA′.
Theorem 2. Suppose that the regularity conditions stated in the Appendix hold.
(i) Under model ℬt (q) ∪ 𝒞t (q), in distribution, where
and are the probability limits of θ̂t and α̂t. The matrix Γt can be consistently estimated with
(ii) Let be two pairs of distinct pt,m × 1 and pt,h × 1 functions (ψt, ϕt) of O̅t−1. Then, under the intersection model ℬt (q) ∩ 𝒞t (q), .
(iii) Under model ∪u∈𝕌ℳu (q), √n (β̂ − β*) → N (0, Γ) in distribution, where
and θ0 and α0 are the probability limits of θ̂ = (θ̂1, …, θ̂T)′ and α̂ = (α̂1, …, α̂T)′. The matrix Γ can be consistently estimated with .
(iv) Let (ψ(1), ϕ(1)) and (ψ(2), ϕ(2)) be two distinct sets of functions {(ψt (O̅t−1), ϕt (O̅t−1)), t = 1, …, T}. Then, under the intersection model .
It is possible to show that, at the intersection model ℬt (q) ∩ 𝒞t (q), every doubly robust estimator β̂t (ψt, ϕt) has asymptotic variance that attains neither the semiparametric variance bound for estimation of under model 𝒞t (q) nor, except when qt = 0, the semiparametric variance bound for estimation of βt under model ℬt (q). In our opinion, the hope to control bias is more important than efficiency concerns, and we therefore recommend using doubly or 2T-multiply robust estimators of and β*, respectively.
So far, we have not allowed the parameters αt in model (11) and θt in model (12), respectively, to be shared across occasions. When we are faced with sample sizes that are not large enough to yield well-behaved estimators of at each occasion t, two dimension-reducing strategies can be envisaged. The first strategy is to reduce further the dimension of models (11) and (12) at each t. The second strategy is to allow parameters αt and θt, respectively, to be shared across occasions and to compute an estimator of β* that is consistent and asymptotically normal so long as at least one of holds. Denote by α and θ the resulting ph × 1 and pm × 1 parameter vectors indexing models (11) and (12) respectively, for all occasions t. Then such an estimator β̃ of β* can be obtained by solving estimating equations (16) at each occasion t, in which θ̂t ≡ θ̂ and α̂t ≡ α̂ now solve En {M(ψ, θ)} = 0 with and En {H(ϕ, α)} = 0 with , where ψ and ϕ are vectors of pm × 1 functions ψt(O̅t−1) and ph × 1 functions ϕt(O̅t−1), t = 1, …, T, respectively. Parts (iii) and (iv) of Theorem 2 continue to hold for the resulting estimator β̃ if we replace model ∪u∈𝕌ℳu (q) by , β̂ by β̃ and U (β, θ, α) by
5 Estimation of the occasion-specific conditional outcome means given baseline covariates
Suppose now that we are interested in inference about a parameter, which we denote again by β*, indexing a regression model for the conditional mean of Yt, t = 1, …, T, given baseline covariates X; that is, for t = 1, …, T,
(17) |
where gt(X; β) is a known function that is smooth in β and β* ∈ Θ ⊆ IRr is unknown. Denote by 𝒜* (q) the model for (L,R̅T) defined by the restrictions of model 𝒜(q) and the additional restriction (17) for all t = 1, …, T.
Part (i) of Theorem 1 is no longer true if model 𝒜(q) is replaced with model 𝒜* (q), i.e. 𝒜* (q) does not determine a nonparametric model for the observed data law. Hence, in principle, under (17) the postulated functions qt may sometimes be subject to an empirical test. Moreover, β* and qt, t = 1, …, T, may be jointly identified under (17). However, there would generally be very limited independent information about β* and qt, t = 1, …, T, and therefore their joint estimation would require very large sample sizes. In fact, it follows from Proposition B1, Part 6, of Rotnitzky et al. (1998) that, when pr (R1 = … = RT = 0|L) > σ > 0 with probability 1 and both ht and qt, t = 1, …, T, in (6) are unknown, β* cannot be estimated at rate√n. Thus, we continue to recommend that one regard the functions qt, t = 1, …, T, as fixed and known when estimating β* and then vary these functions in a sensitivity analysis.
As was the case for estimation of the marginal means E (Yt), unless T is small and Lt is discrete with few levels, inference about β* requires placing dimension-reducing assumptions on either ht or mt, in addition to the restrictions of model 𝒜* (q). We therefore consider, for each t = 1, …, T, models defined like models ℬt (q) and 𝒞t (q) respectively but with the additional restriction (17). Furthermore, we let be defined like ℳu (q) in §4.2 but with instead of ℬt (q) and 𝒞t (q) so that the union model stands for the model in which at each t either holds, but not necessarily both. In this section we consider estimation of β* under the union model .
Although, as shown in the Appendix, the restrictions defining 𝒞t (q) for each fixed t, and indeed simultaneously for all t, are guaranteed to be compatible, the same is not true for . To be specific, for each t, given a function qt the function mt (O̅t−1) and the conditional mean function E (Yt|X) are not variation independent; that is, fixing one restricts the range of possible functions for the other. Thus, it may happen that there exists no joint distribution of (L,R̅T) of which the marginal of O is the observed data distribution and that satisfies simultaneously (5), (12) and (17). Furthermore, even if such incompatibility is not present, it may still happen that the parameter space for β* under is much smaller than that under 𝒜* (q). This is clearly undesirable because any reasonable dimension-reducing strategy should not, a priori, eliminate values of β* regarded plausible under the model of scientific interest. The following simple example illustrates these points.
Example. Suppose that T = 1, L0 = X and Y1 is binary. Suppose that in (17) we assume that logit pr (Y1 = 1|X) = β0 + β1X, q1 (Y1, X, V) = λY1 with λ > 0 and logit pr (Y1 = 1|R1 = 0, X) = θ0 + θ1X. Under this model logit pr (Y1 = 1|R1 = 1, X) = λ + θ0 + θ1X and hence pr (Y1 = 1|R1 = 1, X) > pr (Y1 = 1|R1 = 0, X). Therefore, since
it must be that pr (Y1 = 1|R1 = 0, X) < pr (Y1 = 1|X) < pr (Y1 = 1|R1 = 1, X). This implies that logit pr (Y1 = 1|X) = λ* + θ0 + θ1X for some 0 < λ* < λ. In particular, β1 = θ1. It follows that the parameter space for β may be more restricted once we impose the restrictions on pr (Y1 = 1|R1 = 0, X). For example, if the model for pr (Y1 = 1|R1 = 0, X) restricts θ1 to lie in a strict subset of the real line and the model for pr (Y1 = 1|X) leaves β1 unrestricted, then, once the model for pr (Y1 = 1|R1 = 0, X) is imposed, the parameter space for β1 is reduced to that for θ1. The models would even become incompatible if a probit regression were considered for pr (Y1 = 1|X) and a logistic regression for pr (Y1 = 1|R1 = 0, X).
In the Appendix we show that , and indeed , impose restrictions that are always compatible.
When model is incompatible with model (17) then an estimator of β* that is consistent under the union model actually converges in probability to β* only if the working model holds and hence the estimator is not really doubly robust. We do not regard this theoretical difficulty to be of concern in practice since it is ameliorated if one postulates a richly parameterized model (12). To see this note that, when no restriction is placed on mt (O̅t−1), model becomes model defined like 𝒜t (q) but with the additional restriction (17). As shown in the Appendix, the restrictions defining model are always compatible. Consequently, if (17) is correctly specified then a flexible model for mt (O̅t−1) should result in a nearly correctly specified model . In order to highlight the possibility of model incompatibility, we refer to estimators that are consistent and asymptotically normal under model as generalized 2T-multiply robust estimators.
In agreement with the discussion in Robins & Rotnitzky (2001), we recommend estimating β* with generalized 2T-multiply robust estimators because such estimators are expected to have small asymptotic bias if, at each t = 1, …, T, at leats one of the models is approximately correct.
We construct generalized 2T-multiply robust estimators of β* in model as follows. Redefine Ht(dt, ϕt, βt, αt), Mt(dt, ψt, βt, θt) and ϕθt,βt,t(O̅t−1) as in (13), (14) and (15) but with εt(βt) replaced by εt(β) = Yt − gt(X; β), mt (O̅t−1; θt) − βt by mt (O̅t−1; θt) − gt (X; β) and with dt= dt(X) an arbitrary conformable vector function of X. With these redefinitions, it is true that Ht(dt, dtϕθt,β,t, β, αt) = Mt(dt, dtψαt,t, β, θt) for any arbitrary r × 1 vector function dt(X). We therefore write
Similarly to §4.2, we construct generalized 2T-multiply robust estimators β̂ of β*, also denoted throughout by β̂ (d, dθ, ψ, dα, ϕ), in model by solving an r × 1 estimating equation of the form
(18) |
where , d (X) = (d1 (X), …, dT (X)) for arbitrary r × 1 vector functions dt (X), t = 1, …, T, and where θ̃ = (θ̃1, …, θ̃T)′ and α̃ = (α̃1, …, α̃T)′ solve
respectively, for t = 1, …, T, using arbitrary collections of functions,
Parts (iii) and (iv) of Theorem 2 remain valid if we replace ∪u∈𝕌ℳu (q) by , β̂ (ψ(j), ϕ(j)) by , j = 1, 2, Qt (β, θt, αt) by Qt (dt, β, θt, αt) and we use the redefinition , where
with dt = dt(X), dtα = dtα(X) and dtθ = dtθ(X).
Theorem 3 below, proved in the Appendix, provides the optimal r × T matrix function dopt (X) = (d1,opt (X), …, dT,opt (X)) in the sense that, among all estimators β̂ (d, dθ, ψ, dα, ϕ) that solve (18) using an arbitrary r × T matrix functions d (X) and fixed collections of functions dθ, ψ, dα and ϕ, the estimator with the smallest asymptotic variance is the one that uses d = dopt. In particular, since under laws in the intersection model the limiting distribution of β̂ (d, dθ, ψ, dα, ϕ) does not depend on the choice of dθ, ψ, dα and ϕ, we conclude that the estimator that solves (18) using d = dopt has the smallest asymptotic variance among all estimators β̂ (d, dθ, ψ, dα, ϕ) under any law in . In the following theorem, Γ (d, dθ, ψ, dα, ϕ) denotes the variance of the limiting normal distribution of √n {β̂ (d, dθ, ψ, dα, ϕ) − β*} under model , where each is defined as Ut (β, θt, αt) but with the r × 1 function dt(X) replaced by the constant real valued function dt (X)≡1. Also, θ0 and α0 denote the probability limits of θ̃ and α̃. In addition, for any pair of conformable square matrices A and B, A ≥ B indicates that A − B is positive semidefinite.
Theorem 3. For every fixed collection of functions dθ, ψ, dα and ϕ, we have that Γ (dopt, dθ, ψ, dα, ϕ) ≤ Γ (d, dθ, ψ, dα, ϕ), where
6 Simulation study
We conducted two simulation experiments. The first compares the behaviour in finite samples of the 2T-multiply robust estimators of marginal means with competitors that are not 2T-multiply robust, and the second evaluates the behaviour of generalized 2T-multiply robust estimators of parameters of regression models for the marginal means. Each experiment was based on 1000 replications of random samples of size 500 generated as follows. In both experiments, for t > 1, Lt comprised just the outcome Yt. In the first experiment L0 was standard normal and, for t = 1, …, 4, given (L̅t−1, R̅t−1), Rt was generated from
and then Lt was generated as Lt = 2t + 3L0 + Rt−1 + 2Rt−1εt−1 − γRt + εt for the choices γ = 0 and γ = −0.5, where ε1, …, ε4 are four independent standard normal variates. It is easy to check that the law of our simulated data satisfies the restrictions of ℬ (q) and 𝒞 (q) with qt(O̅t−1, Yt) = γYt, ht(O̅t−1; αt) = αt0 + αt1L0 + I(t > 1)(αt2Rt−1 + αt3Rt−1Yt−1) and mt(O̅t−1; θt) = θt0 + θt1L0 + I(t > 1)(θt2Rt−1 + θt3Rt−1Yt−1) for specific vector values αt and θt.
In the second experiment, given a standard normal variate X, we generated a 4 × 1 multivariate normal vector (Y1, …, Y4) with E (Yt|X) = β1t + β2X, β1 = 2, β2 = 3, var(Yt|X) = 5 and cov(Yt, Yt′|X) = 4, t ≠ t′. Then we generated R̅4 given Y̅4 iteratively for t = 1, …, 4 from
where εt = Yt − β1t − β2X and γ = 0 or γ = −0.5. Under our data-generating process, model ℬ (q) holds for qt(O̅t−1, Yt) = γYt and ht(O̅t−1; αt) = αt0 + αt1X + I(t > 1)(αt2Rt−1 + αt3Rt−1Yt−1) for a specific value of αt. Since the functional form of mt(O̅t−1) is complicated we have considered an approximate working model for it, given by mt(O̅t−1; θ) = θt0 + θt1X + I(t > 1) {θt2Rt−1 + θt3Rt−1X + θt4Rt−1Yt−1}, and thus computed a generalized 2T-multiply robust estimator of (β1, β2) such that, at each t, model 𝒞t (q) assumes that mt (O̅t−1) = mt(O̅t−1; θ) for some θ.
Under the data-generating process of the first experiment, the probability of not missing the outcome is 84.4, 69.0, 60.5 and 53.7 at the four occasions, respectively, for both γ = 0 and γ = −0.5. In the second experiment, these probabilities are 84.4, 60.1, 61.5 and 56.5 for γ = 0. The values when γ = −0.5 are similar.
Table 1 summarizes the results for estimation of β3 = E (Y3) and β4 = E (Y4) in the first experiment for the following methods: inverse probability weighted estimators, i.e. those solving En {Ht(1, 0, βt, α̂t)} = 0, labelled ‘IPW’; conditional mean imputation estimators solving En {Mt(1, 0, βt, θ̂t)} = 0, labelled ‘CM’; and 2T-multiply robust estimators, labelled ‘MR’. The estimators were computed under various conditions: correctly specified working models ht(O̅t−1; αt) and mt(O̅t−1; θt) as defined above for all t, labelled ‘None’; models m3(O̅2; θ3) and h4(O̅3; α4) that incorrectly set a priori to zero the coefficients multiplying the term Rt−1Yt−1, labelled ‘𝒞3&ℬ4’; and models ht(O̅t−1; αt) and mt(O̅t−1; θt) that for all t incorrectly set a priori to zero the coefficients multiplying the term Rt−1Yt−1, labelled ‘All’.
Table 1.
IPW | CM | MR | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
qt | Estimand | Misspec. | Bias | SD | Cov | Bias | SD | Cov | Bias | SD | Cov |
0 | β3 = 6.69 | None | −1.6 | 27.7 | 92.6 | 0.1 | 16.4 | 95.1 | −0.1 | 16.8 | 94.3 |
C3 & B4 | −1.6 | 27.7 | 92.6 | −48.8 | 16.2 | 14.4 | −0.6 | 30.7 | 93.8 | ||
All | −52.2 | 17.7 | 15.8 | −48.8 | 16.4 | 14.4 | −48.9 | 16.2 | 15.4 | ||
β4 = 8.61 | None | −1.3 | 28.5 | 91.5 | 0.2 | 16.4 | 95.4 | 0.3 | 16.8 | 95.5 | |
C3 & B4 | −52.1 | 18.4 | 18.7 | 0.2 | 16.4 | 95.4 | 0.3 | 16.4 | 95.4 | ||
All | −52.1 | 18.4 | 18.7 | −47.6 | 16.4 | 18.3 | −46.7 | 16.2 | 19.4 | ||
−0.5Yt | β3 = 6.99 | None | −0.4 | 28.5 | 93.0 | −2.3 | 22.0 | 89.5 | 1.0 | 17.9 | 95.0 |
C3 & B4 | −0..4 | 28.5 | 93.0 | −86.2 | 29.5 | 11.6 | 0.7 | 44.1 | 92.9 | ||
All | −82.1 | 19.4 | 1.0 | −86.2 | 29.5 | 11.6 | −77.3 | 17.9 | 0.9 | ||
β4 = 8.87 | None | −0.7 | 27.6 | 92.2 | −4.0 | 23.9 | 94.1 | 1.2 | 17.3 | 94.9 | |
C3 & B4 | −81.3 | 20.2 | 2.2 | −4.0 | 23.9 | 94.1 | −1.0 | 18.8 | 94.1 | ||
All | −81.3 | 20.2 | 2.2 | −86.9 | 34.7 | 15.2 | −74.6 | 18.3 | 1.4 |
IPW, inverse probability weighted estimator; CM, conditional mean imputation estimator; MR, 2T-multiply robust estimator; None, no model misspecification; C3 & B4, misspecification of models C3 and B4; All, misspecification of Ct and Bt for each t.
The results for the 2T-multiply robust estimators in the first simulation study are as predicted by the theory: they are nearly unbiased and the Wald confidence intervals centred at them cover roughly at the nominal 95% level when, at each occasion, none or one of the working models, but not both, is incorrect. In contrast, and also as expected, the inverse probability weighted estimators of β3 are nearly unbiased but those of β4 are not unbiased when the model for h3(O̅2) is correctly specified but that of h4(O̅3) is incorrectly specified. The reverse occurs for the conditional mean imputation estimators. No estimator is unbiased when all working models are misspecified. In addition, as predicted by theory, when qt = 0, the 2T-multiply robust estimator is more efficient than the inverse probability weighted estimator when all working models are correct, but is less efficient than the conditional mean imputation estimator. Interestingly, the 2T-multiply robust estimators of β3 and β4 are also more efficient than the corresponding inverse probability weighted estimators when qt = −0.5Yt, and both working models are correct even though this cannot be deduced from the theory. Note also that the 2T-multiply robust estimator of β3 is less efficient than the inverse probability weighted estimator when ℬ3 is correct but 𝒞3 is incorrect.
Table 2 summarizes the results from the second simulation for the generalized 2T-multiply robust estimators of β1 and β2 solving the equations (18) that use, instead of d(X), the vector function of X and β,
where Q* (β, θ, α) = (Q1 (1, β, θ1, α1), …, QT (1, β, θT, αT))′. It can be shown that using d* (X; β) in (18) results in generalized 2T-multiply robust estimators that, under the union model , are asymptotically equivalent to those that solve (18) with d (X) = d* (X; β*). The estimators were computed under various conditions: the correct working model ht(O̅t−1; αt) and the approximately correct model mt(O̅t−1; θ), labelled ‘None’; the same models as in the previous case except that the coefficients corresponding to Rt−1Yt−1 of h2(O̅1; α2), h4(O̅3; α4) and m3(O̅2; θ3) were set to 0, labelled ‘ℬ2, 𝒞3 and ℬ4’; and the same models as in the first case but with the coefficients of Rt−1Yt−1 equal to 0 for all t and for all models, labelled ‘All’. The results confirm that the generalized 2T-multiply robust estimators perform well even if the model for mt(O̅t−1) is incorrectly specified provided, as in our simulations, the model is richly parameterized.
Table 2.
qt | Estimand | Misspec. | Bias | SD | Cov |
---|---|---|---|---|---|
0 | β1 = 3 | None | 0.08 | 10.60 | 94.6 |
B2,C3 & B4 | 0.12 | 10.30 | 94.0 | ||
All | 1.74 | 9.58 | 95.7 | ||
β2 = 2 | None | 0.007 | 4.25 | 95.3 | |
B2,C3 & B4 | −0.66 | 3.68 | 95.9 | ||
All | −10.7 | 3.87 | 17.2 | ||
−0.5Yt | β1 = 3 | None | 1.70 | 12.00 | 95.7 |
B2,C3 & B4 | 1.36 | 12.20 | 93.1 | ||
All | 3.37 | 11.30 | 91.2 | ||
β2 = 2 | None | 1.31 | 6.43 | 96.7 | |
B2,C3 & B4 | −4.21 | 7.69 | 88.1 | ||
All | −24.8 | 9.10 | 5.71 |
None, no model misspecification; B2,C3 & B4, misspecification of models B2,C3 and B4; All, misspecification of Ct and Bt for each t.
Acknowledgement
This work was partially conducted while Stijn Vansteelandt was visiting the Department of Biostatistics of the Harvard School of Public Health. He wishes to thank the members of the Departments for their kind hospitality and stimulating environment. Stijn Vansteelandt was funded by a postdoctoral grant from the Fund for Scientific Research, Belgium. Andrea Rotnitzky and James Robins were funded by grants from the U.S. National Institutes of Health.
Appendix
Proofs
Proof of Theorem 1. Every observed data distribution is defined by a given collection of conditional densities and probabilities, {f(Lt|O̅t−1, Rt = 1), pr(Rt|O̅t−1) : t = 1, …, T}. Thus, to show that 𝒜(q) is a model for the observed data law restricted only by pr(Rt = 1|O̅t−1) > 0 for t = 1, …, T, it suffices to show the following: (a) given {f(Lt|O̅t−1, Rt = 1), pr(Rt|O̅t−1) : pr(Rt = 1|O̅t−1) > 0, t = 1, …, T}, there exists a distribution f* (L̅T,R̅T) satisfying the restrictions of 𝒜(q) and such that, for t = 1, …, T,
(A1) |
and (b) pr*(Rt = 1|O̅t−1) > 0 for every joint distribution f* (L̅T,R̅T) that satisfies the restrictions of model 𝒜(q).
We prove (a) by constructing a joint distribution f* (L̅T,R̅T) that satisfies (A1) iteratively as follows. For t = 1, Rt−1 is nil and f*(L̅t−1) = f (L0). For each t = 1, …, T, we define f*(Lt,Rt|L̅t−1, R̅t−1) equal to f*(Lt,Rt|O̅t−1), where f*(Lt,Rt|O̅t−1) is defined by f*(Lt|O̅t−1,Rt = 1) = f(Lt|O̅t−1,Rt = 1) and pr*(Rt = 1|O̅t−1) = pr(Rt = 1|O̅t−1), and when (4) holds f*(Lt\Yt|Yt,O̅t−1,Rt = 0) is equal to an arbitrary law and f*(Yt|O̅t−1,Rt = 0) is equal to the right-hand side of (5). By construction, f*(L̅T,R̅T) satisfies (1) and Assumption 1 for t = 1, …, T, and thus is in model 𝒜(q), and additionally satisfies (A1), thus proving (a). Part (b) holds because it is implied by (1). This concludes the proof of part (i). To show part (ii), note that, under model 𝒜t(q),
where qt(O̅t−1, Yt) is defined arbitrarily if pr(Rt = 0|O̅t−1) = 0. Thus, f (Yt|O̅t−1) is identified under 𝒜t(q) because the right-hand side of the last display is determined by the observed data law. To show that (8) holds, we use expression (7) and note that, under models 𝒜t(q) and 𝒜(q), the right-hand side of (8) equals
From
under model 𝒜t(q), it follows that the right-hand side is equal to E(Yt). The proof of (9) is now immediate.
Proof that the restrictions imposed by models 𝒜*(q), ℬ*(q), ℬ(q), 𝒞(q) and are compatible. To show that the restrictions of ℬ*(q) are compatible we will exhibit a joint law f* (L̅T,R̅T) satisfying the restrictions defining ℬ*(q). We construct such a law recursively as follows. We define f*(L0) as an arbitrary law and set R0 as nil. Then, having defined f*(L̅t−1,R̅t−1) for t = 1, …, T, we define f*(Lt,Rt|L̅t−1,R̅t−1) as follows. The density f* (Yt|L̅t−1,R̅t−1) satisfies and the integral is taken with respect to the counting dominating measure for R̅t−1 and the adequate dominating measures for and Yt. This ensures that (17) holds. Next, we define f*(Lt \ Yt|Yt,Rt,O̅t−1) as an arbitrary law. Finally, for a given fixed function , we set , which ensures that (5) and (11) hold. Thus, by construction, f* (L̅T,R̅T) satisfies the restrictions defining ℬ*(q). Since ℬ*(q) is more restrictive than ℬ(q) and 𝒜*(q), this implies that the restrictions of ℬ(q) and 𝒜*(q) are compatible.
We next recursively construct a law f* (L̅T,R̅T) that satisfies the restrictions imposed by the intersection model . We define f*(L0) as an arbitrary law and set R0 as nil. Then, having defined f*(L̅t−1,R̅t−1) for t = 1, …, T, we define f*(Lt,Rt|L̅t−1,R̅t−1) as follows. Given fixed functions , for each t = 1, …, T, we define f* (Yt|L̅t−1,R̅t−1,Rt = 0) = f*(Yt|Rt = 0,O̅t−1), where f*(Yt|Rt = 0,O̅t−1) is any law that satisfies . Next, we define pr*(Rt = 1|L̅t−1,R̅t−1) = pr*(Rt = 1|O̅t−1) by the identity
and we define f* (Yt|L̅t−1,R̅t−1,Rt = 1) = f*(Yt|O̅t−1,Rt = 1), where
Finally, we choose f*(Lt \ Yt|Yt,L̅t−1,R̅t−1) to be an arbitrary law. By construction, f* (L̅T,R̅T) satisfies for each t, (12), and hence model 𝒞t(q), as well as (5) and (11), and hence model ℬt(q). This is seen because, for the chosen law,
We conclude that f* (L̅T,R̅T) satisfies the restrictions of model and therefore also those of 𝒞(q).
Proof that Ht(1, ϕθt,βt,t, βt, αt)=Mt(1, ψαt,t, βt, θt). We have
Now, suppose that (11) and (4) hold. Then
and
because
Thus, under (11) and (4), for any θt. Next, suppose (12) holds. Then , where E (Yt|O̅t−1,Rt = 0) is defined arbitrarily if (4) does not hold, and hence . Also,
because . Thus, under model (12), for any αt.
Proof of Theorem 2. To prove part (i) of the theorem, we assume that the regularity conditions 1–9 of Appendix B of Robins et al. (1994) hold with Ut(βt, θt, αt) and replacing their Hi(γ) and γ0 respectively, and their regularity condition 3 being replaced by the assumption that pr (Rt = 1|O̅t−1, Yt; αt) > σ > 0 with probabaility 1 for some σ and arbitrary αt in the parameter space. By standard Taylor expansion arguments we have that , where op(1) denotes a random variable converging to 0 in probability. Furthermore, because under model ℬt(q) ∪ 𝒞t(q), another Taylor expansion gives
When, as in regularity condition 6 of Robins et al. (1994), is nonsingular, this is equivalent to
The asymptotic distribution of under model ℬt(q) ∪ 𝒞t(q) follows from the previous equation by Slutsky’s Theorem and the Central Limit Theorem. The consistency of the variance estimator follows from the Law of Large Numbers. This proves part (i). Since αt and θt, for t = 1, …, T, are variation independent parameters, it also follows that is the influence function corresponding to the 2T-multiply robust CAN estimator for β* under model . This proves part (iii).
At the intersection model ℬt (q) ∩ 𝒞t (q), and hence . It follows that the estimators have the same influence functions at the intersection model ℬt (q) ∩ 𝒞t (q). This proves part (ii). Part (iv) is similarly proved.
Proof of Theorem 3. By definition, and, by analogous arguments to Theorem 2 for estimators β̂ (d) ≡ β̂ (d, dθ, ψ, dα, ϕ), the variance matrix of the limiting distribution of √n {β̂ (d) − β*} is equal to Γ (d) = Ψ (d) Ω (d) Ψ (d)′, where
That Γ (dopt) ≤ Γ (d) follows after applying the Cauchy-Schwarz inequality.
Contributor Information
Stijn Vansteelandt, Email: stijn.vansteelandt@ugent.be, Department of Applied Mathematics and Computer Sciences, Ghent University, 9000 Ghent, Belgium.
Andrea Rotnitzky, Email: arotnitzky@utdt.edu, Department of Economics, Di Tella University, Buenos Aires, Argentina.
James Robins, Email: robins@hsph.harvard.edu, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts 02115, U.S.A..
References
- Albert PS. A transitional model for longitudinal binary data subject to nonignorable missing data. Biometrics. 2000;56:602–608. doi: 10.1111/j.0006-341x.2000.00602.x. [DOI] [PubMed] [Google Scholar]
- Andersson SA, Perlman MD. Lattice-ordered conditional-independence models for missing data. Statist. Prob. Lett. 1991;12:465–486. [Google Scholar]
- Deltour I, Richardson S, Le Hesran J-Y. Stochastic algorithms for Markov models estimation with intermittent missing data. Biometrics. 1999;55:565–573. doi: 10.1111/j.0006-341x.1999.00565.x. [DOI] [PubMed] [Google Scholar]
- Fairclough DL, Peterson HF, Cella D, Bonomi P. Comparison of several model-based methods for analysing incomplete quality of life data in cancer clinical trials. Statist. Med. 1998;17:781–796. doi: 10.1002/(sici)1097-0258(19980315/15)17:5/7<781::aid-sim821>3.0.co;2-o. [DOI] [PubMed] [Google Scholar]
- Gill RD, Robins JM. Sequential models for coarsening and missingness. In: Lin DY, Fleming TR, editors. Proc. First Seattle Symp. Biostatist: Survival Anal. New York: Springer; 1997. pp. 295–305. [Google Scholar]
- Gill RD, van der Laan MJ, Robins JM. Coarsening at random: characterizations, conjectures and counterexamples. In: Lin DY, Fleming TR, editors. Proc. First Seattle Symp. Biostatist: Survival Anal. New York: Springer; 1997. pp. 255–294. [Google Scholar]
- Ibrahim JG, Chen M-H, Lipsitz SR. Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika. 2001;88:551–564. [Google Scholar]
- Laird N, Ware J. Random effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]
- Lin H, Scharfstein DO, Rosenheck RA. Analysis of longitudinal data with irregular, informative follow-up. J. R. Statist. Soc. B. 2003;66:791–813. [Google Scholar]
- Little RJ, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley; 1987. [Google Scholar]
- Robins JM. Non-response models for the analysis of non-monotone nonignorable missing data. Statist. Med. 1997;16:21–37. doi: 10.1002/(sici)1097-0258(19970115)16:1<21::aid-sim470>3.0.co;2-f. [DOI] [PubMed] [Google Scholar]
- Robins JM. Proc. Am. Statist. Assoc. Sec. Bayesian Sci. Alexandria: Am. Statist. Assoc; 2000. Robust estimation in sequentially ignorable missing data and causal inference models; pp. 6–10. 1999. [Google Scholar]
- Robins JM, Gill RD. Non-response models for the analysis of non-monotone ignorable missing data. Statist. Med. 1997;16:39–56. doi: 10.1002/(sici)1097-0258(19970115)16:1<39::aid-sim535>3.0.co;2-d. [DOI] [PubMed] [Google Scholar]
- Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell N, Dietz K, Farewell V, editors. AIDS Epidemiology - Methodological Issues. Boston, MA: Birkhäuser; 1992. pp. 297–331. [Google Scholar]
- Robins JM, Rotnitzky A. Bickel P, Kwon J, editors. Statist. Sinica. 2001;11:920–936. Comment on a paper by. [Google Scholar]
- Robins JM, Rotnitzky A, Scharfstein D. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In: Halloran ME, Berry D, editors. Statistical Models in Epidemiology: The Environment and Clinical Trials. IMA Volume 116. New York: Springer-Verlag; 1999. pp. 1–92. [Google Scholar]
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J. Am. Statist. Assoc. 1994;89:846–866. [Google Scholar]
- Robins JM, Rotnitzky A, Zhao L-P. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Am. Statist. Assoc. 1995;90:106–121. [Google Scholar]
- Rotnitzky A, Robins JM, Scharfstein DO. Semiparametric regression for repeated outcomes with nonignorable nonresponse. J. Am. Statist. Assoc. 1998;93:1321–1339. [Google Scholar]
- Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric non-response models. J. Am. Statist. Assoc. 1999;94:1096–1146. [Google Scholar]
- Shah A, Laird N, Schoenfeld D. A random-effects model for multiple characteristics with possibly missing data. J. Am. Statist. Assoc. 1997;92:775–779. [Google Scholar]
- Troxel AB, Fairclough DL, Curran D, Hahn EA. Statistical analysis of quality of life with missing data in cancer clinical trials. Statist. Med. 1998;17:653–666. doi: 10.1002/(sici)1097-0258(19980315/15)17:5/7<653::aid-sim812>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]
- Troxel AB, Lipsitz SR, Harrington DP. Marginal models for the analysis of longitudinal measurements with nonignorable non-monotone missing data. Biometrika. 1998;85:661–672. [Google Scholar]
- van der Laan MJ, Robins JM. Unified Methods for Censored Longitudinal Data and Causality. New-York: Springer-Verlag; 2003. [Google Scholar]
- Zeuzem S, Feinman SV, Rasenack J, Heathcote EJ, Lai MY, Gane E, O‘Grady J, Reichen J, Diago M, Lin A, Hoffman J, Brunda MJ. Peginterferon alfa-2a in patients with chronic hepatitis C. New Engl. J. Med. 2000;343:1666–1672. doi: 10.1056/NEJM200012073432301. [DOI] [PubMed] [Google Scholar]