Estimation and inference based on Neumann series approximation to locally efficient score in missing data problems

HUA YUN CHEN

doi:10.1111/j.1467-9469.2009.00646.x

. Author manuscript; available in PMC: 2010 Jun 1.

Published in final edited form as: Scand Stat Theory Appl. 2009 Dec 1;36(4):713–734. doi: 10.1111/j.1467-9469.2009.00646.x

Estimation and inference based on Neumann series approximation to locally efficient score in missing data problems

HUA YUN CHEN ¹

PMCID: PMC2811346 NIHMSID: NIHMS168144 PMID: 20161609

Abstract

Theory on semiparametric efficient estimation in missing data problems has been systematically developed by Robins and his coauthors. Except in relatively simple problems, semiparametric efficient scores cannot be expressed in closed forms. Instead, the efficient scores are often expressed as solutions to integral equations. Neumann series was proposed in the form of successive approximation to the efficient scores in those situations. Statistical properties of the estimator based on the Neumann series approximation are difficult to obtain and as a result, have not been clearly studied. In this paper, we reformulate the successive approximation in a simple iterative form and study the statistical properties of the estimator based on the reformulation. We show that a doubly-robust locally-efficient estimator can be obtained following the algorithm in robustifying the likelihood score. The results can be applied to, among others, the parametric regression, the marginal regression, and the Cox regression when data are subject to missing values and the missing data are missing at random. A simulation study is conducted to evaluate the performance of the approach and a real data example is analyzed to demonstrate the use of the approach.

Keywords: auxiliary covariates, information operator, non-monotone missing pattern, weighted estimating equations

1 Introduction

The semiparametric efficient estimation for missing data problems has been extensively studied (Bickel et al., 1993; Robins et al., 1994 and others). One major task in such problems is to project the estimating score onto the orthogonal complement of the nuisance score space. However, the projection often depends on the unknown underlying distribution that generated the data (Robins et al., 1994, 1995; Rotnitzky & Robins, 1995; Rotnitzky et al., 1998; Scharfstein et al., 1999). To overcome the difficulty, working models have been proposed to compute a locally efficient score. It has been shown that when data are missing at random and either the working model for the missing data mechanism or the working model for the nuisance model of the full data is correct, the locally efficient score is asymptotically unbiased (Lipsitz et al., 1999; Robins et al., 1999; Robins & Rotnitzky, 2001; van der Laan & Robins, 2003). Note that, except in simple cases, computing the projection using working models still corresponds to the hard problem of solving an integral equation. Neumann series expansion has been proposed to obtain an approximate solution through successive approximation (Robins et al., 1994, Robins & Wang, 1998, and van der Laan & Robins, 2003). Since the procedure for finding the locally efficient estimator based on the approximate locally efficient score is complicated, the study of the asymptotic properties of the estimator has been left open. In this article, we reformulate the successive approximation and show that an algorithm based on robustifying the likelihood score yields an estimator having the desired asymptotic properties, i.e., doubly robust and locally efficient.

The remainder of the article is organized as follows. In section 2, we reformulate the successive approximation in a simple form and show the robust property of the algorithm. The asymptotic properties of the estimator are carefully studied in section 3. We show that the algorithm indeed yields an estimator which is doubly-robust and locally-efficient when appropriate care is taken with regard to the number of terms used in the Neumann series approximation. Applications of the theory developed to regression models are briefly discussed in section 4. A simulation study is performed using parametric regression with missing covariates to examine the finite sample performance of the algorithm in section 5 and the algorithm is applied to a real data example. The article is concluded with some discussions in section 6. All proofs are collected in the Appendix.

2 The Neumann series approximation to the efficient score

Let Y be the full data, R be the missing data indicator for Y, and R(Y) and R̄ (Y) be respectively the observed and missing parts of Y. Let the density of the distribution for (R, Y) with respect to μ, a product of count measures and Lebesgue measures, be dP₍_α_,_β_,_θ₎/dμ = π(R|Y, α)f(Y, β, θ), where β ∈ Ω, θ ∈ Θ, α ∈ Ξ. Here β and α are usually Euclidean parameters, and θ is a nuisance parameter, which can be of infinite dimension. Let η = (α, β, θ), where β is the parameter of interest and (α, θ) are nuisance parameters. Let $b (Y) \in L_{0}^{2} (P_{η})$ , a mean-zero square-integrable function with respect to P_η. Define the nonparametric information operator m_η as

m_{η} {b (Y)} = E_{η} [E_{η} {b (Y) ∣ R, R (Y)} ∣ Y] .

Neumann series approximation to the efficient score appeared as the successive approximation in Robins et al. (1994). The method first finds the efficient score for β under the full data model, denoted by $S_{β}^{F, eff}$ . The method then employs the successive approximation,

U_{N} = S_{β}^{F, eff} + P_{η} (I - m_{η}) U_{N - 1},

where Inline graphic is the projection to the closure of the nuisance score space of the full data model and $U_{0} = S_{β}^{F, eff}$ . The semiparametric efficient score is E_η{U_∞|R, R(Y)}. There are many unanswered questions associated with the use of this approach in practice. First, we need guidelines for choosing a finite N in implementing the algorithm. Second, the successive approximation is only known to converge under L²-norm, which is insufficient in studying the properties of the estimator when the underlying distribution used in computing the projection is estimated rather than known. Third, it is not known if the estimator generated from the procedure involving approximations is indeed asymptotically equivalent to the estimator with known underlying distribution.

To answer these questions, we reformulate the successive approximation in another form as U_N = (I − Inline graphic m_η)U_N₋₁, or equivalently as an explicit expression:

U_{N} = {(I - P_{η} m_{η})}^{N} S_{β},

where S_β is the likelihood score for β. The equivalence of the simpler form to the successive approximation can be shown by noting that (I − Inline graphic ) = 0 and $S_{β}^{F, eff} = (I - P_{η}) S_{β}$ . This allows us to express the efficient score for the missing data problem directly as

lim_{N \to \infty} E {{(I - P_{η} m_{η})}^{N} S_{β} ∣ R, R (Y)} .

Note that the approximation based on the new expression does not require us to first find the efficient score under the full data. The approximation is the likelihood score when N = 0 and can be regarded as robustification of the likelihood score when N > 0. An algorithm for finding the approximate locally efficient score for estimating β based on the expression can be described as follows. First, find an estimate of the nuisance parameters under the working models using methods such as the maximum likelihood approach. Then, compute the approximate efficient score with the nuisance parameters estimated from the working models. Finally, solve the score equation to obtain β estimator. Results in the next section show that it is sufficient that the number N in the approximation be taken in an order higher than the logarithm of the sample size. Let

T_{N} {R, R (Y), η} = E {{(I - P_{η} m_{η})}^{N} S_{β} ∣ R, R (Y)}

and T{R, R(Y), η} be the limit of T_N{R, R(Y), η} under L²(P_η). The large sample properties of the approximate locally efficient estimator are studied in the next section. The following proposition describes the robust properties of the estimating scores, which are used in the next section. Similar results can be found in van der Laan & Robins (2003, sections 2.9 & 2.10). Note, however, that part (a) of the proposition is different from their results.

Let P₀ = P_{(β₀,θ₀,α₀)}, the true distribution generating the data. If, for any small ε and large M, there exists θ₍_ε_,_M₎ such that

f (y, β_{0}, θ) (1 + ε [S (y) 1_{{∣ S (y) ∣ \leq M}} - E_{(β_{0}, θ)} {S (Y) 1_{{∣ S (Y) ∣ \leq M}}}]) = f (y, β_{0}, θ_{ε, M}),

where S(Y) = f(Y, β₀, θ₀)/f(Y, β₀, θ)−1, then, in this paper, {f(y, β₀, θ), θ ∈ Θ} is called a super-convex family for θ at β₀. Note that a super-convex family of distributions is always a convex family of distributions, which corresponds to M = ∞. When the densities are bounded above from infinity and below from zero, a convex family is also a super-convex family.

Proposition 1

Assume that data are missing at random and that the true distribution generating the data is P₀. Then the following results hold:

For any fixed N, if the nuisance model for the full data is correct, i.e., θ = θ₀, then T_N{R, R(Y), β₀, θ₀, α} is asymptotically unbiased under P₀. The L²(P_{(β₀, θ₀, α)}) limit of T_N is also asymptotically unbiased if it is in L²(P₀) and the missing data probabilities are bounded away from zero.
If the model for the missing data mechanism is correct, i.e., α = α₀, and f(y, β₀, θ) is a super-convex family with respect to θ, then E₀[T{R, R(Y), β₀, θ, α₀}] = 0 if E₀[T^⊗2{R, R(Y), β₀, θ, α₀}] < ∞, where T^⊗2 = TT′.

We suppress the proof of this proposition because it is similar to the proof of such results in the literature such as in Robins et al. (2000) and van der Laan & Robins (2003). The proposition suggests that T is doubly robust, i.e., it is unbiased when either the missing data mechanism model or the nuisance model for the full data is correctly specified. For a fixed N, T_N is unbiased only when the nuisance model for the full data is correct. However, note that T_N approximates T. If we allow N to depend on the sample size, we can make T_{N_n} doubly robust as n → ∞. We explain in the next section how to implement this idea.

3 Estimation and inference based on approximate locally efficient scores

To simplify notation in this section, we use θ to denote the nuisance parameter. That is, we absorb α into θ. Denote the model ∫ π(R|Y, α)f(Y, β, θ)dR̄(Y) by g(R, R(Y), β, θ) after the parameter absorption. Let θ(γ), γ ∈ Γ define a working model which is a regular submodel. Let θ(γ̂) be a $\sqrt{n}$ consistent estimator of θ(γ) under the working models.

To accommodate β of infinite dimension, we use the functional form to denote T_N and T. Let T_N₍_i₎(η)(h₁) = T_N {R_i, R_i(Y_i), η}(h₁) and T₍_i₎(η)(h₁) = T{R_i, R_i(Y_i), η}(h₁) where h₁ ∈ H₁, a Hilbert space. Let θ̂ = θ(γ̂). For a given N, let β̃_N be the solution to the equation

P_{n} T_{N} ({\tilde{β}}_{N}, \hat{θ}) (h_{1}) = \frac{1}{n} \sum_{i = 1}^{n} T_{N (i)} ({\tilde{β}}_{N}, \hat{θ}) (h_{1}) = 0,

for all h₁ ∈ H₁. Let β̃ be the solution to the equation

P_{n} T (\tilde{β}, \hat{θ}) (h_{1}) = \frac{1}{n} \sum_{i = 1}^{n} T_{(i)} (\tilde{β}, \hat{θ}) (h_{1}) = 0,

for all h₁ ∈ H₁. Define linear operator Inline graphic as a map from H₁ to itself satisfying

< Q_{0} h_{1}, h_{1}^{*} >_{H_{1}} = E_{0} {\frac{\partial T (η_{0})}{\partial β} (h_{1}) (h_{1}^{*})} .

(1)

Assumption 9 in the Appendix guarantees that Inline graphic exists, is uniquely defined, and is continuously invertible because of the following. For any given h₁, the right-hand side of the foregoing equation defines a linear functional on H₁. By Riesz representation theorem, there exists an h^** ∈ H₁ such that, for all $h_{1}^{*} \in H_{1}$ ,

< h_{1}^{* *}, h_{1}^{*} >_{H_{1}} = P_{0} {\frac{\partial T (η_{0})}{\partial β} (h_{1}) (h_{1}^{*})} .

We can thus define the map Inline graphic from H₁ to H₁ such that $Q_{0} h_{1} = h_{1}^{* *}$ . By varying h₁ ∈ H₁, we see that is well-defined on H₁. It is straightforward to verify that is a linear operator on H₁. Similarly, we define linear operators, and , that map H₁ to itself, respectively as

< Q_{N} h_{1}, h_{1}^{*} >_{H_{1}} = P_{0} {\frac{\partial T_{N} (β_{N}, θ^{*})}{\partial β} (h_{1}) (h_{1}^{*})} .

and

< Q_{0 N} h_{1}, h_{1}^{*} >_{H_{1}} = P_{0} {\frac{\partial T_{N} (β_{0}, θ^{*})}{\partial β} (h_{1}) (h_{1}^{*})},

which are respectively continuously invertible from assumption 10 in the Appendix and from the continuity of the right-hand side with respect to β.

We now state theorems on the asymptotic properties of the β estimators when data are missing at random and either the missing data mechanism model or the nuisance full data model is correctly specified. Theorem 1 describes the asymptotic behavior of β̃ when n → ∞. Theorem 2 states the asymptotic property of β̃_N when N is fixed and n → ∞. Theorem 3 describes the asymptotic behavior of β̃_N when n → ∞ and N, as a function of n, also tends to ∞. Assumptions and proofs are given in the Appendix.

Theorem 1

Under assumptions 1–9, as n →; ∞, β̃ → β₀ almost surely and

< \sqrt{n} (\tilde{β} - β_{0}), h_{1} >_{H_{1}} \to N (0, V_{0} (h_{1})),

uniformly for h₁ ∈ H₁₀, and

V_{0} (h_{1}) = E_{0} {T (β_{0}, θ^{*}) (Q_{0}^{- 1} h_{1})}^{\otimes 2},

which can be consistently estimated uniformly for h₁ ∈ H₁₀ by

\frac{1}{n} \sum_{i = 1}^{n} {[T_{(i)} (\tilde{β}, \hat{θ}) {{\hat{Q}}_{0}^{- 1} (h_{1})}]}^{\otimes 2},

where θ^* = θ(γ^*) and γ^* is the limit of γ̂, and

< {\hat{Q}}_{0} (h_{1}), h_{1}^{*} >_{H_{1}} = \frac{1}{n} \sum_{i = 1}^{n} \sqrt{n} [T_{(i)} {\tilde{β} + h_{1} / \sqrt{n}, \hat{θ}} (h_{1}^{*}) - T_{(i)} {\tilde{β}, \hat{θ}} (h_{1}^{*})],

for $h_{1}^{*} \in H_{1}$ . When the nuisance models for θ are correctly specified, θ^* = θ₀ and V₀ attains the semiparametric efficient variance bound.

Theorem 2

Under assumptions 1–8, and 10–12, for any fixed N, as n → ∞, β̃_N converges almost surely to β_N satisfying E₀{T_N(β_N, θ^*)(h₁)} = 0 for all h₁ ∈ H₁. The asymptotic bias of β̃_N can be approximated by

< (β_{N} - β_{0}), h_{1} >_{H_{1}} \approx E_{0} {T_{N} (β_{0}, θ^{*}) (- Q_{0 N}^{- 1} h_{1})} .

Further,

< \sqrt{n} ({\tilde{β}}_{N} - β_{N}), h_{1} >_{H_{1}} \to N (0, V_{N} (h_{1}))

uniformly over h₁ ∈ H₁₀, where

V_{N} (h_{1}) = E_{0} {[T_{N} (β_{N}, θ^{*}) (Q_{N}^{- 1} h_{1}) + E_{0} {\frac{\partial T_{N} (β_{N}, θ^{*})}{\partial θ} (u) (Q_{N}^{- 1} h_{1})}_{u = U_{1}}]}^{\otimes 2}

can be consistently estimated by

\frac{1}{n} \sum_{i = 1}^{n} {(T_{N (i)} {{\tilde{β}}_{N}, \hat{θ}} {{\hat{Q}}_{N}^{- 1} h_{1}) + \sqrt{n} [{\bar{T}}_{N} {{\tilde{β}}_{N}, \hat{θ} + \frac{U_{i}}{\sqrt{n}}} ({\hat{Q}}_{N}^{- 1} h_{1}) - {\bar{T}}_{N} {{\tilde{β}}_{N}, \hat{θ}} {{\hat{Q}}_{N}^{- 1} h_{1})])}^{\otimes 2}

uniformly for h₁ ∈ H₁₀, where Inline graphic (h₁) is defined as

< {\hat{Q}}_{N} (h_{1}), h_{1}^{*} >_{H_{1}} = \frac{1}{n} \sum_{i = 1}^{n} T_{N (i)} {{\tilde{β}}_{N} + h_{1} / \sqrt{n}, \hat{θ}} (h_{1}^{*})

for h₁, $h_{1}^{*} \in H_{1}$ , and U₁, ···, U_n are influence functions of θ̂ in estimating θ^*, i.e.,

\hat{θ} - θ^{*} = \frac{1}{n} \sum_{i = 1}^{n} U_{i} + o_{p} (\frac{1}{\sqrt{n}}) .

Theorem 3

Let N_n be a sequence such that log n/N_n → 0 as n → ∞. Under assumptions 1–9, β̃_{N_n} − β̃ → 0 P₀-almost surely, and

< \sqrt{n} ({\tilde{β}}_{N_{n}} - \tilde{β}), h_{1} >_{H_{1}} = o_{P_{0}} (1)

uniformly over h₁ ∈ H₁₀ as n → ∞. Further, V₀(h₁) can be consistently estimated by

\frac{1}{n} \sum_{i = 1}^{n} {T_{N_{n} (i)} ({\tilde{β}}_{N_{n}}, \hat{θ}} ({\hat{Q}}_{0 N_{n}}^{- 1} (h_{1})}^{\otimes 2}

uniformly over h₁ ∈ H₁₀, where

< {\hat{Q}}_{0 N_{n}} (h_{1}), h_{1}^{*} >_{H_{1}} = \frac{1}{n} \sum_{i = 1}^{n} \sqrt{n} [T_{N_{n} (i)} {{\tilde{β}}_{N_{n}} + h_{1}^{*} / \sqrt{n}, \hat{θ}} (h_{1}) - T_{N_{n} (i)} {{\tilde{β}}_{N_{n}}, \hat{θ}} (h_{1})],

for h₁, $h_{1}^{*} \in H_{1}$ .

In practice, Theorem 1 is useful only when the locally efficient score has a closed-form expression. This can happen sometimes. In general, Theorem 1 cannot be applied directly because of the unknown form of the locally efficient score. Theorem 2 can almost always be applied to the approximation score with a finite N. It can be seen from Theorem 2 that, although the bias in estimating β cannot be totally avoided, the magnitude of the bias can be controlled by selecting a sufficiently large N. This is because E₀T_N (β₀, θ^*)(h₁) → E₀T(β₀, θ^*)(h₁) = 0 uniformly over h₁ ∈ H₁₀ as N → ∞. Furthermore, if the nuisance model for the full data is correctly specified in the missing data problem, bias in estimating β is asymptotically zero for any fixed N when n → ∞ because E₀T_N(β₀, θ^*)(h₁) = 0 from Proposition 1(a). Theorem 3 states that the approximation score with N sufficient large relative to the sample size is asymptotically equivalent to the locally efficient score in estimating β. This confirms that the algorithm indeed finds the locally efficient estimator when carefully implemented.

4 Applications to examples

In this section, we apply the theorems in the previous section to several regression models frequently used in practice.

Example 1: Missing data in parametric regression models

Let Y = (V, W, X) with density p(v|w, x)f(w|x, β)q(x) where f(w|x, β) is the parametric regression model with a Euclidean parameter β ∈ R^k, which is of primary interest, and V is the auxiliary information observed in addition to outcome W and covariates X. The nuisance parameter for the complete data model is (p, q). We know from Robins et al. (1994) that doubly robust estimating scores have the form $E_{η} {m_{η}^{- 1} (D) ∣ R, R (Y)}$ , where Inline graphic is in the orthogonal complement of the nuisance score space. Specifically, = S(W, X) − E_η{S(W, X)|X} for any square-integrable function S. The semiparametric efficient score is the special doubly robust estimating score with satisfying the integral equation

E_{η} {m_{η}^{- 1} (D) ∣ X, W} - E_{η} {m_{η}^{- 1} (D) ∣ X} = \frac{\partial}{\partial β} log f (W ∣ X, β) .

When missing data form monotone patterns, $m_{η}^{- 1}$ has a closed-form expression. But the foregoing integral equation does not have a closed-form solution even with the simplest missing data pattern and when no auxiliary covariates are involved. As a result, successive approximation is needed except when the missing data form monotone patterns and we are satisfied with a doubly robust estimator.

The score operator for the parameters (q, β, p) is

A_{η} (h_{1}, h_{21}, h_{22}) = h_{1}^{T} \frac{\partial log f}{\partial β} + h_{21} (v, w, x) + h_{22} (x),

where h₁ ∈ H₁ = R^k with k the dimension of β, and

h_{21} \in H_{21} = {h_{21} (v, w, x) \in L^{2} (P_{η}) ∣ E_{η} {h_{21} (V, W, X) ∣ W, X} = 0}

and

h_{22} \in H_{22} = {h_{22} (x) \in L^{2} (P_{η}) ∣ E_{η} {h_{22} (X)} = 0} .

Note that H₁ does not vary for different β ∈ Ω. H₁₀ can be chosen as H₁₀ = {h₁ ∈ R^k||h₁| ≤ 1}. Let H₂ = H₂₁ × H₂₂. Let A₂_η(h₂₁, h₂₂) = A_η(0, h₂₁, h₂₂). The adjoint operator of A₂_η is

A_{2 η}^{*} {s (v, w, x)} = {s (v, w, x) - E_{η} (s ∣ w, x), E_{η} (s ∣ x) - E_{η} (s)} .

It follows that $A_{2 η}^{*} A_{2 η} (h_{21}, h_{22}) = (h_{21} (v, w, x), h_{22} (x))$ . Hence, $A_{2 η}^{*} A_{2 η}$ is continuously invertible on H₂ and $P_{η} = A_{2 η} {(A_{2 η}^{*} A_{2 η})}^{- 1} A_{2 η}^{*}$ having the form

P_{η} (s) = s (v, w, x) - E_{η} (s ∣ w, x) + E_{η} (s ∣ x),

for any mean-zero square integrable function s. Assume that the densities involving the nuisance parameters are uniformly bounded, the convexity requirement for q(v|w, x)f(w|x, β)p(x) with respect to qp can be verified as follows.

τ q_{1} (v ∣ w, x) p_{1} (x) + (1 - τ) q_{2} (v ∣ w, x) p_{2} (x) = q_{τ} (v ∣ w, x) p_{τ} (x),

where

q_{τ} (v ∣ w, x) = \frac{τ q_{1} (v ∣ w, x) p_{1} (x) + (1 - τ) q_{2} (v ∣ w, x) p_{2} (x)}{τ p_{1} (x) + (1 - τ) p_{2} (x)}

and

p_{τ} (x) = τ p_{1} (x) + (1 - τ) p_{2} (x)

for τ ∈ [0, 1].

Example 2: Marginal mean model

Let W = (W₁, ···, W_K)^T and E(W_k|X) = g_k(X_kβ) for k = 1, ···, K. Let g(β) = {g₁(X₁β), ···, g_K(X_Kβ)}^T, and f(ε) be the joint density of W given X, where ε_i = w_i − g_i(x_iβ) and ε = (ε₁, ···, ε_K). Let the density for X be q and the density for V given (W, X) be p, where V denotes the auxiliary covariate. The nuisance parameter for the complete data model is (q, f, p). When data are missing at random, the efficient score for estimating β is $E_{η} {m_{η}^{- 1} (D) ∣ R, R (Y)}$ (Robins et al., 1994; 1995), where Inline graphic = Cov_η (S, ε|X){Var_η (ε|X)}⁻¹ε satisfying

{Cov}_{η} {m_{η}^{- 1} (D), ε ∣ X} = \frac{d g}{d β} .

When X is completely observed, the efficient score has the form

\frac{d g}{d β} {[{Var}_{η} {m_{η}^{- 1} (ε) ∣ X}]}^{- 1} m_{η}^{- 1} (ε) .

Successive approximation is needed when either missing data form nonmontone patterns or covariates are subject to missing values.

When data are fully observed, the score for estimating β is

A_{1 η} h_{1} = - \sum_{k = 1}^{K} h_{1}^{T} X_{k}^{T} g_{k}^{'} (X_{k} β) \frac{\partial log f}{\partial ε_{k}} (ε_{1}, \dots, ε_{K}),

where h₁ ∈ H₁ = R^d and d is the dimension of parameter β. H₁₀ can be taken as the unit ball in R^d. The nuisance score is

A_{2 η} {h_{21} (V, X, W), h_{22} (X, W), h_{23} (X)} = h_{22} (X, W) + h_{21} (V, X, W) + h_{23} (X),

where H₂ = H₂₁ × H₂₂ × H₂₃ with

\begin{matrix} H_{21} = {h_{21} (v, w, x) \in L^{2} (P_{η}) ∣ E_{η} {h_{21} (V, W, X) ∣ W, X} = 0}, \\ H_{22} = {h_{22} (w, x) \in L^{2} (P_{η}) ∣ E_{η} {h_{22} (X, W) ∣ X} = 0, E_{η} {ε h_{22} (X, W) ∣ X} = 0}, \end{matrix}

and

H_{23} = {h_{23} (x) \in L^{2} (P_{η}) ∣ E_{η} {h_{23} (X)} = 0} .

The adjoint operator of A₂_η is

\begin{array}{l} A_{2 η}^{*} S (V, X, W) = {S (V, X, W) - E_{η} (S ∣ X, W), E_{η} (S ∣ X, W) - E_{η} {S ∣ X} \\ - {Cov}_{η} (S, ε ∣ X) {V a r_{η} (ε ∣ X)}^{- 1} ε, E_{η} (S ∣ X) - E_{η} (S)} . \end{array}

It follows that $A_{2 η}^{*} A_{2 η} {h_{21}, h_{22}, h_{23}} = (h_{21} (v, w, x), h_{22} (w, x), h_{23} (x))$ . Hence, $A_{2 η}^{*} A_{2 η}$ has continuous inverse on H₂₁ × H₂₂ × H₂₃ and $P_{η} = A_{2 η} {(A_{2 η}^{*} A_{2 η})}^{- 1} A_{2 η}^{*}$ appears as

P_{η} s = s (V, X, W) - E_{η} (s ε ∣ X) {V a r_{η} (ε ∣ X)}^{- 1} ε .

for mean-zero square-integrable function s. The efficient score for β under the full data is

S_{β}^{eff, F} = {X_{1}^{T} g_{1}^{'} (X_{1} β), \dots, X_{K}^{T} g_{K}^{'} (X_{K} β)} {V a r_{η} (W ∣ X)}^{- 1} ε

When the densities involving the nuisance parameters are uniformly bounded, the convexity requirement can be verified as follows.

τ q_{1} (v ∣ w, x) f_{1} {w - g (β)} p_{1} (x) + (1 - τ) q_{2} (v ∣ w, x) f_{2} (w - g (β)} p_{2} (x) = q_{τ} (v ∣ w, x) f_{τ} {w - g (β)} p_{τ} (x),

where

\begin{matrix} q_{τ} (v ∣ w, x) = \frac{τ q_{1} (v ∣ w, x) f {w - g (β)} p_{1} (x) + (1 - τ) q_{2} (v ∣ w, x) f_{2} (w - g (β)} p_{2} (x)}{τ f {w - g (β)} p_{1} (x) + (1 - τ) f_{2} (w - g (β)} p_{2} (x)}, \\ f_{τ} {w - g (β)} = \frac{τ f {w - g (β)} p_{1} (x) + (1 - τ) f_{2} (w - g (β)} p_{2} (x)}{τ p_{1} (x) + (1 - τ) p_{2} (x)}, \end{matrix}

and p_τ(x) = τp₁(x) + (1 − τ)p₂(x) for τ ∈ [0, 1].

Example 3: The missing covariate problem in Cox regression model

Suppose that T is the survival time of a subject, which is subject to censoring by censoring time C. Given covariate Z (time-independent), T and C are independent. X = T ∧ C = min(T, C) and δ = 1_{_T_≤_C_} rather than (T, C) are observed. Z is subject to missing values. Assume that, given (T, C, Z), the missing data mechanism depends on the observed data R(Y) = {X, δ, R, R(Z)} only. Suppose that the Cox proportional hazards model holds, that is

lim_{Δ \to 0} \frac{1}{Δ} P (t < T \leq t + Δ ∣ T \geq t, Z} = λ (t) φ (β Z),

where φ is a known function and λ(t) is an unknown baseline hazard function. The nuisance parameter includes the censoring distribution, baseline hazard, and covariate distribution. The efficient score for estimating β when data are subject to MAR missing values is $E_{η} [m_{η}^{- 1} {D (X, δ, Z)} ∣ R, R (X, δ, Z)]$ (Robins et al., 1994; Nan et al., 2004) where Inline graphic is the unique solution to

\begin{array}{l} \frac{\partial log φ}{\partial β} - \frac{E_{η} {ξ (u) \frac{\partial φ}{\partial β}}}{E_{η} {ξ (u) φ}} = m_{η}^{- 1} (D (u, 1, Z) - \frac{E_{η} {m^{- 1} (D) ξ (u) ∣ Z}}{E_{η} {ξ (u) ∣ Z}} \\ - E_{η} (ξ (u) φ [m_{η}^{- 1} (D) (u, 1, Z) - \frac{E_{η} {m_{η}^{- 1} (D) ξ (u) ∣ Z}}{E_{η} {ξ (u) ∣ Z}}]) / E_{η} {ξ (u) φ} \end{array}

and

D = \int {b_{1} (u, Z) - \frac{E_{η} {ξ (u) φ b_{1} (u, Z)}}{E_{η} {ξ (u) φ}}} {d N (u) - ξ (u) φ (β Z) d Λ (u)}

for some b₁, where ξ(u) = 1_{_X_≥_u_} and N(u) = 1_{_X_≤_u_,_δ_=1}. The successive approximation is needed in obtaining a locally efficient estimator of β.

The density for (X, δ, Z) is

f (x, δ ∣ z, β, λ) p (z) = λ^{δ} (x) φ^{δ} (β z) exp {- Λ (x) φ (β z)} g_{c}^{1 - δ} (x ∣ z) {\bar{G}}_{c}^{δ} (x ∣ z) p (z),

where $Λ (x) = \int_{0}^{x} λ (t) d t$ and g_c is the density function of the censoring time distribution G_c and Ḡ_c = 1 − G_c. Let $Λ_{c} (x, z) = \int_{0}^{x} λ_{c} (t, z) d t$ , λ_c = g_c/Ḡ_c, dM_T(t, z) = dN_T(t) − Y(t)λ_T(t, z)dt with N_T(t) = 1_{_X_≤_t_,_δ_=1}, and dM_C(t, z) = dN_C(t) − Y(t)λ_C(t, z)dt with N_C(t) = 1_{_X_≤_t_,_δ_=0}. The score operator for the parameters (β, λ, g_c, p) is

\begin{array}{l} A_{η} {h_{11}, h_{12} (x), h_{21} (x, z), h_{22} (z)} = \int h_{11}^{T} \frac{\partial}{\partial β} log φ (β Z) d M_{T} (t, z) + \int h_{12} (t) d M_{T} (t, z) \\ + \int h_{21} (t, Z) d M_{C} (t, Z) + h_{22} (Z), \end{array}

where H₁ = H₁₁ × H₁₂, H₂ = H₂₁ × H₂₂, and H₁₁ = R^k with k being the dimension of β,

\begin{array}{l} H_{12} = {h_{12} (t) ∣ h_{12} (t) \in L^{2} {d Λ (t)}}, \\ H_{21} = {h_{21} (t, z) ∣ h_{12} (t, z) \in L^{2} {d Λ_{C} (t ∣ z) d P (z)}}, \end{array}

and

H_{22} = {h_{22} (z) \in L^{2} (P_{η}) ∣ E_{η} {h_{22} (Z)} = 0} .

For Λs that are bounded at T₀, the study stopping time, H₁₂ does not change with Λ. Hence, H₁ is fixed. Define the inner product on H₂ as

< (h_{21}, h_{22}), (h_{21}^{*}, h_{22}^{*}) >_{H_{2}} = E_{η} {\int h_{21} (t, Z) h_{21}^{*} (t, Z) d Λ_{c} (t ∣ Z)} + E_{η} {h_{22} (Z) h_{22}^{*} (Z)} .

It is not difficult to see that H₂ is a Hilbert space under the inner product. Similarly, we can define an inner product on H₁ as

< (h_{11}, h_{12}), (h_{11}^{*}, h_{12}^{*}) >_{H_{1}} = h_{11}^{T} h_{11}^{*} + \int h_{12} (t) h_{12}^{*} (t) d Λ (t)

to make it a Hilbert space. Let H₁₀ be the subset of H₁ such that for any (h₁₁, h₁₂) ∈ H₁₀, h₁₁ is bounded by 1 and h₁₂ ∈ BV [0, T₀], i.e., h₁₂ has bounded variation on [0, T₀].

Let A₂_η(h₂₁, h₂₂) = A_η(0, 0, h₂₁, h₂₂). The adjoint of A₂_η satisfies

< A_{2 η} (h_{21}, h_{22}), s (x, δ, z) >_{L^{2} (P_{η})} = < (h_{21}, h_{22}), A_{2 η}^{*} {s (x, δ, z)} >_{H_{2}} .

Note that, for any s(X, δ, Z) ∈ L²(P_η), s can be represented as (Sasieni, 1992; Nan et al., 2004)

\begin{array}{l} s (X, δ, Z) = \int [s (t, 1, Z) - \frac{E_{η} {s (X, δ, Z) Y (t) ∣ Z}}{E_{η} {Y (t) ∣ Z}}] d M_{T} (t, z) \\ = \int [s (t, 0, Z) - \frac{E_{η} {s (X, δ, Z) Y (t) ∣ Z}}{E_{η} {Y (t) ∣ Z}}] d M_{C} (t, z) \\ + E_{η} {s (X, δ, Z) ∣ Z}, \end{array}

and the three components are orthogonal to each other. It follows that

\begin{array}{l} < A_{2 η} (h_{21}, h_{22}), s (X, δ, Z) >_{L^{2} (P_{η})} = E_{η} {h_{22} (Z) E_{η} (s ∣ Z)} \\ + E_{η} [\int h_{21} (t, Z) {s (t, 0, Z) - \frac{E_{η} {s (X, δ, Z) Y (t) ∣ Z}}{E_{η} {Y (t) ∣ Z}}} d < M_{C} > (t, z)], \end{array}

where d < M_C > (t, z) = Y (t)dΛ_C(t, z). It can be seen that the adjoint operator $A_{2 η}^{*}$ can be obtained as

\begin{array}{l} A_{2 η}^{*} (s) = (s (t, 0, z) E_{η} {Y (y) ∣ Z} - E_{η} {s (X, δ, Z) Y (t) ∣ Z}, \\ E_{η} {s (X, δ, Z) ∣ Z} - E_{η} {s (X, δ, Z)}) . \end{array}

By direct calculation, it follows that

A_{2 η}^{*} A_{2 η} (h_{21}, h_{22}) = (h_{21} (t, Z) E {Y (t) ∣ Z}, h_{22} (Z)),

which implies that $A_{2 η}^{*} A_{2 η}$ is continuously invertible on H₂ when E_η{Y(t)|Z} > σ > 0 for all t ≤ T₀. The projection operator appears as

P_{η} (s) = \int [s (t, 0, Z) - \frac{E_{η} {s (X, δ, Z) Y (t) ∣ Z}}{E_{η} (Y (t) ∣ Z)}] d M_{C} (t, Z) + E_{η} {s (X, δ, Z) ∣ Z} .

for a mean-zero square-integrable function s. The efficient score for estimating (β, Λ) can be expressed as

lim_{N \to \infty} E_{η} {{(I - P_{η} m_{η})}^{N} a (X, δ, Z) ∣ X, δ, R, R (Z)},

where

a (X, δ, Z) = \int h_{11}^{T} \frac{\partial}{\partial β} log φ (β Z) d M_{T} (t, Z) + \int h_{12} (t) d M_{T} (t, Z) .

The convexity requirement can be verified as follows.

τ g_{c 1}^{1 - δ} (x ∣ z) {\bar{G}}_{c 1}^{δ} (x ∣ z) p_{1} (z) + (1 - τ) g_{c 2}^{1 - δ} (x ∣ z) {\bar{G}}_{c 2}^{δ} (x ∣ z) p_{2} (z) = g_{c τ}^{1 - δ} (x ∣ z) {\bar{G}}_{c τ}^{δ} (x ∣ z) p_{τ} (z),

where

g_{c τ} (t ∣ z) = \frac{τ g_{c 1} (t ∣ z) p_{1} (z) + (1 - τ) g_{c 2} (t ∣ z) p_{2} (z)}{τ p_{1} (z) + (1 - τ) p_{2} (z)}

and

p_{τ} (z) = τ p_{1} (x) + (1 - τ) p_{0} (z),

for τ ∈ [0, 1]. Note that we considered the regression parameter and the baseline hazard as the parameters of interest rather than the regression parameter alone because of the convexity requirement. This treatment is different from those treated in the literature.

5 Numeric study

5.1 Simulation studies

We perform a simulation study on missing data in parametric regression with/without auxiliary covariates. Two parametric regression models were simulated. The first was the logistic regression. The second was the linear regression with a normal error.

In the logistic regression model, two independent covariates were simulated. One was binary and the other was normally distributed. One normally distributed auxiliary covariate was also simulated in this case. It was assumed that, given the covariates, the outcome and the auxiliary covariates were independent. But the auxiliary covariate depended on the other covariates. In the simulation, both covariates were subject to missing values and the missingness depended on the outcome and the auxiliary covariate only. More specifically, we assumed that E(Y |X₁, X₂) = g(β₀+β₁x₁ +β₂x₂), where g(t) = (1+e⁻^t)⁻¹. The parameter (β₀, β₁, β₂) = (1, −0.5, 0.5). The model for the auxiliary covariate V given X₁ and X₂ was set to E(V|X₁, X₂) = ψ₀ + ψ₁X₁ + ψ₂X₂ + ψ₃X₁X₂ with a standard normal error. In the simulation, we set (ψ₀, ψ₁, ψ₂) = (0.5, 0.3, 1) and ψ₃ = 0 which corresponds to a correct model for V given X₁ and X₂, and ψ₃ = −2, which corresponds to a severely misspecified model in the data analysis. Three missing covariate patterns were simulated. They were completely observed, observed X₁ only, and observed X₂ only. Let R₁ and R₂ denote the missingness indicators for X₁ and X₂ respectively. The missing data were generated by the model

log \frac{P (R_{1} = i, R_{2} = j ∣ Y, V)}{P (R_{1} = 1, R_{2} = 1 ∣ Y, V)} = α_{0} + α_{1} Y + α_{2} V + α_{3} Y V,

where (i, j) = (1, 0) or (0, 1) and (α₀, α₁, α₂) = (−0.5, −0.5, 0.5), and α₃ = 0 corresponding to a correct missing data mechanism model and α₃ = −1.5 corresponding to an incorrect missing data mechanism model in the data analysis.

The correct model for Y given X₁ and X₂ was always used in the analysis of the simulated data. The distributions of the covariates and the auxiliary covariate were assumed unknown in the analysis of the simulated data. The semiparametric odds ratio models with bilinear log-odds ratio functions were used for modeling the distributions of the covariates (X₁, X₂) and of the auxiliary covariate given the outcome and the covariates (X₁, X₂) in the analysis. The polytomous logit regression model with different sets of parameters for different missing patterns and without the interaction term was always assumed in the data analysis. This implies that the missing data mechanism model was misspecified in the analytical model if the model generating the missing data included the interaction term. To compare the performance of different methods, we computed the following estimators for the regression parameter. The first one was the estimator from the complete-case analysis (CC), which is the solution to the estimating equation,

\sum_{i = 1}^{n} 1_{{R_{i} = 1}} \frac{\partial}{\partial β} log f (Y_{i} ∣ X_{i}, β) = 0,

where 1 is a vector with 1 in all of its components. The second one was the simple missing-data-probability weighted estimator (SW), which is the solution to the estimating equation,

\sum_{i = 1}^{n} \frac{1_{{R_{i} = 1}}}{π_{i} (1)} \frac{\partial}{\partial β} log f (Y_{i} ∣ X_{i}, β) = 0,

where π_i(r) = π{r, r(V_i, X_i, Y_i), α̂} for all missing-data pattern r and α̂ is the maximum likelihood estimate under the missing-data mechanism model, that is, the polytomous linear logit model without interaction. The third was the augmented missing data probability weighted estimator (AW), which is the solution to the estimating equation,

\begin{array}{l} \sum_{i = 1}^{n} [\frac{1_{{R_{i} = 1}}}{π_{i} (1)} \frac{\partial}{\partial β} log f (Y_{i} ∣ X_{i}, β) + \sum_{r} {1_{{R_{i} = r}} - \frac{1_{{R_{i} = 1}}}{π_{i} (1)} π_{i} (r)} \\ \times \hat{E} {\frac{\partial}{\partial β} log f (Y_{i} ∣ X_{i}, β) | R = r, r (V_{i}, X_{i}, Y_{i})}] = 0, \end{array}

where Ê was computed using the distribution estimated from the following maximum likelihood estimator. The fourth was the maximum likelihood estimator (ML) with the bilinear odds ratio model and without interaction for the covariate distribution. The last two were the likelihood robustification estimators as proposed in this paper using approximation with N = 10 (LR-10) and N = 20 (LR-20) respectively.

The simulation results were based on 500 replicates of a sample size of 400. The missing proportions for X₁ and for X₂ were approximately 25%. The average number of complete cases thus obtained was close to 200. Table 1 lists the simulation results for the binary outcome data. As expected, when all models are correct, except the CC estimator, the biases of all the other five estimators are relatively small. But the efficiency is different: with the ML estimator the most efficient and the SW estimator the least efficient. When the covariate model is correct and the missing data mechanism is incorrect, the CC and SW estimators can have substantial biases. Biases of all the other estimators are small. When the missing data mechanism model is correct and the covariate model is incorrect, the SW estimator is unbiased. The ML estimator is subject to a sizable bias. Both LR-10 and LR-20 estimators largely correct the bias in the ML estimator. When neither model is correct, the LR-10 and LR-20 estimators along with the AW estimator appear to have much smaller biases when compared with the CC, SW, and ML estimators although all the estimators are biased. The variance estimates for the likelihood robustification estimators appear to work well. The AW estimator has good performance in all the above cases both in terms of bias and variation. This is partly because it has the doubly robust property in the narrow sense that the estimator is consistent when either the covariate models or the missing data mechanism model is correct as long as both the missing data mechanism and its model depend only on the fully observed covariates and the outcome.

Table 1.

Simulation results for the logistic regression model with missing covariates and auxiliary information.

			β₀			β₁			β₂

CM	MDM	Method	Bias	Var	Vest	Bias	Var	Vest	Bias	Var	Vest
R	R	CC	0.29	0.071		−0.01	0.132		0.08	0.043
		SW	0.03	0.062		0.02	0.154		0.03	0.052
		AW	0.01	0.038		0.00	0.088		0.02	0.021
		LR-20	0.01	0.034	0.034	0.00	0.075	0.071	0.01	0.019	0.017
		LR-10	0.01	0.034	0.034	0.00	0.075	0.071	0.01	0.019	0.017
		ML	0.01	0.034		0.00	0.075		0.01	0.019

R	W	CC	0.50	0.068		0.18	0.126		0.11	0.037
		SW	−0.08	0.061		0.22	0.131		0.22	0.039
		AW	0.00	0.036		0.01	0.083		0.01	0.018
		LR-20	0.01	0.034	0.034	0.01	0.076	0.070	0.01	0.017	0.017
		LR-10	0.01	0.034	0.034	0.01	0.076	0.070	0.01	0.017	0.017
		ML	0.01	0.033		0.00	0.075		0.01	0.017

W	R	CC	0.26	0.070		0.00	0.145		0.02	0.044
		SW	0.04	0.063		−0.02	0.163		0.03	0.054
		AW	0.02	0.046		−0.01	0.100		0.02	0.030
		LR-20	0.02	0.036	0.036	−0.01	0.083	0.079	0.02	0.023	0.021
		LR-10	0.03	0.036	0.036	−0.03	0.082	0.078	0.03	0.023	0.021
		ML	0.10	0.040		−0.17	0.089		0.07	0.025

W	W	CC	0.36	0.050		0.38	0.118		0.01	0.033
		SW	−0.42	0.032		0.37	0.117		0.03	0.038
		AW	−0.03	0.033		0.08	0.087		0.03	0.025
		LR-20	−0.03	0.032	0.030	0.07	0.077	0.070	0.00	0.020	0.020
		LR-10	−0.04	0.032	0.030	0.08	0.077	0.070	0.00	0.020	0.020
		ML	−0.07	0.030		0.15	0.080		-0.02	0.020

Open in a new tab

CM: covariate model used in generating the data; R = linear model without the interaction term; W =linear model with the interaction term. MDM: missing data model used in generating the data; R =logistic model without interaction; W = logistic model with interaction. CC: complete-case analysis. SW: simple weighted estimating equation approach. AW: augmented weighted estimating equation approach. LR-20: Neumann approximation approach with N = 20. LR-10: Neumann approximation approach with N = 10. ML: the maximum likelihood estimator. Bias: average of estimated values minus the true value. Var: variance of the estimate computed from the simulation replicates. Vest: average of the estimated variances.

In the linear regression model, the variance of the residual error was set to 1. Variables were generated in the same way as in the logistic regression model with the exception that the normal error was used in generating Y and g(t) = t. To simplify the computation involved in the analysis of the simulated data, we included V as the third covariate in the linear regression model. However, V had no effect on Y conditional on X₁ and X₂. The integral with respect to y in computing the expectations in the robustification procedure was approximated by 10 points Gauss-Hermite quadrature. LR-5 (N = 5) and LR-10 (N = 10) estimators were computed. Five hundred replicates of a sample of 200 were used in the simulation. The results are shown in Table 2. The behavior of the estimators is almost the same as that observed in the previous scenario for the logistic regression model. The difference between LR – 5 and LR – 10 is still relatively small, which indicates that the convergence rate of the likelihood robustification approximation is reasonably fast in the simulated cases.

Table 2.

Simulation results for the linear regression model with missing covariates.

			β₀			β₁			β₂

CM	MDM	Method	Bias	Var	Vest	Bias	Var	Vest	Bias	Var	Vest
R	R	CC	0.20	0.020		0.03	0.039		−0.04	0.021
		SW	0.00	0.022		0.00	0.058		−0.01	0.030
		AW	0.01	0.015		−0.01	0.032		0.00	0.015
		LR-10	0.01	0.013	0.014	−0.01	0.028	0.029	−0.01	0.012	0.014
		LR-5	0.01	0.013	0.014	−0.01	0.028	0.029	−0.01	0.012	0.014
		ML	0.01	0.013		−0.01	0.028		−0.01	0.012

R	W	CC	0.00	0.015		0.09	0.023		−0.10	0.014
		SW	−0.32	0.018		0.06	0.041		−0.07	0.022
		AW	0.01	0.013		−0.01	0.030		0.00	0.014
		LR-10	0.01	0.011	0.013	−0.01	0.024	0.027	0.00	0.012	0.013
		LR-5	0.01	0.011	0.013	−0.01	0.024	0.027	0.00	0.012	0.013
		ML	0.01	0.011		−0.01	0.024		0.00	0.012

W	R	CC	0.22	0.015		0.00	0.032		−0.03	0.009
		SW	0.01	0.016		−0.02	0.045		−0.01	0.015
		AW	0.02	0.012		−0.04	0.033		0.01	0.006
		LR-10	0.02	0.012	0.014	−0.04	0.029	0.033	0.00	0.005	0.007
		LR-5	0.03	0.011	0.014	−0.05	0.028	0.032	0.01	0.005	0.007
		ML	0.08	0.012		−0.16	0.027		0.04	0.006

W	W	CC	−0.03	0.017		0.22	0.028		−0.15	0.009
		SW	−0.33	0.023		0.20	0.051		−0.13	0.016
		AW	−0.04	0.014		0.07	0.050		0.02	0.009
		LR-10	−0.02	0.011	0.011	0.02	0.028	0.030	−0.01	0.007	0.007
		LR-5	−0.02	0.011	0.011	0.03	0.028	0.030	−0.01	0.007	0.007
		ML	−0.04	0.011		0.08	0.030		−0.02	0.007

Open in a new tab

Abbreviations are defined similarly as in Table 1.

In summary, the SW estimator is sensitive to misspecification of the missing data mechanism. The ML estimator can have sizable bias when the covariate models are severely misspecified. We have also simulated other scenarios (not shown) which suggest that the ML estimator with the semiparametric odds ratio model for the covariates is relatively robust against covariate model misspecification. The AW estimator is very robust although it does not have the doubly robust property in general. The likelihood robustification estimators perform better than the AW estimator in all the cases. The estimators from LR-5, LR-10 and LR-20 are nearly indistinguishable, which suggests that approximation using N = 10 or even N = 5 is good enough in the simulated cases. Other simulations not shown indicate that the number N that gives good approximation depends on the amount of missing data. In general, the higher the percentage of missing data is, the larger the number N is required. In practice, N can be empirically determined by comparing estimators using different numbers of approximation. In the computation, the covariates that were subject to missing were rounded to the nearest 0.05 in the logistic regression, and to 0.1 in the linear regression. The effect of the rounding on the parameter estimates was nearly negligible as indicated in the results (not shown) when finer roundings were used.

5.2 Application to hip fracture data

The hip fracture data were collected by Dr. Barengolts at the College of Medicine of the University of Illinois at Chicago in studying the hip fracture in veterans. The study matched a case and a control by age and race. Risk factors on bone fracture were assessed. As in Chen (2004), we concentrated on 9 of the risk factors in the analysis. One of the challenging problems in analyzing this dataset is that most of the risk factors are subject to missing values and there are a large number (38 altogether) of missing patterns. This dataset was analyzed in Chen (2004) by the likelihood method using the semiparametric odds ratio models proposed there for the covariates. Since the covariate models applied there are not guaranteed to be correctly specified, it is of interest to verify whether any substantial bias is introduced into the parameter estimator due to the potential covariate model misspecification. This is assessed here by computing the doubly robust estimators of the parameter and comparing them with the maximum likelihood estimator.

There were a few obstacles in actually implementing the proposed method to this dataset. The primary problem was to estimate the missing data probabilities. Since many missing patterns (26 out of 38) have less than 5 observations, it is virtually impossible to estimate the missing data probabilities that depend on one or more variables. As a compromise, we assumed that the missing data did not depend on the observed or unobserved data, i.e., MCAR. Under this assumption, the simple missing data probability weighted estimator is the same as the estimator from the complete-case analysis. We computed the estimator from the complete-case analysis, the maximum likelihood estimator, the augmented weighted estimator, and the likelihood robustification estimators with N = 10 and N = 20 respectively. In computing these estimators, we rounded data for the three continuous variables: BMI, log(HGB), and Albumin to allow each of them to have about 10 categories. This reduces the computation time and the storage space required. However, the effect of rounding on the parameter estimators is small as discussed in Chen (2004). All the parameter estimates except LR-20, which is the same as LR-10, are shown in Table 3.

Table 3.

Analysis of the hip fracture data

	Method
Variable	CC	CC^*	AW^*	ML^*	LR-10^*
Etoh	1.39(0.39)	1.43(0.39)	1.26(0.30)	1.30(0.30)	1.29(0.29)
Smoke	0.93(0.40)	0.96(0.40)	0.76(0.29)	0.74(0.31)	0.73(0.29)
Dementia	2.51(0.72)	2.50(0.73)	1.57(0.46)	1.53(0.47)	1.48(0.47)
Antiseiz	3.31(1.06)	3.43(1.06)	2.66(0.62)	2.62(0.64)	2.62(0.62)
LevoT4	2.01(1.02)	2.10(1.03)	1.01(0.55)	0.91(0.61)	0.96(0.55)
AntiChol	−1.92(0.77)	−2.20(0.79)	−1.77(0.52)	−1.71(0.56)	−1.76(0.54)
BMI	−0.10(0.04)	−0.10(0.04)	−0.10(0.03)	−0.10(0.03)	−0.10(0.03)
logHGB)	−2.60(1.20)	−2.21(1.19)	−2.92(0.95)	−2.98(0.94)	−2.99(0.94)
Albumin	−0.91(0.35)	−1.13(0.37)	−1.14(0.33)	−1.15(0.29)	−1.16(0.32)

Open in a new tab

: Results are based on roundings of BMI, log(HGB), and Albumin to create 11, 11, and 10 categories respectively. Results from LR-20 are not shown because they are the same as those of LR-10. Standard deviation in the parentheses. Variances for CC and CC^* estimators are estimated by information matrix; Variance of AW estimator is estimated by the sandwich approach with adjusting the fact that nuisance parameters are estimated. See Table 1 for the abbreviations.

The regression coefficients for LevoT4 and dementia estimated from the complete-case analysis are substantially different from those estimated by the other methods. The estimates from the maximum likelihood, the augmented weighted estimating equation, and the likelihood robustification are very close. Estimates from the latter two are even closer. This suggests that the covariate models used in the likelihood approach appear to be reasonable in the sense that it may be close to correctly specified or even if it is incorrectly specified, the influence of the misspecification on the parameter estimates is very small.

6 Discussion

We have shown that the Neumann series approximation can be used to find a locally efficient estimator in missing data problems under the assumption that all configurations of the full data can be observed with a probability bounded away from zero. This helps to close a gap between the semiparametric efficient theory for the missing data problem and the implementation of the procedure in finding such an estimator. The results can be easily modified to be applied to the study of the asymptotic behavior of the doubly robust estimators when missing data are nonmonotone. Note that the results do not cover the case where a continuous inverse of m does not exist. Similar results in the latter case is expected to be much harder to obtain.

Supplementary Material

Supplementary

NIHMS168144-supplement-Supplementary.pdf^{(97KB, pdf)}

Acknowledgments

The author thanks the editor for the detailed comments which have greatly improved the presentation of the paper. The author would also like to thank Professor James Robins for his insightful comments on the earlier versions of the paper. Comments from Drs. Y. Q. Chen and C. Y. Wang at FHCRC on the earlier versions of the paper are also very much appreciated. The research was supported by a grant from NCI/NIH on statistical methods for missing data.

Appendix: Proof of Theorems

Assume that the semiparametric model under consideration is π(r|y, η)p(y, η), where p(y, η) is the density of the full data y with respect to μ, a product of Lebesgue measures and counting measures. Assume that η = (β, θ) ∈ Ω × Θ and β is the parameter of interest and θ is the nuisance parameter. The marginal distribution for observed data (R, R(Y)) is g(r, r(y), η) = ∫ π (r|y, η)p(y, η)dr̄(y), where r̄(y) denote those components of y that are missing in the missing pattern r. Assume that θ(γ), γ ∈ Γ is a restricted parameterization of θ such that θ(Γ) ⊂ Θ. Let η₀ = (β₀, θ^*) and P₀ be the true probability measure that generated the data. The following regularity conditions are used in the theorems.

For any β ∈ Ω and γ ∈ Γ, aside from a μ-zero set, {(r, y)|g(r, r(y), β, θ(γ)) > 0} is the same for any fixed r. π(r|y, η) and p(y, η) are bounded away from zero and ∞ uniformly for all y and η ∈ ℰ = Ω × θ(Γ) ⊂ Ω × Θ, if π(r|y, η) > 0 for a μ-nonzero set of y. The full data model p(y, η) is a convex family with respect to θ.
Ω is a compact subset of a Hilbert space. The true parameter value β₀ is an inner point of Ω. Θ is a subset of another Hilbert space. As n → ∞, γ̂ → γ^* in the norm defined on Γ. θ(γ) is continuous.
As a L²(μ) function of (β, θ) ∈ Ω × Θ, {π (r|y, η)p(y, η)}^1/2 is Fréchet differentiable with respect to η ∈ Ω× Θ. The score operator defined as 2{π (r|y, η)p(y, η)}^−1/2 times the derivative is denoted by (A₁_η(h₁), A₂_η(h₂)) with h₁ ∈ H₁ and h₁ ∈ H₂. Both H₁ and H₂ are Hilbert spaces.
$A_{2 η}^{*} A_{2 η}$ is continuously invertible at η₀, where $A_{2 η}^{*}$ , mapping L²{P_η(y)} to H₂, is the adjoint operator of A₂_η.
For η ∈ ℰ, (π_ηp_η)^1/2A₁_η(h₁) and (π_ηp_η)^1/2A₂_η(h₂) are continuous with respect to η in L²(μ) and are Fréchet differentiable with respect to β in a neighborhood of η₀ in L²(μ) and $A_{2 η}^{*} ({(π_{η} p_{η})}^{- 1 / 2} s)$ is continuous with respect to η in H₂ norm for s ∈ L²(μ) and is Fréchet differentiable with respect to β in a neighborhood of η₀ in H₂ norm.

Suppose that p(y, η) = dP_η/d(μ₁ ×···× μ_J) where μ_j is Lebesgue measure on R¹ or a counting measure, j = 1, ···, J. Suppose that, for any missing pattern r, π(r|y, η)p(y, η), A₁_η(h₁), A₂_η(h₂), and $A_{2 η}^{*} (s)$ , and their derivatives with respect to β for η ∈ ℰ, are all continuous with respect to the jth argument of y if μ_j is Lebesgue measure, j = 1, ···, J and h₁, h₂, and s are continuous with respect to y_j.

There exists a norm on ℰ, denoted by ||·||, such that

\begin{array}{l} {| | π (r ∣ y, η_{1}) - π (r ∣ y, η_{2}) | |}_{L^{\infty} (P_{η_{0}})} \leq C_{1} | | η_{1} - η_{2} | |, \\ {| | p (y, η_{1}) - p (y, η_{2}) | |}_{L^{\infty} (P_{η_{0}})} \leq C_{2} | | η_{1} - η_{2} | |, \\ {| | A_{1 η_{1}} (h_{1}) - A_{1 η_{2}} (h_{1}) | |}_{L^{\infty} (P_{η_{0}})} \leq C_{3} | | η_{1} - η_{2} | | {| | h_{1} | |}_{H_{1}}, \\ {| | A_{2 η_{1}} (h_{2}) - A_{2 η_{2}} (h_{2}) | |}_{L^{\infty} (P_{η_{0}})} \leq C_{4} | | η_{1} - η_{2} | | {| | h_{2} | |}_{H_{2}}, \end{array}

and

{| | A_{2 η_{1}}^{*} (s) - A_{2 η_{2}}^{*} (s) | |}_{H_{2}} \leq C_{5} | | η_{1} - η_{2} | | {| | s | |}_{L^{\infty} (P_{η_{0}})}

for any η₁, η₂ ∈ ℰ and some constants C_i, i = 1, ···, 5. The ε-covering number for ℰ under ||·||, N(ℰ, ε, ||·||) satisfies

\int_{0}^{\infty} \sqrt{log N (E, ε / ∣ log ε ∣, | | \cdot | |)} d ε < \infty .

There exists an H₁₀ ⊂ H₁, for any fixed continuously invertible map from H₁ to itself, if < h₁, β > _H₁ = 0 for all h₁ ∈ H₁₀, then β = 0. The covering number of H₁₀ under supremum norm, N(H₁₀, ε, ||·||_∞) satisfies

$\int_{0}^{\infty} \sqrt{log N (H_{10}, ε, {| | \cdot | |}_{\infty})} d ε < \infty .$
$inf_{{| | h_{1} | |}_{H_{1}} = 1, {| | h_{1}^{*} | |}_{H_{1}} = 1} | E_{0} {\frac{\partial T (η_{0})}{\partial β} (h_{1}) (h_{1}^{*})} | > 0.$
$inf_{{| | h_{1} | |}_{H_{1}} = 1, {| | h_{1}^{*} | |}_{H_{1}} = 1} | E_{0} {\frac{\partial T_{N} (β_{N}, θ^{*})}{\partial β} (h_{1}) (h_{1}^{*})} | > 0.$
There exists U_i, i = 1, ···, n, iid with E₀U^⊗2 finite such that

$θ (\hat{γ}) - θ^{*} = \frac{1}{n} \sum_{i = 1}^{n} U_{i} + o_{p} (\frac{1}{\sqrt{n}}),$

where θ^* = θ(γ^*).
A₁_η(h₁) and A₂_η(h₂) are Fréchet differentiable with respect to θ along the path θ(γ), γ ∈ Γ in a neighborhood of η₀ in L²(P_η₀) and $A_{2 η}^{*} (s)$ is Fréchet differentiable with respect to θ along the path θ(γ), γ ∈ Γ in a neighborhood of η₀ in H₂.

Before we prove Theorems 1–3, we first establish a set of lemmas for the proofs of the theorems. These lemmas are mostly for showing that T and T_N are differentiable and that $F = \cup_{N = 0}^{\infty} {T_{N} (η) (h_{1}) ∣ η \in E, h_{1} \in H_{10}} \cup {T (η) (h_{1}) ∣ η \in E, h_{1} \in H_{10}}$ is a P₀-Donsker class. Let D_η= (I − Inline graphic m_η) where $P_{η} = A_{2 η} {(A_{2 η}^{*} A_{2 η})}^{- 1} A_{2 η}^{*}$ . It follows that $T_{N} (η, h_{1}) = E_{η} {D_{η}^{N} A_{1 η} h_{1} ∣ R, R (Y)}$ . We start from lemmas on the differentiable of T_N with respect to β. Proofs of the lemmas are suppressed and can be found in the supplement materials.

Lemma 1

Under assumptions 1–5, $g_{η}^{1 / 2} T_{N} (η) (h_{1})$ , for any N, is Fréchet differentiable with respect to β in L²(μ) in a neighborhood of η₀ in ℰ, and both $g_{η}^{1 / 2} T_{N} (η) (h_{1})$ , for any N, and the derivatives are continuous at η₀ in L²(μ). If we define the derivative of T_N with respect to β as

$\frac{\partial T_{N} (η)}{\partial β} (h_{1}) (h_{1}^{*}) = g_{η}^{- 1 / 2} {\frac{\partial {g_{η}^{1 / 2} T_{N} (η) (h_{1})}}{\partial β} (h_{1}^{*}) - \frac{1}{2} g_{η}^{1 / 2} B_{1 η} (h_{1}^{*}) T_{N} (η) (h_{1})},$

where the first term on the right-hand side of the equation refers to the derivative of $g_{η}^{1 / 2} T_{N} (η) (h_{1})$ with respect to β, then

${‖ ε^{- 1} {T_{N} (η + ε h_{1}^{*}) (h_{1}) - T_{N} (η) (h_{1})} - \frac{\partial T_{N} (η)}{\partial β} (h_{1}) (h_{1}^{*}) ‖}_{L^{2} (P_{η})} \to 0,$

as n → ∞.
(b) Under assumptions 1–5 and 12, $g_{η}^{1 / 2} T_{N} (η) (h_{1})$ , for any N, is Fréchet differentiable with respect to θ along the path θ(γ), γ ∈ Γ, in L²(μ) in a neighborhood of η₀ and the derivatives are continuous at η₀ in L²(μ). The derivative of T_N with respect to θ is defined similarly.

Lemma 2

Under the assumptions 1–5, T(η)(h₁) is Fréchet differentiable with respect to β in L²(P_η0) in a neighborhood of ₀ and both T(η)(h₁) and the derivative are continuous at η₀ in L²(P_η0). If we define the derivative of T with respect to β as

\frac{\partial T (η)}{\partial β} (h_{1}) (h_{1}^{*}) = g_{η}^{- 1 / 2} {\frac{\partial {g_{η}^{1 / 2} T (η) (h_{1})}}{\partial β} (h_{1}^{*}) - \frac{1}{2} g_{η}^{1 / 2} B_{1 η} (h_{1}^{*}) T (η) (h_{1})},

where the first term on the right-hand side of the equation refers to the derivative of $g_{η}^{1 / 2} T (η) (h_{1})$ with respect to β, then

{‖ ε {T (η + ε h_{1}^{*}) (h_{1}) - T (η) (h_{1})} - \frac{\partial T (η)}{\partial β} (h_{1}) (h_{1}^{*}) ‖}_{L^{2} (P_{η})} \to 0,

as n → ∞.

Lemma 3

Under assumptions 1–3, for any p > 0, N > 0, and s, a function of Y in L^∞(P_η),

{| | D_{η}^{N + p} s - D_{η}^{N} s | |}_{L^{\infty} (P_{η})} \leq K N^{c (Y)} {(1 - σ)}^{N} {| | s | |}_{L^{\infty} (P_{η})},

where c(Y) denotes the cardinality of Y and K is a constant independent of N.

Lemma 4

Suppose that there exists a measurable set Inline graphic with P₀() = 0 such that for all $t (Y) \in \cup_{N = 0}^{\infty} {D_{η}^{N} s ∣ s \in S, η \in E} \cup {{lim}_{N \to \infty} D_{η}^{N} s i n L^{\infty} (P_{η}) ∣ s \in S, η \in E}$ ,

{| | t (Y) | |}_{L^{\infty} (P_{η})} \geq b sup_{y \in N^{c}} ∣ t (y) ∣,

(2)

where S is a set of functions of y and 0 < b ≤ 1 is a constant. Then,

\cup_{N = 0}^{\infty} {D_{η}^{N} s ∣ s \in S, η \in E} \cup {lim_{N \to \infty} D_{η}^{N} s i n L^{\infty} (P_{η}) ∣ s \in S, η \in E}

is a P₀-Donsker class if

S is P₀-measurable and L^∞(P₀) bounded satisfying

$\int_{0}^{\infty} \sqrt{log N (S, ε, L^{\infty} (P_{η}))} d ε < \infty,$ (3)

for a fixed η ∈ ℰ.
For any η₁, η₂ ∈ ℰ, s ∈ S, and a fixed η ∈ ℰ, there exists a C(η) < ∞ such that

$∣ D_{η_{1}} s - D_{η_{2}} s ∣ \leq C (η) | | η_{1} - η_{2} | | {| | s | |}_{L^{\infty} (P_{η})},$

and ℰ has covering numbers under ||·|| satisfying

$\int_{0}^{\infty} \sqrt{log N (E, ε / ∣ log ε ∣, | | \cdot | |)} d ε < \infty .$ (4)

Lemma 5

Under assumption 6, if s(y, η) is continuous with respect to argument j when μ_j is Lebesgue measure, then there exists a measurable set Inline graphic with P_η() = 0 such that

{| | t (Y, η) | |}_{L^{\infty} (P_{η})} = sup_{Y \in N^{c}} ∣ t (y, η) ∣,

for any

t \in \cup_{N = 0}^{\infty} {D_{η}^{N} s ∣ s \in S, η \in E} \cup {lim_{N \to \infty} D_{η}^{N} s ∣ s \in S, η \in E},

where the limit is in the sense of the L^∞(P_η) norm.

Lemma 6

Under assumptions 1–7, we have

{| | D_{η_{1}} s - D_{η_{2}} s | |}_{L^{\infty} (P_{η})} \leq C_{1} (η) | | η_{1} - η_{2} | | {| | s | |}_{L^{\infty} (P_{η})},

and

{| | E_{η_{1}} {D_{η_{1}} s ∣ R, R (y)} - E_{η_{2}} {D_{η_{2}} s ∣ R, R (y)} | |}_{L^{\infty} (P_{η})} \leq C_{2} (η) | | η_{1} - η_{2} | | {| | s | |}_{L^{\infty} (P_{η})},

for some constants C₁(η), C₂(η) < ∞.

Lemma 7

Under assumptions 1–8,

F = \cup_{N = 0}^{\infty} {T_{N} (η) (h_{1}) ∣ η \in E, h_{1} \in H_{10}} \cup {T (η) (h_{1}) ∣ η \in E, h_{1} \in H_{10}}

is a P₀-Donsker class.

Proof of Theorem 1

Let η̃ = (β̃, θ̂). By definition, P_nT(η̃)(h₁) = 0 for h₁ ∈ H₁₀. From Lemma 7, {T(η)(h₁)| η ∈ ℰ, h₁ ∈ H₁₀} is a P₀-Donsker class with bounded envelop function, and thus a P₀-Glivenko-Cantelli class. For a convergent point of a subsequence of η̃, denoted by $η_{0}^{*} = (β_{0}^{*}, θ^{*})$ , it follows from the continuity of T(η) in a neighborhood of η₀ (Lamma B.1 (a)) that $P_{0} T (η_{0}^{*}) (h_{1}) = 0$ . Note that P₀T(η₀)(h₁) = 0, where η₀ = (β₀, θ^*). Since T(η)(h₁) is differentiable with respect to β in a neighborhood of β₀ in L²(P_η₀) at η₀ and the derivative is continuous at η₀ in η, by the mean value theorem,

P_{0} {T (β_{0}^{*}, θ^{*}) (h_{1}) - T (β_{0}, θ^{*}) (h_{1})} = P_{0} [\frac{\partial T}{\partial β} {η_{0} + λ (η_{0}^{*} - η_{0})} (h_{1}) (β_{0}^{*} - β_{0})]

for some 0 ≤ λ ≤ 1 and all h₁ ∈ H₁₀. From assumptions 8 and 9, we can conclude that $β_{0}^{*} = β_{0}$ locally. Since θ̂ converges (Assumption 2) and β̃ varies in a compact set, which implies each convergent subsequence converges to the same limit, β̃ locally converges to β₀ almost surely.

Since E₀{T(η̃)(h₁)− T(η₀)(h₁)}² → 0 uniformly for h₁ ∈ H₁₀ and {T(η)(h₁)|η ∈ ℰ, h₁ ∈ H₁₀} is a P₀-Donsker class with bounded envelope function, it follows (van der Vaart & Wellner, 1996, Lemma 3.3.5 on page 311) that $\sqrt{n} (P_{n} - P_{0}) {T (\tilde{η}) (h_{1}) - T (η_{0}) (h_{1})} = o_{P} (1)$ , which implies that $- \sqrt{n} P_{0} T (\tilde{η}) (h_{1}) = \sqrt{n} (P_{n} - P_{0}) T (η_{0}) (h_{1}) + o_{P_{0}} (1)$ . Note that

\sqrt{n} P_{0} {T (\tilde{η}) (h_{1}) - T (β_{0}, \hat{θ}) (h_{1})} = P_{0} {\frac{\partial T}{\partial β} (β_{0}, θ^{*}) (h_{1}) \sqrt{n} (\tilde{β} - β_{0})} + o_{P_{0}} (| | \sqrt{n} (\tilde{β} - β_{0}) | |),

and that P₀T(β₀, θ̂)(h₁) = P₀T(β₀, θ^*)(h₁) = 0. It now follows that

- P_{0} [\frac{\partial T}{\partial β} (η_{0}) (h_{1}) {\sqrt{n} (\tilde{β} - β_{0})}] = \sqrt{n} (P_{n} - P_{0}) T (η_{0}) (h_{1}) + o_{P_{0}} (1 + | | \sqrt{n} (\tilde{β} - β_{0}) | |) .

By replacing h₁ in the foregoing equation by $Q_{0}^{- 1} h_{1}$ , it follows that

< \sqrt{n} (\tilde{β} - β_{0}), h_{1} >_{H_{1}} = \sqrt{n} P_{n} T (η_{0}) {- Q_{0}^{- 1} (h_{1})} + o_{P_{0}} (1 + | | \sqrt{n} (\tilde{β} - β_{0}) | |),

which implies that $< \sqrt{n} (\tilde{β} - β_{0}), h_{1} >_{H_{1}} = O_{P_{0}} (1)$ and that $< \sqrt{n} (\tilde{β} - β_{0}), h_{1} >_{H_{1}} \to N (0, V (h_{1}))$ uniformly on h₁ ∈ H₁₀, where $V (h_{1}) = E_{0} {[T (η_{0}) {- Q_{0}^{- 1} (h_{1})}]}^{2}$ .

To prove the consistency of the variance estimate, let ${\tilde{η}}_{h} = (\hat{β} + \frac{h}{\sqrt{n}}, \hat{θ})$ for a fixed h ∈ H₁. That {T(η̃_h)(h₁) − T(η̃)(h₁)|h₁ ∈ H₁₀} is a P₀-Donsker class implies that $\sqrt{n} (P_{n} - P_{0}) {T ({\tilde{η}}_{h}) - T (\tilde{η})} (h_{1}) = o_{P_{0}} (1)$ . It follows that, apart from a o_P₀ (1) term,

\begin{array}{l} \sqrt{n} P_{n} {T ({\tilde{η}}_{h}) - T (\tilde{η})} (h_{1}) = \sqrt{n} P_{0} {T ({\tilde{η}}_{h}) - T (\tilde{η})} (h_{1}) \\ = P_{0} {\frac{\partial T (η_{0})}{\partial β} (h) (h_{1})} = < Q_{0} h, h_{1} >_{H_{1}} . \end{array}

Since {T²(η̃)(h₁)|h₁ ∈ H₁₀} is a Glivenko-Cantelli class, P_n{T(η̃)(h₁)}² = P₀{T(η₀)(h₁)}²+ o_P₀ (1) uniformly in h₁ ∈ H₁₀. It can now be seen that the asymptotic variance of $< \sqrt{n} (\tilde{β} - β_{0}), h_{1} >_{H_{1}}$ can be consistently estimated by P_n{T(η̃)(h)}², where h ∈ H₁ solves the equation $< h_{1}, h_{1} >_{H_{1}} = \sqrt{n} P_{n} {T ({\tilde{η}}_{h}) (h_{1}) - T (\tilde{η}) (h_{1})}$ . Note that, from the previous derivation, for any fixed h₁ ∈ H₁₀, h thus defined converges in probability (P₀) to $Q_{0}^{- 1} (h_{1})$ uniformly over h₁ ∈ H₁₀.

When both the missing data mechanism model and the nuisance model for the full data are correctly specified, θ^* = θ₀. That is, P_η₀= P₀. It follows from $E_{(β_{0} + h / \sqrt{n}, θ_{0})} T (β_{0} + h / \sqrt{n}, θ_{0}) (h_{1}) = 0$ for any h, h₁ ∈ H₁₀ that

- E_{0} {\frac{\partial T}{\partial β} (η_{0}) (h) (h_{1})} = E_{0} {T (β_{0}, θ_{0}) (h_{1}) A_{1 (β_{0}, θ_{0})} (h)} = E_{0} {T (β_{0}, θ_{0}) (h_{1}) T (β_{0}, θ_{0}) (h)} .

This implies that β̃ is asymptotically efficient.

Proof of Theorem 2

For any convergent point of β̃_N, denoted by β_N, let η̂_N = (β̃_N, θ̂) and η_N = (β_N, θ̂). By definition, P_nT_N(η̂_N)(h₁) = 0. Note that ℱ (see Lemma 7 for its definition) is a P_η₀-Donsker class with bounded envelop function, and thus a P_η₀-Glivenko-Cantelli class. Since dP_η₀/dP₀ is bounded away from 0 and ∞, ℱ is a P₀-Glivenko-Cantelli class. Since P₀{T_N(η̂_N)(h₁) − T_N(η_N)(h₁)}² → 0 uniformly over η_N and h₁ ∈ H₁₀ for any fixed N as n → ∞, it follows that (P_n − P₀) {T_N(η̂_N)(h₁) − T_N(η_N)(h₁)} = o_P₀ (1). Hence, P_nT_N(η_N)(h₁) = o_P₀ (1), which implies that P₀T_N(η_N)(h₁) = 0 for all h₁ ∈ H₁₀. Note that

P_{0} {T_{N} (β_{N}, \hat{θ}) (h_{1}) - T_{N} (β_{0}, \hat{θ}) (h_{1})} = P_{0} {\frac{\partial T_{N}}{\partial β} (β_{0}, θ^{*}) (h_{1}) (β_{N} - β_{0})} + o_{P_{0}} ({| | β_{N} - β_{0} | |}_{H_{1}}) .

It follows from the definition of Inline graphic that, for all h₁ ∈ H₁₀,

< β_{N} - β_{0}, h_{1} >_{H_{1}} = P_{0} {T_{N} (β_{N}, \hat{θ}) (- Q_{0 N}^{- 1} h_{1}) - T_{N} (β_{0}, \hat{θ}) (- Q_{0 N}^{- 1} h_{1})} + o_{P_{0}} ({| | β_{N} - β_{0} | |}_{H_{1}}) .

Next, note that $\sqrt{n} (P_{n} - P_{0}) {T_{N} ({\tilde{β}}_{N}, \hat{θ}) (h_{1}) - T_{N} (β_{N}, \hat{θ}) (h_{1})} = o_{P_{0}} (1)$ uniformly over h₁ ∈ H₁₀. It follows that, for all h₁ ∈ H₁₀,

\begin{array}{l} - \sqrt{n} P_{n} T_{N} (β_{N}, \hat{θ}) (h_{1}) = \sqrt{n} P_{0} {T_{N} ({\tilde{β}}_{N}, \hat{θ}) (h_{1}) - T_{N} (β_{N}, \hat{θ}) (h_{1})} + o_{P_{0}} (1) \\ = P_{0} {\frac{\partial T}{\partial β} (β_{N}, θ^{*}) \sqrt{n} ({\tilde{β}}_{N} - β_{N}) (h_{1})} + o_{P_{0}} (\sqrt{n} {| | {\tilde{β}}_{N} - β_{N} | |}_{H_{1}} + 1) . \end{array}

From assumption 10 and the definition of Inline graphic , it follows that

\begin{array}{l} < \sqrt{n} ({\tilde{β}}_{N} - β_{N}), h_{1} >_{H_{1}} = \sqrt{n} P_{0} {T_{N} (β_{N}, \hat{θ}) (- Q_{N}^{- 1} h_{1}) - T_{N} (β_{N}, θ^{*}) (- Q_{N}^{- 1} h_{1})} \\ + \sqrt{n} (P_{n} - P_{0}) T_{N} (β_{N}, θ^{*}) (- Q_{N}^{- 1} h_{1}) + o_{P_{0}} ({| | \sqrt{n} ({\tilde{β}}_{N} - β_{0})) | |}_{H_{1}} + 1) . \end{array}

Note that, apart from a o_P₀ (1) term,

\sqrt{n} P_{0} {T_{N} (β_{N}, \hat{θ}) (h_{1}) - T_{N} (β_{N}, θ^{*}) (h_{1})} = \sum_{i = 1}^{n} P_{0} {T_{N} (β_{N}, θ^{*} + \frac{U_{i}}{\sqrt{n}}) (h_{1}) - T_{N} (β_{N}, θ^{*}) (h_{1})}

because T_N is Fréchet differentiable at (β_N, θ^*) with respect to θ along the path θ(γ) in L²(P₀) and $\hat{θ} - θ^{*} = \frac{1}{n} \sum_{i = 1}^{n} U_{i} + o_{P_{0}} (\frac{1}{\sqrt{n}})$ , where E₀(U) = 0. It follows that, uniformly over h₁ ∈ H₁₀, $\sqrt{n} ({\tilde{β}}_{N} - β_{N}) (h_{1}) \to N (0, V_{N} (h_{1}))$ where

V_{N} (h_{1}) = E_{0} {[T_{N} (β_{N}, θ^{*}) (- Q_{N}^{- 1} h_{1}) + P_{0} {\frac{\partial T_{N} (β_{N}, θ^{*})}{\partial θ} (u) (- Q_{N}^{- 1} h_{1})}_{u = U}]}^{2} .

The consistency of the variance estimate can be shown by virtually identical statements as those given in the previous theorem.

Proof of Theorem 3

Let β_{N_n} denote the solution to the equation P_nT_{N_n}(β_{N_n}, θ̂)(h₁) = 0 for all h₁ ∈ H₁. From the Donsker class result and the uniform convergence of T_N to T in L²(P₀), it follows that $\sqrt{n} (P_{n} - P_{0}) {T_{N_{n}} (β_{N_{n}}, \hat{θ}) (h_{1}) - T (β_{N_{n}}, \hat{θ}) (h_{1})} = o_{P_{0}} (1)$ , uniformly over h₁ ∈ H₁₀. The equations imply that

\sqrt{n} P_{n} {T (β_{N_{n}}, \hat{θ}) (h_{1})} = \sqrt{n} P_{0} {T_{N_{n}} (β_{N_{n}}, \hat{θ}) (h_{1}) - T (β_{N_{n}}, \hat{θ}) (h_{1})} + o_{P_{0}} (1)

uniformly for h₁ ∈ H₁₀. Since $∣ \sqrt{n} P_{0} {T_{N_{n}} (β_{N_{n}}, \hat{θ}) (h_{1}) - T (β_{N_{n}}, \hat{θ}) (h_{1})} ∣ \leq K \sqrt{n} {(1 - σ)}^{N_{n}}$ , and that log(n)/N_n → 0 implies $K \sqrt{n} {(1 - σ)}^{N_{n}} \to 0$ as n → ∞, it follows that $\sqrt{n} P_{n} {T (β_{N_{n}}, \hat{θ}) (h_{1})} = o_{P_{0}} (1)$ , uniformly over h₁ ∈ H₁₀. From the definition of β̃, it follows that $\sqrt{n} P_{n} {T (β_{N_{n}}, \hat{θ}) (h_{1}) - T (\tilde{β}, \hat{θ}) (h_{1})} = o_{P_{0}} (1)$ , uniformly over h₁ ∈ H₁₀. From the fact that ℱ is a P₀-Donsker class and that T is continuous in L²(P₀) at η₀, it follows that P₀{T(β_{N_n}, θ̂)(h₁) − T(β̃, θ̂)(h₁)}² → 0, uniformly over h₁ ∈ H₁₀ and thus that $\sqrt{n} (P_{n} - P_{0}) {T (β_{N_{n}}, \hat{θ}) (h_{1}) - T (\tilde{β}, \hat{θ}) (h_{1})} = o_{P_{0}} (1)$ , uniformly over h₁ ∈ H₁₀. It follows that $\sqrt{n} P_{0} {T (β_{N_{n}}, \hat{θ}) (h_{1}) - T (\tilde{β}, \hat{θ}) (h_{1})} = o_{P_{0}} (1)$ , uniformly over h₁ ∈ H₁₀. It now follows from Lemma 2 and assumption 9 that $< \sqrt{n} (β_{N_{n}} - \tilde{β}), h_{1} >_{H_{1}} = o_{P_{0}} (1)$ , uniformly over h₁ ∈ H₁₀

For the variance estimate, let $η_{N_{n} h} = ({\hat{β}}_{N_{n}} + \frac{h}{\sqrt{n}}, \hat{θ})$ for a fixed h ∈ H₁. Again, from the Donsker class result and the uniform convergence of T_N to T in L²(P₀), it follows that $\sqrt{n} (P_{n} - P_{0}) {T_{N_{n}} (η_{N_{n} h}) (h_{1}) - T_{N_{n}} (η_{N_{n}}) (h_{1})} = o_{P_{0}} (1)$ , uniformly over h₁ ∈ H₁₀. It follows from the foregoing equation that $\sqrt{n} P_{n} {T_{N_{n}} (η_{N_{n} h}) (h_{1}) - T_{N_{n}} (η_{N_{n}}) (h_{1})}$ can be rewritten as $\sqrt{n} P_{0} {T_{N_{n}} (η_{N_{n} h}) (h_{1}) - T (η_{N_{n} h}) (h_{1})} + \sqrt{n} P_{0} {T_{N_{n}} (η_{N_{n}}) (h_{1}) - T (η_{N_{n}}) (h_{1})}$ plus $\sqrt{n} P_{0} {T (η_{N_{n} h}) (h_{1}) - T (η_{N_{n}}) (h_{1})} + o_{P_{0}} (1)$ . Since

∣ \sqrt{n} P_{0} {T_{N_{n}} (η_{N_{n} h}) (h_{1}) - T (η_{N_{n} h}) (h_{1})} ∣ \leq \sqrt{n} K {(1 - σ)}^{N_{n}} \to 0

and $∣ \sqrt{n} P_{0} {T_{N_{n}} (η_{N_{n}}) (h_{1}) - T (η_{N_{n}}) (h_{1})} ∣ \leq \sqrt{n} K {(1 - σ)}^{N_{n}} \to 0$ as n → ∞ when log(n)/N_n → 0, it follows that

\sqrt{n} P_{n} {T_{N_{n}} (η_{N_{n} h}) (h_{1}) - T_{N_{n}} (η_{N_{n}}) (h_{1})} = \sqrt{n} P_{0} {T (η_{N_{n} h}) (h_{1}) - T (η_{N_{n}}) (h_{1})} + o_{P_{0}} (1) .

Note that { $T_{N_{n}}^{2} (η_{N_{n}}) (h_{1}) ∣ \forall n, h_{1} \in H_{10}$ } is a Glivenko-Cantelli class. It follows that

P_{n} {T_{N_{n}} (η_{N_{n}}) (h_{1})}^{2} = P_{0} {T (η_{0}) (h_{1})}^{2} + o_{P_{0}} (1),

uniformly in h₁ ∈ H₁₀. Those two results combined with the proof of consistency of the variance estimate in Theorem 1 imply the consistency of the variance estimate.

References

Begun JM, Hall WJ, Huang WM, Wellner J. Information and asymptotic efficiency in parametric-nonparametric models. Ann Statist. 1983;11:432–452. [Google Scholar]
Bickel P, Klassen C, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: John Hopkins University Press; 1993. [Google Scholar]
Chen HY. Nonparametric and semiparametric models for missing covariates in parametric regression. J Amer Statist Assoc. 2004;99:1176–1189. [Google Scholar]
Huang Y. Calibration regression of censored lifetime medical cost. J Amer Statist Assoc. 2002;97:318–327. [Google Scholar]
Lipsitz SR, Ibrahim JG, Zhao LP. A weighted estimating equation for missing covariate data with properties similar to maximum likelihood. J Amer Statist Assoc. 1999;94:1147–1160. [Google Scholar]
Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2. New York: John Wiley; 2002. [Google Scholar]
Nan B, Emond MJ, Wellner JA. Information bounds for Cox regression models with missing data. Ann Statist. 2004;32:723–735. [Google Scholar]
Robins JM, Hsieh FS, Newey W. Semiparametric efficient estimation of a conditional density with missing or mismeasured covariates. J Roy Statist Soc, Ser B. 1995;57:409–424. [Google Scholar]
Robins JM, Rotnitzky A. Comments on “Inference for semiparametric models: Some questions and an answer” by Bickel, P. J. and Kwon, J. in the millennium series of Statist. Sinica. 2001;11:920–936. [Google Scholar]
Robins JM, Rotnitzky A, van der Laan MJ. Discussion of “On profile likelihood” by Murphy, S.A. and van der Vaart, A. W. J Amer Statist Assoc. 1999;94:477–482. [Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Amer Statist Assoc. 1994;89:846–866. [Google Scholar]
Robins JM, Rotnitzky A, Van der Laan M. Comment on “On profile likelihood” by Murphy and Van der Vaart. J Amer Statist Assoc. 2000;95:477–485. [Google Scholar]
Robins JM, Wang N. Discussion on the papers by Forster and Smith and Clayton et al. J Roy Statist Soc, Ser B. 1998;60:91–93. [Google Scholar]
Rotnitzky A, Robins JM. Analysis of semiparametric models with nonignorable nonresponses. Stat Med. 1997;16:81–102. doi: 10.1002/(sici)1097-0258(19970115)16:1<81::aid-sim473>3.0.co;2-0. [DOI] [PubMed] [Google Scholar]
Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]
Sasieni P. Information bounds for the conditional hazard ratio in a nested family of regression models. J Roy Statist Soc, Ser B. 1992;54:617–635. [Google Scholar]
Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models (with discussion) J Amer Statist Assoc. 1999;94:1096–1120. [Google Scholar]
Van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer; 2003. [Google Scholar]
Van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes With Application to Statistics. New York: Springer; 1996. [Google Scholar]
Yu M, Nan B. Semiparametric regression models with missing data: the mathematical review and a new application. Statist Sinica. 2006;16:1193–1212. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary

NIHMS168144-supplement-Supplementary.pdf^{(97KB, pdf)}

[R1] Begun JM, Hall WJ, Huang WM, Wellner J. Information and asymptotic efficiency in parametric-nonparametric models. Ann Statist. 1983;11:432–452. [Google Scholar]

[R2] Bickel P, Klassen C, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: John Hopkins University Press; 1993. [Google Scholar]

[R3] Chen HY. Nonparametric and semiparametric models for missing covariates in parametric regression. J Amer Statist Assoc. 2004;99:1176–1189. [Google Scholar]

[R4] Huang Y. Calibration regression of censored lifetime medical cost. J Amer Statist Assoc. 2002;97:318–327. [Google Scholar]

[R5] Lipsitz SR, Ibrahim JG, Zhao LP. A weighted estimating equation for missing covariate data with properties similar to maximum likelihood. J Amer Statist Assoc. 1999;94:1147–1160. [Google Scholar]

[R6] Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2. New York: John Wiley; 2002. [Google Scholar]

[R7] Nan B, Emond MJ, Wellner JA. Information bounds for Cox regression models with missing data. Ann Statist. 2004;32:723–735. [Google Scholar]

[R8] Robins JM, Hsieh FS, Newey W. Semiparametric efficient estimation of a conditional density with missing or mismeasured covariates. J Roy Statist Soc, Ser B. 1995;57:409–424. [Google Scholar]

[R9] Robins JM, Rotnitzky A. Comments on “Inference for semiparametric models: Some questions and an answer” by Bickel, P. J. and Kwon, J. in the millennium series of Statist. Sinica. 2001;11:920–936. [Google Scholar]

[R10] Robins JM, Rotnitzky A, van der Laan MJ. Discussion of “On profile likelihood” by Murphy, S.A. and van der Vaart, A. W. J Amer Statist Assoc. 1999;94:477–482. [Google Scholar]

[R11] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Amer Statist Assoc. 1994;89:846–866. [Google Scholar]

[R12] Robins JM, Rotnitzky A, Van der Laan M. Comment on “On profile likelihood” by Murphy and Van der Vaart. J Amer Statist Assoc. 2000;95:477–485. [Google Scholar]

[R13] Robins JM, Wang N. Discussion on the papers by Forster and Smith and Clayton et al. J Roy Statist Soc, Ser B. 1998;60:91–93. [Google Scholar]

[R14] Rotnitzky A, Robins JM. Analysis of semiparametric models with nonignorable nonresponses. Stat Med. 1997;16:81–102. doi: 10.1002/(sici)1097-0258(19970115)16:1<81::aid-sim473>3.0.co;2-0. [DOI] [PubMed] [Google Scholar]

[R15] Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]

[R16] Sasieni P. Information bounds for the conditional hazard ratio in a nested family of regression models. J Roy Statist Soc, Ser B. 1992;54:617–635. [Google Scholar]

[R17] Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models (with discussion) J Amer Statist Assoc. 1999;94:1096–1120. [Google Scholar]

[R18] Van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer; 2003. [Google Scholar]

[R19] Van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes With Application to Statistics. New York: Springer; 1996. [Google Scholar]

[R20] Yu M, Nan B. Semiparametric regression models with missing data: the mathematical review and a new application. Statist Sinica. 2006;16:1193–1212. [Google Scholar]

PERMALINK

Estimation and inference based on Neumann series approximation to locally efficient score in missing data problems

HUA YUN CHEN

Abstract

1 Introduction

2 The Neumann series approximation to the efficient score

Proposition 1

3 Estimation and inference based on approximate locally efficient scores

Theorem 1

Theorem 2

Theorem 3

4 Applications to examples

Example 1: Missing data in parametric regression models

Example 2: Marginal mean model

Example 3: The missing covariate problem in Cox regression model

5 Numeric study

5.1 Simulation studies

Table 1.

Table 2.

5.2 Application to hip fracture data

Table 3.

6 Discussion

Supplementary Material

Acknowledgments

Appendix: Proof of Theorems

Lemma 1

Lemma 2

Lemma 3

Lemma 4

Lemma 5

Lemma 6

Lemma 7

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases