Local Influence for Generalized Linear Models with Missing Covariates

Xiaoyan Shi; Hongtu Zhu; Joseph G Ibrahim

doi:10.1111/j.1541-0420.2008.01179.x

. Author manuscript; available in PMC: 2010 Feb 10.

Published in final edited form as: Biometrics. 2009 Dec;65(4):1164. doi: 10.1111/j.1541-0420.2008.01179.x

Local Influence for Generalized Linear Models with Missing Covariates

Xiaoyan Shi ¹, Hongtu Zhu ¹, Joseph G Ibrahim ^1,^*

PMCID: PMC2819734 NIHMSID: NIHMS173905 PMID: 19210738

Summary

In the analysis of missing data, sensitivity analyses are commonly used to check the sensitivity of the parameters of interest with respect to the missing data mechanism and other distributional and modeling assumptions. In this article, we formally develop a general local influence method to carry out sensitivity analyses of minor perturbations to generalized linear models in the presence of missing covariate data. We examine two types of perturbation schemes (the single-case and global perturbation schemes) for perturbing various assumptions in this setting. We show that the metric tensor of a perturbation manifold provides useful information for selecting an appropriate perturbation. We also develop several local influence measures to identify influential points and test model misspecification. Simulation studies are conducted to evaluate our methods, and real datasets are analyzed to illustrate the use of our local influence measures.

Keywords: Influence measure, Local influence, Missing covariates, Perturbation manifold, Perturbation scheme

1. Introduction

Missing data are common in various settings, including surveys, clinical trials, and longitudinal studies. Methods for handling missing data strongly depend on the mechanism that generated the missing values as well as the distributional and modeling assumptions at various stages. Therefore, the resulting estimates and tests may be sensitive to these assumptions. For this reason, sensitivity analyses are commonly used to check the sensitivity of the parameter estimates of interest with respect to the model assumptions. Sensitivity analyses are often carried out in two consecutive steps: selection of perturbation schemes to various model assumptions and use of influence measures to quantify the effects of those perturbations. Some literature on sensitivity analysis for missing data problems includes Copas and Li (1997); Copas and Eguchi (2005); Troxel (1998); Jansen et al. (2003); Van Steen, Molenberghs, and Thijs (2001); Verbeke et al. (2001); Hens et al. (2005); Jansen et al. (2006); and Troxel, Ma, and Heitjan (2004). For instance, Copas and Eguchi (2005) proposed a general formulation for assessing the bias of maximum likelihood estimates due to incomplete data in the presence of small model uncertainty. Verbeke et al. (2001), Hens et al. (2005), and Jansen et al. (2006) developed local influence methods for assessing nonrandom dropout in incomplete longitudinal data.

Cook (1986) proposed a general approach for assessing the local influence of a minor perturbation to a statistical model, which has been applied to many types of models, such as mixed models (Beckman, Nachtsheim, and Cook, 1987), generalized linear models (GLMs; Thomas and Cook, 1989), among others. Zhu and Lee (2001) extended Cook's approach for assessing local influence in a minor perturbation of statistical models for latent variable models. Recently, Zhu et al. (2007) developed a perturbation manifold to select an appropriate perturbation for statistical models without missing data, which is central to the development of the local influence approach proposed here.

The aim of this article is to systematically investigate Cook's (1986) local influence methods for GLMs with missing at random (MAR) covariates as well as not missing at random (NMAR) covariates, often referred to as nonignorable missing covariates. Our local influence method provides a general framework for carrying out sensitivity analyses for missing data problems, compared to the existing literature (Van Steen et al., 2001; Troxel et al., 2004; Copas and Eguchi, 2005; Hens et al., 2005; Jansen et al., 2006). We examine two types of perturbation schemes for perturbing various modeling assumptions and individual observations. We also develop a methodology for selecting appropriate perturbation schemes. We examine two objective functions, including the maximum likelihood estimate and the likelihood ratio statistic, and then we develop influence measures based on these functions to assess appropriate perturbation schemes.

To motivate the proposed methodology, we consider a quality-of-life dataset and a liver cancer dataset. The quality-of-life study of the International Breast Cancer Study Group compares several chemotherapies in premenopausal women with breast cancer. These women were randomly assigned in a 2 × 2 factorial design to receive tamoxifen either alone or with oral cyclophosphamide, intravenous methotrexate, and flourouracil in three early cycles, three delayed cycles, or both early and delayed cycles. For ease of exposition, the four treatment arms are labeled A, B, C, and D. The response variable is the logarithm of the survival time. The dataset has 404 observations and the covariates are: physical ability; mood; indicator for treatment A (yes, no); indicator for treatment B (yes, no); indicator for treatment C (yes, no); age (in years); and language (English, otherwise). Among these seven covariates, physical ability and mood have 13% and 31% missingness percentages, respectively, and the remaining covariates are fully observed. The liver cancer dataset has 191 patients from two Eastern Cooperative Oncology Group clinical trials (Ibrahim, Chen, and Lipsitz, 1999). Previous analyses of these data focused on characterizing how the number of cancerous liver nodes (response) when entering the trials was predicted by six other baseline characteristics: time since diagnosis of the disease (in weeks); two biochemical markers (alpha-fetoprotein and anti-hepatitis B antigen, each classified as normal or abnormal); associated jaundice (yes, no); body mass index (weight in kilograms divided by the square of height in meters); and age (in years). Among these six covariates, three have missing data and the remaining covariates are completely observed. The three with missing data, which are time since diagnosis of the disease, alpha-fetoprotein, and anti-hepatitis B antigen, have 8.9%, 5.8%, and 18.3% missingness percentages, yielding a total missingness percentage of 29%. Here, it is of interest to carry out local influence methods to possibly detect influential cases and to carry out sensitivity analyses on the modeling assumptions. For instance, using our new methodology, we detected that cases 10, 15, 65, and 160 in the liver cancer data have abnormally large response values, and case 131 has an extreme covariate value in time since diagnosis compared to the rest of the cases (Table 1). More details regarding these two real datasets are given in Section 5.

Table 1.

The influential cases in the liver cancer data

Obs.	Response	Time	Alpha-fetoprotein	Anti-hepatitis B antigen	Jaundice	BMI	Age
10	61	2.29	0	1	1	18.81	31.06
15	21	0.57	1	0	1	23.73	42.29
65	23	2.57	1	.	1	23.72	70.52
131	6	320.86	0	.	0	20.31	66.19
160	21	1.14	1	0	1	22.94	65.40

Open in a new tab

The article is organized as follows. In Section 2, we review the model development for GLMs with missing covariates. In Section 3, we systematically develop local influence measures for assessing small perturbations to modeling assumptions in GLMs with missing covariates. We present several simulation studies in Section 4, and analyze two real datasets in Section 5. We conclude the article with some final remarks in Section 6.

2. Model and Notation

Suppose that we have complete data D_c = {d_i = (x_i, z_i, r_i, y_i): i = 1, …, n}, where y_i is the univariate response, x_i is a p₁ × 1 vector of completely observed covariates, and z_i is a p₂ × 1 vector of partially observed covariates. We use r_i, a p₂ × 1 random vector, to indicate the missingness of z_i: r_ik = 1 if z_ik is observed, and r_ik = 0 if z_ik is missing, where r_ik and z_ik are the kth component of r_i and z_i, respectively.

We use p(D_c | η) to denote the complete-data density function with η being the vector of all unknown parameters. One way of modeling the complete-data density is to use three layers of conditional densities as follows:

p (D_{c} ∣ η) = \prod_{i = 1}^{n} p (y_{i} ∣ x_{i}, z_{i}, β, τ) p (z_{i} ∣ x_{i}, α) p (r_{i} ∣ x_{i}, z_{i}, y_{i}, ξ),

(1)

where (β, τ) are the parameters for the conditional distribution of y_i given (x_i, z_i), α is the parameter vector for the covariate distribution p(z_i | x_i, α), and ξ is the parameter vector for modeling the missing data mechanism p(r_i | x_i, z_i, y_i, ξ). The three sets of parameters are assumed distinct from one another, and η = (β′, τ, α′, ξ′)′.

We need to specify each of the three components in equation (1). Under the GLM, y_i given (x_i, z_i) has a density in the exponential family

p (y_{i} ∣ x_{i}, z_{i}, β, τ) = \exp {a_{i}^{- 1} (τ) [y_{i} θ_{i} (β) - b (θ_{i} (β))] + c (y_{i}, τ)},

(2)

i = 1, …, n, indexed by the canonical parameter θ_i and the scale parameter τ, where the functions b(·) and c(·, ·) determine a particular distribution in the class. The functions a_i(τ) are commonly of the form $a_{i} (τ) = τ^{- 1} k_{i}^{- 1}$ , where the k_i's are known weights. Further, the θ_i's satisfy the equations θ_i = θ(μ_i), and μ_i = g((x_i′, z_i′) β) are the components of μ = E(y | x, z, β, τ), where g(·) is a known link function and β = (β₀, β₁, …, β_p)′ is a (p + 1) × 1 vector of regression coefficients, in which p = p₁ + p₂.

Next, we need to specify a distribution for z_i given x_i. We suggest specifying the covariate distribution via a sequence of one-dimensional conditional distributions:

\begin{matrix} p (z_{i} ∣ x_{i}, α) = & p (z_{{ip}_{2}} ∣ z_{i (p_{2} - 1)}, \dots, z_{i 1}, x_{i}, α) \\ \times \dots p (z_{i 2} ∣ z_{i 1}, x_{i}, α) \times p (z_{i 1} ∣ x_{i}, α) . \end{matrix}

(3)

We typically assume specific parametric forms for these one-dimensional conditional distributions. This strategy allows much flexibility in the specification of the joint covariate distribution and has the potential of reducing the number of nuisance parameters (Lipsitz and Ibrahim, 1996; Ibrahim, Lipsitz, and Chen, 1999). Furthermore, we model the missing data mechanism using a sequence of one-dimensional conditional distributions as

\begin{matrix} p (r_{i} ∣ y_{i}, x_{i}, z_{i}, ξ) = & p (r_{i p_{2}} ∣ r_{i (p_{2} - 1)}, \dots, r_{i 1}, x_{i}, z_{i}, y_{i}, ξ) \\ \times \dots \times p (r_{i 2} ∣ r_{i 1}, x_{i}, z_{i}, y_{i}, ξ) \\ \times p (r_{i 1} ∣ x_{i}, z_{i}, y_{i}, ξ) . \end{matrix}

(4)

Because r_ij is binary, a sequence of logistic regressions is commonly used.

3. Local Influence

We will develop a local influence method for carrying out sensitivity analyses of various assumptions of a GLM with missing covariates. Specifically, we will address three important issues related to local influence methods: perturbation schemes for perturbing the distributions for each component in equation (1), the appropriate choice of a perturbation vector, and the development of influence measures.

3.1 A Simple Example

Throughout this section, we examine a linear regression model with one missing covariate to illustrate our methodological development. We consider the model

y_{i} = β_{0} + β_{1} x_{i} + β_{2} z_{i} + ∊_{i},

(5)

where ∊_i ~ N(0, τ). We assume that y_i and x_i are completely observed for i = 1, …, n, but the covariate z_i may be missing for some cases. We also assume that (z_i | x_i, α) ~ N(α₀ + α₁x_i, α₂), where α = (α₀, α₁, α₂). We let r_i = 1 if z_i is missing and r_i = 0 if z_i is observed. Furthermore, we assume that the z_i's are MAR with missing data mechanism

p (r_{i} = 1 ∣ y_{i}, x_{i}, z_{i}) = \frac{\exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i})}{1 + \exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i})} .

(6)

We introduce various perturbations to perturb p(D_c | η) and then we assess the sensitivity of each perturbation scheme to the proposed model and associated statistical inference. As an illustration, we consider four perturbations as follows. These perturbations illustrate two different types of perturbation schemes, which we discuss in the next subsection. The first is to perturb the variances of ∊_i such that

Var (∊_{1}, \dots, ∊_{n}) = τ diag (1 ∕ ω_{1}, \dots, 1 ∕ ω_{n}) .

(7)

Throughout, we let ω⁰ denote no perturbation. In this case, ω⁰ =1_n is an n × 1 vector with all 1's. This perturbation is designed to assess the homogeneous variance assumption of the ∊_i's. The second is to introduce a perturbation to z_i to assess the linear relationship between y_i and z_i such that

y_{i} = β_{0} + β_{1} x_{i} + β_{2} (z_{i} + ω_{i}) + ∊_{i},

(8)

for i = 1, …, n. In this case, ω⁰ = 0_n, which is an n × 1 vector with all 0's.

The third is to extend the MAR assumption such that

p (r_{i} = 1 ∣ y_{i}, x_{i}, z_{i}, ω) = \frac{\exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i} + ω z_{i})}{1 + \exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i} + ω z_{i})} .

(9)

If ω ≠ 0, then the missing data mechanism is NMAR. This strategy for checking NMAR is similar to that of Verbeke et al. (2001) in the context of longitudinal data. Thus, equation (9) explores the influence of perturbing the MAR assumption (ω⁰ = 0) in the direction of NMAR. We emphasize here that formal tests for MAR or NMAR missingness should be approached with great caution, although they might be possible. Our main goal here and throughout this article is to use local influence methods to carry out sensitivity analyses to assess the effect of perturbing the given GLM with MAR covariates in the direction of NMAR. An alternative to equation (9) is the individual-specific infinitesimal perturbation as used in Verbeke et al. (2001), Hens et al. (2005), and Jansen et al. (2006), which is given by

p (r_{i} = 1 ∣ y_{i}, x_{i}, z_{i}, ω) = \frac{\exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i} + ω_{i} z_{i})}{1 + \exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i} + ω_{i} z_{i})} .

(10)

This can provide insight into which case may have large influence.

The fourth perturbation extends the linear relationship between z_i and x_i such that (z_i | x_i, α) ~ N(α₀ + α₁x_i + g(x_i), α2) for i = 1, …, n, where g(·) is an unknown function. For instance, we may approximate g(x) using a set of m basis functions (e.g., Fourier series, B-splines) B1(x), …, B_m(x) such that $g (x) \approx Σ_{j = 1}^{m} ω_{j} B_{j} (x)$ . Thus, we obtain

(z_{i} ∣ x_{i}, α) ~ N (α_{0} + α_{1} x_{i} + Σ_{j = 1}^{m} ω_{j} B_{j} (x_{i}), α_{2})

(11)

for i = 1, …, n. In equation (11), we are interested in assessing whether there is a nonlinear relationship between the covariate z_i and x_i. In this case, ω⁰ = 0_m.

3.2 Perturbation Schemes

We formally define two classes of perturbation schemes: the single-case and the global perturbation scheme. Let ω = (ω₁, …, ω_m) ∈ R^m be a perturbation vector for the complete-data density p(D_c | η). We use p(D_c | η, ω) to denote the perturbed complete-data density such that ∫ p(D_c | η, ω)dD_c = 1 and p(D_c | η, ω⁰) = p(D_c | η). To assess the local influence of a model perturbation, we are primarily interested in the behavior of p(D_c | η, ω) as a function of ω around ω⁰. We set η at a given value (e.g., the maximum likelihood estimate).

The single-case perturbation scheme refers to any scheme that independently perturbs individual observations (Verbeke et al., 2001). The single-case perturbation is mainly for identifying influential observations. Specifically, the perturbed complete-data density is

p (D_{c} ∣ η, ω) = \prod_{i = 1}^{n} p (x_{i}, z_{i}, r_{i}, y_{i} ∣ η, ω_{i}),

(12)

where ω = (ω₁, …, ω_n) and ω_i denotes the perturbation to the ith observation. Such perturbation schemes, for example, include case weights for each of the three components of equation (1), perturbing individual components of (x_i, z_i, r_i ) and perturbing individual components (or multiple components) of r_i. Perturbations (7), (8), and (10) of the previous subsection belong to such a class.

The global perturbation scheme refers to any scheme that perturbs all observations simultaneously (Troxel et al., 2004; Copas and Eguchi, 2005). The global perturbation is mainly for assessing the robustness of model assumptions to small perturbations. Specifically, the perturbed complete-data density is

p (D_{c} ∣ η, ω) = \prod_{i = 1}^{n} p (x_{i}, z_{i}, r_{i}, y_{i} ∣ η, ω),

(13)

where ω = (ω₁, …, ω_m) is shared by all the observations. Such a perturbation scheme includes the perturbation of each of the three components of equation (1) and simultaneous perturbations of the three components of equation (1), among many others. The number of components in ω can be as small as one, such as perturbation (9) and other examples (Gustafson, 2001; Troxel et al., 2004; Copas and Eguchi, 2005; Zhu et al., 2007). Perturbation (11) is also a global perturbation scheme, in which m in the perturbation can increase with n.

3.3 Appropriate Perturbation

We develop a new geometric framework to address the issue of selecting an appropriate perturbation scheme for equation (1). This issue is central to the development of the local influence approach, because arbitrarily perturbing a model may lead to inappropriate inference about the cause (e.g., influential observations) of a large effect.

The perturbed model p(D_c | η, ω) has a natural geometrical structure. The perturbed model M = {p(D_c | η, ω) : ω ∈ R^m} can be regarded as an m-dimensional manifold. At each ω ∈ M, there is a tangent space T_ω of M spanned by m functions ∂_{ω_j} l_c (ω), where l_c (ω) = log p(D_c | η, ω). The m² quantities g_jk (ω) = E_ω [∂_{ω_j} l_c (ω)∂_{ω_k} l_c (ω)], | j, k = 1, …, m form the metric tensor of M, in which E_ω denotes the expectation taken with respect to p(D_c | η, ω), and the metric matrix G(ω) = (g_ij (ω)) is the Fisher information matrix with respect to the perturbation vector ω (Figure 1).

A graphical representation of the perturbation manifold.

An appropriate perturbation to equation (1) requires that G(ω⁰) = diag(g₁₁(ω⁰), …, g_{m m} (ω⁰)). The elements of G(ω) measure the amount of perturbation introduced by all components of the perturbation vector ω. The g_ii (ω) can be interpreted as the amount of perturbation introduced by ω_i, whereas $r_{ij} (ω) = g_{ij} (ω) ∕ \sqrt{g_{ii} (ω) g_{jj} (ω)}$ indicates an association between ω_i and ω_j. For a diagonal matrix G(ω), all components of ω may be regarded as being orthogonal to each other in the perturbed model (Cox and Reid, 1987), and therefore it becomes easy to pinpoint the cause of a large effect. In applications, although G(ω⁰) may not be diagonal, we can always choose a new perturbation vector ω̃, defined by

\tilde{ω} (ω) = ω^{0} + c^{- 1 ∕ 2} G {(ω^{0})}^{1 ∕ 2} (ω - ω^{0}),

(14)

such that G(ω̃) evaluated at ω⁰ equals cI_m, where c > 0.

For the single-case perturbation scheme (12), we have g_jk (ω) = δ_jk E_ω [∂_{ω_j}l_c,j (ω)]², for j, k = 1, …, n, where δ_jk is the Kronecker delta and l_c,j (ω) = log p(d_j | η, ω_j). The diagonal structure of G(ω) = (g_jk (ω)) indicates that all components of ω are orthogonal to each other. Furthermore, if p(d_i | η, ω_i) is invariant across all i, then G(ω) = g₁₁(ω) I_n, which indicates that different components of ω have the same influence on the corresponding distributions.

For the global perturbation scheme, we have $g_{jk} (ω) = - Σ_{i = 1}^{n} E_{ω} [\partial_{ω_{j} ω_{k}}^{2} ℓ_{c, i} (ω)]$ . Although ω may not be appropriate, we can choose a new perturbation ω̃ = ω⁰ + G(ω⁰)^½ (ω − ω⁰) such that G(ω̃⁰) = I_n. Thus, ω̃ is an appropriate perturbation at least at ω̃⁰ = ω⁰. For instance, we consider the perturbation (11) to the model in Section 3.1. It can be shown that

\begin{matrix} - \partial_{ω_{j} ω_{k}}^{2} ℓ_{c} (ω) & = α_{2}^{- 1} B_{j} (x_{i}) B_{k} (x_{i}) and g_{jk} (ω) \\ = α_{2}^{- 1} Σ_{i = 1}^{n} \int B_{j} (x_{i}) B_{k} (x_{i}) p (x_{i} ∣ α) {dx}_{i}, \end{matrix}

where p(x_i | α) is the distribution of x_i. If {B_j (x) : j = 1, …, m} forms an orthonormal basis with respect to p(x | α), then G(ω) is just an m × m identity matrix. However, because the x_i's are always observed, we can always treat x_i as fixed and approximate g_jk (ω) using $g_{jk} (ω) = α_{2}^{- 1} Σ_{i = 1}^{n} B_{j} (x_{i}) B_{k} (x_{i})$ .

3.4 Influence Measures

3.4.1 First-order Influence Measures

We consider a b × 1 objective function f(ω): M → R^b such as the maximum likelihood estimate of η (Copas and Eguchi, 2001, 2005; Gustafson, 2001; Troxel et al., 2004). The objective function f(ω) defines the aspect of inference of interest for sensitivity analysis. Let ω(t) be a geodesic on M with ω(0) = ω⁰ and ∂_tω(t) | _t=0 = h ∈ R^m. It follows from a Taylor's series expansion that f(ω(t)) = f(ω(0)) + ḟ_h(0)t + O(t²), where $f_{h} (0) = Σ_{j} \partial_{ω_{j}} f (ω^{0}) h_{j} = \nabla_{f}^{'} h$ . If ∇f ≠ 0, then the first-order term ḟ_h(0) mainly characterizes the local influence of a perturbation vector ω to a model.

We introduce a first-order influence measure to assess the local influence of minor perturbations when ∇_f = 0. The first-order influence measure (FI) in the direction h ∈ R^m is ${FI}_{f, h} = {FI}_{f (ω^{0}), h} = \frac{h^{'} \nabla_{f} W_{f} \nabla_{f}^{'} h}{h^{'} G h}$ , where G = G(ω⁰) and W_f is a positive semi-definite matrix.

Although ω may not be an appropriate perturbation, we can always use the appropriate perturbation ω̃(ω) in equation (14), which yields

{FI}_{f (\tilde{ω}), h} ∣_{\tilde{ω} = ω^{0}} = \frac{h^{'} G^{- 1 ∕ 2} \nabla_{f} W_{f} \nabla_{f}^{'} G^{- 1 ∕ 2} h}{h^{'} h} .

(15)

The maximum value of FI_f,h equals the principal eigenvalue of $G^{- 1 ∕ 2} \nabla_{f} W_{f} \nabla_{f}^{'} G^{- 1 ∕ 2}$ , which quantifies the largest degree of local influence of ω̃ to a statistical model, while the corresponding eigenvector of $G^{- 1 ∕ 2} \nabla_{f} W_{f} \nabla_{f}^{'} G^{- 1 ∕ 2}$ , denoted by h_max, can be used either for identifying influential observations for single-case perturbations or for identifying influential directions for global perturbations (Copas and Eguchi, 2005). The h_max is the largest perturbation direction for f (ω̃).

3.4.2 Maximum Likelihood Estimate as the Objective Function

Let D_o denote the observed data. We consider ν̂₀ (ω) = (β̂_o (ω), α̂_o (ω), ξ̂_o (ω))′, which is the maximum likelihood estimate of ν based on the perturbed observed-data density. The perturbed observed-data density, denoted by p(D_o | ν, ω), is associated with the perturbed complete-data density through p(D_o | ν, ω) = ∫ p(D_c | ν, ω)dD_m. It can be shown that

\partial_{ω} {\hat{η}}_{o} (ω^{0}) = I_{η, o}^{- 1} Δ_{o} (η, ω) ∣_{η = \hat{η}, ω = ω^{0}},

(16)

where $I_{η, o} = - \partial_{η}^{2} \log p (D_{o} ∣ η)$ and $Δ_{o} (η, ω) = \partial_{η ω}^{2} \log p (D_{o} ∣ η, ω)$ . Then, the asymptotic bias in the estimate of ν is ∂_ων̂_o (ω⁰)(ω – ω⁰) under p(D_o | ν, ω).

We choose η^_o as the object of interest and set W_f = I_η^,o . We can show that

{FI}_{{\hat{η}}_{o} (\tilde{ω}), h} = h^{'} G^{- 1 ∕ 2} Δ_{o} {(\hat{η}, ω^{0})}^{'} I_{\hat{η}, o}^{- 1} Δ_{o} (\hat{η}, ω^{0}) G^{- 1 ∕ 2} h,

(17)

where h′h = 1. If G = I_m , then it can be shown that FI_{η^_o (ω˜),h} is the same as Cook's (1986) local influence measure based on the likelihood displacement. Finally, for most GLMs with missing covariate data, computing the matrix $G^{- 1 ∕ 2} Δ_{o} {(\hat{η}, ω^{0})}^{'} I_{\hat{η}, o}^{- 1} Δ_{o} (\hat{η}, ω^{0}) G^{- 1 ∕ 2}$ involves the computation of G, Δ_o (η^, ω⁰), and I_η,o, which can be expressed as expectations with respect to the conditional distribution of Z_m,i given d_o,i and hence be computed using Markov chain Monte Carlo methods. For the single-case perturbation in equation (12), we obtain G(ω⁰) = g₁₁(ω⁰)I_n and the ith column of Δ_o (η, ω⁰), denoted by δ_η,i, is given by $\partial_{η ω_{i}}^{2} {\log \int p (d_{i} ∣ η, ω_{i}) d z_{m, i}}$ . Thus,

{FI}_{{\hat{η}}_{o} (\tilde{ω}), h} = g_{11} {(ω^{0})}^{- 1} h^{'} Δ_{o} {(\hat{η}, ω^{0})}^{'} I_{\hat{η}, o}^{- 1} Δ_{o} (\hat{η}, ω^{0}) h .

(18)

In particular, for the ith observation, ${FI}_{{\hat{η}}_{o} (\tilde{ω}), e_{i}} = g_{11} {(ω^{0})}^{- 1} δ_{\hat{η}, i}^{'} I_{\hat{η}, o}^{- 1} δ_{\hat{η}, i}$ , and $Σ_{i = 1}^{n} {FI}_{{\hat{η}}_{o}} (\tilde{ω}), e_{i} = g_{11} {(ω^{0})}^{- 1} tr {Σ_{i = 1}^{n} δ_{\hat{η}, i} δ_{\hat{η}, i}^{'} I_{\hat{η}, o}^{- 1}}$ . Under some mild conditions, $Σ_{i = 1}^{n} δ_{\hat{η}, i} δ_{\hat{η}, i}^{'} ∕ n$ and I_η^,o/n converge in probability to J_o and I_o, respectively. Therefore, $Σ_{i = 1}^{n} {FI}_{{\hat{η}}_{0}} (\tilde{ω}), e_{i}$ is a direct estimate of $λ_{0} = tr (J_{o} I_{o}^{- 1}) g_{11} {(ω^{0})}^{- 1}$ . Under exchangeability of the observations, each FI_{η^_o (ω˜),e_i} should be around its mean λ₀. However, in real applications, if a particular FI_{η^_o (ω˜),e_i} is much larger than λ₀, then this observation may be regarded as an influential case.

3.4.3 Likelihood Ratio as the Objective Function

We consider f_lr (ω) = log p(D_o | η^, ω) − log p(D_o | η^) as our objective function. For the single-case perturbation in equation (12), we can obtain that G(ω⁰) = g₁₁(ω⁰) I_n and

\begin{matrix} \partial_{ω_{i}} f_{lr} (ω) & = \partial_{ω_{i}} \log p (d_{o, i} ∣ \hat{η}, ω_{i}) \\ = E [\partial_{ω_{i}} \log p (d_{i} ∣ \hat{η}, ω_{i}) ∣ d_{o, i}, \hat{η}] \end{matrix}

for i = 1, …, n, where the expectation is taken with respect to the conditional distribution of z_m,i given d_o,i. Thus, by setting W_flr = 1, we get FI_flr (ω),h = g₁₁(ω⁰)⁻1 h′∇_flr ∇′_flrh. For the ith observation, we have FI_flr (ω),e_i = g₁₁(ω⁰)⁻1{∂_ωiflr (ω⁰)}². If a particular FI_flr (ω),e_i is much larger than the mean of all FI_{flr (ω), e_i} 's, then the ithe observation can be regarded as influential.

For the global-case perturbation in equation (13), we define $\log p (D_{o} ∣ \hat{η}, ω) = Σ_{i = 1}^{n} \log p (d_{o, i}, \hat{η}, ω)$ . Direct calculation leads to

\begin{matrix} \nabla_{f_{lr}} & = Σ_{i = 1}^{n} \partial_{ω} \log p (d_{o, i}, \hat{η}, ω^{0}) \\ = Σ_{i = 1}^{n} E [\partial_{ω} \log p (d_{i}, \hat{η}, ω^{0}) ∣ d_{o, i}, \hat{η}] . \end{matrix}

(19)

Setting W_flr = 1 and choosing ω˜ in equation (14), we have ${FI}_{f_{lr} (\tilde{ω}), h} = h^{'} G^{- 1 ∕ 2} \nabla_{f_{lr}} \nabla_{f_{lr}}^{'} G^{- 1 ∕ 2} h$ , where h′h = 1. The maximum value is the principal eigenvalue ${FI}_{f_{lr} (\tilde{ω}), h_{\max}} = \nabla_{f_{lr}}^{'} G^{- 1} \nabla_{f_{lr}}$ and its corresponding h_max is G^−1/2∇_flr/ ||G^−1/2∇_flr||. Moreover, under some mild conditions $\nabla_{f_{lr}}^{'} G^{- 1} \nabla_{f_{lr}}$ can be used as a test statistic for testing H₀ : ω = 0. Under H₀ : ω = 0, it can be shown that $\nabla_{f_{lr}} ∕ \sqrt{n}$ converges in distribution to a Gaussian distribution with zero mean and covariance matrix Σ_flr as n → ∞. Thus, $\nabla_{f_{lr}}^{'} Σ_{f_{lr}}^{- 1 ∕ 2} Σ_{f_{lr}}^{1 ∕ 2} G^{- 1} Σ_{f_{lr}}^{1 ∕ 2} Σ_{f_{lr}}^{- 1 ∕ 2} \nabla_{f_{lr}}$ converges in distribution to a weighted chi-squared distribution as n → ∞. Therefore, we may use the asymptotic distribution of $\nabla_{f_{lr}}^{'} G^{- 1} \nabla_{f_{lr}}$ to characterize the asymptotic behavior of the influence measures FI_{flr (ω˜),h}.

4. Simulation Studies

We applied the proposed local influence measures to several simulated datasets in which various assumptions were misspecified to examine their performance. First, we applied two single-case perturbation schemes to simulated datasets in each of which an outlier was added. We expected that both schemes could detect the outlier both in the response and in the covariates. Secondly, we used several perturbation schemes to examine the functional form of the missing data mechanism and to assess the relationship between the response and covariates.

We generated 500 simulated datasets from model (5) with n = 100, β₀ = β₁ = β₂ = 1 and τ = 1. Moreover, (x_i, z_i) were generated from a N₂(0, I₂) distribution. We also assumed an MAR missing data mechanism for z_i given by

p (r_{i} = 1 ∣ x_{i}, z_{i}, y_{i}) = \frac{\exp (ξ_{0} + ξ_{1} x_{i})}{1 + \exp (ξ_{0} + ξ_{1} x_{i})},

(20)

with ξ₀ = −0.5 and ξ₁ = 1.0, resulting in an average missingness fraction of 40%. Then, we fit y_i = β₀ + β₁x_i + β₂z_i + ∊_i with MAR z_i, and changed y₁₀₀ to y₁₀₀ + δ with δ = 1.0, 2.0, 3.0, 4.0, and 5.0 to add an outlier. We applied two single-case perturbation schemes. The first was to perturb the variance of ∊_i such that Var(∊₁, …, ∊_n) = τdiag(1/ω₁,…, 1/ω_n), where ω⁰ = 1_n is an n × 1 vector with all 1's. The second perturbation was to perturb the missing covariate z_i such that y_i = β₀ + β₁x_i + β₂(z_i + ω_i) + ∊_i for i = 1 ,…, n, where ω⁰ =0_n is an n × 1 vector with all 0's. We calculated FI_{η^_o (ω⁰),e_i} and FI_flr (ω⁰),e_i for both perturbations, and their values for the last case were larger than those for the rest of the cases, especially when δ is large. The first half of Table 2 summarizes the percentages of detecting the outlier using either non-robust methods with the sample mean and standard deviation (> mean + 2 × SD or > mean + 3 × SD) or robust methods with the sample median and median absolute deviation (> median + 2 × MAD or > median + 3 × MAD) for different values of δ. As expected, the percentage of detecting the outlier increases with δ, and the results based on the robust methods are better compared to the non-robust methods. The threshold based on three standard deviations (SD or MAD) is not very different from using a threshold based on two standard deviations. Based on a simulated dataset, in which δ = 4, the index plots of the two influence measures (Figure 2) can effectively detect the outlier. Instead of having an outlier in the response, we examined a scenario with the presence of the outlier in the covariates. We changed z₁₀₀ to z₁₀₀ + δ with δ = 1.0, 2.0, 3.0, 4.0, and 5.0, and applied the same two single-case perturbation schemes. The values of FI_{η^_o (ω⁰),e_i} and FI_flr (ω⁰),e_i for both perturbations for the last case were again larger than those for the rest of the cases, especially when δ is large. The second half of Table 2 lists the percentages of detecting the outlier using either non-robust or robust methods mentioned previously, which shows similar findings as when the outlier is in the response. Thus our local influence method can effectively detect the outlier in the covariates when δ is reasonably large.

Table 2.

Percentages of detecting the outlier using non-robust methods and robust methods for different values of δ and thresholds. Five hundred simulated datasets were used for each case.

The outlier is in the response
		>mean + 2 × SD δ					>mean + 3 × SD δ
Perturbation	Stat	1	2	3	4	5	1	2	3	4	5
1	FI_{η̂_o(ω⁰)}	0.108	0.420	0.718	0.888	0.966	0.078	0.354	0.618	0.840	0.944
	FI_{f_lr(ω⁰)}	0.112	0.404	0.678	0.850	0.932	0.084	0.350	0.624	0.814	0.916
2	FI_{η̂_o(ω⁰)}	0.112	0.426	0.720	0.888	0.974	0.060	0.310	0.576	0.808	0.926
	FI_{f_lr(ω⁰)}	0.140	0.466	0.740	0.884	0.950	0.084	0.350	0.636	0.812	0.908
		>median + 2 × MAD δ					>median + 3 × MAD δ
Perturbation	Stat	1	2	3	4	5	1	2	3	4	5

1	FI_{η̂_o(ω⁰)}	0.484	0.772	0.924	0.982	0.998	0.434	0.746	0.906	0.974	0.998
	FI_{f_lr(ω⁰)}	0.274	0.642	0.858	0.960	0.996	0.254	0.642	0.844	0.958	0.994
2	FI_{η̂_o(ω⁰)}	0.336	0.642	0.854	0.958	0.994	0.228	0.526	0.792	0.936	0.986
	FI_{f_lr(ω⁰)}	0.448	0.756	0.920	0.974	0.996	0.358	0.696	0.872	0.966	0.996
The outlier is in the covariates
		>mean + 2 × SD δ					>mean + 3 × SD δ
Perturbation	Stat	1	2	3	4	5	1	2	3	4	5

1	FI_{η̂_o(ω⁰)}	0.150	0.464	0.762	0.806	0.830	0.100	0.366	0.712	0.804	0.830
	FI_{f_lr(ω⁰)}	0.102	0.320	0.580	0.710	0.800	0.080	0.264	0.516	0.680	0.784
2	FI_{η̂_o(ω⁰)}	0.192	0.534	0.784	0.816	0.832	0.090	0.388	0.724	0.814	0.830
	FI_{f_lr(ω⁰)}	0.136	0.376	0.644	0.750	0.820	0.082	0.270	0.532	0.688	0.788
		>median + 2 × MAD δ					>median + 3 × MAD δ
Perturbation	Stat	1	2	3	4	5	1	2	3	4	5

1	FI_{η̂_o(ω⁰)}	0.460	0.734	0.828	0.844	0.858	0.422	0.704	0.820	0.840	0.852
	FI_{f_lr(ω⁰)}	0.246	0.498	0.750	0.804	0.834	0.232	0.482	0.738	0.798	0.834
2	FI_{η̂_o(ω⁰)}	0.382	0.706	0.820	0.828	0.840	0.282	0.620	0.806	0.822	0.836
	FI_{f_lr(ω⁰)}	0.398	0.644	0.810	0.828	0.848	0.338	0.586	0.780	0.814	0.838

Open in a new tab

Index plots of influence measures from a simulated dataset with y₁₀₀ as an influential case: (a) FI_{η^_o (ω⁰),e_i} and (b) FI_flr (ω⁰),e_i for the variance perturbation; (c) FI_{η^_o (ω⁰),e_i} and (d) FI_{flr (ω⁰),e_i} for the missing covariate perturbation.

Next, we explored the potential deviations of the MAR missing data mechanism in the direction of NMAR. We generated data from model (5) with n = 200, β₀ = β₁ = β₂ = 1, τ = 1, (x_i, z_i) were generated from a N₂(0, I₂) distribution, and the following missing data mechanism for z_i was assumed,

p (r_{i} = 1 ∣ x_{i}, z_{i}, y_{i}) = \frac{\exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i} + {az}_{i})}{1 + \exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i} + {az}_{i})},

(21)

with ξ₀ = −1.8, ξ₁ = 1.0, and ξ₂ = 1.0 being chosen to make the missing data fraction approximately 40% for various values of a. If a ≠ 0, then the missing data mechanism is nonignorable. We fit y_i = β₀ + β₁x_i + β₂z_i + ∊_i with the MAR missing data mechanism given by

p (r_{i} = 1 ∣ x_{i}, z_{i}, y_{i}) = \frac{\exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i})}{1 + \exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i})} .

(22)

Then, we applied a global perturbation given by

p (r_{i} = 1 ∣ y_{i}, x_{i}, z_{i}, ω) = \frac{\exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i} + ω z_{i})}{1 + \exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i} + ω z_{i})} .

(23)

The FI_flr (ω⁰) were 0.084, 1.448, and 4.795 for a = 0, 0.5, and 1.0, respectively. From these results, we see that as a increases, the influence measure of FI_flr (ω⁰) also increases, which may suggest that an NMAR model is tenable for large a. We also used the corresponding single-case perturbation given by

p (r_{i} = 1 ∣ y_{i}, x_{i}, z_{i}, ω) = \frac{\exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i} + ω_{i} z_{i})}{1 + \exp (ξ_{0} + ξ_{1} y_{i} + ξ_{2} x_{i} + ω_{i} z_{i})} .

(24)

No large FI_flr (ω⁰),e_i was observed for any i even when a is large. This result might suggest that this type of NMAR mechanism is not detectable using only FI_{flr (ω⁰),e_i}, the diagonal entries of $G^{- 1 ∕ 2} \nabla_{f_{lr}} W_{f_{lr}} \nabla_{f_{lr}}^{'} G^{- 1 ∕ 2}$ , confirming the analyses in Jansen et al. (2006). However, we observed increases in the off-diagonal entries of $G^{- 1 ∕ 2} \nabla_{f_{lr}} W_{f_{lr}} \nabla_{f_{lr}}^{'} G^{- 1 ∕ 2}$ as a increases, indicating influence through combinations of cases.

As noted in Hens et al. (2005) and Jansen et al. (2006), a local influence tool for the missing data mechanism is able to pick up anomalous features of cases that are not necessarily related to the missing data mechanism. To study this notion, we generated an original dataset from model (5) with n = 200, β₀ = β₁ = β₂ = 1, τ = 1, where (x_i, z_i) were generated from a N₂(0, I₂) distribution, and MAR was assumed. Then we generated a perturbed dataset in which we added 20 to the responses of the last five cases. We fit y_i = β₀ + β₁x_i + β₂z_i + ∊_i with the MAR missing data mechanism given by equation (22). The perturbation (24) identified the last five cases as influential. Thus single-case perturbation for the missing data mechanism is able to pick up some deviations in the data even though the deviations are different from the functional form of the missing data mechanism. The global perturbation (23) resulted in FI_flr (ω⁰) = 1.61, a big qualitative change compared to FI_flr (ω⁰) = 0.011 for the original dataset. These results may thus raise some concerns about the MAR assumption, and/or about the model as a whole.

We also examined whether our influence measures can assess the relationship between the response and the covariates of interest. We generated data from $y_{i} = 1 + x_{i} + z_{i} + c * z_{i}^{2} + ∊_{i}$ for i = 1, …, 100, where ∊_i ~ N(0, 1) and (x_i, z_i) were independently generated from a N₂(0, I₂) distribution. The missing data mechanism was assumed MAR as in equation (20) with a 40% missingness fraction. We fit y_i = β₀ + β₁x_i + β₂z_i + ∊_i assuming MAR z_i's, and thus the fitted model would be misspecified if c ≠ 0. We considered a global perturbation scheme $y_{i} = β_{0} + β_{1} x_{i} + β_{2} z_{i} + Σ_{j = 1}^{m + 3} ω_{j} B_{j} (z_{i}) + ∊_{i}$ , where the B_j (z) are truncated polynomials of order 2 to 4, given by $z^{2}, z^{3}, z^{4}, {(z - k_{1})}_{+}^{4}, \dots, {(z - k_{m})}_{+}^{4}$ , where k₁, …, k_m are the m = 3 prefixed knots. The principal eigenvalue of FI_flr (ω⁰) was 0.582, 13.675, 24.535, and 33.233 for c = 0, 0.4, 0.8, and 1.2, respectively. The principal eigenvalue of FI_flr (ω⁰) was statistically significant at the 5% significance level (p-value = 0.002) for c = 0.8, but not significant for c < 0.8. Thus, the local influence measures are useful for detecting model misspecification in this example.

5. Real Data Analysis

5.1 Quality-of-Life Data

As mentioned in Section 1, the response variable for these data is the logarithm of the survival time, in which all cases are light censored. The dataset has 404 observations and the covariates are: physical ability (z₁); mood (z₂); indicator for treatment A (yes, no) (x₁); indicator for treatment B (yes, no) (x₂); indicator for treatment C (yes, no) (x₃); age (x₄); and language (English, otherwise)(x₅). Among these seven covariates, z₁ has 13% missingness and z₂ has 31% missingness, and the remaining covariates are fully observed.

We fit a regression model $y_{i} = v_{i}^{'} β + ∊_{i}$ , where ∊_i ~ N (0, τ), $v_{i}^{'} = (1, z_{i 1}, z_{i 2}, x_{i 1}, \dots, x_{i 5})$ is the 1 × 8 vector of covariates and β = (β₀, β₁, …, β₇)^T are the corresponding regression coefficients. Because only the continuous covariates z₁ and z₂ have missing values, we assumed (z_i1, z_i2) ~ N₂(α₁, α₂), for i = 1, …, n. We assumed that the missing covariates are MAR and calculated the maximum likelihood estimates of (β, τ, α₁, α₂) using the expectation-maximization (EM) algorithm.

To detect the influential cases, we employed two single-case perturbation schemes. The first was to perturb the variance of ∊_i such that Var(∊₁, …, ∊_n) = τ diag(1/ω₁, …, 1/ω_n). We calculated FI_η̂o (ω⁰),e_i and FI_flr (ω⁰),e_i, and both indicated that cases 132 and 404 were very influential (Figures 3a and b). The second perturbation was to simultaneously perturb the missing covariates z_i1 and z_i2 such that y_i = β₀ + β₁(z_i1 + ω_i) + β₂(z_i2 + ω_i) + β₃x_i1 + ⋯ + β₇x_i5 + ∊_i. Again, cases 132 and 404 were very influential (Figures 3c and d). The response values of cases 132 and 404 are very small, compared to the rest of the cases.

Index plots of influence measures for quality-of-life data: (a) FI_η̂o (ω⁰),e_i and (b) FI_flr (ω⁰),e_i for the variance perturbation; (c) FI_η̂o (ω⁰),e_i and (d) FI_flr (ω⁰), e_i for the missing covariate perturbation.

Next, we were interested in sensitivity analyses regarding the MAR assumption in the direction of NMAR. First, we fit the model with a MAR missing data mechanism

p (r_{i} ∣ x_{i}, y_{i}, ξ) = p (r_{i 2} ∣ r_{i 1}, x_{i}, y_{i}, ξ_{2}) p (r_{i 1} ∣ x_{i}, y_{i}, ξ_{1}),

(25)

where $p (r_{i 1} ∣ x_{i}, y_{i}, ξ_{1}) = \frac{\exp (r_{i 1} f_{i 1})}{1 + \exp (f_{i 1})}$ , $p (r_{i 2} ∣ r_{i 1}, x_{i}, y_{i}, ξ_{2}) = \frac{\exp (r_{i 2} f_{i 2})}{1 + \exp (f_{i 2})}$ , f_i1 = ξ₁₀ + ξ₁₁x_i1 + ⋯ + ξ₁₅x_i5 + ξ₁₆y_i and f_i2 = ξ₂₀ + ξ₂₁x_i1 + ⋯ + ξ₂₅x_i5 + ξ₂₇r_i1. Then, we considered a global perturbation

\begin{matrix} p (r_{i} ∣ x_{i}, z_{i}, y_{i}, ξ, ω) = & p (r_{i 2} ∣ r_{i 1}, x_{i}, z_{i}, y_{i}, ξ_{2}, ω) \\ \times p (r_{i 1} ∣ x_{i}, z_{i}, y_{i}, ξ_{1}, ω) \\ p (r_{i 1} ∣ x_{i}, z_{i}, y_{i}, ξ_{1}, ω) = & \frac{\exp [r_{i 1} (f_{i 1} + ω_{1} z_{i 1} + ω_{2} z_{i 2})]}{1 + \exp (f_{i 1} + ω_{1} z_{i 1} + ω_{2} z_{i 2})}, \\ p (r_{i 2} ∣ r_{i 1}, x_{i}, z_{i}, y_{i}, ξ_{2}, ω) = & \frac{\exp [r_{i 2} (f_{i 2} + ω_{3} z_{i 1} + ω_{4} z_{i 2})]}{1 + \exp (f_{i 2} + ω_{3} z_{i 1} + ω_{4} z_{i 2})} . \end{matrix}

(26)

The principal eigenvalue of FI_flr (ω⁰) was 0.11, far smaller than the weighted chi-squared 0.05 cut-off point. This may suggest that the missing data mechanism is likely to be MAR.

In fitting the model using equation (25), the large value of the estimate for ξ₂₆ indicated that the missingness of x₂ might depend on the response, whereas the estimates for all other ξ's were nonsignificant. Thus, we dropped the y_i term in f _i2 of equation (25), leading to f _i2 = ξ₂₀ + ξ₂₁x_i1 + ⋯ + ξ₂₅x_i5 + ξ₂₇r_i1. Then we used the global perturbation in equation (26) with

\begin{matrix} p (r_{i 1} ∣ x_{i}, y_{i}, ξ_{1}, ω) = & \frac{\exp (r_{i 1} f_{i 1})}{1 + \exp (f_{i 1})} and \\ p (r_{i 2} ∣ r_{i 1}, x_{i}, y_{i}, ξ_{2}, ω) = & \frac{\exp [r_{i 2} (f_{i 2} + ω y_{i})]}{1 + \exp (f_{i 2} + ω y_{i})} . \end{matrix}

It turned out that FI_flr (ω ⁰) was 4.51, which is larger than the chi-squared 0.05 cut-off point. This suggests that the missingness of x₂ may depend on the response.

Furthermore, to assess the linear relationship between the response and the covariates (z₁, z₂), we employed a global perturbation scheme as follows:

\begin{matrix} y_{i} = & β_{0} + β_{1} z_{i 1} + β_{2} z_{i 2} + β_{3} x_{i 1} + \dots + β_{7} x_{i 5} \\ + Σ_{j = 1}^{m + 3} ω_{j} B_{j} (z_{i 1}) + Σ_{j = 1}^{m + 3} ω_{j + m + 3} B_{j} (z_{i 2}) + ∊_{i}, \end{matrix}

where the B_j (z) are truncated polynomials of order 2 to 4 given by $z^{2}, z^{3}, z^{4}, {(z - k_{1})}_{+}^{4}, \dots, {(z - k_{m})}_{+}^{4}$ , where k₁, …, k_m are the m = 3 prefixed knots. The principal eigenvalue of FI_flr (ω) was 3.44, which was not statistically significant at the 5% significance level (p-value = 0.65). Thus, the fitted model appears to be robust to this global perturbation scheme.

5.2 Liver Cancer Data

To further illustrate our proposed methods, we revisit the liver cancer data as introduced in Section 1 (Ibrahim, Chen, and Lipsitz, 1999). We are interested in how the number of cancerous liver nodes (y) when entering the trials is predicted by six other baseline characteristics: time since diagnosis of the disease (in weeks) (z₁); two biochemical markers (each classified as normal or abnormal), alpha-fetoprotein (z₂) and anti-hepatitis B antigen (z₃); associated jaundice (yes, no) (x₁); body mass index (weight in kilograms divided by the square of height in meters) (x₂); and age (in years) (x₃).

We used a Poisson regression model, $p (y_{i} ∣ x_{i}, z_{i}, β) \propto \exp [y_{i} (v_{i}^{T} β) - \exp (v_{i}^{T} β)]$ , where $v_{i}^{T} = (1, x_{i 1}, x_{i 2}, x_{i 3}, z_{i 1}, z_{i 2}, z_{i 3})$ is the 1 × 7 vector of covariates including an intercept, and β = (β₀, β₁, …, β₆)^T are the corresponding regression coefficients. Logarithm of the time since diagnosis was used to achieve approximate normality. Because only z_i = (z_i1, z_i2, z_i3) has missing values, we need to consider a joint distribution only for these covariates. Because z_i2 and z_i3 were both dichotomous, we used logistic regressions. Thus,

\begin{matrix} p (z_{i 1}, z_{i 2}, z_{i 3} ∣ x_{i}, α) = & p (z_{i 3} ∣ z_{i 1}, z_{i 2}, x_{i}, α_{3}) \\ \times p (z_{i 2} ∣ z_{i 1}, x_{i}, α_{2}) \times p (z_{i 1} ∣ x_{i}, α_{1}), \end{matrix}

where α = (α₁, α₂, α₃) and (z_i3 | z_i1, z_i2, x_i) is a logistic regression with

\begin{matrix} p (z_{i 3} = 1 ∣ z_{i 1}, z_{i 2}, x_{i}, α_{3}) \\ = \frac{\exp (α_{30} + α_{31} z_{i 1} + α_{32} z_{i 2} + α_{3 x}^{T} x_{i})}{1 + \exp (α_{30} + α_{31} z_{i 1} + α_{32} z_{i 2} + α_{3 x}^{T} x_{i})}, \end{matrix}

and $α_{3 x}^{T} = (α_{33}, α_{34}, α_{35})$ . Similarly,

p (z_{i 2} = 1 ∣ z_{i 1}, x_{i}, α_{2}) = \frac{\exp (α_{20} + α_{21} z_{i 1} + α_{2 x}^{T} x_{i})}{1 + \exp (α_{20} + α_{21} z_{i 1} + α_{2 x}^{T} x_{i})},

and $α_{2 x}^{T} = (α_{22}, α_{23}, α_{24})$ . In addition, we took a normal distribution for the missing covariate z₁, specifically, z_i1 ~ N (α₁₁, α₁₂) and $α_{1}^{T} = (α_{11}, α_{12})$ . We assumed that the missing covariates are MAR and estimated (β, α) using the EM algorithm.

To detect the influential cases, we employed a perturbation to simultaneously perturb the missing covariates z_i1, z_i2, and z_i3 such that y_i = β₀ + β₁(z_i1 + ω_i) + β₂(z_i2 + ω_i) + β₃(z_i3 + ω_i) + ⋯ + β₆x_i3 + ∊_i. Both FI_η̂o (ω⁰),e_i and FI_flr (ω⁰),e_i indicated that cases 10, 15, 65, and 160 were very influential for this perturbation (Figures 4a and b). Then we employed a perturbation to the distribution of z_i1 such that z_i1 ~ N (α₁₁ + ω_i, α₁₂), i = 1, …, n, and both influence measures detected case 131 to be influential for the distributional assumption of z_i1 (Figures 4c and d). These findings confirmed the suspected cases reported in Table 1.

Index plots of influence measures for liver cancer data: (a) FI_η̂o (ω⁰),e_i and (b) FI_flr (ω⁰),e_i for the missing covariate perturbation; (c) FI_η̂o (ω⁰),e_i and (d) FI_flr (ω⁰),e_i for the perturbation to the distribution of z_i1.

Next, we examined the functional form of the missing data mechanism given by

\begin{matrix} p (r_{i} ∣ x_{i}, y_{i}, ξ) = & p (r_{i 3} ∣ r_{i 1}, r_{i 2}, x_{i}, y_{i}, ξ_{2}) \\ \times p (r_{i 2} ∣ r_{i 1}, x_{i}, y_{i}, ξ_{2}) \times p (r_{i 1} ∣ x_{i}, y_{i}, ξ_{1}), \end{matrix}

p (r_{i 1} ∣ y_{i}, x_{i}, ξ_{1}) = \frac{\exp (r_{i 1} f_{i 1})}{1 + \exp (f_{i 1})},

(27)

p (r_{i 2} ∣ r_{i 1}, y_{i}, x_{i}, ξ_{2}) = \frac{\exp (r_{i 2} f_{i 2})}{1 + \exp (f_{i 2})},

(28)

p (r_{i 3} ∣ r_{i 1}, r_{i 2}, y_{i}, x_{i}, ξ_{3}) = \frac{\exp (r_{i 3} f_{i 3})}{1 + \exp (f_{i 3})},

(29)

in which f _i1 = ξ₁₀ + ξ₁₁x_i1 + ξ₁₂x_i2 + ξ₁₃x_i3 + ξ₁₄y_i, f _i2 = ξ₂₀ + ξ₂₁x_i1 + ξ₂₂x_i2 + ξ₂₃x_i3 + ξ₂₄y_i + ξ₂₅r_i1, and f _i3 = ξ₃₀ + ξ₃₁x_i1 + ξ₃₂x_i2 + ξ₃₃x_i3 + ξ₃₄y_i + ξ₃₅r_i1 + ξ₃₆r_i2. Then, we considered a global perturbation for the missing mechanism:

\begin{matrix} p (r_{i} ∣ x_{i}, z_{i}, y_{i}, ξ, ω) = & p (r_{i 3} ∣ r_{i 1}, r_{i 2}, x_{i}, z_{i}, y_{i}, ξ_{2}, ω) \\ \times p (r_{i 2} ∣ r_{i 1}, x_{i}, z_{i}, y_{i}, ξ_{2}, ω) \\ \times p (r_{i 1} ∣ x_{i}, z_{i}, y_{i}, ξ_{1}, ω) \end{matrix}

\begin{matrix} p (r_{i 1} ∣ x_{i}, z_{i}, y_{i}, ξ_{1}, ω) \\ = \frac{\exp [r_{i 1} (f_{i 1} + ω_{1} z_{i 1} + ω_{2} z_{i 2} + ω_{3} z_{i 3})]}{1 + \exp (f_{i 1} + ω_{1} x_{i 1} + ω_{2} x_{i 2} + ω_{3} z_{i 3})}, \\ p (r_{i 2} ∣ r_{i 1}, x_{i}, z_{i}, y_{i}, ξ_{2}, ω) \\ = \frac{\exp [r_{i 2} (f_{i 2} + ω_{4} z_{i 1} + ω_{5} z_{i 2} + ω_{6} z_{i 3})]}{1 + \exp (f_{i 2} + ω_{4} z_{i 1} + ω_{5} z_{i 2} + ω_{6} z_{i 3})}, \\ p (r_{i 3} ∣ r_{i 2}, r_{i 1}, x_{i}, z_{i}, y_{i}, ξ_{2}, ω) \\ = \frac{\exp [r_{i 3} (f_{i 3} + ω_{7} z_{i 1} + ω_{8} z_{i 2} + ω_{9} z_{i 3})]}{1 + \exp (f_{i 3} + ω_{7} z_{i 1} + ω_{8} z_{i 2} + ω_{9} z_{i 3})} . \end{matrix}

The principal eigenvalue of FI_flr (ω⁰) (0.24) was quite small, which suggests that the missing data mechanism is likely to be MAR. Following the arguments in Zhu et al. (2007), we considered a single-case perturbation for the missing mechanism as follows:

\begin{matrix} p (r_{i 1} ∣ x_{i}, z_{i}, y_{i}, ξ_{1}, ω) \\ = \frac{\exp [r_{i 1} (f_{i 1} + ω_{i} (z_{i 1} ∕ s_{z 1} + z_{i 2} ∕ s_{z 2} + z_{i 3} ∕ s_{z 3}))]}{1 + \exp (f_{i 1} + ω_{i} (z_{i 1} ∕ s_{z 1} + z_{i 2} ∕ s_{z 2} + z_{i 3} ∕ s_{z 3}))}, \end{matrix}

where s_{z 1}, s_{z 2}, and s_{z 3} are the sample standard deviations for z₁, z₂, and z₃, respectively. Then, a similar perturbation was introduced for r_i2 and r_i3. All perturbations revealed case 131 to be influential. However, the perturbation for r_i3 revealed only case 65 as an influential case. The reason that cases 10, 15, and 160 did not stand out under the single-case perturbation for all missing covariates and case 65 did not stand out under the single-case perturbation for z₁ or z₂, is that: (i) they all have very large values in the response, (ii) large response values y_i tend to yield large values of p(r_i = 1 | x_i, y_i, ξ) for all z₁, z₂, and z₃, and (iii) cases 10, 15, and 160 have no missing values in z₁, z₂, and z₃ so they fit equations (27), (28), and (29) well, whereas case 65 has no missing values in z₁ and z₂ so it fits equations (27) and (28) well.

6. Discussion

We have developed a general local influence methodology for carrying out sensitivity analyses in GLMs with MAR or NMAR covariate data. We have also proposed a novel methodology for choosing an appropriate perturbation scheme and examined several influence measures within this context. The simulation studies and the real datasets showed very promising results for the proposed methods. We emphasize again that in missing data problems, there is typically little information in the data regarding the form of the missing data mechanism, and the parametric assumption of the missing data mechanism itself is not “testable” from the data. Thus, NMAR modeling should be viewed as a sensitivity analysis concerning a more complicated model. In this sense, it is not advisable to carry out formal tests directly to assess and compare MAR and NMAR models. Future work in this area includes extending these methodologies to the Cox proportional hazards model with right censored survival data and missing covariates, as well as to parametric and semi-parametric models for longitudinal data with MAR or NMAR response and/or covariate data.

Acknowledgements

This work was supported in part by National Science Foundation (NSF) grants SES-06-43663 and BCS-08-26844 and National Institutes of Health (NIH) grants ULI-RR025747-01 and AG033387 to Dr Zhu and NIH grants GM70335 and CA74015 to Dr Ibrahim. We thank the editor and two referees for variable suggestions that greatly improved the article.

References

Beckman RJ, Nachtsheim CJ, Cook RD. Diagnostics for mixed-model analysis of variance. Technometrics. 1987;29:413–426. [Google Scholar]
Cook RD. Assessment of local influence (with discussion) Journal of the Royal Statistical Society, Series B. 1986;48:133–169. [Google Scholar]
Copas JB, Eguchi S. Local sensitivity approximations for selectivity bias. Journal of the Royal Statistical Society, Series B. 2001;63:871–895. [Google Scholar]
Copas JB, Eguchi S. Local model uncertainty and incomplete-data bias (with discussion) Journal of the Royal Statistical Society, Series B. 2005;67:459–513. [Google Scholar]
Copas JB, Li HG. Inference for non-random samples (with discussion) Journal of the Royal Statistical Society, Series B. 1997;59:55–96. [Google Scholar]
Cox DR, Reid N. Parameter orthogonality and approximate conditional inference (with discussion) Journal of the Royal Statistical Society, Series B. 1987;49:1–39. [Google Scholar]
Gustafson P. On measuring sensitivity to parametric model misspecification. Journal of the Royal Statistical Society, Series B. 2001;63:81–94. [Google Scholar]
Hens N, Aerts, Molenberghs G, Thijs H, Verbeke G. Kernel weighted influence measures. Computational Statistics and Data Analysis. 2005;48:467–487. [Google Scholar]
Ibrahim JG, Chen MH, Lipsitz SR. Monte Carlo EM for missing covariates in parametric regression models. Biometrics. 1999;55:591–596. doi: 10.1111/j.0006-341x.1999.00591.x. [DOI] [PubMed] [Google Scholar]
Ibrahim JG, Lipsitz SR, Chen MH. Missing covariates in generalized linear models when the missing data mechanism is nonignorable. Journal of the Royal Statistical Society, Series B. 1999;61:173–190. [Google Scholar]
Jansen I, Molenberghs G, Aerts M, Thijs H, Van Steen K. A local influence approach to binary data from a psychiatric study. Biometrics. 2003;59:410–419. doi: 10.1111/1541-0420.00048. [DOI] [PubMed] [Google Scholar]
Jansen I, Hens N, Molenberghs G, Aerts M, Verbeke G, Kenward MG. The nature of sensitivity in monotone missing not at random models. Computational Statistics and Data Analysis. 2006;50:830–858. [Google Scholar]
Lipsitz SR, Ibrahim JG. A conditional model for incomplete covariates in parametric regression models. Biometrika. 1996;83:916–922. [Google Scholar]
Thomas W, Cook RD. Assessing influence on regression coefficients in generalized linear models. Biometrika. 1989;76:741–749. [Google Scholar]
Troxel AB. A comparative analysis of quality of life data from a southwest oncology group randomized trial of advanced colorectal cancer. Statistics in Medicine. 1998;17:767–779. doi: 10.1002/(sici)1097-0258(19980315/15)17:5/7<767::aid-sim820>3.0.co;2-b. [DOI] [PubMed] [Google Scholar]
Troxel AB, Ma G, Heitjan DF. An index of local sensitivity to nonignorability. Statistica Sinica. 2004;14:1221–1237. [Google Scholar]
Van Steen K, Molenberghs G, Thijs H. A local influence approach to sensitivity analysis of incomplete longitudinal ordinal data. Statistical Modelling: An International Journal. 2001;1:125–142. [Google Scholar]
Verbeke G, Molenberghs G, Thijs H, Lasaffre E, Kenward MG. Sensitivity analysis for non-random dropout: A local influence approach. Biometrics. 2001;57:43–50. doi: 10.1111/j.0006-341x.2001.00007.x. [DOI] [PubMed] [Google Scholar]
Zhu HT, Lee SY. Local influence for incomplete-data models. Journal of the Royal Statistical Society, Series B. 2001;63:111–126. [Google Scholar]
Zhu HT, Ibrahim JG, Lee S, Zhang HP. Assessment of local influence by invariant measures. Annals of Statistics. 2007;35:2565–2588. [Google Scholar]

[R1] Beckman RJ, Nachtsheim CJ, Cook RD. Diagnostics for mixed-model analysis of variance. Technometrics. 1987;29:413–426. [Google Scholar]

[R2] Cook RD. Assessment of local influence (with discussion) Journal of the Royal Statistical Society, Series B. 1986;48:133–169. [Google Scholar]

[R3] Copas JB, Eguchi S. Local sensitivity approximations for selectivity bias. Journal of the Royal Statistical Society, Series B. 2001;63:871–895. [Google Scholar]

[R4] Copas JB, Eguchi S. Local model uncertainty and incomplete-data bias (with discussion) Journal of the Royal Statistical Society, Series B. 2005;67:459–513. [Google Scholar]

[R5] Copas JB, Li HG. Inference for non-random samples (with discussion) Journal of the Royal Statistical Society, Series B. 1997;59:55–96. [Google Scholar]

[R6] Cox DR, Reid N. Parameter orthogonality and approximate conditional inference (with discussion) Journal of the Royal Statistical Society, Series B. 1987;49:1–39. [Google Scholar]

[R7] Gustafson P. On measuring sensitivity to parametric model misspecification. Journal of the Royal Statistical Society, Series B. 2001;63:81–94. [Google Scholar]

[R8] Hens N, Aerts, Molenberghs G, Thijs H, Verbeke G. Kernel weighted influence measures. Computational Statistics and Data Analysis. 2005;48:467–487. [Google Scholar]

[R9] Ibrahim JG, Chen MH, Lipsitz SR. Monte Carlo EM for missing covariates in parametric regression models. Biometrics. 1999;55:591–596. doi: 10.1111/j.0006-341x.1999.00591.x. [DOI] [PubMed] [Google Scholar]

[R10] Ibrahim JG, Lipsitz SR, Chen MH. Missing covariates in generalized linear models when the missing data mechanism is nonignorable. Journal of the Royal Statistical Society, Series B. 1999;61:173–190. [Google Scholar]

[R11] Jansen I, Molenberghs G, Aerts M, Thijs H, Van Steen K. A local influence approach to binary data from a psychiatric study. Biometrics. 2003;59:410–419. doi: 10.1111/1541-0420.00048. [DOI] [PubMed] [Google Scholar]

[R12] Jansen I, Hens N, Molenberghs G, Aerts M, Verbeke G, Kenward MG. The nature of sensitivity in monotone missing not at random models. Computational Statistics and Data Analysis. 2006;50:830–858. [Google Scholar]

[R13] Lipsitz SR, Ibrahim JG. A conditional model for incomplete covariates in parametric regression models. Biometrika. 1996;83:916–922. [Google Scholar]

[R14] Thomas W, Cook RD. Assessing influence on regression coefficients in generalized linear models. Biometrika. 1989;76:741–749. [Google Scholar]

[R15] Troxel AB. A comparative analysis of quality of life data from a southwest oncology group randomized trial of advanced colorectal cancer. Statistics in Medicine. 1998;17:767–779. doi: 10.1002/(sici)1097-0258(19980315/15)17:5/7<767::aid-sim820>3.0.co;2-b. [DOI] [PubMed] [Google Scholar]

[R16] Troxel AB, Ma G, Heitjan DF. An index of local sensitivity to nonignorability. Statistica Sinica. 2004;14:1221–1237. [Google Scholar]

[R17] Van Steen K, Molenberghs G, Thijs H. A local influence approach to sensitivity analysis of incomplete longitudinal ordinal data. Statistical Modelling: An International Journal. 2001;1:125–142. [Google Scholar]

[R18] Verbeke G, Molenberghs G, Thijs H, Lasaffre E, Kenward MG. Sensitivity analysis for non-random dropout: A local influence approach. Biometrics. 2001;57:43–50. doi: 10.1111/j.0006-341x.2001.00007.x. [DOI] [PubMed] [Google Scholar]

[R19] Zhu HT, Lee SY. Local influence for incomplete-data models. Journal of the Royal Statistical Society, Series B. 2001;63:111–126. [Google Scholar]

[R20] Zhu HT, Ibrahim JG, Lee S, Zhang HP. Assessment of local influence by invariant measures. Annals of Statistics. 2007;35:2565–2588. [Google Scholar]

PERMALINK

Local Influence for Generalized Linear Models with Missing Covariates

Xiaoyan Shi

Hongtu Zhu

Joseph G Ibrahim

Summary

1. Introduction

Table 1.

2. Model and Notation

3. Local Influence

3.1 A Simple Example

3.2 Perturbation Schemes

3.3 Appropriate Perturbation

Figure 1.

3.4 Influence Measures

3.4.1 First-order Influence Measures

3.4.2 Maximum Likelihood Estimate as the Objective Function

3.4.3 Likelihood Ratio as the Objective Function

4. Simulation Studies

Table 2.

Figure 2.

5. Real Data Analysis

5.1 Quality-of-Life Data

Figure 3.

5.2 Liver Cancer Data

Figure 4.

6. Discussion

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Local Influence for Generalized Linear Models with Missing Covariates

Xiaoyan Shi

Hongtu Zhu

Joseph G Ibrahim

Summary

1. Introduction

Table 1.

2. Model and Notation

3. Local Influence

3.1 A Simple Example

3.2 Perturbation Schemes

3.3 Appropriate Perturbation

Figure 1.

3.4 Influence Measures

3.4.1 First-order Influence Measures

3.4.2 Maximum Likelihood Estimate as the Objective Function

3.4.3 Likelihood Ratio as the Objective Function

4. Simulation Studies

Table 2.

Figure 2.

5. Real Data Analysis

5.1 Quality-of-Life Data

Figure 3.

5.2 Liver Cancer Data

Figure 4.

6. Discussion

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases