Estimators based on Unconventional Likelihoods with Nonignorable Missing Data and its Application to a Children’s Mental Health Study

Jiwei Zhao; Chi Chen

doi:10.1080/10485252.2019.1664739

. Author manuscript; available in PMC: 2020 Oct 2.

Published in final edited form as: J Nonparametr Stat. 2019 Sep 18;31(4):911–931. doi: 10.1080/10485252.2019.1664739

Estimators based on Unconventional Likelihoods with Nonignorable Missing Data and its Application to a Children’s Mental Health Study

Jiwei Zhao ^1,^*, Chi Chen ¹

PMCID: PMC7531040 NIHMSID: NIHMS1540645 PMID: 33013146

Abstract

Nonignorable missing-data is common in studies where the outcome is relevant to the subject’s behavior. Ibrahim et al. (2001) fitted a logistic regression for a binary outcome subject to nonignorable missing data, and they proposed to replace the outcome in the mechanism model with an auxiliary variable that is completely observed. They had to correctly specify a model for the auxiliary variable; unfortunately the outcome variable subject to nonignorable missingness is still involved. The correct specification of this model is mysterious. Instead, we propose two unconventional likelihood based estimation procedures where the nonignorable missingness mechanism model could be completely bypassed. We apply our proposed methods to the children’s mental health study and compare their performance with existing methods. The large sample properties of the proposed estimators are rigorously justified, and their finite sample behaviors are examined via comprehensive simulation studies.

Keywords: Nonignorable missing data, Missingness mechanism, Unconventional likelihood, Pseudo likelihood, Conditional likelihood, Asymptotic normality

1. Introduction

The issue of missing data is a serious problem in various biomedical studies such as clinical trials and observational studies (Ibrahim et al., 2012; Little et al., 2012). Consider a situation that the outcome variable of interest, Y, is subject to missing values. If one believes that its missingness is unrelated to the possibly unobserved Y-value, the missing-data is called ignorable (Little and Rubin, 2002). For ignorable missing data, many well-known statistical procedures are available, and their theoretical properties are rigorously justified. Commonly, however, there is often a suspicion that the missing-data is actually related to the values of the possibly unobserved Y itself. This type of missing data is termed nonignorable and they exist in disciplines where the outcome is directly relevant to subject’s behavior, or is reported by the subjects themselves, for instance, through questionnaire or surveys.

Ibrahim et al. (2001) analyzed a data set from a study of mental health of children in Connecticut (Zahner et al., 1992, 1993; Zahner and Daskalakis, 1997), and their outcome variable of interest was the teacher’s report of the psychopathology of the child. Using logistic regression, this variable was modelled as a function of the parental status of the household, father, and the physical health of the child, health. In this study, 42.7% subjects missed their teacher’s report of psychopathology. A child’s possibly unobserved psychopathology status may be related to missingness because a teacher is more likely to fill out the psychopathology status when the teacher feels that the child is normal or not normal, hence it is nonignorable. A naive logistic regression model discarding subjects with missing psychopathology status, the so-called complete-case analysis, could lead to highly biased estimates. Thus, methods taking the nonignorable missingness mechanism into account need to be proposed.

Ibrahim et al. (2001) proposed to approximate the nonignorable missingness mechanism with an ignorable one, by taking advantage of the auxiliary variable, Z, the parents’ report of the psychopathology of the child. Their rationale is that the missingness indicator and the outcome variable Y would be independent conditional on Z and other covariates, if Y and Z have a sufficiently high partial correlation. Afterwards their parametric likelihood under this ignorable missingness assumption not only involves the model of interest for Y, but also the regression model for Z on all other variables including Y. They showed that, if Y and Z have a reasonably large correlation, their proposed estimator reduced the estimation bias resulting from the complete-case analysis.

In nonignorable missing data analysis, one is usually interested in estimating the unknown parameter in a parametric outcome model. In this situation, if one fits another parametric model for the missingness mechanism, the overall model specification issue becomes involved, and as a consequence, it may result in severe estimation bias. By observing the existence of the auxiliary variable Z, Ibrahim et al. (2001) avoided the direct parametric modeling of the mechanism. However, to implement their parametric likelihood, they have to fit a parametric model for Z regressing on Y and all other variables. Therefore essentially they still cannot completely avoid the problem of fitting a parametric model with nonignorable missingness.

There is a considerable amount of recent literature on nonignorable missingness mechanism beyond a simplistic parametric modeling. Zhao and Shao (2015) proposed a semi-parametric pseudo likelihood method for generalized linear models based on the existence of a nonresponse instrument (or, shadow variable), and Zhao and Ma (2018) investigated the estimation optimality under their framework. Fang and Shao (2016) and Zhao et al. (2018) studied the variable selection problem with nonignorable missing data from different perspectives. Besides likelihood based methods, Zhao et al. (2017) and Chen and Fang (2019) also considered estimating equation based methods under appropriate assumptions.

In this paper, motivated by the initial analysis result of Ibrahim et al. (2001) that the variable father is unrelated to missingness, we propose two estimation methods based on unconventional likelihoods without fitting the mechanism model or any other model. Our first method is based on conditional likelihood and second based on pseudo likelihood. In both methods we assume the mechanism model is arbitrary except that the missingness is independent with father conditional on all other variables, the evidence from Ibrahim et al. (2001). The most attractive feature of using these unconventional likelihoods is that, in the objective function, the pieces involving the mechanism model are all canceled out, therefore our methods are robust to the mechanism model misspecification. Different from Ibrahim et al. (2001) who adopted purely parametric specification on their models hence had to conduct heavy sensitivity analyses, we use a semiparametric framework where we treat the mechanism model as a nonparametric nuisance, and we propose nuisance-free methods to estimate the parameter of interest. We also rigorously derive the asymptotic properties of our proposed estimators.

The auxiliary variable Z is important since it is completely observed and it provides key surrogate information to the outcome variable Y, therefore it is of interest to quantify the relation between Z and Y in our outcome model and to further use the auxiliary variable Z in our framework. To this end we consider a more general outcome model for Y where the auxiliary variable Z also serves as one of the covariate variables. Instead of relying on the conditional independence between the missingness and Y, our proposed methods under this more general outcome model rely on the conditional independence between the missingness and Z. Similarly we can also propose estimators based on unconventional likelihoods that the fitting of the mechanism or any other model is not needed.

The structure of the paper is as follow. We first introduce the notation and briefly review the analysis results in Ibrahim et al. (2001) in Section 2. We propose our unconventional likelihood based estimation methods in Section 3 for the model considered in Ibrahim et al. (2001). In Section 4 we present the results for the more general outcome model discussed in the last paragraph. We implement our proposed estimators and compare their performance with Ibrahim et al. (2001) in Section 5. Finally, in Section 6 we conduct comprehensive simulation studies to examine the finite sample performance of our proposed estimators if the data generation process were known, and in Section 7 we conclude our paper.

2. A children’s mental health study

We first briefly describe the study in Ibrahim et al. (2001) and their analysis results. Besides the notation Y as outcome variable (Y = 1 if the teacher reports borderline or clinical psychopathology and Y = 0 if the teacher reports that the child is normal) and Z as auxiliary variable (Z = 1 if the parents’ report is abnormal and Z = 0 if normal) that are used in Section 1, we also introduce the following notation. We denote U = (U₁, U₂)^T where U₁ is the effect of health (U₁ = 1 means there is fair or poor health, a chronic condition or a limitation in activity and U₁ = 0 means there are no health problems), and U₂ is the effect of father (U₂ = 1 if no father figure is present and U₂ = 0 if a father figure is present). The variables U₁, U₂ and Z are all completely observed, but the variable Y has missing values. We use binary variable R as the missingness indicator where R = 1 represents that Y is observed and R = 0 if Y is missing. We use N as the total sample size and n as the sample size without missing values. Without loss of generality, we assume the first n samples are completely observed. Throughout our paper we also use p(⋅ | ⋅) as a generic notation for the conditional probability density function.

Throughout Ibrahim et al. (2001), the model of interest is

logit {P (Y = 1 ∣ U_{1}, U_{2})} = β_{0} + β_{1} U_{1} + β_{2} U_{2},

(1)

where β = (β₀, β₁, β₂)^T is the unknown parameter. One key step of Ibrahim et al. (2001) is to formulize a parametric likelihood function and adopt the EM algorithm for the maximization. To this end, they first proposed a logistic regression model for the missingness mechanism

logit {P (R = 1 ∣ Y, U_{1}, U_{2})} = α_{0} + α_{1} Y + α_{2} U_{1} + α_{3} U_{2} .

They found that the variable U₂ was unrelated to missingness, hence they dropped this term and the mechanism model becomes

logit {P (R = 1 ∣ Y, U_{1})} = α_{0} + α_{1} Y + α_{2} U_{1} .

(2)

One limitation of using purely parametric approach for the nonignorable mechanism model (2) is the potential misspecification issue. To fix this problem, heavy sensitivity analyses have to be conducted to validate the correctness of the parametric specification. If more terms were added as covariates to the model (2), for instance, the multi-way interactions between Y and U₁, the model (2) becomes non-identifiable.

By taking advantage of the existence of the auxiliary variable Z and recognizing it as a proxy to the outcome Y, Ibrahim et al. (2001) proposed that

P (R = 1 ∣ Y, U_{1}) = P (R = 1 ∣ Z, U_{1}),

hence the mechanism model becomes ignorable and it can be eliminated in the parametric likelihood to estimating β. Hence the likelihood function to be maximized is

\sum_{i = 1}^{N} (r_{i} [\log {p (z_{i} ∣ y_{i}, u_{i}; γ)} + \log {p (y_{i} ∣ u_{i}; β)}] + (1 - r_{i}) \log {\int_{y} p (z_{i} ∣ y, u_{i}; γ) p (y ∣ u_{i}; β) dy}),

(3)

where, besides the model of interest p(y | u; β), the model p(z | y, u; γ) is also involved where the parameter γ is unknown.

Therefore to correctly estimate the parameter of interest β, one has to correctly fit the model p(z | y, u; γ) which also involves nonignorable missing data. The essential difficulty of correctly fitting a parametric model with nonignorable missing data still does remain. In their paper, Ibrahim et al. (2001) tried different models to fit p(z | y, u; γ) by adding different combinations of interaction terms.

The major conclusion of Ibrahim et al. (2001) is that, the estimator of β from maximizing (3) with the model p(z | y, u; γ) correctly specified, although theoretically biased, will numerically reduce the estimation bias compared to the complete-case estimator, if there exists a decent amount of correlation between Y and Z. They ascertained this conclusion by analyzing the children’s mental health study data set.

3. Proposed methods

Motivated by the difficulty of correctly specifying a parametric model for the nonignorable missingness mechanism, we propose two unconventional likelihood based methods where we do not need to specify a concrete form for the mechanism model. In this Section, same as Ibrahim et al. (2001), we aim on estimating parameter β in model (1), but very differently, we only assume the missingness mechanism to satisfy

P (R = 1 ∣ Y, U_{1}, U_{2}) = P (R = 1 ∣ Y, U_{1}) .

(4)

Note that we are not using the explicit form of the mechanism model as in (2). Note also that the validity of assumption (4) is already ascertained in the initial analyses results of Ibrahim et al. (2001). A similar framework of integrating the auxiliary variable Z will be presented in Section 4.

3.1. Method based on conditional likelihood

The idea of the first method is initiated from studying the model p(y, u₁ | u₂). Note that the likelihood function based on the model p(y, u₁ | u₂) with the observed data {y_i, u_1i, u_2i}, i = 1, …, n is

\prod_{i = 1}^{n} p (y_{i}, u_{1 i} ∣ u_{2 i}, r_{i} = 1) = \prod_{i = 1}^{n} \frac{P (R = 1 ∣ y_{i}, u_{1 i})}{P (R = 1 ∣ u_{2 i})} p (y_{i}, u_{1 i} ∣ u_{2 i}),

(5)

where P(R = 1 | u₂) = ∫ ∫P(R = 1 | Y, U₁)p(Y, U₁ | u₂)dY dU₁. However it does not work to directly proceed with this likelihood since all terms P(R = 1 | y, u₁) and P(R = 1 | u₂) are unknown. Motivated by the previous work, e.g., Zhao et al. (2018), on decomposing independent and identically distributed observations into its rank statistic and order statistic and studying solely on the likelihood function of rank statistic conditional on the order statistic, here, we also consider a similar type of conditional likelihood

\frac{\prod_{i = 1}^{n} p (y_{i}, u_{1 i} ∣ u_{2 i}, r_{i} = 1)}{\sum_{σ \in \sum} \prod_{i = 1}^{n} p (y_{σ (i)}, u_{1 σ (i)} ∣ u_{2 i}, r_{i} = 1)},

(6)

where the denominator comes from a similar type of likelihood for the order statistic, and ∑ represents the collection of all one-to-one mappings from (1, …, n} to (1, …, n}. Plugging the expression in (5) of p(y_i, u_1i | u_2i, r_i = 1) into (6), it can be simplified to

\frac{\prod_{i = 1}^{n} p (y_{i}, u_{1 i} ∣ u_{2 i})}{\sum_{σ \in \sum} \prod_{i = 1}^{n} p (y_{σ (i)}, u_{1 σ (i)} ∣ u_{2 i})},

(7)

since all terms p(R = 1 | y_i, u_1i) and P(R = 1 | u_2i) are canceled out.

The computation of the expression (7) is infeasible in practice, therefore we propose to use a pairwise version of (7):

\prod_{1 \leq i < j \leq n} \frac{p (y_{i}, u_{1 i} ∣ u_{2 i}) p (y_{j}, u_{1 j} ∣ u_{2 j})}{p (y_{i}, u_{1 i} ∣ u_{2 i}) p (y_{j}, u_{1 j} ∣ u_{2 j}) + p (y_{i}, u_{1 i} ∣ u_{2 j}) p (y_{j}, u_{1 j} ∣ u_{2 i})},

which is identical to maximizing the objection function

l_{c} (β) = - \frac{2}{N (N - 1)} \sum_{1 \leq i < j \leq N} r_{i} r_{j} \log {1 + Q_{ij} (β)},

(8)

where

Q_{ij} (β) = W_{ij} \cdot \frac{p (y_{i} ∣ u_{1 i}, u_{2 j}; β) p (y_{j} ∣ u_{1 j}, u_{2 i}; β)}{p (y_{i} ∣ u_{1 i}, u_{2 i}; β) p (y_{j} ∣ u_{1 j}, u_{2 j}; β)}, and W_{ij} = \frac{p (u_{1 i} ∣ u_{2 j}) p (u_{1 j} ∣ u_{2 i})}{p (u_{1 i} ∣ u_{2 i}) p (u_{1 j} ∣ u_{2 j})} .

(9)

The idea of using conditional likelihood was applied in missing data literature, for example, Liang and Qin (2000) and Zhao and Shao (2017). However their methods were based on a less flexible assumption and consequently, not all parameters were identifiable. In our framework, on the contrary, one can verify that the parameter β is fully identifiable in Q_ij(β), hence also in the objective function l_c(β).

To address the term W_ij, note that in the application of the children’s mental health study, both U₁ and U₂ are binary and they are completely observed, therefore we can simply fit a saturated logistic regression model as

logit {P (U_{1} = 1 ∣ U_{2})} = γ_{0} + γ_{1} U_{2},

where γ = (γ₀, γ₁)^T is the unknown parameter. Thereby we can simplify

W_{ij} = \frac{V_{ij} V_{ji}}{V_{ii} V_{jj}} = \exp {γ_{1} (u_{1 i} - u_{1 j}) (u_{2 j} - u_{2 i})},

where $V_{ij} = p (u_{1 i} ∣ u_{2 j}, γ) = \frac{\exp {u_{1 i} (γ_{0} + γ_{1} u_{2 j})}}{1 + \exp (γ_{0} + γ_{1} u_{2 j})}$ . To better facilitate our presentation in the following, we now re-denote Q_ij(β) as Q_ij(β, γ), and l_c(β) as l_c(β, γ).

In numerical implementation, one will first fit a logistic regression model between completely observed variables U₁ and U₂ to obtain $\hat{γ}$ . Then one will plug ${\hat{W}}_{ij}$ into function l_c(β) in (8), represented as $l_{c} (β, \hat{γ})$ . Finally the unknown parameter β will be estimated by maximizing the objective function $l_{c} (β, \hat{γ})$ .

Denote the true value of β as β₀ and that of γ as γ₀. We have the following asymptotic result for $\hat{β}$ , and its proof is contained in the Appendix.

Theorem 1. Assume (1), (4), $E_{β_{0}} {∥ \frac{\partial}{\partial β} \log (1 + Q_{12} (β, γ)) ∥}^{2} < \infty$ for any β in the parameter space. Then

\sqrt{N} (\hat{β} - β_{0}) \overset{d}{\to} N (0, A^{- 1} Σ A^{- 1}),

Where

A = E [- R_{i} R_{j} \frac{\partial^{2}}{\partial β \partial β^{T}} \log {1 + Q_{ij} (β_{0}, γ_{0})}], \sum = 4 E [ζ_{12} (β_{0}, γ_{0}) ζ_{13} {(β_{0}, γ_{0})}^{T}], ζ_{ij} (β_{0}, γ_{0}) = {BG}^{- 1} M_{ij} (γ_{0}) - N_{ij} (β_{0}, γ_{0}), G = E [\frac{\partial^{2}}{\partial γ \partial γ^{T}} \log {p (u_{1} ∣ u_{2}; γ_{0})}], B = E [- R_{i} R_{j} \frac{\partial^{2}}{\partial β \partial γ^{T}} \log {1 + Q_{ij} (β_{0}, γ_{0})}], M_{ij} (γ_{0}) = \frac{1}{2} {\frac{\partial}{\partial γ} \log p (u_{1 i} ∣ u_{2 i}; γ_{0}) + \frac{\partial}{\partial γ} \log p (u_{1 j} ∣ u_{2 j}; γ_{0})}, N_{ij} (β_{0}, γ_{0}) = - r_{i} r_{j} \frac{\partial}{\partial β} \log {1 + Q_{ij} (β_{0}, γ_{0})} .

3.2. Method based on pseudo likelihood

Our second method is motivated from the fact that, under assumption (4), the missingness indicator R and the covariate U₂ are conditionally independent given the outcome Y and covariate U₁. Exactly we have p(U₂ | Y, U₁, R = 1) = p(U₂ | Y, U₁). This means, the model regressing U₂ on (Y, U₁) that can be inferred using all completely observed subjects (with R = 1) is the same as the model that can be inferred using all data (as if there were no missing data). Therefore we aim to maximize

\prod_{i = 1}^{N} {p (u_{2 i} ∣ y_{i}, u_{1 i})}^{r_{i}} = \prod_{i = 1}^{N} {\frac{p (y_{i} ∣ u_{1 i}, u_{2 i}; β) p (u_{1 i} ∣ u_{2 i}) p (u_{2 i})}{\int p (y_{i} ∣ u_{1 i}, u_{2}; β) p (u_{1 i} ∣ u_{2}) p (u_{2}) {du}_{2}}}^{r_{i}} .

(10)

Note that there are two nuisance components in (10). The p(u₂) in the integral is unknown, but since the whole integral can be treated as an empirical mean with respect to U₂, we can estimate p(u₂) using its empirical distribution function and replace the integral as its empirical version. The other component, the model p(u₁ | u₂), is also unknown. Similar to Section 3.1, we also fit a saturated logistic regression model p(u₁ | u₂; γ).

We introduce the notation

H_{i} (β, γ, F) = r_{i} [\log {p (y_{i} ∣ u_{1 i}, u_{2 i}; β)} - \log \int p (y_{i} ∣ u_{1 i}, u_{2}; β) p (u_{1 i} ∣ u_{2}; γ) d F (u_{2})],

and

H_{i} (β, γ, \hat{F}) = r_{i} [\log {p (y_{i} ∣ u_{1 i}, u_{2 i}; β)} - \log {\sum_{j = 1}^{N} p (y_{i} ∣ u_{1 i}, u_{2 j}; β) p (u_{1 i} ∣ u_{2 j}; γ)}],

where F denotes the cumulative distribution function of U₂. We also adopt the same method as in Section 3.1 to estimate γ. Therefore, our objective function becomes

l_{p} (β, \hat{γ}, \hat{F}) = \frac{1}{N} \sum_{i = 1}^{N} H_{i} (β, \hat{γ}, \hat{F}),

(11)

and our estimator $\hat{β}$ is

\hat{β} = \arg \max_{β} l_{p} (β, \hat{γ}, \hat{F}) .

(12)

We need to point out that, the idea of using pseudo likelihood in a generalized linear model was rigorously investigated in Zhao and Shao (2015). The method presented here is a direct application of Zhao and Shao (2015) for the case that U₁ and U₂ are both binary, but we still need to pay attention to some nuances. For instance, because the covariates U₁ and U₂ are both binary, one can easily check that the identifiability conditions in Zhao and Shao (2015) are all satisfied, hence the parameter β is fully identifiable. Also, since both U₁ and U₂ are binary, a saturated logistic regression model can be fitted for p(u₁ | u₂) and there is no model misspecification issue.

Similar to Section 3.1, we also assume

\sqrt{N} (\hat{γ} - γ_{0}) = - G^{- 1} \sqrt{N} \frac{1}{N} \sum_{i = 1}^{N} \frac{\partial}{\partial γ} \log p (u_{1 i} ∣ u_{2 i}; γ_{0}) + o_{p} (1) .

We present the asymptotic result for $\hat{β}$ below and its proof is contained in the Appendix.

Theorem 2. Assume (1), (4). Assume that l_p(β, γ, F) is continuously twice differentiable with respect to β, and there exists an open $Ω \in B$ containing β₀ such that

\sup_{β \in Ω} ‖ \frac{\partial^{2}}{\partial β \partial β^{T}} l_{p} (β, γ, F) ‖ < M (R, Y, X),

where E{M(R, Y, X)} < ∞. Then, as N → ∞,

\sqrt{N} (\hat{β} - β_{0}) \overset{d}{\to} N (0, C^{- 1} Λ C^{- 1}),

(13)

where

C = E [\frac{\partial^{2}}{\partial β \partial β^{T}} H_{i} (β_{0}, γ_{0}, F_{0})], Λ = var [{DG}^{- 1} \frac{\partial}{\partial γ} \log {p (u_{1 i} ∣ u_{2 i}; γ_{0})} - \frac{\partial}{\partial β} H_{i} (β_{0}, γ_{0}, F_{0}) - 2 v_{1} (u_{2 i}; β_{0}, γ_{0}, F_{0})], D = E [\frac{\partial^{2}}{\partial β \partial γ^{T}} H_{i} (β_{0}, γ_{0}, F_{0})],

and v₁(u_2i; β₀, γ₀, F₀) is defined in the Appendix.

4. Results under a more general outcome model

The outcome model (1), considered in Ibrahim et al. (2001) and in Section 3, cannot learn the association between Y and its surrogate Z directly, nor the adjusted effect of U on Y after controlling the variable Z. In this Section, we study a more general outcome model

logit {P (Y = 1 ∣ U_{1}, U_{2}, Z)} = β_{0} + β_{1} U_{1} + β_{2} U_{2} + β_{3} Z,

(14)

where we simply denote β = (β₀, β₁, β₂, β₃)^T throughout this Section.

Keeping the similar spirit as in Section 3 of imposing the assumption on the mechanism model at the lowest level, throughout this Section, we assume

P (R = 1 ∣ Y, U_{1}, U_{2}, Z) = P (R = 1 ∣ Y, U_{1}, U_{2}),

(15)

which is reasonable in the sense that, after adjusting for the effect of Y, U₁ and U₂, the missingness of Y, i.e., the variable R, does not depend on Z, the surrogate variable of Y.

Recall that in Section 3, since both U₁ and U₂ are binary, a saturated logistic regression model for p(u₁ | u₂) could be fitted and there is no model misspecification issue. This will not be the case in this Section. Since p(u₁, u₂ | z) = p(u₁ | u₂, z)p(u₂ | z), in total 5 parameters could be estimated if one fits parametric logistic regression models without multi-way interaction terms. But one needs to estimate 6 parameters in modeling p(u₁, u₂ | z) since all variables U₁, U₂ and Z are binary. In other words, simple parametric models for p(u₁, u₂ | z) could be misspecified, and this phenomenon would pronounce more clearly and commonly if in general variables U and Z are continuous or multi-dimensional.

In the following, we will briefly present the two unconventional likelihood based estimators under model (14) and assumption (15). We will consider both parametric and nonparametric estimation approaches for p(u₁, u₂, z). The nonparametric estimation approach enjoys the attractive property of no model misspecification issue. To distinguish it from the simple parametric modeling, we denote q(u₁, u₂, z) as the nonparametric model for p(u₁, u₂, z).

4.1. Method based on conditional likelihood

Similar to the idea of conditional likelihood introduced in Section 3.1, we form the objective function as

l_{c} (β) = - \frac{2}{N (N - 1)} \sum_{1 \leq i < j \leq N} r_{i} r_{j} \log {1 + Q_{ij} (β)},

(16)

where

Q_{ij} (β) = W_{ij} \cdot \frac{p (y_{i} ∣ u_{1 i}, u_{2 i}, z_{j}; β) p (y_{j} ∣ u_{1 j}, u_{2 j}, z_{i}; β)}{p (y_{i} ∣ u_{1 i}, u_{2 i}, z_{i}; β) p (y_{j} ∣ u_{1 j}, u_{2 j}, z_{j}; β)},

and

W_{ij} = \frac{p (u_{1 i}, u_{2 i} ∣ z_{j}) p (u_{1 j}, u_{2 j} ∣ z_{i})}{p (u_{1 i}, u_{2 i} ∣ z_{i}) p (u_{1 j}, u_{2 j} ∣ z_{j})} .

We now consider both parametric and nonparametric estimation approaches for W_ij. Parametrically, we assume

logit {P (U_{1} = 1 ∣ U_{2}, Z)} = γ_{0} + γ_{1} U_{2} + γ_{2} Z, and logit {P (U_{2} = 1 ∣ Z)} = γ_{3} + γ_{4} Z .

After some algebra, we have

W_{ij} = \exp {γ_{2} (u_{1 i} - u_{1 j}) (z_{j} - z_{i}) + γ_{4} (u_{2 i} - u_{2 j}) (z_{j} - z_{i})} \cdot \frac{{1 + \exp (γ_{0} + γ_{1} u_{2 i} + γ_{2} z_{i})} {1 + \exp (γ_{0} + γ_{1} u_{2 j} + γ_{2} z_{j})}}{{1 + \exp (γ_{0} + γ_{1} u_{2 i} + γ_{2} z_{j})} {1 + \exp (γ_{0} + γ_{1} u_{2 j} + γ_{2} z_{i})}} .

Hence, given the similar regularity conditions discussed in Section 3.1, the asymptotic property of $\hat{β}$ estimated via conditional likelihood is similar to Theorem 1, i.e.,

\sqrt{N} (\hat{β} - β_{0}) \overset{d}{\to} N (0, A^{- 1} Σ A^{- 1}),

where more details are similar to Theorem 1 hence are omitted.

Alternatively we consider to estimate W_ij in a nonparametric approach, i.e.,

W_{ij} = \frac{p (u_{1 i}, u_{2 i} ∣ z_{j}) p (u_{1 j}, u_{2 j} ∣ z_{i})}{p (u_{1 i}, u_{2 i} ∣ z_{i}) p (u_{1 j}, u_{2 j} ∣ z_{j})} = \frac{q (u_{1 i}, u_{2 i}, z_{j}) q (u_{1 j}, u_{2 j}, z_{i})}{q (u_{1 i}, u_{2 i}, z_{i}) q (u_{1 j}, u_{2 j}, z_{j})} .

Since all variables U₁, U₂ and Z are binary, we estimate q(u₁, u₂, z) using its empirical distribution, i.e., we will have

{\hat{W}}_{ij} = \frac{(\sum_{i = 1}^{N} I {U_{1} = u_{1 i}, U_{2} = u_{2 i}, Z = z_{j}}) (\sum_{i = 1}^{N} I {U_{1} = u_{1 j}, U_{2} = u_{2 j}, Z = z_{i}})}{(\sum_{i = 1}^{N} I {U_{1} = u_{1 i}, U_{2} = u_{2 i}, Z = z_{i}}) (\sum_{i = 1}^{N} I {U_{1} = u_{1 j}, U_{2} = u_{2 j}, Z = z_{j}})} .

Once ${\hat{W}}_{ij}$ is obtained, all the following procedures are similar to the parametric approach, and similar asymptotic result to Theorem 1 can also be derived.

4.2. Method based on pseudo likelihood

If we implement the method of pseudo likelihood in Section 3.2 to the general case where the model p(Y | U₁, U₂, Z) is assumed as in (14), the estimator of β can be similarly derived. We introduce the similar notation

H_{i} (β, F) = r_{i} [\log {p (y_{i} ∣ u_{1 i}, u_{2 i}, z_{i}; β)} - \log {\int p (y_{i} ∣ u_{1 i}, u_{2 i}, z; β) p (u_{1 i}, u_{2 i} ∣ z) d F (z)}],

and

H_{i} (β, \hat{F}) = r_{i} [\log {p (y_{i} ∣ u_{1 i}, u_{2 i}, z_{i}; β)} - \log {\sum_{j = 1}^{N} p (y_{i} ∣ u_{1 i}, u_{2 i}, z_{j}; β) p (u_{1 i} ∣ u_{2 i}, z_{j})}] .

In general, the estimator $\hat{β}$ is the maximizer of the objective function

l_{p} (β, \hat{F}) = \frac{1}{N} \sum_{i = 1}^{N} H_{i} (β, \hat{F}) .

We also briefly present both parametric and nonparametric approaches for handling p(u₁, u₂ | z). Parametrically we have

\hat{β} = \arg \max_{β} l_{p} (β, \hat{γ}, \hat{F}),

(17)

where

l_{p} (β, \hat{γ}, \hat{F}) = \frac{1}{N} \sum_{i = 1}^{N} r_{i} [\log {p (y_{i} ∣ u_{1 i}, u_{2 i}, z_{i}; β)} - \log {\sum_{j = 1}^{N} p (y_{i} ∣ u_{1 i}, u_{2 i}, z_{j}; β) p (u_{1 i}, u_{2 i} ∣ z_{j}; \hat{γ})}] .

Given the similar regularity conditions discussed in Section 3.2, the asymptotic property of $\hat{β}$ obtained from (17) is similar to Theorem 2, i.e.,

\sqrt{N} (\hat{β} - β_{0}) \overset{d}{\to} N (0, C^{- 1} Λ C^{- 1}),

where further notation and more details are omitted.

Alternatively we consider nonparametric approach for H_i(β, F), where the essential component

\log \int p (y_{i} ∣ u_{1 i}, u_{2 i}, z; β) p (u_{1 i}, u_{2 i} ∣ z) d F (z)

can be empirically estimated by

\log \int p (y_{i} ∣ u_{1 i}, u_{2 i}, z; β) p (u_{1 i}, u_{2 i} ∣ z) p (z) dz = \log {p (y_{i} ∣ u_{1 i}, u_{2 i}, z = 1; β) p (u_{1 i}, u_{2 i} ∣ z = 1) p (z = 1) + p (y_{i} ∣ u_{1 i}, u_{2 i}, z = 0; β) p (u_{1 i}, u_{2 i} ∣ z = 0) p (z = 0)} = \log {p (y_{i} ∣ u_{1 i}, u_{2 i}, z = 1; β) p (u_{1 i}, u_{2 i}, z = 1) + p (y_{i} ∣ u_{1 i}, u_{2 i}, z = 0; β) p (u_{1 i}, u_{2 i}, z = 0)} = \sum_{a, b \in {0, 1}} I {u_{1 i} = a, u_{2 i} = b} \log {p (y_{i} ∣ u_{1 i} = a, u_{2 i} = b, z = 1; β) p (u_{1 i} = a, u_{2 i} = b, z = 1) + p (y_{i} ∣ u_{1 i} = a, u_{2 i} = b, z = 0; β) p (u_{1 i} = a, u_{2 i} = b, z = 0)} .

Afterwards, the asymptotic property of $\hat{β}$ similar to Theorem 2 can also be derived.

5. Analyses on children’s mental health study

In this Section, we present the analysis results from the children’s mental health study, introduced in Section 2.

We first consider the model studied in Section 3. For this model, we compare our proposed conditional likelihood method (Method 1) and pseudo likelihood method (Method 2) to the complete-case analysis method (CC), the likelihood method assuming a parametric nonignorable mechanism (NI), and the auxiliary variable method proposed in Ibrahim et al. (2001) (AUX). In Table 1, we summarize the comparison results of these five methods for parameter estimate, standard error, z-statistic and p-value.

Table 1:

Comparison of five methods for the estimates and standard errors of the three parameters in the model considered in Section 3.

Effect	Parameter	Method	Estimate	Standard Error	z-statistic	p-value
Intercept	β₀	CC	−1.7372	0.1070	−16.2358	0.0000
		NI	−0.6410	0.5170	−1.2398	0.2150
		AUX	−1.7030	0.1050	−16.2190	0.0000
		Method 1	−1.6626	0.1238	−13.4324	0.0000
		Method 2	−0.7243	0.3856	−1.8784	0.0603

Health	β₁	CC	0.2465	0.1380	1.7863	0.0740
		NI	0.1150	0.1420	0.8099	0.4180
		AUX	0.1810	0.1370	1.3212	0.1864
		Method 1	0.4214	0.6554	0.6429	0.5203
		Method 2	0.2848	0.5123	0.5560	0.5782

Father	β₂	CC	0.5419	0.1607	3.3724	0.0007
		NI	0.5450	0.1610	3.3851	0.0007
		AUX	0.5120	0.1580	3.2405	0.0012
		Method 1	0.5411	0.1243	4.3535	0.0000
		Method 2	0.5424	0.1231	4.4061	0.0000

Open in a new tab

From Table 1, it can be seen that, the effect father has approximately the same parameter estimate results no matter which method is used. For the effect health, although the estimation bias of the method CC can also be seen from the methods NI and AUX, the proposed methods show that the method CC is biased in the opposite direction. We believe that our proposed methods make more sense, since a poorer health condition would result in a much higher probability of clinical psychopathology from the teacher’s report.

We then consider the model studied in Section 4. For this model, since there is potential misspecification issue on p(u₁ | u₂, z)p(u₂ | z), we consider different modeling strategies. For either Method 1 (conditional likelihood based) or Method 2 (pseudo likelihood based), we consider each of models p(u₁ | u₂, z) and p(u₂ | z) to be logistic regression model or probit regression model, hence eight parametric modeling approaches. We also consider nonparametric modeling approaches in both Method 1 and Method 2. We compare our proposed methodologies to the complete-case analysis, the CC method. The comparison of the totally eleven methods is summarized in Table 2, in which “logit-probit” indicates that the link function for p(u₁ | u₂, z) is logit and that for p(u₂ | z) is probit.

Table 2:

Comparison of eleven methods for the estimates and standard errors of the four parameters in the model considered in Section 4.

Effect	Parameter	Method		Estimate	Standard Error	z-statistic	p-value
Intercept	β₀	CC		−1.9307	0.1132	−17.0618	0.0000
		Nonparametric	Method 1	−1.8727	0.2625	−7.1338	0.0000
		Nonparametric	Method 2	−1.3750	0.2867	−4.7966	0.0000
		logit-logit	Method 1	−1.6218	0.3392	−4.7813	0.0000
		logit-logit	Method 2	−1.3750	0.3055	−4.5001	0.0000
		logit-probit	Method 1	−1.4353	0.5499	−2.6100	0.0091
		logit-probit	Method 2	−1.1340	0.2969	−3.8195	0.0001
		probit-logit	Method 1	−0.6827	0.4770	−1.4314	0.1523
		probit-logit	Method 2	−0.7217	0.3002	−2.4040	0.0162
		probit-probit	Method 1	−0.5156	0.6188	−0.8331	0.4048
		probit-probit	Method 2	−0.5073	0.3077	−1.6489	0.0992

Health	β₁	CC		−0.0516	0.1480	−0.3487	0.7273
		Nonparametric	Method 1	−0.9972	0.4704	−2.1198	0.0340
		Nonparametric	Method 2	−0.9814	0.3622	−2.7098	0.0067
		logit-logit	Method 1	−1.2819	0.6980	−1.8364	0.0663
		logit-logit	Method 2	−0.9814	0.4008	−2.4484	0.0144
		logit-probit	Method 1	−1.2680	0.5530	−2.2929	0.0219
		logit-probit	Method 2	−1.2370	0.4190	−2.9526	0.0032
		probit-logit	Method 1	−2.2843	0.5662	−4.0343	0.0001
		probit-logit	Method 2	−2.6201	0.5934	−4.4151	0.0000
		probit-probit	Method 1	−2.4730	0.5266	−4.6965	0.0000
		probit-probit	Method 2	−3.4659	0.7197	−4.8157	0.0000

Father	β₂	CC		0.3652	0.1690	2.1608	0.0307
		Nonparametric	Method 1	0.0526	0.5258	0.1000	0.9203
		Nonparametric	Method 2	−0.0699	0.4460	−0.1567	0.8755
		logit-logit	Method 1	−0.4041	0.9355	−0.4319	0.6658
		logit-logit	Method 2	−0.0700	0.5384	−0.1300	0.8966
		logit-probit	Method 1	−2.2393	0.9099	−2.4610	0.0139
		logit-probit	Method 2	−1.5588	0.7286	−2.1393	0.0324
		probit-logit	Method 1	−0.1721	0.7500	−0.2294	0.8185
		probit-logit	Method 2	−0.5305	0.6475	−0.8193	0.4126
		probit-probit	Method 1	−1.3461	0.7213	−1.8662	0.0620
		probit-probit	Method 2	−1.5535	0.7480	−2.0770	0.0378

Parent’s Report	β₃	CC		1.4621	0.1583	9.2380	0.0000
		Nonparametric	Method 1	1.4085	0.1356	10.3902	0.0000
		Nonparametric	Method 2	1.4687	0.1366	10.7523	0.0000
		logit-logit	Method 1	1.4043	0.1370	10.2476	0.0000
		logit-logit	Method 2	1.4687	0.1376	10.6736	0.0000
		logit-probit	Method 1	1.4149	0.1340	10.5569	0.0000
		logit-probit	Method 2	1.4562	0.1311	11.1084	0.0000
		probit-logit	Method 1	1.4122	0.1371	10.3007	0.0000
		probit-logit	Method 2	1.4733	0.1320	11.1602	0.0000
		probit-probit	Method 1	1.4170	0.1322	10.7186	0.0000
		probit-probit	Method 2	1.4475	0.1265	11.4411	0.0000

Open in a new tab

The comparison results from Table 2 are also meaningful. Firstly no matter which method is used, the estimates of the effect parent’s report are roughly the same, and the corresponding p-value is always zero. This indicates the adjusted association between the teacher’s report and the parent’s report indeed exists and it is statistically significant. Secondly, although there are some differences of the estimates of the effect father among the proposed methods, the nonparametric approach shows that the effect is almost zero and it is not significant. Thirdly, to the contrary of the CC method, all the proposed methods present a statistically significant effect for the variable health. This observation certainly means that the CC method is heavily biased.

We also want to report that, across our numerical studies, the standard error estimates based on the asymptotic results are sensitive to the initial values of the unknown parameters. Therefore, we pursue a nonparametric Bootstrap approach instead. To investigate what is the sufficient number of Bootstrap samples in our setting, we illustrate the relation between the standard error estimates and the number of Bootstrap samples, as shown in Figure 1 for the model considered in Section 3, and in Figure 2 for the nonparametric approach for the model considered in Section 4. It can be seen that, in general 200 Bootstrap samples are sufficient. In each situation of our numerical investigations, we simulate 300 Bootstrap samples.

Figure 1: — For the proposed Method 1 and Method 2, as the number of Bootstrap samples increases, the change of standard errors of the three parameters in the model considered in Section 3.

Figure 2: — For the proposed nonparametric Method 1 and Method 2, as the number of Bootstrap samples increases, the change of standard errors of the four parameters in the model considered in Section 4.

6. Simulation studies

We examine the finite sample performance of the proposed methods using simulation studies where the data generation process were known.

For the model considered in Section 3, we first generate binary variable U₂ following Bernoulli distribution with P(U₂ = 1) = 0.6, then we generate binary variable U₁ following logit{P(U₁ = 1 | U₂)} = γ₀ + γ₁U₂ with γ₀ = 1 and γ₁ = 0.5. The outcome Y is generated from logit{P(Y = 1 | U₁, U₂)} = β₀ + β₁U₁ + β₂U₂ with β = (−0.5, 0.1, 3)^T. The missingness mechanism is generated from logit{P(R = 1 | Y, U₁; θ)} = θ₀ + θ₁Y + θ₂U₁ with θ = (−2.20, 3.58, 0.81)^T, so that there are approximately 70% completely observed subjects. We consider the total sample size N = 2,000. We compared the two proposed methods to the method using all data, termed as the benchmark, and the CC method. Based on 500 simulation replicates, the results of the estimation bias, Monte Carlo approximation of the standard deviation, the estimated standard error, and the coverage probability, are summarized in Table 3.

Table 3:

Comparison of four methods for the estimation bias (Bias), Monte Carlo approximation of standard deviation (SD), estimated standard error (SE), and coverage probability (CP) of the three parameters in the model considered in Section 3.

Parameter	Method	Bias	SD	SE	CP
β₀	Benchmark	0.0055	0.1218	0.1273	0.9600
	CC	2.1164	0.3046	0.2898	0.0000
	Method 1	0.0029	0.0113	0.0124	0.9480
	Method 2	0.0049	0.2417	0.2293	0.9560

β₁	Benchmark	−0.0031	0.1358	0.1423	0.9620
	CC	−0.6033	0.3240	0.3139	0.5200
	Method 1	0.0224	0.3086	0.2942	0.9460
	Method 2	0.0006	0.2837	0.2729	0.9460

β₂	Benchmark	0.0001	0.1352	0.1348	0.9300
	CC	0.0370	0.3146	0.2963	0.9320
	Method 1	0.0329	0.3077	0.2996	0.9420
	Method 2	0.0397	0.3157	0.3034	0.9480

Open in a new tab

Similarly, for the model considered in Section 4, we first generate binary variable Z following Bernoulli distribution with P(Z = 1) = 0.6, then we generate binary variable U₂ following logit{P(U₂ = 1 | Z)} = γ₃ + γ₄Z with γ₃ = 1 and γ₄ = 0.5, and binary variable U₁ following logit{P(U₁ = 1 | U₂, Z)} = γ₀ + γ₁U₂ + γ₂Z with γ₀ = −1, γ₁ = 0.5 and γ₂ = 0.5. The outcome Y is generated from logit{P(Y = 1 | U₁, U₂, Z)} = β₀ + β₁U₁ + β₂U₂ + γ₃Z with β = (−0.5, 0.1, 0.1, 3)^T. The missingness mechanism is generated from logit{P(R = 1 | Y, U₁, U₂; θ)} = θ₀ + θ₁Y + θ₂U₁ + θ₃U₂ with θ = (−2.20, 3.58, 0.46, 0.46)^T, so that there are also approximately 70% completely observed subjects. We consider the total sample size N = 2,000. We compared the two proposed methods in both parametric and nonparametric modeling approaches to the benchmark method and the CC method. Based on 500 simulation replicates, the similar results are summarized in Table 4.

Table 4:

Comparison of six methods for the estimation bias (Bias), Monte Carlo approximation of standard deviation (SD), estimated standard error (SE), and coverage probability (CP) of the four parameters in the model considered in Section 4.

Parameter	Method	Bias	SD	SE	CP
β₀	Benchmark	−0.0006	0.1428	0.1320	0.9420
	CC	2.1102	0.3041	0.2872	0.0000
	Method 1-Parametric	0.0226	0.2069	0.1978	0.9696
	Method 1-Nonparametric	0.0097	0.0729	0.0773	0.9760
	Method 2-Parametric	−0.0028	0.2666	0.2615	0.9520
	Method 2-Nonparametric	0.0012	0.2077	0.2034	0.9600

β₁	Benchmark	0.0083	0.1245	0.1274	0.9520
	CC	−0.3222	0.2319	0.2408	0.7300
	Method 1-Parametric	0.0548	0.4122	0.4172	0.9798
	Method 1-Nonparametric	0.0201	0.2573	0.2509	0.9760
	Method 2-Parametric	0.0101	0.2514	0.2614	0.9640
	Method 2-Nonparametric	0.0044	0.1790	0.1880	0.9600

β₂	Benchmark	0.0020	0.1555	0.1434	0.9280
	CC	−0.3428	0.3219	0.3050	0.8240
	Method 1-Parametric	0.0659	0.5004	0.4979	0.9696
	Method 1-Nonparametric	0.0277	0.2987	0.3204	0.9539
	Method 2-Parametric	0.0040	0.3084	0.2993	0.9480
	Method 2-Nonparametric	0.0004	0.2239	0.2197	0.9440

β₃	Benchmark	0.0000	0.1404	0.1372	0.9460
	CC	0.0209	0.3209	0.3107	0.9340
	Method 1-Parametric	0.0191	0.3149	0.3130	0.9393
	Method 1-Nonparametric	0.0176	0.3159	0.3147	0.9479
	Method 2-Parametric	0.0213	0.3167	0.3168	0.9480
	Method 2-Nonparametric	0.0206	0.3191	0.3197	0.9500

Open in a new tab

The conclusions from summarizing Tables 3 and 4 are quite clear. Firstly, the benchmark method and all the proposed methods are asymptotically unbiased and possess around 95% coverage probability, in every scenario. This well matches our theoretical investigation of the proposed methodology. Secondly, in most of the situations, the CC method is severely biased hence results in very poor coverage probability. This means simply using the CC method is incorrect generally. Thirdly, in Table 4, since the parametric modeling is correct in our proposed methods, its performance is very similar to the nonparametric approach, but it is less efficient than the nonparametric approach in general. Finally, from either Table 3 or Table 4, it is hard to reach a definitive answer of comparing the efficiency of the two proposed methods based on unconventional likelihoods.

7. Discussion

The limitation of modeling the nonignorable missingness mechanism parametrically motivates us the methods based on unconventional likelihoods, in that the assumption imposed on the mechanism model is at the minimum level. Both methods presented in this paper own deep roots in fundamental statistical methodologies: the conditional likelihood based method relies on the decomposition of rank statistics and order statistic, while the pseudo likelihood based method relies on the conditional independence and biased sampling.

From the data perspective, our work stems from a children’s mental health study where all the variables are binary. Binary data, or categorical data in general, may not be as easy to analyze as compared to continuous data. For instance, one should pay extra attention to the identifiability issue and the model misspecification issue for the binary data. Our proposed methods, however, do not have any restrictions on data types. Although nuances exist as to model identifiability and model misspecification, each of our proposed methods can be equally applied to either categorical data or continuous data.

In each of our proposed methods, β is the parameter of interest and γ is the nuisance. Our proposal entails a first-stage estimation procedure for $\hat{γ}$ , then the parameter β is estimated by maximizing $l_{c} (β, \hat{γ})$ or $l_{p} (β, \hat{γ}, \hat{F})$ . One could certainly consider to directly maximize l_c(β, γ) or $l_{p} (β, γ, \hat{F})$ to simultaneously locate $\hat{β}$ and $\hat{γ}$ without a first-stage procedure for the nuisance. One possible strength of doing this is, it might amplify the estimation efficiency of β, but this is not clear and it warrants further careful investigation; while one definite limitation of doing this is, it will intensify the computational burden since the optimization procedure involves a higher-dimensional parameter.

Last but not least, one potential limitation of our proposed methods is the sample size requirement. Our simulation settings with sample size N = 2, 000 mimic the children’s mental health study. We did carry out experiments with a smaller sample size, e.g., N = 200. But it showed numerical estimation bias to certain degree. This phenomenon was uncovered in the literature. For instance, Zhao (2017) studied a resampling based procedure to reduce the estimation bias in a similar context.

Acknowledgements

The authors would like to thank the editor, the associate editor and two anonymous referees for their constructive comments, which have led to a significantly improved paper. This work was supported by the National Center for Advancing Translational Sciences of the National Institutes of Health under award number UL1TR001412. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Appendix

Proof of Theorem 1. Let $\hat{γ}$ represent the MLE of γ,

\hat{γ} = \arg \max_{γ} \frac{1}{N} \sum_{i = 1}^{N} \log {p (u_{1 i} ∣ u_{2 i}; γ)} .

We first develop asymptotic properties of $\hat{γ}$ . We start from estimating equation

\frac{\partial}{\partial γ} \frac{1}{N} \sum_{i = 1}^{N} \log {p (u_{1 i} ∣ u_{2 i}; \hat{γ})} = 0 .

By Taylor expansion,

0 = \frac{\partial}{\partial γ} \frac{1}{N} \sum_{i = 1}^{N} \log {p (u_{1 i} ∣ u_{2 i}; \hat{γ})} = \frac{\partial}{\partial γ} \frac{1}{N} \sum_{i = 1}^{N} \log {p (u_{1 i} ∣ u_{2 i}; γ_{0})} + {\frac{\partial^{2}}{\partial γ \partial γ^{T}} \frac{1}{N} \sum_{i = 1}^{N} \log {p (u_{1 i} ∣ u_{2 i}; γ_{0})}} (\hat{γ} - γ_{0}) + o_{p} (N^{- \frac{1}{2}}) .

Thus,

\sqrt{N} \frac{\partial^{2}}{\partial γ \partial γ^{T}} \frac{1}{N} \sum_{i = 1}^{N} \log {p (u_{1 i} ∣ u_{2 i}; γ_{0})} (\hat{γ} - γ_{0}) + \sqrt{N} \frac{\partial}{\partial γ} \frac{1}{N} \sum_{i = 1}^{N} \log {p (u_{1 i} ∣ u_{2 i}; γ_{0})} + o_{p} (1) = 0 .

As a result,

\sqrt{N} (\hat{γ} - γ_{0}) = - {[\frac{\partial^{2}}{\partial γ \partial γ^{T}} \frac{1}{N} \sum_{i = 1}^{N} \log {p (u_{1 i} ∣ u_{2 i}; γ_{0})}]}^{- 1} \sqrt{N} \frac{\partial}{\partial γ} \frac{1}{N} \sum_{i = 1}^{N} \log {p (u_{1 i} ∣ u_{2 i}; γ_{0})} + o_{p} (1) = - G^{- 1} \sqrt{N} \frac{1}{N} \sum_{i = 1}^{N} \frac{\partial}{\partial γ} \log {p (u_{1 i} ∣ u_{2 i}; γ_{0})} + o_{p} (1) .

Once the asymptotic property of the nuisance parameter γ is known, we can develop the asymptotic properties of β. We can obtain $\hat{β}$ by solving

\frac{\partial l_{c} (\hat{β}, \hat{γ})}{\partial β} = 0,

which is equivalent to

{\frac{\partial l_{c} (\hat{β}, \hat{γ})}{\partial β} - \frac{\partial l_{c} (β_{0}, \hat{γ})}{\partial β}} + {\frac{\partial l_{c} (β_{0}, \hat{γ})}{\partial β} - \frac{\partial l_{c} (β_{0}, γ_{0})}{\partial β}} + \frac{\partial l_{c} (β_{0}, γ_{0})}{\partial β} = 0 .

(18)

Specifically,

\frac{\partial l_{c} (\hat{β}, \hat{γ})}{\partial β} - \frac{\partial l_{c} (β_{0}, \hat{γ})}{\partial β} = \frac{\partial^{2}}{\partial β \partial β^{T}} l_{c} (β_{0}, \hat{γ}) (\hat{β} - β_{0}) + o_{p} (N^{- \frac{1}{2}}),

(19)

by Taylor expansion. Similarly,

\frac{\partial l_{c} (β_{0}, \hat{γ})}{\partial β} - \frac{\partial l_{c} (β_{0}, γ_{0})}{\partial β} = \frac{\partial^{2}}{\partial β \partial γ^{T}} l_{c} (β_{0}, γ_{0}) (\hat{γ} - γ_{0}) + o_{p} (N^{- \frac{1}{2}}) .

(20)

With (19) and (20) plug into (18), we can obtain the following equation

\sqrt{N} \frac{\partial^{2}}{\partial β \partial β^{T}} l_{c} (β_{0}, \hat{γ}) (\hat{β} - β_{0}) + \sqrt{N} \frac{\partial^{2}}{\partial β \partial γ^{T}} l_{c} (β_{0}, γ_{0}) (\hat{γ} - γ_{0}) + \sqrt{N} \frac{\partial l_{c} (β_{0}, γ_{0})}{\partial β} + o_{p} (1) = 0 .

(21)

As $\sqrt{N} (\hat{γ} - γ_{0}) = - G^{- 1} \sqrt{N} \frac{\partial}{\partial γ} \frac{1}{N} \sum_{i = 1}^{N} \log p (u_{1 i} ∣ u_{2 i}; γ_{0}) + o_{p} (1)$ , (21) is equivalent to

\sqrt{N} \frac{\partial^{2}}{\partial β \partial β^{T}} l_{c} (β_{0}, \hat{γ}) (\hat{β} - β_{0}) + \frac{\partial^{2}}{\partial β \partial γ^{T}} l_{c} (β_{0}, γ_{0}) (- G^{- 1} \sqrt{N} \frac{\partial}{\partial γ} \frac{1}{N} \sum_{i = 1}^{N} \log p (u_{1 i} ∣ u_{2 i}; γ_{0})) + \sqrt{N} \frac{\partial l_{c} (β_{0}, γ_{0})}{\partial β} + o_{p} (1) = 0 .

Thus,

\sqrt{N} (\hat{β} - β_{0}) = - {\frac{\partial^{2}}{\partial β \partial β^{T}} l_{c} (β_{0}, \hat{γ})}^{- 1} {\frac{\partial^{2}}{\partial β \partial γ^{T}} l_{c} (β_{0}, γ_{0}) (- G^{- 1} \sqrt{N} \frac{\partial}{\partial γ} \frac{1}{N} \sum_{i = 1}^{N} \log p (u_{1 i} ∣ u_{2 i}; γ_{0})) + \sqrt{N} \frac{\partial l_{c} (β_{0}, γ_{0})}{\partial β}} + o_{p} (1) .

(22)

In addition, we need to form a projection of $\frac{\partial}{\partial γ} \frac{1}{N} \sum_{i = 1}^{N} \log p (u_{1 j} ∣ u_{2 j}; γ_{0})$ in (22) through

\frac{\partial}{\partial γ} \frac{1}{N} \sum_{i = 1}^{N} \log p (u_{1 i} ∣ u_{2 i}; γ_{0}) = {(\begin{matrix} N \\ 2 \end{matrix})}^{- 1} \sum_{1 \leq i < j \leq N} \frac{1}{2} {\frac{\partial}{\partial γ} \log p (u_{1 i} ∣ u_{2 i}; γ_{0}) + \frac{\partial}{\partial γ} \log p (u_{1 j} ∣ u_{2 j}; γ_{0})},

and

\frac{\partial l_{c} (β_{0}, γ_{0})}{\partial β} = {(\begin{matrix} N \\ 2 \end{matrix})}^{- 1} \sum_{1 \leq i < j \leq N} \frac{\partial}{\partial β} [- r_{i} r_{j} \log {1 + Q_{ij} (β_{0}, γ_{0})}] .

To sum up, (22) can be formed as

\sqrt{N} (\hat{β} - β_{0}) = A^{- 1} \sqrt{N} {(\begin{matrix} N \\ 2 \end{matrix})}^{- 1} \sum_{1 \leq i < j \leq N} {{BG}^{- 1} M_{ij} (γ_{0}) - N_{ij} (β_{0}, γ_{0})} + o_{p} (1),

(23)

and this completes the proof. □

Proof of Theorem 2. By Taylor’s expansion and $\frac{\partial}{\partial β} l_{p} (\hat{β}, \hat{γ}, \hat{F}) = 0$ , we have

- \frac{\partial}{\partial β} l_{p} (β_{0}, γ_{0}, F_{0}) = {\frac{\partial}{\partial β} l_{p} (\hat{β}, \hat{γ}, \hat{F}) - \frac{\partial}{\partial β} l_{p} (β_{0}, \hat{γ}, \hat{F})} + {\frac{\partial}{\partial β} l_{p} (β_{0}, \hat{γ}, \hat{F}) - \frac{\partial}{\partial β} l_{p} (β_{0}, γ_{0}, \hat{F})} + \frac{\partial}{\partial β} l_{p} (β_{0}, γ_{0}, \hat{F}) - \frac{\partial}{\partial β} l_{p} (β_{0}, γ_{0}, F_{0}) = \frac{\partial^{2}}{\partial β \partial β^{T}} l_{p} (β_{0}, \hat{γ}, \hat{F}) (\hat{β} - β_{0}) + \frac{\partial^{2}}{\partial β \partial γ^{T}} l_{p} (β_{0}, γ_{0}, \hat{F}) (\hat{γ} - γ_{0}) + \frac{\partial}{\partial β} l_{p} (β_{0}, γ_{0}, \hat{F}) - \frac{\partial}{\partial β} l_{p} (β_{0}, γ_{0}, F_{0}) + o_{p} (N^{- \frac{1}{2}}),

(24)

and

\frac{\partial}{\partial β} l_{p} (β_{0}, γ_{0}, \hat{F}) - \frac{\partial}{\partial β} l_{p} (β_{0}, γ_{0}, F_{0}) = \frac{1}{N} \sum_{i = 1}^{N} r_{i} \times {\frac{\int \frac{\partial}{\partial β} p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d \hat{F} \int p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d (\hat{F} - F_{0})}{\int p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d F_{0} \int p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d \hat{F}} - \frac{\int \frac{\partial}{\partial β} p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d (\hat{F} - F_{0})}{\int p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d F_{0}}} = V_{N} + o_{p} (N^{- \frac{1}{2}}),

(25)

where

V_{N} = \frac{1}{N} \sum_{i = 1}^{N} r_{i} \times {\frac{\int \frac{\partial}{\partial β} p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d \hat{F} \int p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d (\hat{F} - F_{0})}{{[\int p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d F_{0}]}^{2}} - \frac{\int \frac{\partial}{\partial β} p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d (\hat{F} - F_{0})}{\int p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d F_{0}}}

is a V-statistic based on data w_i = (r_i, y_i, u_1i, u_2i) and the following kernel function

v (w_{i}, w_{j}) = \frac{1}{2} {\frac{r_{i} \int \frac{\partial}{\partial β} p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{i} ∣ u_{2}; γ_{0}) d F_{0} p (y_{i} ∣ u_{1 i}, u_{2 j}; β_{0}) p (u_{1 i} ∣ u_{2 j}; γ_{0})}{{[\int p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d F_{0}]}^{2}} + \frac{r_{j} \int \frac{\partial}{\partial β} p (y_{j} ∣ u_{1 j}, u_{2}; β_{0}) p (u_{1 j} ∣ u_{2}; γ_{0}) d F_{0} p (y_{j} ∣ u_{1 j}, u_{2 i}; β_{0}) p (u_{1 j} ∣ u_{2 i}; γ_{0})}{{[\int p (y_{j} ∣ u_{1 j}, u_{2}; β_{0}) p (u_{1 j} ∣ u_{2}; γ_{0}) d F_{0}]}^{2}} - \frac{r_{i} \frac{\partial}{\partial β} p (y_{i} ∣ u_{1 i}, u_{2 j}; β_{0}) p (u_{1 i} ∣ u_{2 j}; γ_{0})}{\int p (y_{i} ∣ u_{1 i}, u_{2}; β_{0}) p (u_{1 i} ∣ u_{2}; γ_{0}) d F_{0}} - \frac{r_{j} \frac{\partial}{\partial β} p (y_{j} ∣ u_{1 j}, u_{2 i}; β_{0}) p (u_{1 j} ∣ u_{2 i}; γ_{0})}{\int p (y_{j} ∣ u_{1 j}, u_{2}; β_{0}) p (u_{1 j} ∣ u_{2}; γ_{0}) d F_{0}}} .

Let

v_{1} (w_{j}) = E [v (w_{i}, w_{j} ∣ w_{i})] = \frac{P (R = 1)}{2} E {\frac{\int \frac{\partial}{\partial β} p (y_{j} ∣ u_{1 j}, u_{2}; β_{0}) p (u_{1 j} ∣ u_{2}; γ_{0}) d F_{0} p (y_{j} ∣ u_{1 j}, u_{2 i}; β_{0}) p (u_{1 j} ∣ u_{2 i}; γ_{0})}{{[\int p (y_{j} ∣ u_{1 j}, u_{2}; β_{0}) p (u_{1 j} ∣ u_{2}; γ_{0}) d F_{0}]}^{2}} - \frac{\frac{\partial}{\partial β} p (y_{j} ∣ u_{1 j}, u_{2 i}; β_{0}) p (u_{1 j} ∣ u_{2 i}; γ_{0})}{\int p (y_{j} ∣ u_{1 j}, u_{2}; β_{0}) p (u_{1 j} ∣ u_{2}; γ_{0}) d F_{0}} ∣ r_{i} = 1, u_{2 i}},

which does not depend on r_i, y_i or u_1i, and will be denoted by v₁(u_2i; β₀, γ₀, F₀). From the theory of V-statistics, we have

V_{N} = \frac{1}{N} \sum_{i = 1}^{N} 2 v_{1} (u_{2 i}; β_{0}, γ_{0}, F_{0}) + o_{p} (N^{- \frac{1}{2}}) .

(26)

Under the given conditions, we have

\sqrt{N} (\hat{β} - β_{0}) = C^{- 1} \frac{1}{\sqrt{N}} \sum_{i = 1}^{N} {{DG}^{- 1} \frac{\partial}{\partial γ} \log p (u_{1 i} ∣ u_{2 i}; γ_{0}) - \frac{\partial}{\partial β} H_{i} (β_{0}, γ_{0}, F_{0}) - 2 v_{1} (u_{2 i}; β_{0}, γ_{0}, F_{0})} + o_{p} (1),

and this completes the proof. □

References

Chen J and Fang F (2019), “Semiparametric likelihood for estimating equations with non-ignorable non-response by non-response instrument,” Journal of Nonparametric Statistics, 1–15. [Google Scholar]
Fang F and Shao J (2016), “Model selection with nonignorable nonresponse,” Biometrika, 103, 861–874. [Google Scholar]
Ibrahim JG, Chu H, and Chen M-H (2012), “Missing data in clinical studies: issues and methods,” Journal of Clinical Oncology, 30, 3297–3303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ibrahim JG, Lipsitz SR, and Horton N (2001), “Using auxiliary data for parameter estimation with non-ignorably missing outcomes,” Journal of the Royal Statistical Society: Series C (Applied Statistics), 50, 361–373. [Google Scholar]
Liang K-Y and Qin J (2000), “Regression analysis under non-standard situations: a pairwise pseudolikelihood approach,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62, 773–786. [Google Scholar]
Little RJ, D’agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, Frangakis C, Hogan JW, Molenberghs G, Murphy SA, et al. (2012), “The prevention and treatment of missing data in clinical trials,” New England Journal of Medicine, 367, 1355–1360. [DOI] [PMC free article] [PubMed] [Google Scholar]
Little RJ and Rubin DB (2002), Statistical Analysis with Missing Data, Wiley, 2nd ed. [Google Scholar]
Zahner GE and Daskalakis C (1997), “Factors associated with mental health, general health, and school-based service use for child psychopathology.” American journal of public health, 87, 1440–1448. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zahner GE, Jacobs JH, Freeman DH, and Trainor KF (1993), “Rural-urban child psychopathology in a northeastern US state: 1986–1989,” Journal of the American Academy of Child & Adolescent Psychiatry, 32, 378–387. [DOI] [PubMed] [Google Scholar]
Zahner GE, Pawelkiewicz W, DeFrancesco JJ, and Adnopoz J (1992), “Children’s mental health service needs and utilization patterns in an urban community: An epidemiological assessment,” Journal of the American Academy of Child & Adolescent Psychiatry, 31, 951–960. [DOI] [PubMed] [Google Scholar]
Zhao J (2017), “Reducing bias for maximum approximate conditional likelihood estimator with general missing data mechanism,” Journal of Nonparametric Statistics, 29, 577–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao J and Ma Y (2018), “Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse,” Biometrika, 105, 479–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao J and Shao J (2015), “Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data,” Journal of the American Statistical Association, 110, 1577–1590. [Google Scholar]
— (2017), “Approximate conditional likelihood for generalized linear models with general missing data mechanism,” Journal of Systems Science and Complexity, 30, 139–153. [Google Scholar]
Zhao J, Yang Y, and Ning Y (2018), “Penalized pairwise pseudo likelihood for variable selection with nonignorable missing data,” Statistica Sinica, 28, 2125–2148. [Google Scholar]
Zhao P, Tang N, Qu A, and Jiang D (2017), “Semiparametric estimating equations inference with nonignorable missing data,” Statistica Sinica, 89–113. [Google Scholar]

[R1] Chen J and Fang F (2019), “Semiparametric likelihood for estimating equations with non-ignorable non-response by non-response instrument,” Journal of Nonparametric Statistics, 1–15. [Google Scholar]

[R2] Fang F and Shao J (2016), “Model selection with nonignorable nonresponse,” Biometrika, 103, 861–874. [Google Scholar]

[R3] Ibrahim JG, Chu H, and Chen M-H (2012), “Missing data in clinical studies: issues and methods,” Journal of Clinical Oncology, 30, 3297–3303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Ibrahim JG, Lipsitz SR, and Horton N (2001), “Using auxiliary data for parameter estimation with non-ignorably missing outcomes,” Journal of the Royal Statistical Society: Series C (Applied Statistics), 50, 361–373. [Google Scholar]

[R5] Liang K-Y and Qin J (2000), “Regression analysis under non-standard situations: a pairwise pseudolikelihood approach,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62, 773–786. [Google Scholar]

[R6] Little RJ, D’agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, Frangakis C, Hogan JW, Molenberghs G, Murphy SA, et al. (2012), “The prevention and treatment of missing data in clinical trials,” New England Journal of Medicine, 367, 1355–1360. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Little RJ and Rubin DB (2002), Statistical Analysis with Missing Data, Wiley, 2nd ed. [Google Scholar]

[R8] Zahner GE and Daskalakis C (1997), “Factors associated with mental health, general health, and school-based service use for child psychopathology.” American journal of public health, 87, 1440–1448. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Zahner GE, Jacobs JH, Freeman DH, and Trainor KF (1993), “Rural-urban child psychopathology in a northeastern US state: 1986–1989,” Journal of the American Academy of Child & Adolescent Psychiatry, 32, 378–387. [DOI] [PubMed] [Google Scholar]

[R10] Zahner GE, Pawelkiewicz W, DeFrancesco JJ, and Adnopoz J (1992), “Children’s mental health service needs and utilization patterns in an urban community: An epidemiological assessment,” Journal of the American Academy of Child & Adolescent Psychiatry, 31, 951–960. [DOI] [PubMed] [Google Scholar]

[R11] Zhao J (2017), “Reducing bias for maximum approximate conditional likelihood estimator with general missing data mechanism,” Journal of Nonparametric Statistics, 29, 577–593. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Zhao J and Ma Y (2018), “Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse,” Biometrika, 105, 479–486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Zhao J and Shao J (2015), “Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data,” Journal of the American Statistical Association, 110, 1577–1590. [Google Scholar]

[R14] — (2017), “Approximate conditional likelihood for generalized linear models with general missing data mechanism,” Journal of Systems Science and Complexity, 30, 139–153. [Google Scholar]

[R15] Zhao J, Yang Y, and Ning Y (2018), “Penalized pairwise pseudo likelihood for variable selection with nonignorable missing data,” Statistica Sinica, 28, 2125–2148. [Google Scholar]

[R16] Zhao P, Tang N, Qu A, and Jiang D (2017), “Semiparametric estimating equations inference with nonignorable missing data,” Statistica Sinica, 89–113. [Google Scholar]

PERMALINK

Estimators based on Unconventional Likelihoods with Nonignorable Missing Data and its Application to a Children’s Mental Health Study

Jiwei Zhao

Chi Chen

Abstract

1. Introduction

2. A children’s mental health study

3. Proposed methods

3.1. Method based on conditional likelihood

3.2. Method based on pseudo likelihood

4. Results under a more general outcome model

4.1. Method based on conditional likelihood

4.2. Method based on pseudo likelihood

5. Analyses on children’s mental health study

Table 1:

Table 2:

Figure 1:

Figure 2:

6. Simulation studies

Table 3:

Table 4:

7. Discussion

Acknowledgements

Appendix

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Estimators based on Unconventional Likelihoods with Nonignorable Missing Data and its Application to a Children’s Mental Health Study

Jiwei Zhao

Chi Chen

Abstract

1. Introduction

2. A children’s mental health study

3. Proposed methods

3.1. Method based on conditional likelihood

3.2. Method based on pseudo likelihood

4. Results under a more general outcome model

4.1. Method based on conditional likelihood

4.2. Method based on pseudo likelihood

5. Analyses on children’s mental health study

Table 1:

Table 2:

Figure 1:

Figure 2:

6. Simulation studies

Table 3:

Table 4:

7. Discussion

Acknowledgements

Appendix

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases