Abstract
Nonignorable missing-data is common in studies where the outcome is relevant to the subject’s behavior. Ibrahim et al. (2001) fitted a logistic regression for a binary outcome subject to nonignorable missing data, and they proposed to replace the outcome in the mechanism model with an auxiliary variable that is completely observed. They had to correctly specify a model for the auxiliary variable; unfortunately the outcome variable subject to nonignorable missingness is still involved. The correct specification of this model is mysterious. Instead, we propose two unconventional likelihood based estimation procedures where the nonignorable missingness mechanism model could be completely bypassed. We apply our proposed methods to the children’s mental health study and compare their performance with existing methods. The large sample properties of the proposed estimators are rigorously justified, and their finite sample behaviors are examined via comprehensive simulation studies.
Keywords: Nonignorable missing data, Missingness mechanism, Unconventional likelihood, Pseudo likelihood, Conditional likelihood, Asymptotic normality
1. Introduction
The issue of missing data is a serious problem in various biomedical studies such as clinical trials and observational studies (Ibrahim et al., 2012; Little et al., 2012). Consider a situation that the outcome variable of interest, Y, is subject to missing values. If one believes that its missingness is unrelated to the possibly unobserved Y-value, the missing-data is called ignorable (Little and Rubin, 2002). For ignorable missing data, many well-known statistical procedures are available, and their theoretical properties are rigorously justified. Commonly, however, there is often a suspicion that the missing-data is actually related to the values of the possibly unobserved Y itself. This type of missing data is termed nonignorable and they exist in disciplines where the outcome is directly relevant to subject’s behavior, or is reported by the subjects themselves, for instance, through questionnaire or surveys.
Ibrahim et al. (2001) analyzed a data set from a study of mental health of children in Connecticut (Zahner et al., 1992, 1993; Zahner and Daskalakis, 1997), and their outcome variable of interest was the teacher’s report of the psychopathology of the child. Using logistic regression, this variable was modelled as a function of the parental status of the household, father, and the physical health of the child, health. In this study, 42.7% subjects missed their teacher’s report of psychopathology. A child’s possibly unobserved psychopathology status may be related to missingness because a teacher is more likely to fill out the psychopathology status when the teacher feels that the child is normal or not normal, hence it is nonignorable. A naive logistic regression model discarding subjects with missing psychopathology status, the so-called complete-case analysis, could lead to highly biased estimates. Thus, methods taking the nonignorable missingness mechanism into account need to be proposed.
Ibrahim et al. (2001) proposed to approximate the nonignorable missingness mechanism with an ignorable one, by taking advantage of the auxiliary variable, Z, the parents’ report of the psychopathology of the child. Their rationale is that the missingness indicator and the outcome variable Y would be independent conditional on Z and other covariates, if Y and Z have a sufficiently high partial correlation. Afterwards their parametric likelihood under this ignorable missingness assumption not only involves the model of interest for Y, but also the regression model for Z on all other variables including Y. They showed that, if Y and Z have a reasonably large correlation, their proposed estimator reduced the estimation bias resulting from the complete-case analysis.
In nonignorable missing data analysis, one is usually interested in estimating the unknown parameter in a parametric outcome model. In this situation, if one fits another parametric model for the missingness mechanism, the overall model specification issue becomes involved, and as a consequence, it may result in severe estimation bias. By observing the existence of the auxiliary variable Z, Ibrahim et al. (2001) avoided the direct parametric modeling of the mechanism. However, to implement their parametric likelihood, they have to fit a parametric model for Z regressing on Y and all other variables. Therefore essentially they still cannot completely avoid the problem of fitting a parametric model with nonignorable missingness.
There is a considerable amount of recent literature on nonignorable missingness mechanism beyond a simplistic parametric modeling. Zhao and Shao (2015) proposed a semi-parametric pseudo likelihood method for generalized linear models based on the existence of a nonresponse instrument (or, shadow variable), and Zhao and Ma (2018) investigated the estimation optimality under their framework. Fang and Shao (2016) and Zhao et al. (2018) studied the variable selection problem with nonignorable missing data from different perspectives. Besides likelihood based methods, Zhao et al. (2017) and Chen and Fang (2019) also considered estimating equation based methods under appropriate assumptions.
In this paper, motivated by the initial analysis result of Ibrahim et al. (2001) that the variable father is unrelated to missingness, we propose two estimation methods based on unconventional likelihoods without fitting the mechanism model or any other model. Our first method is based on conditional likelihood and second based on pseudo likelihood. In both methods we assume the mechanism model is arbitrary except that the missingness is independent with father conditional on all other variables, the evidence from Ibrahim et al. (2001). The most attractive feature of using these unconventional likelihoods is that, in the objective function, the pieces involving the mechanism model are all canceled out, therefore our methods are robust to the mechanism model misspecification. Different from Ibrahim et al. (2001) who adopted purely parametric specification on their models hence had to conduct heavy sensitivity analyses, we use a semiparametric framework where we treat the mechanism model as a nonparametric nuisance, and we propose nuisance-free methods to estimate the parameter of interest. We also rigorously derive the asymptotic properties of our proposed estimators.
The auxiliary variable Z is important since it is completely observed and it provides key surrogate information to the outcome variable Y, therefore it is of interest to quantify the relation between Z and Y in our outcome model and to further use the auxiliary variable Z in our framework. To this end we consider a more general outcome model for Y where the auxiliary variable Z also serves as one of the covariate variables. Instead of relying on the conditional independence between the missingness and Y, our proposed methods under this more general outcome model rely on the conditional independence between the missingness and Z. Similarly we can also propose estimators based on unconventional likelihoods that the fitting of the mechanism or any other model is not needed.
The structure of the paper is as follow. We first introduce the notation and briefly review the analysis results in Ibrahim et al. (2001) in Section 2. We propose our unconventional likelihood based estimation methods in Section 3 for the model considered in Ibrahim et al. (2001). In Section 4 we present the results for the more general outcome model discussed in the last paragraph. We implement our proposed estimators and compare their performance with Ibrahim et al. (2001) in Section 5. Finally, in Section 6 we conduct comprehensive simulation studies to examine the finite sample performance of our proposed estimators if the data generation process were known, and in Section 7 we conclude our paper.
2. A children’s mental health study
We first briefly describe the study in Ibrahim et al. (2001) and their analysis results. Besides the notation Y as outcome variable (Y = 1 if the teacher reports borderline or clinical psychopathology and Y = 0 if the teacher reports that the child is normal) and Z as auxiliary variable (Z = 1 if the parents’ report is abnormal and Z = 0 if normal) that are used in Section 1, we also introduce the following notation. We denote U = (U1, U2)T where U1 is the effect of health (U1 = 1 means there is fair or poor health, a chronic condition or a limitation in activity and U1 = 0 means there are no health problems), and U2 is the effect of father (U2 = 1 if no father figure is present and U2 = 0 if a father figure is present). The variables U1, U2 and Z are all completely observed, but the variable Y has missing values. We use binary variable R as the missingness indicator where R = 1 represents that Y is observed and R = 0 if Y is missing. We use N as the total sample size and n as the sample size without missing values. Without loss of generality, we assume the first n samples are completely observed. Throughout our paper we also use p(⋅ | ⋅) as a generic notation for the conditional probability density function.
Throughout Ibrahim et al. (2001), the model of interest is
| (1) |
where β = (β0, β1, β2)T is the unknown parameter. One key step of Ibrahim et al. (2001) is to formulize a parametric likelihood function and adopt the EM algorithm for the maximization. To this end, they first proposed a logistic regression model for the missingness mechanism
They found that the variable U2 was unrelated to missingness, hence they dropped this term and the mechanism model becomes
| (2) |
One limitation of using purely parametric approach for the nonignorable mechanism model (2) is the potential misspecification issue. To fix this problem, heavy sensitivity analyses have to be conducted to validate the correctness of the parametric specification. If more terms were added as covariates to the model (2), for instance, the multi-way interactions between Y and U1, the model (2) becomes non-identifiable.
By taking advantage of the existence of the auxiliary variable Z and recognizing it as a proxy to the outcome Y, Ibrahim et al. (2001) proposed that
hence the mechanism model becomes ignorable and it can be eliminated in the parametric likelihood to estimating β. Hence the likelihood function to be maximized is
| (3) |
where, besides the model of interest p(y | u; β), the model p(z | y, u; γ) is also involved where the parameter γ is unknown.
Therefore to correctly estimate the parameter of interest β, one has to correctly fit the model p(z | y, u; γ) which also involves nonignorable missing data. The essential difficulty of correctly fitting a parametric model with nonignorable missing data still does remain. In their paper, Ibrahim et al. (2001) tried different models to fit p(z | y, u; γ) by adding different combinations of interaction terms.
The major conclusion of Ibrahim et al. (2001) is that, the estimator of β from maximizing (3) with the model p(z | y, u; γ) correctly specified, although theoretically biased, will numerically reduce the estimation bias compared to the complete-case estimator, if there exists a decent amount of correlation between Y and Z. They ascertained this conclusion by analyzing the children’s mental health study data set.
3. Proposed methods
Motivated by the difficulty of correctly specifying a parametric model for the nonignorable missingness mechanism, we propose two unconventional likelihood based methods where we do not need to specify a concrete form for the mechanism model. In this Section, same as Ibrahim et al. (2001), we aim on estimating parameter β in model (1), but very differently, we only assume the missingness mechanism to satisfy
| (4) |
Note that we are not using the explicit form of the mechanism model as in (2). Note also that the validity of assumption (4) is already ascertained in the initial analyses results of Ibrahim et al. (2001). A similar framework of integrating the auxiliary variable Z will be presented in Section 4.
3.1. Method based on conditional likelihood
The idea of the first method is initiated from studying the model p(y, u1 | u2). Note that the likelihood function based on the model p(y, u1 | u2) with the observed data {yi, u1i, u2i}, i = 1, …, n is
| (5) |
where P(R = 1 | u2) = ∫ ∫P(R = 1 | Y, U1)p(Y, U1 | u2)dY dU1. However it does not work to directly proceed with this likelihood since all terms P(R = 1 | y, u1) and P(R = 1 | u2) are unknown. Motivated by the previous work, e.g., Zhao et al. (2018), on decomposing independent and identically distributed observations into its rank statistic and order statistic and studying solely on the likelihood function of rank statistic conditional on the order statistic, here, we also consider a similar type of conditional likelihood
| (6) |
where the denominator comes from a similar type of likelihood for the order statistic, and ∑ represents the collection of all one-to-one mappings from (1, …, n} to (1, …, n}. Plugging the expression in (5) of p(yi, u1i | u2i, ri = 1) into (6), it can be simplified to
| (7) |
since all terms p(R = 1 | yi, u1i) and P(R = 1 | u2i) are canceled out.
The computation of the expression (7) is infeasible in practice, therefore we propose to use a pairwise version of (7):
which is identical to maximizing the objection function
| (8) |
where
| (9) |
The idea of using conditional likelihood was applied in missing data literature, for example, Liang and Qin (2000) and Zhao and Shao (2017). However their methods were based on a less flexible assumption and consequently, not all parameters were identifiable. In our framework, on the contrary, one can verify that the parameter β is fully identifiable in Qij(β), hence also in the objective function lc(β).
To address the term Wij, note that in the application of the children’s mental health study, both U1 and U2 are binary and they are completely observed, therefore we can simply fit a saturated logistic regression model as
where γ = (γ0, γ1)T is the unknown parameter. Thereby we can simplify
where . To better facilitate our presentation in the following, we now re-denote Qij(β) as Qij(β, γ), and lc(β) as lc(β, γ).
In numerical implementation, one will first fit a logistic regression model between completely observed variables U1 and U2 to obtain . Then one will plug into function lc(β) in (8), represented as . Finally the unknown parameter β will be estimated by maximizing the objective function .
Denote the true value of β as β0 and that of γ as γ0. We have the following asymptotic result for , and its proof is contained in the Appendix.
Theorem 1. Assume (1), (4), for any β in the parameter space. Then
Where
3.2. Method based on pseudo likelihood
Our second method is motivated from the fact that, under assumption (4), the missingness indicator R and the covariate U2 are conditionally independent given the outcome Y and covariate U1. Exactly we have p(U2 | Y, U1, R = 1) = p(U2 | Y, U1). This means, the model regressing U2 on (Y, U1) that can be inferred using all completely observed subjects (with R = 1) is the same as the model that can be inferred using all data (as if there were no missing data). Therefore we aim to maximize
| (10) |
Note that there are two nuisance components in (10). The p(u2) in the integral is unknown, but since the whole integral can be treated as an empirical mean with respect to U2, we can estimate p(u2) using its empirical distribution function and replace the integral as its empirical version. The other component, the model p(u1 | u2), is also unknown. Similar to Section 3.1, we also fit a saturated logistic regression model p(u1 | u2; γ).
We introduce the notation
and
where F denotes the cumulative distribution function of U2. We also adopt the same method as in Section 3.1 to estimate γ. Therefore, our objective function becomes
| (11) |
and our estimator is
| (12) |
We need to point out that, the idea of using pseudo likelihood in a generalized linear model was rigorously investigated in Zhao and Shao (2015). The method presented here is a direct application of Zhao and Shao (2015) for the case that U1 and U2 are both binary, but we still need to pay attention to some nuances. For instance, because the covariates U1 and U2 are both binary, one can easily check that the identifiability conditions in Zhao and Shao (2015) are all satisfied, hence the parameter β is fully identifiable. Also, since both U1 and U2 are binary, a saturated logistic regression model can be fitted for p(u1 | u2) and there is no model misspecification issue.
Similar to Section 3.1, we also assume
We present the asymptotic result for below and its proof is contained in the Appendix.
Theorem 2. Assume (1), (4). Assume that lp(β, γ, F) is continuously twice differentiable with respect to β, and there exists an open containing β0 such that
where E{M(R, Y, X)} < ∞. Then, as N → ∞,
| (13) |
where
and v1(u2i; β0, γ0, F0) is defined in the Appendix.
4. Results under a more general outcome model
The outcome model (1), considered in Ibrahim et al. (2001) and in Section 3, cannot learn the association between Y and its surrogate Z directly, nor the adjusted effect of U on Y after controlling the variable Z. In this Section, we study a more general outcome model
| (14) |
where we simply denote β = (β0, β1, β2, β3)T throughout this Section.
Keeping the similar spirit as in Section 3 of imposing the assumption on the mechanism model at the lowest level, throughout this Section, we assume
| (15) |
which is reasonable in the sense that, after adjusting for the effect of Y, U1 and U2, the missingness of Y, i.e., the variable R, does not depend on Z, the surrogate variable of Y.
Recall that in Section 3, since both U1 and U2 are binary, a saturated logistic regression model for p(u1 | u2) could be fitted and there is no model misspecification issue. This will not be the case in this Section. Since p(u1, u2 | z) = p(u1 | u2, z)p(u2 | z), in total 5 parameters could be estimated if one fits parametric logistic regression models without multi-way interaction terms. But one needs to estimate 6 parameters in modeling p(u1, u2 | z) since all variables U1, U2 and Z are binary. In other words, simple parametric models for p(u1, u2 | z) could be misspecified, and this phenomenon would pronounce more clearly and commonly if in general variables U and Z are continuous or multi-dimensional.
In the following, we will briefly present the two unconventional likelihood based estimators under model (14) and assumption (15). We will consider both parametric and nonparametric estimation approaches for p(u1, u2, z). The nonparametric estimation approach enjoys the attractive property of no model misspecification issue. To distinguish it from the simple parametric modeling, we denote q(u1, u2, z) as the nonparametric model for p(u1, u2, z).
4.1. Method based on conditional likelihood
Similar to the idea of conditional likelihood introduced in Section 3.1, we form the objective function as
| (16) |
where
and
We now consider both parametric and nonparametric estimation approaches for Wij. Parametrically, we assume
After some algebra, we have
Hence, given the similar regularity conditions discussed in Section 3.1, the asymptotic property of estimated via conditional likelihood is similar to Theorem 1, i.e.,
where more details are similar to Theorem 1 hence are omitted.
Alternatively we consider to estimate Wij in a nonparametric approach, i.e.,
Since all variables U1, U2 and Z are binary, we estimate q(u1, u2, z) using its empirical distribution, i.e., we will have
Once is obtained, all the following procedures are similar to the parametric approach, and similar asymptotic result to Theorem 1 can also be derived.
4.2. Method based on pseudo likelihood
If we implement the method of pseudo likelihood in Section 3.2 to the general case where the model p(Y | U1, U2, Z) is assumed as in (14), the estimator of β can be similarly derived. We introduce the similar notation
and
In general, the estimator is the maximizer of the objective function
We also briefly present both parametric and nonparametric approaches for handling p(u1, u2 | z). Parametrically we have
| (17) |
where
Given the similar regularity conditions discussed in Section 3.2, the asymptotic property of obtained from (17) is similar to Theorem 2, i.e.,
where further notation and more details are omitted.
Alternatively we consider nonparametric approach for Hi(β, F), where the essential component
can be empirically estimated by
Afterwards, the asymptotic property of similar to Theorem 2 can also be derived.
5. Analyses on children’s mental health study
In this Section, we present the analysis results from the children’s mental health study, introduced in Section 2.
We first consider the model studied in Section 3. For this model, we compare our proposed conditional likelihood method (Method 1) and pseudo likelihood method (Method 2) to the complete-case analysis method (CC), the likelihood method assuming a parametric nonignorable mechanism (NI), and the auxiliary variable method proposed in Ibrahim et al. (2001) (AUX). In Table 1, we summarize the comparison results of these five methods for parameter estimate, standard error, z-statistic and p-value.
Table 1:
Comparison of five methods for the estimates and standard errors of the three parameters in the model considered in Section 3.
| Effect | Parameter | Method | Estimate | Standard Error | z-statistic | p-value |
|---|---|---|---|---|---|---|
| Intercept | β0 | CC | −1.7372 | 0.1070 | −16.2358 | 0.0000 |
| NI | −0.6410 | 0.5170 | −1.2398 | 0.2150 | ||
| AUX | −1.7030 | 0.1050 | −16.2190 | 0.0000 | ||
| Method 1 | −1.6626 | 0.1238 | −13.4324 | 0.0000 | ||
| Method 2 | −0.7243 | 0.3856 | −1.8784 | 0.0603 | ||
| Health | β1 | CC | 0.2465 | 0.1380 | 1.7863 | 0.0740 |
| NI | 0.1150 | 0.1420 | 0.8099 | 0.4180 | ||
| AUX | 0.1810 | 0.1370 | 1.3212 | 0.1864 | ||
| Method 1 | 0.4214 | 0.6554 | 0.6429 | 0.5203 | ||
| Method 2 | 0.2848 | 0.5123 | 0.5560 | 0.5782 | ||
| Father | β2 | CC | 0.5419 | 0.1607 | 3.3724 | 0.0007 |
| NI | 0.5450 | 0.1610 | 3.3851 | 0.0007 | ||
| AUX | 0.5120 | 0.1580 | 3.2405 | 0.0012 | ||
| Method 1 | 0.5411 | 0.1243 | 4.3535 | 0.0000 | ||
| Method 2 | 0.5424 | 0.1231 | 4.4061 | 0.0000 | ||
From Table 1, it can be seen that, the effect father has approximately the same parameter estimate results no matter which method is used. For the effect health, although the estimation bias of the method CC can also be seen from the methods NI and AUX, the proposed methods show that the method CC is biased in the opposite direction. We believe that our proposed methods make more sense, since a poorer health condition would result in a much higher probability of clinical psychopathology from the teacher’s report.
We then consider the model studied in Section 4. For this model, since there is potential misspecification issue on p(u1 | u2, z)p(u2 | z), we consider different modeling strategies. For either Method 1 (conditional likelihood based) or Method 2 (pseudo likelihood based), we consider each of models p(u1 | u2, z) and p(u2 | z) to be logistic regression model or probit regression model, hence eight parametric modeling approaches. We also consider nonparametric modeling approaches in both Method 1 and Method 2. We compare our proposed methodologies to the complete-case analysis, the CC method. The comparison of the totally eleven methods is summarized in Table 2, in which “logit-probit” indicates that the link function for p(u1 | u2, z) is logit and that for p(u2 | z) is probit.
Table 2:
Comparison of eleven methods for the estimates and standard errors of the four parameters in the model considered in Section 4.
| Effect | Parameter | Method | Estimate | Standard Error | z-statistic | p-value | |
|---|---|---|---|---|---|---|---|
| Intercept | β0 | CC | −1.9307 | 0.1132 | −17.0618 | 0.0000 | |
| Nonparametric | Method 1 | −1.8727 | 0.2625 | −7.1338 | 0.0000 | ||
| Method 2 | −1.3750 | 0.2867 | −4.7966 | 0.0000 | |||
| logit-logit | Method 1 | −1.6218 | 0.3392 | −4.7813 | 0.0000 | ||
| Method 2 | −1.3750 | 0.3055 | −4.5001 | 0.0000 | |||
| logit-probit | Method 1 | −1.4353 | 0.5499 | −2.6100 | 0.0091 | ||
| Method 2 | −1.1340 | 0.2969 | −3.8195 | 0.0001 | |||
| probit-logit | Method 1 | −0.6827 | 0.4770 | −1.4314 | 0.1523 | ||
| Method 2 | −0.7217 | 0.3002 | −2.4040 | 0.0162 | |||
| probit-probit | Method 1 | −0.5156 | 0.6188 | −0.8331 | 0.4048 | ||
| Method 2 | −0.5073 | 0.3077 | −1.6489 | 0.0992 | |||
| Health | β1 | CC | −0.0516 | 0.1480 | −0.3487 | 0.7273 | |
| Nonparametric | Method 1 | −0.9972 | 0.4704 | −2.1198 | 0.0340 | ||
| Method 2 | −0.9814 | 0.3622 | −2.7098 | 0.0067 | |||
| logit-logit | Method 1 | −1.2819 | 0.6980 | −1.8364 | 0.0663 | ||
| Method 2 | −0.9814 | 0.4008 | −2.4484 | 0.0144 | |||
| logit-probit | Method 1 | −1.2680 | 0.5530 | −2.2929 | 0.0219 | ||
| Method 2 | −1.2370 | 0.4190 | −2.9526 | 0.0032 | |||
| probit-logit | Method 1 | −2.2843 | 0.5662 | −4.0343 | 0.0001 | ||
| Method 2 | −2.6201 | 0.5934 | −4.4151 | 0.0000 | |||
| probit-probit | Method 1 | −2.4730 | 0.5266 | −4.6965 | 0.0000 | ||
| Method 2 | −3.4659 | 0.7197 | −4.8157 | 0.0000 | |||
| Father | β2 | CC | 0.3652 | 0.1690 | 2.1608 | 0.0307 | |
| Nonparametric | Method 1 | 0.0526 | 0.5258 | 0.1000 | 0.9203 | ||
| Method 2 | −0.0699 | 0.4460 | −0.1567 | 0.8755 | |||
| logit-logit | Method 1 | −0.4041 | 0.9355 | −0.4319 | 0.6658 | ||
| Method 2 | −0.0700 | 0.5384 | −0.1300 | 0.8966 | |||
| logit-probit | Method 1 | −2.2393 | 0.9099 | −2.4610 | 0.0139 | ||
| Method 2 | −1.5588 | 0.7286 | −2.1393 | 0.0324 | |||
| probit-logit | Method 1 | −0.1721 | 0.7500 | −0.2294 | 0.8185 | ||
| Method 2 | −0.5305 | 0.6475 | −0.8193 | 0.4126 | |||
| probit-probit | Method 1 | −1.3461 | 0.7213 | −1.8662 | 0.0620 | ||
| Method 2 | −1.5535 | 0.7480 | −2.0770 | 0.0378 | |||
| Parent’s Report | β3 | CC | 1.4621 | 0.1583 | 9.2380 | 0.0000 | |
| Nonparametric | Method 1 | 1.4085 | 0.1356 | 10.3902 | 0.0000 | ||
| Method 2 | 1.4687 | 0.1366 | 10.7523 | 0.0000 | |||
| logit-logit | Method 1 | 1.4043 | 0.1370 | 10.2476 | 0.0000 | ||
| Method 2 | 1.4687 | 0.1376 | 10.6736 | 0.0000 | |||
| logit-probit | Method 1 | 1.4149 | 0.1340 | 10.5569 | 0.0000 | ||
| Method 2 | 1.4562 | 0.1311 | 11.1084 | 0.0000 | |||
| probit-logit | Method 1 | 1.4122 | 0.1371 | 10.3007 | 0.0000 | ||
| Method 2 | 1.4733 | 0.1320 | 11.1602 | 0.0000 | |||
| probit-probit | Method 1 | 1.4170 | 0.1322 | 10.7186 | 0.0000 | ||
| Method 2 | 1.4475 | 0.1265 | 11.4411 | 0.0000 | |||
The comparison results from Table 2 are also meaningful. Firstly no matter which method is used, the estimates of the effect parent’s report are roughly the same, and the corresponding p-value is always zero. This indicates the adjusted association between the teacher’s report and the parent’s report indeed exists and it is statistically significant. Secondly, although there are some differences of the estimates of the effect father among the proposed methods, the nonparametric approach shows that the effect is almost zero and it is not significant. Thirdly, to the contrary of the CC method, all the proposed methods present a statistically significant effect for the variable health. This observation certainly means that the CC method is heavily biased.
We also want to report that, across our numerical studies, the standard error estimates based on the asymptotic results are sensitive to the initial values of the unknown parameters. Therefore, we pursue a nonparametric Bootstrap approach instead. To investigate what is the sufficient number of Bootstrap samples in our setting, we illustrate the relation between the standard error estimates and the number of Bootstrap samples, as shown in Figure 1 for the model considered in Section 3, and in Figure 2 for the nonparametric approach for the model considered in Section 4. It can be seen that, in general 200 Bootstrap samples are sufficient. In each situation of our numerical investigations, we simulate 300 Bootstrap samples.
Figure 1:

For the proposed Method 1 and Method 2, as the number of Bootstrap samples increases, the change of standard errors of the three parameters in the model considered in Section 3.
Figure 2:

For the proposed nonparametric Method 1 and Method 2, as the number of Bootstrap samples increases, the change of standard errors of the four parameters in the model considered in Section 4.
6. Simulation studies
We examine the finite sample performance of the proposed methods using simulation studies where the data generation process were known.
For the model considered in Section 3, we first generate binary variable U2 following Bernoulli distribution with P(U2 = 1) = 0.6, then we generate binary variable U1 following logit{P(U1 = 1 | U2)} = γ0 + γ1U2 with γ0 = 1 and γ1 = 0.5. The outcome Y is generated from logit{P(Y = 1 | U1, U2)} = β0 + β1U1 + β2U2 with β = (−0.5, 0.1, 3)T. The missingness mechanism is generated from logit{P(R = 1 | Y, U1; θ)} = θ0 + θ1Y + θ2U1 with θ = (−2.20, 3.58, 0.81)T, so that there are approximately 70% completely observed subjects. We consider the total sample size N = 2,000. We compared the two proposed methods to the method using all data, termed as the benchmark, and the CC method. Based on 500 simulation replicates, the results of the estimation bias, Monte Carlo approximation of the standard deviation, the estimated standard error, and the coverage probability, are summarized in Table 3.
Table 3:
Comparison of four methods for the estimation bias (Bias), Monte Carlo approximation of standard deviation (SD), estimated standard error (SE), and coverage probability (CP) of the three parameters in the model considered in Section 3.
| Parameter | Method | Bias | SD | SE | CP |
|---|---|---|---|---|---|
| β0 | Benchmark | 0.0055 | 0.1218 | 0.1273 | 0.9600 |
| CC | 2.1164 | 0.3046 | 0.2898 | 0.0000 | |
| Method 1 | 0.0029 | 0.0113 | 0.0124 | 0.9480 | |
| Method 2 | 0.0049 | 0.2417 | 0.2293 | 0.9560 | |
| β1 | Benchmark | −0.0031 | 0.1358 | 0.1423 | 0.9620 |
| CC | −0.6033 | 0.3240 | 0.3139 | 0.5200 | |
| Method 1 | 0.0224 | 0.3086 | 0.2942 | 0.9460 | |
| Method 2 | 0.0006 | 0.2837 | 0.2729 | 0.9460 | |
| β2 | Benchmark | 0.0001 | 0.1352 | 0.1348 | 0.9300 |
| CC | 0.0370 | 0.3146 | 0.2963 | 0.9320 | |
| Method 1 | 0.0329 | 0.3077 | 0.2996 | 0.9420 | |
| Method 2 | 0.0397 | 0.3157 | 0.3034 | 0.9480 | |
Similarly, for the model considered in Section 4, we first generate binary variable Z following Bernoulli distribution with P(Z = 1) = 0.6, then we generate binary variable U2 following logit{P(U2 = 1 | Z)} = γ3 + γ4Z with γ3 = 1 and γ4 = 0.5, and binary variable U1 following logit{P(U1 = 1 | U2, Z)} = γ0 + γ1U2 + γ2Z with γ0 = −1, γ1 = 0.5 and γ2 = 0.5. The outcome Y is generated from logit{P(Y = 1 | U1, U2, Z)} = β0 + β1U1 + β2U2 + γ3Z with β = (−0.5, 0.1, 0.1, 3)T. The missingness mechanism is generated from logit{P(R = 1 | Y, U1, U2; θ)} = θ0 + θ1Y + θ2U1 + θ3U2 with θ = (−2.20, 3.58, 0.46, 0.46)T, so that there are also approximately 70% completely observed subjects. We consider the total sample size N = 2,000. We compared the two proposed methods in both parametric and nonparametric modeling approaches to the benchmark method and the CC method. Based on 500 simulation replicates, the similar results are summarized in Table 4.
Table 4:
Comparison of six methods for the estimation bias (Bias), Monte Carlo approximation of standard deviation (SD), estimated standard error (SE), and coverage probability (CP) of the four parameters in the model considered in Section 4.
| Parameter | Method | Bias | SD | SE | CP |
|---|---|---|---|---|---|
| β0 | Benchmark | −0.0006 | 0.1428 | 0.1320 | 0.9420 |
| CC | 2.1102 | 0.3041 | 0.2872 | 0.0000 | |
| Method 1-Parametric | 0.0226 | 0.2069 | 0.1978 | 0.9696 | |
| Method 1-Nonparametric | 0.0097 | 0.0729 | 0.0773 | 0.9760 | |
| Method 2-Parametric | −0.0028 | 0.2666 | 0.2615 | 0.9520 | |
| Method 2-Nonparametric | 0.0012 | 0.2077 | 0.2034 | 0.9600 | |
| β1 | Benchmark | 0.0083 | 0.1245 | 0.1274 | 0.9520 |
| CC | −0.3222 | 0.2319 | 0.2408 | 0.7300 | |
| Method 1-Parametric | 0.0548 | 0.4122 | 0.4172 | 0.9798 | |
| Method 1-Nonparametric | 0.0201 | 0.2573 | 0.2509 | 0.9760 | |
| Method 2-Parametric | 0.0101 | 0.2514 | 0.2614 | 0.9640 | |
| Method 2-Nonparametric | 0.0044 | 0.1790 | 0.1880 | 0.9600 | |
| β2 | Benchmark | 0.0020 | 0.1555 | 0.1434 | 0.9280 |
| CC | −0.3428 | 0.3219 | 0.3050 | 0.8240 | |
| Method 1-Parametric | 0.0659 | 0.5004 | 0.4979 | 0.9696 | |
| Method 1-Nonparametric | 0.0277 | 0.2987 | 0.3204 | 0.9539 | |
| Method 2-Parametric | 0.0040 | 0.3084 | 0.2993 | 0.9480 | |
| Method 2-Nonparametric | 0.0004 | 0.2239 | 0.2197 | 0.9440 | |
| β3 | Benchmark | 0.0000 | 0.1404 | 0.1372 | 0.9460 |
| CC | 0.0209 | 0.3209 | 0.3107 | 0.9340 | |
| Method 1-Parametric | 0.0191 | 0.3149 | 0.3130 | 0.9393 | |
| Method 1-Nonparametric | 0.0176 | 0.3159 | 0.3147 | 0.9479 | |
| Method 2-Parametric | 0.0213 | 0.3167 | 0.3168 | 0.9480 | |
| Method 2-Nonparametric | 0.0206 | 0.3191 | 0.3197 | 0.9500 | |
The conclusions from summarizing Tables 3 and 4 are quite clear. Firstly, the benchmark method and all the proposed methods are asymptotically unbiased and possess around 95% coverage probability, in every scenario. This well matches our theoretical investigation of the proposed methodology. Secondly, in most of the situations, the CC method is severely biased hence results in very poor coverage probability. This means simply using the CC method is incorrect generally. Thirdly, in Table 4, since the parametric modeling is correct in our proposed methods, its performance is very similar to the nonparametric approach, but it is less efficient than the nonparametric approach in general. Finally, from either Table 3 or Table 4, it is hard to reach a definitive answer of comparing the efficiency of the two proposed methods based on unconventional likelihoods.
7. Discussion
The limitation of modeling the nonignorable missingness mechanism parametrically motivates us the methods based on unconventional likelihoods, in that the assumption imposed on the mechanism model is at the minimum level. Both methods presented in this paper own deep roots in fundamental statistical methodologies: the conditional likelihood based method relies on the decomposition of rank statistics and order statistic, while the pseudo likelihood based method relies on the conditional independence and biased sampling.
From the data perspective, our work stems from a children’s mental health study where all the variables are binary. Binary data, or categorical data in general, may not be as easy to analyze as compared to continuous data. For instance, one should pay extra attention to the identifiability issue and the model misspecification issue for the binary data. Our proposed methods, however, do not have any restrictions on data types. Although nuances exist as to model identifiability and model misspecification, each of our proposed methods can be equally applied to either categorical data or continuous data.
In each of our proposed methods, β is the parameter of interest and γ is the nuisance. Our proposal entails a first-stage estimation procedure for , then the parameter β is estimated by maximizing or . One could certainly consider to directly maximize lc(β, γ) or to simultaneously locate and without a first-stage procedure for the nuisance. One possible strength of doing this is, it might amplify the estimation efficiency of β, but this is not clear and it warrants further careful investigation; while one definite limitation of doing this is, it will intensify the computational burden since the optimization procedure involves a higher-dimensional parameter.
Last but not least, one potential limitation of our proposed methods is the sample size requirement. Our simulation settings with sample size N = 2, 000 mimic the children’s mental health study. We did carry out experiments with a smaller sample size, e.g., N = 200. But it showed numerical estimation bias to certain degree. This phenomenon was uncovered in the literature. For instance, Zhao (2017) studied a resampling based procedure to reduce the estimation bias in a similar context.
Acknowledgements
The authors would like to thank the editor, the associate editor and two anonymous referees for their constructive comments, which have led to a significantly improved paper. This work was supported by the National Center for Advancing Translational Sciences of the National Institutes of Health under award number UL1TR001412. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Appendix
Proof of Theorem 1. Let represent the MLE of γ,
We first develop asymptotic properties of . We start from estimating equation
By Taylor expansion,
Thus,
As a result,
Once the asymptotic property of the nuisance parameter γ is known, we can develop the asymptotic properties of β. We can obtain by solving
which is equivalent to
| (18) |
Specifically,
| (19) |
by Taylor expansion. Similarly,
| (20) |
With (19) and (20) plug into (18), we can obtain the following equation
| (21) |
As , (21) is equivalent to
Thus,
| (22) |
In addition, we need to form a projection of in (22) through
and
To sum up, (22) can be formed as
| (23) |
and this completes the proof. □
Proof of Theorem 2. By Taylor’s expansion and , we have
| (24) |
and
| (25) |
where
is a V-statistic based on data wi = (ri, yi, u1i, u2i) and the following kernel function
Let
which does not depend on ri, yi or u1i, and will be denoted by v1(u2i; β0, γ0, F0). From the theory of V-statistics, we have
| (26) |
Under the given conditions, we have
and this completes the proof. □
References
- Chen J and Fang F (2019), “Semiparametric likelihood for estimating equations with non-ignorable non-response by non-response instrument,” Journal of Nonparametric Statistics, 1–15. [Google Scholar]
- Fang F and Shao J (2016), “Model selection with nonignorable nonresponse,” Biometrika, 103, 861–874. [Google Scholar]
- Ibrahim JG, Chu H, and Chen M-H (2012), “Missing data in clinical studies: issues and methods,” Journal of Clinical Oncology, 30, 3297–3303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ibrahim JG, Lipsitz SR, and Horton N (2001), “Using auxiliary data for parameter estimation with non-ignorably missing outcomes,” Journal of the Royal Statistical Society: Series C (Applied Statistics), 50, 361–373. [Google Scholar]
- Liang K-Y and Qin J (2000), “Regression analysis under non-standard situations: a pairwise pseudolikelihood approach,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62, 773–786. [Google Scholar]
- Little RJ, D’agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, Frangakis C, Hogan JW, Molenberghs G, Murphy SA, et al. (2012), “The prevention and treatment of missing data in clinical trials,” New England Journal of Medicine, 367, 1355–1360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Little RJ and Rubin DB (2002), Statistical Analysis with Missing Data, Wiley, 2nd ed. [Google Scholar]
- Zahner GE and Daskalakis C (1997), “Factors associated with mental health, general health, and school-based service use for child psychopathology.” American journal of public health, 87, 1440–1448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zahner GE, Jacobs JH, Freeman DH, and Trainor KF (1993), “Rural-urban child psychopathology in a northeastern US state: 1986–1989,” Journal of the American Academy of Child & Adolescent Psychiatry, 32, 378–387. [DOI] [PubMed] [Google Scholar]
- Zahner GE, Pawelkiewicz W, DeFrancesco JJ, and Adnopoz J (1992), “Children’s mental health service needs and utilization patterns in an urban community: An epidemiological assessment,” Journal of the American Academy of Child & Adolescent Psychiatry, 31, 951–960. [DOI] [PubMed] [Google Scholar]
- Zhao J (2017), “Reducing bias for maximum approximate conditional likelihood estimator with general missing data mechanism,” Journal of Nonparametric Statistics, 29, 577–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao J and Ma Y (2018), “Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse,” Biometrika, 105, 479–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao J and Shao J (2015), “Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data,” Journal of the American Statistical Association, 110, 1577–1590. [Google Scholar]
- — (2017), “Approximate conditional likelihood for generalized linear models with general missing data mechanism,” Journal of Systems Science and Complexity, 30, 139–153. [Google Scholar]
- Zhao J, Yang Y, and Ning Y (2018), “Penalized pairwise pseudo likelihood for variable selection with nonignorable missing data,” Statistica Sinica, 28, 2125–2148. [Google Scholar]
- Zhao P, Tang N, Qu A, and Jiang D (2017), “Semiparametric estimating equations inference with nonignorable missing data,” Statistica Sinica, 89–113. [Google Scholar]
