SUMMARY
Tang et al. (2003) considered a regression model with missing response, where the missingness mechanism depends on the value of the response variable and hence is nonignorable. They proposed three pseudolikelihood estimators, based on different treatments of the probability distribution of the completely observed covariates. The first assumes the distribution of the covariate to be known, the second estimates this distribution parametrically, and the third estimates the distribution nonparametrically. While it is not hard to show that the second estimator is more efficient than the first, Tang et al. (2003) only conjectured that the third estimator is more efficient than the first two. In this paper, we investigate the asymptotic behaviour of the third estimator by deriving a closed-form representation of its asymptotic variance. We then prove that the third estimator is more efficient than the other two. Our result can be straightforwardly applied to missingness mechanisms that are more general than that in Tang et al. (2003).
Keywords: Efficiency, Missing data, Nonignorable nonresponse, Pseudolikelihood estimator
1. Introduction
Tang et al. (2003) considered multivariate regression analysis of a -dimensional response on a -dimensional covariate , with the joint density function of factorized as , where represents the marginal density function of and the estimation of is of main interest.
They considered the situation where is fully observed but has missing values. Let if is completely observed and otherwise. Tang et al. (2003) assumed that the missing data mechanism depends only on the underlying value of the response and hence is nonignorable,
(1) |
They proposed estimators built on the fact that and are conditionally independent given , so the completely observed subjects form a random sample from the distribution of given .
The missing data mechanism (1) has been widely adopted in response-biased sampling (Brown, 1990; Liang & Qin, 2000; Chen, 2001) and is relevant in many applications. For example, Chen (2001) studied a case of a univariate response and a multivariate covariate , where the observed data form a nonrandom sample from , with the sampling probability depending only on . Therefore, the observed data can be viewed as a random sample from the distribution of given , instead of from the original regression model (Chen, 2001). Assumption (1) is also sensible in other situations. For example, when evaluating a new biomarker in analytical chemistry, scientists usually encounter a laboratory quality control limit, the so-called detection limit (Navas-Acien et al., 2008; Caldwell et al., 2009; Carter et al., 2016), defined as the lowest concentration of analyte distinguishable from the background noise. Although theoretically available, concentration values below the detection limit are usually not released by laboratories. When the concentration value is the outcome of interest and needs to be regressed against covariates , assumption (1) is satisfied, with where is the detection limit. Eliminating observations with values below the detection limit can lead to severe bias; see Hopke et al. (2001), Moulton et al. (2002), Richardson & Ciampi (2003), Schisterman et al. (2006) and the references therein. Other examples where (1) is valid include survey sampling (Deville & Särndal, 1992; Deville, 2000; Kott, 2014), case-control studies (Chen, 2007), and some survival analysis contexts. Tang et al. (2003) also discussed some extensions of the assumption (1).
The novelty of the idea in Tang et al. (2003) has led to recent developments such as Kim & Yu (2011), Zhao & Shao (2015), Shao & Wang (2016) and Miao & Tchetgen Tchetgen (2016). In this paper, we first present our results under assumption (1) and then show that they can be straightforwardly applied to more general missingness mechanisms.
The estimation of is based on independent and identically distributed observations for and for . Based on (1), we estimate through maximizing the likelihood of the parameters of based on the complete observations,
Equivalently, we estimate by maximizing the complete-case conditional loglikelihood
(2) |
which contains , the unspecified probability density function of . If the true , , is known, the corresponding estimator of , denoted by , is the maximizer of . If is unknown, with fully observed, any appropriate complete-data technique can be applied to estimate . Tang et al. (2003) considered two situations and obtained two pseudolikelihood estimators of (Gong & Samaniego, 1981; Parke, 1986). The first is when a parametric model is adopted and the full-data maximum likelihood estimator is used to obtain . Then is used to replace in (2), which leads to the estimator , the maximizer of . The second is when is unspecified and its cumulative distribution is estimated by its empirical version . This gives the estimator , the maximizer of
(3) |
An interesting issue is the efficiency of , and . Theorem 2 of Tang et al. (2003) established the asymptotic normality of and showed that is more efficient than . However, the authors did not give an explicit expression for the asymptotic variance of , so could not provide a theoretical efficiency comparison with . Based on simulation studies, they conjectured that is more efficient than both and .
We derive the asymptotic variance of in closed form and prove that is more efficient than and ; thus we establish the correctness of the conjecture in Tang et al. (2003) and provide a clear explanation of their numerical observations. We also show that, in general, no other method of estimating can lead to a more efficient estimator of than , which is recommended for use in practice.
2. Asymptotic distribution and optimality
We use uppercase letters to denote random variables and lowercase letters to denote their realizations. We let and . Sometimes we also write and . We define , and . We let and . We also write and .
In this section, we establish the asymptotic distribution of and briefly describe the asymptotic distributions of and . The results for and can be found in Theorem 2 of Tang et al. (2003). Recall that is the maximizer of . It is straightforward to show that
so in distribution as . The estimator is the maximizer of . Because the maximum likelihood estimate of satisfies ,
Hence in distribution as , where . It is obvious that .
Theorem 1.
Under the conditions of Theorem 3 in Tang et al. (2003), the estimator has the asymptotic representation
so in distribution as , where
This implies that:
Corollary 1.
is more efficient than and hence more efficient than .
Although other nonparametric methods can be used to estimate and thus obtain alternative estimators of , doing so cannot increase the efficiency of . To see this, let denote the empirical estimator of , which results in , and let be an alternative consistent estimator of using data , which gives rise to an alternative estimator of , denoted by . The derivation in Theorem 1 yields
We write as . As regular asymptotically linear estimators of based on , and satisfy
(Huber, 1981) for some influence functions and , which we inspect in order to compare estimation efficiency. Here . Therefore
where in the last step we have used the zero-mean properties of and . This leads to
Using the technique in Corollary 1, we obtain
so is also less efficient than . Therefore the pseudolikelihood estimator is superior to any other possible parametrically or nonparametrically based estimators.
3. Extension to more general missingness mechanisms
The missing data mechanism in (1) assumes that given , and are conditionally independent. Although reasonable in many situations, this is not always true. For instance, in a randomized clinical trial comparing a treatment with a placebo, the dichotomous treatment indicator may influence the missingness. Consider a very simple scenario where denotes the outcome, the binary treatment indicator and the covariate. We are interested in the unknown parameters in . Compared with (1), it is more cautious to assume that
(4) |
Under (4), the methods in § 2 still apply. To see this, similar to the idea in § 1, the unknown parameter can be estimated based on the conditional likelihood
Thus the same reasoning and derivation can be applied, and we can show that the estimator for with estimated by its empirical version under and separately, i.e., the version in § 2, is also optimal among the three estimators.
We now generalize the assumption (1) to the case where the missing data indicator and some components in the covariates , say , are conditionally independent given and the remaining components of , say ; that is,
(5) |
Here, the covariates are represented by , and is called a nonresponse instrument (Zhao & Shao, 2015) or a shadow variable (Miao & Tchetgen Tchetgen, 2016). The objective function becomes
(6) |
The distribution of conditional on in (6) poses extra challenges. The truth, a parametric or nonparametric estimator of , can be incorporated into the estimation, resulting in , or . Theory similar to that in § 2 can be developed and leads to the same optimality of .
The nonignorable missing data mechanism assumption is usually difficult to specify or verify (d’Haultfoeuille, 2010), but a nonresponse instrument is often available and assumption (5) is often reasonable. For example, in a study of children’s mental health (Zahner et al., 1992), investigators were interested in evaluating the prevalence of children with abnormal psychopathological status based on their teacher’s assessment, , which was subject to missing values. A missing teacher report may be related to the teacher’s assessment of the student even after adjusting for fully observed covariates such as physical health of the child and parental status of the household (Ibrahim et al., 2001). A separate parental report on the psychopathology of the child was also available for all children in the study. Such a report is likely to be highly correlated with that of the teacher, but is unlikely to be correlated with the teacher’s response status conditional on the teacher’s assessment of that student. Therefore, the parental assessment constitutes a valid nonresponse instrument and assumption (5) is reasonable.
Supplementary Material
Acknowledgement
We thank the editor, associate editor and three referees for their constructive comments, which have led to a significantly improved paper. This work was partially supported by the National Center for Advancing Translational Sciences of the U.S. National Institutes of Health and the U.S. National Science Foundation.
Appendix
Proof of Theorem 1.
If we estimate with its empirical distribution, we approximate by its sample average, i.e., . Also, . To obtain , we maximize (3), which is equivalent to maximizing
Thus, by the mean value theorem, there must exist some lying between and such that
Now
(A1) where, using the decomposition and representation techniques related to V-statistics (Serfling, 1980; Shao, 2003), we have
Substituting this into (A1), we get
Hence in distribution as , where
□
Proof of Corollary 1.
To prove that is more efficient than , note that
where the second equality comes from the fact that
Therefore, is also more efficient than . □
Supplementary material
Supplementary material available at Biometrika online contains some simulation results.
References
- Brown C. H. (1990). Protecting against nonrandomly missing data in longitudinal studies. Biometrics 46, 143–55. [PubMed] [Google Scholar]
- Caldwell K. L., Jones R. L., Verdon C. P., Jarrett J. M., Caudill S. P. & Osterloh J. D. (2009). Levels of urinary total and speciated arsenic in the US population: National Health and Nutrition Examination Survey 2003–2004. J. Expos. Sci. Envir. Epidemiol. 19, 59–68. [DOI] [PubMed] [Google Scholar]
- Carter R. L., Wrabetz L., Jalal K., Orsini J. J., Barczykowski A. L., Matern D. & Langan T. J. (2016). Can psychosine and galactocerebrosidase activity predict early-infantile Krabbe’s disease presymptomatically? J. Neurosci. Res. 94, 1084–93. [DOI] [PubMed] [Google Scholar]
- Chen H. Y. (2007). A semiparametric odds ratio model for measuring association. Biometrics 63, 413–21. [DOI] [PubMed] [Google Scholar]
- Chen K. (2001). Parametric models for response-biased sampling. J. R. Statist. Soc. B 63, 775–89. [Google Scholar]
- Deville J.-C. (2000). Generalized calibration and application to weighting for non-response. In Proceedings in Computational Statistics: 14th Symposium held in Utrecht, The Netherlands, 2000. Heidelberg: Springer, pp. 65–76. [Google Scholar]
- Deville J.-C. & Särndal C. E. (1992). Calibration estimators in survey sampling. J. Am. Statist. Assoc. 87, 376–82. [Google Scholar]
- d’Haultfoeuille X. (2010). A new instrumental method for dealing with endogeneous selection. J. Economet. 154, 1–15. [Google Scholar]
- Gong G. & Samaniego F. J. (1981). Pseudo maximum likelihood estimation: Theory and applications. Ann. Statist. 9, 861–9. [Google Scholar]
- Hopke P. K., Liu C. & Rubin D. B. (2001). Multiple imputation for multivariate data with missing and below-threshold measurements: Time-series concentrations of pollutants in the Arctic. Biometrics 57, 22–33. [DOI] [PubMed] [Google Scholar]
- Huber P. J. (1981). Robust Statistics. New York: Wiley. [Google Scholar]
- Ibrahim J. G., Lipsitz S. R. & Horton N. (2001). Using auxiliary data for parameter estimation with non-ignorably missing outcomes. Appl. Statist. 50, 361–73. [Google Scholar]
- Kim J. K. & Yu C. L. (2011). A semiparametric estimation of mean functionals with nonignorable missing data. J. Am. Statist. Assoc. 106, 157–65. [Google Scholar]
- Kott P. (2014). Calibration weighting when model and calibration variables can differ. In Contributions to Sampling Statistics. Cham: Springer International Publishing, pp. 1–18. [Google Scholar]
- Liang K.-Y. & Qin J. (2000). Regression analysis under non-standard situations: A pairwise pseudolikelihood approach. J. R. Statist. Soc. B 62, 773–86. [Google Scholar]
- Miao W. & Tchetgen Tchetgen E. J. (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika 103, 475–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moulton L. H., Curriero F. C. & Barroso P. F. (2002). Mixture models for quantitative HIV RNA data. Statist. Meth. Med. Res. 11, 317–25. [DOI] [PubMed] [Google Scholar]
- Navas-Acien A., Silbergeld E. K., Pastor-Barriuso R. & Guallar E. (2008). Arsenic exposure and prevalence of type 2 diabetes in US adults. J. Am. Med. Assoc. 300, 814–22. [DOI] [PubMed] [Google Scholar]
- Parke W. R. (1986). Pseudo maximum likelihood estimation: The asymptotic distribution. Ann. Statist. 14, 355–7. [Google Scholar]
- Richardson D. B. & Ciampi A. (2003). Effects of exposure measurement error when an exposure variable is constrained by a lower limit. Am. J. Epidemiol. 157, 355–63. [DOI] [PubMed] [Google Scholar]
- Schisterman E. F., Vexler A., Whitcomb B. W. & Liu A. (2006). The limitations due to exposure detection limits for regression models. Am. J. Epidemiol. 163, 374–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Serfling R. J. (1980). Approximation Theorems of Mathematical Statistics. New York: Wiley. [Google Scholar]
- Shao J. (2003). Mathematical Statistics. New York: Springer, 2nd ed. [Google Scholar]
- Shao J. & Wang L. (2016). Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika 103, 175–87. [Google Scholar]
- Tang G., Little R. J. & Raghunathan T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika 90, 747–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zahner G. E., Pawelkiewicz W., DeFrancesco J. J. & Adnopoz J. (1992). Children’s mental health service needs and utilization patterns in an urban community: An epidemiological assessment. J. Am. Acad. Child Adolesc. Psychiat. 31, 951–60. [DOI] [PubMed] [Google Scholar]
- Zhao J. & Shao J. (2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. J. Am. Statist. Assoc. 110, 1577–90. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.