First of all, I wholeheartedly congratulate Tang and Ju (referred to as TJ hereafter) on a well-written comprehensive review paper that surveys cutting-edge statistical theory and methodology relevant to estimation, in uence analysis and model selection in regression models with missing data.
TJ begins their presentation from the missing data mechanism, a fundamental concept in the missing data literature (Little and Rubin, 2002; Tsiatis, 2006; Kim and Shao, 2013; Molenberghs et al., 2014). In their Section 2, TJ presents a detailed explanation of this definition and underlines its importance to developing downstream statistical methodology. To facilitate this discussion, I adopt the same notation as follows. Consider a regression model where Y is a response variable and X is a p-dimensional explanatory variable, and are n independent and identically distributed realizations of (X, Y ). Assume X is always fully observed but Y is subject to missingness. Let δ be the missing data indicator for Y, that is, δ = 0 if Y is missing, and δ = 1 otherwise. Then the missing data mechanism is the conditional distribution of δ given X and Y, i.e.,
| (1) |
One intrinsic complication of the missing data mechanism is that, only except for a few scenarios (Little, 1988; d’Haultfoeuille, 2010), its underlying truth is difficult to verify. The reason dues to its plausible dependence on Y, an incompletely observed variable. This issue pronounces more clearly when one moves forward to real application, where the investigators would be more satisfied if a statistical method could make the assumption of the mechanism less stringently so that it is able to be flexibly applied to various scenarios.
My discussion, motivated by the need of developing versatile statistical procedures that would provide robust protection to certain mechanism misspecification, showcases the up-to-date statistical treatments where the mechanism model assumption is only imposed at a minimum level. The discussion concentrates on brief introduction of two types of these assumptions and spans diverse statistical topics including model identification, point estimation, hypothesis testing, and high dimensional variable selection.
One distinct feature of the methods in this discussion is that the mechanism model would be treated as a nuisance, hence all the methods could be carried out without the need of estimating the mechanism.
1. Mechanism based on conditional independence
The instrumental variable is a well studied method in econometrics, epidemiology and related disciplines. The key step of applying this method is certain requirement about the conditional independence among variables. Zhao and Shao (2015) proposed to take advantage of the nonresponse instrument Z, a component of X, to analyze missing data, especially nonignorable missing data. The concept of nonresponse instrument shares the similar spirit to the instrumental variable. To be more specific, Zhao and Shao (2015) assumed that
| (2) |
where . Some further requirement, e.g., . is also needed for model identification purpose.
When X by itself serves as the nonresponse instrument, Tang et al. (2003) studied this special situation and proposed to estimate the unknown parameter ϴ in through the conditional likelihood of :
where g(x) represents the unspecified probability density function of X. Then the objective becomes to a semiparametric function:
To solve for , an estimator of g(x) is needed. Three straightforward g(x) estimators could be considered: the true g(x); a parametric g(x) = g(x; α) with α estimated as through full data likelihood method; a nonparametric g(x) with its cumulative distribution function estimated by its empirical version. These three alternatives lead to three different pseudolikelihood estimators of : PL0, PL1 and PL2. At rst sight, one would believe that PL0 is superior to the other two in terms of estimation efficiency. However, Tang et al. (2003) showed that PL0 is always less efficient than bPL1. In a recent paper Zhao and Ma (2018), the authors further proved that PL1 is always less efficient than PL2 and there is no other method which could lead to a more efficient estimator than PL2, hence PL2 is optimal.
Other work along this line includes Miao and Tchetgen Tchetgen (2016) exploring different types of doubly robust estimators and Fang et al. (2018) extending the idea to missing covariate and proposing some imputation approach based on estimating equations.
2. Mechanism based on statistical chromatography
The other unspecified missing data mechanism investigated in the literature is to assume a decomposable model
| (3) |
where s(·) and t(·) are two unspecified functions. It is clear that, MCAR (s = t = constant) and MAR (t = constant) are special cases of this assumption. When s = constant, it becomes the case discussed in Section 1 where X on its own serves as the nonresponse instrument.
A pivotal observation following (3) is that, and could be bridged as
Note that preserves to be a function of x-only multiples a function of y-only. Using the idea of the conditional likelihood (Kalb eisch, 1978), decomposing the observed yi’s as its rank statistic and order statistic, considering the likelihood conditional on the order statistic, Liang and Qin (2000) proposed the following objective function to estimating :
| (4) |
where the first m subjects are fully observed without loss of generality.
The key here is that we model the data at a more refined granularity of rank and order statistics, so that sophisticated conditioning arguments could be applied to separate the parameter of interest and other nuisance components. Hence we call this procedure statistical chromatography.
We elaborate under the generalized linear model framework where
with link function structure . With canonical link, to maximize (4) is equivalent to minimizing
where . Hence to compensate for missing data, we could only estimate γ as opposed to the whole unknown parameter ϴ. Although only γ is estimable, the hypothesis testing β = 0 versus β ≠ 0 could still be carried out since the null hypothesis β = 0 is equivalent to γ = 0. The detailed Wald type test statistic needs the asymptotic distribution of the estimator γ of under this scheme (Zhao and Shao, 2017). With noncanonical link, Zhao and Shao, (2017) showed that, interestingly, the whole unknown parameter is estimable under some situations.
Finally I would like to point out a regularization approach for high dimensional variable selection with missing data using this approach. The essential idea is to identify “important” variables through whether the corresponding estimator equals zero or not. The penalized likelihood function is
where pλ(·) could be any penalty function, and λ ≥ 0 is the tuning parameter. Zhao et al. (2018) proved that the validity of the selection consistency allows p to grow at a rate exponentially fast with n as log with 0 < k < 1/4. In penalized likelihood approach for variable selection, the determination of the tuning parameter is also critical. Zhao and Yang (2017) further studied some stability enhanced tuning parameter selection methods following this approach.
References
- d’Haultfoeuille X (2010), “A new instrumental method for dealing with endogenous selection,” Journal of Econometrics, 154, 1–15. [Google Scholar]
- Fang F, Zhao J, and Shao J (2018), “Imputation-based adjusted score equations in generalized linear models with nonignorable missing covariate values,” Statistica Sinica, 28. [Google Scholar]
- Kalb eisch JD (1978), “Likelihood methods and nonparametric tests,” Journal of the American Statistical Association, 73, 167–170. [Google Scholar]
- Kim JK and Shao J (2013), Statistical Methods for Handling Incomplete Data, Chapman & Hall/CRC. [Google Scholar]
- Liang K-Y and Qin J (2000), “Regression analysis under non-standard situations: a pairwise pseudolikelihood approach,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62, 773–786. [Google Scholar]
- Little RJ (1988), “A test of missing completely at random for multivariate data with missing values,” Journal of the American Statistical Association, 83, 1198–1202. [Google Scholar]
- Little RJ and Rubin DB (2002), Statistical Analysis with Missing Data, Wiley, 2nd ed. [Google Scholar]
- Miao W and Tchetgen Tchetgen EJ (2016), “On varieties of doubly robust estimators under missingness not at random with a shadow variable,” Biometrika, 103, 475–482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis AA, and Verbeke G (2014), Handbook of Missing Data Methodology, Boca Raton, Florida: Chapman & Hall/CRC Press. [Google Scholar]
- Tang G, Little RJ, and Raghunathan TE (2003), “Analysis of multivariate missing data with nonignorable nonresponse,” Biometrika, 90, 747–764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsiatis AA (2006), Semiparametric Theory and Missing Data, New York: Springer. [Google Scholar]
- Zhao J and Ma Y (2018), “Optimal pseudolikelihood estimation in the analysis of multi-variate missing data with nonignorable nonresponse,” Biometrika, 105, 479–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao J and Shao J (2015), “Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data,” Journal of the American Statistical Association, 110, 1577–1590. [Google Scholar]
- Zhao J (2017), “Approximate conditional likelihood for generalized linear models with general missing data mechanism,” Journal of Systems Science and Complexity, 30, 139–153. [Google Scholar]
- Zhao J and Yang Y (2017), “Tuning Parameter Selection in the LASSO with Unspecified Propensity,” in New Advances in Statistics and Data Science, Springer, pp. 109–125. [Google Scholar]
- Zhao J, Yang Y, and Ning Y (2018), “Penalized pairwise pseudo likelihood for variable selection with nonignorable missing data,” Statistica Sinica, 28. [Google Scholar]
