Summary
Measurement error data or errors-in-variable data are often collected in many studies. Natural criterion functions are often unavailable for general functional measurement error models due to the lack of information on the distribution of the unobservable covariates. Typically, the parameter estimation is via solving estimating equations. In addition, the construction of such estimating equations routinely requires solving integral equations, hence the computation is often much more intensive compared with ordinary regression models. Because of these difficulties, traditional best subset variable selection procedures are not applicable, and in the measurement error model context, variable selection remains an unsolved issue. In this paper, we develop a framework for variable selection in measurement error models via penalized estimating equations. We first propose a class of selection procedures for general parametric measurement error models and for general semiparametric measurement error models, and study the asymptotic properties of the proposed procedures. Then, under certain regularity conditions and with a properly chosen regularization parameter, we demonstrate that the proposed procedure performs as well as an oracle procedure. We assess the finite sample performance via Monte Carlo simulation studies and illustrate the proposed methodology through the empirical analysis of a familiar data set.
Keywords: Measurement error models, Errors in variables, Estimating equations, Semi-parametric methods, Nonconcave penalty function, SCAD
1. Introduction
In the regression analysis, some covariates often can only be measured imprecisely or indirectly, thus resulting in the measurement error models, also known as errors-in-variable models in the literature. Various statistical procedures have been developed for statistical inference in measurement error models (Carroll, Ruppert, Stefanski and Crainiceanu, 2006). The study on linear measurement error models dates back to Bickel and Ritov (1987), where an efficient estimator is provided. Stefanski and Carroll (1987) constructed consistent estimators for generalized linear measurement error models. Recently, Tsiatis and Ma (2004) extended the model framework to an arbitrary parametric regression setting. Liang, Härdle and Carroll (1999) proposed partially linear measurement error models. Ma and Carroll (2006) studied generalized partially linear measurement error models. Further active research development has been established recently in the nonparametric measurement error area, see, for example, Delaigle and Hall (2007), and Delaigle and Meister (2007). The goal of this paper is to develop a class of variable selection procedures for general measurement error models. We would emphasize here that the scope of the paper is not limited to generalized linear models.
This study was motivated by examining the effects of systolic blood pressure (SBP), a covariate with error, and effects of three other covariates, respectively serum cholesterol, age and smoking status on the probability of the occurrence of heart disease. In our initial analysis, we include interactions between SBP and covariates, and interactions among covariates and quadratic terms of covariates to reduce modeling bias. It is found in our preliminary analysis that some interactions and quadratic terms are not significant and should be excluded to achieve a parsimonious model. To select significant variables in further analysis, we realized that the traditional AIC and BIC criteria are not well-defined for the model we consider in Section 4.4. Recently, a class of variable selection procedures for partially linear measurement error models via using penalized least squares and penalized quantile regression were proposed in Liang and Li (2009). However, their procedures are not applicable to cases beyond partially linear models, such as partially linear logistic regression models, and therefore, the procedures in Liang and Li (2009) cannot be applied for the model in Section 4.4 either. In fact, variable selection for general parametric or semiparametric measurement error models is challenging. One major difficulty is the lack of a likelihood function in these models, due to the difficulty in obtaining the distribution of the error-prone covariates. For example, using Y to denote the response variable, X to denote the unobservable covariate, and W to denote an observed surrogate of X. The likelihood of a single observation (w, y) is then ∫ pY |X(y|x, β)pW|X(w|x)pX(x)dx. In order to calculate this likelihood, one will need to estimate pX, yielding a deconvolution problem which is known to have very slow rate (Carroll and Hall, 1988, Fan, 1991) and is typically avoided in parametric measurement error models. Although a reasonable criterion function can be used in place of the likelihood, the difficulty persists in that, except for very special models such as in linear or partially linear cases, even a sensible criterion function is unavailable. In other models such as the ones arise in survival analysis, the lack of a likelihood function also causes a problem. To perform variable selection in these models, rather complicated methods have been proposed where for each potential model, one needs to fit the model, derive the asymptotic properties of the estimator, then form some artificial criterion function based on the asymptotic properties of the estimators and finally add penalty to perform the procedure. The procedure is complicated and unnatural. This motivates us to develop some simple variable selection procedures for measurement error models when a reasonable criterion function is unavailable. Although there exists a few variable selection procedures for linear or partially linear measurement error models (Liang and Li, 2009), to the best of our knowledge, variable selection for general parametric or semiparametric measurement error models has never been systematically studied in the literature. This paper intends to fill this gap by developing a class of variable selection procedures for both parametric and semiparametric measurement error models. In addition, the method proposed here is applicable to the more general situation where likelihood or a natural criterion is not available, and the estimation is performed through solving a set of estimating equations.
The variable selection procedures we propose is indeed a penalized estimating equation method, which can be applied for both parametric and semiparametric measurement error models. In addition, the penalized estimating equation method is applicable to any set of consistent estimating equations. Note that here, the measurement error model we consider is completely general and not limited to generalized linear models. Variable selection and feature selection are very active research topics in the current literature. Candès and Tao (2007) and Fan and Lv (2008) have studied variable selection for linear models when the sample size is much smaller than the dimension of regression parameter space. Their results are inspiring, but only valid for linear models with very strong assumptions on design matrix or the distribution of covariates. Thus, we in this paper follow Fan and Peng (2004), and consider the statistical setting in which the number of regression coefficients diverges to infinity at a certain rate as the sample size tends to infinity. We systematically study the asymptotic properties of the proposed estimator. It is worth pointing out that theoretic results in this paper provide explicit results on the asymptotic properties when the dimension of regression coefficients increases as the sample size increases. This advances the results in current literature, where estimation and inference are studied only for fixed finite dimensional parameter for measurement error models. In our asymptotic analysis, we show that with a proper choice of the regularization parameters and the penalty function, our estimator possesses the oracle property, which roughly means that the estimate is as good as when the true model is known (Fan and Li, 2001). We also demonstrate that the oracle property holds in a simpler form for the more familiar setting where the true number of regression coefficients is fixed.
In addition, we address issues of practical implementation of the proposed methodology. It is desirable to have an automatic, data-driven method to select the regularization parameters. To this end, we propose GCV-type and BIC-type tuning parameter selectors for the proposed penalized estimating equation method. Monte Carlo simulation studies are conducted to assess finite sample performance in terms of model complexity and model error. From our simulation studies, both tuning parameter selectors result in sparse models, while the BIC-type tuning parameter selector outperforms the GCV-type tuning parameter selectors.
The rest of the paper is organized as follows. In Section 2, we propose a new class of variable selection procedures for parametric measurement error models and study asymptotic properties of the proposed procedures. We develop a new variable selection procedure for semiparametric measurement error models in Section 3. Implementation issues and numerical examples are presented in Section 4, where we describe data driven automatic tuning parameter selection methods (Section 4.1), define the concept of approximate model error to evaluate the selected model (Section 4.2), carry out a simulation study to assess the finite sample performance of the proposed procedures (Section 4.3), and illustrate our method in an example (Section 4.4). Technical details are collected in the Appendix.
2. Parametric measurement error models
A general parametric measurement error model has two parts, written as
| (2.1) |
Here, the main model is pY |X,Z(Y|X, Z, β), which denotes the conditional probability density function (pdf) of the response variable Y on the covariates measured with error X and the covariates measured without error Z. Note that here, the conditional distribution of Y on the covariates is completely general hence it includes many familiar regression families. For example, linear model with normal error Y = Xβ + e, logistic model Pr(Y = 1|X) = exp(β0 + Xβ1)/{1+exp(β0 + Xβ1)} or Poisson model pY |X(Y |X) = exp{−(β0 + XTβ1)}(β0 + XTβ1)Y/Y ! are all special forms of the model we consider. The error model is denoted pW|X,Z(W |X, Z, ξ), where W is an observable surrogate of X. Parameter β is a d-dimensional regression coefficient, ξ is a finite dimensional parameter, and our main interest is in selecting the relevant subset of covariates in X and Z and estimating the subsequent parameters contained in β. Typically, ξ is a nuisance parameter, and its estimation usually requires multiple observations or instrumental variables. As routinely done in the literature, we assume that the model is identifiable. Furthermore, for simplicity, we assume in the main context of this paper that the error model pW|X,Z(W |X, Z) is completely known and hence ξ is suppressed. The extension to the unknown ξ case is rather straightforward and is discussed in Section 5. The observed data is of the form {(Wi, Zi, Yi), i = 1, …, n}.
Denote as the purported score function. That is,
where is a conditional pdf that one posits, which can be either equal or not equal the true pdf pX|Z(X|Z). Let function a(X, Z) satisfy
where E* indicates that the expectation is calculated using the posited . Note that here and in the sequel, a model has to be proposed in order to actually construct the estimator. Define
To select significant variables and estimate the corresponding parameters simultaneously, we propose the penalized estimating equations for model (2.1) as
| (2.2) |
where and is the first order derivative of a penalty function pλ (·). Solving for β̂ from (2.2) gives the estimate of β. In practice, we may allow different coefficients to have penalty functions with different regularization parameters. For example, we may want to keep some variables in the model without penalizing their coefficients. For ease of presentation, we assume that the penalty functions and the regularization parameters are the same for all coefficients in this paper.
The penalties in the classic variable selection criteria, such as AIC and BIC, cannot be applied to the penalized estimating equations. Following the study on the choice of the penalty functions in Fan and Li (2001), we use the SCAD penalty, whose first order derivative is defined as
| (2.3) |
for any scalar γ, where sign(·) is the sign function, i.e., sign(γ) = −1, 0, and 1 when γ < 0, = 0 and > 0 respectively. Here, a > 2 is a constant, and a choice of a = 3.7 is appropriate from a Bayesian point of view. A property of (2.2) is that with a proper choice of penalty functions, such as the SCAD penalty, the resulting estimate contains some exact zero coefficients. This is equivalent to excluding the corresponding variables from the final selected model, thus variable selection is achieved at the same time as parameter estimation.
Concerns about model bias often prompt us to build models that contain many variables, especially when the sample size becomes large. A reasonable way to capture such tendency is to consider the situation where the dimension of the parameter β increases along with the sample size n. We therefore study the asymptotic properties of the penalized estimating equation estimator under the setting in which both the dimension of the true non-zero components of β and the total length of β tend to infinity as n goes to infinity. Denote β0 = (β10, ···, βd0)T as the true value of β. Let
| (2.4) |
where we write λ as λn to emphasize its dependence on the sample size n.
Theorem 1
Suppose that Condition (P1) in the Appendix holds. Under regularity conditions (A1)–(A3) in the Appendix, and if , λn → 0 when n → ∞, then with probability tending to one, there exists a root of (2.2), denoted β̂, such that , where we write d as dn to emphasize its dependence on the sample size n.
The proof of Theorem 1 is given in the Appendix. Theorem 1 demonstrates that the convergence rate depends on the penalty function and the regularization parameter λn through an. From Theorem 1, it requires to achieve root n/dn convergence rate. For the L1 penalty, an = λn. Thus, the root n/dn convergence rate requires that . While for the SCAD penalty, an = 0 as λn → 0. Thus, the resulting estimate with the SCAD penalty is root n/dn consistent.
To present the oracle property of the resulting estimate, we first introduce some notation. Without loss of generality, we assume , and in the true model, any element in βI0 is not equal to 0 while βII0 ≡ 0. Denote the dimension of βI as d1. Furthermore, denote
| (2.5) |
and the first d1 components of as . In the following theorem, we use the same formulation as that in Cai, Fan, Li and Zhou (2005).
Theorem 2
Suppose that Condition (P1) holds. Under regularity conditions (A1)–(A3), assume λn → 0 and when n → ∞. If
| (2.6) |
then with probability tending to one, any root n consistent solution of (2.2) must satisfy that
β̂II = 0,
for any d1 × 1 vector v, s.t. vTv = 1,
where the notation stands for convergence in distribution.
The proof of Theorem 2 is given in the Appendix. For some penalty functions, including the SCAD penalty, b and Σ are zero when λn is sufficiently small, hence the results in Theorem 2 imply that the proposed procedure has the celebrated oracle property: i.e., β̂II = 0, and for any d1 × 1 vector v, s.t. vTv = 1,
| (2.7) |
Theorems 1 and 2 imply that for fixed and finite d, ||β̂ − β0|| = Op(n−1/2+an) and with probability tending to one, any root n convergence solution of (2.2) must satisfy that β̂II = 0 and
where the notation M−T denotes (M−1)T for a matrix M. These same results are still valid under much weaker conditions. See an elaborated version of this paper, Ma and Li (2007), for details.
For SCAD penalty and for fixed and finite d, (2.7) becomes
in distribution. In other words, with probability tending to 1, the penalized estimator performs the same as the locally efficient estimator under the correct model.
3. Semiparametric measurement error models
To motivate the problems considered in this section, we start with some commonly-used semiparametric regression models for which the proposed procedure in this section can be directly applied. Consider first the error-free regression cases, and let Y be the response, Z and S be covariates. Throughout this paper, we consider univariate Z only. Consider the partially linear model defined as follows:
| (3.1) |
The partially linear model keeps the flexibility of nonparametric model for the baseline function, while maintaining the explanatory power of parametric models. Therefore, it has received a lot of attention in the literature. See, for example, Härdle, Liang and Gao (2000) and references therein. Various extensions of the partially linear model have been proposed in the literature. Li and Nie (2007, 2008) proposed the partially nonlinear models
| (3.2) |
where f(S; β) is a specific, known function, which may be nonlinear in β. See Li and Nie (2007, 2008) for some interesting examples. Li and Liang (2008) and Lam and Fan (2008) studied the generalized varying-coefficient partially linear model
| (3.3) |
where g(·) is a link function, and (S1, S2, Z) are covariates. Model (3.3) includes most commonly-used semiparametric models, such as the partially linear models (3.1), the generalized partially linear models (Severini and Staniswalis, 1994), and semi-varying coefficient partially linear models (Fan and Huang, 2005).
In the presence of covariates measured with error, one may extend the aforementioned semiparametric regression models for measurement error data. As in the last section, let X be the covariate vector measured with error. Among these semiparametric models with error, the partially linear measurement error model
| (3.4) |
has been studied in Liang, Härdle and Carroll (1999). Liang and Li (2009) proposed a class of variable selection for model (3.4) using penalized least squares and penalized quantile regression. Our procedure in this section, however, is directly applicable for both the generalized varying-coefficient partially linear measurement error model
| (3.5) |
where , and the partially nonlinear measurement error model
| (3.6) |
when an error distribution is assumed. It is worth noting that model (3.6) includes the following model as a special case
| (3.7) |
where (XZ) consists of all interaction terms between X and Z, but model (3.4) does not. Thus, the variable selection procedures proposed in Liang and Li (2009) are not directly applicable for model (3.7).
In summary, in this section, we consider a general semiparametric error model that include models (3.4), (3.5) and (3.6) as its special cases. Specifically, the semiparametric measurement error model we consider here also has two parts:
| (3.8) |
The major difference from its parametric counterpart is that the main model contains unknown functions θ (Z). It is easy to check that models (3.4), (3.5) and (3.6) are special cases of model (3.8). Note that a simpler version of this model is considered in Ma and Carroll (2006), where the dimension of β is assumed to be fixed, and the dimension of θ is assumed to be one.
Throughout this paper, we assume that the model is identifiable. We propose the penalized estimating equation for the semiparametric model:
| (3.9) |
where ṗλ (β) has the same form as in (2.3). However, the computation of ℒ is more involved as we now describe. Denote the dimension of θ (Z) as m, a fixed and finite integer. If we replace θ (Z) with a single unknown m-dimensional parameter α, and append α to β, we obtain from (3.8) a parametric measurement error model with parameters (βT, αT)T. For this parametric model, we can compute its corresponding as done in the last section. Specifically, we will have , where a(X, Z, S) satisfies . Note that has the same dimension as the dimension of β plus m. We write the last m components of as Ψ (X, Z, S, Y, β, α) and the rest as ℒ (X, Z, S, Y, β, α). We now solve for θ̂i, i = 1, …, n, from
| (3.10) |
where Kh(z) = h−1 K(z/h), K is a smooth symmetric kernel function with compact support that satisfies ∫ K(t)t2dt = 1, and h is a bandwidth. Note that θ1, …, θn are all m-dimensional parameters. Inserting the θ̂i’s into ℒ in (3.9), we obtain a complete description of the estimator. Note that θ̂i depends on β, so a more precise notation for θ̂i is θ̂i(β). Solving equation (3.9) yields a penalized estimating equation estimate. Theorem 3 below gives its convergence rate.
Theorem 3
Suppose that condition (P1) holds. Under regularity conditions (B1)–(B4) in the Appendix, and if , λn → 0 when n → ∞, then with probability tending to one, there exists a root of (3.9), denoted β̂, such that .
The proof of Theorem 3 is given in the Appendix. Theorem 3 indicates that to achieve root n/dn convergence rate (or root n convergence rate for finite and fixed d), λn and the penalty function must be chosen such that an = Op(n−1/2).
Let ℒI be the first d1 components of ℒ; ℒIβI the partial derivative of ℒI with respect to βI; ℒIθ the partial derivative of ℒI with respect to θ; Ψ θ the partial derivative of Ψ with respect to θ and ΨβI the partial derivative of Ψ with respect to βI. Also define Ω(Z) = E (Ψθ|Z),
(Z) = E(ℒIθ|Z)Ω−1 (Z) and θβI (Z) = −Ω−1(Z)E(ΨβI|Z). Further defining
we obtain the following results.
Theorem 4
Suppose that condition (P1) holds. Under regularity conditions (B1)–(B4), if λn → 0, , and (2.6) holds, then with probability tending to one, any root n consistent estimator obtained in (3.9) must satisfy that
β̂II = 0,
for any d1 × 1 vector v such that vTv = 1,
The proof of Theorem 4 is given in the Appendix. Theorem 4 implies that for fixed and finite d, the convergence rate of the resulting estimate is n−1/2 + an. It also implies that any root n consistent solution of (3.9) must satisfy β̂II = 0, and β̂I has the following asymptotic normality,
See the earlier version of this work, Ma and Li (2007) for details.
4. Numerical studies and application
In this section, we provide implementation details such as tuning parameter selection and model error approximation. Issues related to the numerical procedure to solve (2.2) and (3.9), the choice of kernel and bandwidth in the semiparametric model, and the treatment of multiple roots have been addressed in Ma and Carroll (2006) hence is not further discussed here. We assess the finite sample performance of the proposed procedure by Monte Carlo simulation, and illustrate the proposed methodology by an empirical analysis of the Framingham heart study data. In our simulation, we concentrate on the performance of the proposed procedure for a quadratic logistic measurement error model and a partially linear logistic measurement error model in terms of model complexity and model error.
4.1 Tuning parameter selection
An MM algorithm (Hunter and Li, 2005) and a local linear approximation (LLA) algorithm (Zou and Li, 2008) have been proposed for penalized likelihood with nonconcave penalty. However, both the MM algorithm and the LLA algorithm are difficult to be implemented for the measurement error models we consider. Thus, we employ the local quadratic approximation (LQA) algorithm (Fan and Li, 2001) to solve the penalized estimating equations. Specifically, in implementing the Newton-Raphson algorithm to solve the penalized estimating equations, we locally approximate the first order derivative of the penalty function by a linear function, following the idea of local quadratic approximation (LQA) algorithm. Specifically, suppose that at the kth step of the iteration, we obtain the value β(k). Then, for not very close to zero,
otherwise, we set , and exclude the corresponding covariate from the model. This approximation is updated in every step of the Newton-Raphson algorithm iteration. The LQA was proposed by Fan and Li (2001). In practice, we set the initial value of β to be the unpenalized estimating equation estimate. It can be shown that when the algorithm converges, the solution will satisfy the penalized estimating equations. Following Theorem 2 and 4, we can further approximate the estimation variance of the resulting estimator. That is
where Σλ is a diagonal matrix with elements equal to for nonvanishing β̂j, a linear approximation of Σ defined in (2.5). We use E to denote the sample approximation of evaluated at β̂ for the parametric model (2.1) and the sample approximation of the matrix A evaluated at β̂ for the semiparametric model (3.8). Similarly, we use F to denote the sample approximation of evaluated at β̂ for the parametric model and the sample approximation of the matrix B evaluated at β̂ for the semiparametric model, respectively. The consistency of the proposed sandwich formula can be shown by using similar techniques in Fan and Peng (2004). The accuracy of this sandwich formula will be tested in our simulation studies.
It is desirable to have automatic, data-driven methods to select the tuning parameter λ. Here we will consider two tuning parameter selectors, the GCV and BIC tuning parameter selectors. To define the GCV statistic and the BIC statistic, we need to define the degrees of freedom and goodness of fit measure for the final selected model. Similar to the nonconcave penalized likelihood approach, we may define effective number of parameters or degrees of freedom to be
where I stands for the Fisher Information matrix. For the logistic regression models employed in this section, a natural approximation of I, ignoring the measurement error effect, is VTQV, where V represents the covariates included in the model, and Q is a diagonal matrix with the ith element equal to μ̂λ,i(1 − μ̂λ,i). Here, μ̂λ,i = P (Yi = 1|Vi).
In the logistic regression model context of this section, we may employ its deviance as a goodness of fit measure. Specifically, let μi be the conditional expectation of Yi given its covariates for i = 1, ···, n. The deviance of a model fit μ̂λ = (μ̂λ,1, ···, μ̂λ,n) is defined to be
Define the GCV statistic to be
and the BIC statistic to be
The GCV and the BIC tuning parameter selectors select λ by minimizing GCV (λ) and BIC(λ) respectively. Note that the BIC tuning parameter selector is distinguished from the traditional BIC variable selection criterion, which is not well defined for estimating equation methods. Wang, Li and Tsai (2007) provided a study on the asymptotic behavior for the GCV and BIC tuning parameter selectors for the nonconcave penalized least squares variable selection procedures in linear and partially linear regression models. It is needed to further study the asymptotic property of the proposed tuning parameter selection, but it is out of the score of this paper.
4.2 Model error
Model error is an effective way of evaluating model adequacy versus model complexity. To implement the concept of model error in evaluating our procedure, we first simplify its definition for logistic partially linear measurement error model. Denote μ(S, X, Z) = E(Y |S, X, Z), and define the model error for a model μ̂(S, X, Z) as
where the expectation is taken over the new observation S+, X+ and Z+. Let g(·) be the logit link. For the logistic partially linear model, the mean function has the form μ(S, X, Z) = g−1{θ (Z)+βTV}, where V = (ST, XT)T. If θ̂(·) and β̂ are consistent estimators for θ(·) and β, respectively, then by a Taylor expansion, the model error can be approximated by
The first component is the inherent model error due to θ̂(·), the second one is due to the lack of fit of β̂, and the third one is the cross-product between the first two components. Thus, to assess the performance of the proposed variable selection procedure, we define the approximate model error (AME) for β̂ to be
Furthermore, the AME of β̂ can be written as
| (4.1) |
In our simulation, the matrix CX is estimated by 1,000,000 Monte Carlo simulations. For measurement error data, we observe W instead of X. We also consider an alternative approximate model error
| (4.2) |
where CW is obtained by replacing X with W in the definition of CX. The AME(β̂) and AMEW (β̂) are defined for the parametric model case by setting θ (·) = 0. Note that although we defined the model error in the context with a logistic link function, it is certainly not restricted to such case. The general approach for calculating AME is to approximate the probability density function evaluated at the estimated parameters around the true parameter value, and to extract the linear term of the parameter of interest. AMEW is calculated by replacing X with W.
4.3 Simulation examples
To demonstrate the performance of our method in both parametric and semiparametric measurement error models, we conduct two simulation studies. In our simulation, we will examine only the performance of the penalized estimating equation method with the SCAD penalty.
Example 1
In this example, we generate data from a logistic model where the covariate measured with error enters the model through a quadratic function, and the covariates measured without error enter linearly. The measurement error follows a normal additive pattern. Specifically,
where β = (0, 1.5, 2, 0, 3, 0, 1.5, 0, 0, 0)T, the covariate X is generated from a normal distribution with mean 0 and variance 1, (Z1, …, Z6)T is generated from a normal distribution with mean 0 and covariance between Zi and Zi is 0.5|i−j|. The last component of the Z covariates, Z7, is a binary variable taking value 0 or 1 with equal probability. U is normally distributed with mean 0 and standard deviation 0.1. In our simulation, the sample size is taken to be either n = 500 or n = 1000.
For the selected model, the model complexity is summarized in terms of the number of zero coefficients, and the model error is summarized in terms of relative approximation model error (RAME), defined to be the ratio of model error of the selected model to that of the full model. In Table 1, the RAME column corresponds to the sample median and median absolute deviation (MAD) divided by a factor of 0.6745 of the RAME values over 1000 simulations. Similarly, the RAMEW column corresponds to those of the RAMEW values over 1000 simulations. From Table 1, it can be seen that the values of RAME and RAMEW are very close. The average count of zero coefficients is also reported in Table 1, where the column labeled “C” presents the average count restricted only to the true zero coefficients, while the column labeled “E” displays the average count of the coefficients erroneously set to 0.
Table 1.
MRMEs and Model Complexity for Example 1
| n | RAME | RAMEW | # of Zero Coefficients | ||
|---|---|---|---|---|---|
| Median (MAD) | Median (MAD) | C | E | ||
| GCV | 500 | 0.694(0.231) | 0.698(0.228) | 4.574 | 0.006 |
| BIC | 500 | 0.396(0.188) | 0.396(0.187) | 5.857 | 0.074 |
| GCV | 1000 | 0.766(0.187) | 0.770(0.185) | 4.456 | 0 |
| BIC | 1000 | 0.390(0.157) | 0.401(0.158) | 5.758 | 0.010 |
We next verify the consistency of the estimators and test the accuracy of the proposed standard error formula. Table 2 displays the bias and sample standard deviation (SD) of the estimates for two nonzero coefficients, (β1, β2), over 1000 simulations, and the sample average and the sample standard deviations of the 1000 standard errors obtained by using the sandwich formula. The row labeled ‘EE’ corresponds to the unpenalized estimating equation estimator. We omit here the results for other nonzero coefficients and the results under sample size n = 500. Interested readers can find them in an earlier version of this work, Ma and Li (2007). Overall, the estimators are consistent and the sandwich formula works well.
Table 2.
Bias and Standard Errors for Example 1 (n = 1000)
| β̂1 | β̂2 | |||
|---|---|---|---|---|
| bias (SD) | SE (Std(SE)) | bias (SD) | SE (Std(SE)) | |
| EE | 0.072 (0.273) | 0.268 (0.062) | 0.124 (0.332) | 0.321(0.088) |
| GCV | 0.029 (0.254) | 0.250 (0.048) | 0.009 (0.258) | 0.253(0.057) |
| BIC | 0.024 (0.290) | 0.249 (0.054) | 0.052 (0.255) | 0.244(0.052) |
Example 2
In this example, we illustrate the performance of the method for a semiparametric measurement error model. Simulation data are generated from
where β, X and W are the same as in the previous simulation. We generate S’s in a similar fashion as the Z’s in Example 1. That is, (S1, …, S8) is generated from a normal distribution with mean zero and covariance between Si and Sj is 0.5|i−j|. S9 is a binary variable with equal probability to be zero or one. The random variable Z is generated from a uniform distribution in [−π/2, π/2]. The true function θ (Z) = 0.5 cos(Z). The parameter takes values β = (1.5, 2, 0, 0, 3, 0, 1.5, 0, 0, 0)T.
The simulation results are summarized in Table 3, with notation similar to that of Table 1. From Table 3, we can see that the penalized estimating equation estimators can significantly reduce model complexity. Overall, the BIC tuning parameter selectors perform better, while GCV is too conservative. We have further tested the consistency and the accuracy of the standard error formula derived from the sandwich formula. The result is summarized in Table 4, with notation similar to that of Table 2. We note the consistency of the estimator and that the standard error formula performs very well. More simulation results are summarized in the earlier version of the work, Ma and Li (2007).
Table 3.
MRMEs and Model Complexity for Example 2
| Method | n | RMEX | RMEW | # of Zero Coefficients | |
|---|---|---|---|---|---|
| Median (MAD) | Median (MAD) | C | E | ||
| GCV | 500 | 0.878(0.161) | 0.880(0.158) | 4.060 | 0 |
| BIC | 500 | 0.381(0.158) | 0.387(0.155) | 5.713 | 0 |
| GCV | 1000 | 0.868(0.164) | 0.873(0.160) | 4.061 | 0 |
| BIC | 1000 | 0.386(0.162) | 0.392(0.161) | 5.694 | 0 |
Table 4.
Bias and Standard Errors for Example 2 (n = 1000)
| Method | β̂1 | β̂2 | ||
|---|---|---|---|---|
| Bias (SD) | SE (Std(SE)) | Bias (SD) | SE (Std(SE)) | |
| EE | 0.039(0.170) | 0.166(0.018) | 0.057(0.194) | 0.190(0.018) |
| GCV | 0.047(0.174) | 0.172(0.020) | 0.069(0.196) | 0.191(0.021) |
| BIC | 0.031(0.169) | 0.170(0.019) | 0.044(0.179) | 0.185(0.019) |
4.4 An application
The Framingham heart study data set (Kannel et al., 1986) is a well known data set where it is generally accepted that measurement error exists on the long term systolic blood pressure (SBP). In addition to SBP, other measurements include age, smoking status and serum cholesterol. In the literature, there has been speculation that a second order term involving age might be needed to analyze the dependence of heart disease occurrence. In addition, it is unclear if the interaction between the various covariates plays a role in influencing the heart disease rate. The data set includes 1,615 observations.
With the method developed here, it is possible to perform a variable selection to address these issues. Following the literature, we adopt the measurement error model of log(MSBP − 50) = log(SBP − 50) + U, where U is a mean zero normal random variable with variance , MSBP is the measured SBP. We denote the standardized log(MSBP − 50) as W. The standardization using the same parameters on log(SBP − 50) is denoted X. The standardized serum cholesterol and age are denoted by Z1, Z2 respectively, and Z3 denotes the binary variable smoking status. Using Y to denote the occurrence of heart disease, the saturated model which includes all the interaction terms and also the square of age term is of the form
We used both GCV and BIC tuning parameter selectors to choose λ. We present the tuning parameters and the corresponding GCV and BIC scores in Figure 1. The final chosen λ is 0.073 and 0.172 by the GCV and BIC selectors, respectively. The selected model is depicted in Table 5. The GCV criterion selects the covariates X, XZ1, 1, Z1, Z2, Z3, , Z2Z3 into the model, while the BIC criterion selects the covariates X, 1, Z1, Z2 into the model. We report the selection and estimation results in Table 5, as well as the semiparametric estimation results without variable selection.
Figure 1.
Tuning parameters and their corresponding BIC and GCV scores for the Framingham data. The scores are normalized to the range [0, 1].
Table 5.
Results for the Framingham data set.
| EE | GCV | BIC | ||
|---|---|---|---|---|
| β̂ (SE) | β̂ (SE) | β̂ (SE) | ||
| X | 0.643(0.248) | 0.416(0.093) | 0.179(0.039) | |
| XZ1 | −0.167(0.097) | −0.072(0.041) | 0 (NA) | |
| XZ2 | −0.059(0.111) | 0 (NA) | 0 (NA) | |
| XZ3 | −0.214(0.249) | 0 (NA) | 0 (NA) | |
| Intercept | −3.415(0.428) | −3.255(0.356) | −2.555(0.092) | |
| Z1 | 0.516(0.212) | 0.332(0.085) | 0.124(0.033) | |
| Z2 | 1.048(0.341) | 1.044(0.329) | 0.398(0.067) | |
| Z3 | 1.060(0.443) | 0.907(0.373) | 0 (NA) | |
|
|
−0.253(0.125) | −0.262(0.121) | 0 (NA) | |
| Z1Z2 | −0.072(0.103) | 0 (NA) | 0 (NA) | |
| Z1Z3 | −0.161(0.225) | 0 (NA) | 0 (NA) | |
| Z2Z3 | −0.442(0.336) | −0.473(0.326) | 0 (NA) |
As seen, the terms X, 1, Z1, Z2 are selected by both criteria, while Z3, and some of the interaction terms are selected only by GCV. The BIC criterion is very aggressive and it results in a very simple final model while the GCV criterion is much more conservative, hence the resulting model is more complex. This agrees with the simulation results obtained. Since both criteria have included the covariate X, the measurement error feature and its treatment in the Framingham data is unavoidable.
5. Discussion
In this paper, we have proposed a new class of variable selection procedures in the framework of measurement error models. The procedure is proposed in a completely general functional measurement error model setting, and is suitable for both parametric and semiparametric models that contain unspecified smooth functions of an observable covariate. We have assumed the error model pW|X,Z(W|X, Z) to be completely known for ease of presentation. In the situation when the error model contains an unknown parameter ξ, the identifiability of the problem requires additional information such as multiple measurements or instruments. Such information should be incorporated to estimate ξ. Specifically, in the variable selection context, we can simply append the estimating equation with these additional estimating equations obtained from the corresponding score functions with respect to ξ, and append the penalty function with zeros. Because the augmented estimating equations still have the same convergence property as illustrated in Ma and Carroll (2006), the same asymptotic convergence rates and oracle properties hold as in the known ξ case, without any efficiency loss. When the error model pW|X,Z(W|X, Z) is completely unspecified, a nonparametric estimation of the measurement error distribution has to be carried out first, then the result can be plugged into the proposed variable selection and estimation procedure. In this case, the asymptotic convergence rate of the parameters and the oracle property remain the same, but the asymptotic variance will increase; The details on incorporating the estimation of unknown error and demonstrating its subsequent convergence property in the estimation framework is the focus of Hall and Ma (2007).
We also would like to point out that in the special case of generalized linear models and normal additive error with possible heteroscedasticity, the procedure of solving linear integral equations can be spared and the estimating equations are simplified significantly (Ma and Tsiatis, 2006). In such situations, the computation complexity of the proposed procedure will be reduced to about the same level as for variable selection in regressions without errors in the variables.
As pointed out by the referee, it is of great interest to perform variable selection for high dimensional data. In this paper, we allow the number of covariates to grow to infinity at a op(n−1/5) rate as the sample size n increases. However, the proposed procedures and the used algorithm in this paper may not directly applied to the large p, small n problems. Variable selection for the large p, small n setting is a very active research topic. It is very challenging to extend the existing variable selection procedures for large p, small n problems to measurement error data. Further researches are needed on this topic, but this is out of scope of this paper.
Acknowledgments
Ma’s work was supported by a Swiss NSF grant. Li’s research was supported by a NSF grant DMS-0348869 and a National Institute on Drug Abuse (NIDA) grant P50 DA10075. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the National Institutes of Health.
Appendix
Global assumption (P1) on the penalty function
(P1) Let . Assume that λn → 0, an = O(n−1/2) and cn → 0 n → ∞. In addition, there exist constants C and D such that when γ1, γ2 > Cλ, .
It is easy to verify that both the L1 and the SCAD penalties satisfy this condition.
The regularity conditions for Theorems 1 and 2
(A1) The expectation of the first derivative of with respect to β exists at β0 and its left eigenvalues are bounded away from zero and infinity uniformly for all n. For any entry Sjk in .
(A2) The eigenvalues of the matrix , satisfy 0 < C2 < λmin < ··· < λmax < C3 < ∞ for all n. For any entries, Sk, Sj in .
(A3) The second derivatives of with respect to β exist and the entries are uniformly bounded by a function M(Wi, Zi, Yi) in a large enough neighborhood of β0. In addition, E(M2) < C5 < ∞ for all n, d.
Conditions (A1) – (A3) are mild regularity conditions. They guarantee that the solution of the following estimating equation
is root n/dn convergent, and possesses asymptotic normality.
Proof of Theorem 1
Condition (A1) allows us to define
Let αn = n−1/2 + an, and . It suffices to show that
| (A1) |
has a solution β̂ that satisfies . This will be shown using the Brouwer fixed point theorem. Using the Taylor expansion, we have
where β* is in between β and β0. It can be shown by Conditions (A1), (A2) and (A3) and definition of that
We next check the key condition for the Brouwer fixed point theorem. For any β such that for some constant C, it follows by Condition (P1) that
Using the Cauchy-Schwartz inequality, it can be shown that the first term in the above equation is of order . Note that . Thus the second term in the above equation dominates the first term with probability 1 − ε for any ε > 0 as long as C is large enough. Thus, for any ε > 0, and for large enough C, the probability for the above display to be larger than zero is at least 1 −ε. From the Brouwer fixed-point theorem, we know that with probability at least 1 − ε, there exists at least one solution for (A1) in the region .
Lemma on sparsity for Theorem 2
lemma 1
If the conditions in Theorem 2 hold, then for any given β that satisfies , with probability tending to 1, any solution of (2.2) satisfies that βII = 0.
Proof
Denote the kth element in as Lnk(β), k = d1 +1, …, dn. We next show that the order of Lnk(β) is
| (A2) |
where β* is in between β and β0. Because of condition (A2), the first term of (A2) is of order . The second term in (A2) can be further written as
| (A3) |
Using the Cauchy-Schwartz inequality and condition (A1), it can be shown by straightforward calculation that the first term in (A3) is of order . Using the Cauchy-Schwartz inequality again, the second term in (A3) is controlled by
Thus, the second term of (A2) is of order . As for the third term of (A2), we can have a similar decomposition to that of (A3). Using the Cauchy-Schwartz inequality in matrix form and condition (A3), it can be shown that the third term of (A2) is of order as . Thus, Lnk(β) is of order . Hence we have
Using condition (2.6), the sign of is decided by sign(βk) completely when n is large enough. From the continuity of , we obtain that it is zero at βk = 0.
Proof of Theorem 2
From Theorem 1 and condition (P1), there is a root n/dn consistent estimator β̂. From Lemma 1, , so (i) is shown. Denote the first d1 equations in as Ln(βI). Now consider solving the first d1 equations in (2.2) for βI, while βII = 0. we have
where is between βI0 and β̂I. It follows by condition (P1) that
Furthermore, for any fixed ε > 0, it follows by conditions (A1), (A3) and the Chebyshev inequality that
since d1 ≤ dn. Thus, we have
Therefore,
and subsequently,
We thus obtain that
Denote . Using condition (A2), it follows that
Let Yi = n−1/2vTI*−1/2 Sn,eff,I(Wi, Zi, Yi, βI0). It follows that for any ε > 0,
Using Chebyshev’s inequality, we have
Note that E(||Y1||4) = n−2E||vTI*−1/2Seff,I(W1, Z1, Y1, βI0)||4. Note that the rank of vvT is one, and hence λmax(vvT) equals the trace of vvT. So, λmax(vvT) = 1 as vTv = 1. Thus, it follows that
due to condition (A2). Hence,
On the other hand,
Following Lindeberg-Feller central limit theorem, the results in (ii) now follow.
Regularity conditions for Theorem 3 and 4
The notation Ci below is generic and is allowed to be different from that in Conditions (A1)–(A3).
(B1) The first derivatives of ℒ with respect to β and θ exist and are denoted as ℒβ and ℒθ respectively. The first derivative of θ with respect to β exists and is denoted as θβ. Thus E(ℒβ + ℒθθβ) exists and its left eigenvalues are bounded away from zero and infinity uniformly for all n at β0 and the true function θ0(Z). For any entry Sjk of the matrix d(ℒβ + ℒθθβ)/dβ, .
(B2) The eigenvalues of the matrix E{ℒI −
(Z)Ψ}{ℒI −
(Z)Ψ}T satisfy 0 < C2 < λmin < ··· < λmax < C3 < ∞ for all n; for any entries Sk, Sj in (ℒβ + ℒθθβ),
.(B3) The second derivatives of ℒ with respect to β and θ exist, the second derivatives of θ with respect to β exists, and the entries are uniformly bounded by a function M(Wi, Zi, Si, Yi) in a neighborhood of β0, θ0. In addition, E(M2) < C5 < ∞ for all n, d.
(B4) The random variable Z has compact support and its density fZ(z) is positive on that support. The bandwidth h satisfies nh4 → 0 and nh2 → ∞. θ(z) has bounded second derivative.
Proof of Theorem 3
Denote
Let αn = n−1/2 + an, and . It will be shown that
| (A4) |
has a solution β̂ that satisfies .
Due to the usual local estimating equation expansion, we have
| (A5) |
which implies that θ̂(z, β0) − θ0(z) = Op(h2 + n−1/2h−1/2). For any β such that for some constant C, we obtain the expansion
where β* is in between β and β0. Because of condition (B3), each component of the last term is uniformly of order Op(n1/2||β − β0||2). The second term can be written as n1/2{1 + op(1)}(β −β0) under conditions (B1), (B3) and (B4). The first term can be further expanded as
under conditions (B3) and (B4). Summarizing the above results, making use of (A5), we obtain
under condition (B4). Similar to the situation in Theorem 1, under condition (P1), we further obtain
| (A6) |
The first term in the above display is of order , the second term equals , which dominates the first term as long as C is large enough. The last term is dominated by the first two terms. Thus, for any ε > 0, as long as C is large enough, the probability for the above display to be larger than zero is at least 1 − ε. From Brouwer fixed-point theorem, we know that with probability at least 1 − ε, there exists at least one solution for (A4) in the region .
Lemma for Theorem 4
lemma 2
If conditions in Theorem 4 hold, then for any given β that satisfies , with probability tending to 1, any solution of (2.2) satisfies that βII = 0.
Proof
Denote the kth equation in as Lnk(β, θ̂), and that in as Gnk(β0, θ0), k = d1 + 1, …, dn, then the expansion in Theorem 3 leads to
Similar to the derivation in Lemma 1, the first three terms of the above display are all of order , hence we have
Because of (2.6), the sign of is decided by sign(βk) completely. From the continuity of , we obtain that it is zero at βk = 0 with probability larger than any 1 − ε.
Proof of Theorem 4
(i) immediately follows by Lemma 1. Denote the first d1 equations in as Ln{βI, θ̂(βI)}, and that in as Gn(βI0, θ0). Note that the d1 × d1 upper left block of J−1 is the matrix A defined in Theorem 4. Using the Taylor expansion for the penalized estimating function at , the first d1 equations yield
Using condition (B2), we have
Let Yi = n−1/2vTB−1/2 {ℒnIi(βI0, θ0) −
(Zi) Ψi(βI0, θ0)}, i = 1, …, n. It follows that for any ε > 0,
Using the Chebyshev inequality, we have Pr(||Y1|| > ε) = O(n−1), and E(||Y1||4) is bounded by
which equals by condition (B2). Hence,
On the other hand,
(ii) follows by Lindeberg-Feller central limit theorem.
Contributor Information
Yanyuan Ma, Email: ma@stat.tamu.edu, Department of Statistics, Texas A&M University, College Station, TX 77843.
Runze Li, Email: rli@stat.psu.edu, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802.
References
- Bickel PJ, Ritov AJC. Efficient estimation in the errors-in-variables model. Annals of Statistics. 1987;15:513–540. [Google Scholar]
- Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2005;92:303–16. doi: 10.1093/biomet/92.2.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Candès E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n (with discussion) Annals of Statistics. 2007;35:2313–2392. [Google Scholar]
- Carroll RJ, Hall P. Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Association. 1988;83:1184–1186. [Google Scholar]
- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu C. Measurement Error in Nonlinear Models: A Modern Perspective. 2. CRC Press; London: 2006. [Google Scholar]
- Delaigle A, Hall P. Using SIMEX for smoothing-parameter choice in errors-in-variables problems. Journal of the American Statistical Association. 2007;103:280–287. [Google Scholar]
- Delaigle A, Meister A. Nonparametric regression estimation in the heteroscedastic errors-in-variables problem. Journal of the American Statistical Association. 2007;102:1416–1426. [Google Scholar]
- Fan J. On the optimal rates of convergence for nonparametric deconvolution problems. Annals of Statistics. 1991;19:1257–1272. [Google Scholar]
- Fan J, Huang T. Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli. 2005;11:1031–1057. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–60. [Google Scholar]
- Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space (with discussion) Journal of the Royal Statistical Society, Serie B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics. 2004;32:928–61. [Google Scholar]
- Hall P, Ma Y. Semiparametric estimators of functional measurement error models with unknown error. Journal of the Royal Statistical Society, Serie B. 2007;69:429–46. [Google Scholar]
- Härdle W, Liang H, Gao J. Partially Linear Models. Heidelberg: Springer Physica; 2000. [Google Scholar]
- Hunter D, Li R. Variable selection using MM algorithms. Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kannel WB, Newton JD, Wentworth D, Thomas HE, Stamler J, Hulley SB, Kjelsberg MO. Overall and coronary heart disease mortality rates in relation to major risk factors in 325, 348 men Screened for MRFIT. American Heart Journal. 1986;112:825–36. doi: 10.1016/0002-8703(86)90481-3. [DOI] [PubMed] [Google Scholar]
- Lam C, Fan J. Profile-Kernel likelihood inference with diverging number of parameters. Annals of Statistics. 2008;36:2232–2260. doi: 10.1214/07-AOS544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li R, Liang H. Variable selection in semiparametric regression modeling. Annals of Statistics. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li R, Nie L. A new estimation procedure for partially nonlinear model via a mixed effects approach. Canadian Journal of Statistics. 2007;35:399–411. [Google Scholar]
- Li R, Nie L. Efficient statistical inference procedures for partially nonlinear models and their applications. Biometrics. 2008;64:904–911. doi: 10.1111/j.1541-0420.2007.00937.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang H, Härdle W, Carroll RJ. Estimation in a semiparametric partially linear errors-in-variables model. Annals of Statistics. 1999;27:1519–1535. [Google Scholar]
- Liang H, Li R. Variable selection for partially linear models with measurement errors. Journal of the American Statistical Association. 2009 doi: 10.1198/jasa.2009.0127. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma Y, Carroll RJ. Locally efficient estimators for eemiparametric models with measurement error. Journal of the American Statistical Association. 2006;101:1465–74. [Google Scholar]
- Ma Y, Li R. Variable selection in measurement error models. Tech Report. 2007 doi: 10.3150/09-bej205. Available at http://www2.unine.ch/webdav/site/statistics/shared/documents/v10.pdf. [DOI] [PMC free article] [PubMed]
- Ma Y, Tsiatis AA. Closed form semiparametric estimators for measurement error models. Statistica Sinica. 2006;16:183–93. [Google Scholar]
- Severini TA, Staniswalis JG. Quasilikelihood estimation in semiparametric models. Journal of the American Statistical Association. 1994;89:501–511. [Google Scholar]
- Stefanski LA, Carroll RJ. Conditional scores and optimal scores for generalized linear measurement-error models. Biometrika. 1987;74:703–716. [Google Scholar]
- Tsiatis AA, Ma Y. Locally efficient semiparametric estimators for functional measurement error models. Biometrika. 2004;91:835–48. [Google Scholar]
- Wang H, Li R, Tsai C. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Annals of Statistics. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

