Abstract
Varying coefficient models allow us to generalize standard linear regression models to incorporate complex covariate effects by modeling the regression coefficients as functions of another covariate. For nonparametric varying coefficients, we can borrow the idea of parametrically guided estimation to improve asymptotic bias. In this paper, we develop a guided estimation procedure for the nonparametric varying coefficient models. Asymptotic properties are established for the guided estimators and a method of bandwidth selection via bias-variance tradeoff is proposed. We compare the performance of the guided estimator with that of the unguided estimator via both simulation and real data examples.
Keywords: generalized linear models, local polynomial smoothing, nonparametric regression, parametrically guided estimation, varying coefficient model
1. Introduction
A common scenario in studies is when the researcher observes some covariates and a response and wants to estimate the conditional mean function of the response given the covariates. A typical approach is to fit a generalized linear model (GLM) (Nelder and Wedderburn 1972) where a parametric assumption is made on a transformation of the conditional mean. This approach is easily interpreted and efficient if the correct parametric model is chosen, but can have serious consequences when the model is misspecified, a common problem in real-world applications. Alternatively, nonparametric methods make little or no assumptions of the model and are robust to model misspecification, however, they are slower to converge (Glad 1998) and can fail when the dimensions of the covariates are too high (Fan and Zhang 2008). In this paper, we combine the benefits of both parametric and nonparametric methods by considering estimation of nonparametric varying coefficient models using a pre-specified parametric family of functions as a guide. We use a local likelihood kernel smoothing based estimation procedure and investigate the asymptotic properties of the resulting estimates. We evaluate the finite sample performance of our method via a simulation study and demonstrate it by applying to the AIDS Clinical Trials Group (ACTG) Protocol 315 data described in Section 5.
Varying coefficient models (VCMs) are nonparametric generalized additive models (Wood 2006) that increase the flexibility of linear models by using a smooth function to model the parameters. There are numerous advantages of using VCMs over their parametric linear model counterparts. First, fitting a standard linear model is too restrictive because few real world problems satisfy the assumption of linearity. The true, possibly nonlinear, relationship between the covariates are not well captured by polynomial fitting which can lead to large bias in estimation. Second, VCMs allow for the interaction between covariates to be modeled in a nonparametric way. Models with only main effects included may miss the effect from the interaction between covariates. Another advantage for using VCMs is that the dimensionality issue is avoided because the coefficient functions are allowed to vary as a function of another covariate (Hastie and Tibshirani 1993). VCMs are easy to interpret and arise when it is desirable to know how regression coefficients change over groups, or over time in longitudinal studies. These flexible models can be applied to a variety of data types, including longitudinal data (Brumback and Rice 1998; Hoover, Rice, Wu, and Yang 1998), time series data (Chen and Tsay 1993; Huang and Shen 2004), environmental data (Fan and Zhang 1999), and genetic data (Ma, Yang, Romero, and Cui 2011).
VCMs consist of smooth, functional parameters that need to be estimated and this can be done using penalized spline approaches (Eilers and Marx 1996; Cao, Lin, Wu, and Yu 2010), basis expansion methods (Holdeman 1969), or by applying regression locally (Chen and Tsay 1993; Cai, Fan, and Li 2000). In this paper, we utilize the latter method since these estimators are efficient and have nice sampling properties (Fan 1993). This method involves using a kernel function to weight the likelihood and applying polynomial regression locally using Taylor series approximations. In practice, the full likelihood may be unknown or difficult to construct, so we replace it with the quasi-likelihood introduced by Wedderburn (1974). With the quasi-likelihood, only the relationship between the mean and the variance needs to be specified, and the model will still retain most of the efficiency of a maximum likelihood estimation procedure.
Nonparametric estimators can be enhanced by using a parametric guide. In practice, previous information or exploratory analysis may give some insight on the shape of the unknown functions and this information can be used to speed up convergence and reduce bias. The basic process is as follows: 1) identify a parametric family of functions that captures the shape, 2) remove the trend and carry out local polynomial estimation, and 3) add the trend back to obtain the final estimators of the functions. This guided estimation scheme was first studied in the density estimation framework where Hjort and Glad (1995) showed that their estimator had better bias and similar variance compared to the traditional nonparametric estimator, even when the guide was superficial. It has also been studied in least squares regression (Glad 1998; Martins-Filho, Mishra, and Ullah, 2008) quasi-likelihood models (Fan, Wu, and Feng 2009) and nonparametric additive models (Fan, Maity, Wang, and Wu 2013).
In this paper, we propose a way to improve the coefficient functions in VCMs by using guided techniques. Cai et al. (2000) first proposed local estimation techniques for VCM coefficient functions using local-likelihood equations, and present asymptotic properties of their estimators. Fan et al. (2009) propose using quasi-likelihood models as a general case when the likelihood is unavailable and present guided estimation when a single covariate and response are observed. Fan et al. (2013) extend this methodology to an additive model where multiple covariates are observed but do not interact with each other. The main contribution of this paper is that we allow for interaction between two observed covariates and thus extend previous works to VCMs, of which Fan et al. (2009) is a special case. We borrow the generality of the quasi-likelihood models and the idea of parametrically guided estimation to improve the bias of our estimators, but we extend the optimal bandwidth selection methodology for the kernel weight function when we apply local polynomial regression. We also give asymptotic theory and results for the guided estimators in the VCM framework, different than the GLM and the additive model in Fan et al. (2009) and Fan et al. (2013), respectively. We show in simulations that the guided estimators have lower bias and similar variance when a fixed bandwidth is used, and lower bias and variance when the optimal bandwidth is used. We then estimate the functional parameters in the (ACTG) Protocol 315 data.
The rest of this paper is organized as follows: in Section 2 we give an overview of QLMs and the standard nonparametric estimation procedure using local polynomial fitting. We propose our parametrically guided estimation scheme in Section 2.2 using two different types of corrections, and give some asymptotic properties of our estimators. In Section 3, we present one method of choosing the bandwidth parameter in local polynomial fitting. We evaluate the performance of our estimators compared to the standard ones in Section 4 and found that our estimators had lower bias when a fixed bandwidth was used, and lower bias and variability when the optimal bandwidth was chosen. We then applied our methodology to the ACTG data in Section 5 and provide some concluding remarks in Section 6.
2. Guided Estimation for Varying Coefficient Models
Assume that for each of n subjects we observe covariates Xi = (1, Xi1, …, Xiq)T and Ti, and a response Yi. The VCM for these covariates is defined as
| (1) |
where μi = E(Yi|Xi, Ti) is the conditional mean of the response, g(·) is a link function from the GLM framework, and θ(·) = {θ0(·), θ1(·), …, θq (·)}T are unknown, smooth functions. The first term models the unique effect of T and the remaining terms model the interaction between X and T. This VCM is more flexible than a linear regression model because it allows the effect of X to vary smoothly with T and the effect of T is not restricted to a linear assumption. The goal is to estimate θ(·) and there are several ways to do this (Cleveland, Grosse, and Shyu 1991; Hastie and Tibshirani 1993). The method we adopt is to use local-likelihood kernel smoothing using a quasi-likelihood.
2.1. Framework and Local Likelihood Estimation
Quasi-likelihood models (QLMs), an extension of GLMs, are ideal because often the full likelihood may be unknown or difficult to construct. In QLMs, only the relationship between the conditional mean and variance of the responses need to be specified, which is often doable in practice. The full conditional log-likelihood is replaced with a quasi-likelihood function Q(Yi, μi) and if we define var(Yi|Xi, Ti) = V (μi), then Q satisfies
Wedderburn (1974) shows that Q has similar properties to the log-likelihoods and that Q is exactly the likelihood when the response comes from a single parameter exponential family.
The standard, unguided, nonparametric procedure for estimating θ(·) is to use local polynomial fitting by first approximating the functions using a Taylor series expansion so that
| (2) |
where for p = 0, …, P and j = 0, …, q. Substituting this approximation into (1) yields where Gi = {1, (Ti − t0)1, …, (Ti − t0)P}T ⊗ Xi, and where βp = (β0p, …, βqp)T. The approximation in (2) is only accurate when t0 is close to Ti, so the quasi-likelihood is weighted in such a way that more weight is given to t0’s close to Ti and little to no weight to those far from Ti. This is done by using a kernel function Kh(·) = K(·/h)/h and defining the local quasi-likelihood as
| (3) |
The parameter h is the bandwidth and needs to be estimated (see Section 3). We maximize (3) with respect to β and the solution β̂0 will be the estimate of θ(t0).
2.2. Guided Estimation
The local likelihood estimators can be enhanced by using a parametric guide. Intuitively speaking, more curvature in a function makes it more difficult to estimate. If, through exploratory analysis or prior information, one has some idea of the shape of the true function, then one can identify a parametric family that captures this trend. Using the parametric guide, the curvature of the function can be removed yielding a flatter curve that is easier to estimate. Once this flatter curve is estimated, then the guide can be used to add the trend back and obtain the final estimate of the original function. This type of guided estimation has been shown to reduce bias of nonparametric estimators (Hjort and Glad 1995; Glad 1998; Martins-Filho et al. 2008) and improve variance since a larger bandwidth can be selected (Fan et al. 2009, 2013).
Define a parametric family that captures the trend of the function as {θjg(t,
αj): αj = (αj1, …, αjmj)T ∈
⊂ ℝmj} for j = 0, …, q. The optimal guides can be found by maximizing the quasi-likelihood
with respect to where θg(Ti, α) = {θ0g(Ti, α0), …, θqg(Ti, αq)}T. The best fit is denoted by θ̂g(t) = θg(t, α̂) where we suppress the dependence on α in our notation. In this paper we present two methods of removing the trend using an additive correction or a multiplicative correction.
2.2.1. Additive Correction
If the curvature of θj is well approximated by θ̂jg, then estimating the quantity θj(t) − θ̂jg(t) will yield more accurate and less variable estimates since this quantity is close to flat. Once estimated, the guide is added back to give the final estimate of θj. This process can be achieved in one step by defining ηj(t) = θj(t) − θ̂jg(t) + θ̂jg(t0) and estimating ηj at t0. This definition of ηj is known as the additive correction.
Using this correction, the VCM in (1) can be rewritten as
where η(t) = {η0(t), …, ηq(t)}T, and h(t) = {θ̂0g(t)−θ̂0g(t0), …, θ̂qg(t)− θ̂qg(t0)}T. Similar to Section 2.1 and with a slight abuse of notation, a Taylor series approximation is used for ηj(Ti) about the point t0 such that
where for p = 0, …, P. The local quasi-likelihood
is maximized with respect to β and the estimate of β̂0 corresponds to η̂(t0) which gives the final estimate θ̂(t0). Because h(t) is known for fixed T = t, our model can be fit using standard software with h(t) as an offset.
2.2.2. Multiplicative Correction
An alternative correction which leads to a different guided estimator is the multiplicative correction. As in the additive case, the ratio θj(t)/θ̂jg(t) will be flat if θ̂jg(t) captures the trend of θj(t) and estimating this ratio will be less biased than estimating the unknown function directly. Once estimated, the ratio is then multiplied by the guide to get the final estimate of θj. The multiplicative correction is defined as ηj(t) = θj(t){θ̂jg(t0)/θ̂jg(t)} and the one step solution requires estimating ηj(t) at t0.
An alternative correction which leads to a different guided estimator is the multiplicative correction. As in the additive case, the ratio θj(t)/θ̂jg(t) will be flat if θ̂jg(t) captures the trend of θj(t) and estimating this ratio will be less biased than estimating the unknown function directly. Once estimated, the ratio is then multiplied by the guide to get the final estimate of θj. The multiplicative correction is defined as ηj(t) = θj(t){θ̂jg(t0)/θ̂jg(t)} and the one step solution requires estimating ηj(t) at t0.
Using the multiplicative correction, (1) is written as
Estimating ηj(T) is achieved by first using a Taylor series expansion of ηj(t) about the point t0 and then maximizing
with respect to β, where , and ⊗ denotes the Kronecker product between the two vectors. The solution β̂0 give our final estimates of θ̂(t0). By manipulating the design matrix, there is no offset for the multiplicative correction and this model can easily be fit using standard GLM software.
2.3. Asymptotic Properties
In this section, we investigate the asymptotic properties of the proposed guided estimators. For illustration purposes, we present the case where estimation is performed using a local linear approximation (P = 1) and an additive guide. Similar results for general P and the multiplicative guide can be obtained by straightforward but tedious algebra.
Define κd = ∫udK(u) du and νd = ∫udK2(u) du. Define M to be a 2 × 2 matrix with elements Mkl = κk+l−2 and R to be a 2 × 2 matrix with elements Rkl = νk+l−2. Let ρ(x, t) = 1/[V {μ(x, t)}g′2{μ(x, t)}] and γ(x, t0) = var(Y1|X1 = x, T1 = t0)/[V{μ(x, t0)}g′{μ(x, t0)}]2. Make the definitions
where ft(·) is the marginal density function of T. Then for a fixed guide, we have the following result.
Theorem 2.1
Fix a point t0 and assume the guide is fixed. Under the conditions stated in Section A.1 of the Appendix, as h → 0, nh → ∞, and nh5 → constant, we have
where Σ is the leading (q + 1) × (q + 1) submatrix of .
There are two noteworthy points to make about Theorem 1. The first is that if there is no (or constant) guide and the model belongs to a one-parameter exponential family with the canonical link and correctly specified variance function, then η″(t0) = θ″(t0), ρ(x, t) = γ(x, t) = V {μ(x, t)}, and our result reduces to Theorem 1 of Cai et al. (2000). The second is that only the bias term B1(t0) is affected when a guide is used, and not the asymptotic variance. To be specific, consider the integrated squared bias for each function θj(·). It is evident that the integrated squared bias of the guided estimate is smaller that that of the unguided one when . This is analogous to Remark 3 of Fan et al. (2009). Therefore when a fixed bandwidth is used, the squared bias (and hence the MSE) is reduced if an appropriate guide is chosen. Furthermore, finding the optimal bandwidth using the procedure from Section 3 allows for a larger bandwidth to be selected compared to the unguided estimate which will reduce the variance as well as the bias. We demonstrate this in our simulation study in Section 4.
Theorem 2.1 assumes a fixed guide but in practice the guide needs to be estimated. Theorem 2.2 states that the expressions above are still valid for estimated guides as well.
Theorem 2.2
Let fjnt(x, t, y) = ft(t) exp(Q[g−1{xT θ(t)}, y]) be the true joint density and fprp(x, t, y, α) = ft(t) exp(Q[g−1{xTθg(t, α)}, y]) be the proposed joint density. Define α* to be the minimizer of the Kullback-Leibler distance between fprp and fjnt. Then under the White (1982)-type conditions in the Appendix, the same result as in Theorem 2.1 holds when an estimated guide θ̂g(·) is used in place of a fixed guide with the modification that α is now replaced by α*.
The proofs of Theorems 2.1 and 2.2 are given in the Appendix.
Remark 1
The asymptotic results presented in this section are an extension of the results from Fan et al. (2009). Consider the special case where the covariate contains only the intercept Xi = 1 and parameter function θ(·) ≡ θ0(·). Then we have ρ(x, t) = ρ(t) = 1/[V {μ(t)}g′2{μ(t)}], γ(x, t) = γ(t) = var(Y1|T1 = t)/[V {μ(t)}g′{μ(t)}]2, V1(t0) = ft(t0)ρ(t0)M, V2(t0) = ft(t0)γ(t0)R, and B1(t0) = M−1(κ2, κ3)T η″(t0)/2. With these simplified definitions, the result in Theorem 2.1 implies that
where σ(t0) = ν0var(Y1|T1 = t0)g′2{μ(t)}/ft (t0). This result is as in Theorem 1 in Fan et al. (2009) with the assumption that κ0 = ∫K(z) dz = 1.
3. Optimal Bandwidth Selection
Note that for simplicity of presentation, we will consider our bandwidth selection method using the additive correction; the multiplicative correction follows easily by using the multiplicative definition of ηj(·), replacing Gi with , and omitting the offset term .
Once β̂ is obtained, the bias arises from the approximation term of the Taylor series expansions. Hence, using more terms in the series should theoretically produce less bias. Let be the approximation error. If a higher order Taylor approximation is substituted for ηj(Ti) in rj, then . We then maximize the local quasi-likelihood including the approximation errors
with respect to β where ri = (r0i, …, rqi)T. Define the maximizer as β̂*. The local quasi-likelihood Q*(β) is differentiated with respect to β to get the gradient vector
and the second derivative is taken to get the Hessian matrix Q*″(β). A Taylor series expansion is then applied to Q*′ around β̂ to get
and thus, an approximation of the estimation bias is β̂ − β̂* ≈ {Q*″(β̂)}−1Q*′(β̂).
To get an approximation of the variance, a Taylor series expansion of Q′(β̂) is done about the true β, denoted by β0. Note that
which implies β̂ − β0 ≈ −{Q″(β0)}−1Q′(β0), and the estimate of the conditional variance is
where Q″(β0) can be approximated by Q″(β̂). To approximate the variance term, note that
The last approximation follows from the fact that Ti only has significant weight in the neighborhood of t0.
To compute the MSE, we denote the bias of θ̂ as B(t0; h) = {B0(t0; h), …, Bq (t0; h)}T corresponding to the first q+1 components of [Q*″(β̂)]−1Q*′(β̂). The variance-covariance matrix V(t0; h) of θ̂ is the first (q + 1) × (q + 1) submatrix of var(β̂ − β0|x, t) and thevvariance of the estimated VCM is . The conditional MSE of XT θ̂ given X = x is
The sample MSE is derived as
We propose to choose h such that
To summarize, we use a grid of t0’s and a grid of candidate bandwidths. We fit the local quasi-likelihood for each t0 and h candidate and calculate the extended residual squares criterion (ERSC) defined as ERSC(x, t; h) = σ̂2{1+(p+1)/N}, where σ̂2 is the weighted residual sum of squares after fitting a local pth-order polynomial and N is the number of local data points (see (5.6) of Fan, Farmen, and Gijbels (1998) for details). We then sum the ERSCs over t0 and the bandwidth with the lowest sum becomes the pilot bandwidth. Using this bandwidth, we fit a local quasi-likelihood for each t0 to obtain β̂. For a new set of candidate bandwidths, we fit the local quasi-likelihood using higher order Taylor series approximation and obtain β̂*, which is theoretically more accurate. We compute the bias and variance using the gradient and Hessian of the quasi-likelihood, compute the MSE and the candidate bandwidth with the lowest MSE is our optimal bandwidth.
4. Simulation
We conducted a simulation study to evaluate the performance of our estimators. We generated each observation (Xi, Ti, Yi) by first simulating the covariates Xi and Ti from a uniform distribution. Then, given Xi and Ti, the conditional mean μi of the response was generated as
where g is the canonical link. We used a grid of equally spaced values tk for k = 1, …, K = 100 to estimate the two functions. A cubic guide θ0g(t, α0) = α01 + α02t + α03t2 + α04t3 was used for estimating θ0 and a quadratic guide θ1g(t, α1) = α11 + α12t + α13t2 was used for estimating θ1. We used local linear polynomial estimators with the Epanechnikov kernel weight, and for bias calculations, we chose the degree of the Taylor expansion to be a = 1.
For R = 1000 simulations, we generated the data and estimated the parameters for the cubic and quadratic guides as described above. To get the final estimates θ̂0 and θ̂1, we maximized the local quasi-likelihood with the appropriate distribution and canonical link, and Epanechnikov kernel weight. The design matrix of the local likelihood was constructed using both the additive and multiplicative corrections (Gi and , respectively). To find the optimal smoothing bandwidth, we first simulated 15 data sets and applied our methods from Section 3. We took the median of these 15 values as the optimal bandwidth and used that value as fixed for the 1000 simulations. We also compared the two methods using the optimal bandwidth from the unguided method as the fixed bandwidth.
Once the two functions were estimated, we computed the marginal squared bias, marginal variance, and marginal MSE of each. Define , and where j = 0, 1 and r indexes the simulation. The average marginal squared bias of θ̂j is , the average marginal variance is and the average marginal MSE is . However instead of averaging over all k, we used the 10% trimmed mean. Tables A1 – A3 show the results of our simulations. “Best h” is each estimation method’s optimal smoothing bandwidth obtained via the bias-variance tradeoff in Section 3. All values for squared bias, variance, and MSE in the table are multiplied by 100. “Same h” refers to the fixed bandwidth obtain from the optimal bandwidth from the unguided estimators.
Table A1.
Results of trimmed average bias, variance, and MSE for Example 1. “Same h” refers to fixed bandwidth and best “h” refers to optimal bandwidth. All values are multiplied by 100.
| Bias2 | Variance | MSE | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Naive | Additive | Multiplicative | Naive | Additive | Multiplicative | Naive | Additive | Multiplicative | |||
| n = 100 | Same h | θ0(t) | 0.129 | 0.022 | 0.022 | 0.179 | 0.173 | 0.174 | 0.308 | 0.195 | 0.196 |
| θ1(t) | 0.003 | 0.001 | 0.001 | 0.552 | 0.522 | 0.522 | 0.556 | 0.522 | 0.522 | ||
|
| |||||||||||
| Best h | θ0(t) | 0.129 | 0.098 | 0.096 | 0.179 | 0.104 | 0.106 | 0.308 | 0.203 | 0.202 | |
| θ1(t) | 0.003 | 0.001 | 0.001 | 0.552 | 0.294 | 0.296 | 0.556 | 0.295 | 0.297 | ||
|
| |||||||||||
| n = 200 | Same h | θ0(t) | 0.080 | 0.014 | 0.015 | 0.090 | 0.088 | 0.089 | 0.171 | 0.103 | 0.104 |
| θ1(t) | 0.003 | 0.001 | 0.001 | 0.269 | 0.261 | 0.262 | 0.272 | 0.262 | 0.262 | ||
|
| |||||||||||
| Best h | θ0(t) | 0.080 | 0.026 | 0.026 | 0.090 | 0.074 | 0.075 | 0.171 | 0.100 | 0.101 | |
| θ1(t) | 0.003 | 0.001 | 0.001 | 0.269 | 0.217 | 0.217 | 0.272 | 0.217 | 0.217 | ||
Table A3.
Results of trimmed average bias, variance, and MSE for Example 3. All values are multiplied by 100.
| Bias2 | Variance | MSE | ||||||
|---|---|---|---|---|---|---|---|---|
| Naive | Additive | Naive | Additive | Naive | Additive | |||
| n=500 | Same h | θ0(t) | 0.84 | 0.17 | 62.43 | 67.23 | 63.27 | 67.40 |
| θ1(t) | 0.21 | 0.02 | 27.10 | 29.08 | 27.31 | 29.10 | ||
|
| ||||||||
| Best h | θ0(t) | 0.84 | 0.18 | 62.34 | 57.60 | 63.27 | 57.77 | |
| θ1(t) | 0.21 | 0.02 | 27.10 | 24.95 | 27.31 | 24.96 | ||
Example 1: Poisson Response
For the Poisson response, n = 100 or 200 covariates Xi and Ti were generated with Xi ~ Unif[−1, 1] and Ti ~ Unif[−2, 2]. The true functions were θ0(t) = sin(πt/2) + 4 and θ1(t) = sin(πt/4 − π/2)/2 + 1. The response Yi was generated from a Poisson(μi). We used a grid of K = 100 equally spaced values in [−2, 2] for t0 to estimate the two functions. Table A1 gives the (trimmed) average marginal squared bias, variance, and MSE for the two functions estimated by the original method using no guide and by our method using guided estimation with additive and multiplicative corrections. When the same bandwidth is used the guided estimation procedure reduces bias but has no effect on variance. When the optimal bandwidth is used, the guided estimates have lower bias and lower variance. This is because the guides account for much of the trend in the true curve and the nonparametric correction is flatter and easier to estimate, resulting in lower bias. As the sample size increased, we saw a reduction in bias, variance and MSE.
Example 2: Normal Response
For the Gaussian response, n = 100 or 200 covariates Xi and Ti were generated with Xi ~ Unif[−1, 1] and Ti ~ Unif[−2, 2]. The true functions were θ0(t) = sin(πt/2) − 2 and θ1(t) = 2 sin(πt/4 − π/2) + 3. The response Yi was generated from a Normal(μi, 1). Table A2 gives the (trimmed) average marginal squared bias, variance, and MSE for the two functions estimated by the original method using no guide and by our method using guided estimation with additive and multiplicative corrections. In this example, the guided estimates still have lower bias and variance when the optimal bandwidth is used. When the same bandwidth is used, the variance of the additive and multiplicative correction is slightly higher than the unguided estimates. The gains in bias reduction are counteracted by the higher variance so the MSE is approximately the same for both methods.
Table A2.
Results of trimmed average bias, variance, and MSE for Example 2. All values are multiplied by 100.
| Bias2 | Variance | MSE | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Naive | Additive | Multiplicative | Naive | Additive | Multiplicative | Naive | Additive | Multiplicative | |||
| n = 100 | Same h | θ0(t) | 1.05 | 0.11 | 0.10 | 3.82 | 4.15 | 4.29 | 4.86 | 4.26 | 4.39 |
| θ1(t) | 0.20 | 0.02 | 0.02 | 11.68 | 11.81 | 12.21 | 11.89 | 11.83 | 12.23 | ||
|
| |||||||||||
| Best h | θ0(t) | 1.05 | 0.22 | 0.21 | 3.82 | 3.41 | 3.51 | 4.86 | 3.63 | 3.73 | |
| θ1(t) | 0.20 | 0.04 | 0.03 | 11.68 | 8.73 | 9.92 | 11.89 | 8.77 | 8.96 | ||
|
| |||||||||||
| n = 200 | Same h | θ0(t) | 0.46 | 0.07 | 0.07 | 2.20 | 2.31 | 2.38 | 2.66 | 2.38 | 2.45 |
| θ1(t) | 0.13 | 0.01 | 0.01 | 6.65 | 6.70 | 6.85 | 6.78 | 6.71 | 6.85 | ||
|
| |||||||||||
| Best h | θ0(t) | 0.46 | 0.11 | 0.11 | 2.20 | 2.00 | 2.05 | 2.66 | 2.11 | 2.16 | |
| θ1(t) | 0.13 | 0.01 | 0.01 | 6.65 | 5.55 | 5.63 | 6.78 | 5.56 | 5.64 | ||
Example 3: Bernoulli Response
For the Bernoulli response, we chose a larger sample size of n = 500 since the estimation of the success probability is more difficult than estimating the mean in the Gaussian and Poisson case. The covariates Xi and Ti were generated with Xi ~ Unif[1, 2] and Ti ~ Unif[-1, 1]. The true functions were θ0(t) = sin(πt)/2+1 and θ1(t) = 0.7 sin{(t+1)π/2}−1. The response Yi was generated from a Bernoulli(μi). The results from this example are presented in Table A3. Using a multiplicative correction with Bernoulli data is very unstable due to the possibility of dividing by zero, thus this correction was not used. We see that using the guides reduces bias and variance when the optimal bandwidth is used, and reduces bias but has little effect on variance when a fixed bandwidth is used.
5. HIV Data Analysis
In the AIDS Clinical Trials Group (ACTG) Protocol 315, 48 individuals infected with HIV-1 were given potent antiviral medicine to evaluate the efficacy of treatment on the reduction of viral load (plasma HIV-1 RNA). The viral load was measured repeatedly over three months and 31 baseline covariates were measured for each individual. Details of this study can be found in Lederman et al. (1998) and Liang, Wu, and Carroll (2003). Of these 31 covariates, Wu and Wu (2002) identified those that were significant predictors and in this illustration, we chose two of these covariates that appeared frequently in their models, baseline viral load and baseline CD4+ counts. Viral load is the number of copies per milliliter and is measured on a logarithmic scale with base 10. CD4+ counts are the number of lymphocytes that are CD4+. The response was the change from baseline viral load measured at day 7. If an individual did not have a viral load measurement on day 7, then the preceding or following day measurement was used. The data for our illustration are presented in Figure A1. We would like to determine if baseline viral load and baseline CD4+ counts have an effect on the change from baseline viral load measurement while adjusting for interaction between them.
Figure A1.
Scatterplot of the HIV Data variables of interest presented in Section 5. The response is change from baseline viral load at day 7.
Using a grid of size of 25 for t0, the Epanechnikov kernel, and the Gaussian log-likelihood for Q, we fit the model in (1) and estimated θ0 and θ1 using the original unguided nonparametric method, represented by the solid red line in Figure A2. The pre-asymptotic bandwidth selector gave bandwidth 0.67. Based on the shape of these unguided estimates, we chose a cubic guide for our guided estimation of θ0 and θ1 and used the additive correction. The results from our fit are represented by the green dot-dashed lines in Figure A2 and the guides used are the black dashed line. The bandwidth for our method was also 0.67. Our estimated θ̂0 had more curvature than the naive counterpart and followed the parametric guide very closely, suggesting that there is little model misspecification when using a cubic guide for θ0. For θ̂1, the naive estimate and our guided estimate had somewhat similar shapes, with the most difference in the left endpoints.
Figure A2.
Nonparametric estimate (solid red) and parametrically guided estimate (green dot-dashed) of θ0 (a) and θ1 (b) along with the cubic guide (black dashed).
The point-wise bootstrap 95% confidence intervals are given in Figure A3. The confidence intervals for the guided estimates were slightly wider in the boundaries but overall were similar to those of the naive estimate. In Figure A3(c) and (d), the entire confidence interval for the functions contain zero. Recall that θ1 is the slope function and this term models the interaction between the two covariates. This indicates that there is no interaction between baseline viral load and baseline CD4+ counts and the response can be adequately modeled as a cubic function of baseline viral load alone.
Figure A3.
Nonparametric estimate (solid red) of θ0 (a) and θ1 (c) and parametrically guided estimate (solid green) of θ1 (b) and θ2 (d) along with a point-wise 95% bootstrap confidence intervals (black dashed).
We also used the VCM to separately model the viral load after day 14, day 21, and day 28 with the same two covariates in order to compare the estimated functions on different days to day 7. Again, if an individual did not have a viral load measurement on these exact days, then the preceding or following day was used, and if all three days were missing, the individual was dropped from the analysis. This yielded 35 individuals for the day 14 analysis, 39 individuals for the day 21 analysis, and 38 individuals for the day 28 analysis. The estimated functions corresponding to these responses are given in Figure A4. The pre-asymptotic bandwidth selector gave bandwidths 0.53, 1.22 and 0.77 for the naive estimates of day 14, 21, and 28 respectively. The bandwidth for the guided estimates were 0.63, 1.26, and 0.77 for day 14, 21, and 28 respectively. The shape of the slope function θ1 changes drastically for the different days, but the entire confidence interval for all three θ1 functions (not presented) contains zero. Thus CD4+ has a very different interaction effect on baseline viral load for different days, but the overall effect is not significant. The shape of the intercept function θ0 was similar for days 14 and 21, and had more of a cubic shape for day 28. Similar to day 7, the parametrically guided estimates of θ0 followed their respective guides very closely.
Figure A4.
Nonparametric estimate (solid red) and parametrically guided estimate (green dot-dashed) of θ0 (a)–(c) and θ1 (d)–(f) for day 14, day 21, and day 28 responses. A cubic guide was used in (a), (c), and (f), a quadratic guide was used in (b) and (e), and a quartic guide was in (d).
6. Discussion
In this paper, we used parametric guides to enhance the performance of nonparametric estimators of the parameter functions in varying coefficient models. We generalized to quasi-likelihoods since the true likelihood is often unavailable. We presented two ways of using the guide and estimated the corrected functions using local polynomial fitting. We developed the asymptotic properties of the guided estimators and a method of selecting the optimal bandwidth parameter of the kernel function. We conducted a simulation study to compare our guided estimators to their standard nonparametric equivalents, and found that the guided estimators had lower bias when a fixed bandwidth was used, and lower bias and variance when the optimal bandwidth was used. In general, even if the shape of the parameter function is not captured by the guide, the guided estimator will still have better bias than the unguided counterpart and the two will have similar variability.
In this paper we present the additive and multiplicative correction which are special cases of a unified family of guided estimators proposed by Fan et al. (2009). This work could be extended to include this unified family and its asymptotic properties. Other future work includes extending our methods to functional data where the covariates of interest are smooth functions and the response can be functional or scalar. The functional covariates correspond to unknown functional parameters that need to be estimated, and this can be done using our guided estimation scheme. This work can also be extended to multivariate unknown functions (e.g. θ (Ti)) by using an empirical basis expansion of the function and reducing it to a VCM.
Acknowledgments
Maity’s research was supported by NIH grant R00 ES017744 and a NCSU Faculty Research and Professional Development (FRPD) grant. Wu’s research was partially supported by NSF grant DMS-1055210 and NIH grant R01-CA149569. We also thank two anonymous referees, an anonymous Associate Editor and the editor for their constructive and helpful comments, which has improved the presentation of the paper.
Appendix A. Appendix
A.1. Assumptions, Definitions and Facts
Recall that Xi = (1, Xi1, …, Xiq)T and that Gi = {1, (Ti − t0)}T ⊗ Xi for P = 1. Also recall that ρ(x, t) = 1/[V{μ(x, t)}g′2{μ(x, t)}]. Define Qk(u, v) = ∂kQ{g−1(u), v}/∂uk. Since Q(·) is a quasi-likelihood function, Qk(u, v) is linear in v for each fixed u. Thus for fixed X = x, we have
The following assumptions that we use to prove Theorem 2.1 and 2.2 are well used in the nonparametric regression and nonparametric varying coefficient modeling literature, see for examples Cai et al. (2000) and Fan et al. (2009).
The kernel K(·) is a symmetric positive bounded function with [−1, 1] as support, and follows the property that ∫ uK(u) du = 0.
E(|X|3|T = t0) is continuous at t0.
E(Y4|X = x, T = t0) is bounded in a neighborhood of t0.
The function Q2(u, v) < 0 for each u and v in the range of the response variable.
The functions ft(·), V1(·), V{μ(x, ·)}, V′{μ(x, ·)} and g‴{μ(x, ·)} are continuous at t0. Also, ft(t0) > 0 and V1(t0) > 0.
The functions for j = 0, …, q are continuous in a neighborhood around t0.
For our results to be valid for the case when the guide is estimated, we use the following White (1982)-type condition, see e.g., Fan et al. (2009). Recall that we denote the guides as θjg(t) = θjg(t, α) and similarly θ̂jg(t) = θjg(t, α̂), omitting the dependence on α to ease notation.
We assume that E[log{fjnt(x, t, y)}] exists. Also, there exists a function m(x, t, y) such that |fprp(x, t, y)}|≤m(x, t, y).
E[log{fjnt(x, t, y)} – log{fprp(x, t, y)}] has a unique minimizer α*.
Each element (∂/∂α) log{fprp(x, t, y)} is continuously differentiable in α.
There are functions m2(x, t, y) and m3(x, t, y) such that for any α, the absolute value of each element of [(∂/∂α) log{fprp(x, t, y)}][(∂/∂α) log{fprp(x, t, y)}]T is bounded by m2(x, t, y), and that the absolute value of each element of [(∂2/∂ααT) log{fprp(x, t, y)}] is bounded by m3(x, t, y). Also E{m2(x, t, y)} and E{m3(x, t, y)} exist.
The matrix E [(∂/∂α) log{fprp(x, t, y)}][(∂/∂α) log{fprp(x, t, y)}]T is nonsingular at α* and that α* is a regular point of the matrix E[(∂2/∂ααT) log{fprp(x, t, y)}].
A.2. Proof of Theorem 2.1
We first prove the result with a fixed guide and then address the case when the guide is estimated.
Fix a point t0. Recall that ηj(t) = θj(t) − θjg(t) + θjg(t0). Define bn = (nh)−1/2 and . Then it is straightforward to see that δ̂ minimizes
as a function of δ, where . By Taylor’s expansion we obtain
where
with lying between s(Ti, t0) and .
We analyze each term separately. First we note that using similar argument as in Cai et al. (2000) and Fan et al. (2009), E(|
(δ)|) is bounded by a term of the order
| (A1) |
The previous line makes use of the assumption that K(·) is bounded, Q3(u, v) is linear in v, and that E(|X|3|T = t0) is continuous at t0.
Next, we take up
. For each Ti such that |Ti − t0| < h, a Taylor’s series expansion gives us
| (A2) |
Thus we have
The expected value of the (j, k)th element of
, with 1 ≤ j, k ≤ (q + 1) is
Similarly, it is fairly straightforward to derive that
where M is a 2 × 2 matrix with elements Mkl = κk+l-2. Using similar calculations, we can show that var(
) = O{(nh)−1}. Thus we have that
| (A3) |
Hence we can write
Note that δ̂ actually minimizes
−
since
is free of d. Now we use the quadratic approximation lemma (Fan and Gijbels 1996) and derive that
provided that
is a sequence of stochastically bounded vectors.
To prove the asymptotic normality of δ̂, it is sufficient to show the same for
. Note that using (A2) we have
Denote the joint density of X and T by fx,t(·, ·). Using the assumption that nh5→ constant we derive for 1 ≤ j ≤ q + 1,
Similarly we derive that for q + 2 ≤ j ≤ 2(q + 1). Hence we have that
Now we take up var(
). Define Zi to be the term inside the sum in the definition of
. Since for any j, k, Zj is independent of Zk when j ≠ k, it is straightforward to derive
Using similar argument to those in the computation of E(
), we derive that E(Z1j) = Op(h3) for any j and thus
. Next, for 1 ≤ j, k ≤ (q + 1),
where γ(x, t0) = var(Y1|X = x, T1 = t0)/[V {μ(x, t0)}g′{μ(x, t0)}]2. Similarly we obtain
| (A4) |
where R is a 2 × 2 matrix with elements Rjk = νj+k−2.
The final step of the argument is to use the Cramer-Wold device, that is to show that for any vector d
We only need to check Lyapounov’s condition. To this end, recall that
. It is sufficient to show that
. This result follows from a similar argument as that in (A1). Hence we obtain var(
)−1/2 {
− E(
)} → N(0, 1).
Combining (A3) – (A4) we obtain
Theorem 2.1 now follows from the definition of δ.
A.3. Proof of Theorem 2.2
Recall the definition of Q(β; h, t, α) from Section 2.2 for various guides. Denote β̂(t, α) to be the maximizer of Q(β; h, t, α). When an estimated guide α̂ is used, it suffices to show that ||β̂(t, α̂) − β̂(t, α*)|| converges to zero in probability with a rate faster than (nh)1/2, the convergence rate shown in Theorem 2.1 with a fixed guide.
We start by observing two facts. First, n-1/2||α̂ − α*|| = Op(1), which can be shown using the second set of assumptions at the end of Section A.1. Second, using the assumption that Q2(u, v) < 0 for each u and v in the range of the response variable, it is evident that both Q(β; h, t, α*) and Q(β; h, t, α̂) are strictly concave in β. Thus we have
| (A5) |
| (A6) |
| (A7) |
for any β, where the last two equations hold entry-wise. Thus it follows from (A5) that ||β̂(t, α̂) − β̂(t, α*)|| = op(1).
Using (A6), and a Taylor’s expansion, we have
Recall that by definition of β̂(t, α*) we have . Also, using similar argument used to derive (A3), the Frobenius norm of the second derivative term in the expansion above can be shown to be Op(1). Thus we have that
The rate n1/2 is much faster than the rate (nh)1/2 shown in Theorem 2.1, and hence the result follows.
References
- Brumback B, Rice JA. ‘Smoothing Spline Models for the Analysis of Nested and Crossed Samples of Curves’ (with discussion) Journal of the American Statistical Association. 1998;93:961–994. [Google Scholar]
- Cai Z, Fan J, Li R. Efficient Estimation and Inferences for Varying-coefficient Models. Journal of the American Statistical Association. 2000;95:888–902. [Google Scholar]
- Cao Y, Lin H, Wu TZ, Yu Y. Penalized Spline Estimation for Functional Coefficient Regression Models. Computational Statistics and Data Analysis. 2010;54:891–905. doi: 10.1016/j.csda.2009.09.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen R, Tsay RS. Functional-coefficient Autoregressive Models. Journal of the American Statistical Association. 1993;88:298–308. [Google Scholar]
- Cleveland W, Grosse E, Shyu W. Local Regression Models. In: Chambers J, Hastie T, editors. Statistical Models in S. 2. Pacific Grove, CA: Wadsworth and Brooks/Cole; 1991. pp. 309–376. [Google Scholar]
- Eilers PHC, Marx BD. Flexible Smoothing with B-splines and Penalities. Statistical Science. 1996;11:89–102. [Google Scholar]
- Fan J. Local Linear Regression Smoothers and their Minimax Efficiencies. The Annals of Statistics. 1993;21:196–216. [Google Scholar]
- Fan J, Farmen M, Gijbels I. Local Maximum Likelihood Estimation and Inference. Journal of the Royal Statistical Society Series B. 1998;60:591–608. [Google Scholar]
- Fan J, Gijbels I. Local Polynomial Modelling and its Applications. London: Chapman and Hall; 1996. [Google Scholar]
- Fan J, Maity A, Wang Y, Wu Y. Parametrically Guided Generalised Additive Models with Application to Mergers and Acquisitions Data. Journal of Non-parametric Statistics. 2013;25:109–128. doi: 10.1080/10485252.2012.735233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Wu Y, Feng Y. Local Quasi-likelihood with a Parametric Guide. The Annals of Statistics. 2009;37:4153–4183. doi: 10.1214/09-AOS713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Zhang W. Statistical Estimation in Varying Coefficient Models. The Annals of Statistics. 1999;27:1491–1518. [Google Scholar]
- Fan J, Zhang W. Statistical Methods with Varying Coefficient Models. Statistics and Its Interface. 2008;1:179–195. doi: 10.4310/sii.2008.v1.n1.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glad IK. Parametrically Guided Non-parametric Regression. Scandinavian Journal of Statistics. 1998;25:649–668. [Google Scholar]
- Hastie T, Tibshirani R. Varying-coefficient Models. Journal of the Royal Statistical Society Series B. 1993;55:757–796. [Google Scholar]
- Hjort NL, Glad IK. Nonparametric Density Estimation with a Parametric Start. The Annals of Statistics. 1995;23:882–904. [Google Scholar]
- Holdeman JT. A Method for the Approximation of Functions Defined by Formal Series Expansions in Orthogonal Polynomials. Mathematics of Computation. 1969;23:275–287. [Google Scholar]
- Hoover DR, Rice JA, Wu CO, Yang LP. Nonparametric Smoothing Estimates of Time-varying Coefficient Models with Longitudinal Data. Biometrika. 1998;85:809–822. [Google Scholar]
- Huang JZ, Shen H. Functional Coefficient Regression Models for Nonlinear Time Series: A Polynomial Spline Approach. Scandinavian Journal of Statistics. 2004;31:515–534. [Google Scholar]
- Lederman MM, Connick E, Landay A, Kuritzkes DR, Spritzler J, StClair M, Kotzin BL, Fox L, Chiozzi MH, Leonard JM, et al. Immunologic Responses Associated with 12 Weeks of Combination Antiretroviral Therapy Consisting of Zidovudine, Lamivudine, and Ritonavir: Results of AIDS Clinical Trials Group Protocol 315. The Journal of Infectious Diseases. 2002;179:70–79. doi: 10.1086/515591. [DOI] [PubMed] [Google Scholar]
- Liang H, Wu H, Carroll RJ. The Relationship between Virologic and Immunologic Responses in AIDS Clinical Research using Mixed-effects Varying-coefficient Models with Measurement Error. Biostatistics. 2003;4:297–312. doi: 10.1093/biostatistics/4.2.297. [DOI] [PubMed] [Google Scholar]
- Ma S, Yang L, Romero R, Cui Y. Varying Coefficient Model for Gene-environment Interaction: A Non-linear Look. Bioinformatics. 2011;27:2119–2126. doi: 10.1093/bioinformatics/btr318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martins-Filho C, Mishra S, Ullah A. A Class of Improved Parametrically Guided Nonparametric Regression Estimators. Econometric Reviews. 2008;27:542–573. [Google Scholar]
- Nelder JA, Wedderburn RWM. Generalized Linear Models. Journal of the Royal Statistical Society Series A. 1972;135:370–384. [Google Scholar]
- Wedderburn RWM. Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika. 1974;61:439–447. [Google Scholar]
- White H. Maximum Likelihood Estimation of Misspecified Models. Econometrica. 1982;50:1–25. [Google Scholar]
- Wood SN. Generalized Additive Models: An Introduction with R. Boca Raton, FL: Chapman and Hall/CRC; 2006. [Google Scholar]
- Wu H, Wu L. Identification of Significant Host Factors for HIV Dynamics Modelled by Non-linear Mixed-effects Models. Statistics in Medicine. 2002;21:753–771. doi: 10.1002/sim.1015. [DOI] [PubMed] [Google Scholar]




