Parametrically guided estimation in nonparametric varying coefficient models with quasi-likelihood

Clemontina A Davenport; Arnab Maity; Yichao Wu

doi:10.1080/10485252.2015.1026903

. Author manuscript; available in PMC: 2015 Jul 1.

Published in final edited form as: J Nonparametr Stat. 2015 Apr 7;27(2):195–213. doi: 10.1080/10485252.2015.1026903

Parametrically guided estimation in nonparametric varying coefficient models with quasi-likelihood

Clemontina A Davenport ¹, Arnab Maity ¹, Yichao Wu ^1,^*

PMCID: PMC4484785 NIHMSID: NIHMS673477 PMID: 26146469

Abstract

Varying coefficient models allow us to generalize standard linear regression models to incorporate complex covariate effects by modeling the regression coefficients as functions of another covariate. For nonparametric varying coefficients, we can borrow the idea of parametrically guided estimation to improve asymptotic bias. In this paper, we develop a guided estimation procedure for the nonparametric varying coefficient models. Asymptotic properties are established for the guided estimators and a method of bandwidth selection via bias-variance tradeoff is proposed. We compare the performance of the guided estimator with that of the unguided estimator via both simulation and real data examples.

Keywords: generalized linear models, local polynomial smoothing, nonparametric regression, parametrically guided estimation, varying coefficient model

1. Introduction

A common scenario in studies is when the researcher observes some covariates and a response and wants to estimate the conditional mean function of the response given the covariates. A typical approach is to fit a generalized linear model (GLM) (Nelder and Wedderburn 1972) where a parametric assumption is made on a transformation of the conditional mean. This approach is easily interpreted and efficient if the correct parametric model is chosen, but can have serious consequences when the model is misspecified, a common problem in real-world applications. Alternatively, nonparametric methods make little or no assumptions of the model and are robust to model misspecification, however, they are slower to converge (Glad 1998) and can fail when the dimensions of the covariates are too high (Fan and Zhang 2008). In this paper, we combine the benefits of both parametric and nonparametric methods by considering estimation of nonparametric varying coefficient models using a pre-specified parametric family of functions as a guide. We use a local likelihood kernel smoothing based estimation procedure and investigate the asymptotic properties of the resulting estimates. We evaluate the finite sample performance of our method via a simulation study and demonstrate it by applying to the AIDS Clinical Trials Group (ACTG) Protocol 315 data described in Section 5.

Varying coefficient models (VCMs) are nonparametric generalized additive models (Wood 2006) that increase the flexibility of linear models by using a smooth function to model the parameters. There are numerous advantages of using VCMs over their parametric linear model counterparts. First, fitting a standard linear model is too restrictive because few real world problems satisfy the assumption of linearity. The true, possibly nonlinear, relationship between the covariates are not well captured by polynomial fitting which can lead to large bias in estimation. Second, VCMs allow for the interaction between covariates to be modeled in a nonparametric way. Models with only main effects included may miss the effect from the interaction between covariates. Another advantage for using VCMs is that the dimensionality issue is avoided because the coefficient functions are allowed to vary as a function of another covariate (Hastie and Tibshirani 1993). VCMs are easy to interpret and arise when it is desirable to know how regression coefficients change over groups, or over time in longitudinal studies. These flexible models can be applied to a variety of data types, including longitudinal data (Brumback and Rice 1998; Hoover, Rice, Wu, and Yang 1998), time series data (Chen and Tsay 1993; Huang and Shen 2004), environmental data (Fan and Zhang 1999), and genetic data (Ma, Yang, Romero, and Cui 2011).

VCMs consist of smooth, functional parameters that need to be estimated and this can be done using penalized spline approaches (Eilers and Marx 1996; Cao, Lin, Wu, and Yu 2010), basis expansion methods (Holdeman 1969), or by applying regression locally (Chen and Tsay 1993; Cai, Fan, and Li 2000). In this paper, we utilize the latter method since these estimators are efficient and have nice sampling properties (Fan 1993). This method involves using a kernel function to weight the likelihood and applying polynomial regression locally using Taylor series approximations. In practice, the full likelihood may be unknown or difficult to construct, so we replace it with the quasi-likelihood introduced by Wedderburn (1974). With the quasi-likelihood, only the relationship between the mean and the variance needs to be specified, and the model will still retain most of the efficiency of a maximum likelihood estimation procedure.

Nonparametric estimators can be enhanced by using a parametric guide. In practice, previous information or exploratory analysis may give some insight on the shape of the unknown functions and this information can be used to speed up convergence and reduce bias. The basic process is as follows: 1) identify a parametric family of functions that captures the shape, 2) remove the trend and carry out local polynomial estimation, and 3) add the trend back to obtain the final estimators of the functions. This guided estimation scheme was first studied in the density estimation framework where Hjort and Glad (1995) showed that their estimator had better bias and similar variance compared to the traditional nonparametric estimator, even when the guide was superficial. It has also been studied in least squares regression (Glad 1998; Martins-Filho, Mishra, and Ullah, 2008) quasi-likelihood models (Fan, Wu, and Feng 2009) and nonparametric additive models (Fan, Maity, Wang, and Wu 2013).

In this paper, we propose a way to improve the coefficient functions in VCMs by using guided techniques. Cai et al. (2000) first proposed local estimation techniques for VCM coefficient functions using local-likelihood equations, and present asymptotic properties of their estimators. Fan et al. (2009) propose using quasi-likelihood models as a general case when the likelihood is unavailable and present guided estimation when a single covariate and response are observed. Fan et al. (2013) extend this methodology to an additive model where multiple covariates are observed but do not interact with each other. The main contribution of this paper is that we allow for interaction between two observed covariates and thus extend previous works to VCMs, of which Fan et al. (2009) is a special case. We borrow the generality of the quasi-likelihood models and the idea of parametrically guided estimation to improve the bias of our estimators, but we extend the optimal bandwidth selection methodology for the kernel weight function when we apply local polynomial regression. We also give asymptotic theory and results for the guided estimators in the VCM framework, different than the GLM and the additive model in Fan et al. (2009) and Fan et al. (2013), respectively. We show in simulations that the guided estimators have lower bias and similar variance when a fixed bandwidth is used, and lower bias and variance when the optimal bandwidth is used. We then estimate the functional parameters in the (ACTG) Protocol 315 data.

The rest of this paper is organized as follows: in Section 2 we give an overview of QLMs and the standard nonparametric estimation procedure using local polynomial fitting. We propose our parametrically guided estimation scheme in Section 2.2 using two different types of corrections, and give some asymptotic properties of our estimators. In Section 3, we present one method of choosing the bandwidth parameter in local polynomial fitting. We evaluate the performance of our estimators compared to the standard ones in Section 4 and found that our estimators had lower bias when a fixed bandwidth was used, and lower bias and variability when the optimal bandwidth was chosen. We then applied our methodology to the ACTG data in Section 5 and provide some concluding remarks in Section 6.

2. Guided Estimation for Varying Coefficient Models

Assume that for each of n subjects we observe covariates X_i = (1, X_i₁, …, X_iq)^T and T_i, and a response Y_i. The VCM for these covariates is defined as

\begin{array}{l} g (μ_{i}) = θ_{0} (T_{i}) + X_{i 1} θ_{1} (T_{i}) + \dots + X_{i q} θ_{q} (T_{i}) \\ = X_{i}^{T} θ (T_{i}), \end{array}

(1)

where μ_i = E(Y_i|X_i, T_i) is the conditional mean of the response, g(·) is a link function from the GLM framework, and θ(·) = {θ₀(·), θ₁(·), …, θ_q (·)}^T are unknown, smooth functions. The first term models the unique effect of T and the remaining terms model the interaction between X and T. This VCM is more flexible than a linear regression model because it allows the effect of X to vary smoothly with T and the effect of T is not restricted to a linear assumption. The goal is to estimate θ(·) and there are several ways to do this (Cleveland, Grosse, and Shyu 1991; Hastie and Tibshirani 1993). The method we adopt is to use local-likelihood kernel smoothing using a quasi-likelihood.

2.1. Framework and Local Likelihood Estimation

Quasi-likelihood models (QLMs), an extension of GLMs, are ideal because often the full likelihood may be unknown or difficult to construct. In QLMs, only the relationship between the conditional mean and variance of the responses need to be specified, which is often doable in practice. The full conditional log-likelihood is replaced with a quasi-likelihood function Q(Y_i, μ_i) and if we define var(Y_i|X_i, T_i) = V (μ_i), then Q satisfies

\frac{\partial}{\partial μ} Q (y, μ) = \frac{y - μ}{V (μ)} .

Wedderburn (1974) shows that Q has similar properties to the log-likelihoods and that Q is exactly the likelihood when the response comes from a single parameter exponential family.

The standard, unguided, nonparametric procedure for estimating θ(·) is to use local polynomial fitting by first approximating the functions using a Taylor series expansion so that

θ_{j} (T_{i}) \approx β_{j 0} + (T_{i} - t_{0}) β_{j 1} + \dots + {(T_{i} - t_{0})}^{P} β_{j P},

(2)

where $β_{j p} = θ_{j}^{(p)} (t_{0}) / p!$ for p = 0, …, P and j = 0, …, q. Substituting this approximation into (1) yields $g (μ_{i}) = G_{i}^{T} β$ where G_i = {1, (T_i − t₀)¹, …, (T_i − t₀)^P}^T ⊗ X_i, and $β = {(β_{0}^{T}, \dots, β_{P}^{T})}^{T}$ where β_p = (β₀_p, …, β_qp)^T. The approximation in (2) is only accurate when t₀ is close to T_i, so the quasi-likelihood is weighted in such a way that more weight is given to t₀’s close to T_i and little to no weight to those far from T_i. This is done by using a kernel function K_h(·) = K(·/h)/h and defining the local quasi-likelihood as

\sum_{i = 1}^{n} Q {g^{- 1} (G_{i}^{T} β), Y_{i}} K_{h} (T_{i} - t_{0}) .

(3)

The parameter h is the bandwidth and needs to be estimated (see Section 3). We maximize (3) with respect to β and the solution β̂₀ will be the estimate of θ(t₀).

2.2. Guided Estimation

The local likelihood estimators can be enhanced by using a parametric guide. Intuitively speaking, more curvature in a function makes it more difficult to estimate. If, through exploratory analysis or prior information, one has some idea of the shape of the true function, then one can identify a parametric family that captures this trend. Using the parametric guide, the curvature of the function can be removed yielding a flatter curve that is easier to estimate. Once this flatter curve is estimated, then the guide can be used to add the trend back and obtain the final estimate of the original function. This type of guided estimation has been shown to reduce bias of nonparametric estimators (Hjort and Glad 1995; Glad 1998; Martins-Filho et al. 2008) and improve variance since a larger bandwidth can be selected (Fan et al. 2009, 2013).

Define a parametric family that captures the trend of the function as {θ_jg(t, α_j): α_j = (α_j₁, …, α_{jm_j})^T ∈ Inline graphic ⊂ ℝ^m_j} for j = 0, …, q. The optimal guides can be found by maximizing the quasi-likelihood

\sum_{i = 1}^{n} Q [g^{- 1} {X_{i}^{T} θ_{g} (T_{i}, α)}, Y_{i}]

with respect to $α = {(α_{0}^{T}, \dots, α_{q}^{T})}^{T}$ where θ_g(T_i, α) = {θ₀_g(T_i, α₀), …, θ_qg(T_i, α_q)}^T. The best fit is denoted by θ̂_g(t) = θ_g(t, α̂) where we suppress the dependence on α in our notation. In this paper we present two methods of removing the trend using an additive correction or a multiplicative correction.

2.2.1. Additive Correction

If the curvature of θ_j is well approximated by θ̂_jg, then estimating the quantity θ_j(t) − θ̂_jg(t) will yield more accurate and less variable estimates since this quantity is close to flat. Once estimated, the guide is added back to give the final estimate of θ_j. This process can be achieved in one step by defining η_j(t) = θ_j(t) − θ̂_jg(t) + θ̂_jg(t₀) and estimating η_j at t₀. This definition of η_j is known as the additive correction.

Using this correction, the VCM in (1) can be rewritten as

g (μ_{i}) = X_{i}^{T} η (T_{i}) + X_{i}^{T} h (T_{i})

where η(t) = {η₀(t), …, η_q(t)}^T, and h(t) = {θ̂₀_g(t)−θ̂₀_g(t₀), …, θ̂_qg(t)− θ̂_qg(t₀)}^T. Similar to Section 2.1 and with a slight abuse of notation, a Taylor series approximation is used for η_j(T_i) about the point t₀ such that

η_{j} (T_{i}) \approx β_{j 0} + (T_{i} - t_{0}) β_{j 1} + \dots + {(T_{i} - t_{0})}^{P} β_{j P} .

where $β_{j p} = η_{j}^{(p)} (t_{0}) / p!$ for p = 0, …, P. The local quasi-likelihood

\begin{array}{l} Q (β) \equiv Q (β; h, t_{0}, \hat{α}) \\ = \sum_{i = 1}^{n} Q [g^{- 1} {G_{i}^{T} β + X_{i}^{T} h (T_{i})}, Y_{i}] \times K_{h} (T_{i} - t_{0}) \end{array}

is maximized with respect to β and the estimate of β̂₀ corresponds to η̂(t₀) which gives the final estimate θ̂(t₀). Because h(t) is known for fixed T = t, our model can be fit using standard software with h(t) as an offset.

2.2.2. Multiplicative Correction

An alternative correction which leads to a different guided estimator is the multiplicative correction. As in the additive case, the ratio θ_j(t)/θ̂_jg(t) will be flat if θ̂_jg(t) captures the trend of θ_j(t) and estimating this ratio will be less biased than estimating the unknown function directly. Once estimated, the ratio is then multiplied by the guide to get the final estimate of θ_j. The multiplicative correction is defined as η_j(t) = θ_j(t){θ̂_jg(t₀)/θ̂_jg(t)} and the one step solution requires estimating η_j(t) at t₀.

Using the multiplicative correction, (1) is written as

g (μ_{i}) = \frac{{\hat{θ}}_{0 g} (T_{i})}{{\hat{θ}}_{0 g} (t_{0})} η_{0} (T_{i}) + \frac{{\hat{θ}}_{1 g} (T_{i})}{{\hat{θ}}_{1 g} (t_{0})} X_{i 1} η_{1} (T_{i}) + \dots + \frac{{\hat{θ}}_{q g} (T_{i})}{{\hat{θ}}_{q g} (t_{0})} X_{i q} η_{q} (T_{i}) .

Estimating η_j(T) is achieved by first using a Taylor series expansion of η_j(t) about the point t₀ and then maximizing

Q (β) = \sum_{i = 1}^{n} Q {g^{- 1} (G_{i}^{* T} β), Y_{i}} K_{h} (T_{i} - t_{0})

with respect to β, where $G_{i}^{*} = {1, T_{i} - t_{0}, \dots, {(T_{i} - t_{0})}^{p}} \otimes (\frac{{\hat{θ}}_{0 g} (T_{i})}{{\hat{θ}}_{0 g} (t_{0})}, X_{i 1} \frac{{\hat{θ}}_{1 g} (T_{i})}{{\hat{θ}}_{1 g} (t_{0})}, \dots, X_{i q} \frac{{\hat{θ}}_{q g} (T_{i})}{{\hat{θ}}_{q g} (t_{0})})$ , and ⊗ denotes the Kronecker product between the two vectors. The solution β̂₀ give our final estimates of θ̂(t₀). By manipulating the design matrix, there is no offset for the multiplicative correction and this model can easily be fit using standard GLM software.

2.3. Asymptotic Properties

In this section, we investigate the asymptotic properties of the proposed guided estimators. For illustration purposes, we present the case where estimation is performed using a local linear approximation (P = 1) and an additive guide. Similar results for general P and the multiplicative guide can be obtained by straightforward but tedious algebra.

Define κ_d = ∫u^dK(u) du and ν_d = ∫u^dK²(u) du. Define M to be a 2 × 2 matrix with elements M_kl = κ_k₊_l₋₂ and R to be a 2 × 2 matrix with elements R_kl = ν_k₊_l₋₂. Let ρ(x, t) = 1/[V {μ(x, t)}g′²{μ(x, t)}] and γ(x, t₀) = var(Y₁|X₁ = x, T₁ = t₀)/[V{μ(x, t₀)}g′{μ(x, t₀)}]². Make the definitions

\begin{array}{l} V_{1} (t_{0}) = f_{t} (t_{0}) M \otimes E {ρ (X_{1}, T_{1}) X_{1} X_{1}^{T} ∣ T_{1} = t_{0}}, \\ V_{2} (t_{0}) = f_{t} (t_{0}) R \otimes E {γ (X_{1}, T_{1}) X_{1} X_{1}^{T} ∣ T_{1} = t_{0}}, \\ B_{1} (t_{0}) = M^{- 1} {(κ_{2}, κ_{3})}^{T} \otimes η^{″} (t_{0}) / 2, \end{array}

where f_t(·) is the marginal density function of T. Then for a fixed guide, we have the following result.

Theorem 2.1

Fix a point t₀ and assume the guide is fixed. Under the conditions stated in Section A.1 of the Appendix, as h → 0, nh → ∞, and nh⁵ → constant, we have

{(n h)}^{1 / 2} {\hat{θ} (t_{0}) - θ (t_{0}) - h^{2} B_{1} (t_{0})} \to^{d} N (0, \sum),

where Σ is the leading (q + 1) × (q + 1) submatrix of $V_{1}^{- 1} V_{2} V_{1}^{- 1}$ .

There are two noteworthy points to make about Theorem 1. The first is that if there is no (or constant) guide and the model belongs to a one-parameter exponential family with the canonical link and correctly specified variance function, then η″(t₀) = θ″(t₀), ρ(x, t) = γ(x, t) = V {μ(x, t)}, and our result reduces to Theorem 1 of Cai et al. (2000). The second is that only the bias term B₁(t₀) is affected when a guide is used, and not the asymptotic variance. To be specific, consider the integrated squared bias for each function θ_j(·). It is evident that the integrated squared bias of the guided estimate is smaller that that of the unguided one when $\int {θ_{j}^{″} (t) - θ_{j g}^{″} (t)}^{2} d t < \int {θ_{j}^{″} (t)}^{2} d t$ . This is analogous to Remark 3 of Fan et al. (2009). Therefore when a fixed bandwidth is used, the squared bias (and hence the MSE) is reduced if an appropriate guide is chosen. Furthermore, finding the optimal bandwidth using the procedure from Section 3 allows for a larger bandwidth to be selected compared to the unguided estimate which will reduce the variance as well as the bias. We demonstrate this in our simulation study in Section 4.

Theorem 2.1 assumes a fixed guide but in practice the guide needs to be estimated. Theorem 2.2 states that the expressions above are still valid for estimated guides as well.

Theorem 2.2

Let f_jnt(x, t, y) = f_t(t) exp(Q[g⁻¹{x^T θ(t)}, y]) be the true joint density and f_prp(x, t, y, α) = f_t(t) exp(Q[g⁻¹{x^Tθ_g(t, α)}, y]) be the proposed joint density. Define α^* to be the minimizer of the Kullback-Leibler distance between f_prp and f_jnt. Then under the White (1982)-type conditions in the Appendix, the same result as in Theorem 2.1 holds when an estimated guide θ̂_g(·) is used in place of a fixed guide with the modification that α is now replaced by α^*.

The proofs of Theorems 2.1 and 2.2 are given in the Appendix.

Remark 1

The asymptotic results presented in this section are an extension of the results from Fan et al. (2009). Consider the special case where the covariate contains only the intercept X_i = 1 and parameter function θ(·) ≡ θ₀(·). Then we have ρ(x, t) = ρ(t) = 1/[V {μ(t)}g′²{μ(t)}], γ(x, t) = γ(t) = var(Y₁|T₁ = t)/[V {μ(t)}g′{μ(t)}]², V₁(t₀) = f_t(t₀)ρ(t₀)M, V₂(t₀) = f_t(t₀)γ(t₀)R, and B₁(t₀) = M⁻¹(κ₂, κ₃)^T η″(t₀)/2. With these simplified definitions, the result in Theorem 2.1 implies that

{(n h)}^{1 / 2} {\hat{θ} (t_{0}) - θ (t_{0}) - h^{2} κ_{0} κ_{2} η^{″} (t_{0}) / 2} \to^{d} N (0, σ (t_{0})),

where σ(t₀) = ν₀var(Y₁|T₁ = t₀)g′²{μ(t)}/f_t (t₀). This result is as in Theorem 1 in Fan et al. (2009) with the assumption that κ₀ = ∫K(z) dz = 1.

3. Optimal Bandwidth Selection

Note that for simplicity of presentation, we will consider our bandwidth selection method using the additive correction; the multiplicative correction follows easily by using the multiplicative definition of η_j(·), replacing G_i with $G_{i}^{*}$ , and omitting the offset term $X_{i}^{T} h (T_{i})$ .

Once β̂ is obtained, the bias arises from the approximation term of the Taylor series expansions. Hence, using more terms in the series should theoretically produce less bias. Let $r_{j} (T_{i}) = η_{j} (T_{i}) - \sum_{k = 0}^{P} η_{j}^{(k)} (t_{0}) {(T_{i} - t_{0})}^{k} / k!$ be the approximation error. If a higher order Taylor approximation is substituted for η_j(T_i) in r_j, then $r_{j} (T_{i}) \approx \sum_{k = 1}^{a} η_{j}^{(p + k)} (t_{0}) {(T_{i} - t_{0})}^{p + k} / (p + k)! \overset{def}{=} r_{j i}$ . We then maximize the local quasi-likelihood including the approximation errors

Q^{*} (β) = \sum_{i = 1}^{n} Q [g^{- 1} {G_{i}^{T} β + X_{i}^{T} h (T_{i}) + X_{i}^{T} r_{i}}, Y_{i}] \times K_{h} (T_{i} - t_{0})

with respect to β where r_i = (r₀_i, …, r_qi)^T. Define the maximizer as β̂^*. The local quasi-likelihood Q^*(β) is differentiated with respect to β to get the gradient vector

{Q^{*}}^{'} (β) = \sum_{i = 1}^{n} \frac{Y_{i} - g^{- 1} {G_{i}^{T} β + X_{i}^{T} h (T_{i}) + X_{i}^{T} r_{i}}}{V [g^{- 1} {G_{i}^{T} β + X_{i}^{T} h (T_{i}) + X_{i}^{T} r_{i}}]} {(g^{- 1})}^{'} (G_{i}^{T} β + X_{i}^{T} h (T_{i}) + X_{i}^{T} r_{i}) G_{i} K_{h} (T_{i} - t_{0})

and the second derivative is taken to get the Hessian matrix Q^*″(β). A Taylor series expansion is then applied to Q^*′ around β̂ to get

{Q^{*}}^{'} ({\hat{β}}^{*}) \approx {Q^{*}}^{'} (\hat{β}) + {Q^{*}}^{″} (\hat{β}) ({\hat{β}}^{*} - \hat{β}) = 0

and thus, an approximation of the estimation bias is β̂ − β̂^* ≈ {Q^*″(β̂)}⁻¹Q^*′(β̂).

To get an approximation of the variance, a Taylor series expansion of Q′(β̂) is done about the true β, denoted by β⁰. Note that

0 = Q^{'} ({\hat{β}}^{*}) \approx Q^{'} (β^{0}) + Q^{″} (β^{0}) (\hat{β} - β^{0})

which implies β̂ − β⁰ ≈ −{Q″(β⁰)}⁻¹Q′(β⁰), and the estimate of the conditional variance is

var (\hat{β} - β^{0} ∣ X, T) \approx {Q^{″} (β^{0})}^{- 1} var {Q^{'} (β^{0}) ∣ X, T} {Q^{″} (β^{0})}^{- 1}

where Q″(β⁰) can be approximated by Q″(β̂). To approximate the variance term, note that

\begin{array}{l} var {Q^{'} (β^{0}) ∣ X, T} = \sum_{i = 1}^{n} var (\frac{\partial}{\partial β} Q [g^{- 1} {G_{i}^{T} β + X_{i}^{T} h (T_{i})}, Y_{i}] ∣ X_{i}, T_{i}) K_{h}^{2} (T_{i} - t_{0}) \\ = \sum_{i = 1}^{n} var (\frac{Y_{i} - g^{- 1} {G_{i}^{T} β + X_{i}^{T} h (T_{i})}}{V [g^{- 1} {G_{i}^{T} β + X_{i}^{T} h (T_{i})}]} {(g^{- 1})}^{'} {G_{i}^{T} β + X_{i}^{T} h (T_{i})} ∣ X_{i}, T_{i}) \times G_{i} G_{i}^{T} K_{h}^{2} (T_{i} - t_{0}) \\ \approx \frac{{[{(g^{- 1})}^{'} {X_{i}^{T} θ (t_{0})}]}^{2}}{V [g^{- 1} {X_{i}^{T} θ (t_{0})}]} G_{i} G_{i}^{T} K_{h}^{2} (T_{i} - t_{0}) . \end{array}

The last approximation follows from the fact that T_i only has significant weight in the neighborhood of t₀.

To compute the MSE, we denote the bias of θ̂ as B(t₀; h) = {B₀(t₀; h), …, B_q (t₀; h)}^T corresponding to the first q+1 components of [Q^*″(β̂)]⁻¹Q^*′(β̂). The variance-covariance matrix V(t₀; h) of θ̂ is the first (q + 1) × (q + 1) submatrix of var(β̂ − β⁰|x, t) and thevvariance of the estimated VCM $X_{i}^{T} \hat{θ}$ is $X_{i}^{T} V (t_{0}; h) X_{i}$ . The conditional MSE of X^T θ̂ given X = x is

MSE (t_{0}; h) = x^{T} {B (t_{0}; h) B {(t_{0}; h)}^{T} + V (t_{0}; h)} x .

The sample MSE is derived as

\begin{array}{l} \hat{MSE} (t_{0}; h) = \frac{1}{h} \sum_{i = 1}^{n} X_{i}^{T} {B (t_{0}; h) B {(t_{0}; h)}^{T} + V (t_{0}; h)} X_{i} \\ = trace [{B (t_{0}; h) B {(t_{0}; h)}^{T} + V (t_{0}; h)} n^{- 1} \sum_{i = 1}^{n} X_{i} X_{i}^{T}] . \end{array}

We propose to choose h such that

\hat{h} = {argmin}_{h} \int \hat{MSE} (t; h) d t .

To summarize, we use a grid of t₀’s and a grid of candidate bandwidths. We fit the local quasi-likelihood for each t₀ and h candidate and calculate the extended residual squares criterion (ERSC) defined as ERSC(x, t; h) = σ̂²{1+(p+1)/N}, where σ̂² is the weighted residual sum of squares after fitting a local pth-order polynomial and N is the number of local data points (see (5.6) of Fan, Farmen, and Gijbels (1998) for details). We then sum the ERSCs over t₀ and the bandwidth with the lowest sum becomes the pilot bandwidth. Using this bandwidth, we fit a local quasi-likelihood for each t₀ to obtain β̂. For a new set of candidate bandwidths, we fit the local quasi-likelihood using higher order Taylor series approximation and obtain β̂^*, which is theoretically more accurate. We compute the bias and variance using the gradient and Hessian of the quasi-likelihood, compute the MSE and the candidate bandwidth with the lowest MSE is our optimal bandwidth.

4. Simulation

We conducted a simulation study to evaluate the performance of our estimators. We generated each observation (X_i, T_i, Y_i) by first simulating the covariates X_i and T_i from a uniform distribution. Then, given X_i and T_i, the conditional mean μ_i of the response was generated as

μ_{i} = g^{- 1} {θ_{0} (T_{i}) + X_{i 1} θ_{1} (T_{i})}

where g is the canonical link. We used a grid of equally spaced values t_k for k = 1, …, K = 100 to estimate the two functions. A cubic guide θ₀_g(t, α₀) = α₀₁ + α₀₂t + α₀₃t² + α₀₄t³ was used for estimating θ₀ and a quadratic guide θ₁_g(t, α₁) = α₁₁ + α₁₂t + α₁₃t² was used for estimating θ₁. We used local linear polynomial estimators with the Epanechnikov kernel weight, and for bias calculations, we chose the degree of the Taylor expansion to be a = 1.

For R = 1000 simulations, we generated the data and estimated the parameters for the cubic and quadratic guides as described above. To get the final estimates θ̂₀ and θ̂₁, we maximized the local quasi-likelihood with the appropriate distribution and canonical link, and Epanechnikov kernel weight. The design matrix of the local likelihood was constructed using both the additive and multiplicative corrections (G_i and $G_{i}^{*}$ , respectively). To find the optimal smoothing bandwidth, we first simulated 15 data sets and applied our methods from Section 3. We took the median of these 15 values as the optimal bandwidth and used that value as fixed for the 1000 simulations. We also compared the two methods using the optimal bandwidth from the unguided method as the fixed bandwidth.

Once the two functions were estimated, we computed the marginal squared bias, marginal variance, and marginal MSE of each. Define $B_{j k} = R^{- 1} \sum_{r = 1}^{R} [{\hat{θ}}_{j, r} (t_{k}) - θ_{j} (t_{k})], V_{j k} = R^{- 1} \sum_{r = 1}^{R} {[{\hat{θ}}_{j, r} (t_{k}) - R^{- 1} \sum_{r^{'} = 1}^{R} {\hat{θ}}_{j, r^{'}} (t_{k})]}^{2}$ , and ${MSE}_{j k} = B_{j k}^{2} + V_{j k}$ where j = 0, 1 and r indexes the simulation. The average marginal squared bias of θ̂_j is $B_{j}^{2} = K^{- 1} \sum_{k = 1}^{K} B_{1 k}^{2}$ , the average marginal variance is $V_{j} = K^{- 1} \sum_{k = 1}^{K} V_{j k}$ and the average marginal MSE is ${MSE}_{j} = K^{- 1} \sum_{k = 1}^{K} {MSE}_{j k}$ . However instead of averaging over all k, we used the 10% trimmed mean. Tables A1 – A3 show the results of our simulations. “Best h” is each estimation method’s optimal smoothing bandwidth obtained via the bias-variance tradeoff in Section 3. All values for squared bias, variance, and MSE in the table are multiplied by 100. “Same h” refers to the fixed bandwidth obtain from the optimal bandwidth from the unguided estimators.

Table A1.

Results of trimmed average bias, variance, and MSE for Example 1. “Same h” refers to fixed bandwidth and best “h” refers to optimal bandwidth. All values are multiplied by 100.

			Bias²			Variance			MSE
			Naive	Additive	Multiplicative	Naive	Additive	Multiplicative	Naive	Additive	Multiplicative
n = 100	Same h	θ₀(t)	0.129	0.022	0.022	0.179	0.173	0.174	0.308	0.195	0.196
	Same h	θ₁(t)	0.003	0.001	0.001	0.552	0.522	0.522	0.556	0.522	0.522

	Best h	θ₀(t)	0.129	0.098	0.096	0.179	0.104	0.106	0.308	0.203	0.202
	Best h	θ₁(t)	0.003	0.001	0.001	0.552	0.294	0.296	0.556	0.295	0.297

n = 200	Same h	θ₀(t)	0.080	0.014	0.015	0.090	0.088	0.089	0.171	0.103	0.104
	Same h	θ₁(t)	0.003	0.001	0.001	0.269	0.261	0.262	0.272	0.262	0.262

	Best h	θ₀(t)	0.080	0.026	0.026	0.090	0.074	0.075	0.171	0.100	0.101
	Best h	θ₁(t)	0.003	0.001	0.001	0.269	0.217	0.217	0.272	0.217	0.217

Open in a new tab

Table A3.

Results of trimmed average bias, variance, and MSE for Example 3. All values are multiplied by 100.

			Bias²		Variance		MSE
			Naive	Additive	Naive	Additive	Naive	Additive
n=500	Same h	θ₀(t)	0.84	0.17	62.43	67.23	63.27	67.40
	Same h	θ₁(t)	0.21	0.02	27.10	29.08	27.31	29.10

	Best h	θ₀(t)	0.84	0.18	62.34	57.60	63.27	57.77
	Best h	θ₁(t)	0.21	0.02	27.10	24.95	27.31	24.96

Open in a new tab

Example 1: Poisson Response

For the Poisson response, n = 100 or 200 covariates X_i and T_i were generated with X_i ~ Unif[−1, 1] and T_i ~ Unif[−2, 2]. The true functions were θ₀(t) = sin(πt/2) + 4 and θ₁(t) = sin(πt/4 − π/2)/2 + 1. The response Y_i was generated from a Poisson(μ_i). We used a grid of K = 100 equally spaced values in [−2, 2] for t₀ to estimate the two functions. Table A1 gives the (trimmed) average marginal squared bias, variance, and MSE for the two functions estimated by the original method using no guide and by our method using guided estimation with additive and multiplicative corrections. When the same bandwidth is used the guided estimation procedure reduces bias but has no effect on variance. When the optimal bandwidth is used, the guided estimates have lower bias and lower variance. This is because the guides account for much of the trend in the true curve and the nonparametric correction is flatter and easier to estimate, resulting in lower bias. As the sample size increased, we saw a reduction in bias, variance and MSE.

Example 2: Normal Response

For the Gaussian response, n = 100 or 200 covariates X_i and T_i were generated with X_i ~ Unif[−1, 1] and T_i ~ Unif[−2, 2]. The true functions were θ₀(t) = sin(πt/2) − 2 and θ₁(t) = 2 sin(πt/4 − π/2) + 3. The response Y_i was generated from a Normal(μ_i, 1). Table A2 gives the (trimmed) average marginal squared bias, variance, and MSE for the two functions estimated by the original method using no guide and by our method using guided estimation with additive and multiplicative corrections. In this example, the guided estimates still have lower bias and variance when the optimal bandwidth is used. When the same bandwidth is used, the variance of the additive and multiplicative correction is slightly higher than the unguided estimates. The gains in bias reduction are counteracted by the higher variance so the MSE is approximately the same for both methods.

Table A2.

Results of trimmed average bias, variance, and MSE for Example 2. All values are multiplied by 100.

			Bias²			Variance			MSE
			Naive	Additive	Multiplicative	Naive	Additive	Multiplicative	Naive	Additive	Multiplicative
n = 100	Same h	θ₀(t)	1.05	0.11	0.10	3.82	4.15	4.29	4.86	4.26	4.39
	Same h	θ₁(t)	0.20	0.02	0.02	11.68	11.81	12.21	11.89	11.83	12.23

	Best h	θ₀(t)	1.05	0.22	0.21	3.82	3.41	3.51	4.86	3.63	3.73
	Best h	θ₁(t)	0.20	0.04	0.03	11.68	8.73	9.92	11.89	8.77	8.96

n = 200	Same h	θ₀(t)	0.46	0.07	0.07	2.20	2.31	2.38	2.66	2.38	2.45
	Same h	θ₁(t)	0.13	0.01	0.01	6.65	6.70	6.85	6.78	6.71	6.85

	Best h	θ₀(t)	0.46	0.11	0.11	2.20	2.00	2.05	2.66	2.11	2.16
	Best h	θ₁(t)	0.13	0.01	0.01	6.65	5.55	5.63	6.78	5.56	5.64

Open in a new tab

Example 3: Bernoulli Response

For the Bernoulli response, we chose a larger sample size of n = 500 since the estimation of the success probability is more difficult than estimating the mean in the Gaussian and Poisson case. The covariates X_i and T_i were generated with X_i ~ Unif[1, 2] and T_i ~ Unif[-1, 1]. The true functions were θ₀(t) = sin(πt)/2+1 and θ₁(t) = 0.7 sin{(t+1)π/2}−1. The response Y_i was generated from a Bernoulli(μ_i). The results from this example are presented in Table A3. Using a multiplicative correction with Bernoulli data is very unstable due to the possibility of dividing by zero, thus this correction was not used. We see that using the guides reduces bias and variance when the optimal bandwidth is used, and reduces bias but has little effect on variance when a fixed bandwidth is used.

5. HIV Data Analysis

In the AIDS Clinical Trials Group (ACTG) Protocol 315, 48 individuals infected with HIV-1 were given potent antiviral medicine to evaluate the efficacy of treatment on the reduction of viral load (plasma HIV-1 RNA). The viral load was measured repeatedly over three months and 31 baseline covariates were measured for each individual. Details of this study can be found in Lederman et al. (1998) and Liang, Wu, and Carroll (2003). Of these 31 covariates, Wu and Wu (2002) identified those that were significant predictors and in this illustration, we chose two of these covariates that appeared frequently in their models, baseline viral load and baseline CD4+ counts. Viral load is the number of copies per milliliter and is measured on a logarithmic scale with base 10. CD4+ counts are the number of lymphocytes that are CD4+. The response was the change from baseline viral load measured at day 7. If an individual did not have a viral load measurement on day 7, then the preceding or following day measurement was used. The data for our illustration are presented in Figure A1. We would like to determine if baseline viral load and baseline CD4+ counts have an effect on the change from baseline viral load measurement while adjusting for interaction between them.

Figure A1 — Scatterplot of the HIV Data variables of interest presented in Section 5. The response is change from baseline viral load at day 7.

Using a grid of size of 25 for t₀, the Epanechnikov kernel, and the Gaussian log-likelihood for Q, we fit the model in (1) and estimated θ₀ and θ₁ using the original unguided nonparametric method, represented by the solid red line in Figure A2. The pre-asymptotic bandwidth selector gave bandwidth 0.67. Based on the shape of these unguided estimates, we chose a cubic guide for our guided estimation of θ₀ and θ₁ and used the additive correction. The results from our fit are represented by the green dot-dashed lines in Figure A2 and the guides used are the black dashed line. The bandwidth for our method was also 0.67. Our estimated θ̂₀ had more curvature than the naive counterpart and followed the parametric guide very closely, suggesting that there is little model misspecification when using a cubic guide for θ₀. For θ̂₁, the naive estimate and our guided estimate had somewhat similar shapes, with the most difference in the left endpoints.

Figure A2 — Nonparametric estimate (solid red) and parametrically guided estimate (green dot-dashed) of θ₀ (a) and θ₁ (b) along with the cubic guide (black dashed).

The point-wise bootstrap 95% confidence intervals are given in Figure A3. The confidence intervals for the guided estimates were slightly wider in the boundaries but overall were similar to those of the naive estimate. In Figure A3(c) and (d), the entire confidence interval for the functions contain zero. Recall that θ₁ is the slope function and this term models the interaction between the two covariates. This indicates that there is no interaction between baseline viral load and baseline CD4+ counts and the response can be adequately modeled as a cubic function of baseline viral load alone.

We also used the VCM to separately model the viral load after day 14, day 21, and day 28 with the same two covariates in order to compare the estimated functions on different days to day 7. Again, if an individual did not have a viral load measurement on these exact days, then the preceding or following day was used, and if all three days were missing, the individual was dropped from the analysis. This yielded 35 individuals for the day 14 analysis, 39 individuals for the day 21 analysis, and 38 individuals for the day 28 analysis. The estimated functions corresponding to these responses are given in Figure A4. The pre-asymptotic bandwidth selector gave bandwidths 0.53, 1.22 and 0.77 for the naive estimates of day 14, 21, and 28 respectively. The bandwidth for the guided estimates were 0.63, 1.26, and 0.77 for day 14, 21, and 28 respectively. The shape of the slope function θ₁ changes drastically for the different days, but the entire confidence interval for all three θ₁ functions (not presented) contains zero. Thus CD4+ has a very different interaction effect on baseline viral load for different days, but the overall effect is not significant. The shape of the intercept function θ₀ was similar for days 14 and 21, and had more of a cubic shape for day 28. Similar to day 7, the parametrically guided estimates of θ₀ followed their respective guides very closely.

Figure A4 — Nonparametric estimate (solid red) and parametrically guided estimate (green dot-dashed) of θ₀ (a)–(c) and θ₁ (d)–(f) for day 14, day 21, and day 28 responses. A cubic guide was used in (a), (c), and (f), a quadratic guide was used in (b) and (e), and a quartic guide was in (d).

6. Discussion

In this paper, we used parametric guides to enhance the performance of nonparametric estimators of the parameter functions in varying coefficient models. We generalized to quasi-likelihoods since the true likelihood is often unavailable. We presented two ways of using the guide and estimated the corrected functions using local polynomial fitting. We developed the asymptotic properties of the guided estimators and a method of selecting the optimal bandwidth parameter of the kernel function. We conducted a simulation study to compare our guided estimators to their standard nonparametric equivalents, and found that the guided estimators had lower bias when a fixed bandwidth was used, and lower bias and variance when the optimal bandwidth was used. In general, even if the shape of the parameter function is not captured by the guide, the guided estimator will still have better bias than the unguided counterpart and the two will have similar variability.

In this paper we present the additive and multiplicative correction which are special cases of a unified family of guided estimators proposed by Fan et al. (2009). This work could be extended to include this unified family and its asymptotic properties. Other future work includes extending our methods to functional data where the covariates of interest are smooth functions and the response can be functional or scalar. The functional covariates correspond to unknown functional parameters that need to be estimated, and this can be done using our guided estimation scheme. This work can also be extended to multivariate unknown functions (e.g. θ (T_i)) by using an empirical basis expansion of the function and reducing it to a VCM.

Acknowledgments

Maity’s research was supported by NIH grant R00 ES017744 and a NCSU Faculty Research and Professional Development (FRPD) grant. Wu’s research was partially supported by NSF grant DMS-1055210 and NIH grant R01-CA149569. We also thank two anonymous referees, an anonymous Associate Editor and the editor for their constructive and helpful comments, which has improved the presentation of the paper.

Appendix A. Appendix

A.1. Assumptions, Definitions and Facts

Recall that X_i = (1, X_i₁, …, X_iq)^T and that G_i = {1, (T_i − t₀)}^T ⊗ X_i for P = 1. Also recall that ρ(x, t) = 1/[V{μ(x, t)}g′²{μ(x, t)}]. Define Q_k(u, v) = ∂^kQ{g⁻¹(u), v}/∂u^k. Since Q(·) is a quasi-likelihood function, Q_k(u, v) is linear in v for each fixed u. Thus for fixed X = x, we have

\begin{array}{l} Q_{1} {x^{T} θ (t_{0}), μ (x, t_{0})} = 0, \\ Q_{2} {x^{T} θ (t_{0}), μ (x, t_{0})} = - ρ (x, t_{0}) . \end{array}

The following assumptions that we use to prove Theorem 2.1 and 2.2 are well used in the nonparametric regression and nonparametric varying coefficient modeling literature, see for examples Cai et al. (2000) and Fan et al. (2009).

The kernel K(·) is a symmetric positive bounded function with [−1, 1] as support, and follows the property that ∫ uK(u) du = 0.
E(|X|³|T = t₀) is continuous at t₀.
E(Y⁴|X = x, T = t₀) is bounded in a neighborhood of t₀.
The function Q₂(u, v) < 0 for each u and v in the range of the response variable.
The functions f_t(·), V₁(·), V{μ(x, ·)}, V′{μ(x, ·)} and g‴{μ(x, ·)} are continuous at t₀. Also, f_t(t₀) > 0 and V₁(t₀) > 0.
The functions $θ_{j}^{″} (\cdot)$ for j = 0, …, q are continuous in a neighborhood around t₀.

For our results to be valid for the case when the guide is estimated, we use the following White (1982)-type condition, see e.g., Fan et al. (2009). Recall that we denote the guides as θ_jg(t) = θ_jg(t, α) and similarly θ̂_jg(t) = θ_jg(t, α̂), omitting the dependence on α to ease notation.

We assume that E[log{f_jnt(x, t, y)}] exists. Also, there exists a function m(x, t, y) such that |f_prp(x, t, y)}|≤m(x, t, y).
E[log{f_jnt(x, t, y)} – log{f_prp(x, t, y)}] has a unique minimizer α^*.
Each element (∂/∂α) log{f_prp(x, t, y)} is continuously differentiable in α.
There are functions m₂(x, t, y) and m₃(x, t, y) such that for any α, the absolute value of each element of [(∂/∂α) log{f_prp(x, t, y)}][(∂/∂α) log{f_prp(x, t, y)}]^T is bounded by m₂(x, t, y), and that the absolute value of each element of [(∂²/∂αα^T) log{f_prp(x, t, y)}] is bounded by m₃(x, t, y). Also E{m₂(x, t, y)} and E{m₃(x, t, y)} exist.
The matrix E [(∂/∂α) log{f_prp(x, t, y)}][(∂/∂α) log{f_prp(x, t, y)}]^T is nonsingular at α^* and that α^* is a regular point of the matrix E[(∂²/∂αα^T) log{f_prp(x, t, y)}].

A.2. Proof of Theorem 2.1

We first prove the result with a fixed guide and then address the case when the guide is estimated.

Fix a point t₀. Recall that η_j(t) = θ_j(t) − θ_jg(t) + θ_jg(t₀). Define b_n = (nh)^−1/2 and $δ = b_{n}^{- 1} {[β_{00} - θ_{0} (t_{0}), \dots, β_{q 0} - θ_{q} (t_{0}), h {β_{01} - η_{0}^{'} (t_{0})}, \dots, h {β_{q 1} - η_{q}^{'} (t_{0})}]}^{T}$ . Then it is straightforward to see that δ̂ minimizes

\sum_{i = 1}^{n} Q [g^{- 1} {b_{n} G_{i}^{T} δ + s (T_{i}, t_{0}) + X_{i}^{T} θ_{g} (T_{i})}, Y_{i}] K {(T_{i} - t_{0}) / h}

as a function of δ, where $s (T_{i}, t_{0}) = X_{i}^{T} {η (t_{0}) + (T_{i} - t_{0}) η^{'} (t_{0})}$ . By Taylor’s expansion we obtain

\begin{array}{l} L (δ) \equiv \sum_{i = 1}^{n} Q [g^{- 1} {b_{n} G_{i}^{T} δ + s (T_{i}, t_{0}) + X_{i}^{T} θ_{g} (T_{i})}, Y_{i}] K {(T_{i} - t_{0}) / h} \\ = L_{0} + L_{1}^{T} δ + δ^{T} L_{2} δ / 2 + L_{3} (δ), \end{array}

where

\begin{array}{l} L_{0} & = \sum_{i = 1}^{n} Q [g^{- 1} {s (T_{i}, t_{0}) + X_{i}^{T} θ_{g} (T_{i})}, Y_{i}] K {(T_{i} - t_{0}) / h}; \\ L_{1} & = b_{n} \sum_{i = 1}^{n} Q_{1} [{s (T_{i}, t_{0}) + X_{i}^{T} θ_{g} (T_{i})}, Y_{i}] K {(T_{i} - t_{0}) / h} G_{i}; \\ L_{2} & = b_{n}^{2} \sum_{i = 1}^{n} Q_{2} [{s (T_{i}, t_{0}) + X_{i}^{T} θ_{g} (T_{i})}, Y_{i}] K {(T_{i} - t_{0}) / h} G_{i} G_{i}^{T}; \\ L_{3} (δ) & = (b_{n}^{3} / 6) \sum_{i = 1}^{n} Q_{3} [{s_{i}^{*} + X_{i}^{T} θ_{g} (T_{i})}, Y_{i}] K {(T_{i} - t_{0}) / h} {(G_{i}^{T} δ)}^{3} \end{array}

with $s_{i}^{*}$ lying between s(T_i, t₀) and $b_{n} G_{i}^{T} δ + s (T_{i}, t_{0})$ .

We analyze each term separately. First we note that using similar argument as in Cai et al. (2000) and Fan et al. (2009), E(| Inline graphic (δ)|) is bounded by a term of the order

O ({n b}_{n}^{3} E ∣ Q_{3} (s_{i}^{*}, Y_{1}) {(X_{1}^{T} δ)}^{3} K {(T_{i} - t_{0}) / h} ∣) = O (b_{n}) .

(A1)

The previous line makes use of the assumption that K(·) is bounded, Q₃(u, v) is linear in v, and that E(|X|³|T = t₀) is continuous at t₀.

Next, we take up Inline graphic . For each T_i such that |T_i − t₀| < h, a Taylor’s series expansion gives us

X_{i}^{T} η (T_{i}) = s (T_{i}, t_{0}) + {(T_{i} - t_{0})}^{2} X_{i}^{T} η^{″} (t_{0}) / 2 + o_{p} (h^{2}) .

(A2)

Thus we have

Q_{2} [{s (T_{i}, t_{0}) + X_{i}^{T} θ_{g} (t_{i})}, μ (X_{i}, T_{i})] = Q_{2} [X_{i}^{T} θ (T_{i}), μ (X_{i}, T_{i})] + o_{p} (1) = - ρ (X_{i}, T_{i}) + o_{p} (1) .

The expected value of the (j, k)th element of Inline graphic , with 1 ≤ j, k ≤ (q + 1) is

\begin{array}{l} E (L_{2, j k}) = h^{- 1} \int Q_{2} [{s (T_{1}, t_{0}) + X_{1}^{T} θ_{g} (T_{1})}, μ (X_{1}, T_{1})] X_{1 j} X_{1 k} \times K {(T_{1} - t_{0}) / h} f_{x, t} (X_{1}, T_{1}) {d T}_{1} d X_{1} \\ \int Q_{2} [{s (w h + t_{0}, t_{0}) + X_{1}^{T} θ_{g} (w h + t_{0})}, μ (X_{1}, w h + t_{0})] X_{1 j} X_{1 k} \times K (w) f_{x, t} (X_{1}, w h + t_{0}) d w d X_{1} \\ = - \int ρ (X_{1}, w h + t_{0}) X_{1 j} X_{1 k} K (w) f_{x, t} (X_{1}, w h + t_{0}) d w d X_{1} + o_{p} (1) \\ = - f_{t} (t_{0}) E [ρ (X_{1}, T_{1}) X_{1 j} X_{1 k} ∣ T_{1} = t_{0}] + o_{p} (1) . \end{array}

Similarly, it is fairly straightforward to derive that

E (L_{2}) = - f_{t} (t_{0}) M \otimes E [ρ (X_{1}, T_{1}) X_{1} X_{1}^{T} ∣ T_{1} = t_{0}] + o (1),

where M is a 2 × 2 matrix with elements M_kl = κ_k₊_l-₂. Using similar calculations, we can show that var( Inline graphic ) = O{(nh)⁻¹}. Thus we have that

L_{2} = - f_{t} (t_{0}) M \otimes E [ρ (X_{1}, T_{1}) X_{1} X_{1}^{T} ∣ T_{i} = t_{0}] + o_{p} (1) = - V_{i} (t_{0}) + o_{p} (1) .

(A3)

Hence we can write

L (δ) - L_{0} = L_{1}^{T} δ - δ^{T} V_{1} (t_{0}) δ + o_{p} (1) .

Note that δ̂ actually minimizes Inline graphic − since is free of d. Now we use the quadratic approximation lemma (Fan and Gijbels 1996) and derive that

\hat{δ} = V_{1}^{- 1} L_{1} + o_{p} (1),

provided that Inline graphic is a sequence of stochastically bounded vectors.

To prove the asymptotic normality of δ̂, it is sufficient to show the same for Inline graphic . Note that using (A2) we have

Q_{1} [{s (T_{i}, t_{0}) + X_{i}^{T} θ_{g} (T_{i})}, μ (X_{i}, T_{i})] = ρ (X_{i}, T_{i}) {(T_{i} - t_{0})}^{2} X_{i}^{T} η^{″} (t_{0}) / 2 + o_{p} (h^{2}) .

Denote the joint density of X and T by f_x_,t(·, ·). Using the assumption that nh⁵→ constant we derive for 1 ≤ j ≤ q + 1,

\begin{array}{l} E (L_{1, j}) = {n b}_{n} \int Q_{1} [{s (T_{1}, t_{0}) + X_{1}^{T} θ_{g} (T_{1})}, μ (X_{1}, T_{1})] K {(T_{1} - t_{0}) / h} \times X_{1 j} f_{x, t} (X_{1}, T_{1}) d X_{1} {d T}_{1} \\ = {(n h)}^{1 / 2} \int Q_{1} [{s (w h + t_{0}, t_{0}) + X_{1}^{T} θ_{g} (w h + t_{0})}, μ (X_{1}, w h + t_{0})] \times K (w) X_{1 j} f_{x, t} (X_{1}, w h + t_{0}) d X_{1} d w \\ = {{({n h}^{5})}^{1 / 2} / 2} \int ρ (X_{1}, w h + t_{0}) w^{2} X_{i}^{T} η^{″} (t_{0}) \times K (w) X_{1 j} f_{x, t} (X_{1}, w h + t_{0}) d X_{1} {d T}_{1} + o_{p} {{({n h}^{5})}^{1 / 2}} \\ = {{({n h}^{5})}^{1 / 2} / 2} κ_{2} f_{t} (t_{0}) E [X_{i}^{T} η^{″} (t_{0}) ρ (X_{1}, T_{1}) X_{1 j} ∣ T_{1} = t_{0}] + o_{p} {{({n h}^{5})}^{1 / 2}} . \end{array}

Similarly we derive that $E (L_{1, j}) = {{({n h}^{5})}^{1 / 2} / 2} κ_{3} f_{t} (t_{0}) E [X_{i}^{T} η^{″} (t_{0}) ρ (X_{1}, T_{1}) X_{1 j} ∣ T_{1} = t_{0}] + o_{p} {{({n h}^{5})}^{1 / 2}}$ for q + 2 ≤ j ≤ 2(q + 1). Hence we have that

\begin{array}{l} E (L_{1}) = {({n h}^{5})}^{1 / 2} {f_{t} (t_{0}) / 2} (\begin{matrix} κ_{2} \\ κ_{3} \end{matrix}) \otimes E [ρ (X_{1}, T_{1}) X_{1} X_{1}^{T} ∣ T_{1} = t_{0}] η^{″} (t_{0}) + o_{p} {{({n h}^{5})}^{1 / 2}} \\ = {({n h}^{5})}^{1 / 2} B_{1} (t_{0}) + o_{p} (1) . \end{array}

Now we take up var( Inline graphic ). Define Z_i to be the term inside the sum in the definition of . Since for any j, k, Z_j is independent of Z_k when j ≠ k, it is straightforward to derive

cov (L_{1, j} L_{1, k}) = {n b}_{n}^{2} [E (Z_{1 j} Z_{1 k}) - E (Z_{1 j}) E (Z_{1 k})] .

Using similar argument to those in the computation of E( Inline graphic ), we derive that E(Z₁_j) = O_p(h³) for any j and thus ${n b}_{n}^{2} E (Z_{1 j}) E (Z_{1 k}) = o_{p} (1)$ . Next, for 1 ≤ j, k ≤ (q + 1),

\begin{array}{l} {n b}_{n}^{2} E (Z_{1 j} Z_{1 k}) = {n b}_{n}^{2} E [Q_{1}^{2} [{s (T_{1}, t_{0}) + X_{1}^{T} θ_{g} (T_{1})}, Y_{1}] K^{2} {(T_{1} - t_{0}) / h} X_{1, j} X_{1 k}] \\ = ν_{0} f_{t} (t_{0}) E {γ (X_{1}, T_{1}) X_{1 j} X_{1 k} ∣ T_{1} = t_{0}} + o_{p} (1), \end{array}

where γ(x, t₀) = var(Y₁|X = x, T₁ = t₀)/[V {μ(x, t₀)}g′{μ(x, t₀)}]². Similarly we obtain

var (L_{1}) = f_{t} (t_{0}) R \otimes E {γ (X_{1}, T_{1}) X_{1} X_{1}^{T} ∣ T_{1} = t_{0}} + o_{p} (1) = V_{2} (t_{0}) + o_{p} (1),

(A4)

where R is a 2 × 2 matrix with elements R_jk = ν_j₊_k₋₂.

The final step of the argument is to use the Cramer-Wold device, that is to show that for any vector d

{d^{T} var (L_{1}) d}^{- 1 / 2} {d^{T} L_{1} - d^{T} E (L_{1})} \to^{d} N (0, 1) .

We only need to check Lyapounov’s condition. To this end, recall that $L_{1} = b_{n} \sum_{i = 1}^{n} Z_{i}$ . It is sufficient to show that ${n b}_{n}^{3} E ({∣ d^{T} Z_{1} ∣}^{3}) \to 0$ . This result follows from a similar argument as that in (A1). Hence we obtain var( Inline graphic )^−1/2 { − E( )} → N(0, 1).

Combining (A3) – (A4) we obtain

\hat{δ} - {({n h}^{5})}^{1 / 2} V^{- 1} (t_{0}) B_{1} (t_{0}) \to^{p} N (0, V_{1}^{- 1} V_{2} V_{1}^{- 1})

Theorem 2.1 now follows from the definition of δ.

A.3. Proof of Theorem 2.2

Recall the definition of Q(β; h, t, α) from Section 2.2 for various guides. Denote β̂(t, α) to be the maximizer of Q(β; h, t, α). When an estimated guide α̂ is used, it suffices to show that ||β̂(t, α̂) − β̂(t, α^*)|| converges to zero in probability with a rate faster than (nh)₁_/₂, the convergence rate shown in Theorem 2.1 with a fixed guide.

We start by observing two facts. First, n^-1^/²||α̂ − α^*|| = O_p(1), which can be shown using the second set of assumptions at the end of Section A.1. Second, using the assumption that Q₂(u, v) < 0 for each u and v in the range of the response variable, it is evident that both Q(β; h, t, α^*) and Q(β; h, t, α̂) are strictly concave in β. Thus we have

n^{- 1} Q (β; h, t, \hat{α}) = n^{- 1} Q (β; h, t, α^{*}) + O_{p} (n^{- 1 / 2});

(A5)

n^{- 1} \frac{\partial}{\partial β} Q (β; h, t, \hat{α}) = n^{- 1} \frac{\partial}{\partial β} Q (β; h, t, α^{*}) + O_{p} (n^{- 1 / 2});

(A6)

n^{- 1} \frac{\partial^{2}}{\partial β \partial β^{T}} Q (β; h, t, \hat{α}) = n^{- 1} \frac{\partial^{2}}{\partial β \partial β^{T}} Q (β; h, t, α^{*}) + O_{p} (n^{- 1 / 2}),

(A7)

for any β, where the last two equations hold entry-wise. Thus it follows from (A5) that ||β̂(t, α̂) − β̂(t, α^*)|| = o_p(1).

Using (A6), and a Taylor’s expansion, we have

\begin{array}{l} 0 = n^{- 1} \frac{\partial}{\partial β} Q (β; h, t, \hat{α}) ∣_{β = \hat{β} (t, \hat{α})} \\ = n^{- 1} \frac{\partial}{\partial β} Q (β; h, t, α^{*}) ∣_{β = \hat{β} (t, \hat{α})} + O_{p} (n^{- 1 / 2}) \\ = n^{- 1} \frac{\partial}{\partial β} Q (β; h, t, α^{*}) ∣_{β = \hat{β} (t, α^{*})} + n^{- 1} \frac{\partial^{2}}{\partial β \partial β^{T}} Q (β; h, t, α^{*}) ∣_{β = \hat{β} (t, α^{*})} [\hat{β} (t, \hat{α}) - \hat{β} (t, α^{*})] + O_{p} (n^{- 1 / 2}) . \end{array}

Recall that by definition of β̂(t, α^*) we have $\frac{\partial}{\partial β} Q (β; h, t, α^{*}) ∣_{β = \hat{β} (t, α^{*})} = 0$ . Also, using similar argument used to derive (A3), the Frobenius norm of the second derivative term in the expansion above can be shown to be O_p(1). Thus we have that

‖ \hat{β} (t, \hat{α}) - \hat{β} (t, α^{*}) ‖ = O_{p} (n^{- 1 / 2}) .

The rate n^1/2 is much faster than the rate (nh)^1/2 shown in Theorem 2.1, and hence the result follows.

References

Brumback B, Rice JA. ‘Smoothing Spline Models for the Analysis of Nested and Crossed Samples of Curves’ (with discussion) Journal of the American Statistical Association. 1998;93:961–994. [Google Scholar]
Cai Z, Fan J, Li R. Efficient Estimation and Inferences for Varying-coefficient Models. Journal of the American Statistical Association. 2000;95:888–902. [Google Scholar]
Cao Y, Lin H, Wu TZ, Yu Y. Penalized Spline Estimation for Functional Coefficient Regression Models. Computational Statistics and Data Analysis. 2010;54:891–905. doi: 10.1016/j.csda.2009.09.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen R, Tsay RS. Functional-coefficient Autoregressive Models. Journal of the American Statistical Association. 1993;88:298–308. [Google Scholar]
Cleveland W, Grosse E, Shyu W. Local Regression Models. In: Chambers J, Hastie T, editors. Statistical Models in S. 2. Pacific Grove, CA: Wadsworth and Brooks/Cole; 1991. pp. 309–376. [Google Scholar]
Eilers PHC, Marx BD. Flexible Smoothing with B-splines and Penalities. Statistical Science. 1996;11:89–102. [Google Scholar]
Fan J. Local Linear Regression Smoothers and their Minimax Efficiencies. The Annals of Statistics. 1993;21:196–216. [Google Scholar]
Fan J, Farmen M, Gijbels I. Local Maximum Likelihood Estimation and Inference. Journal of the Royal Statistical Society Series B. 1998;60:591–608. [Google Scholar]
Fan J, Gijbels I. Local Polynomial Modelling and its Applications. London: Chapman and Hall; 1996. [Google Scholar]
Fan J, Maity A, Wang Y, Wu Y. Parametrically Guided Generalised Additive Models with Application to Mergers and Acquisitions Data. Journal of Non-parametric Statistics. 2013;25:109–128. doi: 10.1080/10485252.2012.735233. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Wu Y, Feng Y. Local Quasi-likelihood with a Parametric Guide. The Annals of Statistics. 2009;37:4153–4183. doi: 10.1214/09-AOS713. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Zhang W. Statistical Estimation in Varying Coefficient Models. The Annals of Statistics. 1999;27:1491–1518. [Google Scholar]
Fan J, Zhang W. Statistical Methods with Varying Coefficient Models. Statistics and Its Interface. 2008;1:179–195. doi: 10.4310/sii.2008.v1.n1.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Glad IK. Parametrically Guided Non-parametric Regression. Scandinavian Journal of Statistics. 1998;25:649–668. [Google Scholar]
Hastie T, Tibshirani R. Varying-coefficient Models. Journal of the Royal Statistical Society Series B. 1993;55:757–796. [Google Scholar]
Hjort NL, Glad IK. Nonparametric Density Estimation with a Parametric Start. The Annals of Statistics. 1995;23:882–904. [Google Scholar]
Holdeman JT. A Method for the Approximation of Functions Defined by Formal Series Expansions in Orthogonal Polynomials. Mathematics of Computation. 1969;23:275–287. [Google Scholar]
Hoover DR, Rice JA, Wu CO, Yang LP. Nonparametric Smoothing Estimates of Time-varying Coefficient Models with Longitudinal Data. Biometrika. 1998;85:809–822. [Google Scholar]
Huang JZ, Shen H. Functional Coefficient Regression Models for Nonlinear Time Series: A Polynomial Spline Approach. Scandinavian Journal of Statistics. 2004;31:515–534. [Google Scholar]
Lederman MM, Connick E, Landay A, Kuritzkes DR, Spritzler J, StClair M, Kotzin BL, Fox L, Chiozzi MH, Leonard JM, et al. Immunologic Responses Associated with 12 Weeks of Combination Antiretroviral Therapy Consisting of Zidovudine, Lamivudine, and Ritonavir: Results of AIDS Clinical Trials Group Protocol 315. The Journal of Infectious Diseases. 2002;179:70–79. doi: 10.1086/515591. [DOI] [PubMed] [Google Scholar]
Liang H, Wu H, Carroll RJ. The Relationship between Virologic and Immunologic Responses in AIDS Clinical Research using Mixed-effects Varying-coefficient Models with Measurement Error. Biostatistics. 2003;4:297–312. doi: 10.1093/biostatistics/4.2.297. [DOI] [PubMed] [Google Scholar]
Ma S, Yang L, Romero R, Cui Y. Varying Coefficient Model for Gene-environment Interaction: A Non-linear Look. Bioinformatics. 2011;27:2119–2126. doi: 10.1093/bioinformatics/btr318. [DOI] [PMC free article] [PubMed] [Google Scholar]
Martins-Filho C, Mishra S, Ullah A. A Class of Improved Parametrically Guided Nonparametric Regression Estimators. Econometric Reviews. 2008;27:542–573. [Google Scholar]
Nelder JA, Wedderburn RWM. Generalized Linear Models. Journal of the Royal Statistical Society Series A. 1972;135:370–384. [Google Scholar]
Wedderburn RWM. Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika. 1974;61:439–447. [Google Scholar]
White H. Maximum Likelihood Estimation of Misspecified Models. Econometrica. 1982;50:1–25. [Google Scholar]
Wood SN. Generalized Additive Models: An Introduction with R. Boca Raton, FL: Chapman and Hall/CRC; 2006. [Google Scholar]
Wu H, Wu L. Identification of Significant Host Factors for HIV Dynamics Modelled by Non-linear Mixed-effects Models. Statistics in Medicine. 2002;21:753–771. doi: 10.1002/sim.1015. [DOI] [PubMed] [Google Scholar]

[R1] Brumback B, Rice JA. ‘Smoothing Spline Models for the Analysis of Nested and Crossed Samples of Curves’ (with discussion) Journal of the American Statistical Association. 1998;93:961–994. [Google Scholar]

[R2] Cai Z, Fan J, Li R. Efficient Estimation and Inferences for Varying-coefficient Models. Journal of the American Statistical Association. 2000;95:888–902. [Google Scholar]

[R3] Cao Y, Lin H, Wu TZ, Yu Y. Penalized Spline Estimation for Functional Coefficient Regression Models. Computational Statistics and Data Analysis. 2010;54:891–905. doi: 10.1016/j.csda.2009.09.036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Chen R, Tsay RS. Functional-coefficient Autoregressive Models. Journal of the American Statistical Association. 1993;88:298–308. [Google Scholar]

[R5] Cleveland W, Grosse E, Shyu W. Local Regression Models. In: Chambers J, Hastie T, editors. Statistical Models in S. 2. Pacific Grove, CA: Wadsworth and Brooks/Cole; 1991. pp. 309–376. [Google Scholar]

[R6] Eilers PHC, Marx BD. Flexible Smoothing with B-splines and Penalities. Statistical Science. 1996;11:89–102. [Google Scholar]

[R7] Fan J. Local Linear Regression Smoothers and their Minimax Efficiencies. The Annals of Statistics. 1993;21:196–216. [Google Scholar]

[R8] Fan J, Farmen M, Gijbels I. Local Maximum Likelihood Estimation and Inference. Journal of the Royal Statistical Society Series B. 1998;60:591–608. [Google Scholar]

[R9] Fan J, Gijbels I. Local Polynomial Modelling and its Applications. London: Chapman and Hall; 1996. [Google Scholar]

[R10] Fan J, Maity A, Wang Y, Wu Y. Parametrically Guided Generalised Additive Models with Application to Mergers and Acquisitions Data. Journal of Non-parametric Statistics. 2013;25:109–128. doi: 10.1080/10485252.2012.735233. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Fan J, Wu Y, Feng Y. Local Quasi-likelihood with a Parametric Guide. The Annals of Statistics. 2009;37:4153–4183. doi: 10.1214/09-AOS713. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Fan J, Zhang W. Statistical Estimation in Varying Coefficient Models. The Annals of Statistics. 1999;27:1491–1518. [Google Scholar]

[R13] Fan J, Zhang W. Statistical Methods with Varying Coefficient Models. Statistics and Its Interface. 2008;1:179–195. doi: 10.4310/sii.2008.v1.n1.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Glad IK. Parametrically Guided Non-parametric Regression. Scandinavian Journal of Statistics. 1998;25:649–668. [Google Scholar]

[R15] Hastie T, Tibshirani R. Varying-coefficient Models. Journal of the Royal Statistical Society Series B. 1993;55:757–796. [Google Scholar]

[R16] Hjort NL, Glad IK. Nonparametric Density Estimation with a Parametric Start. The Annals of Statistics. 1995;23:882–904. [Google Scholar]

[R17] Holdeman JT. A Method for the Approximation of Functions Defined by Formal Series Expansions in Orthogonal Polynomials. Mathematics of Computation. 1969;23:275–287. [Google Scholar]

[R18] Hoover DR, Rice JA, Wu CO, Yang LP. Nonparametric Smoothing Estimates of Time-varying Coefficient Models with Longitudinal Data. Biometrika. 1998;85:809–822. [Google Scholar]

[R19] Huang JZ, Shen H. Functional Coefficient Regression Models for Nonlinear Time Series: A Polynomial Spline Approach. Scandinavian Journal of Statistics. 2004;31:515–534. [Google Scholar]

[R20] Lederman MM, Connick E, Landay A, Kuritzkes DR, Spritzler J, StClair M, Kotzin BL, Fox L, Chiozzi MH, Leonard JM, et al. Immunologic Responses Associated with 12 Weeks of Combination Antiretroviral Therapy Consisting of Zidovudine, Lamivudine, and Ritonavir: Results of AIDS Clinical Trials Group Protocol 315. The Journal of Infectious Diseases. 2002;179:70–79. doi: 10.1086/515591. [DOI] [PubMed] [Google Scholar]

[R21] Liang H, Wu H, Carroll RJ. The Relationship between Virologic and Immunologic Responses in AIDS Clinical Research using Mixed-effects Varying-coefficient Models with Measurement Error. Biostatistics. 2003;4:297–312. doi: 10.1093/biostatistics/4.2.297. [DOI] [PubMed] [Google Scholar]

[R22] Ma S, Yang L, Romero R, Cui Y. Varying Coefficient Model for Gene-environment Interaction: A Non-linear Look. Bioinformatics. 2011;27:2119–2126. doi: 10.1093/bioinformatics/btr318. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Martins-Filho C, Mishra S, Ullah A. A Class of Improved Parametrically Guided Nonparametric Regression Estimators. Econometric Reviews. 2008;27:542–573. [Google Scholar]

[R24] Nelder JA, Wedderburn RWM. Generalized Linear Models. Journal of the Royal Statistical Society Series A. 1972;135:370–384. [Google Scholar]

[R25] Wedderburn RWM. Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika. 1974;61:439–447. [Google Scholar]

[R26] White H. Maximum Likelihood Estimation of Misspecified Models. Econometrica. 1982;50:1–25. [Google Scholar]

[R27] Wood SN. Generalized Additive Models: An Introduction with R. Boca Raton, FL: Chapman and Hall/CRC; 2006. [Google Scholar]

[R28] Wu H, Wu L. Identification of Significant Host Factors for HIV Dynamics Modelled by Non-linear Mixed-effects Models. Statistics in Medicine. 2002;21:753–771. doi: 10.1002/sim.1015. [DOI] [PubMed] [Google Scholar]

PERMALINK

Parametrically guided estimation in nonparametric varying coefficient models with quasi-likelihood

Clemontina A Davenport

Arnab Maity

Yichao Wu

Abstract

1. Introduction

2. Guided Estimation for Varying Coefficient Models

2.1. Framework and Local Likelihood Estimation

2.2. Guided Estimation

2.2.1. Additive Correction

2.2.2. Multiplicative Correction

2.3. Asymptotic Properties

Theorem 2.1

Theorem 2.2

Remark 1

3. Optimal Bandwidth Selection

4. Simulation

Table A1.

Table A3.

Example 1: Poisson Response

Example 2: Normal Response

Table A2.

Example 3: Bernoulli Response

5. HIV Data Analysis

Figure A1.

Figure A2.

Figure A3.

Figure A4.

6. Discussion

Acknowledgments

Appendix A. Appendix

A.1. Assumptions, Definitions and Facts

A.2. Proof of Theorem 2.1

A.3. Proof of Theorem 2.2

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases