Abstract
We study a semiparametric generalized additive coefficient model, in which linear predictors in the conventional generalized linear models is generalized to unknown functions depending on certain covariates, and approximate the nonparametric functions by using polynomial spline. The asymptotic expansion with optimal rates of convergence for the estimators of the nonparametric part is established. Semiparametric generalized likelihood ratio test is also proposed to check if a nonparametric coefficient can be simplified as a parametric one. A conditional bootstrap version is suggested to approximate the distribution of the test under the null hypothesis. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed methods. We further apply the proposed model and methods to a data set from a human visceral Leishmaniasis (HVL) study conduced in Brazil from 1994 to 1997. Numerical results outperform the traditional generalized linear model and the proposed generalized additive coefficient model is preferable.
Keywords: Conditional bootstrap, generalized additive models, knots, maximum likelihood estimation, optimal rate of convergence, spline approximation
1 INTRODUCTION
The most common model used in analyzing the relationship between a discrete response variable and covariates is the generalized linear model (McCullagh & Nelder, 1989). With a given link function, it models the relationship between the dependent and explanatory variables through a linear function form. However, many data arising in a variety of disciplines, such as, economics, political science, geography and epidemiology, require more flexible forms than the usual linearity. Recently, many non- and semi-parametric models have been proposed to relax the strict linear assumption in the generalized linear models, such as, the generalized additive models (Hastie & Tibshirani, 1990; Härdle et al., 2004), the generalized varying coefficient models (Cai et al., 2000), the generalized partially linear models (Green & Silverman, 1994; Härdle et al., 2000; Liang & Ren, 2005), and the generalized partially linear single index models (Carroll et al., 1997).
In this paper, we propose a new semiparametric model, namely the generalized additive coefficient model (GACM), which is an extension of varying coefficient models (Hastie & Tibshirani, 1993). Similar to the generalized varying coefficient model, it allows the coefficients of the linear covariates to depend on certain covariates, called tuning variables. But it further imposes an additive function form on the coefficient functions to circumvents the so-called “curse-of-dimensionality” problem when the dimension of the tuning variables is large. As seen in Section 2, the proposed GACM is flexible enough to include the aforementioned non- and semiparametric models as special cases.
A motivation of this study comes from an analysis of an epidemiological data set. It consists of the human visceral Leishmaniasis (HVL) case numbers in 117 health zones in Belo Horizonte, Brazil from years 1994 to 1997. HVL is mainly a rural disease which has become prevalent in recent years in Brazilian urban areas. The first human case of HVL was recorded in March 1989 in Sabará, a municipality located in the Belo Horizonte metropolitan region. Afterwards, in spite of undertaken control actions, the disease spread into the city from the northeast. The annual human cases recorded in years 1994, 1995, 1996 are 29, 46, and 45 respectively. A total of 40 cases were already reported only in the first semester of 1997. As argued in Assunção (2003), the small number of cases in each area produced very unstable rates, preventing a more focused public health action. One of the main interests of the study is to model the disease diffusion over time to better monitor the disease and allocate resources for disease control. A possible approach is to model the HVL case numbers using a traditional Poisson regression model with a polynomial time trend, see model (8). But Belo Horizonte is a large Brazilian city with great social, econometric, and geological diversity. Also the disease first appeared in the northwest, then spread into the city afterwards. Thus the dynamic of the disease progress over time is different over the whole space. A traditional poisson model with constant coefficients, such as model (8), over the whole space may not be able to capture this spatially varying phenomena. But this can be incorporated into the generalized additive coefficient model, by allowing the coefficients of linear covariats to vary smoothly with the location indexes (the latitude, and longitude), see model (7). Our analysis in Section 6 shows that the GACM out-performers the generalized linear model in terms of both estimation and prediction.
In the least squares setting, Xue & Yang (2006 a & b) considered the estimation of the additive coefficient model for gaussian data using both kernel and polynomial spline methods. In contrast, this paper studies the estimation, also testing of the model for non-gaussian data through maximizing the likelihood with polynomial spline smoothing. The convergence results of the maximum likelihood estimates in this paper are similar to those for regression established by Xue & Yang (2006b). But as Huang (1998) pointed out, it’s more technically challenging to establish the rate of convergence for the maximum likelihood estimates, since it cannot be viewed simply as an orthogonal projection due to its nonlinear structure. Another contribution of this paper is to propose an efficient testing procedure for the coefficient functions by combing the polynomial spline smoothing with the conditional bootstrapping.
The use of the polynomial spline smoothing in the generalized nonparametric models has been investigated in various contexts. Stone (1986) first obtained the rate of convergence of the polynomial spline estimates for the generalized additive model. Stone (1994) and Huang (1998) focused on the polynomial spline estimation of the generalized functional ANOVA model, while Huang et. al. (2000) and Huang & Liu (2006) considered the functional ANOVA model and the single-index model in the proportional hazard regression via maximum partial likelihood estimation respectively. Polynomial spline smoothing is a global smoothing method, which approximates the unknown functions via polynomial splines characterized by only a finite number of coefficients. After the spline basis is chosen, the coefficients can be estimated by an efficient one-step procedure of maximizing the likelihood function. It is computationally cheaper in contrast to the kernel based methods, where the maximizing has to be conducted repeatedly at every local data points. Thus the application of the polynomial spline smoothing in current context is particularly computationally efficient.
This paper is organized as follows. Section 2 introduces the GACM. Section 3 gives an efficient polynomial spline estimation method for the proposed model. Mean square (or L2) convergence results are established for the estimators under mild assumptions. Section 4 discusses a testing procedure of the coefficient functions via conditional bootstrap approach. The simulation studies and an application of the proposed methods in a real data example are included in Sections 5 and 6 respectively. The technical lemmas and proofs are given in the appendix.
2 THE MODEL
In our definition of the generalized regression models, we follow the notation in Stone (1986, 1994), and Huang (1998). The set-up involves an exponential family of distributions of the form
where B (·) and C (·) are known functions with , and ρ is a non-zero measure defined on that is not concentrated on a single point. Correspondingly, the mean of the distribution μ = A (η) = C’ (η) / B’ (η), where B’ (·) and C’ (·) are the first-order derivatives of B (·) and C (·) respectively. Equivalently, η = A−1 (μ) with the function A−1 being the link function.
Consider a random vector (Y, X, T), in which Y is a real-valued response variable, and (X, T) are predictor variables with X = (X1, . . . , Xd2)T, T = (T1, . . . , Td1)T. The conditional distribution of Y given (X, T) is connected to the above exponential family distribution through the assumption that
| (1) |
The model (1) is called the generalized additive coefficient model (GACM). To assure the components in (1) are identifiable, we impose restriction E {αls (Xs)} = 0, for l = 1, . . . , d1, s = 1, . . . , d2. As in most work on nonparametric smoothing, estimation of the functions is conducted on a compact support. Without loss of generality, let the compact set be χ = [0, 1]d2.
The proposed GACM in (1) is quite general. It is flexible enough to cover a variety of situations. For example, when d2 = 0, or equivalently, there is no predictors X, (1) becomes the generalized linear models. When d2 = 1, (1) becomes the generalized varying coefficient model (Cai et al., 2000). When T1 = . . . = Td1 =constant, (1) becomes the generalized additive model (Hastie & Tibshirani, 1990; Härdle et al., 2004).
Similar to Huang (1998), if the conditional distribution of Y given X = x, T = t follows the exponential family distribution described above with η = η (x, t), then the assumption (1) is satisfied and the log-likelihood function is given by l (h, X, T, Y) = B (h (X, T)) Y − C (h (X, T)), for any function h defined on . If the conditional distribution of Y given X = x, T = t does not follow the exponential family distribution, we can think l (h, X, T, Y) as a pseudo-log-likelihood. For simplicity, we refer to both cases as the log-likelihood function.
3 POLYNOMIAL SPLINE ESTIMATION
We propose to estimate the nonparametric functions in model (1) using the polynomial spline smoothing method. It involves an approximation of the nonparametric functions using polynomial splines.
For each of the tuning variable directions, i.e., s = 1, . . . , d2, we introduce a knot sequence ks,n on [0, 1], with Nn interior knots, ks,n = {0 = vs,0 < vs,1 < · · · < vs,Nn < vs,Nn+1 = 1}. For any nonnegative integer p, we denote φs = φp([0, 1], ks,n), the space of functions, whose element
is a polynomial of degree p (or less) on each of the intervals [vs,i, vs,i+1) for i = 0, . . . , Nn−1, and [vs,Nn, vs,Nn+1],
and is p − 1 continuously differentiable on [0, 1] if p ≥ 1.
A function that satisfies (i) and (ii) is called a polynomial spline, a piecewise polynomial connected smoothly on the interior knots. For example, a polynomial spline with degree p = 0 is a piecewise constant function, and a polynomial spline with degree p = 1 is a piecewise linear function and continuous on [0, 1]. The polynomial spline space φs is determined by the degree of the polynomial p and the knot sequence ks,n. Let hs = hs,n = maxi=0,...,Nn |vs,i+1 − vs,i|, which is called the mash size of ks,n and can be understood as the smoothness parameter like the bandwidth in the local polynomial context. Define h = maxs=1,...,d2 hs, which measures the overall smoothness.
To consistently estimate functions in (1) which are centered with E (αls (Xs)) = 0, we introduce empirically centered polynomial splines,
The basis of can be conveniently constructed. For example, we have used the empirically centered truncated power basis in the implementation, i.e.
where Jn = Nn + p, and {bs1, . . . , bsJn} is the truncated power basis given as
in which . If the functions are smooth enough, then one can approximate them by polynomial splines . That is, for each l = 1, . . . , d1, s = 1, . . . , d2, one has
with a set of coefficients . Denote l(η(X, T), Y) = B[η(X, T)]Y − C[η(X, T)], and for any . Then the log-likelihood function ln (η) can be approximated by
| (2) |
in which the coefficients α0 = (α10, . . . , αd10)T and can be solved by maximizing the log-likelihood function, i.e.
| (3) |
Then the resulting estimator of the functions are given as
for l = 1, . . . , d1, s = 1, . . . , d2. As a result,
| (4) |
The maximization of (3) can be easily carried out using existing software for generalized linear models. Furthermore, only one maximum likelihood procedure is needed to estimate all the components in the coefficient function, which is much more computationally efficient than the kernel based method where one needs to perform the maximum likelihood estimation at each local point. On the other hand, the next theorem shows that the polynomial spline estimators enjoy the same optimal rate of convergence as the kernel estimators. In the following, ||·|| denotes the theoretical L2-norm defined by (A.1) in the appendix.
Theorem 1
Under the assumptions (C1)-(C8) in the Appendix, one has,
and for l = 1, . . . , d1, s = 1, . . . , d2,
Theorem 1 shows the mean square (or L2) consistency of polynomial spline estimators. When the smoothing parameter h takes the optimal order of n−1/(2p+3), ||α̂ls − αls||2 = Op (n−(p+1)/(2p+3)), which is the optimal rate of convergence for univariate nonparametric functions. As a result, each of the d2 dimensional coefficient functions is also estimated at the univariate optimal rate. Therefore the proposed GACM effectively avoids the “curse-of-dimensionality” by assuming an additive structure on the coefficient functions.
3.1 Knot number selection
The proposed polynomial spline estimation procedure crucially depends on appropriate choice of knot sequence , and in particular, on the number of interior knot Nn. Here we propose to select Nn using an AIC procedure. For knot location, we use either equally-spaced knots or quantile knots (sample quantiles with the same number of observations between any two adjacent knots). The similar procedure was also used in Huang et al. (2002), and Xue & Yang (2006b).
According to Theorem 1, the optimal order of Nn is n1/(2p+3). Thus we propose to choose the optimal knot number, , from a neighborhood of n1/(2p+3). For our examples in Sections 5 and 6, we have used [0.5Nr, min(5Nr, Tb)], where Nr = ceiling(n1/(2p+3)), and Tb = {n/ (4d1) − 1} /d2 to ensure the total number of parameters in (2) is less than n/4. Let η̂Nn (·) be the estimator of η (·) with number of knots Nn, and the resulting log-likelihood function . Let qn = (1 + d2Jn) d1 be the total number of parameters in (2). Then the optimal knot number, , is the one which minimizes the AIC value. That is
4 HYPOTHESIS TESTING
After fitting GACM (1), a natural question arising is whether the coefficient functions are actually varying, or more generally, whether certain parametric models, such as polynomials, fit the nonparametric components. This leads us to consider hypothesis testing problems such as:
where , and θ is a vector of unknown parameters in the polynomial function. It includes testing whether the component αls is varying, in which αls (xs, θ) = 0. One option is the nonparametric likelihood ratio test statistic (Fan et al., 2001), which is defined as
| (5) |
in which ln (H0) and ln (H1) are the log-likelihood functions calculated under the null and alternative hypothesis respectively. To be more specific, under null hypothesis, we model αls as a polynomial of degree q and approximate all the other functions in the model with polynomial spline of degree p with p ≥ q. Under alternative hypothesis, all functions in the model are approximated with polynomial spline of degree p. We have used the AIC procedure in subsection (3.1) to choose optimal knot number for the full generalized additive coefficient model under alternative hypothesis. Then the same number of knots is used for estimation of the nonparametric functions in the null model.
Theorem 2
Under the assumptions (C1)-(C8) in the Appendix, one has, under H0, Tn → 0 in probability as n → ∞; otherwise, there exits δ > 0, such that Tn > δ with probability tending to one.
The result of Theorem 2 suggests to reject H0 for large Tn. To obtain an appropriate critical value, we approximate the null distribution of Tn using the conditional bootstrap method; see also Cai et al. (2000), and Fan and Huang (2005). Let , and be the estimators of the constants and coefficient functions under H0. Let the resulting estimator of η (x, t) be . In the conditional bootstrap procedure, a total of B bootstrap samples are generated. In our examples given in Sections 5 and 6, we have used B = 500. In each of the samples (b = 1, . . . , B), the value of independent variables (Xi, Ti) are kept the same as the observed ones, while a bootstrap sample is generated from the distribution of Y, with η (x, t) being . Then a test statistics in (5) is computed from the bootstrap sample . In the implementation, the AIC knot selection procedure for the alternative model is performed for each bootstrap sample. The distribution of is used to approximate the distribution of the test statistic Tn. In particular, for a given level of significance α, the (1 − α) 100% percentile of is used as the critical value.
5 SIMULATION STUDY
In this section, we investigate the finite-sample performance of the proposed estimation and testing methods through two simulation studies. We use the averaged integrated squared error (AISE) to evaluate the performance of the function estimators , which is defined as
where denotes the estimator of αls(·) in the αls (·) in the r-th replication for r = 1, . . . , nrep, and are the grid points where the nonparametric functions are evaluated. In both examples, we have used the sample size n = 250, 500, 750, and the number of replications nrep = 1000. The nonparametric functions αls (·) are all evaluated on a grid of equally-spaced points xms, m = 1, . . . , ngrid with x1s = 0.025, xngrid,s = 0.975, and ngrid = 96.
5.1 Logistic regression
Data sets are generated from a logistic regression model where the binary response variable Yi has the distribution
in which α10 = 1, α20 = 0, α11 (x) = α21 (x) = sin (2πx), α12 (x) = 0, and α22 (x) = 2x − 1. The covariates Ti = (Ti1, Ti2)T, and Xi = (Xi1, Xi2)T are independently generated from the standard bivariate normal and Uniform([0, 1]2) respectively.
We have applied the proposed polynomial spline estimation method with both linear spline (p = 1) and cubic spline (p = 3). Estimation with other degrees such as quadratic spline (p = 2) can also be used, but give similar findings. We have used the AIC procedure to select the number of knots that are evenly placed over the ranges of xis, for each s = 1, . . . , d2. Table 1 summarizes the means and standard errors (in the parentheses) of {α̂l0}l=1,2, and the AISEs of from both linear and cubic spline estimations. It shows that two spline fits are generally comparable with cubic spline slightly better than the linear spline in smaller sample sizes. In both cases, the standard errors of the constant estimators and the AISEs of the function estimators decrease as the sample size n increases, which confirms Theorem 1. The typical estimated curves (whose ISE is the median in the 1000 replications) from linear polynomial spline estimation are plotted in Figure 1, together with the point-wise 95% confidence intervals when n = 500. It clearly shows that the linear spline fits are reasonably good.
Table 1.
Logistic regression example: the means and standard errors (in parentheses) of α̂10, and α̂20 and the AISEs of α̂11 (·), α̂12 (·), α̂21 (·), and α̂22 (·) from 1000 replications
| Methods | n | α10 = 1 | α20 = 0 | α 11 | α 12 | α 21 | α 22 |
|---|---|---|---|---|---|---|---|
| Linear spline | 250 | 1.169(0.321) | −0.009(0.251) | 0.4234 | 0.2182 | 0.3256 | 0.2684 |
| 500 | 1.068(0.157) | 0.005(0.134) | 0.0953 | 0.0726 | 0.0804 | 0.0686 | |
| 750 | 1.047(0.126) | 0.003(0.108) | 0.0679 | 0.0451 | 0.0546 | 0.0418 | |
|
| |||||||
| Cubic spline | 250 | 1.197(0.304) | 0.002(0.236) | 0.3531 | 0.2165 | 0.2429 | 0.2689 |
| 500 | 1.082(0.159) | 0.007(0.137) | 0.0954 | 0.0799 | 0.0751 | 0.0682 | |
| 750 | 1.066(0.125) | 0.001(0.111) | 0.0688 | 0.0495 | 0.0527 | 0.0467 | |
Figure 1.
Logistic regression example: plots of the typical estimated coefficient functions using linear polynomial spline method: (a) α11; (b) α12; (c) α21; (d) α22. In each plot, the solid curve represents the true curve, the dashed curve is the typical estimated curve with n = 250, the dotted curve is with n = 500 and the dotdash curve is with n = 750. The two long-dashed curves are the point-wise 95% confidence intervals when n = 500.
Next we examine the proposed testing procedure and consider the following hypothesis
| (6) |
The powers are evaluated under a sequence of the alternative models, H1 : α12 (x) = λ sin(2πx), where λ controls the degree of departure from the null hypothesis, with λ = 0 corresponding to H0. The value λ is taken to be a grid of equally spaced points on [0, 1.5]. Based on 1000 replications for the sample size n = 250, 500 and 750, Figure 2 plots the power functions with significant level α = 0.05. It shows that the power increases to 1 rapidly as λ increases. The powers at λ = 0 are 0.054, 0.056, 0.051 for n = 250, 500 and 750 respectively, which are all close to the corresponding significant level.
Figure 2.
Logistic example: the power function of the test statistics Tn is plotted against λ, for sample sizes n = 250 (solid curve), 500 (dashed curve), and 750 (dotted curve). The significant level is 0.05.
5.2 Poisson regression
In this example, we consider a Poisson regression model with
where different forms of coefficient functions are considered, with αl0 = 1, α20 = 0, and α11 (x) = 4x (1 − x)−2/3, α12 (x) = 0, α21 (x) = sin2 (πx)−0.5, α22 (x) = exp (2x − 1) / [exp (1) − exp (−1)]−1/2. The covariates are generated in the same way as in the Logistic regression example.
Similarly as in the Logistic regression example, we have used both linear spline (p = 1) and cubic spline (p = 3) estimations of the coefficient functions. The equally spaced knots are used with the number of interior knots chosen using the AIC procedure. The simulation results are summarized in Table 2, which contains the standard errors (in the parentheses) of {α̂l0}l=1,2, and the AISEs of from two spline fits. Similarly as in the Logistic regression example, Table 2 shows the convergence of both {α̂l0}l=1,2, and , as n increases. It again collaborates Theorem 1. The typical estimated curves from linear spline method with their point-wise 95% confidence intervals when n = 500 in Figure 3 show that the proposed spline method gives reasonable estimators of the coefficient functions. We also studied the performance of the proposed testing procedure for this Poisson regression. The same hypothesis (6) in the Logistic regression example are considered. The powers are evaluated under the alternative models, H1 : α12 (x) = λ [4x (1 − x) − 2/3], with λ being a grid of equally spaced points on [0, 1.5]. Based on 1000 replications for the sample size n = 250, 500, and 750, Figure 4 plots the power functions with significant levels α = 0.05. The powers at λ = 0 are 0.045, 0.053, 0.054 respectively, which are close to the significant levels.
Table 2.
Poisson regression example: the means and standard errors (in parentheses) of α̂10, and α̂20 and the AISEs of α̂11 (·), α̂12 (·), α̂21 (·), and α̂22 (·) using linear and cubic spline estimations from 1000 replications
| Methods | n | α10 = 1 | α20 = 0 | α 11 | α 12 | α 21 | α 22 |
|---|---|---|---|---|---|---|---|
| Linear spline | 250 | 0.9923(0.054) | 0.0013(0.061) | 0.0099 | 0.0084 | 0.0168 | 0.0119 |
| 500 | 0.9963(0.033) | 0.0022(0.041) | 0.0047 | 0.0036 | 0.0084 | 0.0054 | |
| 750 | 0.9957(0.027) | 0.0005(0.033) | 0.0030 | 0.0022 | 0.0056 | 0.0034 | |
|
| |||||||
| Cubic spline | 250 | 0.9896(0.053) | 0.0016(0.060) | 0.0092 | 0.0091 | 0.0134 | 0.0112 |
| 500 | 0.9945(0.033) | 0.0017(0.039) | 0.0041 | 0.0032 | 0.0053 | 0.0048 | |
| 750 | 0.9941(0.027) | 0.0009(0.032) | 0.0031 | 0.0024 | 0.0042 | 0.0037 | |
Figure 3.
Poisson regression example: plots of the typical estimated coefficient functions using linear polynomial spline method: (a) α11; (b) α12; (c) α21; (d) α22. In each plot, the solid curve represents the true curve, the dashed curve is the typical estimated curve with n = 250, the dotted curve is with n = 500 and the dotdash curve is with n = 750. The two long-dashed curves are the point-wise 95% confidence intervals when n = 500.
Figure 4.
Poisson regression example: the power function of the test statistics Tn is plotted against λ, for sample sizes n = 250 (solid curve), 500 (dashed curve), and 750 (dotted curve). The significant level is 0.05.
6 REAL DATA ANALYSIS
In this section we apply the proposed GACM to analyze the data set from the HVL study introduced in Section 1. The data consists of the annual number of human HVL cases and total population count for each of the 117 zones and each of the periods of 1994/1995, 1995/1996, and 1996/1997. A period comprises the second semester of a year (starting in July 1st) and the first semester of the following year (ending in June 30). For more information of the data, see Assunção et al. (2001). Belo Horizonte is a large Brazilian city with more than 2 million inhabitants. The spatial impacts are not necessarily homogenous over the whole area. Assunção et al. (2001) and Assunção (2003) took the varying spatial effect into account and used the Bayesian spatial varying parameter model to study the diffusion of the disease. Motivated by their analysis, we model the varying spatial effect by using the generalized additive coefficient model as follows.
Let yit be the annual counts of cases and Pit the risk population in each ozone i, i = 1, . . . , 117, for three years t, t = 0, 1, 2. Time indexes t = 0, 1, 2 represents the periods of 1994/1995, 1995/1996, and 1996/1997 respectively. Similarly as in Assunção et al. (2001), and Assunç ão (2003), we assume, conditional on the relative risk exp (λit), the counts are independently distributed according to a Poisson distribution with mean Pit exp (λit). A second degree polynomial is assumed on λit to model the time trend. To allow for varying spatial effects, we further allow the coefficients of the polynomial terms to vary with the spatial coordinates of each zone (xi1, xi2). To be more specific, we assume
| (7) |
Model (7) allows the time profile to vary smoothly over space, thus effectively models the space-time interaction. The coefficient functions in (7) are estimated using both linear spline (p = 1) and quadratic spline (p = 2) with knot numbers selected by the AIC procedure as in subsection 3.1. Figure 5 plots the estimated coefficient functions. It shows that two spline fits are very close. Therefore, for simplicity, we only report the results using linear spline in what follows. For comparison, we also consider a standard Poisson regression model with constant coefficients, i.e.,
| (8) |
Figure 5.
The estimated coefficient functions in model (7) using linear splines (dashed curve) and quadratic splines (solid curve). Plotted are (a) α̂01, (b) α̂02, (c) α̂11, (d) α̂12, (e) α̂21, (f) α̂22.
Fits are measured by their Akaike’s Information Criterion (AIC), which is the minus twice the maximized log-likelihood plus twice the number of parameters. Models (7) and (8) give AIC values 470.98 and 626.88 respectively, which indicates that model (7) gives a better fit even with model complexity taken into account. Figure 6 graphically compares the residuals from two models, where the residuals are defined as with yit and ŷit being the observed and estimated annual count of cases for ith zone and tth year. We also compare the models by their prediction performances. We randomly select 15 zones from the 117 health zones. The observations taken at the last time period 1996/1997 from the selected 15 zones are left out for prediction, while the remaining observations are used for estimation. Then the averaged squared prediction errors (ASPE) are calculated. We replicated the prediction procedure ten times. Then the averaged ASPE from ten replications using models (7) and (8) are reported, which are 1.14, and 1.51 respectively. That is, by efficiently taking the varying spatial effect into account, model (7) not only provides better estimation performance, but also improves prediction accuracy than the traditional Poisson regression model (8). Figure 7 plots the estimated HLV rates (per 100 thousands) from model (7) in the health zones for each of the three period.
Figure 6.
Figures (a)-(c) are the residual plots for three periods: 1994/1995 (a), 1995/1996 (b), and 1996/1997 (c). In each plot, triangle and cross denotes the residuals from Models (7), and (8) respectively.
Figure 7.
The estimated HLV rates (per 100 thousand) in the zones of Belo Horizonte using Model (7). The plots from the left to right corresponds to periods: 1994/1995, 1995/1996, and 1996/1997 respectively.
Finally, one may ask whether the coefficient functions in model (7) are all significantly different from zero, or whether model (7) can be simplified with some of the coefficient functions deleted. For each l = 0, 1, 2 and s = 1, 2, we apply the idea in Section 4 to test the hypothesis: . Based on 1000 bootstrap samples for each hypothesis, all coefficient functions are significantly different from 0 at level 0.05 with the p-values given as < 0.0001, < 0.0001, 0.02, 0.03, 0.01, 0.03 for hypothesis , , , , , respectively. We therefore conclude that model (7) is more appropriate to fit this data set than model (8), and the improvement is statistically significant. Furthermore, as a referee pointed out, it is also of interest to test the linearity of the unknown coefficient functions in model (7). For each l = 0, 1, 2 and s = 1, 2, consider null hypothesis that for some coefficients β0,ls, β1,ls. The p-values are 0.016, < 0.0001, 0.032, 0.025, 0.169, 0.021 for hypothesis , , , , , respectively. It suggests that only the coefficient function α21 in model (7) is of linear form at significance level 0.05.
7 CONCLUSIONS
A polynomial spline estimation method together with a generalized likelihood ratio testing procedure have been proposed for the generalized semiparametric additive coefficient model. Theoretical results have been established under very broad assumptions on the data generating process. Based on our experiences in working with both simulated and empirical examples, implementation of the proposed estimation method is as easy and fast as estimating a simple generalized linear model. The estimators’ performance and their prediction power, however, are both promising as Theorem 1 stipulates. These two aspects of the estimators, together with similar desirable properties of the generalized likelihood ratio testing procedure, make them highly recommendable for statistical inference in multivariate regression setting. A third feature, as mentioned in the introduction, is that the procedures of this paper automatically adapts to the generalized additive models (Hastie & Tibshirani, 1990; Härdle et al., 2004), the generalized varying coefficient models (Cai et al., 2000), the generalized partially linear models (Green & Silverman, 1994; Härdle et al., 2000; Liang & Ren, 2005), the generalized partially linear single index models (Carroll et al., 1997), and simple generalized linear models. Hence, all these models can be simultaneously applied to any given data, and the most appropriate one can be selected via generalized likelihood ratio testing procedure.
Acknowledgments
This research was partially supported by NIH/NIAID grants AI62247 and AI059773 and NSF grant DMS0806097. The paper is greatly improved due to constructive comments from the editor and two referees. The authors thank Drs Renato M. Assunção and Ilka A. Reis for providing the data from the Brazil HVL study, and Drs. David Birks, Jianhua Huang, and Lijian Yang for helpful discussions.
APPENDIX
A.1 Notation and Assumptions
To formalize the discussion, we introduce two function spaces: the model space and the spline approximation space . The model space is a collection of functions on χ × Rd1 defined as,
where are finite constants, and , for 1 ≤ s ≤ d2. Then the predictor function η (x, t) in (1) is modeled as an element of . Let be a random sample of size n from the distribution of (Y, X, T) . In what follows, denote by En the empirical expectation. For functions , define the theoretical inner product and the empirical inner product respectively as
The induced norms are denoted as
| (A.1) |
We now define the polynomial spline approximation space ,
in which , the empirically centered polynomial spline space. Lemma A.3 shows that the functions in and have essentially unique representations.
For any , let l (m (X, T) , Y) = B [(m (X, T)] Y − C [(m (X, T)]. When they exist, define the log-likelihood and the expected log-likelihood function separately as
Note that the expected log-likelihood function Λ (·) needs not to be defined for all . Therefore we restrict our attention to a subset of , denoted as , which is a collection of bounded functions in . That is
Then Λ (·) is well-defined on , under assumptions (C1)-(C5). The subinterval is chosen to be large enough such that η and , where the true predictor function η is bounded on χ1 × χ2, under assumptions (C3) and (C6). Similarly, one defines
To prove the theoretical results, we need the following assumptions. In what follows, we have denoted by the same letters c, C, any positive constants without distinction in each case. For any function f on χ, denote ||f||∞ = supx∈χ |f (x)| and denote by Cp ([0, 1]), the space of p-times continuously differentiable functions on [0, 1].
-
(C1)
The function B (·) is twice continuously differentiable and its first derivative B’ (·) is strictly positive.
-
(C2)
There is a subinterval S of , such that ρ is concentrated on S, and B” (η) y − C” (η) < 0, for all and y ∈ S.
-
(C3)
The tuning variables X = (X1, . . . , Xd2)T and the linear covariates T = (T1, . . . , Td1)T are compactly supported and without loss of generality, we assume that their supports are χ1 = [0, 1]d2, and χ2 = [0, 1]d1 respectively.
-
(C4)
The joint density of X, denoted by f(x), is absolutely continuous and bounded away from zero and infinity, that is, 0 < c ≤ minx∈χ1 f(x) ≤ maxx∈χ1 f(x) ≤ C < ∞.
-
(C5)
Let λ0 (x) ≤ . . . ≤ λd1 (x) be the eigenvalues of E(TTT|X = x). We assume are uniformly bounded away from 0, and infinite, for all x ∈ χ1.
-
(C6)
The coefficient functions αls ∈ Cp+1 ([0, 1]), for l = 1, . . . , d1, s = 1, . . . , d2.
-
(C7)
There is a constant C > 0, such that supx∈χ1,t∈χ2 Var (Y|X = x, T = t) ≤ C.
-
(C8)The d2 sets of knots denoted as ks,n = {0 = xs,0 ≤ xs,1 ≤ · · · ≤ xs,Nn ≤ xs,Nn+1 = 1}, s = 1, . . . , d2, are quasi-uniform, that is, there exists c > 0
Furthermore the number of interior knots , where p denotes the degree of the spline space and ‘≍’ denotes both sides have the same order.
The assumption (C1) implies that the function C (·) is twice continuously differentiable, A (·) is continuously differentiable and A’ (·) is strictly positive. Furthermore, for each , the function B (ξ) A (η) − C (ξ) has a unique maximum at ξ = η. Thus the function that maximizes Λ (·) is given by the true predictor function η. Let h = maxs=1,...,d2;j=0,...,Nn |xs,j+1 − xs,j|. Then (C8) implies that .
The assumptions (C1)-(C8) are common in polynomial spline estimation literature. Assumptions (C1), (C2) are the same as Assumption 1 and Assumption 2 of Huang (1998), and conditions on Page 591 of Stone (1986). They are satisfied by many familiar exponential families including Normal, Binomial, Poisson, and Gamma distributions. The assumptions (C3)-(C5), (C7) are similar to Assumptions 1-4 in Huang et al. (2002). The assumptions (C6) and (C8) are also used in Xue & Yang (2006b).
A.2 Technical lemmas
The first three lemmas present properties of the spaces and , which were proved in Xue & Yang (2006b) under more general set-up.
Lemma A.1
Under assumptions (C3)-(C5), there exists a constant c > 0 such that
for all .
Lemma A.2
Under assumptions (C3)-(C8), one has
In particular, there exists constants 0 < c < 1 < C, such that except on an event whose probability tends to zero, as .
Lemma A.3
Under assumptions (C3)-(C8), the model space is theoretically identifiable, i.e., for any , implies m = 0 a.s.. The approximation space is empirically identifiable, for any , implies mn = 0 a.s..
Lemma A.4
Under assumptions (C1)-(C6), Λ(·) is strictly concave over . That is, for any , that are essentially different (different on a set of positive probability relative to the joint distribution of (X, T)),
Proof: For any , one has
Note that A [η (X, Y)] ∈ S almost surely, since ρ is concentrated on S (Assumption C2). Therefore, Assumption C2 gives B” [m (X, T)] A [η (X, Y)] − C” [m (X, T)] < 0, almost surely, for any . Let λ [m, X, T] = B [m (X, T)] A [η (X, Y)] − C [m (X, T)] and m(t) = m0 + t (m1 − m0). As a result, almost surely,
The lemma follows from above inequality.
Lemma A.5
Under assumptions (C1)-(C6), there exist positive numbers c1 and c2, such that for all ,
Proof: For any . Let m(t) = (1 − t) η + tm. Then
since η is the true predictor function. Hence integration by parts gives
where
Assumptions (C1) & (C2) entail that there exist constants such that,
The proof of Lemma is complected by taking and . The proof is similar to those in the Lemma 4.1 of Huang (1998) and the Lemma 6 of Stone (1986).
A.3 Proof of Theorem 1
Note that , and . Write . Then one has the following error decomposition,
where and η* − η can be understood as the estimation error and the approximation error respectively. Then the first part of Theorem 1, is obtained by showing that ||η* - η|| = Op(hp+1), and , which is proved in Step1 and Step 2 respectively. The rest of Theorem 1 follows immediately from Lemma A.1. The same idea of error decomposition is also used in Huang (1998) where the generalized functional ANOVA model is assumed instead.
Step1: (the approximation error)
Lemma A.5 entails that there exist constants c1,c2 > 0, such that for all ,
| (A.2) |
Thus, for any constant c > 0, and any with ||mn − η|| = chp+1 (if such mn exists), (A.2) entails that,
| (A.3) |
On the other hand, the approximation theorem (de Boor, 2001) ensures that, there exist spline functions , and a constant C > 0 that does not depend on n, such that ||gls − αls||∞ ≤ Chp+1, for 1 ≤ 1 ≤ d1, 1 ≤ s ≤ d2. Let . By assumption (C3), one has
in which c3 = d1d2C. Thus (A.2) also gives
| (A.4) |
By choosing c such that , (A.3) and (A.4) entail that, when n is sufficiently large,
| (A.5) |
where . Note that for such choice of . Let , which is a closed bounded convex set in . Assumption (C1) and Lemma A.5 entail that Λ (·) is a continuous concave functional on . Therefore, Theorem 2 in Pietrzykowski (1972) ensures that Λ (·) has a maximum on B(c). On the other hand, (A.5) ensures that the maximum must be in the interior of B(c). Together with the concavity of Λ (·) and the definition of η*, η* exists, and satisfies ||η* − η|| < chp+1, for n sufficiently large. Hence ||η* − η|| = Op(hp+1).
Step2 (the estimation error)
Let be an orthornormal basis for with respect to the theoretical inner product, where In = (1 + d2Jn) d1. Then one can write , and , for some coefficients , and . For any , write , as ln (β). Let S (β) = (∂/∂β) ln (β) be the score at β, which is a In-dimensional vector having entries
and let D (β) = (∂2/∂β∂βT) ln (β) In × In Hessian matrix, which has entries
Then, for any with being a set of orthornormal basis and , for some constant a > 0 to be chosen later, Taylor expansion gives
For any fixed ε > 0, Lemma A.7 implies that one can choose a sufficiently large, such that . Let . Then on event
| (A.6) |
Moreover, for such a, Lemma A.8 implies that
| (A.7) |
except on an event whose probability tends to zero, as n → ∞ Thus (A.6) & (A.7) entails that, except on an event whose probability tends to zero, as n → ∞, ln (β) < ln (β*) for all . Hence by the concavity of ln (β) (Lemma A.6) and similar arguments as in Step 1, exists and satisfies . Since ε is arbitrary, exists except on an event whose probability tends to zero as n → ∞, and satisfies .
In the following, we present the necessary Lemmas used in the proof. The Lemmas are presented here because they need notations introduced in the proof of Theorem 1.
Lemma A.6
Under assumptions (C1)-(C8), there exits a c4 > 0, such that, except on an event whose probability tends to zero as n → ∞
for 0 < t < 1 and all . Thus the log-likelihood ln (·) is strictly concave on except on an event whose probability tends to zero as n → ∞.
Proof: Let mt = m1 + t (m2 − m1), for 0 < t < 1. It follows from Assumptions (C1) and (C2) that
Note that there is a constant δ > 0, such that B” (ξ) y − C” (ξ) ≤ −δ for all y ∈ S. Thus the right-hand side of the above equality is bounded by
by Lemma A.2. The result follows by letting c4 = cδ.
Lemma A.7
Under assumptions (C1)-(C8), for any constant c > 0,
Proof: Note that β* maximizes
Thus , which implies
Thus
where
Thus
and we complete the proof of Lemma.
Lemma A.8
Under assumptions (C1)-(C8), there exists a constant c5 > 0, such that, for any fixed positive constant a, with probability approaching to 1, as n → ∞,
for all , with .
Proof: For any with , Lemma A.6 entails that there exits a constant c4 > 0, such that, for 0 < t < 1,
except on an event whose probability tends to zero as n → ∞. Also note that
Thus the Lemma follows with .
A.4 Proof of Theorem 2
We prove the result only when the null hypothesis H0 : αls = 0, for notation convenience. For higher order polynomials, the proof follows similarly. Define the approximation space under H0 as
which leaves out the spline approximation term for αls. Note that . Denote , the restricted maximum likelihood estimator of η under H0. Recall that , the unrestricted MLE. Write , and , for some coefficients , and . Then Taylor expansion gives
in which S (β̂) = 0 by the definition of MLE, and
where ηt = η̂ + t (η̂0 − η̂). By (C1) and (C2), there exists constants 0 < c < C, such that −C ≤ B” (ξ) y − C” (ξ) ≤ −c, for all , and y ∈ R. Also note that , and . Therefore Tn = 2[ln (η̂) − ln (η̂0)] satisfies . That is . Lemma A.2 further gives that Tn ≍ ||η̂0 − η̂||2. Then the result of Theorem 2 follows by noting the following. Under H0, ||η̂0 − η̂|| ≤ ||η̂0 − η|| + ||η̂ − η|| = op (1). while under H1, ||η̂0 − η̂|| ≥ ||η̂0 − η|| − ||η̂ − η||, in which by Lemma A.1, ||η̂0 − η||2 ≥ c ||αls||2, and ||η̂ − η|| = op (1).
Contributor Information
Lan Xue, Department of Statistics, Oregon State University.
Hua Liang, Department of Biostatistics, University of Rochester.
REFERENCES
- Assunção R. Space varying coefficient models for small area data. Environmetrics. 2003;14:453–473. [Google Scholar]
- Assunção R, Reis I, Oliveira C. Diffusion and prediction of Leishmaniasis in a large metropolitan area in Brazil with Bayesian space-time model. Statist. Med. 2001;20:2319–2335. doi: 10.1002/sim.844. [DOI] [PubMed] [Google Scholar]
- Cai Z, Fan J, Li R. Efficient estimation and inferences for varying-coefficient models. J. Am. Statist. Assoc. 2000;95:888–902. [Google Scholar]
- Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. J. Am. Statist. Assoc. 1997;92:477–489. [Google Scholar]
- de Boor C. A Practical Guide to Splines. Springer; New York: 2001. [Google Scholar]
- Fan J, Huang T. Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli. 2005;11:1031–1057. [Google Scholar]
- Fan J, Zhang C, Zhang J. Generalized likelihood ratio statistics and Wilks phenomenon. Ann. Statist. 2001;29:153–193. [Google Scholar]
- Green PJ, Silverman BW. Nonparametric regression and generalized linear models. A roughness penalty approach. Chapman & Hall; London: 1994. [Google Scholar]
- Härdle W, Liang H, Gao J. Partially linear models. Physica-Verlag; Heidelberg: 2000. [Google Scholar]
- Härdle W, Müller M, Sperlich, Werwatz A. Nonparametric and semiparametric models. Springer Verlag; Heidelberg: 2004. St. [Google Scholar]
- Hastie TJ, Tibshirani RJ. Generalized additive models. Chapman & Hall; London: 1990. [Google Scholar]
- Hastie T, Tibshirani RJ. Varying-coefficient models. J. R. Statist. Soc. 1993;B 55:757–796. [Google Scholar]
- Huang JZ. Functional ANOVA models for generalized regression. J. Mult. Anal. 1998;67:49–71. [Google Scholar]
- Huang JZ, Liu L. Polynomial spline estimation and inference of proportional hazards regression models with flexible relative risk form. Biometrics. 2006;62:793–802. doi: 10.1111/j.1541-0420.2005.00519.x. [DOI] [PubMed] [Google Scholar]
- Huang JZ, Kooperberg C, Stone CJ, Truong YK. Functional ANOVA modeling for proportional hazards regression. Ann. Statist. 2000;28:961–999. [Google Scholar]
- Huang JZ, Wu CO, Zhou L. Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika. 2002;89:111–128. [Google Scholar]
- Liang H, Ren H. Generalized partially linear measurement error models. J. Comp. Graph. Statist. 2005;14:237–250. [Google Scholar]
- McCullagh P, Nelder JA. Generalized linear models. Chapman & Hall; London: 1989. [Google Scholar]
- Pietrzykowski T. A generalization of the potential method for conditional maxima on Banach, reflexive spaces. Numer. Math. 1972;18:367–372. [Google Scholar]
- Stone CJ. The dimensionality reduction principle for generalized additive models. Ann. Statist. 1986;14:590–606. [Google Scholar]
- Stone CJ. The use of polynomial splines and their tensor products in multivariate function estimation. Ann. Statist. 1994;22:118–184. [Google Scholar]
- Xue L, Yang L. Estimation of semiparametric additive coefficient model. J. Statist. Plan. Infer. 2006a;136:2506–2534. [Google Scholar]
- Xue L, Yang L. Additive coefficient modeling via polynomial spline. Statist. Sinica. 2006b;16:1423–1446. [Google Scholar]







