Abstract
In the low-dimensional case, the generalized additive coefficient model (GACM) proposed by Xue and Yang [Statist. Sinica 16 (2006) 1423–1446] has been demonstrated to be a powerful tool for studying nonlinear interaction effects of variables. In this paper, we propose estimation and inference procedures for the GACM when the dimension of the variables is high. Specifically, we propose a groupwise penalization based procedure to distinguish significant covariates for the “large p small n” setting. The procedure is shown to be consistent for model structure identification. Further, we construct simultaneous confidence bands for the coefficient functions in the selected model based on a refined two-step spline estimator. We also discuss how to choose the tuning parameters. To estimate the standard deviation of the functional estimator, we adopt the smoothed bootstrap method. We conduct simulation experiments to evaluate the numerical performance of the proposed methods and analyze an obesity data set from a genome-wide association study as an illustration.
Key words and phrases: Adaptive group lasso, bootstrap smoothing, curse of dimensionality, gene-environment interaction, generalized additive partially linear models, inference for high-dimensional data, oracle property, penalized likelihood, polynomial splines, two-step estimation, undersmoothing
1. Introduction
Regression analysis is a commonly used statistical tool for modeling the relationship between a scalar dependent variable Y and one or more explanatory variables denoted as T = (T1, T2, …, Tp)T. To study the marginal effects of the predictors on the response, one may fit a generalized linear model (GLM),
| (1) |
where g is a known monotone link function, and αℓ0, 1 ≤ ℓ ≤ p, are unknown parameters. Sometimes, the effect of one variable may change with other variables; that is, there is an interaction effect. By letting T1 = 1, to incorporate the interaction effects of T and the other variables, denoted as X = (X1,…, Xd)T, model (1) can be modified to E(Y|X, T) = μ(X, T) = g−1{η(X, T)} with
| (2) |
where αℓk for 0 ≤ k ≤ d and 1 ≤ ℓ ≤ p are parameters. After a direct reformulation, model (2) can be written as
| (3) |
Here the effect of each Tℓ changes linearly with Xk. However, in practice, this simple linear relationship may not reflect the true changing patterns of the coefficient with other covariates. We here use an example of gene and environment (G × E) interactions for illustration. It has been noticed in the literature that obesity is linked to genetic factors. Their effects, however, can be altered under different environmental factors such as sleeping hours [Knutson (2012)] and physical activity [Wareham, van Sluijs and Ekelund (2005)]. To have a rough idea of how the effects of the genetic factors change with the environment, we explore data from the Framingham Heart Study [Dawber, Meadors and Moore (1951)]. In Figure 1 we plot the estimated mean body mass index (BMI) against sleeping hours per day and activity hours per day, respectively, for people with three possible genotype categories represented by AA, Aa and aa, and for one single nucleotide polymorphism (SNP). A detailed description and the analysis of this data set are given in Section 5. We define allele A as the minor (less frequent) allele. This figure clearly shows different nonlinear curves for the three groups in each of the two plots. By letting Tℓ be the indicator for the group ℓ, the linear function in model (3) is clearly misspecified.
Fig. 1. Plots of the estimated BMI against sleeping hours per day (left panel) and activity hours per day (right panel) for the three genotypes AA (solid line), Aa (dashed line) and aa (dotted line) of SNP rs242263 in the Framingham study, where A is the minor allele.
To relax the linearity assumption, we allow each αℓk Xk term to be an unknown nonlinear function of Xk, and thus extend model (3) to the generalized additive coefficient model (GACM)
| (4) |
For identifiability, the functional components satisfy E{αℓk(Xk)} = 0 for 1 ≤ k ≤ d and 1 ≤ ℓ ≤ p. The conditional variance of Y is modeled as a function of the mean, that is, var(Y|X, T) = V{μ(X, T)} = σ2(X, T). In each coefficient function of the GACM, covariates Xk are continuous variables. If some of them are discrete, they will enter linearly. For example, if Xk is binary, we let αℓk(Xk) = αℓk Xk. In such a case, model (4) turns out to be a partially linear additive coefficient model. The linearity of (4) in Tℓ is particularly appropriate when those factors are discrete, for example, SNPs in a genome-wide association study (GWAS), as in the data example of Section 5.
For the low-dimensional case that the dimensions of X and T are fixed, estimation of model (4) has been studied; see Liu and Yang (2010), Xue and Liang (2010), Xue and Yang (2006) for a spline estimation procedure and Lee, Mammen and Park (2012) for a backfitting algorithm. In modern data applications, model (4), however, is particularly useful when p is large. For example, in GWAS, the number of SNPs, which is p, can be very large, but the dimension of X such as the environmental factors, which is d, is inevitably relatively small. Moreover, the number of variables in T which have nonzero effects is small. It therefore, poses new challenges to apply model (4) to the high-dimensional case including: (i) how to identify those important variables in T, (ii) how to estimate the coefficient functions for the important covariates and (iii) how to conduct inferences for the nonzero coefficient functions. For example, it is of interest to know whether they are a function of a specific parametric form such as constant, linear or quadratic, etc.
In the high-dimensional data setting, studying nonlinear interaction effects has found much attention in recent years, and a few strategies have been proposed. For example, Jiang and Liu (2014) proposed to detect variables under the general index model, which enables the study of high-order interactions among components of continuous predictors, which are assumed to have a multivariate normal distribution. Moreover, Lian (2012) considered variable selection in varying coefficient models which allows the coefficient functions to depend on one index variable, such as a time-dependent variable.
When we would like to see how the effect of each genetic factor changes under the influence of multiple environmental variables, the proposed high-dimensional GACM (4) becomes a natural approach to consider, since both the index model [Jiang and Liu (2014)] and the varying coefficient model [Lian (2012)] cannot address this question; the former is used to study interactions of components in a set of continuous predictors, and the latter only allows one index variable. For model selection and estimation, we apply a groupwise penalization method. Moreover, most existing high-dimensional nonparametric modeling papers [Lian (2012), Meier, van de Geer and Bühlmann (2009), Ravikumar et al. (2009), Wang et al. (2014), Huang, Horowitz and Wei (2010)] focus on variable selection and estimation. In this paper, after variable selection, we also propose a simultaneous inferential tool to further test the shape of the coefficient function for each selected variable, which has not been studied in the previous works.
To this end, we aim to address questions (i)–(iii). Specifically, for estimation and model selection, we apply a groupwise regularization method based on a penalized quasi-likelihood criterion. The penalty is imposed on the L2 norm of the spline coefficients of the spline estimators for αℓ(·). We establish the asymptotic consistency of model selection and estimation for the proposed group penalized estimators with the quasi-likelihood criterion in the high-dimensional GACM (4). We allow p to grow with n at an almost exponential order. Importantly, establishment of these results is technically more difficult than other work based on least squares, since no closed-form of the estimators exists from the penalized quasi-likelihood method.
After selecting the important variables, the next question of interest is what shapes the nonzero coefficient functions may have. Then we need to provide an inferential tool to further check whether a coefficient function has some specific parametric form. For example, when it is a constant or a linear function, the corresponding covariate has no or linear interaction effects with another covariate, respectively. For global inference, we construct simultaneous confidence bands (SCBs) for the nonparametric additive functions based on a two-step estimation procedure. By using the selected variables, we first propose a refined two-step spline estimator for the function of interest, which is proved to have a pointwise asymptotic normal distribution and oracle efficiency. We then establish the bounds for the SCBs based on the absolute maxima distribution of a Gaussian process and on the strong approximation lemma [Csörgő and Révész (1981)]. Some other related works on SCBs for nonparametric functions include Claeskens and Van Keilegom (2003), Hall and Titterington (1988), Härdle and Marron (1991), among others. We provide an asymptotic formula for the standard deviation of the spline estimator for the coefficient function, which involves unknown population parameters to be estimated. The formula has somewhat complex expressions and contains many parameters. Direct estimation therefore may be not accurate, particularly with the small or moderate sample sizes. As an alternative, the bootstrap method provides us a reliable way to calculate the standard deviation by avoiding estimating those population parameters. We here apply the smoothed bootstrap method suggested by Efron (2014), which advocated that the method can improve coverage probability to calculate the pointwise estimated standard deviations for the estimators of the coefficient functions. This method was originally proposed for calculating the estimated standard deviation of the estimate of a parameter of interest, such as the conditional mean. We extend this method to the case of functional estimation. We demonstrate by simulation studies in Section 4 that compared to the traditional resampling bootstrap method, the smoothed bootstrap method can successfully improve the empirical coverage rate.
The paper is organized as follows. Section 2 introduces the B-spline estimation procedure for the nonparametric functions, describes the adaptive group Lasso estimators and the initial Lasso estimators and presents asymptotic results. Section 3 describes the two-step spline estimators and introduces the simultaneous confidence bands and the bootstrap methods for calculating the estimated standard deviation. Section 4 describes simulation studies, and Section 5 illustrates the method through the analysis of an obesity data set from a genome-wide association study. Proofs are in the Appendix and additional supplementary material [Ma et al. (2015)].
2. Penalization based variable selection
Let (Yi, ), i = 1,…, n, be random vectors that are independently and identically distributed as (Y, XT, TT), where Xi = (Xi1, …, Xid)T and Ti = (Ti1, …, Tip)T. Write the negative quasi-likelihood function . Estimation of the mean function can be achieved by minimizing the negative quasi-likelihood of the observed data
| (5) |
2.1. Spline approximation
We approximate the smooth functions αℓk(·), 1 ≤ k ≤ d and 1 ≤ ℓ ≤ p in (4) by B-splines. As in most work on nonparametric smoothing, estimation of the functions αℓk(·) is conducted on compact sets. Without loss of generality, let the compact set be χ = [0, 1]. Let be the space of polynomial splines of order q ≥ 2. We introduce a sequence of spline knots
where N ≡ Nn is the number of interior knots. In the following, let Jn = Nn + q. For 0 ≤ j ≤ N, let Hj = tj + 1 − tj be the distance between neighboring knots and let H = max0≤s≤N Hj. Following Zhou, Shen and Wolfe (1998), to study asymptotic properties of the spline estimators for αℓk(·), we assume that max0≤j≤N−1 |Hj + 1 − Hj| = o(N−1) and H/min0≤j≤N Hj ≤ M, where M > 0 is a predetermined constant. Such an assumption is necessary for numerical implementation. In practice, we can use the quantiles as the locations of the knots. Let {bj,k(xk) : 1 ≤ j ≤ Jn}T be the qth order B spline basis functions given on page 87 of de Boor (2001). For positive numbers an and bn, an ≍ bn means that limn→∞ an/bn = c, where c is some nonzero finite constant. For 1 ≤ j ≤ Jn, we adopt the centered B-spline functions given in Xue and Yang (2006) such that Bj,k(xk) = √N[bj,k(xk) − {E(bj,k)/E(b1,k)}b1,k(xk)], so that E{Bj,k(Xk)} = 0 and var{Bj,k(Xk)} ≍ 1. Define the space Gn of additive spline functions as the linear space spanned by B(x) = {1, Bj,k(xk), 1 ≤ j ≤ Jn, 1 ≤ k ≤ d}T , where x = (x1,…, xd)T. According to the result on page 149 of de Boor (2001), for αℓk(·) satisfying condition (C3) in Appendix A.2 such that for given integer r ≥ 1, where C0, 1 [0, 1] is the space of Lipschitz continuous functions on [0, 1] defined in Appendix A.2, there is a function
| (6) |
such that . Then for every 1 ≤ ℓ ≤ p, αℓ(x) can be approximated well by a linear combination of spline functions in , so that
| (7) |
where , in which γℓk = (γj,ℓk : 1 ≤ j ≤ Jn)T. Thus the minimization problem in (5) is equivalent to finding with and to minimize . The components of the additive coefficients are estimated by and .
2.2. Adaptive group Lasso estimator
We now describe the procedure for estimating and selecting the additive coefficient functions by using the adaptive group Lasso. The estimators are obtained by minimizing a penalized negative quasi-likelihood criterion. We establish asymptotic selection consistency as well as the convergence rate of the estimators to the true nonzero functions. For any vector a = (a1, …, as)T, let its L2 norm be . For any measurable L2-integrable function ϕ on [0, 1]d, define the L2 norm as ‖ϕ‖2 = E{ϕ2(X)}.
We are interested in identifying the significant components of the vector T = (T1, …, Tp)T. Let s, a fixed number, be the total number of nonzero αℓ's and I1 = {ℓ: ‖αℓ‖ ≠ 0, 1 ≤ ℓ ≤ p}. Let I2 be the complementary set of I1; that is, I2 = {ℓ: αℓ(·) ≡ 0, 1 ≤ ℓ ≤ p}. Recalling the approximation given in (7), γℓ is zero if and only if each element of γℓ is zero; that is, ‖γℓ‖2 = 0. We apply the adaptive group Lasso approach in Huang, Horowitz and Wei (2010) for variable selection in model (4). In order to identify zero additive coefficients, we penalize the L2 norm of the coefficients γℓ for 1 ≤ ℓ ≤ p. Let wn = (wn1, …, wnp)T be a given vector of weights, which needs to be chosen appropriately to achieve selection consistency. Their choice will be discussed in Section 2.3. We consider the penalized negative quasi-likelihood
| (8) |
where λn is a regularization parameter controlling the amount of shrinkage. The estimator is obtained by minimizing (8). Minimization of (8) is solved by local quadratic approximation as adopted by Fan and Li (2001).
For ℓ = 1, …, p, the ℓth additive coefficient function is estimated by
We will make the following two assumptions on the order requirements of the tuning parameters. Write wn, I1 = (wnℓ : ℓ ∈ I1).
Assumption 1. and λn ‖wn, I1‖2 → 0, as n → ∞.
Assumption 2. , for all ℓ ∈ I2.
The following theorem presents the selection consistency and estimation properties of the adaptive group Lasso estimators.
Theorem 1. Under conditions (C1)–(C5) in the Appendix and Assumptions 1 and 2: (i) as n → ∞, P(‖α̂ℓ‖ > 0, ℓ ∈ I1 and ‖α̂ℓ‖ = 0, ℓ ∈ I2) → 1, and (ii) .
2.3. Choice of the weights
We now discuss how to choose the weights used in (8) based on the initial estimates. For low-dimensional data settings with p < n, an unpenalized estimator such as least squares estimator [Zou (2006)] can be used as an initial estimate. For high-dimensional settings with p ≫ n, it has been discussed [Meier and Bühlmann (2007)] that the Lasso estimator is a more appropriate choice. Following Huang, Horowitz and Wei (2010), we obtain an initial estimate with the group Lasso by minimizing
with respect to . Denote the resulting estimators by . Let Ĩ1 = {ℓ : ‖γ̃ℓ‖2 ≠ 0, 1 ≤ ℓ ≤ p}, and let s̃ be the number of elements in Ĩ1.
Under conditions (C1)–(C5) in the Appendix, and when for a sufficiently large constant C, we have: (i) the number of estimated nonzero functions are bounded; that is, as n → ∞, there exists a constant 1 < C1 < ∞ such that P(s̃ ≤ C1s) → 1; (ii) if λn1 → 0, then P(‖γ̃ℓ‖2 > 0 for all l ∈ I1) → 1; (iii) . We refer to Theorems 1 (i) and (ii) of Huang, Horowitz and Wei (2010) for the proofs of (i) and (ii), and Theorem 1 in our paper for the proof of (iii).
The weights we use are , if ‖γ̃ℓ‖2 > 0; wnℓ = ∞, if ‖γ̃ℓ‖2 = 0.
Remark 1. Assumptions 1 and 2 give the order requirements of Jn and λn. Based on the condition that given in Assumption 1, we need Jn ≪ {n log(n)}1/2, where an ≪ bn denotes that an/bn = o(1) for any positive numbers an and bn, and λn needs to satisfy . From the above theoretical properties of the group Lasso estimators, we know that, with probability approaching 1, ‖γ̃ℓ‖2 > 0 for nonzero components, and then the corresponding weights wnℓ are bounded away from 0 and infinity for ℓ ∈ I1. By defining 0 · ∞ = 0, the components not selected by the group Lasso are not included in the adaptive group Lasso procedure. Let Jn ≍ n1/(2r + 1), so that Jn has the optimal order for spline regression. If p = exp[o{n2r/(2r + 1)}], then . This means the dimension p can diverge with the sample size at an almost exponential rate.
2.4. Selection of tuning parameters
Tuning parameter selection always plays an important role in model and variable selection. An underfitted model can lead to severely biased estimation, and an overfitted model can seriously degrade the estimation efficiency. Among different data-driven methods, the Bayesian information criterion (BIC) tuning parameter selector has been shown to be able to identify the true model consistently in the fixed dimensional setting [Wang, Li and Tsai (2007)]. In the high-dimensional setting, an extend BIC (EBIC) and a generalized information criterion have been proposed by Chen and Chen (2008) and Fan and Tang (2013), respectively. In this paper, we adopt the EBIC method [Chen and Chen (2008)] to select the tuning parameter λn in (8). Specifically, the EBIC(λn) is defined as
where is the minimizer of (8) for a given λn, s* is the number of nonzero estimated functions and 0 ≤ ν ≤ 1 is a constant. Here we use ν = 0.5. When ν = 0, the EBIC is ordinary BIC.
We use cubic B-splines for the nonparametric function estimation, so that q = 4. In the penalized estimation procedure, we let the number of interior knots N = ⎿cn1/(2q + 1)⏌ satisfy the optimal order, where ⎿a⏌ denotes the largest integer no greater than a and c is a constant. In the simulations, we take c = 2.
3. Inference and the bootstrap smoothing procedure
3.1. Background
After model selection, our next step is to conduct statistical inference for the coefficient functions of those important variables. We will establish a simultaneous confidence band (SCB) based on a two-step estimator for global inference. An asymptotic formula of the SCB will be provided based on the distribution of the maximum value of the normalized deviation of the spline functional estimate. To improve accuracy, we calculate the estimated standard deviation in the SCB by using the nonparametric bootstrap smoothing method as discussed in Efron (2014). For specificity, we focus on the construction of αℓ1(x1), with αℓk(xk)for k ≥ 2 defined similarly, for ℓ ∈ Î1, where Î1 = {ℓ : ‖α̂ℓ‖ ≠ 0, 1 ≤ ℓ ≤ p}.
Although the one-step penalized estimation in Section 2 can quickly identify nonzero coefficient functions, no asymptotic distribution is available for the resulting estimators. Thus we construct the SCB based on a refined two-step spline estimator for αℓ1(x1), which will be shown to have the oracle property that the estimator of αℓ1(x1) has the same asymptotic distribution as the univariate oracle estimator obtained by pretending that αℓ0 and αℓk (Xk) for ℓ ∈ Î1, k ≥ 2 and αℓ(X) for ℓ ∉ Î1 are known. See Horowitz, Klemelä and Mammen (2006), Horowitz and Mammen (2004), Liu, Yang and Härdle (2013) for kernel-based two-step estimators in generalized additive models, which also have the oracle property but are not as computationally efficient as the two-step spline method. We next introduce the oracle estimator and the proposed two-step estimator before we present the SCB.
3.2. Oracle estimator
In the following, we describe the oracle estimator of αℓ1(x1). We rewrite model (4) as
| (9) |
By assuming that αℓ0 and αℓk(Xk) for ℓ ∈ Î1, k ≥ 2 and αℓ(X) for ℓ ∉ Î1 are known, estimation in (9) involves only the nonparametric functions αℓ1(X1) of a scalar covariate X1. It will be shown in Theorem 2 that the estimator achieves the univariate optimal convergence rate when the optimal order for the number of knots is applied. We estimate α1(x1) = {αℓ1(x1), ℓ ∈ Î1}T by minimizing the negative quasi-likelihood function as follows. Denote the oracle estimator by , where is defined directly below, where is the centered B-spline function defined in the same way as Bj, 1(x1) in Section 2, but with interior knots and . Rates of increase for are described in Assumptions 3 and 4 below. Let αℓ, −1(Xi) = αℓ0 + Σk ≥ 2 αℓk(Xik). Then is obtained by minimizing the negative quasi-likelihood
| (10) |
where γ,1 = {(γℓ1)T, ℓ ∈ Î1}T. Similarly, the oracle estimator of α0 = {αℓ0, ℓ ∈ Î1}T, which is denoted as , is obtained by minimizing , where γ,0 = (γℓ0, ℓ ∈ Î1) and .
3.3. Initial estimator
The oracle estimator is infeasible because it assumes knowledge of the other functions. In order to obtain the two-step estimators of αℓ1(x1) for ℓ ∈ Î1, we first need initial estimators for αℓ0 and αℓk(xk) for k ≥ 2 and ℓ ∈ Î1 , denoted as and , where and are B-spline functions with the number of interior knots and . Rates of increase for are described in Assumptions 3 and 4 below. We need an undersmoothed procedure in the first step, so that the approximation bias can be reduced, and the difference between the two-step and oracle estimators is asymptotically negligible. We obtain , where , by minimizing the negative quasi-likelihood . The adaptive group Lasso penalized estimator γ̂Î1 = {(γ̂ℓ)T : ℓ ∈ Î1}T obtained in Section 2 can also be used as the initial estimator. We, however, refit the model with the selected variables and obtain the initial estimator in order to improve estimation accuracy in high-dimensional data settings.
3.4. Final estimator
In the second step, we construct the two-step estimator of αℓ1 for ℓ ∈ Î1. We replace αℓ0 and αℓk(Xk) by the initial estimators and for ℓ ∈ Î1 and k ≥ 2 and replace αℓ(X) for ℓ ∉ Î1 by α̂ℓ(X) = 0. Let . Denote the two-step spline estimator of αℓ1(x1) as with minimizing
| (11) |
Then the two-step of αℓ0, denoted as , is obtained in the same way as by replacing αℓ,0(Xi) with for ℓ ∈ Î1 and replacing αℓ(Xi) with α̂ℓ(Xi) = 0 for ℓ ∉ Î1. Let .
3.5. Asymptotic normality and uniform oracle efficiency
We now establish the asymptotic normality and uniform oracle efficiency for the oracle and final estimators. Let
and
. Let s* be the number of elements in Î1. By Theorem 1, P(s* = s) → 1. For simplicity of notation, denote
and ηi = η(Xi,
Ti). Define
matrix B
(x1) as
To establish the asymptotic distribution of the two-step estimator, in addition to Assumptions 1 and 2 given in Section 2, we make the following two assumptions on the number of basis functions and :
Assumption 3. (i) and , and (ii) , as n → ∞.
Assumption 4. , as n → ∞.
First we describe the asymptotic normality of the oracle estimator of αℓ1(x1). Let . Let and , for ℓ ∈ Î1, where .
Theorem 2. Under conditions (C1)–(C5) and Assumption 3(i), for any vector a ∈ Rs* with ‖a‖2 = 1, for any x1 ∈ [0, 1], , where
| (12) |
where g˙−1 (ηi) is the first-order derivative of g−1(ηi) with respect to ηi, and
Thus for ℓ ∈ Î1, , where
| (13) |
and eℓ is the s*-dimensional vector with the ℓth element 1 and other elements 0, and .
The next result shows the uniform oracle efficiency of the two-step estimator that the difference between the two-step estimator and oracle estimator is uniformly asymptotically negligible, and thus the two-step estimator is oracle in the sense that it has the same asymptotic distribution as the oracle estimator. Let .
Theorem 3. Under conditions (C1)–(C5) in the Appendix and Assumptions 1–3,
, and furthermore under Assumption 4,
for any vector a ∈ Rs* with ‖a‖2 = 1 and given in (12). Hence, for any x1 ∈ [0, 1], .
Remark 2. Under Assumptions 1 and 2, by Theorem 1, with probability approaching 1, s* = s, which is a fixed number. In the second step, by letting , the nonparametric functions αℓ1 for ℓ ∈ Î1 are approximated by spline functions with the optimal number of knots. By the conditions that and given in Assumptions 3 and 4, needs to satisfy where r ≥ 1. By using the adaptive group lasso estimator as the initial estimator, Assumption 1 requires that . Hence . We therefore can let , where ϑ is any small positive number close to 0. This increase in the number of basis functions ensures undersmoothing in the first step in order that the uniform difference between the two-step and the oracle estimators become asymptotically negligible. Based on Assumptions 1 and 2, the tuning parameter λn needs to satisfy .
Remark 3. The number of interior knots has the same order requirement as the number of basis functions. In the first step, with the undersmoothing requirement as discussed in Remark 2, we let the number of interior knots Nini = ⎿cn(1+0.01)/(2q+1)⏌, where c is a constant, by assuming that r = q. In the simulations, we let c = 2. In the second-step estimation, we use BIC to select the number of knots NS, so the optimal NS ranges in [⎿n1/(2q+1)⏌, ⎿2n1/(2q+1)⏌] by minimizing BIC: .
3.6. Simultaneous confidence bands
In this section, we propose a SCB for αℓ1(x1) by studying the asymptotic behavior of the maximum of the normalized deviation of the spline functional estimate. To construct asymptotic SCBs for αℓ1(x1) over the interval x1 ∈ [0, 1] with confidence level 100(1 − α)%, α ∈ (0, 1), we need to find two functions lℓn(x1) and uℓn(x1) such that
| (14) |
In practice, we consider a variant of (14) and construct SCBs over a subset Sn,1 of [0, 1] with Sn,1 becoming denser as n → ∞. We, therefore, partition [0, 1] according to Ln equally spaced intervals based on 0 < ξ0 < ξ1 < ··· < ξLn < ξLn+1 = 1 where Ln → ∞ as n → ∞. Let Sn,1 = (ξ0,…,ξLn). Define dLn(α) = 1 − {2 log(Ln + 1)}−1[log{−(1/2)log(1 − α)} + (1/2) {log log(Ln + 1) + log(4π)}], and QLn(α) = {2 log(Ln + 1)}1/2dLn(α).
Theorem 4. Under conditions (C1)–(C5) in the Appendix, and and , we have
and thus an asymptotic 100(1 − α)% confidence band for αℓ1(x1) over x1 ∈ Sn,1 is
| (15) |
Remark 4. Compared to the pointwise confidence intervals with width 2Z1 − α/2σn(x1), the width of the confidence bands (15) is inflated by a rate {2 log(Ln +1)}1/2dLn (α) / Z1−α/2, where Z1−α/2 is the cut-off point of the 100(1 − α)th percentile of the standard normal.
3.7. Bootstrap smoothing for calculating the standard error
Theorem 4 establishes a thresholding value QLn(α) for the SCB. One critical question is how to estimate the standard deviation σn1(x1) in order to construct the SCB. We can use a sample estimate of σn1(x1) according to the asymptotic formula given in (12),which may have approximation error and thus lead to inaccurate results for inference. The bootstrap estimate of the standard deviation provides an alternative way. We here propose a bootstrap smoothed confidence band by adopting the nonparametric bootstrap smoothing idea from Efron (2014), which can eliminates discontinuities in jumpy estimates. The procedure is described as follows.
Let D = {D1,…, Dn} be the data we have, where Di = {Yi, Xi, (Ti,ℓ, ℓ ∈ Î1)}. Denote as a nonparametric bootstrap sample from {D1,…, Dn}, and as the jth bootstrap sample in B draws. Let be the two-step estimator of αℓ1(x1) by using the data . We first present an empirical standard deviation by the traditional resampling method which is given as
| (16) |
where . Then a 100(1 − α)% unsmoothed bootstrap SCB for αℓ1(x1) over x1 ∈ Sn,1 is given as
| (17) |
Another choice is the smoothed bootstrap SCB which eliminates discontinuities in the estimates [Efron (2014)]. Let
be the smoothed estimate of αℓ1(x1) obtained by averaging over the bootstrap replications. Let be the number of elements in equaling Di.
Proposition 1. At each point x1 ∈ Sn,1, the nonparametric delta-method estimate of the standard deviation for the smoothed bootstrap statistic is , where which is the bootstrap covariance between and .
The proof of Proposition 1 essentially follows the same arguments as the proof for Theorem 1 in Efron (2014). Based on Proposition 1, to construct the smoothed bootstrap SCB, we use the nonparametric estimate of the standard deviation given as
| (18) |
Where
with . The 100(1 − α)% smoothed bootstrap SCB for αℓ1(x1) over x1 ∈ Sn,1 is given as
| (19) |
4. A simulation study
In this section, we present a simulation study to evaluate the finite sample performance of our proposed penalized estimation procedure and the simultaneous confidence bands. More numerical studies are located in the supplementary materials [Ma et al. (2015)].
Example 1. In this example, we use 1286 SNPs located on the sixth chromosome from the Framingham Heart Study to simulate the binary response from the logistic model
| (20) |
with the four SNPs ss66063578, ss66236230, ss66194604 and ss66533844 selected from the real data analysis in Section 5 as important covariates and the other SNPs as unimportant covariates, so that s = 4 (the number of important covariates), p = 1286 and the sample size n = 300. The three possible allele combinations are coded as 1, 0 and −1 for each SNP The covariates Xik, k = 1, 2, are simulated environmental effects, which are generated from independent uniform distributions on [0, 1]. We generate the coefficient functions as α10 = 0.5, α11(x1) = 4cos(2πx1), α12(x2) = 5{(2x2 − 1)2 − 1/3}, α20 = 0.5, α21(x1) = 6x1 − 3, α22(x2) = 4{sin(2πx2) + cos(2πx2)}, α30 = 0.5, α31(x1) = 4sin(2πx1), α32(x2) = 6x2 − 3, α40 = 0.5, α41(x1) = 4cos(2πx1), α42(x2) = 5{(2x2 − 1)2 − 1/3} and αℓ(Xi) = 0 for l = 5,…, 1286. We conducted 500 replications for each simulation. We fit the data with the GACM (20) by using the adaptive group lasso (AGL) and group lasso (GL). In the literature, the generalized varying coefficient model [GVCM; Lian (2012)], which considers one index variable in the coefficient function for each predictor Tiℓ, has been widely used to study nonlinear interactions. To apply the GVCM method [Lian (2012)] in this setting, we first perform principal component analysis (PCA) on Xi and then use the first principal component as the index variable in the GVCM. Then we apply the AGL and GL methods to the GVCM: , where Ui is the first principal component obtained by PCA on Xi. Moreover, we also fit the data with the parametric logistic regression by assuming linear coefficient functions (3) with the AGL method. We also compare our proposed method with the conventional screening method by parametric logistic regression for Genome-Wide Association Studies [GWAS; Murcray, Lewinger and Gauderman (2009)]. In the screening method, we fit a logistic model for each SNP: , for ℓ = 1,…, 1286. Then we conduct a likelihood ratio test for the genetic and interaction effects of H0 : βℓ = βℓ1 = βℓ2 = βℓ3 = 0. Let α0 = 0.05 be the overall type I error for the study and M = 1286 be the number of SNPs in this study. We apply the multiple testing correction procedure for GWAS with H0 rejected when the p-value < α0/Meff, where Meff is the Cheverud–Nyholt estimate of the effective number of tests [Cheverud (2001), Nyholt (2004)] calculated by and rjk are the correlation coefficients of the SNPs, and we obtain Meff = 1275.65.
Table 1 presents the percentages of correct-fitting (C) (exactly the important covariates are selected), over-fitting (O) (both the important covariates and some unimportant covariates are selected) and incorrect-fitting (I) (some of the important covariates are not selected), the average true positives (TP), that is, the average number of selected covariates among the important covariates, the average false positives (FP), that is, the average number of selected covariates among the unimportant covariates, and the average model errors (MR), the latter defined as , where μ̂i(Xi, Ti) and μi(Xi, Ti) are the estimated and true conditional means for Yi, respectively. We see that by fitting the proposed GACM, the GL method has larger percentage of over-fitting as well as larger average false positives than the AGL methods. The AGL improves the correct-fitting percentage by 26%. As a result, the AGL reduces the model fitting error by (0.083 − 0.059)/0.059 = 40.7% compared to the GL method. Moreover, both the logistic model and the GVCM fail to identify those important covariates with incorrect-fitting percentage close to or being 1. Furthermore, by using the screening method with logistic regression, the average true positive is 1.056, which is much less than 4 (the number of those important SNPs). This further illustrates that the traditional screening method is not an effective tool to identify important genetic factors in this context. In addition, we observe that the results for the AGL method in Table 1 are comparable to the results in Table S.1 of Example 2 (in the supplementary materials) at p = 1000 with the simulated SNPs in terms of having similar correct-fitting percentages and MR values.
Table 1.
Variable selection and estimation results by the adaptive group lasso and the group lasso with the GACM and GVCM, respectively, and parametric logistic regression with adaptive group lasso and screening methods based on 500 replications. The columns of C, O and I show the percentage of correct-fitting, over-fitting and incorrect-fitting. The columns TP, FP and MR show true positives, false positives and model errors, respectively
| C | O | I | TP | FP | MR | ||
|---|---|---|---|---|---|---|---|
| GACM | AGL | 0.410 | 0.460 | 0.130 | 3.860 | 0.870 | 0.059 |
| GL | 0.140 | 0.764 | 0.096 | 3.904 | 2.540 | 0.083 | |
| GVCM | AGL | 0.030 | 0.000 | 0.970 | 1.636 | 5.685 | 0.142 |
| GL | 0.060 | 0.000 | 0.940 | 2.076 | 20.670 | 0.120 | |
| Logistic regression | AGL | 0.000 | 0.000 | 1.000 | 1.872 | 1.174 | 0.159 |
| Screening | 0.000 | 0.000 | 1.000 | 1.056 | 0.786 | 0.141 |
Next, we investigate the empirical coverage rates of the unsmoothed and smoothed SCBs given in (17) and (19). To calculate the unsmoothed and smoothed bootstrap standard deviations (16) and (18), we use B = 500 bootstrap replications. The confidence bands are constructed at Ln = 20 equally spaced points. At 95% confidence level, Table 2 reports the empirical coverage rates (cov) and the sample averages of median and mean standard deviations (sd.median and sd.mean), respectively, for the unsmoothed SCB (17) and smoothed SCB (19) for coefficient functions αℓ1(x1), ℓ = 1, 2, 3, 4. We see that the smoothed bootstrap method leads to better performance, having empirical coverage rates closer to the nominal confidence level 0.95.
Table 2. The empirical coverage rates (cov) and the sample average of median and mean of the standard deviations (sd.median and sd.mean)for the unsmoothed SCB (17) and smoothed SCB (19) for the coefficient functions αℓ1(x1) for ℓ = 1, 2, 3, 4.
| Unsmoothed bootstrap | Smoothed bootstrap | |||||
|---|---|---|---|---|---|---|
|
|
|
|||||
| cov | sd.median | sd.mean | cov | sd.median | sd.mean | |
| α11 | 0.610 | 0.689 | 0.809 | 0.818 | 0.735 | 0.982 |
| α21 | 0.628 | 0.563 | 0.725 | 0.846 | 0.666 | 0.932 |
| α31 | 0.636 | 0.736 | 0.832 | 0.869 | 0.837 | 1.053 |
| α41 | 0.646 | 0.768 | 0.843 | 0.882 | 0.891 | 1.064 |
5. Data application
We illustrate our method via analysis of the Framingham Heart Study [Dawber, Meadors and Moore (1951)] to investigate the effects of G × E interactions on obesity. People are defined as obese when their body mass index (BMI) is 30 or greater: this is the definition of being obese made by the U.S. Centers for Disease Control and Prevention; see http://www.cdc.gov/obesity/adult/defining.html. We defined the response variable to be Y = 1 for BMI ≥ 30; and Y = 0 for BMI < 30. We use X1 = sleeping hours per day; X2 = activity hours per day; and X3 = diastolic blood pressure as the environmental factors, and use single nucleotide polymorphisms (SNPs) located in the sixth chromosome as the genetic factors. The three possible allele combinations are coded as 1, 0 and −1. As in the simulation, we thus are fitting a multiplicative risk model in the SNPs. For details on genotyping, see http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?studyid=phs000007.v3.p2. A total of 1286 SNPs remain in our analysis after eliminating SNPs with minor allele frequency <0.05, those with departure from Hardy–Weinberg equilibrium and those having correlation coefficient with the response between −0.1 and 0.1. We have n = 300 individuals left in our study after deleting observations with missing values.
To see possible nonlinear main effects of the environmental factors, we first fit a generalized additive model by using X1, X2 and X3 as predictors such that
| (21) |
Figure S.1 given in the supplementary material [Ma et al. (2015)] depicts the plots of m̂k(·) for k = 1, 2, 3 by one-step cubic spline estimation. Clearly the estimate of each nonparametric function has a nonlinear pattern. We refer to Section S.2 for the detailed description of this figure. Based on the plots shown in Figure S.1, we fit the GACM model
| (22) |
where Ti = (Ti1,Ti2,…, Ti1287)T with Ti1 = 1, and Tiℓ are the SNP covariates for ℓ = 2,…, 1287. The nonparametric function αℓk(·) is estimated by cubic splines, and the number of interior knots for each step is selected based on the criterion described in Section 2.4. We select variables in model (22) by the proposed adaptive group lasso (AGL) and the group lasso (GL). To compare the proposed model with linear models, we perform the group lasso by assuming linear interaction effects (Linear) such that , and we also perform the lasso by assuming no interaction effects (No interaction) such that αℓ(Xi) = αℓ0. We also apply the screening method with parametric logistic regression (Screening) as described in Example 2. Table 3 reports the variable selection results in these five scenarios. After model selection, we calculate the estimated leave-one-out cross-validation prediction error (CVPE) for the model with the selected variables as shown in the last row of Table 3. Among the selected SNPs by the AGL method, two SNPs, rs4714924 and rs6543930, have been scientifically confirmed by Randall et al. (2013) to have strong associations with obesity. Moreover, compared to the linear, no interaction and screening methods, our proposed AGL with GACM method enables us to identify more genetic factors, which may be important to the response but missed out by other methods. As a result, it has the smallest CVPE (0.078), so that it significantly improves model prediction compared to other methods. We also see that the logistic model that completely ignores interactions has the largest CVPE (0.152). The screening method has the second largest CVPE (0.149), which is larger than that of the penalization method (0.124) obtained by fitting the same logistic regression model but including interaction considered. This result demonstrates that the screening method is not as effective as the penalization method for analysis of this data set, a result which also agrees with our simulations.
Table 3.
Variable selection results for the group lasso (GL) and the adaptive group lasso (AGL) in model (22), the group lasso by assuming linear interaction effects (linear), the lasso by assuming no interaction effects (no interaction) and the screening method (screening). The symbol ✓ indicates that the SNP was selected into the model. The last row shows the cross validation prediction errors (CVPE)
| SNPs | GL | AGL | Linear | No interaction | Screening |
|---|---|---|---|---|---|
| rs9296244 | ✓ | ✓ | |||
| rs6910353 | ✓ | ✓ | |||
| rs3130813 | ✓ | ✓ | |||
| rs9353447 | ✓ | ✓ | |||
| rs4714924 | ✓ | ✓ | ✓ | ✓ | |
| rs242263 | ✓ | ✓ | ✓ | ||
| rs282123 | ✓ | ||||
| rs282128 | ✓ | ✓ | |||
| rs6929006 | ✓ | ||||
| rs9353711 | ✓ | ||||
| rs12199154 | ✓ | ✓ | |||
| rs2277114 | ✓ | ||||
| rs749517 | ✓ | ||||
| rs729888 | ✓ | ||||
| rs203139 | ✓ | ||||
| rs6914589 | ✓ | ✓ | |||
| rs6543930 | ✓ | ✓ | |||
| CVPE | 0.099 | 0.078 | 0.124 | 0.152 | 0.149 |
Next we fit the final GACM selected variables from the AGL procedure as
| (23) |
To illustrate the main effects of the environmental factors, Figure 2 plots the smoothed two-step estimated functions of the functions , for k = 1, 2, 3, and the associated 95% smoothed SCBs (upper and lower solid lines). The plots of the functional estimates have the same nonlinear change patterns as the corresponding plots in Figure S.1, although because of the addition of the SCBs, the scale of the plot has changed.
Fig. 2. Plots of the smoothed two-step estimated functions for k = 1, 2, 3 and the associated 95% SCBs based on model (23).
To illustrate the effects of the genetic factors changing with the environmental factors, in Figure 3 we plot the smoothed two-step estimated functions and the associated 95% smoothed SCBs of the coefficient functions for the SNP rs242263. To further demonstrate how the probability of developing obesity changes with the environmental factors for each category of SNP rs242263, Figure 4 plots the estimated conditional probability of obesity against each environmental factor by letting Tiℓ = 0 for ℓ ≠ 6. Letting A be the minor allele, the curves are for aa (solid line), Aa (dashed line) and AA (dotted line). Figure 3 indicates different changing patterns of the interaction effects under different environments. For example, sleeping hours seem to have an overall more significant interaction effect with this particular SNP than the other two variables. The effect of this SNP changes from positive to negative and then to positive again as the sleeping hours increase. The coefficient functions of the SNP have an increasing pattern along with the activity hours and diastolic blood pressure, respectively. From Figure 4, we observe that there are stronger differences among the levels AA, Aa, and aa of SNP rs242263 for both large and small values of the environmental factors. There are other interesting results worth further study. For example, in the 2–6 hours per day sleeping range, the AA group (dotted lines) have much higher rates of obesity than the aa group (solid line), but the opposite occurs in the 6–9 hour range. For those with low amounts of activity per day, again the AA group is more obese than the aa group, while when activity increases, the AA group is less obese than the aa group. A similar noticeable difference occurs between the <60 diastolic blood pressure group, those who are hypotensive, and the >90 group, those who are hypertensive, although there are few subjects in the former group.
Fig. 3. Plots of the smoothed two-step estimated functions for k = 1, 2, 3 and the associated 95% SCBs based on model (23).
Fig. 4.
Plots of the estimated conditional probability of obesity against each environmental factor by letting Tiℓ = 0 for ℓ ≠ 5. With A being the minor allele, the curves are aa (solid line), Aa (dashed line) and AA (dotted line), based on model (23).
6. Discussions
The generalized additive coefficient model (GACM) proposed by Xue and Yang (2006) and Xue and Liang (2010) has been demonstrated to be a powerful tool for studying nonlinear interaction effects of variables. To promote the use of the GACM in modern data applications such as gene-environment (G × E) interaction effects in GWAS, we have proposed estimation and inference procedures for the GACM when the dimension of the variables is high. Specifically, we have devised a groupwise penalization method in the GACM for simultaneous model selection and estimation. We showed by numerical studies that we can effectively identify important genetic factors by using the proposed nonparametric model while traditional generalized parametric models such as logistic regression model fails to do so when nonlinear interactions exist. Moreover, by comparing with the conventional screening method with logistic regression as commonly used in the GWAS community, our proposed groupwise penalization method with the GACM has been demonstrated to be more effective for variable selection and model estimation. After identifying those important covariates, we have further constructed simultaneous confidence bands for the nonzero coefficient functions based on a refined two-step estimator. We estimate the standard deviation of the functional estimator by a smoothed bootstrap method as proposed in Efron (2014). The method was shown to have good numerical performance by reducing variability as well as improving the empirical coverage rate of the proposed simultaneous confidence bands. Our methods can be extended to longitudinal data settings through marginal models or mixed-effects models. More work, however, is needed to understand the properties of the estimators in such new settings. Moreover, extending this work to the setting with the dimensions for both genetic and environmental factors growing with the sample size can be a future project to be considered. Some associated theoretical properties with respect to model selection and estimation as well as inference need to be carefully investigated.
Supplementary Material
Acknowledgments
The authors thank the Co-Editors, an Associate Editor and three referees for their valuable suggestions and comments that have substantially improved an earlier version of this paper.
Appendix
Denote the space of the qth order smooth functions as C(q)([0, 1]) = {ϕ|ϕ(q) ∈ C[0, 1]}. For any s × s symmetric matrix A, denote its Lq norm as ‖A‖q = maxς∈Rs,‖ς‖2=1 ‖Aς‖q. Let . For a vector a, let ‖a‖∞ = max1≤i≤s |ai|.
Let C0, 1 (χw) be the space of Lipschitz continuous functions on χw, that is,
in which ‖φ‖0, 1 is the C0, 1-norm of φ. Denote qj(η, y) = ∂j Q{g−1(η), y}/∂ηj, so that
where ρj(η) = {ġ−1 (η)}j/V{g−1 (η)}.
A.1. Assumptions
Throughout the paper, we assume the following regularity conditions:
(C1) The joint density of X, denoted by f(x), is absolutely continuous, and there exist constants 0 < cf ≤ Cf < ∞, such that cf ≤minx∈[0, 1]d f(x) ≤ maxx∈[0, 1]d f(x) ≤ Cf.
(C2) The function V is twice continuously differentiable, and the link function g is three times continuously differentiable. The function q2(η, y) < 0 for η ∈ R and y in the range of the response variable.
(C3) For 1 ≤ ℓ ≤ p, 1 ≤ k ≤ d, , for given integer r ≥ 1. The spline order satisfies q ≥ r.
(C4) Let εi = Yi − μ(Xi, Ti), 1 ≤ i ≤ n. The random variables ε1,…, εn are i.i.d. with E(εi) = 0 and var(εi|Xi, Ti) = σ2(Xi, Ti). Furthermore, their tail probabilities satisfy P(|εi| > x) < K exp(−Cx2), i = 1,…, n, for all x ≥ 0 and for some positive constants C and K.
(C5) The eigenvalues of , where TI1 = (Tℓ, ℓ ∈ I1)T, are uniformly bounded away from 0 and ∞ for all x ∈ [0, 1]d. There exist constants 0 < c1 < C1 < ∞, such that , for all x ∈ [0, 1]d, ℓ ∈ I2.
Conditions (C1)–(C5) are standard conditions for nonparametric estimation. Condition (C1) is the same as condition (C1) in Xue and Yang (2006) and condition (C5) in Xue and Liang (2010). The first condition in (C2) gives the assumptions on V and the link function g, which can be found in condition (E) of Lam and Fan (2008). The second condition in (C2) guarantees that the negative quasi-likelihood function Q{g−1(η), y} is convex in η ∈ R, which is also given in condition (D) of Lam and Fan (2008) and (a) of condition 1 in Carroll et al. (1997). Condition (C3) is typical for polynomial spline smoothing; see the same condition given in Section 5.2 of Huang (2003). Condition (C4) is the same as assumption (A2) given in Huang, Horowitz and Wei (2010). Condition (C5) is given in condition (C5) of Xue and Liang (2010) and condition (A5) in Ma and Yang (2011b).
A.2. Preliminary lemmas
Define , where is defined in (6). Let γI1 = (γℓ : ℓ ∈ I1)T. To prove Theorem 1, we next define the oracle estimator of γI1 by minimizing the penalized negative quasi-likelihood with all irrelevant predictors eliminated as such
| (24) |
so that . Define with for ℓ ∈ I2, where 0d Jn+1 is a (d Jn + 1)-dimensional zero vector. We next present several lemmas, whose detailed proofs are given in the online supplementary materials [Ma et al. (2015)]. Lemma A.1 is used for the proof of Theorem 1, while Lemma A.2 is needed in the proof of Theorem 3.
Lemma A.1. nder the conditions of Theorem 1, one has
| (25) |
and as n → ∞,
| (26) |
Lemma A.2. Under conditions (C1)–(C5) and Assumptions 1–3,
| (27) |
A.3. Proof of Theorem 1
A.4. Proof of Theorem 2
Let γ,1 = (γℓ1, ℓ ∈ Î1)T, where γℓ1 is defined in (7). By Taylor's expansion, from (10), one has
Where and
where . Following similar reasoning as the proofs for (25), we have . Then , where
| (28) |
Therefore, . By Theorem 5.4.2 of DeVore and Lorentz (1993), for sufficiently large n, there exist constants 0 < cB ≤ CB < ∞, such that . By condition (C5), for n large enough, there are constants 0 < CT, C′ < ∞, such that
where C = C′CTCB. Similarly, we have for some constant 0 < c < ∞. Thus, following the same reasoning as the proof for (S.5) in the supplementary materials [Ma et al. (2015)], we have with probability 1, for n → ∞,
| (29) |
By the Lindeberg central limit theorem, it can be proved that
| (30) |
for any a ∈ Rs* with ‖a‖2 = 1. Since , by (30) and Slutsky's theorem, we have
| (31) |
By (28) and (29), with probability approaching 1,
Since , it can be proved that , and . Hence
By (31), , and follows from the central limit theorem.
A.5. Proof of Theorem 3
By (27) in Lemma A.2,
The right-hand side is bounded by can be proved following the same procedure and thus omitted. By (29), with probability approaching 1, for large enough n, for any x1 ∈ [0, 1], and a ∈ Rs* with ‖a‖2 = 1, one has
where is defined in (12). Thus
A.6. Proof of Theorem 4
Using the strong approximation lemma given in Theorem 2.6.7 of Csörgő and Révész (1981), we can prove by the same procedure as Lemma A.7 in Ma, Yang and Carroll (2012) that
| (32) |
for some t < –r/(2r + 1) < 0, where is
and ei, 1 ≤ i ≤ n, are i.i.d. N(0, 1) independent of Zi, 1. For defined in (12) and uniformly in x1 ∈ [0, 1]. By (32), and t < –r/(2r + 1) < 0, we have
| (33) |
Define
. It is apparent that
{η(ξJ)|Zi, 1, 1 ≤ i ≤ n} = N(0, 1), so
{η(ξJ)} = N(0,1) for 0 ≤ J ≤ Ln. Moreover, the eigenvalues of
. Then with probability approaching 1, for J ≠ J′,
and for a constant 0 < C < ∞ when |jJ – jJ′| ≤ (q − 1) and when |jJ – jJ′| > (q – 1), in which jJ denotes the index of the knot closest to ξJ from the left. Therefore, by , there exist constants 0 < C1 < ∞ and 0 < C2 < ∞ such that with probability approaching 1, for J ≠ J′, . By Lemma A1 given in Ma and Yang (2011a), we have
and hence
| (34) |
Furthermore, according to the result on page 149 of de Boor (2001), we have
| (35) |
Moreover, . Hence by (33) and (35), we have
| (36) |
where the last step follows from (34). By the oracle property given in Theorem 3, and and , we have
| (37) |
Therefore, by (36) and (37), we have
and hence the result in Theorem 4 is proved.
Contributor Information
MA Shujie, Email: shujie.ma@ucr.edu.
Raymond J. Carroll, Email: carroll@stat.tamu.edu.
Hua Liang, Email: hliang@gwu.edu.
Shizhong Xu, Email: shizhong.xu@ucr.edu.
References
- Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. J Amer Statist Assoc. 1997;92:477–489. MR1467842. [Google Scholar]
- Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95:759–771. MR2443189. [Google Scholar]
- Cheverud JM. A simple correction for multiple comparisons in interval mapping genome scans. Heredity (Edinb) 2001;87:52–58. doi: 10.1046/j.1365-2540.2001.00901.x. [DOI] [PubMed] [Google Scholar]
- Claeskens G, Van Keilegom I. Bootstrap confidence bands for regression curves and their derivatives. Ann Statist. 2003;31:1852–1884. MR2036392. [Google Scholar]
- Csörgő M, Révész P. Strong Approximations in Probability and Statistics. Academic Press; New York: 1981. MR0666546. [Google Scholar]
- Dawber TR, Meadors GF, Moore FE. Epidemiological approaches to heart disease: The Framingham 660 study. American Journal of Public Health. 1951;41:279–286. doi: 10.2105/ajph.41.3.279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Boor C. Applied Mathematical Sciences. revised. Vol. 27. Springer; New York: 2001. A Practical Guide to Splines. MR1900298. [Google Scholar]
- DeVore RA, Lorentz GG. Constructive Approximation Grundlehren der Mathematischen Wissenschaften. Vol. 303. Springer; Berlin: 1993. MR1261635. [Google Scholar]
- Efron B. Estimation and accuracy after model selection. J Amer Statist Assoc. 2014;109:991–1007. doi: 10.1080/01621459.2013.823775. MR3265671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. MR1946581. [Google Scholar]
- Fan Y, Tang CY. Tuning parameter selection in high dimensional penalized likelihood. J R Stat Soc Ser B Stat Methodol. 2013;75:531–552. MR3065478. [Google Scholar]
- Hall P, Titterington DM. On confidence bands in nonparametric density estimation and regression. J Multivariate Anal. 1988;27:228–254. MR0971184. [Google Scholar]
- Härdle W, Marron JS. Bootstrap simultaneous error bars for nonparametric regression. Ann Statist. 1991;19:778–796. MR1105844. [Google Scholar]
- Horowitz J, Klemelä J, Mammen E. Optimal estimation in additive regression models. Bernoulli. 2006;12:271–298. MR2218556. [Google Scholar]
- Horowitz JL, Mammen E. Nonparametric estimation of an additive model with a link function. Ann Statist. 2004;32:2412–2443. MR2153990. [Google Scholar]
- Huang JZ. Local asymptotics for polynomial spline regression. Ann Statist. 2003;31:1600–1635. MR2012827. [Google Scholar]
- Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models. Ann Statist. 2010;38:2282–2313. doi: 10.1214/09-AOS781. MR2676890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang B, Liu JS. Variable selection for general index models via sliced inverse regression. Ann Statist. 2014;42:1751–1786. MR3262467. [Google Scholar]
- Knutson KL. Does inadequate sleep play a role in vulnerability to obesity? Am J Hum Biol. 2012;24:361–371. doi: 10.1002/ajhb.22219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lam C, Fan J. Profile-kernel likelihood inference with diverging number of parameters. Ann Statist. 2008;36:2232–2260. doi: 10.1214/07-AOS544. MR2458186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee YK, Mammen E, Park BU. Flexible generalized varying coefficient regression models. Ann Statist. 2012;40:1906–1933. MR3015048. [Google Scholar]
- Lian H. Variable selection for high-dimensional generalized varying-coefficient models. Statist Sinica. 2012;22:1563–1588. MR3027099. [Google Scholar]
- Liu R, Yang L. Spline-backfitted kernel smoothing of additive coefficient model. Econometric Theory. 2010;26:29–59. MR2587102. [Google Scholar]
- Liu R, Yang L, Härdle WK. Oracally efficient two-step estimation of generalized additive model. J Amer Statist Assoc. 2013;108:619–631. MR3174646. [Google Scholar]
- Ma S, Yang L. A jump-detecting procedure based on spline estimation. J Nonparametr Stat. 2011a;23:67–81. MR2780816. [Google Scholar]
- Ma S, Yang L. Spline-backfitted kernel smoothing of partially linear additive model. J Statist Plann Inference. 2011b;141:204–219. MR2719488. [Google Scholar]
- Ma S, Yang L, Carroll RJ. A simultaneous confidence band for sparse longitudinal regression. Statist Sinica. 2012;22:95–122. doi: 10.5705/ss.2010.034. MR2933169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma S, Carroll RJ, Liang H, Xu S. Supplement to “Estimation and inference in generalized additive coefficient models for nonlinear interactions with high-dimensional covariates”. 2015 doi: 10.1214/15-AOS1344SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meier L, Bühlmann P. Smoothing l1-penalized estimators for high-dimensional time-course data. Electron J Stat. 2007;1:597–615. MR2369027. [Google Scholar]
- Meier L, van de Geer S, Bühlmann P. High-dimensional additive modeling. Ann Statist. 2009;37:3779–3821. MR2572443. [Google Scholar]
- Murcray CE, Lewinger JP, Gauderman WJ. Gene-environment interaction in genome-wide association studies. Am J Epidemiol. 2009;169:219–226. doi: 10.1093/aje/kwn353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nyholt DR. A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet. 2004;74:765–769. doi: 10.1086/383251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Randall JC, Winkler TM, Kutalik Z, Berndt SI, Jackson AU, et al. Sex-stratified genome-wide association studies including 270,000 individuals show sexual dimorphism in genetic loci for anthropometric traits. PLOS Genetics. 2013;9:e1003500. doi: 10.1371/journal.pgen.1003500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ravikumar P, Lafferty J, Liu H, Wasserman L. Sparse additive models. J R Stat Soc Ser B Stat Methodol. 2009;71:1009–1030. MR2750255. [Google Scholar]
- Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. MR2410008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang L, Xue L, Qu A, Liang H. Estimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates. Ann Statist. 2014;42:592–624. MR3210980. [Google Scholar]
- Wareham NJ, van Sluijs EMF, Ekelund U. Physical activity and obesity prevention: A review of the current evidence. Proc Nutr Soc. 2005;64:229–247. doi: 10.1079/pns2005423. [DOI] [PubMed] [Google Scholar]
- Xue L, Liang H. Polynomial spline estimation for a generalized additive coefficient model. Scand J Stat. 2010;37:26–46. doi: 10.1111/j.1467-9469.2009.00655.x. MR2675938. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xue L, Yang L. Additive coefficient modeling via polynomial spline. Statist Sinica. 2006;16:1423–1446. MR2327498. [Google Scholar]
- Zhou S, Shen X, Wolfe DA. Local asymptotics for regression splines and confidence regions. Ann Statist. 1998;26:1760–1782. MR1673277. [Google Scholar]
- Zou H. The adaptive lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. MR2279469. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




