Estimation and Inference in Generalized Additive Coefficient Models for Nonlinear Interactions with High-Dimensional Covariates

MA Shujie; Raymond J Carroll; Hua Liang; Shizhong Xu

doi:10.1214/15-AOS1344

. Author manuscript; available in PMC: 2015 Oct 1.

Published in final edited form as: Ann Stat. 2015 Oct;43(5):2102–2131. doi: 10.1214/15-AOS1344

Estimation and Inference in Generalized Additive Coefficient Models for Nonlinear Interactions with High-Dimensional Covariates

MA Shujie ^*,¹, Raymond J Carroll ^†,^‡,², Hua Liang ^§,³, Shizhong Xu ^*

PMCID: PMC4578655 NIHMSID: NIHMS719947 PMID: 26412908

Abstract

In the low-dimensional case, the generalized additive coefficient model (GACM) proposed by Xue and Yang [Statist. Sinica 16 (2006) 1423–1446] has been demonstrated to be a powerful tool for studying nonlinear interaction effects of variables. In this paper, we propose estimation and inference procedures for the GACM when the dimension of the variables is high. Specifically, we propose a groupwise penalization based procedure to distinguish significant covariates for the “large p small n” setting. The procedure is shown to be consistent for model structure identification. Further, we construct simultaneous confidence bands for the coefficient functions in the selected model based on a refined two-step spline estimator. We also discuss how to choose the tuning parameters. To estimate the standard deviation of the functional estimator, we adopt the smoothed bootstrap method. We conduct simulation experiments to evaluate the numerical performance of the proposed methods and analyze an obesity data set from a genome-wide association study as an illustration.

Key words and phrases: Adaptive group lasso, bootstrap smoothing, curse of dimensionality, gene-environment interaction, generalized additive partially linear models, inference for high-dimensional data, oracle property, penalized likelihood, polynomial splines, two-step estimation, undersmoothing

1. Introduction

Regression analysis is a commonly used statistical tool for modeling the relationship between a scalar dependent variable Y and one or more explanatory variables denoted as T = (T₁, T₂, …, T_p)^T. To study the marginal effects of the predictors on the response, one may fit a generalized linear model (GLM),

E (Y | T) = μ (T) = g^{- 1} {η (T)}, η (T) = \sum_{ℓ = 1}^{p} α_{ℓ 0} T_{ℓ},

(1)

where g is a known monotone link function, and α_ℓ0, 1 ≤ ℓ ≤ p, are unknown parameters. Sometimes, the effect of one variable may change with other variables; that is, there is an interaction effect. By letting T₁ = 1, to incorporate the interaction effects of T and the other variables, denoted as X = (X₁,…, X_d)^T, model (1) can be modified to E(Y|X, T) = μ(X, T) = g⁻¹{η(X, T)} with

η (X, T) = α_{10} + \sum_{ℓ = 2}^{p} α_{ℓ 0} T_{ℓ} + \sum_{k = 1}^{d} α_{1 k} X_{k} + \sum_{ℓ = 2}^{p} \sum_{k = 1}^{d} α_{ℓ k} X_{k} T_{ℓ},

(2)

where α_ℓk for 0 ≤ k ≤ d and 1 ≤ ℓ ≤ p are parameters. After a direct reformulation, model (2) can be written as

η (X, T) = \sum_{ℓ = 1}^{p} (α_{ℓ 0} + \sum_{k = 1}^{d} α_{ℓ k} X_{k}) T_{ℓ} .

(3)

Here the effect of each T_ℓ changes linearly with X_k. However, in practice, this simple linear relationship may not reflect the true changing patterns of the coefficient with other covariates. We here use an example of gene and environment (G × E) interactions for illustration. It has been noticed in the literature that obesity is linked to genetic factors. Their effects, however, can be altered under different environmental factors such as sleeping hours [Knutson (2012)] and physical activity [Wareham, van Sluijs and Ekelund (2005)]. To have a rough idea of how the effects of the genetic factors change with the environment, we explore data from the Framingham Heart Study [Dawber, Meadors and Moore (1951)]. In Figure 1 we plot the estimated mean body mass index (BMI) against sleeping hours per day and activity hours per day, respectively, for people with three possible genotype categories represented by AA, Aa and aa, and for one single nucleotide polymorphism (SNP). A detailed description and the analysis of this data set are given in Section 5. We define allele A as the minor (less frequent) allele. This figure clearly shows different nonlinear curves for the three groups in each of the two plots. By letting T_ℓ be the indicator for the group ℓ, the linear function in model (3) is clearly misspecified.

To relax the linearity assumption, we allow each α_ℓk X_k term to be an unknown nonlinear function of X_k, and thus extend model (3) to the generalized additive coefficient model (GACM)

η (X, T) = \sum_{ℓ = 1}^{p} {α_{ℓ 0} + \sum_{k = 1}^{d} α_{ℓ k} (X_{k})} T_{ℓ} = \sum_{ℓ = 1}^{p} α_{ℓ} (X) T_{ℓ} .

(4)

For identifiability, the functional components satisfy E{α_ℓk(X_k)} = 0 for 1 ≤ k ≤ d and 1 ≤ ℓ ≤ p. The conditional variance of Y is modeled as a function of the mean, that is, var(Y|X, T) = V{μ(X, T)} = σ²(X, T). In each coefficient function of the GACM, covariates X_k are continuous variables. If some of them are discrete, they will enter linearly. For example, if X_k is binary, we let α_ℓk(X_k) = α_ℓk X_k. In such a case, model (4) turns out to be a partially linear additive coefficient model. The linearity of (4) in T_ℓ is particularly appropriate when those factors are discrete, for example, SNPs in a genome-wide association study (GWAS), as in the data example of Section 5.

For the low-dimensional case that the dimensions of X and T are fixed, estimation of model (4) has been studied; see Liu and Yang (2010), Xue and Liang (2010), Xue and Yang (2006) for a spline estimation procedure and Lee, Mammen and Park (2012) for a backfitting algorithm. In modern data applications, model (4), however, is particularly useful when p is large. For example, in GWAS, the number of SNPs, which is p, can be very large, but the dimension of X such as the environmental factors, which is d, is inevitably relatively small. Moreover, the number of variables in T which have nonzero effects is small. It therefore, poses new challenges to apply model (4) to the high-dimensional case including: (i) how to identify those important variables in T, (ii) how to estimate the coefficient functions for the important covariates and (iii) how to conduct inferences for the nonzero coefficient functions. For example, it is of interest to know whether they are a function of a specific parametric form such as constant, linear or quadratic, etc.

In the high-dimensional data setting, studying nonlinear interaction effects has found much attention in recent years, and a few strategies have been proposed. For example, Jiang and Liu (2014) proposed to detect variables under the general index model, which enables the study of high-order interactions among components of continuous predictors, which are assumed to have a multivariate normal distribution. Moreover, Lian (2012) considered variable selection in varying coefficient models which allows the coefficient functions to depend on one index variable, such as a time-dependent variable.

When we would like to see how the effect of each genetic factor changes under the influence of multiple environmental variables, the proposed high-dimensional GACM (4) becomes a natural approach to consider, since both the index model [Jiang and Liu (2014)] and the varying coefficient model [Lian (2012)] cannot address this question; the former is used to study interactions of components in a set of continuous predictors, and the latter only allows one index variable. For model selection and estimation, we apply a groupwise penalization method. Moreover, most existing high-dimensional nonparametric modeling papers [Lian (2012), Meier, van de Geer and Bühlmann (2009), Ravikumar et al. (2009), Wang et al. (2014), Huang, Horowitz and Wei (2010)] focus on variable selection and estimation. In this paper, after variable selection, we also propose a simultaneous inferential tool to further test the shape of the coefficient function for each selected variable, which has not been studied in the previous works.

To this end, we aim to address questions (i)–(iii). Specifically, for estimation and model selection, we apply a groupwise regularization method based on a penalized quasi-likelihood criterion. The penalty is imposed on the L₂ norm of the spline coefficients of the spline estimators for α_ℓ(·). We establish the asymptotic consistency of model selection and estimation for the proposed group penalized estimators with the quasi-likelihood criterion in the high-dimensional GACM (4). We allow p to grow with n at an almost exponential order. Importantly, establishment of these results is technically more difficult than other work based on least squares, since no closed-form of the estimators exists from the penalized quasi-likelihood method.

After selecting the important variables, the next question of interest is what shapes the nonzero coefficient functions may have. Then we need to provide an inferential tool to further check whether a coefficient function has some specific parametric form. For example, when it is a constant or a linear function, the corresponding covariate has no or linear interaction effects with another covariate, respectively. For global inference, we construct simultaneous confidence bands (SCBs) for the nonparametric additive functions based on a two-step estimation procedure. By using the selected variables, we first propose a refined two-step spline estimator for the function of interest, which is proved to have a pointwise asymptotic normal distribution and oracle efficiency. We then establish the bounds for the SCBs based on the absolute maxima distribution of a Gaussian process and on the strong approximation lemma [Csörgő and Révész (1981)]. Some other related works on SCBs for nonparametric functions include Claeskens and Van Keilegom (2003), Hall and Titterington (1988), Härdle and Marron (1991), among others. We provide an asymptotic formula for the standard deviation of the spline estimator for the coefficient function, which involves unknown population parameters to be estimated. The formula has somewhat complex expressions and contains many parameters. Direct estimation therefore may be not accurate, particularly with the small or moderate sample sizes. As an alternative, the bootstrap method provides us a reliable way to calculate the standard deviation by avoiding estimating those population parameters. We here apply the smoothed bootstrap method suggested by Efron (2014), which advocated that the method can improve coverage probability to calculate the pointwise estimated standard deviations for the estimators of the coefficient functions. This method was originally proposed for calculating the estimated standard deviation of the estimate of a parameter of interest, such as the conditional mean. We extend this method to the case of functional estimation. We demonstrate by simulation studies in Section 4 that compared to the traditional resampling bootstrap method, the smoothed bootstrap method can successfully improve the empirical coverage rate.

The paper is organized as follows. Section 2 introduces the B-spline estimation procedure for the nonparametric functions, describes the adaptive group Lasso estimators and the initial Lasso estimators and presents asymptotic results. Section 3 describes the two-step spline estimators and introduces the simultaneous confidence bands and the bootstrap methods for calculating the estimated standard deviation. Section 4 describes simulation studies, and Section 5 illustrates the method through the analysis of an obesity data set from a genome-wide association study. Proofs are in the Appendix and additional supplementary material [Ma et al. (2015)].

2. Penalization based variable selection

Let (Y_i, $X_{i}^{T}, T_{i}^{T}$ ), i = 1,…, n, be random vectors that are independently and identically distributed as (Y, X^T, T^T), where X_i = (X_i1, …, X_id)^T and T_i = (T_i1, …, T_ip)^T. Write the negative quasi-likelihood function $Q (μ, y) = \int_{μ}^{y} {(y - ζ) / V (ζ)} d ζ$ . Estimation of the mean function can be achieved by minimizing the negative quasi-likelihood of the observed data

\sum_{i = 1}^{n} Q {g^{- 1} {η (X_{i}, T_{i})}, Y_{i}} .

(5)

2.1. Spline approximation

We approximate the smooth functions α_ℓk(·), 1 ≤ k ≤ d and 1 ≤ ℓ ≤ p in (4) by B-splines. As in most work on nonparametric smoothing, estimation of the functions α_ℓk(·) is conducted on compact sets. Without loss of generality, let the compact set be χ = [0, 1]. Let $G_{n}^{0}$ be the space of polynomial splines of order q ≥ 2. We introduce a sequence of spline knots

t_{- q - 1} = \dots = t_{- 1} = t_{0} = 0 < t_{1} < \dots < t_{N} < 1 = t_{N + 1} = \dots = t_{N + q},

where N ≡ N_n is the number of interior knots. In the following, let J_n = N_n + q. For 0 ≤ j ≤ N, let H_j = t_{j + 1} − t_j be the distance between neighboring knots and let H = max_0≤s≤N H_j. Following Zhou, Shen and Wolfe (1998), to study asymptotic properties of the spline estimators for α_ℓk(·), we assume that max_{0≤j≤N−1} |H_{j + 1} − H_j| = o(N⁻¹) and H/min_0≤j≤N H_j ≤ M, where M > 0 is a predetermined constant. Such an assumption is necessary for numerical implementation. In practice, we can use the quantiles as the locations of the knots. Let {b_j,k(x_k) : 1 ≤ j ≤ J_n}^T be the qth order B spline basis functions given on page 87 of de Boor (2001). For positive numbers a_n and b_n, a_n ≍ b_n means that lim_n→∞ a_n/b_n = c, where c is some nonzero finite constant. For 1 ≤ j ≤ J_n, we adopt the centered B-spline functions given in Xue and Yang (2006) such that B_j,k(x_k) = √N[b_j,k(x_k) − {E(b_j,k)/E(b_1,k)}b_1,k(x_k)], so that E{B_j,k(X_k)} = 0 and var{B_j,k(X_k)} ≍ 1. Define the space G_n of additive spline functions as the linear space spanned by B(x) = {1, B_j,k(x_k), 1 ≤ j ≤ J_n, 1 ≤ k ≤ d}^T , where x = (x₁,…, x_d)^T. According to the result on page 149 of de Boor (2001), for α_ℓk(·) satisfying condition (C3) in Appendix A.2 such that $α_{ℓ k}^{(r - 1)} (x_{k}) \in C^{0, 1} [0, 1]$ for given integer r ≥ 1, where C^{0, 1} [0, 1] is the space of Lipschitz continuous functions on [0, 1] defined in Appendix A.2, there is a function

α_{ℓ k}^{0} (x_{k}) = \sum_{j = 1}^{J_{n}} γ_{j, ℓ k} B_{j} (x_{k}) \in G_{n}^{0},

(6)

such that ${sup}_{x_{k} \in [0, 1]} | α_{ℓ k}^{0} (x_{k}) - α_{ℓ k} (x_{k}) | = O (J_{n}^{- r})$ . Then for every 1 ≤ ℓ ≤ p, α_ℓ(x) can be approximated well by a linear combination of spline functions in $G_{n}^{0}$ , so that

α_{ℓ} (X) \approx α_{ℓ}^{0} (X) = γ_{ℓ 0} + \sum_{k = 1}^{d} \sum_{j = 1}^{J_{n}} γ_{j, ℓ k} B_{j, k} (x_{k}) = B {(X)}^{T} γ_{ℓ},

(7)

where $γ_{ℓ} = {(γ_{ℓ 0}, γ_{ℓ 1}^{T}, …, γ_{ℓ d}^{T})}^{T}$ , in which γ_ℓk = (γ_j,ℓk : 1 ≤ j ≤ J_n)^T. Thus the minimization problem in (5) is equivalent to finding ${\tilde{γ}}^{0} = {({\tilde{γ}}_{ℓ}^{0 T}, 1 \leq ℓ \leq p)}^{T}$ with ${\tilde{γ}}_{ℓ}^{0} = {({\tilde{γ}}_{ℓ 0}^{0}, {\tilde{γ}}_{ℓ 1}^{0 T}, \dots, {\tilde{γ}}_{ℓ d}^{0 T})}^{T}$ and ${\tilde{γ}}_{ℓ k}^{0} = {({\tilde{γ}}_{j, ℓ k}^{0} : 1 \leq j \leq J_{n})}^{T}$ to minimize $\sum_{i = 1}^{n} Q [g^{- 1} {\sum_{ℓ = 1}^{p} B {(X_{i})}^{T} γ_{ℓ} T_{ℓ}}, Y_{i}]$ . The components of the additive coefficients are estimated by ${\tilde{α}}_{ℓ k}^{0} (x_{k}) = \sum_{j = 1}^{J_{n}} {\tilde{γ}}_{j, ℓ k}^{0} B_{j} (x_{k}) = B {(X)}^{T} {\tilde{γ}}_{ℓ k}^{0}$ and ${\tilde{α}}_{ℓ 0}^{0} = {\tilde{γ}}_{ℓ 0}^{0}$ .

2.2. Adaptive group Lasso estimator

We now describe the procedure for estimating and selecting the additive coefficient functions by using the adaptive group Lasso. The estimators are obtained by minimizing a penalized negative quasi-likelihood criterion. We establish asymptotic selection consistency as well as the convergence rate of the estimators to the true nonzero functions. For any vector a = (a₁, …, a_s)^T, let its L₂ norm be ${‖ a ‖}_{2} = \sqrt{a_{1}^{2} + \dots + a_{s}^{2}}$ . For any measurable L₂-integrable function ϕ on [0, 1]^d, define the L₂ norm as ‖ϕ‖² = E{ϕ²(X)}.

We are interested in identifying the significant components of the vector T = (T₁, …, T_p)^T. Let s, a fixed number, be the total number of nonzero α_ℓ's and I₁ = {ℓ: ‖α_ℓ‖ ≠ 0, 1 ≤ ℓ ≤ p}. Let I₂ be the complementary set of I₁; that is, I₂ = {ℓ: α_ℓ(·) ≡ 0, 1 ≤ ℓ ≤ p}. Recalling the approximation given in (7), γ_ℓ is zero if and only if each element of γ_ℓ is zero; that is, ‖γ_ℓ‖₂ = 0. We apply the adaptive group Lasso approach in Huang, Horowitz and Wei (2010) for variable selection in model (4). In order to identify zero additive coefficients, we penalize the L₂ norm of the coefficients γ_ℓ for 1 ≤ ℓ ≤ p. Let w_n = (w_n1, …, w_np)^T be a given vector of weights, which needs to be chosen appropriately to achieve selection consistency. Their choice will be discussed in Section 2.3. We consider the penalized negative quasi-likelihood

L_{n} (γ) = \sum_{i = 1}^{n} Q [g^{- 1} {\sum_{ℓ = 1}^{p} B^{T} (X_{i}) γ_{ℓ} T_{ℓ}}, Y_{i}] + n λ_{n} \sum_{ℓ = 1}^{p} w_{n ℓ} {‖ γ_{ℓ} ‖}_{2},

(8)

where λ_n is a regularization parameter controlling the amount of shrinkage. The estimator $\hat{γ} = {({\hat{γ}}_{1}^{T}, …, {\hat{γ}}_{p}^{T})}^{T}$ is obtained by minimizing (8). Minimization of (8) is solved by local quadratic approximation as adopted by Fan and Li (2001).

For ℓ = 1, …, p, the ℓth additive coefficient function is estimated by

{\hat{α}}_{ℓ} (X) = {\hat{γ}}_{ℓ 0} + \sum_{k = 1}^{d} \sum_{j = 1}^{J_{n}} {\hat{γ}}_{j, ℓ k} B_{j, k} (x_{k}) = B^{T} (X) {\hat{γ}}_{ℓ} .

We will make the following two assumptions on the order requirements of the tuning parameters. Write w_{n, I₁} = (w_nℓ : ℓ ∈ I₁).

Assumption 1. $J_{n}^{2} {n log (n)}^{- 1} \to 0$ and λ_n ‖w_{n, I₁}‖₂ → 0, as n → ∞.

Assumption 2. $n λ_{n} {‖ w_{n, I_{1}} ‖}_{2} + n^{1 / 2} J_{n}^{1 / 2} \sqrt{log (p J_{n})} + n J_{n}^{- r} = o (n λ_{n} w_{n ℓ})$ , for all ℓ ∈ I₂.

The following theorem presents the selection consistency and estimation properties of the adaptive group Lasso estimators.

Theorem 1. Under conditions (C1)–(C5) in the Appendix and Assumptions 1 and 2: (i) as n → ∞, P(‖α̂_ℓ‖ > 0, ℓ ∈ I₁ and ‖α̂_ℓ‖ = 0, ℓ ∈ I₂) → 1, and (ii) $‖ {\hat{α}}_{ℓ} - α_{ℓ} ‖ = O_{p} (λ_{n} {‖ w_{n, I_{1}} ‖}_{2} + n^{- 1 / 2} J_{n}^{1 / 2} + J_{n}^{- r}), ℓ \in I_{1}$ .

2.3. Choice of the weights

We now discuss how to choose the weights used in (8) based on the initial estimates. For low-dimensional data settings with p < n, an unpenalized estimator such as least squares estimator [Zou (2006)] can be used as an initial estimate. For high-dimensional settings with p ≫ n, it has been discussed [Meier and Bühlmann (2007)] that the Lasso estimator is a more appropriate choice. Following Huang, Horowitz and Wei (2010), we obtain an initial estimate with the group Lasso by minimizing

L_{n 1} (γ) = \sum_{i = 1}^{n} Q [g^{- 1} {\sum_{ℓ = 1}^{p} B {(X_{i})}^{T} γ_{ℓ} T_{ℓ}}, Y_{i}] + n λ_{n 1} \sum_{ℓ = 1}^{p} {‖ γ_{ℓ} ‖}_{2},

with respect to $γ = {(γ_{1}^{T}, …, γ_{p}^{T})}^{T}$ . Denote the resulting estimators by $\tilde{γ} = {({\tilde{γ}}_{1}^{T}, …, {\tilde{γ}}_{p}^{T})}^{T}$ . Let Ĩ₁ = {ℓ : ‖γ̃_ℓ‖₂ ≠ 0, 1 ≤ ℓ ≤ p}, and let s̃ be the number of elements in Ĩ₁.

Under conditions (C1)–(C5) in the Appendix, and when $λ_{n 1} \geq C n^{- 1 / 2} J_{n}^{1 / 2} \times \sqrt{log (p J_{n})}$ for a sufficiently large constant C, we have: (i) the number of estimated nonzero functions are bounded; that is, as n → ∞, there exists a constant 1 < C₁ < ∞ such that P(s̃ ≤ C₁s) → 1; (ii) if λ_n1 → 0, then P(‖γ̃_ℓ‖₂ > 0 for all l ∈ I₁) → 1; (iii) ${‖ \tilde{γ} - γ ‖}_{2} = O_{p} (λ_{n 1} + n^{- 1 / 2} J_{n}^{1 / 2} + J_{n}^{- r})$ . We refer to Theorems 1 (i) and (ii) of Huang, Horowitz and Wei (2010) for the proofs of (i) and (ii), and Theorem 1 in our paper for the proof of (iii).

The weights we use are $w_{n ℓ} = {‖ {\tilde{γ}}_{ℓ} ‖}_{2}^{- 1}$ , if ‖γ̃_ℓ‖₂ > 0; w_nℓ = ∞, if ‖γ̃_ℓ‖₂ = 0.

Remark 1. Assumptions 1 and 2 give the order requirements of J_n and λ_n. Based on the condition that $J_{n}^{2} {n log (n)}^{- 1} \to 0$ given in Assumption 1, we need J_n ≪ {n log(n)}^1/2, where a_n ≪ b_n denotes that a_n/b_n = o(1) for any positive numbers a_n and b_n, and λ_n needs to satisfy $n^{- 1 / 2} J_{n}^{1 / 2} \sqrt{log (p J_{n})} \times {{min}_{ℓ \in I_{2}} (w_{n ℓ})}^{- 1} ≪ λ_{n} ≪ 1$ . From the above theoretical properties of the group Lasso estimators, we know that, with probability approaching 1, ‖γ̃_ℓ‖₂ > 0 for nonzero components, and then the corresponding weights w_nℓ are bounded away from 0 and infinity for ℓ ∈ I₁. By defining 0 · ∞ = 0, the components not selected by the group Lasso are not included in the adaptive group Lasso procedure. Let J_n ≍ n^{1/(2r + 1)}, so that J_n has the optimal order for spline regression. If p = exp[o{n^{2r/(2r + 1)}}], then $n^{- 1 / 2} J_{n}^{1 / 2} \sqrt{log (p J_{n})} \to 0$ . This means the dimension p can diverge with the sample size at an almost exponential rate.

2.4. Selection of tuning parameters

Tuning parameter selection always plays an important role in model and variable selection. An underfitted model can lead to severely biased estimation, and an overfitted model can seriously degrade the estimation efficiency. Among different data-driven methods, the Bayesian information criterion (BIC) tuning parameter selector has been shown to be able to identify the true model consistently in the fixed dimensional setting [Wang, Li and Tsai (2007)]. In the high-dimensional setting, an extend BIC (EBIC) and a generalized information criterion have been proposed by Chen and Chen (2008) and Fan and Tang (2013), respectively. In this paper, we adopt the EBIC method [Chen and Chen (2008)] to select the tuning parameter λ_n in (8). Specifically, the EBIC(λ_n) is defined as

2 \sum_{i = 1}^{n} (Q [g^{- 1} {\sum_{ℓ = 1}^{p} B {(X_{i})}^{T} {\hat{γ}}_{ℓ} T_{i ℓ}}, Y_{i}]) + s^{*} (1 + d J_{n}) log (n) + 2 ν log (\begin{matrix} p \\ s^{*} \end{matrix}),

where ${({\hat{γ}}_{ℓ})}_{ℓ = 1}^{p}$ is the minimizer of (8) for a given λ_n, s* is the number of nonzero estimated functions ${({\hat{α}}_{ℓ})}_{ℓ = 1}^{p}$ and 0 ≤ ν ≤ 1 is a constant. Here we use ν = 0.5. When ν = 0, the EBIC is ordinary BIC.

We use cubic B-splines for the nonparametric function estimation, so that q = 4. In the penalized estimation procedure, we let the number of interior knots N = ⎿cn^{1/(2q + 1)}⏌ satisfy the optimal order, where ⎿a⏌ denotes the largest integer no greater than a and c is a constant. In the simulations, we take c = 2.

3. Inference and the bootstrap smoothing procedure

3.1. Background

After model selection, our next step is to conduct statistical inference for the coefficient functions of those important variables. We will establish a simultaneous confidence band (SCB) based on a two-step estimator for global inference. An asymptotic formula of the SCB will be provided based on the distribution of the maximum value of the normalized deviation of the spline functional estimate. To improve accuracy, we calculate the estimated standard deviation in the SCB by using the nonparametric bootstrap smoothing method as discussed in Efron (2014). For specificity, we focus on the construction of α_ℓ1(x₁), with α_ℓk(x_k)for k ≥ 2 defined similarly, for ℓ ∈ Î₁, where Î₁ = {ℓ : ‖α̂_ℓ‖ ≠ 0, 1 ≤ ℓ ≤ p}.

Although the one-step penalized estimation in Section 2 can quickly identify nonzero coefficient functions, no asymptotic distribution is available for the resulting estimators. Thus we construct the SCB based on a refined two-step spline estimator for α_ℓ1(x₁), which will be shown to have the oracle property that the estimator of α_ℓ1(x₁) has the same asymptotic distribution as the univariate oracle estimator obtained by pretending that α_ℓ0 and α_ℓk (X_k) for ℓ ∈ Î₁, k ≥ 2 and α_ℓ(X) for ℓ ∉ Î₁ are known. See Horowitz, Klemelä and Mammen (2006), Horowitz and Mammen (2004), Liu, Yang and Härdle (2013) for kernel-based two-step estimators in generalized additive models, which also have the oracle property but are not as computationally efficient as the two-step spline method. We next introduce the oracle estimator and the proposed two-step estimator before we present the SCB.

3.2. Oracle estimator

In the following, we describe the oracle estimator of α_ℓ1(x₁). We rewrite model (4) as

μ (X, T) = g^{- 1} {η (X, T)} = \sum_{ℓ \in {\hat{I}}_{1}} α_{ℓ 1} (X_{1}) T_{ℓ} + \sum_{ℓ \in {\hat{I}}_{1}} {α_{ℓ 0} + \sum_{k \geq 2} α_{ℓ k} (X_{k})} T_{ℓ} + \sum_{ℓ \notin {\hat{I}}_{1}} α_{ℓ} (X) T_{ℓ} .

(9)

By assuming that α_ℓ0 and α_ℓk(X_k) for ℓ ∈ Î₁, k ≥ 2 and α_ℓ(X) for ℓ ∉ Î₁ are known, estimation in (9) involves only the nonparametric functions α_ℓ1(X₁) of a scalar covariate X₁. It will be shown in Theorem 2 that the estimator achieves the univariate optimal convergence rate when the optimal order for the number of knots is applied. We estimate α₁(x₁) = {α_ℓ1(x₁), ℓ ∈ Î₁}^T by minimizing the negative quasi-likelihood function as follows. Denote the oracle estimator by ${\hat{α}}_{ℓ 1}^{OR} (x_{1}) = B_{1}^{S} {(x_{1})}^{T} {\hat{γ}}_{ℓ 1}^{OR}$ , where ${\hat{γ}}_{ℓ 1}^{OR}$ is defined directly below, $B_{1}^{S} (x_{1}) = {B_{j, 1}^{S} (x_{1}), 1 \leq j \leq J_{n}^{S}}$ where $B_{j, 1}^{S} (x_{1})$ is the centered B-spline function defined in the same way as B_{j, 1}(x₁) in Section 2, but with $N^{S} = N_{n}^{S}$ interior knots and $J_{n}^{S} = N_{n}^{S} + q$ . Rates of increase for $J_{n}^{S}$ are described in Assumptions 3 and 4 below. Let α_{ℓ, −1}(X_i) = α_ℓ0 + Σ_{k ≥ 2} α_ℓk(X_ik). Then ${\hat{γ}}_{, 1}^{OR} = {{({\hat{γ}}_{ℓ 1}^{OR})}^{T}, ℓ \in {\hat{I}}_{1}}^{T}$ is obtained by minimizing the negative quasi-likelihood

L_{n}^{OR} (γ, 1) = \sum_{i = 1}^{n} Q [g^{- 1} {\sum_{ℓ \in {\hat{I}}_{1}} B_{1}^{S} {(X_{i 1})}^{T} γ_{ℓ 1} T_{i ℓ} + \sum_{ℓ \in {\hat{I}}_{1}} α_{ℓ, - 1} (X_{i}) T_{i ℓ} + \sum_{ℓ \notin {\hat{I}}_{1}} α_{ℓ} (X_{i}) T_{i ℓ}}, Y_{i}],

(10)

where γ_,1 = {(γ_ℓ1)^T, ℓ ∈ Î₁}^T. Similarly, the oracle estimator of α₀ = {α_ℓ0, ℓ ∈ Î₁}^T, which is denoted as ${\hat{α}}_{0}^{OR} = {{\hat{α}}_{ℓ 0}^{OR}, ℓ \in {\hat{I}}_{1}}^{T} = {{\hat{γ}}_{ℓ 0}^{OR}, ℓ \in {\hat{I}}_{1}}^{T}$ , is obtained by minimizing $L_{n}^{OR} (γ_{, 0}) = \sum_{i = 1}^{n} Q [g^{- 1} {\sum_{ℓ \in {\hat{I}}_{1}} γ_{ℓ 0} T_{i ℓ} + \sum_{ℓ \in {\hat{I}}_{1}} α_{ℓ, - 0} (X_{i}) T_{i ℓ} + \sum_{ℓ \notin {\hat{I}}_{1}} α_{ℓ} (X_{i}) T_{i ℓ}}, Y_{i}]$ , where γ_,0 = (γ_ℓ0, ℓ ∈ Î₁) and $α_{ℓ, - 0} (X_{i}) = \sum_{k = 1}^{d} α_{ℓ k} (X_{i k})$ .

3.3. Initial estimator

The oracle estimator is infeasible because it assumes knowledge of the other functions. In order to obtain the two-step estimators of α_ℓ1(x₁) for ℓ ∈ Î₁, we first need initial estimators for α_ℓ0 and α_ℓk(x_k) for k ≥ 2 and ℓ ∈ Î₁ , denoted as ${\hat{α}}_{ℓ 0}^{ini} = {\hat{γ}}_{ℓ 0}^{ini}$ and ${\hat{α}}_{ℓ k}^{ini} (x_{k}) = B_{k}^{ini} {(x_{k})}^{T} {\hat{γ}}_{ℓ k}^{ini}$ , where $B_{k}^{ini} (x_{k}) = {B_{j, k}^{ini} (x_{k}) : 1 \leq j \leq J_{n}^{ini}}^{T}$ and $B_{j, k}^{ini} (x_{k})$ are B-spline functions with the number of interior knots $N_{n}^{ini}$ and $J_{n}^{ini} = N_{n}^{ini} + q$ . Rates of increase for $J_{n}^{ini}$ are described in Assumptions 3 and 4 below. We need an undersmoothed procedure in the first step, so that the approximation bias can be reduced, and the difference between the two-step and oracle estimators is asymptotically negligible. We obtain ${\hat{γ}}_{{\hat{I}}_{1}}^{ini} = {{({\hat{γ}}_{ℓ}^{ini})}^{T} : ℓ \in {\hat{I}}_{1}}^{T}$ , where ${\hat{γ}}_{ℓ}^{ini} = {{\hat{γ}}_{ℓ 0}^{ini}, {({\hat{γ}}_{ℓ k}^{ini})}^{T}}^{T}$ , by minimizing the negative quasi-likelihood $\sum_{i = 1}^{n} Q [g^{- 1} {\sum_{ℓ \in {\hat{I}}_{1}} B {(X_{i})}^{T} γ_{ℓ} T_{ℓ}}, Y_{i}]$ . The adaptive group Lasso penalized estimator γ̂_Î₁ = {(γ̂_ℓ)^T : ℓ ∈ Î₁}^T obtained in Section 2 can also be used as the initial estimator. We, however, refit the model with the selected variables and obtain the initial estimator ${\hat{γ}}_{{\hat{I}}_{1}}^{ini}$ in order to improve estimation accuracy in high-dimensional data settings.

3.4. Final estimator

In the second step, we construct the two-step estimator of α_ℓ1 for ℓ ∈ Î₁. We replace α_ℓ0 and α_ℓk(X_k) by the initial estimators ${\hat{α}}_{ℓ 0}^{ini}$ and ${\hat{α}}_{ℓ k}^{ini} (X_{k})$ for ℓ ∈ Î₁ and k ≥ 2 and replace α_ℓ(X) for ℓ ∉ Î₁ by α̂_ℓ(X) = 0. Let ${\hat{α}}_{ℓ, - 1}^{ini} (X_{i}) = {\hat{α}}_{ℓ 0}^{ini} + \sum_{k \geq 2} {\hat{α}}_{ℓ k}^{ini} (X_{i k})$ . Denote the two-step spline estimator of α_ℓ1(x₁) as ${\hat{α}}_{ℓ 1}^{S} (x_{1}) = B_{1}^{S} {(x_{1})}^{T} {\hat{γ}}_{ℓ 1}^{S}$ with ${\hat{γ}}_{, 1}^{S} = {{({\hat{γ}}_{ℓ 1}^{S})}^{T}, ℓ \in {\hat{I}}_{1}}^{T}$ minimizing

L_{n}^{S} (γ_{, 1}) = \sum_{i = 1}^{n} Q [g^{- 1} {\sum_{ℓ \in {\hat{I}}_{1}} B_{1}^{S} {(X_{i 1})}^{T} γ_{ℓ 1} T_{i ℓ} + \sum_{ℓ \in {\hat{I}}_{1}} {\hat{α}}_{ℓ, - 1}^{ini} (X_{i}) T_{i ℓ} + \sum_{ℓ \notin {\hat{I}}_{1}} {\hat{α}}_{ℓ} (X_{i}) T_{i ℓ}}, Y_{i}] .

(11)

Then the two-step of α_ℓ0, denoted as ${\hat{α}}_{ℓ 0}^{S} = {\hat{γ}}_{ℓ 0}^{S}$ , is obtained in the same way as ${\hat{α}}_{ℓ 0}^{OR}$ by replacing α_ℓ,0(X_i) with ${\hat{α}}_{ℓ, 0}^{ini} (X_{i}) = \sum_{k = 1}^{d} {\hat{α}}_{ℓ k}^{ini} (X_{i k})$ for ℓ ∈ Î₁ and replacing α_ℓ(X_i) with α̂_ℓ(X_i) = 0 for ℓ ∉ Î₁. Let ${\hat{α}}_{0}^{S} = {{\hat{α}}_{ℓ 0}^{S}, ℓ \in {\hat{I}}_{1}}^{T}$ .

3.5. Asymptotic normality and uniform oracle efficiency

We now establish the asymptotic normality and uniform oracle efficiency for the oracle and final estimators. Let $Z_{i j ℓ, 1} = B_{j, 1}^{S} (X_{i 1}) T_{i ℓ}$ and $Z_{i, 1} = {(Z_{i j ℓ, 1}, 1 \leq j \leq J_{n}^{S}, ℓ \in {\hat{I}}_{1})}^{T}$ . Let s* be the number of elements in Î₁. By Theorem 1, P(s* = s) → 1. For simplicity of notation, denote $σ_{i}^{2} = σ^{2} (X_{i}, T_{i})$ and η_i = η(X_i, T_i). Define $s^{*} \times s^{*} J_{n}^{S}$ matrix B(x₁) as

[\begin{matrix} B_{1, 1}^{S} (x_{1}) & \dots & B_{J_{n}^{S}, 1}^{S} (x_{1}) & 0 & \dots & 0 & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & \dots & 0 & 0 & \dots & 0 & B_{1, 1}^{S} (x_{1}) & \dots & B_{J_{n}^{S}, 1}^{S} (x_{1}) \end{matrix}] .

To establish the asymptotic distribution of the two-step estimator, in addition to Assumptions 1 and 2 given in Section 2, we make the following two assumptions on the number of basis functions $J_{n}^{S}$ and $J_{n}^{ini}$ :

Assumption 3. (i) $s^{*} {(J_{n}^{S})}^{2} {n log (n)}^{- 1} = o (1)$ and $s^{*} {(J_{n}^{S})}^{- r} = o (1)$ , and (ii) $n {(log n)}^{- 1} {(J_{n}^{S} J_{n}^{ini})}^{- 1} \to \infty$ , as n → ∞.

Assumption 4. ${(n / J_{n}^{S})}^{1 / 2} {(J_{n}^{ini})}^{- r} \to 0$ , as n → ∞.

First we describe the asymptotic normality of the oracle estimator ${\hat{α}}_{ℓ 1}^{OR} (x_{1})$ of α_ℓ1(x₁). Let ${\hat{α}}_{1}^{OR} (x_{1}) = {{\hat{α}}_{ℓ 1}^{OR} (x_{1}), ℓ \in {\hat{I}}_{1}}^{T}$ . Let $b_{1} (x_{1}) = E {{\hat{α}}_{1}^{OR} (x_{1}) | X, T}$ and $b_{ℓ 1} (x_{1}) = E {{\hat{α}}_{ℓ 1}^{OR} (x_{1}) | X, T}$ , for ℓ ∈ Î₁, where $(X, T) = {(X_{i}, T_{i})}_{i = 1}^{n}$ .

Theorem 2. Under conditions (C1)–(C5) and Assumption 3(i), for any vector a ∈ R^s* with ‖a‖₂ = 1, for any x₁ ∈ [0, 1], $a^{T} σ_{n}^{- 1} (x_{1}) {{\hat{α}}_{1}^{OR} (x_{1}) - b_{1} (x_{1})} \to N (0, 1)$ , where

σ_{n}^{2} (x_{1}) = B^{S} (x_{1}) {[\sum_{i = 1}^{n} Z_{i, 1} Z_{i, 1}^{T} {{\dot{g}}^{- 1} (η_{i})}^{2} / σ_{i}^{2}]}^{- 1} B^{S} {(x_{1})}^{T},

(12)

where g˙⁻¹ (η_i) is the first-order derivative of g⁻¹(η_i) with respect to η_i, and

\sum_{ℓ \in {\hat{I}}_{1}} {‖ {\hat{α}}_{ℓ 1}^{OR} - b_{ℓ 1} ‖}^{2} = O_{p} (s^{*} J_{n}^{S} n^{- 1}), \sum_{ℓ \in {\hat{I}}_{1}} {‖ b_{ℓ 1} - α_{ℓ 1} ‖}^{2} = O_{p} {{(s^{*})}^{2} {(J_{n}^{S})}^{- 2 r}} .

Thus for ℓ ∈ Î₁, $σ_{n 1}^{- 1} (x_{1}) {{\hat{α}}_{ℓ 1}^{OR} (x_{1}) - b_{ℓ 1} (x_{1})} \to N (0, 1)$ , where

σ_{n 1}^{2} (x_{1}) = e_{ℓ}^{T} σ_{n}^{2} (x_{1}) e_{ℓ},

(13)

and e_ℓ is the s*-dimensional vector with the ℓth element 1 and other elements 0, and ${‖ {\hat{α}}_{0}^{OR} - α_{0} ‖}_{2} = O_{p} (\sqrt{s^{*} / n})$ .

The next result shows the uniform oracle efficiency of the two-step estimator that the difference between the two-step estimator ${\hat{α}}_{ℓ 1}^{S} (x_{1})$ and oracle estimator ${\hat{α}}_{ℓ 1}^{OR} (x_{1})$ is uniformly asymptotically negligible, and thus the two-step estimator is oracle in the sense that it has the same asymptotic distribution as the oracle estimator. Let ${\hat{α}}_{1}^{S} (x_{1}) = {{\hat{α}}_{ℓ 1}^{S} (x_{1}), ℓ \in {\hat{I}}_{1}}^{T}$ .

Theorem 3. Under conditions (C1)–(C5) in the Appendix and Assumptions 1–3,

sup_{x_{1} \in [0, 1]} {‖ {\hat{α}}_{1}^{S} (x_{1}) - {\hat{α}}_{1}^{OR} (x_{1}) ‖}_{\infty} = O_{p} {{(n^{- 1} log n)}^{1 / 2} + {(J_{n}^{ini})}^{- r}},

${‖ {\hat{α}}_{0}^{S} - {\hat{α}}_{0}^{OR} ‖}_{2} = o_{p} (n^{- 1 / 2})$ , and furthermore under Assumption 4,

sup_{x_{1} \in [0, 1]} | a^{T} σ_{n}^{- 1} (x_{1}) {{\hat{α}}_{1}^{S} (x_{1}) - {\hat{α}}_{1}^{OR} (x_{1})} | = o_{p} (1),

for any vector a ∈ R^s* with ‖a‖₂ = 1 and $σ_{n}^{2} (x_{1})$ given in (12). Hence, for any x₁ ∈ [0, 1], $a^{T} σ_{n}^{- 1} (x_{1}) {{\hat{α}}_{1}^{S} (x_{1}) - b_{1} (x_{1})} \to N (0, 1)$ .

Remark 2. Under Assumptions 1 and 2, by Theorem 1, with probability approaching 1, s* = s, which is a fixed number. In the second step, by letting $J_{n}^{S} ≍ n^{1 / (2 r + 1)}$ , the nonparametric functions α_ℓ₁ for ℓ ∈ Î₁ are approximated by spline functions with the optimal number of knots. By the conditions that $(n / J_{n}^{S}) {(J_{n}^{ini})}^{- 1} \to 0$ and $n {(log n)}^{- 1} {(J_{n}^{S} J_{n}^{ini})}^{- 1} \to \infty$ given in Assumptions 3 and 4, $J_{n}^{ini}$ needs to satisfy $n^{1 / (2 r + 1)} ≪ J_{n}^{ini} ≪ n^{2 r / (2 r + 1)} {(log n)}^{- 1}$ where r ≥ 1. By using the adaptive group lasso estimator as the initial estimator, Assumption 1 requires that $J_{n}^{ini} ≪ {n log (n)}^{1 / 2}$ . Hence $n^{1 / (2 r + 1)} ≪ J_{n}^{ini} ≪ {n log (n)}^{1 / 2}$ . We therefore can let $J_{n}^{ini} ≍ n^{(1 + ϑ) / (2 r + 1)}$ , where ϑ is any small positive number close to 0. This increase in the number of basis functions ensures undersmoothing in the first step in order that the uniform difference between the two-step and the oracle estimators become asymptotically negligible. Based on Assumptions 1 and 2, the tuning parameter λ_n needs to satisfy $n^{- 1 / 2} {(J_{n}^{ini})}^{1 / 2} \sqrt{log (p J_{n}^{ini})} {{min}_{ℓ \in I_{2}} (w_{n ℓ})}^{- 1} ≪ λ_{n} ≪ 1$ .

Remark 3. The number of interior knots has the same order requirement as the number of basis functions. In the first step, with the undersmoothing requirement as discussed in Remark 2, we let the number of interior knots Nⁱⁿⁱ = ⎿cn^{(1+0.01)/(2q+1)}⏌, where c is a constant, by assuming that r = q. In the simulations, we let c = 2. In the second-step estimation, we use BIC to select the number of knots N^S, so the optimal N^S ranges in [⎿n^1/(2q+1)⏌, ⎿2n^1/(2q+1)⏌] by minimizing BIC: $BIC (N^{S}) = 2 L_{n}^{S} ({\hat{γ}}_{, 1}^{S}) + d (N^{S} + q) log (n)$ .

3.6. Simultaneous confidence bands

In this section, we propose a SCB for α_ℓ₁(x₁) by studying the asymptotic behavior of the maximum of the normalized deviation of the spline functional estimate. To construct asymptotic SCBs for α_ℓ₁(x₁) over the interval x₁ ∈ [0, 1] with confidence level 100(1 − α)%, α ∈ (0, 1), we need to find two functions l_ℓn(x₁) and u_ℓn(x₁) such that

lim_{n \to \infty} P (l_{ℓ n} (x_{1}) \leq α_{ℓ 1} (x_{1}) \leq u_{ℓ n} (x_{1}) for all x_{1} \in [0, 1]) = 1 - α .

(14)

In practice, we consider a variant of (14) and construct SCBs over a subset S_n,₁ of [0, 1] with S_n,₁ becoming denser as n → ∞. We, therefore, partition [0, 1] according to L_n equally spaced intervals based on 0 < ξ₀ < ξ₁ < ··· < ξ_Ln < ξ_Ln+₁ = 1 where L_n → ∞ as n → ∞. Let S_n,₁ = (ξ₀,…,ξ_Ln). Define d_Ln(α) = 1 − {2 log(L_n + 1)}⁻¹[log{−(1/2)log(1 − α)} + (1/2) {log log(L_n + 1) + log(4π)}], and Q_Ln(α) = {2 log(L_n + 1)}^1/2d_Ln(α).

Theorem 4. Under conditions (C1)–(C5) in the Appendix, and $L_{n} ≍ J_{n}^{S} ≍ n^{1 / (2 r + 1)}$ and $n^{1 / (2 r + 1)} ≪ J_{n}^{ini} ≪ n^{2 r / (2 r + 1)} {log (n)}^{- 1}$ , we have

lim_{n \to \infty} P {sup_{x_{1} \in S_{n, 1}} | σ_{n 1}^{- 1} (x_{1}) {{\hat{α}}_{ℓ 1}^{S} (x_{1}) - α_{ℓ 1} (x_{1})} | \leq Q_{L_{n}} (α)} = 1 - α,

and thus an asymptotic 100(1 − α)% confidence band for α_ℓ1(x₁) over x₁ ∈ S_n,1 is

{\hat{α}}_{ℓ 1}^{S} (x_{1}) \pm σ_{n 1} (x_{1}) Q_{L_{n}} (α) .

(15)

Remark 4. Compared to the pointwise confidence intervals with width 2Z_{1 −} _α/₂σ_n(x₁), the width of the confidence bands (15) is inflated by a rate {2 log(L_n +1)}^1/2d_Ln (α) / Z_1−α/2, where Z_1−α/2 is the cut-off point of the 100(1 − α)th percentile of the standard normal.

3.7. Bootstrap smoothing for calculating the standard error

Theorem 4 establishes a thresholding value Q_Ln(α) for the SCB. One critical question is how to estimate the standard deviation σ_n₁(x₁) in order to construct the SCB. We can use a sample estimate of σ_n₁(x₁) according to the asymptotic formula given in (12),which may have approximation error and thus lead to inaccurate results for inference. The bootstrap estimate of the standard deviation provides an alternative way. We here propose a bootstrap smoothed confidence band by adopting the nonparametric bootstrap smoothing idea from Efron (2014), which can eliminates discontinuities in jumpy estimates. The procedure is described as follows.

Let D = {D₁,…, D_n} be the data we have, where D_i = {Y_i, X_i, (T_i,ℓ, ℓ ∈ Î₁)}. Denote $D^{*} = {D_{1}^{*}, \dots, D_{n}^{*}}$ as a nonparametric bootstrap sample from {D₁,…, D_n}, and $D_{(j)}^{*} = {D_{(j) 1}^{*}, \dots, D_{(j) n}^{*}}$ as the jth bootstrap sample in B draws. Let ${\hat{α}}_{ℓ 1, (j)}^{* S} (x_{1})$ be the two-step estimator of α_ℓ₁(x₁) by using the data $D_{(j)}^{*}$ . We first present an empirical standard deviation by the traditional resampling method which is given as

{\hat{σ}}_{ℓ 1, B} (x_{1}) = {[\sum_{j = 1}^{B} {{\hat{α}}_{ℓ 1, (j)}^{* S} (x_{1}) - {\hat{α}}_{ℓ 1,}^{* S} . (x_{1})}^{2} / (B - 1)]}^{1 / 2},

(16)

where ${\hat{α}}_{ℓ 1,}^{* S} . (x_{1}) = \sum_{j = 1}^{B} {\hat{α}}_{ℓ 1, (j)}^{* S} (x_{1}) / B$ . Then a 100(1 − α)% unsmoothed bootstrap SCB for α_ℓ₁(x₁) over x₁ ∈ S_n_,1 is given as

{\hat{α}}_{ℓ 1}^{* S} (x_{1}) \pm {\hat{σ}}_{ℓ 1, B} (x_{1}) Q_{L_{n}} (α) .

(17)

Another choice is the smoothed bootstrap SCB which eliminates discontinuities in the estimates [Efron (2014)]. Let

{\tilde{α}}_{ℓ 1}^{S} (x_{1}) = \sum_{j = 1}^{B} {\hat{α}}_{ℓ 1, (j)}^{* S} (x_{1}) / B

be the smoothed estimate of α_ℓ₁(x₁) obtained by averaging over the bootstrap replications. Let $C_{(j) i}^{*} = # {D_{(j) i'}^{*} = D_{i}}$ be the number of elements in $D_{(j) i'}^{*}$ equaling D_i.

Proposition 1. At each point x₁ ∈ S_n,1, the nonparametric delta-method estimate of the standard deviation for the smoothed bootstrap statistic ${\tilde{α}}_{ℓ 1}^{S} (x_{1})$ is ${\tilde{σ}}_{ℓ 1} (x_{1}) = {\sum_{i = 1}^{n} {cov}_{i}^{2} (x_{1})}^{1 / 2}$ , where ${cov}_{i} (x_{1}) = {cov}_{*} {C_{(j) i}^{*}, {\hat{α}}_{ℓ 1, (j)}^{* S} (x_{1})}$ which is the bootstrap covariance between $C_{(j) i}^{*}$ and ${\hat{α}}_{ℓ 1, (j)}^{* S} (x_{1})$ .

The proof of Proposition 1 essentially follows the same arguments as the proof for Theorem 1 in Efron (2014). Based on Proposition 1, to construct the smoothed bootstrap SCB, we use the nonparametric estimate of the standard deviation given as

{\tilde{σ}}_{ℓ 1, B} (x_{1}) = {\sum_{i = 1}^{n} {\hat{cov}}_{i, B}^{2} (x_{1})}^{1 / 2},

(18)

Where

{\hat{cov}}_{ℓ i, B} (x_{1}) = \sum_{j = 1}^{B} (C_{(j) i}^{*} - C_{\cdot i}^{*}) ({\hat{α}}_{ℓ 1, (j)}^{* S} (x_{1}) - {\hat{α}}_{ℓ 1,}^{* S} . (x_{1})) / B

with $C_{\cdot i}^{*} = \sum_{j = 1}^{B} C_{(j) i}^{*} / B$ . The 100(1 − α)% smoothed bootstrap SCB for α_ℓ₁(x₁) over x₁ ∈ S_n_,1 is given as

{\tilde{α}}_{ℓ 1}^{S} (x_{1}) \pm {\tilde{σ}}_{ℓ 1, B} (x_{1}) Q_{L_{n}} (α) .

(19)

4. A simulation study

In this section, we present a simulation study to evaluate the finite sample performance of our proposed penalized estimation procedure and the simultaneous confidence bands. More numerical studies are located in the supplementary materials [Ma et al. (2015)].

Example 1. In this example, we use 1286 SNPs located on the sixth chromosome from the Framingham Heart Study to simulate the binary response from the logistic model

logit {P (Y_{i} = 1 | X_{i}, T_{i})} = \sum_{ℓ = 1}^{p} α_{ℓ} (X_{i}) T_{i ℓ} = \sum_{ℓ = 1}^{p} {α_{ℓ 0} + \sum_{k = 1}^{2} α_{ℓ k} (X_{i k})} T_{i ℓ},

(20)

with the four SNPs ss66063578, ss66236230, ss66194604 and ss66533844 selected from the real data analysis in Section 5 as important covariates and the other SNPs as unimportant covariates, so that s = 4 (the number of important covariates), p = 1286 and the sample size n = 300. The three possible allele combinations are coded as 1, 0 and −1 for each SNP The covariates X_ik, k = 1, 2, are simulated environmental effects, which are generated from independent uniform distributions on [0, 1]. We generate the coefficient functions as α₁₀ = 0.5, α₁₁(x₁) = 4cos(2πx₁), α₁₂(x₂) = 5{(2x₂ − 1)² − 1/3}, α₂₀ = 0.5, α₂₁(x₁) = 6x₁ − 3, α₂₂(x₂) = 4{sin(2πx₂) + cos(2πx₂)}, α₃₀ = 0.5, α₃₁(x₁) = 4sin(2πx₁), α₃₂(x₂) = 6x₂ − 3, α₄₀ = 0.5, α₄₁(x₁) = 4cos(2πx₁), α₄₂(x₂) = 5{(2x₂ − 1)² − 1/3} and α_ℓ(X_i) = 0 for l = 5,…, 1286. We conducted 500 replications for each simulation. We fit the data with the GACM (20) by using the adaptive group lasso (AGL) and group lasso (GL). In the literature, the generalized varying coefficient model [GVCM; Lian (2012)], which considers one index variable in the coefficient function for each predictor T_iℓ, has been widely used to study nonlinear interactions. To apply the GVCM method [Lian (2012)] in this setting, we first perform principal component analysis (PCA) on X_i and then use the first principal component as the index variable in the GVCM. Then we apply the AGL and GL methods to the GVCM: $logit {P (Y_{i} = 1 | X_{i}, T_{i})} = \sum_{ℓ = 1}^{p} α_{ℓ} (U_{i}) T_{i ℓ}$ , where U_i is the first principal component obtained by PCA on X_i. Moreover, we also fit the data with the parametric logistic regression by assuming linear coefficient functions (3) with the AGL method. We also compare our proposed method with the conventional screening method by parametric logistic regression for Genome-Wide Association Studies [GWAS; Murcray, Lewinger and Gauderman (2009)]. In the screening method, we fit a logistic model for each SNP: $logit {P (Y_{i} = 1 | X_{i}, T_{i ℓ})} = α_{0} + α^{T} X_{i} + β_{ℓ} T_{i ℓ} + \sum_{k = 1}^{2} β_{ℓ k} X_{i k} T_{i ℓ}$ , for ℓ = 1,…, 1286. Then we conduct a likelihood ratio test for the genetic and interaction effects of H₀ : β_ℓ = β_ℓ₁ = β_ℓ₂ = β_ℓ₃ = 0. Let α₀ = 0.05 be the overall type I error for the study and M = 1286 be the number of SNPs in this study. We apply the multiple testing correction procedure for GWAS with H₀ rejected when the p-value < α₀/M_eff, where M_eff is the Cheverud–Nyholt estimate of the effective number of tests [Cheverud (2001), Nyholt (2004)] calculated by $M_{eff} = 1 + M^{- 1} \sum_{j = 1}^{M} \sum_{k = 1}^{M} (1 - r_{j k}^{2})$ and r_jk are the correlation coefficients of the SNPs, and we obtain M_eff = 1275.65.

Table 1 presents the percentages of correct-fitting (C) (exactly the important covariates are selected), over-fitting (O) (both the important covariates and some unimportant covariates are selected) and incorrect-fitting (I) (some of the important covariates are not selected), the average true positives (TP), that is, the average number of selected covariates among the important covariates, the average false positives (FP), that is, the average number of selected covariates among the unimportant covariates, and the average model errors (MR), the latter defined as $\sum_{i = 1}^{n} {{\hat{μ}}_{i} (X_{i}, T_{i}) - μ_{i} (X_{i}, T_{i})}^{2} / n$ , where μ̂_i(X_i, T_i) and μ_i(X_i, T_i) are the estimated and true conditional means for Y_i, respectively. We see that by fitting the proposed GACM, the GL method has larger percentage of over-fitting as well as larger average false positives than the AGL methods. The AGL improves the correct-fitting percentage by 26%. As a result, the AGL reduces the model fitting error by (0.083 − 0.059)/0.059 = 40.7% compared to the GL method. Moreover, both the logistic model and the GVCM fail to identify those important covariates with incorrect-fitting percentage close to or being 1. Furthermore, by using the screening method with logistic regression, the average true positive is 1.056, which is much less than 4 (the number of those important SNPs). This further illustrates that the traditional screening method is not an effective tool to identify important genetic factors in this context. In addition, we observe that the results for the AGL method in Table 1 are comparable to the results in Table S.1 of Example 2 (in the supplementary materials) at p = 1000 with the simulated SNPs in terms of having similar correct-fitting percentages and MR values.

Table 1.

Variable selection and estimation results by the adaptive group lasso and the group lasso with the GACM and GVCM, respectively, and parametric logistic regression with adaptive group lasso and screening methods based on 500 replications. The columns of C, O and I show the percentage of correct-fitting, over-fitting and incorrect-fitting. The columns TP, FP and MR show true positives, false positives and model errors, respectively

		C	O	I	TP	FP	MR
GACM	AGL	0.410	0.460	0.130	3.860	0.870	0.059
	GL	0.140	0.764	0.096	3.904	2.540	0.083
GVCM	AGL	0.030	0.000	0.970	1.636	5.685	0.142
	GL	0.060	0.000	0.940	2.076	20.670	0.120
Logistic regression	AGL	0.000	0.000	1.000	1.872	1.174	0.159
	Screening	0.000	0.000	1.000	1.056	0.786	0.141

Open in a new tab

Next, we investigate the empirical coverage rates of the unsmoothed and smoothed SCBs given in (17) and (19). To calculate the unsmoothed and smoothed bootstrap standard deviations (16) and (18), we use B = 500 bootstrap replications. The confidence bands are constructed at L_n = 20 equally spaced points. At 95% confidence level, Table 2 reports the empirical coverage rates (cov) and the sample averages of median and mean standard deviations (sd.median and sd.mean), respectively, for the unsmoothed SCB (17) and smoothed SCB (19) for coefficient functions α_ℓ₁(x₁), ℓ = 1, 2, 3, 4. We see that the smoothed bootstrap method leads to better performance, having empirical coverage rates closer to the nominal confidence level 0.95.

Table 2. The empirical coverage rates (cov) and the sample average of median and mean of the standard deviations (sd.median and sd.mean)for the unsmoothed SCB (17) and smoothed SCB (19) for the coefficient functions α_ℓ₁(x₁) for ℓ = 1, 2, 3, 4.

	Unsmoothed bootstrap			Smoothed bootstrap

	cov	sd.median	sd.mean	cov	sd.median	sd.mean
α₁₁	0.610	0.689	0.809	0.818	0.735	0.982
α₂₁	0.628	0.563	0.725	0.846	0.666	0.932
α₃₁	0.636	0.736	0.832	0.869	0.837	1.053
α₄₁	0.646	0.768	0.843	0.882	0.891	1.064

Open in a new tab

5. Data application

We illustrate our method via analysis of the Framingham Heart Study [Dawber, Meadors and Moore (1951)] to investigate the effects of G × E interactions on obesity. People are defined as obese when their body mass index (BMI) is 30 or greater: this is the definition of being obese made by the U.S. Centers for Disease Control and Prevention; see http://www.cdc.gov/obesity/adult/defining.html. We defined the response variable to be Y = 1 for BMI ≥ 30; and Y = 0 for BMI < 30. We use X₁ = sleeping hours per day; X₂ = activity hours per day; and X₃ = diastolic blood pressure as the environmental factors, and use single nucleotide polymorphisms (SNPs) located in the sixth chromosome as the genetic factors. The three possible allele combinations are coded as 1, 0 and −1. As in the simulation, we thus are fitting a multiplicative risk model in the SNPs. For details on genotyping, see http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?studyid=phs000007.v3.p2. A total of 1286 SNPs remain in our analysis after eliminating SNPs with minor allele frequency <0.05, those with departure from Hardy–Weinberg equilibrium and those having correlation coefficient with the response between −0.1 and 0.1. We have n = 300 individuals left in our study after deleting observations with missing values.

To see possible nonlinear main effects of the environmental factors, we first fit a generalized additive model by using X₁, X₂ and X₃ as predictors such that

E (Y_{i} | X_{i}, T_{i}) = g^{- 1} {η (X_{i})} with η (X_{i}) = m_{0} + \sum_{k = 1}^{3} m_{k} (X_{i k}) .

(21)

Figure S.1 given in the supplementary material [Ma et al. (2015)] depicts the plots of m̂_k(·) for k = 1, 2, 3 by one-step cubic spline estimation. Clearly the estimate of each nonparametric function has a nonlinear pattern. We refer to Section S.2 for the detailed description of this figure. Based on the plots shown in Figure S.1, we fit the GACM model

η (X_{i}, T_{i}) = \sum_{ℓ = 1}^{1287} {α_{ℓ 0} + \sum_{k = 1}^{3} α_{ℓ k} (X_{i k})} T_{i ℓ},

(22)

where T_i = (T_i₁,T_i₂,…, T_i₁₂₈₇)^T with T_i₁ = 1, and T_iℓ are the SNP covariates for ℓ = 2,…, 1287. The nonparametric function α_ℓk(·) is estimated by cubic splines, and the number of interior knots for each step is selected based on the criterion described in Section 2.4. We select variables in model (22) by the proposed adaptive group lasso (AGL) and the group lasso (GL). To compare the proposed model with linear models, we perform the group lasso by assuming linear interaction effects (Linear) such that $α_{ℓ} (X_{i}) = α_{ℓ 0} + \sum_{k = 1}^{3} β_{ℓ k} X_{i k}$ , and we also perform the lasso by assuming no interaction effects (No interaction) such that α_ℓ(X_i) = α_ℓ₀. We also apply the screening method with parametric logistic regression (Screening) as described in Example 2. Table 3 reports the variable selection results in these five scenarios. After model selection, we calculate the estimated leave-one-out cross-validation prediction error (CVPE) for the model with the selected variables as shown in the last row of Table 3. Among the selected SNPs by the AGL method, two SNPs, rs4714924 and rs6543930, have been scientifically confirmed by Randall et al. (2013) to have strong associations with obesity. Moreover, compared to the linear, no interaction and screening methods, our proposed AGL with GACM method enables us to identify more genetic factors, which may be important to the response but missed out by other methods. As a result, it has the smallest CVPE (0.078), so that it significantly improves model prediction compared to other methods. We also see that the logistic model that completely ignores interactions has the largest CVPE (0.152). The screening method has the second largest CVPE (0.149), which is larger than that of the penalization method (0.124) obtained by fitting the same logistic regression model but including interaction considered. This result demonstrates that the screening method is not as effective as the penalization method for analysis of this data set, a result which also agrees with our simulations.

Table 3.

Variable selection results for the group lasso (GL) and the adaptive group lasso (AGL) in model (22), the group lasso by assuming linear interaction effects (linear), the lasso by assuming no interaction effects (no interaction) and the screening method (screening). The symbol ✓ indicates that the SNP was selected into the model. The last row shows the cross validation prediction errors (CVPE)

SNPs	GL	AGL	Linear	No interaction	Screening
rs9296244	✓	✓
rs6910353	✓	✓
rs3130813	✓	✓
rs9353447				✓	✓
rs4714924	✓	✓	✓		✓
rs242263	✓	✓	✓
rs282123	✓
rs282128	✓	✓
rs6929006			✓
rs9353711	✓
rs12199154	✓	✓
rs2277114	✓
rs749517	✓
rs729888	✓
rs203139			✓
rs6914589	✓	✓
rs6543930	✓	✓
CVPE	0.099	0.078	0.124	0.152	0.149

Open in a new tab

Next we fit the final GACM selected variables from the AGL procedure as

η (X_{i}, T_{i}) = \sum_{ℓ = 1}^{10} {α_{ℓ 0} + \sum_{k = 1}^{3} α_{ℓ k} (X_{i k})} T_{i ℓ} .

(23)

To illustrate the main effects of the environmental factors, Figure 2 plots the smoothed two-step estimated functions ${\tilde{α}}_{1 k}^{S} (\cdot)$ of the functions $α_{1 k}^{S} (\cdot)$ , for k = 1, 2, 3, and the associated 95% smoothed SCBs (upper and lower solid lines). The plots of the functional estimates have the same nonlinear change patterns as the corresponding plots in Figure S.1, although because of the addition of the SCBs, the scale of the plot has changed.

To illustrate the effects of the genetic factors changing with the environmental factors, in Figure 3 we plot the smoothed two-step estimated functions ${\tilde{α}}_{6 k}^{S} (\cdot)$ and the associated 95% smoothed SCBs of the coefficient functions $α_{6 k}^{S} (\cdot)$ for the SNP rs242263. To further demonstrate how the probability of developing obesity changes with the environmental factors for each category of SNP rs242263, Figure 4 plots the estimated conditional probability of obesity against each environmental factor by letting T_iℓ = 0 for ℓ ≠ 6. Letting A be the minor allele, the curves are for aa (solid line), Aa (dashed line) and AA (dotted line). Figure 3 indicates different changing patterns of the interaction effects under different environments. For example, sleeping hours seem to have an overall more significant interaction effect with this particular SNP than the other two variables. The effect of this SNP changes from positive to negative and then to positive again as the sleeping hours increase. The coefficient functions of the SNP have an increasing pattern along with the activity hours and diastolic blood pressure, respectively. From Figure 4, we observe that there are stronger differences among the levels AA, Aa, and aa of SNP rs242263 for both large and small values of the environmental factors. There are other interesting results worth further study. For example, in the 2–6 hours per day sleeping range, the AA group (dotted lines) have much higher rates of obesity than the aa group (solid line), but the opposite occurs in the 6–9 hour range. For those with low amounts of activity per day, again the AA group is more obese than the aa group, while when activity increases, the AA group is less obese than the aa group. A similar noticeable difference occurs between the <60 diastolic blood pressure group, those who are hypotensive, and the >90 group, those who are hypertensive, although there are few subjects in the former group.

Fig. 4 — Plots of the estimated conditional probability of obesity against each environmental factor by letting *T_iℓ* = 0 for ℓ ≠ 5. With A being the minor allele, the curves are aa (solid line), Aa (dashed line) and AA (dotted line), based on model (23).

6. Discussions

The generalized additive coefficient model (GACM) proposed by Xue and Yang (2006) and Xue and Liang (2010) has been demonstrated to be a powerful tool for studying nonlinear interaction effects of variables. To promote the use of the GACM in modern data applications such as gene-environment (G × E) interaction effects in GWAS, we have proposed estimation and inference procedures for the GACM when the dimension of the variables is high. Specifically, we have devised a groupwise penalization method in the GACM for simultaneous model selection and estimation. We showed by numerical studies that we can effectively identify important genetic factors by using the proposed nonparametric model while traditional generalized parametric models such as logistic regression model fails to do so when nonlinear interactions exist. Moreover, by comparing with the conventional screening method with logistic regression as commonly used in the GWAS community, our proposed groupwise penalization method with the GACM has been demonstrated to be more effective for variable selection and model estimation. After identifying those important covariates, we have further constructed simultaneous confidence bands for the nonzero coefficient functions based on a refined two-step estimator. We estimate the standard deviation of the functional estimator by a smoothed bootstrap method as proposed in Efron (2014). The method was shown to have good numerical performance by reducing variability as well as improving the empirical coverage rate of the proposed simultaneous confidence bands. Our methods can be extended to longitudinal data settings through marginal models or mixed-effects models. More work, however, is needed to understand the properties of the estimators in such new settings. Moreover, extending this work to the setting with the dimensions for both genetic and environmental factors growing with the sample size can be a future project to be considered. Some associated theoretical properties with respect to model selection and estimation as well as inference need to be carefully investigated.

Supplementary Material

Supplement

NIHMS719947-supplement-Supplement.pdf^{(142.7KB, pdf)}

Acknowledgments

The authors thank the Co-Editors, an Associate Editor and three referees for their valuable suggestions and comments that have substantially improved an earlier version of this paper.

Appendix

Denote the space of the qth order smooth functions as C^(q)([0, 1]) = {ϕ|ϕ^(q) ∈ C[0, 1]}. For any s × s symmetric matrix A, denote its L_q norm as ‖A‖_q = max_{ς∈R^s,‖ς‖₂=1} ‖Aς‖_q. Let ${‖ A ‖}_{\infty} = {max}_{1 \leq i \leq s} \sum_{j = 1}^{s} | a_{i j} |$ . For a vector a, let ‖a‖_∞ = max_1≤i≤s |a_i|.

Let C^{0, 1} (χ_w) be the space of Lipschitz continuous functions on χ_w, that is,

C^{0, 1} (χ_{w}) = {φ : {‖ φ ‖}_{0, 1} = sup_{w \neq w^{'}, w, w^{'} \in χ_{w}} \frac{| φ (w) - φ (w^{'}) |}{| w - w^{'} |} < + \infty},

in which ‖φ‖_{0, 1} is the C^{0, 1}-norm of φ. Denote q_j(η, y) = ∂^j Q{g⁻¹(η), y}/∂η^j, so that

q_{1} (η, y) = \frac{\partial}{\partial η} Q {g^{- 1} (η), y} = - {y - g^{- 1} (η)} ρ_{1} (η),

q_{2} (η, y) = \frac{\partial^{2}}{\partial η^{2}} Q {g^{- 1} (η), y} = ρ_{2} (η) - {y - g^{- 1} (η)} ρ_{1}^{'} (η),

where ρ_j(η) = {ġ⁻¹ (η)}^j/V{g⁻¹ (η)}.

A.1. Assumptions

Throughout the paper, we assume the following regularity conditions:

(C1) The joint density of X, denoted by f(x), is absolutely continuous, and there exist constants 0 < c_f ≤ C_f < ∞, such that c_f ≤min_{x∈[0, 1]^d} f(x) ≤ max_{x∈[0, 1]^d} f(x) ≤ C_f.

(C2) The function V is twice continuously differentiable, and the link function g is three times continuously differentiable. The function q₂(η, y) < 0 for η ∈ R and y in the range of the response variable.

(C3) For 1 ≤ ℓ ≤ p, 1 ≤ k ≤ d, $α_{ℓ k}^{(r - 1)} (x_{k}) \in C^{0, 1} [0, 1]$ , for given integer r ≥ 1. The spline order satisfies q ≥ r.

(C4) Let ε_i = Y_i − μ(X_i, T_i), 1 ≤ i ≤ n. The random variables ε₁,…, ε_n are i.i.d. with E(ε_i) = 0 and var(ε_i|X_i, T_i) = σ²(X_i, T_i). Furthermore, their tail probabilities satisfy P(|ε_i| > x) < K exp(−Cx²), i = 1,…, n, for all x ≥ 0 and for some positive constants C and K.

(C5) The eigenvalues of $E (T_{I_{1}} T_{I_{1}}^{T} | X = x)$ , where T_I₁ = (T_ℓ, ℓ ∈ I₁)^T, are uniformly bounded away from 0 and ∞ for all x ∈ [0, 1]^d. There exist constants 0 < c₁ < C₁ < ∞, such that $c_{1} \leq E (T_{ℓ}^{2} | X = x) \leq C_{1}$ , for all x ∈ [0, 1]^d, ℓ ∈ I₂.

Conditions (C1)–(C5) are standard conditions for nonparametric estimation. Condition (C1) is the same as condition (C1) in Xue and Yang (2006) and condition (C5) in Xue and Liang (2010). The first condition in (C2) gives the assumptions on V and the link function g, which can be found in condition (E) of Lam and Fan (2008). The second condition in (C2) guarantees that the negative quasi-likelihood function Q{g⁻¹(η), y} is convex in η ∈ R, which is also given in condition (D) of Lam and Fan (2008) and (a) of condition 1 in Carroll et al. (1997). Condition (C3) is typical for polynomial spline smoothing; see the same condition given in Section 5.2 of Huang (2003). Condition (C4) is the same as assumption (A2) given in Huang, Horowitz and Wei (2010). Condition (C5) is given in condition (C5) of Xue and Liang (2010) and condition (A5) in Ma and Yang (2011b).

A.2. Preliminary lemmas

Define $α_{ℓ}^{0} (x) = \sum_{k = 1}^{d} α_{ℓ k}^{0} (x_{k}) = B {(x)}^{T} γ_{ℓ}$ , where $α_{ℓ k}^{0} (x_{k})$ is defined in (6). Let γ_I₁ = (γ_ℓ : ℓ ∈ I₁)^T. To prove Theorem 1, we next define the oracle estimator of γ_I₁ by minimizing the penalized negative quasi-likelihood with all irrelevant predictors eliminated as such

L_{n} (γ_{I_{1}}) = \sum_{i = 1}^{n} Q [g^{- 1} {\sum_{ℓ \in I_{1}} B {(X_{i})}^{T} γ_{ℓ} T_{ℓ}}, Y_{i}] + n λ_{n} \sum_{ℓ \in I_{1}} w_{n ℓ} {‖ γ_{ℓ} ‖}_{2},

(24)

so that ${\hat{γ}}_{I_{1}}^{0} = {({\hat{γ}}_{ℓ}^{0} : ℓ \in I_{1})}^{T} = arg {min}_{γ_{I_{1}}} L_{n} (γ_{I_{1}})$ . Define ${\hat{γ}}_{I_{2}}^{0} = {({\hat{γ}}_{ℓ}^{0} : ℓ \in I_{2})}^{T}$ with ${\hat{γ}}_{ℓ}^{0} \equiv 0_{d J_{n} + 1}$ for ℓ ∈ I₂, where 0_{d Jn+1} is a (d J_n + 1)-dimensional zero vector. We next present several lemmas, whose detailed proofs are given in the online supplementary materials [Ma et al. (2015)]. Lemma A.1 is used for the proof of Theorem 1, while Lemma A.2 is needed in the proof of Theorem 3.

Lemma A.1. nder the conditions of Theorem 1, one has

{‖ {\hat{γ}}_{I_{1}}^{0} - γ_{I_{1}} ‖}_{2} = O_{p} (λ_{n} ‖ w_{n, I_{1}} ‖ + n^{- 1 / 2} J_{n}^{1 / 2} + J_{n}^{- r}),

(25)

and as n → ∞,

P {\hat{γ} = {({\hat{γ}}_{I_{1}}^{0 T}, {\hat{γ}}_{I_{2}}^{0 T})}^{T}} \to 1 .

(26)

Lemma A.2. Under conditions (C1)–(C5) and Assumptions 1–3,

{‖ {\hat{γ}}_{, 1}^{S} - {\hat{γ}}_{, 1}^{OR} ‖}_{\infty} = O_{p} (\sqrt{log n / (J_{n}^{S} n)} + {(J_{n}^{S})}^{- 1 / 2} {(J_{n}^{ini})}^{- r}) .

(27)

A.3. Proof of Theorem 1

By (25) and (26),

\begin{matrix} \sum_{ℓ \in I_{1}} ‖ {\hat{α}}_{ℓ} - α_{ℓ} ‖ ≍ {‖ {\hat{γ}}_{I_{1}} - γ_{I_{1}} ‖}_{2} = O_{p} (λ_{n} ‖ w_{n, I_{1}} ‖ + n^{- 1 / 2} J_{n}^{1 / 2} + J_{n}^{- r}), \\ P (‖ {\hat{α}}_{ℓ} ‖ > 0, ℓ \in I_{1} and ‖ {\hat{α}}_{ℓ} ‖ = 0, ℓ \in I_{2}) \to 1 . \end{matrix}

A.4. Proof of Theorem 2

Let γ_,1 = (γ_ℓ1, ℓ ∈ Î₁)^T, where γ_ℓ1 is defined in (7). By Taylor's expansion, from (10), one has

{\hat{γ}}_{, 1}^{OR} - γ_{, 1} = {[\sum_{i = 1}^{n} Z_{i, 1} Z_{i, 1}^{T} {{\dot{g}}^{- 1} (η_{i}^{*})}^{2} / σ_{i}^{2}]}^{- 1} \times [\sum_{i = 1}^{n} Z_{i, 1} {Y_{i} - g^{- 1} (η_{i}^{0})} ({\dot{g}}^{- 1} (η_{i}^{0}) / σ_{i}^{2})],

Where $η_{i}^{0} = \sum_{ℓ = 1}^{p} {α_{ℓ 0} + \sum_{k = 2}^{d} α_{ℓ k} (X_{i k})} T_{i ℓ} + \sum_{ℓ = 1}^{p} B^{S} {(x_{1})}^{T} γ_{ℓ 1} T_{i ℓ}$ and

η_{i}^{*} = \sum_{ℓ = 1}^{p} {α_{ℓ 0} + \sum_{k = 2}^{d} α_{ℓ k} (X_{i k})} T_{i ℓ} + \sum_{ℓ = 1}^{p} B^{S} {(x_{1})}^{T} γ_{ℓ 1}^{*} T_{i ℓ},

where $γ_{, 1}^{*} = {(γ_{ℓ 1}^{*}, ℓ \in {\hat{I}}_{1})}^{T} \in (γ_{, 1}, {\hat{γ}}_{, 1}^{OR})$ . Following similar reasoning as the proofs for (25), we have ${‖ {\hat{γ}}_{, 1}^{OR} - γ_{, 1} ‖}_{2} = o_{p} (1)$ . Then ${\hat{γ}}_{, 1}^{OR} - γ_{, 1} = ({\hat{γ}}_{, 1 e}^{OR} + {\hat{γ}}_{, 1 μ}^{OR}) + o_{p} (1)$ , where

{\hat{γ}}_{, 1 e}^{OR} = {[\sum_{i = 1}^{n} Z_{i, 1} Z_{i, 1}^{T} {{\dot{g}}^{- 1} (η_{i})}^{2} / σ_{i}^{2}]}^{- 1} [\sum_{i = 1}^{n} Z_{i, 1} ε_{i} {{\dot{g}}^{- 1} (η_{i}) / σ_{i}^{2}}],

{\hat{γ}}_{, 1 μ}^{OR} = {[\sum_{i = 1}^{n} Z_{i, 1} Z_{i, 1}^{T} {{\dot{g}}^{- 1} (η_{i})}^{2} / σ_{i}^{2}]}^{- 1} \times [\sum_{i = 1}^{n} Z_{i, 1} {g^{- 1} (η_{i}) - g^{- 1} (η_{i}^{0})} {{\dot{g}}^{- 1} (η_{i}) / σ_{i}^{2}}] .

(28)

Therefore, $var ({\hat{γ}}_{, 1 e}^{OR} | X, T) = {[\sum_{i = 1}^{n} Z_{i, 1} Z_{i, 1}^{T} {{\dot{g}}^{- 1} (η_{i})}^{2} / σ_{i}^{2}]}^{- 1}$ . By Theorem 5.4.2 of DeVore and Lorentz (1993), for sufficiently large n, there exist constants 0 < c_B ≤ C_B < ∞, such that $c_{B} I_{J_{n}^{S} \times J_{n}^{S}} \leq E (B_{1}^{S} (X_{i 1}) B_{1}^{S} {(X_{i 1})}^{T}) \leq C_{B} I_{J_{n}^{S} \times J_{n}^{S}}$ . By condition (C5), for n large enough, there are constants 0 < C_T, C′ < ∞, such that

E [Z_{i, 1} Z_{i, 1}^{T} {{\dot{g}}^{- 1} (η_{i})}^{2} / σ_{i}^{2}] \leq C^{'} E [{B_{1}^{S} (X_{i 1}) B_{1}^{S} {(X_{i 1})}^{T}} \otimes {E (T_{ℓ} T_{ℓ^{'}} | X)}_{ℓ, ℓ^{'} \in {\hat{I}}_{1}}] \leq C C_{T} s^{*} E {B_{1}^{S} (X_{i 1}) B_{1}^{S} {(X_{i 1})}^{T}} \otimes I_{s^{*} \times s^{*}} \leq C^{'} C_{T} C_{B} s^{*} I_{J_{n}^{S} \times J_{n}^{S}} \otimes I_{s^{*} \times s^{*}} = C s^{*} I_{J_{n}^{S} s^{*} \times J_{n}^{S} s^{*}},

where C = C′C_TC_B. Similarly, we have $E [Z_{i, 1} Z_{i, 1}^{T} {{\dot{g}}^{- 1} (η_{i})}^{2} / σ_{i}^{2}] \geq c I_{J_{n}^{S} s^{*} \times J_{n}^{S} s^{*}}$ for some constant 0 < c < ∞. Thus, following the same reasoning as the proof for (S.5) in the supplementary materials [Ma et al. (2015)], we have with probability 1, for n → ∞,

C^{- 1} {(s^{*})}^{- 1} n^{- 1} I_{J_{n}^{S} s^{*} \times J_{n}^{S} s^{*}} \leq {[\sum_{i = 1}^{n} Z_{i, 1} Z_{i, 1}^{T} {{\dot{g}}^{- 1} (η_{i})}^{2} / σ_{i}^{2}]}^{- 1} \leq c^{- 1} n^{- 1} I_{J_{n}^{S} s^{*} \times J_{n}^{S} s^{*}} .

(29)

By the Lindeberg central limit theorem, it can be proved that

a^{T} σ_{n}^{- 1} (x_{1}) {B^{S} (x_{1}) {\hat{γ}}_{, 1 e}^{OR}} \to N (0, 1),

(30)

for any a ∈ R^s* with ‖a‖₂ = 1. Since $a^{T} σ_{n}^{- 1} (x_{1}) {{\hat{α}}_{1}^{OR} (x_{1}) - b_{1} (x_{1})} = a^{T} σ_{n}^{- 1} (x_{1}) {B^{S} (x_{1}) {\hat{γ}}_{, 1 e}^{OR}} + o_{p} (1)$ , by (30) and Slutsky's theorem, we have

a^{T} σ_{n}^{- 1} (x_{1}) {{\hat{α}}_{1}^{OR} (x_{1}) - b_{1} (x_{1})} \to N (0, 1) .

(31)

By (28) and (29), with probability approaching 1,

\sum_{ℓ \in I_{1}} {‖ {\hat{α}}_{ℓ 1}^{OR} - b_{ℓ 1} ‖}^{2} ≍ {‖ {\hat{γ}}_{, 1 e}^{OR} ‖}_{2}^{2} \leq c^{- 2} n^{- 2} [\sum_{i = 1}^{n} ε_{i} Z_{i, 1}^{T} ({\dot{g}}^{- 1} (η_{i}) / σ_{i}^{2})] [\sum_{i = 1}^{n} Z_{i, 1} ε_{i} ({\dot{g}}^{- 1} (η_{i}) / σ_{i}^{2})] ≍ c^{- 2} n^{- 1} E [Z_{i, 1}^{T} Z_{i, 1} {{\dot{g}}^{- 1} (η_{i})}^{2} / σ_{i}^{2}] ≍ s^{*} J_{n}^{S} n^{- 1};

{‖ a^{T} ({\hat{α}}_{1}^{OR} - b_{1}) ‖}^{2} \leq C_{a} {‖ {\hat{γ}}_{, 1 e}^{OR} ‖}_{2}^{2} \leq C_{a} c^{- 1} n^{- 2} (\sum_{i = 1}^{n} ε_{i} Z_{i, 1}^{T}) (\sum_{i = 1}^{n} Z_{i, 1} ε_{i}) ≍ C_{a} c^{- 1} n^{- 1} E (Z_{i, 1}^{T} Z_{i, 1}) ≍ s^{*} J_{n}^{S} n^{- 1} .

Since ${sup}_{x_{1} \in [0, 1]} | α_{ℓ 1} (x_{1}) - B_{1}^{S} {(x_{1})}^{T} γ_{ℓ 1} | = O {{(J_{n}^{S})}^{- r}}$ , it can be proved that $‖ a^{T} {\hat{γ}}_{, 1 μ}^{OR} ‖ \leq {‖ {\hat{γ}}_{, 1 μ}^{OR} ‖}_{2} = O_{p} {{(s^{*})}^{1 / 2} {(J_{n}^{S})}^{- r}}$ , and $‖ a^{T} (b_{1} - α_{1}^{0}) ‖ ≍ ‖ a^{T} {\hat{γ}}_{, 1 μ 2}^{OR} ‖ = O_{p} {{(s^{*})}^{1 / 2} {(J_{n}^{S})}^{- r}}$ . Hence

‖ a^{T} (b_{1} - α_{1}) ‖ \leq ‖ a^{T} (b_{1} - α_{1}^{0}) ‖ + ‖ a^{T} (α_{1}^{0} - α_{1}) ‖ = O_{p} {s^{*} {(J_{n}^{S})}^{- r}} .

By (31), ${e_{ℓ}^{T} σ_{n}^{2} (x_{1}) e_{ℓ}}^{- 1 / 2} {{\hat{α}}_{ℓ 1}^{OR} (x_{1}) - b_{ℓ 1}} \to N (0, 1)$ , and ${sup}_{ℓ \in {\hat{I}}_{1}} | {\hat{α}}_{ℓ 0}^{OR} - α_{ℓ 0} | = O_{p} (n^{- 1 / 2})$ follows from the central limit theorem.

A.5. Proof of Theorem 3

By (27) in Lemma A.2,

sup_{x_{1} \in [0, 1]} {‖ {\hat{α}}_{1}^{S} (x_{1}) - {\hat{α}}_{1}^{OR} (x_{1}) ‖}_{\infty} \leq sup_{x_{1} \in [0, 1]} \sum_{j = 1}^{J_{n}^{S}} | B_{j, 1}^{S} (x_{1}) | {‖ {\hat{γ}}_{, 1}^{S} - {\hat{γ}}_{, 1}^{OR} ‖}_{\infty} .

The right-hand side is bounded by $O_{p} {{(n^{- 1} log n)}^{1 / 2} + {(J_{n}^{ini})}^{- r}} \cdot {‖ {\hat{α}}_{0}^{S} - {\hat{α}}_{0}^{OR} ‖}_{2} = o_{p} (n^{- 1 / 2})$ can be proved following the same procedure and thus omitted. By (29), with probability approaching 1, for large enough n, for any x₁ ∈ [0, 1], and a ∈ R^s* with ‖a‖₂ = 1, one has

a^{T} σ_{n}^{2} (x_{1}) a \leq c_{Z}^{- 1} n^{- 1} a^{T} B^{S} (x_{1}) B^{S} {(x_{1})}^{T} a \leq c^{- 1} J_{n}^{S} n^{- 1} a^{T} a,

a^{T} σ_{n}^{2} (x_{1}) a \geq C_{Z}^{- 1} {(s^{*})}^{- 1} n^{- 1} a^{T} B^{S} (x_{1}) B^{S} {(x_{1})}^{T} a \geq C^{- 1} J_{n}^{S} {(s^{*})}^{- 1} n^{- 1} a^{T} a,

where $σ_{n}^{2} (x_{1})$ is defined in (12). Thus

sup_{x_{1} \in [0, 1]} | a^{T} σ_{n}^{- 1} (x_{1}) {{\hat{α}}_{1}^{S} (x_{1}) - {\hat{α}}_{1}^{OR} (x_{1})} | \leq sup_{x_{1} \in [0, 1]} {‖ σ_{n}^{- 1} (x_{1}) ‖}_{2} {‖ {\hat{α}}_{1}^{S} (x_{1}) - {\hat{α}}_{1}^{OR} (x_{1}) ‖}_{2} = O_{p} [s^{*} {{(log n / J_{n}^{S})}^{1 / 2} + {(n / J_{n}^{S})}^{1 / 2} {(J_{n}^{ini})}^{- r}}] = o_{p} (1) .

A.6. Proof of Theorem 4

Using the strong approximation lemma given in Theorem 2.6.7 of Csörgő and Révész (1981), we can prove by the same procedure as Lemma A.7 in Ma, Yang and Carroll (2012) that

sup_{x_{1} \in [0, 1]} | {\hat{α}}_{ℓ 1}^{OR} (x_{1}) - b_{ℓ 1} (x_{1}) - {\hat{α}}_{ℓ 1, ε}^{0} (x_{1}) | = o_{a . s .} (n^{t})

(32)

for some t < –r/(2r + 1) < 0, where ${\hat{α}}_{ℓ 1, ε}^{0} (x_{1})$ is

e_{ℓ}^{T} B^{S} (x_{1}) {[\sum_{i = 1}^{n} Z_{i, 1} Z_{i, 1}^{T} {{\dot{g}}^{- 1} (η_{i})}^{2} / σ_{i}^{2}]}^{- 1} [\sum_{i = 1}^{n} Z_{i, 1} e_{i} {{\dot{g}}^{- 1} (η_{i}) / σ_{i}^{2}}],

and e_i, 1 ≤ i ≤ n, are i.i.d. N(0, 1) independent of Z_{i, 1}. For $σ_{n}^{2} (x_{1})$ defined in (12) and $σ_{n 1} (x_{1}) ≍ {(J_{n}^{S} / n)}^{1 / 2} {1 + o_{p} (1)}$ uniformly in x₁ ∈ [0, 1]. By (32), $J_{n}^{S} ≍ n^{1 / (2 r + 1)}$ and t < –r/(2r + 1) < 0, we have

sup_{x_{1} \in [0, 1]} | {log (L_{n} + 1)}^{- 1 / 2} σ_{n 1}^{- 1} (x_{1}) {{\hat{α}}_{ℓ 1}^{OR} (x_{1}) - b_{ℓ 1} (x_{1}) - {\hat{α}}_{ℓ 1, ε}^{0} (x l 1)} | = o_{a . s .} ({log (L_{n} + 1)}^{- 1 / 2} {(n / J_{n}^{S})}^{1 / 2} n^{t}) = o_{a . s .} ({log (L_{n} + 1)}^{- 1 / 2} n^{r / (2 r + 1) - t}) = o_{a . s .} (1) .

(33)

Define $η (x_{1}) = σ_{n 1}^{- 1} (x_{1}) {\hat{α}}_{ℓ 1, ε}^{0} (x_{1})$ . It is apparent that Inline graphic {η(ξ_J)|Z_{i, 1}, 1 ≤ i ≤ n} = N(0, 1), so {η(ξ_J)} = N(0,1) for 0 ≤ J ≤ L_n. Moreover, the eigenvalues of ${(E Z_{i, 1} Z_{i, 1}^{T})}^{- 1} ≍ J_{n}^{S}$ . Then with probability approaching 1, for J ≠ J′,

| E {η (ξ_{J}) η (ξ_{J^{'}})} | ≍ (n / J_{n}^{S}) n^{- 1} | e_{ℓ}^{T} B^{S} (ξ_{J}) {(E Z_{i, 1} Z_{i, 1}^{T})}^{- 1} B^{S} {(ξ_{J^{'}})}^{T} e_{ℓ} | ≍ | e_{ℓ}^{T} B^{S} (ξ_{J}) B^{S} {(ξ_{J^{'}})}^{T} e_{ℓ} | = \sum_{j = 1}^{J_{n}^{S}} B_{j, 1}^{S} (ξ_{J}) B_{j, 1}^{S} (ξ_{J^{'}})

and $\sum_{j = 1}^{J_{n}^{S}} B_{j, 1}^{S} (ξ_{J}) B_{j, 1}^{S} (ξ_{J^{'}}) ≍ C$ for a constant 0 < C < ∞ when |j_J – j_J′| ≤ (q − 1) and $\sum_{j = 1}^{J_{n}^{S}} B_{j, 1}^{S} (ξ_{J}) B_{j, 1}^{S} (ξ_{J^{'}}) = 0$ when |j_J – j_J′| > (q – 1), in which j_J denotes the index of the knot closest to ξ_J from the left. Therefore, by $L_{n} ≍ J_{n}^{S}$ , there exist constants 0 < C₁ < ∞ and 0 < C₂ < ∞ such that with probability approaching 1, for J ≠ J′, $| E {η (ξ_{J}) η (ξ_{J^{'}})} | \leq C_{1}^{- | j_{J} - j_{J^{'}} |} \leq C_{2}^{- | J - J^{'} |}$ . By Lemma A1 given in Ma and Yang (2011a), we have

lim_{n \to \infty} P {sup_{0 \leq J \leq L_{n}} | {2 log (L_{n} + 1)}^{- 1 / 2} η (ξ_{J}) | \leq d_{N_{n}} (α)} = 1 - α,

and hence

lim_{n \to \infty} P {sup_{x_{1} \in S_{n, 1}} | {2 log (L_{n} + 1)}^{- 1 / 2} σ_{n}^{- 1} (x_{1}) {\hat{α}}_{ℓ 1, ε}^{0} (x_{1}) | \leq d_{N_{n}} (α)} = 1 - α .

(34)

Furthermore, according to the result on page 149 of de Boor (2001), we have

sup_{x_{1} \in [0, 1]} | {log (L_{n} + 1)}^{- 1 / 2} σ_{n 1}^{- 1} (x_{1}) {b_{ℓ 1} (x_{1}) - α_{ℓ 1} (x_{1})} | = O_{p} ({log (L_{n} + 1)}^{- 1 / 2} {(n / J_{n}^{S})}^{1 / 2} {(J_{n}^{S})}^{- r}) = o_{p} (1) .

(35)

Moreover, ${\hat{α}}_{ℓ 1}^{OR} (x_{1}) - α_{ℓ 1} (x_{1}) = {\hat{α}}_{ℓ 1, ε}^{0} (x_{1}) + {{\hat{α}}_{ℓ 1}^{OR} (x_{1}) - b_{ℓ 1} (x_{1}) - {\hat{α}}_{ℓ 1, ε}^{OR} (x_{1})} + {b_{ℓ 1} (x_{1}) - α_{ℓ 1} (x_{1})}$ . Hence by (33) and (35), we have

lim_{n \to \infty} P {sup_{x_{1} \in S_{n, 1}} {log (L_{n} + 1)}^{- 1 / 2} σ_{n 1}^{- 1} (x_{1}) | {\hat{α}}_{ℓ 1}^{OR} (x_{1}) - α_{ℓ 1} | \leq d_{N_{n}} (α)} = lim_{n \to \infty} P {sup_{x_{1} \in S_{n, 1}} {log (L_{n} + 1)}^{- 1 / 2} σ_{n 1}^{- 1} (x_{1}) | {\hat{α}}_{ℓ 1, ε}^{0} (x_{1}) | \leq d_{N_{n}} (α)} = 1 - α,

(36)

where the last step follows from (34). By the oracle property given in Theorem 3, and $J_{n}^{S} ≍ n^{1 / (2 r + 1)}$ and $n^{1 / (2 r + 1)} ≪ J_{n}^{ini}$ , we have

sup_{x_{1} \in [0, 1]} {log (L_{n} + 1)}^{- 1 / 2} σ_{n 1}^{- 1} (x_{1}) | {\hat{α}}_{ℓ 1}^{S} (x_{1}) - {\hat{α}}_{ℓ 1}^{OR} (x_{1}) | = O_{p} [log {(L_{n} + 1)}^{- 1 / 2} {(n / J_{n}^{S})}^{1 / 2} {(n^{- 1} log n)}^{1 / 2} + {(J_{n}^{ini})}^{- r}] = o_{p} (1) .

(37)

Therefore, by (36) and (37), we have

lim_{n \to \infty} P {sup_{x_{1} \in S_{n, 1}} {log (L_{n} + 1)}^{- 1 / 2} σ_{n 1}^{- 1} (x_{1}) | {\hat{α}}_{ℓ 1}^{S} (x_{1}) - α_{ℓ 1} (x_{1}) | \leq d_{N_{n}} (α)} = 1 - α,

and hence the result in Theorem 4 is proved.

Contributor Information

MA Shujie, Email: shujie.ma@ucr.edu.

Raymond J. Carroll, Email: carroll@stat.tamu.edu.

Hua Liang, Email: hliang@gwu.edu.

Shizhong Xu, Email: shizhong.xu@ucr.edu.

References

Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. J Amer Statist Assoc. 1997;92:477–489. MR1467842. [Google Scholar]
Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95:759–771. MR2443189. [Google Scholar]
Cheverud JM. A simple correction for multiple comparisons in interval mapping genome scans. Heredity (Edinb) 2001;87:52–58. doi: 10.1046/j.1365-2540.2001.00901.x. [DOI] [PubMed] [Google Scholar]
Claeskens G, Van Keilegom I. Bootstrap confidence bands for regression curves and their derivatives. Ann Statist. 2003;31:1852–1884. MR2036392. [Google Scholar]
Csörgő M, Révész P. Strong Approximations in Probability and Statistics. Academic Press; New York: 1981. MR0666546. [Google Scholar]
Dawber TR, Meadors GF, Moore FE. Epidemiological approaches to heart disease: The Framingham 660 study. American Journal of Public Health. 1951;41:279–286. doi: 10.2105/ajph.41.3.279. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Boor C. Applied Mathematical Sciences. revised. Vol. 27. Springer; New York: 2001. A Practical Guide to Splines. MR1900298. [Google Scholar]
DeVore RA, Lorentz GG. Constructive Approximation Grundlehren der Mathematischen Wissenschaften. Vol. 303. Springer; Berlin: 1993. MR1261635. [Google Scholar]
Efron B. Estimation and accuracy after model selection. J Amer Statist Assoc. 2014;109:991–1007. doi: 10.1080/01621459.2013.823775. MR3265671. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. MR1946581. [Google Scholar]
Fan Y, Tang CY. Tuning parameter selection in high dimensional penalized likelihood. J R Stat Soc Ser B Stat Methodol. 2013;75:531–552. MR3065478. [Google Scholar]
Hall P, Titterington DM. On confidence bands in nonparametric density estimation and regression. J Multivariate Anal. 1988;27:228–254. MR0971184. [Google Scholar]
Härdle W, Marron JS. Bootstrap simultaneous error bars for nonparametric regression. Ann Statist. 1991;19:778–796. MR1105844. [Google Scholar]
Horowitz J, Klemelä J, Mammen E. Optimal estimation in additive regression models. Bernoulli. 2006;12:271–298. MR2218556. [Google Scholar]
Horowitz JL, Mammen E. Nonparametric estimation of an additive model with a link function. Ann Statist. 2004;32:2412–2443. MR2153990. [Google Scholar]
Huang JZ. Local asymptotics for polynomial spline regression. Ann Statist. 2003;31:1600–1635. MR2012827. [Google Scholar]
Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models. Ann Statist. 2010;38:2282–2313. doi: 10.1214/09-AOS781. MR2676890. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang B, Liu JS. Variable selection for general index models via sliced inverse regression. Ann Statist. 2014;42:1751–1786. MR3262467. [Google Scholar]
Knutson KL. Does inadequate sleep play a role in vulnerability to obesity? Am J Hum Biol. 2012;24:361–371. doi: 10.1002/ajhb.22219. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lam C, Fan J. Profile-kernel likelihood inference with diverging number of parameters. Ann Statist. 2008;36:2232–2260. doi: 10.1214/07-AOS544. MR2458186. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee YK, Mammen E, Park BU. Flexible generalized varying coefficient regression models. Ann Statist. 2012;40:1906–1933. MR3015048. [Google Scholar]
Lian H. Variable selection for high-dimensional generalized varying-coefficient models. Statist Sinica. 2012;22:1563–1588. MR3027099. [Google Scholar]
Liu R, Yang L. Spline-backfitted kernel smoothing of additive coefficient model. Econometric Theory. 2010;26:29–59. MR2587102. [Google Scholar]
Liu R, Yang L, Härdle WK. Oracally efficient two-step estimation of generalized additive model. J Amer Statist Assoc. 2013;108:619–631. MR3174646. [Google Scholar]
Ma S, Yang L. A jump-detecting procedure based on spline estimation. J Nonparametr Stat. 2011a;23:67–81. MR2780816. [Google Scholar]
Ma S, Yang L. Spline-backfitted kernel smoothing of partially linear additive model. J Statist Plann Inference. 2011b;141:204–219. MR2719488. [Google Scholar]
Ma S, Yang L, Carroll RJ. A simultaneous confidence band for sparse longitudinal regression. Statist Sinica. 2012;22:95–122. doi: 10.5705/ss.2010.034. MR2933169. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma S, Carroll RJ, Liang H, Xu S. Supplement to “Estimation and inference in generalized additive coefficient models for nonlinear interactions with high-dimensional covariates”. 2015 doi: 10.1214/15-AOS1344SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meier L, Bühlmann P. Smoothing l1-penalized estimators for high-dimensional time-course data. Electron J Stat. 2007;1:597–615. MR2369027. [Google Scholar]
Meier L, van de Geer S, Bühlmann P. High-dimensional additive modeling. Ann Statist. 2009;37:3779–3821. MR2572443. [Google Scholar]
Murcray CE, Lewinger JP, Gauderman WJ. Gene-environment interaction in genome-wide association studies. Am J Epidemiol. 2009;169:219–226. doi: 10.1093/aje/kwn353. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nyholt DR. A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet. 2004;74:765–769. doi: 10.1086/383251. [DOI] [PMC free article] [PubMed] [Google Scholar]
Randall JC, Winkler TM, Kutalik Z, Berndt SI, Jackson AU, et al. Sex-stratified genome-wide association studies including 270,000 individuals show sexual dimorphism in genetic loci for anthropometric traits. PLOS Genetics. 2013;9:e1003500. doi: 10.1371/journal.pgen.1003500. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ravikumar P, Lafferty J, Liu H, Wasserman L. Sparse additive models. J R Stat Soc Ser B Stat Methodol. 2009;71:1009–1030. MR2750255. [Google Scholar]
Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. MR2410008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L, Xue L, Qu A, Liang H. Estimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates. Ann Statist. 2014;42:592–624. MR3210980. [Google Scholar]
Wareham NJ, van Sluijs EMF, Ekelund U. Physical activity and obesity prevention: A review of the current evidence. Proc Nutr Soc. 2005;64:229–247. doi: 10.1079/pns2005423. [DOI] [PubMed] [Google Scholar]
Xue L, Liang H. Polynomial spline estimation for a generalized additive coefficient model. Scand J Stat. 2010;37:26–46. doi: 10.1111/j.1467-9469.2009.00655.x. MR2675938. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xue L, Yang L. Additive coefficient modeling via polynomial spline. Statist Sinica. 2006;16:1423–1446. MR2327498. [Google Scholar]
Zhou S, Shen X, Wolfe DA. Local asymptotics for regression splines and confidence regions. Ann Statist. 1998;26:1760–1782. MR1673277. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. MR2279469. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS719947-supplement-Supplement.pdf^{(142.7KB, pdf)}

[R1] Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. J Amer Statist Assoc. 1997;92:477–489. MR1467842. [Google Scholar]

[R2] Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95:759–771. MR2443189. [Google Scholar]

[R3] Cheverud JM. A simple correction for multiple comparisons in interval mapping genome scans. Heredity (Edinb) 2001;87:52–58. doi: 10.1046/j.1365-2540.2001.00901.x. [DOI] [PubMed] [Google Scholar]

[R4] Claeskens G, Van Keilegom I. Bootstrap confidence bands for regression curves and their derivatives. Ann Statist. 2003;31:1852–1884. MR2036392. [Google Scholar]

[R5] Csörgő M, Révész P. Strong Approximations in Probability and Statistics. Academic Press; New York: 1981. MR0666546. [Google Scholar]

[R6] Dawber TR, Meadors GF, Moore FE. Epidemiological approaches to heart disease: The Framingham 660 study. American Journal of Public Health. 1951;41:279–286. doi: 10.2105/ajph.41.3.279. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] de Boor C. Applied Mathematical Sciences. revised. Vol. 27. Springer; New York: 2001. A Practical Guide to Splines. MR1900298. [Google Scholar]

[R8] DeVore RA, Lorentz GG. Constructive Approximation Grundlehren der Mathematischen Wissenschaften. Vol. 303. Springer; Berlin: 1993. MR1261635. [Google Scholar]

[R9] Efron B. Estimation and accuracy after model selection. J Amer Statist Assoc. 2014;109:991–1007. doi: 10.1080/01621459.2013.823775. MR3265671. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. MR1946581. [Google Scholar]

[R11] Fan Y, Tang CY. Tuning parameter selection in high dimensional penalized likelihood. J R Stat Soc Ser B Stat Methodol. 2013;75:531–552. MR3065478. [Google Scholar]

[R12] Hall P, Titterington DM. On confidence bands in nonparametric density estimation and regression. J Multivariate Anal. 1988;27:228–254. MR0971184. [Google Scholar]

[R13] Härdle W, Marron JS. Bootstrap simultaneous error bars for nonparametric regression. Ann Statist. 1991;19:778–796. MR1105844. [Google Scholar]

[R14] Horowitz J, Klemelä J, Mammen E. Optimal estimation in additive regression models. Bernoulli. 2006;12:271–298. MR2218556. [Google Scholar]

[R15] Horowitz JL, Mammen E. Nonparametric estimation of an additive model with a link function. Ann Statist. 2004;32:2412–2443. MR2153990. [Google Scholar]

[R16] Huang JZ. Local asymptotics for polynomial spline regression. Ann Statist. 2003;31:1600–1635. MR2012827. [Google Scholar]

[R17] Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models. Ann Statist. 2010;38:2282–2313. doi: 10.1214/09-AOS781. MR2676890. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Jiang B, Liu JS. Variable selection for general index models via sliced inverse regression. Ann Statist. 2014;42:1751–1786. MR3262467. [Google Scholar]

[R19] Knutson KL. Does inadequate sleep play a role in vulnerability to obesity? Am J Hum Biol. 2012;24:361–371. doi: 10.1002/ajhb.22219. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Lam C, Fan J. Profile-kernel likelihood inference with diverging number of parameters. Ann Statist. 2008;36:2232–2260. doi: 10.1214/07-AOS544. MR2458186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Lee YK, Mammen E, Park BU. Flexible generalized varying coefficient regression models. Ann Statist. 2012;40:1906–1933. MR3015048. [Google Scholar]

[R22] Lian H. Variable selection for high-dimensional generalized varying-coefficient models. Statist Sinica. 2012;22:1563–1588. MR3027099. [Google Scholar]

[R23] Liu R, Yang L. Spline-backfitted kernel smoothing of additive coefficient model. Econometric Theory. 2010;26:29–59. MR2587102. [Google Scholar]

[R24] Liu R, Yang L, Härdle WK. Oracally efficient two-step estimation of generalized additive model. J Amer Statist Assoc. 2013;108:619–631. MR3174646. [Google Scholar]

[R25] Ma S, Yang L. A jump-detecting procedure based on spline estimation. J Nonparametr Stat. 2011a;23:67–81. MR2780816. [Google Scholar]

[R26] Ma S, Yang L. Spline-backfitted kernel smoothing of partially linear additive model. J Statist Plann Inference. 2011b;141:204–219. MR2719488. [Google Scholar]

[R27] Ma S, Yang L, Carroll RJ. A simultaneous confidence band for sparse longitudinal regression. Statist Sinica. 2012;22:95–122. doi: 10.5705/ss.2010.034. MR2933169. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Ma S, Carroll RJ, Liang H, Xu S. Supplement to “Estimation and inference in generalized additive coefficient models for nonlinear interactions with high-dimensional covariates”. 2015 doi: 10.1214/15-AOS1344SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Meier L, Bühlmann P. Smoothing l1-penalized estimators for high-dimensional time-course data. Electron J Stat. 2007;1:597–615. MR2369027. [Google Scholar]

[R30] Meier L, van de Geer S, Bühlmann P. High-dimensional additive modeling. Ann Statist. 2009;37:3779–3821. MR2572443. [Google Scholar]

[R31] Murcray CE, Lewinger JP, Gauderman WJ. Gene-environment interaction in genome-wide association studies. Am J Epidemiol. 2009;169:219–226. doi: 10.1093/aje/kwn353. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Nyholt DR. A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet. 2004;74:765–769. doi: 10.1086/383251. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Randall JC, Winkler TM, Kutalik Z, Berndt SI, Jackson AU, et al. Sex-stratified genome-wide association studies including 270,000 individuals show sexual dimorphism in genetic loci for anthropometric traits. PLOS Genetics. 2013;9:e1003500. doi: 10.1371/journal.pgen.1003500. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Ravikumar P, Lafferty J, Liu H, Wasserman L. Sparse additive models. J R Stat Soc Ser B Stat Methodol. 2009;71:1009–1030. MR2750255. [Google Scholar]

[R35] Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. MR2410008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Wang L, Xue L, Qu A, Liang H. Estimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates. Ann Statist. 2014;42:592–624. MR3210980. [Google Scholar]

[R37] Wareham NJ, van Sluijs EMF, Ekelund U. Physical activity and obesity prevention: A review of the current evidence. Proc Nutr Soc. 2005;64:229–247. doi: 10.1079/pns2005423. [DOI] [PubMed] [Google Scholar]

[R38] Xue L, Liang H. Polynomial spline estimation for a generalized additive coefficient model. Scand J Stat. 2010;37:26–46. doi: 10.1111/j.1467-9469.2009.00655.x. MR2675938. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Xue L, Yang L. Additive coefficient modeling via polynomial spline. Statist Sinica. 2006;16:1423–1446. MR2327498. [Google Scholar]

[R40] Zhou S, Shen X, Wolfe DA. Local asymptotics for regression splines and confidence regions. Ann Statist. 1998;26:1760–1782. MR1673277. [Google Scholar]

[R41] Zou H. The adaptive lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. MR2279469. [Google Scholar]

PERMALINK

Estimation and Inference in Generalized Additive Coefficient Models for Nonlinear Interactions with High-Dimensional Covariates

MA Shujie

Raymond J Carroll

Hua Liang

Shizhong Xu

Abstract

1. Introduction

Fig. 1. Plots of the estimated BMI against sleeping hours per day (left panel) and activity hours per day (right panel) for the three genotypes AA (solid line), Aa (dashed line) and aa (dotted line) of SNP rs242263 in the Framingham study, where A is the minor allele.

2. Penalization based variable selection

2.1. Spline approximation

2.2. Adaptive group Lasso estimator

2.3. Choice of the weights

2.4. Selection of tuning parameters

3. Inference and the bootstrap smoothing procedure

3.1. Background

3.2. Oracle estimator

3.3. Initial estimator

3.4. Final estimator

3.5. Asymptotic normality and uniform oracle efficiency

3.6. Simultaneous confidence bands

3.7. Bootstrap smoothing for calculating the standard error

4. A simulation study

Table 1.

Table 2. The empirical coverage rates (cov) and the sample average of median and mean of the standard deviations (sd.median and sd.mean)for the unsmoothed SCB (17) and smoothed SCB (19) for the coefficient functions αℓ1(x1) for ℓ = 1, 2, 3, 4.

5. Data application

Table 3.

Fig. 2. Plots of the smoothed two-step estimated functions α∼1kS(⋅) for k = 1, 2, 3 and the associated 95% SCBs based on model (23).

Fig. 3. Plots of the smoothed two-step estimated functions α∼5kS(⋅) for k = 1, 2, 3 and the associated 95% SCBs based on model (23).

Fig. 4.

6. Discussions

Supplementary Material

Acknowledgments

Appendix

A.1. Assumptions

A.2. Preliminary lemmas

A.3. Proof of Theorem 1

A.4. Proof of Theorem 2

A.5. Proof of Theorem 3

A.6. Proof of Theorem 4

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 2. The empirical coverage rates (cov) and the sample average of median and mean of the standard deviations (sd.median and sd.mean)for the unsmoothed SCB (17) and smoothed SCB (19) for the coefficient functions α_ℓ₁(x₁) for ℓ = 1, 2, 3, 4.

Fig. 2. Plots of the smoothed two-step estimated functions ${\tilde{α}}_{1 k}^{S} (\cdot)$ for k = 1, 2, 3 and the associated 95% SCBs based on model (23).

Fig. 3. Plots of the smoothed two-step estimated functions ${\tilde{α}}_{5 k}^{S} (\cdot)$ for k = 1, 2, 3 and the associated 95% SCBs based on model (23).