Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Dec 2.
Published in final edited form as: Biometrika. 2015 Mar 11;102(2):381–395. doi: 10.1093/biomet/asu070

Automatic structure recovery for additive models

YICHAO WU 1, LEONARD A STEFANSKI 1
PMCID: PMC4487890  NIHMSID: NIHMS698534  PMID: 26146407

SUMMARY

We propose an automatic structure recovery method for additive models, based on a backfitting algorithm coupled with local polynomial smoothing, in conjunction with a new kernel-based variable selection strategy. Our method produces estimates of the set of noise predictors, the sets of predictors that contribute polynomially at different degrees up to a specified degree M, and the set of predictors that contribute beyond polynomially of degree M. We prove consistency of the proposed method, and describe an extension to partially linear models. Finite-sample performance of the method is illustrated via Monte Carlo studies and a real-data example.

Keywords: Backfitting, Bandwidth estimation, Kernel, Local polynomial, Measurement-error model selection likelihood, Model selection, Profiling, Smoothing, Variable selection

1. Introduction

Because of recent developments in data acquisition and storage, statisticians often encounter datasets with large numbers of observations or predictors. The demand for analysing such data has led to the current heightened interest in variable selection. For parametric models, classical methods include backward, forward and stepwise selection. More recently, many approaches to variable selection have been developed that use regularization via penalty functions. Examples include the lasso (Tibshirani, 1996), smoothly clipped absolute deviation (Fan & Li, 2001), the adaptive lasso (Zou, 2006; Zhang & Lu, 2007), and the L0 penalty (Shen et al., 2013). Fan & Lv (2010) give a selective overview of variable selection methods.

Variable selection for nonparametric modelling has advanced at a slower pace than for parametric modelling, and has been studied primarily in the context of additive models, which are an important extension of multivariate linear regression. An additive model presupposes that each predictor contributes a possibly nonlinear effect, and that the effects of multiple predictors are additive. Such models were proposed by Friedman & Stuetzle (1981) and serve as surrogates for fully nonparametric models. Most nonparametric variable selection methods studied so far deal with additive models; see Ravikumar et al. (2009), Huang et al. (2010), Fan et al. (2011), and references therein. An exception is Stefanski et al. (2014), which considers variable selection in a fully nonparametric classification model. Additionally, variable selection and structure recovery for the varying coefficient model, a popular extension of the additive model, have been studied: Xia et al. (2004) considered structure recovery towards semi-varying coefficient modelling; Fan et al. (2014) studied a new variable screening method; and for the longitudinal setting Cheng et al. (2014) investigated variable screening and selection, as well as structure recovery.

Partially linear models were proposed by Engle et al. (1986). Combining the advantages of linearity and additivity, these models assume that some covariates have nonlinear additive effects while others contribute linearly. Estimation for a partially linear model requires knowing which covariates have linear effects and which have nonlinear effects, information that is usually not available a priori. Recently, Zhang et al. (2011) and Huang et al. (2012) proposed methods to identify covariates that have linear effects and ones that have nonlinear effects.

We consider an additive model for a scalar response Y and predictors X = (X1, … , XD)T,

Y=m(X)+ε=α+d=1Dmd(Xd)+ε,E(ε)=0,var(ε)=σ2, (1)

under the identifiability conditions E{md (Xd)} = 0 (d = 1 , … , D). Denote a random sample from model (1) by {(Yi, Xi): i = 1, … , n}, where Xi =(Xi1, … , XiD)T. The goal is to estimate α and md (·) (d = 1, … , D).

We a method for estimating md(·) that first distinguishes between important predictors and predictors that are unimportant, i.e., those Xd for which md(·) = 0. Next, motivated by Zhang et al. (2011) and Huang et al. (2012), the method identifies predictors that have linear effects from the estimated set of important predictors. Then, the method identifies the predictors that have quadratic effects, and so on. This process continues and results in estimates of sets of predictors that have polynomial effects at different degrees, up to some degree M, and the set of predictors for which the corresponding md(·) are not polynomial of any degree up to M.

At the core of our structure recovery method is a new nonparametric kernel-based variable selection method derived from the measurement-error model selection likelihood approach of Stefanski et al. (2014). They studied the relationship between lasso estimation and measurement error attenuation in linear models, and used that connection to develop a general approach to variable selection in nonparametric models.

2. Backfitting algorithm

Backfitting coupled with smoothing is commonly used for fitting model (1). Here we use univariate local polynomial smoothing, with SK,h,p denoting the univariate local polynomial smoother with kernel function K (·), bandwidth h and degree p.

Local polynomial smoothing (Fan & Gijbels, 1996) is a well-studied nonparametric smoothing technique. To estimate the regression function g(t) = E(Z…T = t) from an independent and identically distributed random sample {(Ti, Zi) : i = 1, … , n}, local polynomial regression uses Taylor series approximations and weighted least squares. The local polynomial smoothing estimate g^(t0) of g(t0) based on smoother SK,h,p is given by a^0, the optimizer of

mina0,a1,,api=1n{Zij=0paj(Tit0)j}2K(Tit0h).

See Fan & Gijbels (1996) for a detailed account of local polynomial modelling.

Using univariate local polynomial smoothing with kernel K(·), bandwidth hd and degree pd to estimate md(·) in the additive model (1), the backfitting algorithm consists of the following steps.

Step 1

Initialize by setting α^=n1i=1nYi and m^d()0 for d = 1, … , D.

Step 2

For d = 1, … , D:

  • (a) apply the local polynomial smoother SK,hd,pd to {{Xid,Yiα^kdm^k(Xik)}:i=1,,n} and set the estimated function to be the updated estimate m^d() for md (·);

  • (b) if necessary, apply centring by updating m^d() with m^d()n1i=1nm^d(Xid).

Step 3

Repeat Step 2 until the changes in all m^d() (d = 1, … , D) between successive iterations are less than a specified tolerance.

Denote the estimates at convergence by m^dBF(.;h,p) (d = 1, … , D) and α^BF, where h = (h1, … , hD)T and p = (p1, … , pD)T are the vectors of smoothing bandwidths and local polynomial degrees, respectively. For simplicity we use the same kernel K (·) for each component, and thus K (·) is omitted from the notation m^dBF(.;h,p).

Backfitting works well for problems of moderate dimension D. However, even though backfitting entails only univariate smoothing, its performance deteriorates as D gets larger. Consequently, variable selection is important for the additive model.

3. Variable selection via backfitting local constants

3·1. Variable selection

In Step 2(a) of the backfitting algorithm, any smoothing method can be applied. In this section we consider the local constant smoother, i.e., the local polynomial smoother of degree 0, and propose a nonparametric variable selection method for the additive model (1).

With pd =0 in Step 2(a) of the backfitting algorithm, we update the estimate m^d(xd) of md (xd) by the minimizer a^0 of

mina0i=1n[{Yiα^kdm^k(Xik)}a0]2K(Xidxdhd). (2)

Note that K{hd−1(Xidxd )} = K(0) for all i when hd = ∞. In this case, md (·) is globally approximated by a0. The corresponding optimizer of (2) is given by a^0=n1i=1n{Yiα^kdm^k(Xik)}, which equals zero because a^=n1i=1nYi and centring is applied in Step 2(b) of the backfitting algorithm.

Thus, when pd = 0, the backfitting algorithm leads to the constant zero function m^dBF(xd;h,p) when hd = ∞. Consequently, the jth predictor is excluded from the model when hd = ∞ and is included only when hd < ∞. The equivalence, hd < ∞ if and only if the jth predictor is included, is key to the approach described in Stefanski et al. (2014), and we exploit it repeatedly in our method. We assume in this section that the degree of local polynomial smoothing for every function component is 0, i.e., pd 0 for d = 1, … , D. In this way, the smoothing bandwidth relates directly to the importance of each predictor, with small hd corresponding to important predictors.

As in Stefanski et al. (2014), we reparameterize hd as λd = 1/hd, so that large λd will correspond to important predictors and λd = 0 to unimportant predictors. Predictor smoothing is now determined by the inverse bandwidths λd. Stefanski et al. (2014) show that variable selection can be obtained by minimizing a loss function with respect to the λd, subject to a bound on the total amount of smoothing as determined by the sum of the λd or, equivalently, a bound on the harmonic mean of the bandwidths.

A suitable loss function for backfitting is the sum of squared errors

SSE=i=1n{Yiα^BFd=1Dm^dBF(Xid;λ1,0D)}2,

where in the second two arguments of mdBF(.;.,.), λ−1 = (11, … , 1D)T and 0D denotes the D × 1 zero vector. We shall also write 1s for the s × 1 vector of ones.

In the absence of constraints, the minimum of sse with respect to λ is 0, and corresponds to overfitted models. As in Stefanski et al. (2014), appropriate regularization is achieved by solving the constrained optimization problem

minλ1,,λD1ni=1n{Yiα^BFd=1Dm^dBF(Xid;λ1,0D)}2,subject toλd0(d=1,,D)andd=1Dλd=τ0. (3)

Solving (3) distinguishes important predictors, λ^d>0, from those that are unimportant, λ^d=0.

3·2. Modified coordinate descent algorithm

The constrained optimization problem (3) is not convex because of the complicated dependence on λ through the backfitting algorithm and the univariate local polynomial smoothing. We have had success using a modified coordinate descent algorithm.

Coordinate descent for high-dimensional lasso regression (Fu, 1998; Daubechies et al., 2004) cycles through variables one at a time, solving simple marginal univariate optimization problems at each step, and thus is computationally efficient. It has been studied extensively, for example by Friedman et al. (2007) and Wu & Lange (2008), among others. However, standard coordinate descent cannot be applied directly to (3).

In our modified coordinate descent algorithm, we initialize with equal smoothing, i.e., we set λd = τ0/D for all d = 1, … , D. We then cycle through all components and make univariate updates. Current solutions are denoted by λdc(d=1,,D), where the superscript c means current. Suppose that we are updating the d’th component. Let λd(1)=(0,,0,1,0,,0)T, a vector whose elements are all zero except for the d’th, which is 1, and set λdc=(λ1c,,λd1c,0,λd+1c,,λDc)ddλdc. The components of λd(1), and λdc sum to unity. Thus, for any γ ∈ [0, τ0], the components of γλd(1)+(τ0γ)λdc, sum to τ0 and satisfy the first constraint in (3). We then update the set of current solutions to the components of the vector γ^λd(1)+(τ0γ^)λdc, where γ^ is the optimizer of

minγ1ni=1n(Yiα^BFj=1Dm^jBF[Xij;{γλd(1)+(τ0γ)λdc}1,0D])2

subject to 0 ≤ γτ0. Cycling through d’ = 1, … , D completes one iteration of the algorithm. Iterations continue until the change in solutions between successive iterations becomes small enough.

3·3. Tuning

The tuning parameter τ0 controls the total amount of smoothing, and can be selected by standard methods such as crossvalidation, aic or bic. Denote the optimizer of (3) by λ^(τ0). If an independent tuning dataset is available, then τ0 can be selected by minimizing the sum of squared prediction errors of the estimator m^{x;λ^(τ0),0}=α^BF+d=1Dm^dBF[xd;{λ^(τ0)}1,0] over the tuning set. For methods such as aic and bic, the degrees of freedom is needed. The local constant smoothing estimator is a linear smoother, and we couple the backfitting algorithm to marginal local constant smoothing. For each function component, the trace of the corresponding smoothing matrix minus 1 can be used as the degrees of freedom. The trace is reduced by 1 to account for the centring applied in Step 2(b) of the backfitting algorithm.

4. Higher-degree local polynomial regression

The approach in § 3 extends readily to higher-degree local polynomial regression. Suppose that we use local degree p* polynomial regression in Step 2(a) of the backfitting algorithm. With pd = p* in Step 2(a), we update the estimate m^d(xd) of md (xd) by the minimizer a^0 of

mina0,,api=1n[{Yiα^kdm^k(Xik)}j=0paj(Xidxd)j]2K(Xidxdhd).

As remarked previously, K{hd−1(Xidxd)} = K(0) for all i when hd = ∞. Consequently, j=0paj(Xidxd)j is a global approximation of md (xd). For this case, in Step 2(a) of the backfitting algorithm the estimate m^d(xd) is updated by j=0pc^jxdj, where c^0,c^p are the optimizers of mincp,,cpi=1n[{Yiα^kdm^k(Xik)}j=0pcjXidj]2.

Thus, when pd = p*, the backfitting algorithm yields a polynomial of degree p* as the estimate of md(·) when hd = ∞. The interpretation is that the dth predictor makes a polynomial contribution of degree up to p*. Based on this, we derive a method for detecting predictors that contribute polynomially up to degree p* in the additive model.

Now we use backfitting with univariate local polynomial smoothing of degree p* for every function component, again parameterizing via inverse bandwidths λd = 1/hd. As in the previous section, the general approach of Stefanski et al. (2014) leads to solving

minλ1,,λD1ni=1n{yiα^BFd=1Dm^dBF(Xid;λ1,p1D)}2,subject toλd0(d=1,,D)andd=1Dλd=τp, (4)

where τp* plays the same role as τ0 in (3), and in the last two arguments of mdBF(.;.,.), λ−1 = (11, … , 1D)T and p*1D is the product of p* and 1D. In this case, an optimizer λ^d=0 means that the dth predictor contributes polynomially at a degree no greater than p*.

The modified coordinate descent algorithm and the tuning procedures discussed in § § 3·2 and 3·3 extend naturally to (4).

5. Automatic structure recovery

The procedures in § § 3 and 4 provide the building blocks for the additive model automatic structure recovery method. In combination, they enable estimation of the predictors that are not important, as well as those that contribute at the pth degree polynomially.

We let Ap denote the set of predictors that contribute polynomially at degree p and only at degree p, and we let Bp denote the set of predictors that contribute beyond polynomially degree p, which includes higher-degree polynomials as well as nonpolynomial functions. Thus Ap means at degree p, and Bp indicates beyond degree p. With these naming conventions, a predictor that contributes at degree p = 0 is no more informative than a constant and is therefore unimportant. Specifically, A0={d:md()=0}, Ap={d:md(p)()0,md(p+1)()=0} for p = 1,2, … and Bp={d:md(p+1)()0} for p = 0, 1, 2, … Here md(·) = 0 means that md(t) = 0 for any t in its domain of interest, and md(·) ≠ 0 means that md(t) ≠ 0 for some t. Our automatic structure recovery method proceeds in the following steps.

Step 1

Identify important and unimportant predictors.

With appropriately chosen τ0, solve (3) to obtain λ^0=(λ^01,,λ^0D)T, and set A^0={d:λ^0d=0} and B^0=A^0C, the complement of A^0 in the set {1, … D}. Then A^0 and B^0 are estimates of the sets of unimportant and important predictors, respectively.

Step 2

Identify from B^0 the predictors that contribute linearly.

After identifying the set B^0 of important predictors, the next step is to identify a subset of B^0 consisting of functions that contribute linearly. We remove unimportant predictors Xd with dA^0 from consideration and apply the method of § 4, namely (4) with p* = 1, to the data {(XiB^0,Yi):i=1,,n}, where XiB^0 denotes the subvector of Xi with indices in B^0. We define α^BF and m^dBF(xd;λB^01,1)(dB^0) to be the backfitting estimates obtained using the data {(XiB^0,Yi):i=1,,n} with bandwidths λd1 for dB^0. Let #B^0 denote the cardinality of the set B^0. For an appropriately tuned τ1, we solve the optimization problem

minλd:dB^01ni=1n{Yiα^BFdB^0m^dBF(Xid;λB^01,1#B^0)}2,subject toλd0(dB^0)anddB^0λd=τ1,

and denote its optimizer by λ^1d, for dB^0. The set A^1={dB^0:λ^1d=0} estimates the set of predictors contributing linearly, and B^1={dB^0:λ^1d>0} estimates the set of predictors that contribute beyond linearly.

Step 3

Identify from B^1 the predictors that contribute quadratically.

In Step 2 the set A^1 of linear predictors was identified. The set A^1 is global in principle, and this property can be ensured via profiling (Severini & Wong, 1992). For dA^1 we denote the global fit coefficient for the linear predictor Xd by β1d or, in vector form, β1A^1. For a given β1A^1, we Apply the backfitting algorithm of degree pd = 2 with bandwidth λd1(dB^1) to the data {(XiB^1,YidA^1Xidβ1d):i=1,,n} and denote the corresponding estimates by α^BF(β1A^1) and m^dBF(xd;λB^11,2×1#B^1,β1A^1)(dB^1); we have given m^dBF a fourth argument to emphasize its dependence on β1A^1. Using profiling, we solve

minβ1d:dB^11ni=1n{YijA^1Xijβ1jα^BF(β1A^1)dB^1m^dBF(Xid;λB^11,2×1#B^1,β1A^1)}2

to obtain the best estimate for β1A^1, which we denote by β^1A^1(λB^1) to emphasize its dependence on λB^1. With the above notation and definitions, we solve the optimization problem

minλd:dB^11ni=1n[YijA^1Xijβ^1j(λB^1)α^BF{β^1A^1(λB^1)}dB^1m^dBF{Xid;λB^11,2×1B^1,β^1A^1(λB^1)}]2,subject toλd0(dB^1)anddB^1λd=τ2

for appropriately tuned τ2, and denote its optimizer by λ^2d(dB^1). Then A^2={dB^1:λ^2d=0} estimates the set of predictors that contribute quadratically, and B^2={dB^1:λ^2d>0} estimates the set of predictors that contribute beyond quadratically.

Steps 4 to M + 1

Identify predictors that have kth-degree polynomial effects by using appropriate τk for k 3, 4, ….

After Step 3, we have estimated the set of linear predictors and the set of quadratic predictors. In a straightforward manner Step 3 can be modified to identify the sets of predictors that contribute polynomially at degrees k = 3, k = 4, and so on, up to k = M.

Step M + 2. Fit the final model.

After Step M + 1, we have obtained estimates A^k for k = 0, 1 … , M and B^M. The final model is then estimated by combining the profiling technique and the backfitting technique as in Step 3. In this final step, we couple the backfitting algorithm with local linear smoothing, as we want to estimate only the function md(·) itself for dB^M.

6. Theoretical properties

In this section we study the consistency of the proposed structure recovery scheme. We first show that the estimated set of unimportant predictors is consistent.

Proposition 1

Suppose τ0 → ∞ such that τ03n0 as n → ∞, and assume that Conditions A1–A5 in the Appendix hold. Then the solution to (3) satisfies λ^d in probability for dB0, and λ^d0 in probability for dA0. Consequently, pr(A^0=A0)1 as n → ∞.

Remark 1

There is a small gap between Proposition 1 and the proposed procedure, which is confounded with the numerical analysis. In our procedure A^0 is defined as {d:λ^d=0}, but Proposition 1 shows that the solution to (3) satisfies λ^d0 in probability for dA0. Thus there would be closer agreement between the proposition and the algorithm if we had defined A^0 as {d:λ^d<δ} for some small δ > 0. However, based on our numerical experience so far, the optimizer is indeed sparse, returning exact zeros. Therefore, we shall keep the definition of A^0 as {d:λ^d=0}. This remark also applies to Theorem 1.

The consistency in Proposition 1 is readily extended to the estimated set of predictors that contribute polynomially at different degrees, resulting in the following consistency property for the proposed structure recovery scheme.

Theorem 1

Suppose that for k = 0, 1, … , M, τk → ∞ and τk2k+3n0 as n → ∞, and assume that Conditions A1–A5 in the Appendix hold. Then the estimators A^k(k=0,,M) and B^M satisfy pr(A^k=Akfork=0,,M,B^M=BM)1 as n → ∞.

The proofs of Proposition 1 and Theorem 1 are given in the Appendix.

7. Simulation studies

We use simulation models adapted from Zhang et al. (2011) to study the finite-sample performance of the proposed method. In each model, predictors are generated as Xj = (Uj + ηU)/ (1 + η) for j = 1, … , D, where U1, … , UD, U are independent and identically distributed Un[0, 1] variables, with η at levels 0 and 0·5, resulting in pairwise correlations of 0 and 0·2. We compare our method with the smoothing spline analysis-of-variance method (Kimeldorf & Wahba, 1971; Gu, 2002), as well as the linear and nonlinear discovery method of Zhang et al. (2011) and its refitted version, in terms of both integrated squared error and predictor-type identification.

For an estimate f^(x) of the additive model (1), define the integrated squared error ISE(f^)=EX{f(X)f^(X)}2, estimated via Monte Carlo using an independent test set of size 1000. The linear and nonlinear discovery methods, original and refitted, are designed to identify predictors having null, linear, nonlinear, or a mixture of linear and nonlinear effects, whereas the goal of the proposed method is to identify predictors that contribute polynomially at different degrees and predictors that contribute beyond polynomially of a specified degree M. In the simulation studies, we fixed M to be 2. We use the Bayesian information criterion for tuning parameter selection in all methods.

Model 1

In the first data-generation model, Y = f1(X1)+ f2(X2)+ f3(X3)+ε with f1(t)=3t, f2(t)=2 sin(2πt) and f3(t)=2(3t – 1)2. Here ε ~ N(0, σ2) is independent of the predictors. Two values of the standard deviation, σ = 1 and σ = 2, and two dimensions, D = 10 and D = 20, are considered. For all settings, X4, … , XD are unimportant. In terms of the linear and nonlinear discovery classification scheme, X1 has a purely linear effect, X2 has a purely nonlinear effect, and X3 has a mixture of linear and nonlinear effects (Zhang et al., 2011). For our method, X1 is a linear predictor, X3 is a quadratic predictor, and X2 contributes beyond quadratically. Training samples of size n =100 are used. The results from 100 simulated datasets are reported in Tables 1 and 2.

Table 1.

Simulation results for Model 1: average integrated squared errors (×102) with standard deviations (×102) in parentheses

D σ η Oracle 1 SSANOVA LAND LANDr PSC Oracle 2
10 1 0·0 12·5 (5·5) 23·9 (9·0) 11·9 (7·4) 14·9 (6·8) 11·6 (6·2) 9·8 (4·9)
0·5 13·2 (5·4) 25·5 (8·8) 58·9 (36·8) 17·2 (7·3) 12·3 (5·9) 10·4 (4·4)
2 0·0 44·0 (22·9) 92·7 (41·6) 82·1 (48·0) 71·2 (42·4) 52·0 (34·7) 37·3 (21·9)
0·5 50·3 (26·3) 100·4 (35·9) 142·6 (47·9) 88·1 (38·0) 67·4 (34·9) 42·8 (21·7)
20 1 0·0 12·8 (5·4) 45·5 (14·3) 16·4 (12·6) 18·7 (9·7) 14·8 (7·8) 9·9 (4·4)
0·5 13·4 (5·7) 50·6 (19·6) 64·4 (36·0) 24·4 (11·6) 14·3 (8·9) 11·2 (5·4)
2 0·0 45·1 (21·0) 187·0 (65·8) 135·8 (70·9) 130·7 (60·1) 69·3 (40·7) 37·8 (18·5)
0·5 45·7 (18·9) 192·8 (62·8) 169·5 (63·5) 155·3 (63·8) 76·2 (41·7) 40·4 (19·6)

D, dimension; σ, model error standard deviation; η, predictor correlation parameter; SSANOVA, smoothing spline analysis-of-variance method; LAND, linear and nonlinear discovery method; LANDr, linear and nonlinear discovery method, refitted version; PSC, the proposed polynomial structure classification method; Oracle 1, oracle for LAND and LANDr; Oracle 2, oracle for PSC.

Table 2.

Summary of predictor classification performance of the linear and nonlinear discovery method and the proposed method for Model 1

LAND PSC
D σ η L NL L-N Nil CC L Q BQ Nil CC
10 1 0·0 0·9 0·8 1·0 6·5 54 0·9 1·0 1·0 7·0 91
0·5 1·0 0·2 0·9 5·9 10 1·0 1·0 1·0 7·0 92
2 0·0 1·0 0·4 1·0 4·8 5 0·9 1·0 0·9 7·0 74
0·5 1·0 0·0 0·7 3·8 0 0·6 0·9 0·6 7·0 35
Average SD 0·1 0·4 0·2 1·5 0·3 0·2 0·2 0·0
20 1 0·0 1·0 0·7 1·0 15·6 34 1·0 1·0 1·0 17·0 90
0·5 1·0 0·2 0·9 14·8 4 1·0 1·0 1·0 17·0 93
2 0·0 1·0 0·4 1·0 11·0 1 0·9 0·9 0·9 17·0 72
0·5 0·9 0·1 0·7 10·1 0 0·6 0·9 0·6 17·0 33
Average SD 0·1 0·4 0·2 2·5 0·3 0·3 0·2 0·0

LAND, linear and nonlinear discovery method; PSC, the proposed polynomial structure classification method; D, dimension; σ, model error standard deviation; η, predictor correlation parameter; L, linear; NL, nonlinear; L-N, linear-nonlinear mixture; Nil, noise; CC, proportion (%) of models with fully correct variable classifications; Q, quadratic; BQ, beyond quadratic; Average SD, block-averaged standard deviation.

Table 1 summarizes the performance of the smoothing spline method, the two linear and nonlinear discovery methods, the proposed polynomial structure classification method, and two oracle methods, which are included to assess the utility of the underlying classification schemes irrespective of estimation error. Oracle 1 is the linear and nonlinear discovery oracle that makes use of information on which predictors are noise, purely linear, purely nonlinear, or a mixture of linear and nonlinear. Oracle 2 is the oracle for our method and uses information on which predictors are noise, linear, quadratic, or beyond quadratic. The polynomial structure classification method has the smallest integrated squared error values among all methods for all generating distributions, and oracle 2 dominates oracle 1.

Table 2 summarizes the predictor classification performance of the methods under comparison. For the linear and nonlinear discovery method, the table entries are the average and standard deviation of the numbers of predictors identified as purely linear (X1), purely nonlinear (X2), a linear-nonlinear mixture (X3), and noise (X4, … , XD). Similarly, for the proposed method, the table entries are the average and standard deviation of the numbers of predictors identified as linear (X1), quadratic (X3), beyond quadratic (X2), and noise (X4, … , XD). Also reported are the percentages of times that all variables were correctly classified according to each method’s underlying classification scheme.

The proposed polynomial structure classification method performs well for all generative models, although both it and the linear and nonlinear discovery method generally perform less well for larger values of D, σ and η. Because the two methods are based on different classification schemes, predictor classification is not directly comparable, except with respect to the noise variables. The proposed method identified all noise variables for all 100 simulated datasets for all generative models, whereas the linear and nonlinear discovery method missed some noise predictors, more in cases where σ, η and especially D were large.

For one simulated dataset with σ = 1, η = 0·5 and D = 10, we plot in Fig. 1 the nonzero optimizers λ^j of (3) in the first variable selection step as functions of τ0. The optimizers λ^ corresponding to the important predictors, X1, X2 and X3, change to nonzero values quickly as τ0 increases and before any unimportant predictors show changes. Only four lines are visible in the graph because the optimizers corresponding to the other six predictors are zero for 0 ≤ τ0 ≤ 45.

Fig. 1.

Fig. 1

Solution paths to optimization problem (3) for one simulated dataset from Model 1, plotted over 0 < τ0 ≤ 45. At τ0 = 40 the paths are, from bottom to top, for variables X9, X1, X3 and X2.

Model 2

In the second data-generation model, Y=j=17fj(Xj)+ε with f1(t)=3t, f2(t)=−4t, f3(t)=2t, f4(t)=2 sin(2πt), f5(t)=3 sin(2πt)/{2 – sin(2πt)}, f6(t)=5[0·1 sin(2πt) + 0·2 cos(2πt) + 0·3{sin(2πt)}2 + 0·4{cos(2πt)}3 + 0·5{sin(2πt)}3] + 2t and f7(t)=2(3t – 1)2. Here ε ~ N(0, σ2) is independent of the predictors, with σ controlling the noise level; we take σ = 1 or 2. The dimension is D = 20 and so there are 13 noise predictors, X8, … , X20. For the linear and nonlinear discovery methods, X1, X2 and X3 are linear, X4 and X5 are nonlinear, and X6 and X7 are mixed linear-nonlinear. For our method, X1, X2 and X3 are linear, X7 is quadratic, and X4, X5 and X6 are beyond quadratic. Training samples of size n = 250 are used. Tables 3 and 4 present the results from 100 simulated datasets in the same format as for Model 1. Conclusions about the performance of the methods are similar to those for Model 1.

Table 3.

Simulation results for Model 2: average integrated squared errors (×102) with standard deviations (×102) in parentheses

σ η Oracle 1 SSANOVA LAND LANDr PSC Oracle 2
1 0·0 15 (4) 25 (6) 16 (5) 17 (5) 14 (4) 14 (4)
0·5 15 (4) 25 (6) 16 (5) 17 (6) 15 (6) 13 (4)
2 0·0 52 (14) 91 (22) 58 (20) 66 (19) 56 (17) 48 (13)
0·5 47 (16) 86 (24) 77 (34) 73 (31) 64 (26) 44 (15)

σ, model error standard deviation; η, predictor correlation parameter; SSANOVA, smoothing spline analysis-of-variance method; LAND, linear and nonlinear discovery method; LANDr, linear and nonlinear discovery method, refitted version; PSC, the proposed polynomial structure classification method; Oracle 1, oracle for LAND and LANDr; Oracle 2, oracle for PSC.

Table 4.

Summary of predictor classification performance of the linear and nonlinear discovery method and the proposed method for Model 2

LAND PSC
σ η L NL LN Nil CC L Q BQ Nil CC
1 0·0 3·0 1·6 2·0 12·5 47 3·0 1·0 3·0 13·0 99
0·5 3·0 1·1 2·0 11·8 16 2·9 1·0 3·0 13·0 89
2 0·0 3·0 1·5 1·9 11·2 14 2·8 1·0 3·0 13·0 74
0·5 3·0 0·8 1·8 9·7 1 2·2 1·0 2·8 13·0 23
Average SD 0·2 0·6 0·2 2·0 0·3 0·1 0·1 0·1

LAND, linear and nonlinear discovery method; PSC, the proposed polynomial structure classification method; σ, model error standard deviation; η, predictor correlation parameter; L, linear; NL, nonlinear; L-N, linear-nonlinear mixture; Nil, noise; CC, proportion (%) of models with fully correct variable classifications; Q, quadratic; BQ, beyond quadratic; Average SD, block-averaged standard deviation.

8. Extension to partially linear models

The partially linear model is an extension of the additive model where, in addition to predictors X1, … , XD, there are predictors Z1, … , Zq known to contribute linearly, so that

Y=α+d=1Dfd(Xd)+j=1qZjγj+ε. (5)

The Zj commonly include indicators for categorical variables such as gender, race and location.

We illustrate the extension of our method to the partially linear model (5) via profiling by analysing the diabetes data from Willems et al. (1997). The goal of that study was to understand the prevalence of obesity, diabetes and other cardiovascular risk factors. The data include 18 variables on each of 403 African American subjects from central Virginia, with some missing values. The response variable in our analysis is glycosolated haemoglobin, which is of interest because a value greater than 7·0 is usually regarded as giving a positive diagnosis of diabetes.

We exclude two variables, the second systolic blood pressure and second diastolic blood pressure, as they are missingness-prone replicates of two other included variables, the first systolic and first diastolic blood pressures. Doing so leaves 15 candidate predictors. Twelve of these 15 variables are continuous and are rescaled to the unit interval: X1 = cholesterol, X2 = stabilized glucose, X3 = high density lipoprotein (hdl), X4 = cholesterol to hdl ratio, X5 = age, X6 = height, X7 = weight, X8 = first systolic blood pressure, X9 = first diastolic blood pressure, X10 = waist circumference, X11 = hip circumference, and X12 = postprandial time when samples were drawn. The remaining three variables are categorical variables for location, gender and frame. Location is a factor with two levels, Buckingham and Louisa; we set Z1 = 0 for Buckingham and Z1 = 1 otherwise. For gender we set Z2 =0 for female and Z2 =1 for male. The frame variable has three levels; we set Z3 = 0 and Z4 = 0 for small frames, Z3 = 1 and Z4 = 0 for medium frames, and Z3 = 0 and Z4 = 1 for large frames.

We fit the partially linear model (5) to study the dependence of glycosolated haemoglobin on the variables X1, … , X12, Z1, … , Z4, using the natural extension of the automatic structure recovery algorithm to discern the effects of X1, … , X12. We use the Bayesian information criterion to select tuning parameters. Our method identified X1, X3, X6, … , X12 as unimportant predictors; X5 was selected to have a linear effect, X4 a quadratic effect, and X2 a beyond-quadratic effect.

As Fig. 2(a) shows, on the original scale X4 has one outlier, 19·3, far outside the range of the other observations, which are between 1·5 and 12·2. Upon removing the outlier and reapplying our method, X1, X3, X4, X6, … , X12 are identified as unimportant, X5 as having a linear effect, and X2 as having a beyond-quadratic effect. The nonlinear fit for X2 is shown in Fig. 2(b).

Fig. 2.

Fig. 2

Results of applying our method to the diabetes data: (a) index plot of variable X4, with an outlier visible in the upper left corner; (b) the beyond-quadratic fit for X2 (stabilized glucose) after removing the X4 outlier.

9. Discussion

We have proposed and studied the properties of a new automatic structure recovery and estimation method for the additive model. The method is readily generalizable and can be extended to generalized additive models (Hastie & Tibshirani, 1990) and survival data models (Cheng & Lee, 2009).

Classifying predictors into those that are unimportant and those that contribute through polynomials of specified degree is accomplished by using a nonparametric variable selection method based on the results of Stefanski et al. (2014). Like the nonparametric classification method proposed by Stefanski et al. (2014), each step of our new method uses regularization by bounding a sum of inverse kernel bandwidths. Other choices of penalty, such as dλdγ for fixed γ ≥ 0, are possible and worthy of study. Based on our limited experience, γ = 1 gives good overall performance. However, in light of the results in Stefanski et al. (2014), it is expected that γ < 1 would increase sparsity whereas γ > 1 would decrease sparsity but yield regression function estimators that have better mean-squared error properties.

As noted by a referee, modifications to our method and additional development of the theory to accommodate the case of a diverging dimension D would be useful directions to pursue. Alternatively, for data with large D, one could precede the use of our method with application of one of the screening procedures recently developed by Fan et al. (2011) and Cheng et al. (2014).

Acknowledgement

We thank two reviewers, an associate editor, and the editor for their most helpful comments. This research was supported by the U.S. National Institutes of Health and National Science Foundation.

Appendix

For d = 1, … , D, denote the density of Xd by fd(·). We assume the following technical conditions from Fan & Jiang (2005), as our method is based on backfitting coupled with local polynomial smoothing.

Condition A1

The kernel function K(·) is bounded and Lipschitz continuous with bounded support.

Condition A2

The densities fd(·) are Lipschitz continuous and bounded away from zero over their bounded supports.

Condition A3

For all pairs d and d’, the joint density functions of Xd and Xd are Lipschitz continuous on their supports.

Condition A4

For all d = 1, … , D, the (pd + 1)th derivatives of md(·) exist and are bounded and continuous.

Condition A5

The error has finite fourth moment, E(∣ε4) < ∞.

For the backfitting estimate coupled with local polynomial smoothing as defined in § 3, we have the following asymptotic result.

Lemma A1

Assume that Conditions A1–A5 hold and hd → 0 such that nhd2pd+2 as n n → ∞, for d = 1, … , D. Then

1ni=1n{Yiα^BFd=1Dm^dBF(Xid;h,p)}2=σ2+Op(d=1Dhd2pd+2)+Op(d=1D1nhd). (A1)

Proof

Lemma A1 is a straightforward consequence of the third step in the proof of Theorem 1 in Fan & Jiang (2005). The term on the left-hand side of (A1) is rss1/n in their notation, which they show converges in probability to σ2. Fan & Jiang (2005) did not explicitly state the rate of convergence, but it is readily deduced from their proof. We use their notation in the rest of this proof.

Note that rss1 = εAn2ε + BTB + 2BT(WMIn)ε and An2 = In + STSSST + Rn2, according to Fan & Jiang (2005, p. 903). Thus it is enough to track the convergence rates of the different terms. Also, B=O(d=1Dhdpd+1)+(d=1Dmd)O(1) almost surely, according to (B.4) in Fan & Jiang (2005), and d=1Dmd)=Op(n12) by definition. Consequently, BTBn=Op(d=1Dhd2pd+2). By line 6 from the bottom of the right-hand column of p. 901 in Fan & Jiang (2005), BT(WMIn)εn=Op(n1+d=1Dn12hdpd+1). By (B.11)–(B.24), εTSTSnSSTε=Op{d=1D1(nhd)}. As Rn2 = O(1T1/n) uniformly over its elements, εTRn2εn=Op{d=1D1(nhd)}. Combining these terms and noting that nhd2pd+2 completes the proof.

For k = 0, 1, … , let πk be the set of all polynomial functions of degree k. Define I={d:md()πpd,d=1,,D} to be the set of predictors with corresponding additive component functions which are polynomial of degree at most that of the corresponding local polynomial used in the backfitting algorithm. Denote its complement by IC=1{1,,D}\I.

Proposition A1

Suppose that Conditions A1–A5 hold, that hdc0 > 0 for dI and some constant c > 0, and that hd’ → 0 and nhd2pd+2 as n → ∞ for dIC. Then

1ni=1n{Yiα^BFd=1Dm^dBF(Xid;h,p)}2=σ2+Op(dIchd2pd+2)+Op(dIc1nhd). (A2)

Proof

Opsomer & Ruppert (1999) studied a backfitting estimator for semiparametric additive models and showed that the estimator of the parametric component is n1/2-consistent. The condition hdc0 > 0 for dI and some c0 > 0 implies that the smoothing bandwidth hd is bounded away from zero for dI. As md(·) ∈ πpd for dI, there is no approximation bias in using local polynomial smoothing to estimate the corresponding component in the backfitting algorithm. Thus the techniques of Opsomer & Ruppert (1999) can be used to show that the estimator of md(·) for dI is n1/2-consistent as hd > c0 > 0. The n1/2-consistent rate is faster than the consistency rate for smoothing other components.

Proof of Proposition 1

It is easy to show by contradiction that λ^d in probability for dB0. If λ^d is bounded for some dB0, then the objective function of (3) converges to the sum of σ2 plus an additional positive term due to smoothing bias. The additional bias term is caused by bounded λ^d, as the corresponding smoothing bandwidth hd=1λ^d does not shrink to zero. This is suboptimal, as the smallest limit of the objective is σ2. This proves that λ^d in probability for dB0.

In (3), local constant smoothing is used for every component function. Thus pd = 0 for d 1, … , D. The second term on the right-hand side of (A1) or (A2) is due to bias, and the third term is due to variance when using local polynomial smoothing for every component. The condition τ03n0 as n → ∞ ensures that the variance term is dominated by the bias term. At the same time, note that a bounded and small λd for dA0 does not introduce any extra term.

If dA0λ^d0 as n → ∞, consider λ~d=λ^dτ0(τ0dA0λ^d) for dB0 and λ~d=0 for dA0. In this case, λ~d diverges to infinity at a faster rate than λ^d for dB0. Consequently, the asymptotic bias term on the right-hand side of (A2) using the λ~d values is smaller than the bias term on the right-hand side of (A1) using λ^d values. Here, smaller is in the sense of asymptotic order if dA0λ^d, and in the sense of the constant multiplying the asymptotic order if dA0λ^d is bounded. Thus the set of λ^d values with dA0λ^d0 as n → ∞ is suboptimal, as we are solving a minimization problem (3). This shows that λ^d0 in probability for dA0 and therefore completes the proof.

Proof of Theorem 1

From Proposition 1, we have that pr(A^0=A0)1 as n → ∞. Conditional on A^0=A0, we can prove that pr(A^1=A1)1 as n → ∞ using arguments similar to those in the proof of Proposition 1. This process is iterated to complete the proof of the theorem.

References

  1. Cheng M-Y, Lee C-Y. Local polynomial estimation of hazard rates under random censoring. J. Chin. Statist. Assoc. 2009;47:19–38. [Google Scholar]
  2. Cheng M-Y, Honda T, Li J, Peng H. Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Ann. Statist. 2014;42:1819–49. [Google Scholar]
  3. Daubechies IB, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 2004;57:1413–57. [Google Scholar]
  4. Engle RF, Granger CWJ, Rice J, Weiss A. Nonparametric estimates of the relation between weather and electricity sales. J. Am. Statist. Assoc. 1986;81:310–20. [Google Scholar]
  5. Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. Chapman & Hall; London: 1996. [Google Scholar]
  6. Fan J, Jiang J. Nonparametric inference for additive models. J. Am. Statist. Assoc. 2005;100:890–907. [Google Scholar]
  7. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Assoc. 2001;96:1348–60. [Google Scholar]
  8. Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statist. Sinica. 2010;20:101–48. [PMC free article] [PubMed] [Google Scholar]
  9. Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high dimensional additive models. J. Am. Statist. Assoc. 2011;106:544–57. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fan J, Ma Y, Dai W. Nonparametric independence screening in sparse ultra-high dimensional varying coefficient models. J. Am. Statist. Assoc. 2014;109:1270–84. doi: 10.1080/01621459.2013.879828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Friedman JH, Stuetzle W. Projection pursuit regression. J. Am. Statist. Assoc. 1981;76:817–23. [Google Scholar]
  12. Friedman JH, Hastie TJ, Hofling H, Tibshirani RJ. Pathwise coordinate optimization. Ann. Appl. Statist. 2007;1:302–32. [Google Scholar]
  13. Fu WJ. Penalized regressions: The bridge versus the lasso. J. Comp. Graph. Statist. 1998;7:397–416. [Google Scholar]
  14. Gu C. Smoothing Spline ANOVA Models. Springer; New York: 2002. [Google Scholar]
  15. Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman & Hall; London: 1990. [Google Scholar]
  16. Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models. Ann. Statist. 2010;38:2282–313. doi: 10.1214/09-AOS781. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Huang J, Wei F, Ma S. Semiparametric regression pursuit. Statist. Sinica. 2012;22:1403–26. doi: 10.5705/ss.2010.298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kimeldorf GS, Wahba G. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic. 1971;33:82–5. [Google Scholar]
  19. Opsomer JD, Ruppert D. A root-n consistent backfitting estimator for semiparametric additive modeling. J. Comp. Graph. Statist. 1999;8:715–32. [Google Scholar]
  20. Ravikumar PK, Lafferty JD, Liu H, Wasserman L. Sparse additive models. J. R. Statist. Soc. B. 2009;71:1009–30. [Google Scholar]
  21. Severini TA, Wong WH. Profile likelihood and conditionally parametric models. Ann. Statist. 1992;20:1768–802. [Google Scholar]
  22. Shen X, Pan W, Zhu Y, Zhou H. On constrained and regularized high-dimensional regression. Ann. Inst. Statist. Math. 2013;65:807–32. doi: 10.1007/s10463-012-0396-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Stefanski LA, Wu Y, White K. Variable selection in nonparametric classification via measurement error model selection likelihoods. J. Am. Statist. Assoc. 2014;109:574–89. doi: 10.1080/01621459.2013.858630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Tibshirani RJ. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–88. [Google Scholar]
  25. Willems JP, Saunders JT, Hunt DE, Schorling JB. Prevalence of coronary heart disease risk factors among rural blacks: A community-based study. Southern Med. J. 1997;90:814–20. doi: 10.1097/00007611-199708000-00008. [DOI] [PubMed] [Google Scholar]
  26. Wu TT, Lange K. Coordinate descent procedures for lasso penalized regression. Ann. Appl. Statist. 2008;2:224–44. [Google Scholar]
  27. Xia Y, Zhang W, Tong H. Efficient estimation for semivarying-coefficient models. Biometrika. 2004;91:661–81. [Google Scholar]
  28. Zhang HH, Lu W. Adaptive lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
  29. Zhang HH, Cheng G, Liu Y. Linear or nonlinear? Automatic structure discovery for partially linear models. J. Am. Statist. Assoc. 2011;106:1099–112. doi: 10.1198/jasa.2011.tm10281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Zou H. The adaptive lasso and its oracle properties. J. Am. Statist. Assoc. 2006;101:1418–29. [Google Scholar]

RESOURCES