Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 May 17.
Published in final edited form as: Stat Sin. 2011 Apr;21(2):679–705. doi: 10.5705/ss.2011.030a

Surface Estimation, Variable Selection, and the Nonparametric Oracle Property

Curtis B Storlie, Howard D Bondell, Brian J Reich, Hao Helen Zhang
PMCID: PMC3095957  NIHMSID: NIHMS192282  PMID: 21603586

Abstract

Variable selection for multivariate nonparametric regression is an important, yet challenging, problem due, in part, to the infinite dimensionality of the function space. An ideal selection procedure should be automatic, stable, easy to use, and have desirable asymptotic properties. In particular, we define a selection procedure to be nonparametric oracle (np-oracle) if it consistently selects the correct subset of predictors and at the same time estimates the smooth surface at the optimal nonparametric rate, as the sample size goes to infinity. In this paper, we propose a model selection procedure for nonparametric models, and explore the conditions under which the new method enjoys the aforementioned properties. Developed in the framework of smoothing spline ANOVA, our estimator is obtained via solving a regularization problem with a novel adaptive penalty on the sum of functional component norms. Theoretical properties of the new estimator are established. Additionally, numerous simulated and real examples further demonstrate that the new approach substantially outperforms other existing methods in the finite sample setting.

Keywords: Adaptive LASSO, Nonparametric Regression, Regularization Method, Variable Selection, Smoothing Spline ANOVA

1 Introduction

In this paper, we consider the multiple predictor nonparametric regression model yi = f (xi) + εi, i = 1, …, n, where f is the unknown regression function, xi = (x1,i,…, xp,i) is a p-dimensional vector of predictors, and the εi’s are independent noise terms with mean 0 and variances σi2. Many approaches to this problem have been proposed, such as kernel regression (Nadaraya 1964 and others) and locally weighted polynomial regression (LOESS), (Cleveland 1979). See Schimek (2000) for a detailed list of references. When there are multiple predictors, these procedures suffer from the well known curse of dimensionality. Additive models (GAM’s) (Hastie & Tibshirani 1990) avoid some of the problems with high dimensionality and have been shown to be quite useful in cases when the true surface is nearly additive. A generalization of additive modeling is the Smoothing Spline ANOVA (SS-ANOVA) approach (Wahba 1990, Stone, Buja & Hastie 1994, Wahba, Wang, Gu, Klein & Klein 1995, Lin 2000, and Gu 2002). In SS-ANOVA, the function f is decomposed into several orthogonal functional components.

We are interested in the variable selection problem in the context of multiple predictor nonparametric regression. For example, it might be thought that the function f only depends on a subset of the p predictors. Traditionally this problem has been solved in a stepwise or best subset type model selection approach. The MARS procedure (Friedman 1991) and variations thereof (Stone, Hansen, Kooperberg & Truong 1997) build an estimate of f by adding and deleting individual basis functions in a stepwise manner so that the omission of entire variables occurs as a side effect. However, stepwise variable selection is known to be unstable due to its inherent discreteness (Breiman 1995). COmponent Selection Shrinkage Operator (COSSO; Lin & Zhang 2006) performs variable selection via continuous shrinkage in SS-ANOVA models by penalizing the sum of norms of the functional components. Since each of the components are continuously shrunk towards zero, the resulting estimate is more stable than in subset or stepwise regression.

What are the desired properties of a variable selection procedure? For the parametric linear model Fan & Li (2001) discuss the oracle property. A method is said to possess the oracle property if it selects the correct subset of predictors with probability tending to one and estimates the non-zero parameters as efficiently as could be possible if we knew which variables were uninformative ahead of time. Parametric models with the oracle property include Fan & Li (2001) and Zou (2006). In the context of nonparametric regression, we extend the notion of the oracle property. We say a nonparametric regression estimator has the nonparametric (np)-oracle property if it selects the correct subset of predictors with probability tending to one and estimates the regression surface f at the optimal nonparametric rate.

None of the aforementioned nonparametric regression methods have been demonstrated to possess the np-oracle property. In particular, COSSO has a tendency to over-smooth the nonzero functional components in order to set the unimportant functional components to zero. In this paper we propose the adaptive COSSO (ACOSSO) to alleviate this major stumbling block. The intuition behind the ACOSSO is to penalize each component differently so that more flexibility is given to estimate functional components with more trend and/or curvature, while penalizing unimportant components more heavily. Hence it is easier to shrink uninformative components to zero without much degradation to the overall model fit. This is motivated by the adaptive LASSO procedure for linear models of Zou (2006). We explore a special case under which the ACOSSO possesses the np-oracle property. This is the first result of this type for a nonparametric regression estimator. The practical benefit of possessing this property is demonstrated on several real and simulated data examples where the ACOSSO substantially outperforms other existing methods.

In Section 2 we review the necessary literature on smoothing spline ANOVA. The ACOSSO is introduced in Section 3 and its asymptotic properties are presented in Section 4. In Section 5 we discuss the computational details of the estimate. Its superior performance to the COSSO and MARS is demonstrated on simulated data in Section 6 and real data in Section 7. Section 8 concludes. Proofs are given in an appendix.

2 Smoothing Splines and the COSSO

In this section we review only the necessary concepts of SS-ANOVA needed for development. For a more detailed overview of Smoothing Splines and SS-ANOVA see Wahba (1990), Wahba et al. (1995), Schimek (2000), Gu (2002), and Berlinet & Thomas-Agnan (2004).

In the smoothing spline literature it is typically assumed that f ∈ ℱ where ℱ is a reproducing kernel Hilbert space (RKHS). Denote the reproducing kernel (r.k.), inner product, and norm of ℱ as K, 〈·, ·〉, and ∥ · ∥ respectively. Often ℱ is chosen to contain only functions with a certain degree of smoothness. For example, functions of one variable are often assumed to belong to the second order Sobolev space, 𝒮2 = {g : g, g′ are absolutely continuous and g″ ∈ ℒ2[0, 1]}.

Smoothing spline models are usually assumed without loss of generality to be over xX = [0, 1]p. In what is known as smoothing spline (SS)-ANOVA, the space ℱ is constructed by first taking a tensor product of p one dimensional RKHS’s. For example, let ℋj be a RKHS on [0, 1] such that ℋj = {1} ⊕ ℋ̄j where {1} is the RKHS consisting of only the constant functions and ℋ̄j is the RKHS consisting of functions fj ∈ ℋj such that < fj, 1>j = 0. The space ℱ can be taken to be the tensor product of the ℋj, j = 1,…, p which can be written as

=j=1pj={1}{j=1p¯j}{j<k(¯j¯k)}. (1)

The right side of the above equation has decomposed ℱ into the constant space, the main effect spaces, the two-way interaction spaces, etc. which gives rise to the name SS-ANOVA. Typically (1) is truncated so that ℱ includes only lower order interactions for better estimation and ease of interpretation. Regardless of the order of the interactions involved, we see that the space ℱ can be written in general as

={1}{j=1qj} (2)

where {1},ℱ1…ℱq is an orthogonal decomposition of the space and each of the ℱj is itself a RKHS. In this presentation we will focus on two special cases, the additive model f(x)=b+j=1pfj(xj) and the two-way interaction model f(x)=b+j=1pfj(xj)+j<kpfjk(xj,xk), where b ∈ {1}, fj ∈ ℋ̄j and fjk ∈ ℋ̄j ⊗ ℋ̄k.

A traditional smoothing spline type estimate, , is given by the function f ∈ ℱ that minimizes

1ni=1n(yif(xi))2+λ0j=1q1θjPjf2, (3)

where Pj f is the orthogonal projection of f onto ℱj, j = 1,…, q which form an orthogonal partition of the space as in (2). We will use the convention 0/0 = 0 so that when θj = 0 the minimizer satisfies ∥Pj f = 0.

The COSSO (Lin & Zhang 2006) penalizes on the sum of the norms instead of the squared norms as in (3) and hence achieves sparse solutions (i.e. some of the functional components are estimated to be exactly zero). Specifically, the COSSO estimate, , is given by the function f ∈ ℱ that minimizes

1ni=1n(yif(xi))2+λj=1qPjf. (4)

In Lin & Zhang (2006), ℱ was formed using 𝒮 2 with squared norm

f2=(01f(x)dx)2+(01f(x)dx)2+01(f(x))2dx (5)

for each of the ℋj of (1). The reproducing kernel can be found in Wahba (1990).

3 An Adaptive Proposal

Although the COSSO is a significant improvement over classical stepwise procedures, it tends to oversmooth functional components. This seemingly prevents COSSO from achieving a nonparametric version of the oracle property (defined in Section 4). To alleviate this problem, we propose an adaptive approach. The proposed adaptive COSSO uses individually weighted norms to smooth each of the components. Specifically we select as our estimate the function f ∈ ℱ that minimizes

1ni=1n(yif(xi))2+λj=1qwjPjf. (6)

where the 0 < wj ≤ ∞ are weights that can depend on an initial estimate of f which we denote . For example we could initially estimate f via the traditional smoothing spline of (3) with all θj = 1 and λ0 chosen by the generalized cross validation (GCV) criterion (Craven & Wahba 1979). Note that there is only one tuning parameter, λ, in (6). The wj’s are not tuning parameters like the θj’s in (3), rather they are weights to be estimated from the data in a manner described below.

3.1 Choosing the Adaptive Weights, wj

Given an initial estimate , we wish to construct wj’s so that the prominent functional components enjoy the benefit of a smaller penalty relative to less important functional components. In contrast to the linear model, there is no single coefficient, or set of coefficients, to measure importance of a variable. One possible scheme would be to make use of the L2 norm of Pj given by ∥PjL2 = (∫X(Pj(x))2dx)1/2. For a reasonable initial estimator, this quantity will be a consistent estimate of PjfL22 which is often used to quantify the importance of functional components. This would suggest using

wj=Pjf˜L2γ. (7)

In Section 4, the use of these weights results in favorable theoretical properties.

There are other reasonable possibilities one could consider for the wj ’s. In fact at first glance, as an extension of the adaptive LASSO for linear models, it may seem more natural to make use of an estimate of the RKHS norm used in the COSSO penalty and set wj=Pjf˜γ. However, the use of these weights is not recommended because they do not provide an estimator with sound theoretical properties. Consider for example building an additive model using RKHS’s with norm given by (5). Then this wj is essentially requiring estimation of the functionals 01(fj(xj))2dxj which is known to be a harder problem (requiring more smoothness assumptions) than estimating 01fj2(x)dx (Efromovich & Samarov 2000). In fact, using wj=Pjf˜γ instead of (7) would at the very least require stronger smoothness assumptions about the underlying function f in Section 4 to achieve asymptotically correct variable selection. Because of this and the results of preliminary empirical studies we recommend the use of the weights in (7) instead.

4 Asymptotic Properties

In this section we demonstrate the desirable asymptotic properties of the ACOSSO. In particular, we show that the ACOSSO possesses a nonparametric analog of the oracle property. This result is the first of its type for nonparametric surface estimation.

Throughout this section we assume the true regression model is yi = f0 (xi) + εi, i = 1,…, n. The regression function f0 ∈ ℱ is additive in the predictors so that ℱ = {1} ⊕ ℱ1 ⊕⋯⊕ ℱp where each ℱj is a space of functions corresponding to xj . We assume that εi are independent with Eεi = 0 and are uniformly sub-Gaussian. Following van de Geer (2000), we define a sequence of random variables to be uniformly sub-Gaussian if there exists some K > 0 and C > 0 such that

supnmaxi=1,,nE[exp(εi2/K)]C. (8)

Let 𝒮2 denote the RKHS of second order Sobolev space endowed with the norm in (5) with 𝒮2 = {1} ⊕ 𝒮̄2. Also, define the squared norm of a function at the design points as fn2=1/ni=1nf2(xi). Let U be the set of indexes for all uninformative functional components in the model f0=b+j=1pPjf0,j=1,,p. That is U = {j : Pj f0 ≡ 0}.

Theorem 1 below states the convergence rate of ACOSSO when used to estimate an additive model. Corollary 1 following Theorem 1 states that the weights given by (7) lead to an estimator with optimal convergence rate of n−2/5. We sometimes write wj and λ as wj,n and λn respectively to explicitly denote the dependence on n. We also use the notation Xn ~ Yn to mean Xn/Yn = Op(1) and Yn/Xn = Op(1) for some sequences Xn and Yn. The proofs of Theorem 1 and the other results in this section are deferred to the appendix.

Theorem 1. (Convergence Rate)

Assume that f0={1}𝒮¯12𝒮¯p2where𝒮¯j2 is the space 𝒮̄2 corresponding to the jth input variable, xj. Also assume that εi are independent and satisfy (8). Consider the ACOSSO estimate, , defined in (6). Suppose that wj,n1=Op(1) for j = 1,…, p and further that wj,n = Op(1) for jUc. Also assume that λn1=Op(n4/5). If

  1. Pj f0 ≠ 0 for some j, then f^f0n=Op(λ1/2w*,n1/2) where w*,n = min {w1,n,…,wp,n}.

  2. Pj f0 = 0 for all j, then f^f0n=Op(max{n1/2,n2/3λ1/3w*,n1/3}).

Corollary 1. (Optimal Convergence of ACOSSO)

Assume that f0={1}𝒮¯12𝒮¯p2 and that εi are independent and satisfy (8). Consider the ACOSSO estimate, f̂, with weights, wj,n=Pjf˜L2γ, for f̃ given by the traditional smoothing spline (3) with θ = 1p and λ0,n ~ n−4/5. If γ > 3/4 and λn ~ n−4/5, thenf̂ − f0n = Op(n−2/5) if Pj f0 ≠ 0 for some j andf0n = Op(n−1/2) otherwise.

We now turn to discuss the attractive properties of the ACOSSO in terms of model selection. In Theorem 2 and Corollary 2 we will consider functions in the second order Sobolev space of periodic functions denoted 𝒮per2where𝒮per2={1}𝒮¯per2. We also assume that the observations come from a tensor product design. That is, the design points are {x1,i1,x2,i2,,xp,ip}ij=1mj=1p where xj,k = k/m, k = 1,…,m, j = 1,…, p. Therefore the total sample size is n = mp. These assumptions were also used by Lin & Zhang (2006) to examine the model selection properties of the COSSO.

Theorem 2. (Selection Consistency)

Assume a tensor product design and f0={1}𝒮¯per,12𝒮¯per,p2where𝒮¯per,j2is the space𝒮¯per2 corresponding to the jth input variable, xj. Also assume that εi are independent and satisfy (8). The ACOSSO estimate f̂ will be such that Pj ≡ 0 for all j ∈ U with probability tending to one if and only if nwj,n2λn2p as nfor all jU.

We will say a nonparametric regression estimator, , has the nonparametric (np)-oracle property if ∥f0n → 0 at the optimal rate while also setting Pj ≡ 0 for all jU with probability tending to one. This means that the error associated with surface estimation has the same order as that for any other optimal estimator. One could also define a strong np-oracle property which would require asymptotically correct variable selection and the error being asymptotically the same as an oracle estimator (an estimator where the correct variables were known in advance). That is, to possess the strong np-oracle property, the proposed estimator must match the constant as well as the rate of an oracle estimator. The strong np-oracle definition is slightly ambiguous however, as one must specify what estimator should be used as the oracle estimator for comparison (e.g., smoothing spline with one λ, smoothing spline with differing λj’s, etc.). The weaker version of the np-oracle property, which was stated first, avoids this dilemma. The corollary below states that the ACOSSO with weights given by (7) has the np-oracle property.

Corollary 2. (Nonparametric Oracle Property)

Assume a tensor product design and f0 ∈ ℱ where ={1}𝒮¯per,12𝒮¯per,p2 and that εi are independent and satisfy (8). Define weights, wj,n=Pjf˜L2γ, for f̃ given by the traditional smoothing spline with λ0 ~ n−4/5, and γ > 3/4. If also λn ~ n−4/5, then the ACOSSO estimator has the np-oracle property.

Remark 1

The derivation of the variable selection properties of adaptive COSSO requires detailed investigation on the eigen-properties of the reproducing kernel, which is generally intractable. However, Theorem 2 and the Corollary 2 assume that f belongs to the class of periodic functions while x is a tensor product design. This makes the derivation more tractable, since the eigenfunctions and eigenvalues of the associated reproducing kernel have a particularly simple form. Results for this specific design are often instructive for general cases, as suggested in Wahba (1990). We conjecture that the selection consistency of the adaptive COSSO also holds more generally, and this is also supported by numerical results in Section 6. The derivation of variable selection properties in the general case is a technically difficult problem which is worthy of future investigation. Neither the tensor product design nor the periodic functions assumptions are required for establishing the MSE consistency of the adaptive COSSO estimator in Theorem 1 and Corollary 1.

Remark 2

The COSSO (which is the ACOSSO with wj,n = 1 for all j and n) does not appear to enjoy the np-oracle property. Notice that by Theorem 2, λn must go to zero slower than n−1/2 in order to achieve asymptotically correct variable selection. However, even if λn is as small as λn = n−1/2, Theorem 1 implies that the convergence rate is Op(n−1/4) which is not optimal. These results are not surprising given that the linear model can be obtained as a special case of ACOSSO by using ={f:f=β0+j=1pβj(xj1/2)}. For this ℱ the COSSO reduces to the LASSO which is known to be unable to achieve the oracle property (Knight & Fu 2000, Zou 2006). In contrast, the ACOSSO reduces to the adaptive LASSO (Zou 2006) which is well known to achieve the oracle property.

Remark 3

The distribution of the error terms εi in Theorems 1 and 2 need only be independent with sub-Gaussian tails (8). The common assumption that εi~iid𝒩(0,σ2) satisfies (8). But, the distributions need not be Gaussian and further need not even be the same for each of the εi. In particular, this allows for heteroskedastic errors.

Remark 4

Theorems 1 and 2 are assuming an additive model, in which case functional component selection is equivalent to variable selection. In higher order interaction models, the main effect for xj and all of the interaction functional components involving xj must be set to zero in order to eliminate xj from the model and achieve true variable selection. Thus, in some other areas of the paper, when interactions are involved, we use the term variable selection to refer to functional component selection.

5 Computation

Since the ACOSSO in (6) may be viewed as the COSSO in (4) with an “adaptive” RKHS, the computation proceeds in a similar manner as that for the COSSO. We first present an equivalent formulation of the ACOSSO, then describe how to minimize this equivalent formulation for a fixed value of the tuning parameter. Discussion of tuning parameter selection is delayed until Section 5.3.

5.1 Equivalent Formulation

Consider the problem of finding θ = (θ1,…, θq)′ and f ∈ ℱ to minimize

1ni=1n[yif(xi)]2+λ0j=1qθj1wj2ϑPjf2+λ1j=1qwjϑθj,subject toθj0j, (9)

where 0 ≤ ϑ ≤ 2, λ0 > 0 is a fixed constant, and λ1 > 0 is a smoothing parameter. The following Lemma says that the above optimization problem is equivalent to (6). This has important implications for computation since (9) is easier to solve.

Lemma 1

Set λ1 = λ2/(4λ0). (i) If f̂ minimizes (6), setθ^j=λ01/2λ11/2wj1ϑPjf^, j = 1,…, q, then the pair (θ̂, ) minimizes (9). (ii) On the other hand, if a pair (θ̂, ) minimizes (9), then f̂ minimizes (6).

5.2 Computational Algorithm

The equivalent form in (9) gives a class of equivalent problems for ϑ ∈ [0, 2]. For simplicity we will consider the case ϑ = 0 since the ACOSSO can be then viewed as having the same equivalent form as the COSSO with an adaptive RKHS. For a given value of θ = (θ1,…, θq)′, the minimizer of (9) is the smoothing spline of (3) with θj replaced by wj2θj. Hence it is known (Wahba 1990 for example) that the solution has the form f(x)=b+i=1nciKw,θ(x,xi) where c ∈ ℜn, b ∈ ℜ and Kw,θ=j=1q(θj/wj2)Kj, with ℱj corresponding to the decomposition in (2).

Let Kj be the n × n matrix {Kj(xi,xj)}i,j=1n and let 1n be the column vector consisting of n ones. Then write the vector f = (f(x1),…,f(xn))′ as f=b1n+(j=1q(θj/wj2)Kj)c,, y = (y1,…,yn)′ and define υn2=1/ni=1nυi2 for a vector υ of length n and y = (y1,…, yn)′. Now, for fixed θ, minimizing (9) is equivalent to

minb,c{1nyb1nj=1qθjwj2Kjcn2+λ0j=1qθjwj2cKjc} (10)

which is just the traditional smoothing spline problem in Wahba (1990). On the other hand if b and c were fixed, the θ that minimizes (9) is the same as the solution to

minθ{zGθn2+nλ1j=1qθj},subject toθj0j, (11)

where gj=wj2Kjc,, G is the n × p matrix with the jth column being gj and z = yb1n − (n/2)λ0c. Notice that (11) is equivalent to

minθzGθn2 subject toθj0jandj=1qθjM, (12)

for some M > 0. The formulation in (12) is a quadratic programming problem with linear constraints for which there exists many algorithms to find the solution (see Goldfarb & Idnani 1982 for example). A reasonable scheme is then to iterate between (10) and (12). In each iteration (9) is decreased. We have observed that after the second iteration the change between iterations is small and decreases slowly.

5.3 Selecting the Tuning Parameter

In (9) there is really only one tuning parameter, λ1 or equivalently M of (12). Changing the value of λ0 will only scale shift the value of M being used so λ0 can be fixed at any positive value. Therefore, we choose to initially fix θ = 1q and find λ0 to minimize the GCV score of the smoothing spline problem in (10). This has the effect of placing the θj’s on a scale so that M roughly translates into the number of non-zero components. Hence, it seems reasonable to tune M on [0, 2q] for example.

We will use 5-fold cross validation (5CV) in the examples of the subsequent sections to tune M. However, we also found that the BIC criterion (Schwarz 1978) was quite useful for selecting M. We approximate the effective degrees of freedom, ν, by ν = tr(S) where S is the weight matrix corresponding to the smoothing spline fit with θ = θ̂. This type of approximation gives an under-estimate of the actual df, but has been demonstrated to be useful (Tibshirani 1996). We have found that the ACOSSO with 5CV tends to over select non-zero components just as Zou, Hastie & Tibshirani (2007) found that AIC-type criteria over select non-zero coefficients in the LASSO. They recommend using BIC with the LASSO when the goal is variable selection as do we for the ACOSSO.

6 Simulated Data Results

In this section we study the empirical performance of the ACOSSO estimate and compare it to several other existing methods. We display the results of four different versions of the ACOSSO. All versions use weights wj given by (7) with γ = 2 since we found that γ = 2 produced the best overall results among γ ∈ {1.0, 1.5, 2.0, 2.5, 3.0}. The initial estimate, , is either the traditional smoothing spline or the COSSO with λ selected by GCV. We also use either 5CV or BIC to tune M. Hence the four versions of ACOSSO are ACOSSO-5CV-T, ACOSSO-5CV-C, ACOSSO-BIC-T, and ACOSSO-BIC-C where (−T) and (−C) stand for using the traditional smoothing spline and COSSO respectively for the initial estimate.

We include the methods COSSO, MARS, stepwise GAM, Random Forest (Breiman 2001), and the Gradient Boosting Method (GBM) (Friedman 2001). The tuning parameter for COSSO is chosen via 5CV. To fit MARS models we have used the ’polymars’ procedure in the R-package ’polspline’. Stepwise GAM, Random Forest, and GBM fits were obtained using the R-packages ’gam’, ’randomForest’, and ’gbm’ respectively. All input parameters for these methods such as gcv for MARS, n.trees for GBM, etc. were appropriately set to give the best results on these examples.

Note that Random Forest and GBM are both black box prediction machines. That is they produce function estimates that are difficult to interpret and they are not intended for variable selection. They are however well known for making accurate predictions. Thus, they are included here to demonstrate the utility of the ACOSSO even in situations where prediction is the only goal.

We also include the results of the traditional smoothing spline of (3) when fit with only the informative variables. That is, we set θj = 0 if Pj f = 0 and θj = 1 otherwise, then choose λ0 by GCV. This will be referred to as the ORACLE estimator. Notice that the ORACLE estimator is only available in simulations where we know ahead of time which variables are informative. Though the ORACLE cannot be used in practice, it is useful to display its results here because it gives us a baseline for the best estimation risk we could hope to achieve with the other methods.

Performance is measured in terms of estimation risk and model selection. Specifically, the variables defined in Table 1 will be used to compare the different methods. We first present a very simple example to highlight the benefit of using the ACOSSO. We then repeat the same examples used in the COSSO paper to offer a direct comparison on examples where the COSSO is known to perform well. The only difference is that we have increased the noise level to make these problems a bit more challenging.

Table 1.

Definitions of the variables , , 1β̄, and model size used to summarize the results of the simulations.

is a monte carlo estimate of the estimation risk, R() = E{(X) −
f(X)}2. Specifically, let the integrated squared error for a fixed estimate
be given by ISE = EX{f(X) − (X)}2. The ISE is calculated for each
realization via a monte carlo integration with 1000 points. The quantity
is then the average of the ISE values over the N = 100 realizations.
is a monte carlo estimate of the type I error rate averaged over all of the
uninformative functional components. Specifically, let α̂j be the proportion
of realizations that Pj ≠ 0, j = 1,…, q. Then ᾱ = 1/|U|∑jU α̂j
where U = {j : Pj f ≡ 0} and |U| is the number of elements in U.
1β̄ is a monte carlo estimate of the variable selection power averaged over
all of the informative functional components. Specifically, let β̂j be
the proportion of realizations that Pj = 0, j = 1,…, q. Then β̄ =
1/|Uc|∑jUc β̂j where Uc is the complement of U.
model size is the number of functional components included in the model averged
over the N = 100 realizations.

Example 1. The following four functions on [0, 1] are used as building blocks of regression functions in the following simulations:

g1(t)=t;g2(t)=(2t1)2;g3(t)=sin(2πt)2sin(2πt);
g4(t)=0.1sin(2πt)+0.2cos(2πt)+0.3sin2(2πt)+0.4cos3(2πt)+0.5sin3(2πt). (13)

In this example, we let X ∈ ℜ10. We observe n = 100 observations from the model y = f (X) + ε where the underlying regression function is additive,

f(x)=5g1(x1)+3g2(x2)+4g3(x3)+6g4(x4)

and ε~iid𝒩(0,3.03). Therefore X5,…,X10 are uninformative. We first consider the case where X is uniform in [0, 1]10 in which case the signal to noise ratio (SNR) is 3:1 (here we have adopted the variance definition for signal to noise ratio, SNR = [Var( f(X))]/σ2). For comparison, the variances of the functional components are Var{5g1(X1)} = 2.08, Var{3g2(X2)} = 0.80, Var{4g3(X3)} = 3.30 and Var{6g4(X4)} = 9.45.

For the purposes of estimation in the methods ACOSSO, COSSO, MARS, and GBM, we restrict to be a strictly additive function. The Random Forest function however does not have an option for this. Hence there are 10 functional components that are considered for inclusion in the ACOSSO model. Figure 1 gives plots of y versus the the first four variables x1,…, x4 along with the true Pj f component curves for a realization from Example 1. The true component curves, j = 1,…,4, along with the estimates given by ACOSSO-5CV-T and COSSO are then shown in Figure 2 without the data for added clarity. Notice that the ACOSSO captures more of the features of the P3 f component and particularly the P4 f component since the reduced penalty on these components allows it more curvature. In addition, since the weights more easily allow for curvature on components that need it, M does not need to be large (relatively) to allow a good fit to components like P3 f and P4 f. This has the effect that that components with less curvature, like the straight line P1 f, can also estimated more accurately by ACOSSO than by COSSO, as seen in Figure 2.

Figure 1.

Figure 1

Plot of the true functional components, Pj f, j = 1,…, 4 along with the data for a realization from Example 1.

Figure 2.

Figure 2

Plot of Pj f, j = 1,…, 4 along with their estimates given by ACOSSO, COSSO, and MARS for a realization from Example 1.

Figure 6 shows how the magnitudes of the estimated components change with the tuning parameter M for both the COSSO and ACOSSO for the above realization. The magnitudes of the estimated components are measured by their L2 norm ∥PjL2. Dashed lines are drawn at the true values of ∥Pj fL2 for reference. Notice that estimated functional component norms given by ACOSSO are closer to the true values than those given by the COSSO in general. Also, the uninformative components are more heavily penalized in the ACOSSO making it harder for them to enter the model.

Incidentally using GCV or 5CV for tuning parameter selection for the ACOSSO on the above realization gives M = 3.81 and M = 4.54 respectively, both resulting in a model of 5 functional components for this run. The BIC method however gives M = 2.97, which results in the correct model of 4 functional components. This is a typical occurrence for realizations from this example as can be seen in Table 2.

Table 2.

Results of 100 Realizations from Example 1 in the Uniform Case. The standard error for each of the summary statistics is given in parantheses.

1 − β̄ model size
ACOSSO-5CV-T 1.204 (0.042) 0.252 (0.034) 0.972 (0.008) 5.4 (0.21)
ACOSSO-5CV-C 1.186 (0.048) 0.117 (0.017) 0.978 (0.007) 4.6 (0.11)
ACOSSO-BIC-T 1.257 (0.048) 0.032 (0.008) 0.912 (0.012) 3.8 (0.08)
ACOSSO-BIC-C 1.246 (0.064) 0.018 (0.006) 0.908 (0.014) 3.7 (0.07)
COSSO 1.523 (0.058) 0.095 (0.023) 0.935 (0.012) 4.3 (0.15)
MARS 2.057 (0.064) 0.050 (0.010) 0.848 (0.013) 3.7 (0.08)
GAM 1.743 (0.053) 0.197 (0.019) 0.805 (0.011) 4.4 (0.13)
Random Forest 4.050 (0.062) NA NA 10.0 (0.00)
GBM 1.935 (0.039) NA NA 10.0 (0.00)
ORACLE 1.160 (0.034) 0.000 (0.000) 1.000 (0.000) 4.0 (0.00)

In Table 2 we can compare the risk and variable selection capability of the ACOSSO to the COSSO and the other methods on Example 1 with X uniform on [0, 1]10. Notice that all four of the ACOSSO methods are significantly better than COSSO and the other methods in terms of risk. COSSO, MARS, GAM, Random Forest and GBM have 131%, 180%, 150%, 349%, and 167% the risk of the ORACLE respectively, while the ACOSSO methods all have risk less than 108% that of the ORACLE. In terms of variable selection, the two ACOSSO-5CV methods again have a much higher average type I error rate than the two ACOSSO-BIC methods and MARS. In fact, ACOSSO-5CV-T has ᾱ = .25 which is quite high. Both ACOSSO-BIC methods however, have ᾱ ≤ 0.03 and have an average model size of close to 4.0, the correct number of components.

Although the ACOSSO-5CV methods have higher ᾱ, they have better power than the other methods as can be seen in the 1 − β̄ column of. Table 2. In this example, it turns out that 1 − β̄ is almost completely determined by how well the methods do at including the second variable (component P2 f). The components P1 f, P3 f, and P4 f were included in the model nearly 100% for all of the methods. The percentage of the realizations that included P2 f was 65% for the ACOSSO-BIC methods, 75% for the COSSO and only 40% and 23% for MARS and GAM respectively. The percentage of the realizations that included P2 f is close to 90% for the ACOSSO-5CV methods but as mentioned, the price paid is a higher type I error rate.

Example 2. This is a large p example with X ∈ ℜ60. We observe n = 500 observations from y = f(X) + ε. The regression function is additive in the predictors,

f(x)=g1(x1)+g2(x2)+g3(x3)+g4(x4)+1.5g1(x5)+1.5g2(x6)+1.5g3(x7)+1.5g4(x8)+2g1(x9)+2g2(x10)+2g3(x11)+2g4(x12),

where g1,…, g4 are given in (13). The noise variance is set to σ2 = 2.40 yielding a SNR of 3:1 in the uniform case. Notice that X13,…,X60 are uninformative. In this example, we consider the same two distributional families for the input vector X as in Lin & Zhang (2006).

Compound Symmetry

For an input X = (X1,…,Xp), let Xj = (Wj+tU)/(1+t), j = 1,…,p, where W1,…,Wp and U are iid from Unif(0, 1). Thus, Corr(Xj, Xk) = t2/(1 + t2) for jk. The uniform distribution design corresponds to the case t = 0. (trimmed) AR(1): Let W1,…,Wd be iid 𝒩 (0, 1), and let X1 = W1, Xj = ρXj−1 + (1 − ρ2)1/2Wj, j = 2,…,p. Then trim Xj in [−2.5, 2.5] and scale to [0, 1].

Table 3 shows the results of estimation risk (standard error in parentheses) on six different cases which correspond to different distributions for the predictors. Again the ACOSSO methods have estimation risk much closer to ORACLE than the other methods. COSSO and GAM have very similar performance in this example and generally have the best risk among the methods other than ACOSSO. One notable exception is the extremely high correlation case (Compound Symmetry, t = 3, where Corr(Xj, Xk) = .9 for jk). Here the ACOSSO-BIC-T and ACOSSO-5CV-T seem to struggle a bit as they have risk near or above the risk of COSSO and GAM. GBM actually has the best risk in this particular case. However, the ACOSSO variants are substantially better overall than any of the other methods. A similar trend was also noticed (table not presented) for these six different cases on the test function from Example 1.

Table 3.

Estimation Risk based on 100 realizations from Example 2 under various Covariance Structures; standard error given in parantheses.

Compound Symmetry Trimmed AR(1)
t = 0 t = 1 t = 3 ρ = −0.5 ρ = 0.0 ρ = 0.5
ACOSSO-5CV-T 0.41 (0.01) 0.41 (0.01) 0.52 (0.01) 0.40 (0.01) 0.40 (0.01) 0.40 (0.01)
ACOSSO-5CV-C 0.42 (0.01) 0.40 (0.01) 0.43 (0.01) 0.40 (0.01) 0.40 (0.01) 0.40 (0.01)
ACOSSO-BIC-T 0.42 (0.01) 0.41 (0.01) 0.60 (0.02) 0.42 (0.01) 0.39 (0.01) 0.42 (0.01)
ACOSSO-BIC-C 0.42 (0.01) 0.42 (0.01) 0.47 (0.01) 0.45 (0.02) 0.39 (0.01) 0.43 (0.01)
COSSO 0.48 (0.01) 0.60 (0.01) 0.54 (0.01) 0.60 (0.02) 0.57 (0.01) 0.57 (0.01)
MARS 0.97 (0.02) 0.66 (0.01) 1.05 (0.02) 0.64 (0.01) 0.62 (0.01) 0.64 (0.01)
GAM 0.49 (0.01) 0.52 (0.01) 0.50 (0.01) 0.52 (0.01) 0.52 (0.01) 0.52 (0.01)
Random Forest 1.86 (0.01) 1.50 (0.01) 0.76 (0.00) 1.26 (0.01) 1.19 (0.01) 1.25 (0.01)
GBM 0.73 (0.01) 0.52 (0.01) 0.47 (0.00) 0.58 (0.01) 0.58 (0.01) 0.57 (0.01)
ORACLE 0.30 (0.00) 0.28 (0.00) 0.27 (0.00) 0.29 (0.00) 0.29 (0.01) 0.29 (0.01)

Example 3. Here we consider a regression model with 10 predictors and several two way interactions. The regression function is

f(x)=g1(x1)+g2(x2)+g3(x3)+g4(x4)+g3(x1x2)+g2((x1+x3)/2)+g1(x3x4)

so that x5,…, x10 are uninformative. The noise variance is set at σ2 = 0.44098 to give a SNR of 3:1. Here, we consider only the uniform distribution on the predictors and evaluate performance at various sample sizes, n = 100, n = 250, and n = 500.

A summary of the estimation risk on 100 realizations can be found in Table 4. When n = 100 we see that COSSO seems to struggle a bit as all of the other methods except Random Forest have substantially better risk. However, the ACOSSO methods have risk comparable or better than the other methods and less than half that of COSSO. The estimation risk for all methods improves as the sample size increases. However, stepwise GAM does not improve from n = 250 to n = 500 probably because of its inability to model the interactions in this example. Also notice that the ACOSSO methods maintain close to 50% the risk of COSSO for all sample sizes. In fact, for n = 500 the ACOSSO methods have risk nearly the same as that of the ORACLE and roughly half that of the next best methods (COSSO and GBM).

Table 4.

Estimation Risk based on 100 realizations from Example 3 with n = 100, 250, and 500; standard error given in parantheses.

n = 100 n = 250 n = 500
ACOSSO-5CV-T 0.139 (0.017) 0.055 (0.001) 0.034 (0.001)
ACOSSO-5CV-C 0.120 (0.011) 0.055 (0.001) 0.036 (0.001)
ACOSSO-BIC-T 0.200 (0.027) 0.054 (0.001) 0.034 (0.001)
ACOSSO-BIC-C 0.138 (0.016) 0.050 (0.001) 0.034 (0.001)
COSSO 0.290 (0.016) 0.093 (0.002) 0.057 (0.001)
MARS 0.245 (0.021) 0.149 (0.009) 0.110 (0.008)
GAM 0.149 (0.005) 0.137 (0.001) 0.136 (0.001)
Random Forest 0.297 (0.006) 0.190 (0.002) 0.148 (0.001)
GBM 0.126 (0.003) 0.084 (0.001) 0.065 (0.001)
ORACLE 0.071 (0.003) 0.042 (0.001) 0.029 (0.000)

Computation Time

Table 5 gives the computation times (in seconds) for the various methods on the three sample size cases in Example 3. Recall that in Example 3 the number of predictors is p = 10, but the total number of functional components including interactions is q = 55. The times given are the average over the 100 realizations and include the time required for tuning parameter selection.

Table 5.

Average CPU time (in seconds) for each method to compute a model fit (including tuning parameter selection) for the various sample size simulations of Example 3.

n = 100 n = 250 n = 500
ACOSSO-5CV-T 8.7 47.2 155.2
ACOSSO-5CV-C 10.5 49.7 163.6
ACOSSO-BIC-T 3.0 11.9 89.0
ACOSSO-BIC-C 3.9 26.6 118.4
COSSO 7.5 43.4 150.1
MARS 9.2 11.2 13.6
GAM 5.7 7.9 11.3
Random Forest 0.3 6.4 15.3
GBM 4.2 9.1 17.0

For larger sample sizes, ACOSSO and COSSO take significantly longer than the other methods. Considering the performance in the simulation examples and the computation times, the best overall ACOSSO variant seems to be ACOSSO-BIC-T. It is important to point out that the other methods (besides ACOSSO or COSSO) are computed via more polished R-packages that take advantage of the speed of compiled languages such as C or Fortran. The computing time of ACOSSO (and COSSO) could also be decreased substantially by introducing more efficient approximations and by taking advantage of a compiled language.

7 Application to Real Data

In this section we apply the ACOSSO to three real datasets. We only report the results of the two ACOSSO-BIC methods since they performed much better overall than the ACOSSO-5CV methods in our simulations. The first two data sets are the popular Ozone data and the Tecator data which were also used by Lin & Zhang (2006). Both data sets are available from the datasets archive of StatLib at http://lib.stat.cmu.edu/datasets/. The Ozone data was also used in Breiman & Friedman (1995), Buja, Hastie & Tibshirani (1989), and Breiman (1995). This data set contains the daily maximum one-hour-average ozone reading and 8 meteorological variables recorded in the Los Angeles basin for 330 days of 1976.

The Tecator data was recorded on a Tecator Infratec Food and Feed Analyzer. Each sample contains finely chopped pure meat with different moisture, fat and protein contents. The input vector consists of a 100 channel spectrum of absorbances. The absorbance is – log10 of the transmittance measured by the spectrometer. As in Lin & Zhang (2006), we use the first 13 principal components to predict fat content. The total sample size is 215.

The third data set comes from a computer model for two phase fluid flow (Vaughn, Bean, Helton, Lord, MacKinnon & Schreiber 2000). Uncertainty/sensitivity analysis of this model was carried out as part of the 1996 compliance certification application for the Waste Isolation Pilot Plant (WIPP) (Helton & Marietta, editors 2000). There were 31 uncertain variables that were inputs into the two-phase fluid flow analysis; see Storlie & Helton (2008) for a full description. Here we consider only a specific scenario which was part of the overall analysis. The variable BRN-REPTC10K is used as the response. This variable corresponds to cumulative brine flow in m3 into the waste repository at 10,000 years assuming there was a drilling intrusion at 1000 years. The sample size is n = 300. This data set is available at http://www.stat.unm.edu/~storlie/acosso/.

We apply each of the methods on these three data sets and estimate the prediction risk, E[Yf(X)]2, by ten-fold cross validation. We select the tuning parameter using only data within the training set (i.e., a new value of the tuning parameter is selected for each of the 10 training sets without using any data from the test sets). The estimate obtained is then evaluated on the test set. We repeat this ten-fold cross validation 50 times and average. The resulting prediction risk estimates along with standard errors are displayed in Table 6. The interaction model is presented for all of the methods (except GAM) since it had better prediction accuracy on all three data sets than the additive model.

Table 6.

Estimated Prediction Risk for Real Data Examples; standard error given in parantheses. Risk for BRNREPTC10K for the WIPP data is in units of 100m6

Ozone Tecator WIPP
ACOSSO-BIC-T 15.07 (0.07) 1.44 (0.02) 1.04 (0.00)
ACOSSO-BIC-C 14.81 (0.08) 1.38 (0.02) 1.05 (0.01)
COSSO 15.99 (0.06) 0.88 (0.02) 1.30 (0.01)
MARS 14.24 (0.12) 3.01 (0.17) 1.12 (0.01)
GAM 15.91 (0.12) 592.52 (4.26) 1.83 (0.01)
Random Forest 18.11 (0.07) 14.35 (0.10) 1.29 (0.01)
GBM 10.69 (0.00) 3.35 (0.00) 0.97 (0.00)

For the Ozone data set, the ACOSSO is comparable to MARS but better than COSSO, GAM and Random Forest. GBM seems to be the best method for prediction accuracy on this data set though. For the Tecator data, both the COSSO and ACOSSO are much better than all of the other methods. There are several significant interactions which is why GAM performs so poorly relative to the other methods here. Interestly, COSSO is better than ACOSSO here which shows that the adaptive weights aren’t always an advantage. The reason that COSSO performs better in this case is likely due to the fact that 12 out of the 13 variables are selected for inclusion into the model and around 62 out of the 91 total functional components are estimated to be non-zero, so this is not a very sparse model. Hence using all weights equal to 1 (the COSSO) should work quite well here. In cases like this, it may be that using adaptive weights in the ACOSSO can detract from the COSSO fit by adding more noise to the estimation process. In contrast, the WIPP data set has only about 8 informative input variables of the 31 inputs. Hence the ACOSSO significantly outperforms the COSSO and is comparable to GBM for prediction accuracy.

8 Conclusions & Further Work

In this article, we have developed the ACOSSO, a new regularization method for simultaneous model fitting and variable selection in the context of nonparametric regression. The relationship between the ACOSSO and the COSSO is analogous to that between the adaptive LASSO and the LASSO. We have explored a special case under which the ACOSSO has a nonparametric version of the oracle property, which the COSSO does not appear to possess. This is the first result of this type for a nonparametric regression estimator. In addition we have demonstrated that the ACOSSOoutperforms COSSO, MARS, and stepwise GAMs for variable selection and prediction on all simulated examples and all but one of the real data examples. The ACOSSO also has very competitive performance for prediction when compared with other well known prediction methods Random Forest and GBM. R code to fit ACOSSO models is available at http://www.stat.unm.edu/~storlie/acosso/.

It remains to show that ACOSSO has the np-oracle property under more general conditions such as random designs. It may also be possible to yet improve the performance of the ACOSSO by using a different weighting scheme. The Tecator example in Section 7, suggests that using a weight power, γ = 2 is not always ideal. Perhaps it would be better to cross validate on a few different choices of γ so that it could be chosen smaller in cases (such as the Tecator example) where the weights are not as helpful. In addition, there are certainly a number of other ways to use the initial estimate, , in the creation of the penalty term. These are topics for further research.

graphic file with name nihms192282f3.jpg

captionPlot of ∥PjL2 along with ∥Pj fL2 by M for both ACOSSO and COSSO on a realization from Example 1.

Appendix

A Proofs

A.1 Equivalent Form

Proof of Lemma 1

Denote the functional in (6) by A(f) and the functional in (9) by B(θ, f). Since a+b2abfora,b0,, with equality if and only if a = b, we have for each j = 1,…, q

λ0θj1wj2ϑPjf2+λ1wjϑθj2λ01/2λ11/2wjPjf=λwjPjf,

for any θj ≥ 0 and any f ∈ ℱ. Hence B(θ, f) ≥ A(f) with equality only when θj=λ01/2λ11/2wj1ϑPjf^ and the result follows.

A.2 Convergence Rate

The proof of Theorem 1 uses Lemma 2 below which is a generalization of Theorem 10.2 of van de Geer (2000). Consider the regression model yi = g0(xi)+εi, i = 1,…, n where g0 is known to lie in a class of functions 𝒢, xi’s are given covariates in [0, 1]p, and εi’s are independent and sub-Gaussian as in (8). Let In : 𝒢 → [0,∞) be a pseudonorm on 𝒢. Define g^n=arg ming𝒢1/ni=1n(yig(xi))2+τn2In(g). Let H (δ, 𝒢) be the δ-entropy of the function class 𝒢 under the supremum norm ∥g = supx |g(x)| (see van de Geer 2000 page 17).

Lemma 2

Suppose there exists I* such that I*(g)In(g) for all g ∈ 𝒢, n ≥ 1. Also assume that there exists constants A > 0 and 0 < α < 2 such that

H(δ,{gg0I*(g)+I*(g0):g𝒢,I*(g)+I*(g0)>0})Aδα (A1)

for all δ > 0 and n ≥ 1. Then if I*(g0) > 0 and τn1=Op(n1/(2+α))In(2α)/(4+2α)(g0), we have g^ng0=Op(τn)In1/2(g0). Moreover, if In(g0) = 0 for all n ≥ 1 then g^ng0=Op(n1/(2α))τn2α/(2α).

Proof. This follows the same logic as the proof of Theorem 10.2 of van de Geer (2000), so we have intentionally made the following argument somewhat terse. Notice that

g^ng0n2+τn2In(g^n)2(ε,g^ng0)n+τn2In(g0). (A2)

Also, condition (A1) along with Lemma 8.4 in van de Geer (2000) guarantees that

supg𝒢|(ε,g^ng0)n|g^ng0n1α/2(I*(g)+I*(g0))α/2=Op(n1/2). (A3)

Case (i) Suppose that I*(ĝn) > I*(g0). Then by (A2) and (A3) we have

g^ng0n2+τn2In(g^n)Op(n1/2)g^ng0n1α/2I*α/2(g^n)+τn2In(g0)Op(n1/2)g^ng0n1α/2Inα/2(g^n)+τn2In(g0).

The rest of the argument is identical to that on page 170 of van de Geer (2000). Case (ii) Suppose that I*(ĝn) ≤ I*(g0) and I*(g0) > 0. By (A2) and (A3) we have

g^ng0n2Op(n1/2)g^ng0n1α/2I*α/2(g0)+τn2In(g0)Op(n1/2)g^ng0n1α/2Inα/2(g0)+τn2In(g0).

The remainder of this case is identical to that on page 170 of van de Geer (2000).

Proof of Theorem 1. The conditions of Lemma 2 do not hold directly for the ℱ and In(f)=j=1pwj,nPjf of Theorem 1. The following orthogonality argument used in van de Geer (2000) and Lin & Zhang (2006) works to remedy this problem though. For any f ∈ ℱ we can write f(x) = b + g(x) = b + f1(x1) +⋯+ fp(xp), such that i=1nfj(xj,i)=0,j=1,,p. Similarly write (x) = + ĝ(x) and f0(x) = b0+g0(x). Then i=1n(g(xi)g0(xi))=0, and we can write (6) as

(b0b)2+2n(b0b)i=1nεi+1ni=1n(g0(xi)g(xi))2+λnj=1pwj,nPjg.

Therefore must minimize (b0b)2+2/n(b0b)i=1nεi so that b^=b0+1/niεi. Hence (b0)2 = Op(n−1). On the other hand, ĝ must minimize

1ni=1n(g0(xi)g(xi))2+λnj=1pwj,nPjg (A4)

over all g ∈ 𝒢 where

𝒢={g:g(x)=f1(x1)++fp(xp),i=1nfj(xj,i)=0,j=1,,p}. (A5)

Now rewrite (A4) as

1ni=1n(g0(xi)g(xi))2+λ˜nj=1pw˜j,nPjg, (A6)

where λ̃n = λnw*,n, w*,n = min {w1,n,…,wp,n}, and j,n = wj,n/w*,n.

The problem is now reduced to showing that the conditions of Lemma 2 hold for τn2=λ˜nandIn(g)=j=1pw˜j,nPjg. However, notice that min {1,n,…, p,n} = 1 for all n. This implies that In(g)I*(g)=j=1pPjg for all g ∈ 𝒢 and n ≥ 1. Also notice that the entropy bound in (A1) holds whenever

H(δ,{g𝒢:I*(g)1})Aδα, (A7)

since I*(gg0) ≤ I*(g) + I*(g0) so that the set in brackets in (A7) contains that in (A1). And (A7) holds by Lemma 4 in the COSSO paper with α = 1/2. We complete the proof by treating the cases Uc not empty and Uc empty separately.

Case (i) Suppose that Pj f ≠ 0 for some j. Then I*(g0) > 0. Also, w*,n1=Op(1) and wj,n = Op (1) for jUc by assumption. This implies that j,n = Op(1), for jUc so that In(g0) = Op(1). Also λ˜n1=Op(1)λn1=Op(n4/5). The result now follows from Lemma 2.

Case (ii) Suppose now that Pj f = 0 for all j. Then In (g0) = 0 for all n and the result follows from Lemma 2.

Proof of Corollary 1. For the traditional smoothing spline with λ0 ~ n−4/5 it is known (Lin 2000) that ∥PjPj f0L2 = Op(n−2/5). This implies |∥PjL2 − ∥Pj f0L2 | ≤ Op(n−2/5). Hence wj,n1=Op(1) for j = 1,…, p and wj,n = Op(1) for jUc, which also implies w*,n = Op(1). The conditions of Theorem 1 are now satisfied and we have ∥ffn∥ = Op(n−2/5) if Pjf ≠ 0 for some j. On the other hand, also notice that wj,n1=Op(n2γ/5) for jU. Hence w*,n1=Op(n2γ/5) whenever Pj f = 0 for all j so that n2/3λn1/3w*,n1/3=Op(n1/2) for γ > 3/4 and the result follows.

A.3 Oracle Property

Proof of Theorem 2. Define Σ={K¯(x1,i,x1,j)}i,j=1m, the m × m marginal Gram matrix corresponding to the reproducing kernel for 𝒮¯per2. Also let Kj stand for the n × n Gram matrix corresponding to the reproducing kernel for the reproducing kernel for 𝒮¯per2 on variable xj, j = 1,…, p. Let 1m be a vector of m ones. Assuming the observations are permuted appropriately, we can write

K1=Σ(1m1m)(1m1m)K2=(1m1m)Σ(1m1m)Kp(1m1m)(1m1m)Σ,

where ⊗ here stands for the Kronecker product between two matrices.

Straightforward calculation shows that Σ1m = 1/(720m3)1m. So write the eigen-vectors of Σ as {υ1 = 1m, υ2,…, υm} and let ϒ be the m × m matrix with these eigenvectors as its columns. The corresponding eigenvalues are {mϕ1, mϕ2,…,mϕm}, where ϕ1 = 1/(720m4) and ϕ2 ≥ ϕ3 ≥ ⋯ ≥ ϕm. It is known (Uteras 1983) that ϕi ~ i−4 for i ≥ 2. Notice {υ1, υ2,…, υm} are also the eigenvectors of (1m1m) with eigenvalues {m, 0,…, 0}. Write O = ϒϒ ⊗ ⋯ ⊗ ϒ and let ξi be the ith column of O, i = 1,…, n. It is easy to verify that {ξ1,…, ξn} form an eigensystem for each of K1,…,Kp.

Let {ζ1,j,…, ζn,j} be the collection of vectors {ξ1,…, ξn} sorted so that those corresponding to nonzero eigenvalues for Kj are listed first. Specifically, let

ςi,1=υi1m1m,ςi,2=1mυi1m,ςi,p=1m1mυi, (A8)

for i = 1,…,m. Notice that each ζi,j, i = 1,…,m, j = 1,…, p corresponds to a distinct ξk, for some k ∈ {1,…, n}. So let the first m elements of the collection {ζ1,j,…, ζm,j, ζm+1,j,…, ζn,j} be given by (A8) and the remaining nm be given by the remaining ξi in any order. The corresponding eigenvalues are then

ηi,j={nϕifori=1,,m0fori=m+1,,n.

It is clear that {ξ1,…, ξn} is also an orthonormal basis in ℜn with respect to the inner product

u,υn=1/niuiυi. (A9)

Let f = (f(x1),…, f(xn))′. Denote a = (1/n)Of and z = (1/n)Oy. That is, zi = 〈y, ξin, ai = 〈f, ξin, δi = 〈ε, ξin and we have that zi = ai + δi. With some abuse of notation, also let

zi,j=y,ςi,jn,  ai,j=f,ςi,jn,  δi,j=ε,ςi,jn.

Now, using ϑ = 2 in (9), the ACOSSO estimate is the minimizer of

1n(yKθcb1n)(yKθcb1n)+cKθc+λ1j=1pwj2θj, (A10)

where Kθ=j=1pθjKj. Let s = Oc and Dj = (1/n2)OKjO is a diagonal matrix with diagonal elements ϕi. Then (A10) is equivalent to

(zDθs(b,0,0))(zDθs(b,0,0))+sDθs+λ1j=1pwj2θj, (A11)

where Dθ=j=1pθjDj. Straightforward calculation shows that this minimization problem is equivalent to

(s,θ)=i=1mj=1p(zijϕiθjsij)2+i=1mj=1pϕiθjsij2+λ1j=1pwj2θj, (A12)

where sij=ςijc,i=1,,m,, j = 1,…,p are distinct elements of s.

Now, we first condition on θ and minimize over s. Given θ, ℓ(s, θ) is a convex function of s and is minimized at s^(θ)={s^ij(θj)}i=1mj=1p, where ŝijj) = zij(1 − ϕiθj). Inserting ŝ(θ) into (A12) gives

(s^(θ),θ)=i=1mj=1pzij2(1+ϕiθj)2+i=1mj=1pϕiθjzij2(1+ϕiθj)2+λ1j=1pwj2θj=i=1mj=1pzij21+ϕiθj+λ1j=1pwj2θj. (A13)

Notice that ℓ(ŝ(θ), θ) is continuous in θj,

2(s^(θ),θ)θj2=2i=1mzij2ϕi2(1+ϕiθj)3>0for eachj (A14)

and ∂2 ℓ(ŝ (θ), θ)/∂θj∂θk = 0 for jk. Therefore ℓ(s, θ) is convex and has a unique minimum, θ̂.

Clearly, Pj ≡ 0 if and only if θ̂j = 0. So it suffices to consider θ̂j. As such, since we must have that θj ≥ 0, the minimizer, θ̂j = 0 if and only if

θj(s^(θ),θ)|θj=00,

which is equivalent to

T=ni=1mϕizij2nwj,n2λ1,n. (A15)

If we assume that Pj f = 0, then we have zij = δij. In the following, we will obtain bounds for E(T) and Var(T) to demonstrate that T is bounded in probability when Pj f = 0. To this end, we first obtain bounds for E(δij2)and Var(δij2). Recall that δij=(1/n)ςijε and that the individual elements of ε are independent with E(ε) = 0. For notational convenience, let ξ = ζij which is some column of the O matrix. Also, recall that the vector ξ is orthonormal with respect to the inner product in (A9). Now,

E(δij2)=1n2E[(ξε)2]=1n2E(a=1nb=1nξaξbεaεb)=1n2a=1nξa2E(εa2)1n2a=1nξa2M1=M1n (A16)

where M1=maxaE(εa2) which is bounded because of the sub-Gaussian condition (8).

The variance of δij2 is

Var(δij2)=Var(a=1nb=1nξaξbεaεb)=a=1nb=1nc=1nd=1nξaξbξcξdCov(εaεb,εcεd). (A17)

But εa’s are independent, so Cov(εaεb, εcεd) ≠ 0 only in the three mutually exclusive cases (i) a = b = c = d, (ii) a = c and b = d with ab, or (iii) a = d and b = c with ab. Thus, (A17) becomes,

Var(δij2)=1n4[a=1nξa4Cov(εa2,εa2)+2a=1nbanξa2ξb2Cov(εaεb,εaεb)]2n4a=1nb=1nξa2ξb2Var(εaεb)2M2n4(a=1nξa2)2=2M2n2 (A18)

where M2 = maxa,b{Var(εaεb)} which is bounded because of the sub-Gaussian condition in (8). Notice that the derivations of the bounds in (A16) and (A18) do not depend on i or j (i.e., they do not depend on which column of O that ξ comes from). Thus, the bounds in (A16) and (A18) are uniform for all i.

Using (A16) we can write E(T) as

E(T)=ni=1mϕiE[δij2]M1i=1mϕiM1. (A19)

Further, we can use (A18) to write Var(T) as

Var(T)=n2Var(i=1mϕiδij)=n2k=1ml=1mϕkϕlCov(δk,j2,δl,j2)2M2k=1ml=1mϕkϕl=2M2(k=1mϕk)22M2. (A20)

Finally, as n increases, (A19) and (A20) guarantee that the left-hand side of (A15) is bounded in probability when Pj f = 0. Assuming that nwj,n2λn2p or equivalently that nwj,n2λ1,np by Lemma 1, the right-hand side of (A15) increases to ∞ in probability. Therefore, if Pj f = 0 then θ̂j = 0 with probability tending to one. If on the other hand nwj,n2λn2=Op(1), then the probability that T>nwj,n2λ1,n converges to a positive constant. Hence the probability that θ̂j > 0 converges to a positive constant.

Proof of Corollary 2. It is straightforward to to show that Theorem 1 still holds with 𝒮per2 in place of 𝒮2. Also recall from the proof of Corollary 1 that these weights satisfy the conditions of Theorem 1. Also, since wj,n1=Op(n2γ/5) for jU we have nwj,n2λ2np for jU whenever γ > 3/4. The conditions of Theorem 2 are now satisfied. Lastly, in light of Theorem 1, we also know that if Pj f ≠ 0, the probability that Pj ≠ 0 also tends to one as the sample size increases due to the consistency. Corollary 2 follows.

References

  1. Berlinet A, Thomas-Agnan C. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Norwell, MA: Kluwer Academic Publishers; 2004. [Google Scholar]
  2. Breiman L. Better subset selection using the nonnegative garrote. Techno-metrics. 1995;37:373–384. [Google Scholar]
  3. Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]
  4. Breiman L, Friedman J. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association. 1995;80:580–598. [Google Scholar]
  5. Buja A, Hastie T, Tibshirani R. Linear smoothers and additive models (with discussion) Annals of Statistics. 1989;17:453–555. [Google Scholar]
  6. Cleveland W. Robust locally weighted fitting and smoothing scatterplots. Journal of the American Statistical Association. 1979;74:829–836. [Google Scholar]
  7. Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numerical Mathematics. 1979;31:377–403. [Google Scholar]
  8. Efromovich S, Samarov A. Adaptive estimation of the integral of squared regression derivatives. Scandinavian Journal of Statistics. 2000;27(2):335–351. [Google Scholar]
  9. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  10. Friedman J. Multivariate adaptive regression splines (with discussion) Annals of Statistics. 1991;19:1–141. [Google Scholar]
  11. Friedman J. Greedy function approximation: A gradient boosting machine. Annals of Statistics. 2001;29(5):1189–1232. [Google Scholar]
  12. Goldfarb D, Idnani A. Dual and Primal-Dual Methods for Solving Strictly Convex Quadratic Programs. In: Hennart JP, editor. Numerical Analysis. Berlin: Springer-Verlag; 1982. [Google Scholar]
  13. Gu C. Smoothing Spline ANOVA Models. New York, NY: Springer-Verlag; 2002. [Google Scholar]
  14. Hastie T, Tibshirani R. Generalized Additive Models. Chapman & Hall/CRC; 1990. [DOI] [PubMed] [Google Scholar]
  15. Helton J, Marietta M, editors. Reliability Engineering and System Safety. 1–3. Vol. 69. 2000. Special issue: The 1996 performance assessment for the Waste Isolation Pilot Plant; pp. 1–451. [Google Scholar]
  16. Knight K, Fu W. Asymptotics for lasso-type estimators. Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
  17. Lin Y. Tensor product space anova models. Annals of Statistics. 2000;28(3):734–755. [Google Scholar]
  18. Lin Y, Zhang H. Component selection and smoothing in smoothing spline analysis of variance models. Annals of Statistics. 2006;34(5):2272–2297. [Google Scholar]
  19. Nadaraya E. On estimating regression. Theory of Probability and its Applications. 1964;9:141–142. [Google Scholar]
  20. Schimek M, editor. Smoothing and Regression: Approaches, Computation, and Application. New York, NY: John Wiley & Sons, Inc.; 2000. [Google Scholar]
  21. Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6(2):461–464. [Google Scholar]
  22. Stone C, Buja A, Hastie T. The use of polynomial splines and their tensor-products in multivariate function estimation. Annals of Statistics. 1994;22(1):118–184. [Google Scholar]
  23. Stone C, Hansen M, Kooperberg C, Truong Y. 1994 wald memorial lectures - polynomial splines and their tensor products in extended linear modeling. Annals of Statistics. 1997;25(4):1371–1425. [Google Scholar]
  24. Storlie C, Helton J. Multiple predictor smoothing methods for sensitivity analysis: Example results. Reliability Engineering and System Safety. 2008;93(1):55–77. [Google Scholar]
  25. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B. 1996;58:267–288. [Google Scholar]
  26. Uteras F. Natural spline functions: Their associated eigenvalue problem. Numerische Mathematik. 1983;42:107–117. [Google Scholar]
  27. van de Geer S. Empirical Processes in M-Estimation. Cambridge University Press; 2000. [Google Scholar]
  28. Vaughn P, Bean J, Helton J, Lord M, MacKinnon R, Schreiber J. Representation of two-phase flow in the vicinity of the repository in the 1996 performance assessment for the Waste Isolation Pilot Plant. Reliability Engineering and System Safety. 2000;69(1–3):205–226. [Google Scholar]
  29. Wahba G. Spline Models for Observational Data; CBMS-NSF Regional Conference Series in Applied Mathematics.1990. [Google Scholar]
  30. Wahba G, Wang Y, Gu C, Klein R, Klein B. Smoothing spline anova for exponential families, with application to the WESDR. Annals of Statistics. 1995;23:1865–1895. [Google Scholar]
  31. Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. [Google Scholar]
  32. Zou H, Hastie T, Tibshirani R. On the ”degrees of freedom” of the lasso. Annals of Statistics. 2007;35(5):2173–2192. [Google Scholar]

RESOURCES