Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 5.
Published in final edited form as: Stat (Int Stat Inst). 2020 Jul 6;9(1):e300. doi: 10.1002/sta4.300

Sparse Nonparametric Regression With Regularized Tensor Product Kernel

Hang Yu 1,*, Yuanjia Wang 2, Donglin Zeng 3
PMCID: PMC8021131  NIHMSID: NIHMS1657089  PMID: 33824723

Summary

With growing interest to use black-box machine learning for complex data with many feature variables, it is critical to obtain a prediction model that only depends on a small set of features to maximize generalizability. Therefore, feature selection remains to be an important and challenging problem in modern applications. Most of existing methods for feature selection are based on either parametric or semiparametric models, so the resulting performance can severely suffer from model misspecification when high-order nonlinear interactions among the features are present. A very limited number of approaches for nonparametric feature selection were proposed, but they are computationally intensive and may not even converge. In this paper, we propose a novel and computationally efficient approach for nonparametric feature selection in regression field based on a tensor-product kernel function over the feature space. The importance of each feature is governed by a parameter in the kernel function which can be efficiently computed iteratively from a modified alternating direction method of multipliers (ADMM) algorithm. We prove the oracle selection property of the proposed method. Finally, we demonstrate the superior performance of our approach compared to existing methods via simulation studies and application to the prediction of Alzheimer’s disease.

Keywords: Alternating direction method of multipliers, Fisher consistency, Reproducing kernel Hilbert space, Oracle property, Tensor product

1 |. INTRODUCTION

Applications with big data often contain many noisy features that will obscure true signals and deteriorate prediction. With growing interest to use complex data and black-box models to predict an outcome with noisy features, it is critical to obtain generalizable and accurate prediction models that only depend on a small set of features. Therefore, feature selection remains to be crucial for current big data applications. For example, for neurodegenerative diseases such as Parkinson’s disease and Alzheimer’s disease, identifying a few disease diagnostic and prognostic biomarkers is the focal of research for early disease detection and developing intervention. In particular, distinguishing useful biomarkers from noisy ones is essential for identifying individuals at risk long before irreversible damage has occurred, which has implications for prevention and therapeutic development. As another example, type 2 diabetic patient’s health and medical records are routinely captured electronically over time. Such electronic health records consist of patient’s vital signs (blood pressure, heart rates), disease diagnostic biomarkers (e.g., glucose and cholesterol level), co-morbidities and medication history. Thus, it is important to determine which features are predictive of diseases and their treatment outcomes in order to manage individual patient’s healthcare under the framework of precision medicine.

There is an extensive literature on variable selection methods in regression field for parametric and semiparametric models including linear, generalized linear and additive models (e.g., LASSO, Tibshirani (1996), SCAD, Fan and Li (2001), COSSO, Lin and Zhang (2006), and MCP, Zhang (2010)). In these methods, the importance of feature variables is uniquely determined by non-zero coefficients or some univariate functions in the models. However, high-order interaction is often present among the features in many biomedical applications, so parametric or semiparametric models are likely to be misspecified. Theoretical results on variable selection under these misspecified models thus no longer hold. Approaches proposed for nonparametric feature selection include filter methods and wrapper methods. The filter method performs feature selection using various dependence measures. For example, Guyon and Elisseeff (2003) assigned each feature an importance score based on its correlation or mutual information with the outcome of interest, and then the features with low score were removed; Fan and Lv(2008) proposed Sure Independence Screening (SIS) to reduce high dimensionality to relatively large scale, which can be further extended to marginal nonparametric learning (Fan, Feng, and Song (2011)). Song, Smola, Gretton, Bedo, and Borgwardt (2012) proposed a Hilbert-Schmidt Independence Criterion as the dependence measure and a greedy procedure for feature selection. However, the filter method relies on marginal relationship between each feature and the outcome so cannot correctly capture the higher-order interactions among the features. The wrapper methods (Kohavi and John (1997); Liu and Zheng (2006); Maldonado and Weber (2009); Chen and Chen (2015), Dasgupta, Goldberg, and Kosorok (2019)) adopt a greedy search algorithm to generate subsets of the features via forward or backward elimination. These methods are computationally demanding and sequential elimination procedures are likely to lead to cumulative errors over steps.

Since nonparametric prediction can be achieved using approximation from a reproducing kernel Hilbert space (RKHS), several works considered to incorporate feature selection into the construction of such space. Specifically, Weston et al. (2001) introduced a binary indicator variable for each feature in the kernel function that yielded the RKHS, and then performed variable selection using greedy search, thus is computationally intensive. In a more recent work by Allen (2013), they proposed a procedure named KerNel Iterative Feature Extraction. In this approach, the feature input was constructed in a Gaussian RKHS in order to perform any nonparametric prediction. Different bandwidths were used in the Gaussian kernel function for each feature so that a larger bandwidth implied less importance of the corresponding feature. In this way, variable selection was likely to be achieved by tuning the bandwidths data-adaptively. However, due to the nonlinearity of the Gaussian kernel and the high sensitivity to the bandwidth choices, in our numerical experience, this method is unstable even when the number of the feature variables is moderate.

In this work, we propose a novel and computationally efficient approach for nonparametric feature selection when predicting continuous outcomes. Our method considers the feature space from a RKHS that is defined based on a novel tensor-product kernel. The tensor product kernel has been commonly used to integrate features from multiple domains (c.f., Gao and Wu (2012)) in order to account for any highly nonlinear interactions among the domains. For feature selection, we treat each individual feature as a different domain so the use of the tensor product kernel can potentially capture nonlinear high-order interactions among the features, yielding an adequate approximation to any underlying prediction function. Furthermore, we introduce regularization parameters in the tensor product kernel where each parameter determines the importance of the corresponding feature variable. In this way, we can estimate the regularization parameters adaptively from data in order to achieve feature selection and nonparametric function estimation at the same time. Computationally, the estimation of the regularization parameters can be solved efficiently using an iterative procedure, where each iteration is based on a modified alternating direction method of multipliers (ADMM) algorithm. The method essentially reduces to finding optimums of quadratic functions with positive constraints. The unique construction of tensor product kernel results in much greater computational efficiency and numerical stability as compared to previous approaches that either depend on subset search or use a highly nonlinear Gaussian kernel function. We prove the theoretical properties of the proposed method including Fisher consistency and feature selection consistency.

The paper is organized as follows. In Section 2, we describe the proposed method based on a regularized tensor product kernel and then discuss the details of the computational algorithms in our method. In Section 3, we provide the theorems for the Fisher consistency and oracle variable selection property of our method. Numerical evidence based on simulations and application are given Section 4 and 5. We conclude the paper with discussion in Section 6.

2 |. METHOD

Let Y denote the outcome of interest and X=(X1,,Xp) denote the p-dimensional feature variables. Our goal is to learn a nonparametric prediction function, denoted by f(X), to predict Y using data from n independent subjects, denoted by (Xi, Yi), i = 1, 2, …n. In the following sections, we focus on L2-loss to quantify the prediction performance for our method development, although the whole framework applies any other convex loss functions.

2.1 |. Empirical risk minimization on RKHS

Let Hκ denote a RKHS with kernel function κ(X,X˜) (Hofmann, Schölkopf, and Smola (2008)), equipped with norm Hκ. Commonly used kernel functions for κ include the Gaussian kernel, κ(x,y)=exp(xy2/2σ2), and Epanechnikov kernel, κ(x,y)=34h(1xy2h2)|(xyh), in Rp. The empirical regularized risk minimization on RKHS for estimating f(X) solves the following problem:

minfPn((Yf(X))2)+γnfHκ2,

where Pn denotes the empirical measure from n observations, i.e., for any function g(Y, X), Png(Y,X)=n1i=1ng(Yi,Xi), and γn, is a tuning parameter to control the complexity of f. Based on the representation theory for RKHS, this optimization is equivalent to solving

minα1ni=1n{Yij=1nαjκ(Xi,Xj)}2+γnαTKα,

where α=(α1,α2,,αn)T. If we denote the kernel matrix as K={κ(Xi,Xj)}Rn×n, then the solution for α is

α^=(KTK+nγnK)1KTY,

where Y=(Y1,,Yn)T. The resulting prediction function is f^(X)=i=1nα^iκ(X,Xi). The tuning parameter γn is estimated via cross-validation.

2.2 |. Feature selection using a regularized tensor product kernel

In this section, we describe our proposed method for nonparametric feature selection. First, we introduce a regularized tensor product kernel as follows: for a given nonnegative vector λ = (λ1, λ2, ⋯ λp)T, we define λ-regularized kernel function for X = (X1, X2, ⋯ , Xp)T and X˜=(X1˜,X2˜,,Xp˜)T as

κλ,σn(X,X˜)=m=1p{1+λmκn(Xm,X˜m)}, (1)

where κn(x,y)=exp{(xy)2/2σn2} so is proportional to the univariate Gaussian kernel with a pre-defined bandwidth σn in R. Essentially, this kernel function is a tensor-product kernel where for each domain (individual feature in our case, sharing similar idea commonly used for multitask learning (e.g. Suzuki, Kanagawa, Kobayashi, Shimizu, and Tagami (2016)) the kernel function is given by 1+λmκn(x,y). One significant feature of this kernel is that there is a non-negative parameter, λm, that regularizes the contribution of feature m to the entire feature space. In Figure 1, we plot such regularized tensor-product kernel in a 2-dimensional feature space when varying λ1 and λ2. Clearly, when λ1 becomes relatively smaller than λ2, the entire kernel function is increasingly dominated by X2. When λ1 decreases to λ1 = 0 and λ2 > 0, the kernel function is flat along the direction of X1 and only X2 is actively contributing to the distance measure. Similarly, when λ1 > 0 and λ2 = 0, only X1 is actively contributing to the kernel function. Note that for categorical feature variables, κn(x, y) reduces to I(x = y) when σn is small enough.

FIGURE 1.

FIGURE 1

Plots of tensor product kernel in R2.

Note: The bandwidth σn=5 and each kernel is centered at 0. Settings Of λ: from left To right: λ1 = 0; λ2 = 10; λ1 = 4, λ2 = 10; λ1 = 2, λ2 = 10; λ1 = 2, λ2 = 0.

Several interesting properties of the proposed tensor-product kernel are note-worthy. First, κn used in the construction is the Gaussian kernel function so it preserves the universal approximation property in RKHS of the Gaussian kernel when σn is chosen to be small (see Lemma A.1 and A.2 in the appendix). In this way, we expect that the estimation over the RKHS generated by this tensor-product kernel can approximate any underlying nonparametric prediction function. Second, the regularization parameters, which determine the contribution of each feature variable, can be estimated data-adaptively to reveal the true importance of the feature variables and true shape of the underlying prediction function. In particular, if λm=0, the kernel function no longer depends on the m-th feature variable. Therefore, we can achieve the goal of feature selection by estimating the regularization parameters from the data through this tensor-product kernel. Finally, there is significant computation advantage when searching sparse functions based on λ’s, as will be detailed below.

Denote Hλ,σn as the RKHS corresponding to κλ,σn. We aim to minimize

Ln(λ,f)=Pn((Yf(X))2)+γ1nfHλ,σn2+γ2nλ0subject toλ1,λ2,,λp0, (2)

where λ0=m=1pI(λm0), and both γ1n and γ2n are tuning parameters. Note that in the objective function (2), in order to perform variable selection, we include a I0-penalty in the third term to select the non-zero regularization parameters. With the L2-loss, the optimization problem is equivalent to

minλ,α1ni=1n{Yij=1nαjκλ,σn(Xi,Xj)}2+γ1nαTKλ,σnα+γ2nλ0subject toγ1n0,γ2n0, (3)

where Kλ,σn is the matrix given by (κλ,σn(Xi,Xj)). Note that we optimize over both α and λ, so this procedure performs estimation of nonparametric prediction function (via updating α) and search of sparse function space (via updating λ) simultaneously. This is analogous to feature selection in LASSO for (parametric) linear models where one aims to find the optimal prediction and most sparse linear functions at the same time. In fact, this optimization is NP-hard and when p is large it requires that we evaluate all possible subsets of nonzero coefficients of λ. Instead, we solve an approximate optimization problem to (3) based on a modified ADMM algorithm. First, with a surrogate parameter θ, we re-formulate the objective function (3) as,

minλ,α1ni=1n(Yij=1nαjκλ,σn(Xi,Xj))2+γ1nαTKλ,σnα+γ2nθ0subject tom=1p|λmθm|0,λ1,,λp0.

The Lagrange multiplier is denoted by γ3n(γ3n>0). The Lagrange form of the reformulated objective function becomes,

1ni=1n(Yij=1nαjκλ,σn(Xi,Xj))2+γ1nαTKλ,σnα+γ2nθ0+γ3nm=1p|λmθm|, (4)

subject to λm0, m = 1, …, p. The advantage of the approximation in (4) is that the objective function is strictly convex for λm’s while the solution for θm’s is explicit given the other parameters. The details of the algorithm are given in next section.

2.3 |. Algorithms

We iteratively update all parameters to minimize (4). At the k-th iteration,

αk+1=(Kλk,σnTKλk,σn+nγ1nKλk,σn)1Kλk,σnTY (5)
λk+1=minλ1ni=1n(Yij=1nαjk+1κλ,σn(Xi,Xj))2+γ1n(αk+1)TKλ,σnαk+1+γ3nm=1p|λmθmk| (6)
θk+1=minθγ3nm=1p|λmk+1θm|+γ2nm=1pI(θm0). (7)

Note that (5) is an explicit expression and that the update θ in (7) is given by

θqκ+1=λqκ+1I(|λqκ+1|>ρn),

where ρn=γ2n/γ3n.

The update function in (6) is essentially a regression problem with a LASSO-type penalty. Specifically, we use a coordinate descent algorithm to obtain each λq(q=1,2,,p). To obtain λqk+1, we fix λ1k+1,λ2k+1,,λq+1k,λq+2k,,λpk and then after simple calculation, the objective function takes the following form,

minλq1ni=1n(aiq+biqλq)2+dqλq,

where aiq, biq, dq’s are constants with their expressions given in Appendix S1. This is quadratic in λq with constraint λq0; Thus, its solution can be achieved easily by checking whether the minimal value is non-negative or not. The prediction function at an iteration is defined as

f^λk+1(X)=i=1nαik+1κλk+1,σn(X,Xi).

Since our goal is to minimize the objective function which penalizes the size of the non-zero λ’s, we set the convergence criteria to be the change of both the objective function and the non-zero number of λ’s. We let δ=|Ln(λ^k+1,f^λ^k+1)Ln(λ^k,f^λ^k)| and e=λ^k+10. Then our algorithm can be summarized as follows:

  1. At the initial step, set λ^0=0, θ^0=0.

  2. Fix λ^k then update α^k+1.

  3. For fixed α^k+1, update λ^k+1 and θ^k+1 via coordinate descent algorithm.

  4. Calculate δ=|Ln(λ^k+1,f^λ^k+1)Ln(λ^k,f^λ^k)| and e=λ^k+10.

  5. Stop if δc (c is a given cut point) and e doses not change. Otherwise, go to step (ii) with updated λ^k+1.

Since our computation alternates between α’s and λ, each computation given other parameters is a convex optimization problem. Thus, our algorithm guarantees that the objective function decreases over iterations and converges to a local minimum.

In our algorithm, both the tuning parameters and bandwidth need to be determined. First, following the median trick for bandwidth of Gaussian kernel (Jaakkola, Diekhans, and Haussler (1999)), we calculate the matched-pair distance of feature variables and set σn such that the proportion of the matched pairs with distance less than σn is about a half. To tune the other parameters including γ1n, γ3n and ρn (equivalently, γ2n), we use 5-fold cross-validation by varying them on a grid of 2−15, 2−14, ⋯ , 215.

3 |. THEORETICAL RESULTS

In this section, we provide theoretical results to justify the proposed method. In particular, we show that under some assumptions, the resulting prediction function from our method leads to Bayesian risk asymptotically. Furthermore, we show that with probability tending to one, the variable selection based on non-zero λ’s is oracle, i.e., as if we had known which variables were important. Without loss of generality, we assume that the first r feature variables are important, while the others are not; that is, the Bayes rule, E[Y|X], only depends on X1, … , Xr in the sense that for m ≤ r,

E[(E[Y|X]E[Y|Xm])2]>0,

with probability 1,

E[Y|X]=E[Y|X1,X2,,Xr],

where X−m denotes the random vector of X excluding Xm. We use f0(X1,,Xr) to denote E[Y|X]. We further denote (λ^,f^λ^) as the optimal solution for the objective function in (2), where λ^=(λ^1,λ^2,,λ^p). Then our first main result is:

Theorem 1. Assume that γ1n,γ2n0 and let γ1n=σnp/2 with n1/2σnp. Let P denotes the true probability measure, i.e., Pg(Y,X)=E[g(Y,X)] for any measurable function g(Y, X) with finite first moment. Then it holds

  1. with probability one, limnP((Yf^λ^(X))2)=E(Yf0(X))2;

  2. Pr{λ^m>0for allm=1,2,,r}1.

Theorem 1 implies that the loss of the estimated prediction function converges to Bayesian risk. Moreover, the λ^m’s associated with important feature variables should be non-zero, i.e., the estimated function does depend on X1, .., Xr.

The following theorem states that with additional regularity conditions, our method can also identify those unimportant features with probability tending to 1.

Theorem 2. In addition to Theorem 1’s assumptions, assume that f0(X1,,Xr) is twice continuously differentiable and that γ2n satisfies

γ2n(n1/2σnp/2γ1n+σnmin(2,p)).

Then Pr{λ^m=0for m=r+1,,p}1.

Theorem 2 implies that the proposed method can estimate the predicted function as if we knew which variables are important in the truth. The proofs of the theorems are given in the Appendix.

4 |. SIMULATION STUDIES

We conducted simulation studies to examine the performance of the proposed method. First, we considered a model with continuous outcomes and a total of ten variables (p = 10) and gradually increased p up to 100. We generated X1, .., X10 from a multivariate normal distribution with mean zeros and variances 1, where all were independent except that X7, X8, X9, X10 were correlated with corr(X7,X8)=0.4, corr(X7,X9)=0.3, corr(X8,X9)=0.5 and corr(X9,X10)=0.2. We treated X7, X8, X9 as important variables and simulated the continuous response Y using the following model

Yi=2Xi7Xi8Xi9+3.3exp(Xi9)+ϵi.

where ϵiN(0,1). We centered the outcome Y to be mean zero. To examine properties of proposed method in higher dimension, we also simulated scenarios with p = 20, 40, 100, where we generated additional independent noise features from the standard normal distribution. We varied training sample size from n = 100, 200 to 400.

For each simulated data, we used the proposed method to learn the prediction function. The choices of tuning parameters followed the description in Section 2.3 and ρn was set to be 0.001. We reported the true positive rates, true negative rates, average number of selected variables and prediction errors in our method. In addition, we compared our method with COSSO and LASSO, where COSSO can handle variable selection in nonlinear cases based on SS-ANOVA and LASSO assumes a linear regression model where coefficients estimated to be less than 0.0001 are set to zeros. For tuning parameters of LASSO and COSSO, we used 5-fold cross validations. The comparing performance was based on the mean squared errors in an independent testing sample.

The results based on 500 replicates are summarized in Table 1. From feature selection results columns, we observe that for fixed p, as sample size n becomes larger, the number of true positive rate and true negative rate from our method becomes larger. For fixed sample size n, as p becomes larger, the true positive rate becomes smaller due to additional noise variables. As shown in Table 1, our model can successfully selected all three important variables, with both true positive rate and true negative rate around 95%. Average number of selected variables are also approximate 3. However, COSSO cannot select all important variables, which is reflected from both the true positive rate and average number of selected variables columns. This is because COSSO fits a misspecified model. LASSO does not yield any reasonable variable selection results (not shown here) this setting. Prediction error columns show the mean and median absolute deviation of the prediction errors. Clearly, LASSO gives the worst result since it misspecifies the model most. Our method has the best prediction performance and as the sample size n becomes larger, the prediction errors decrease to Bayes error, which is 1 in this case.

TABLE 1.

Summary of feature selection results and prediction errors

Feature Selection Results Prediction Errors

Proposed COSSO Proposed COSSO LASSO


p n TPR TNR avg.# TPR TNR avg.#
10 100 88.4% 90.7% 3.3 65.4% 96.1% 2.2 2.743 (0.313) 5.650 (0.556) 5.954 (0.121)
200 95.4% 95.4% 3.2 82.7% 97.0% 2.7 2.198 (0.205) 5.256 (0.591) 5.689 (0.057)
400 97.5% 96.3% 3.2 91.5% 98.3% 2.9 1.889 (0.171) 4.708 (0.698) 5.606 (0.038)
20 100 85.0% 94.7% 3.5 49.5% 97.5% 1.9 3.455 (0.396) 6.664 (0.455) 7.056 (0.168)
200 94.1% 97.5% 3.3 55.6% 96.5% 2.3 2.820 (0.253) 6.312 (0.492) 6.659 (0.077)
400 97.3% 98.6% 3.2 74.7% 95.3% 3.0 2.464 (0.232) 5.913 (0.570) 6.500 (0.043)
40 100 81.5% 96.7% 3.7 44.3% 98.8% 1.8 2.835 (0.437) 5.693 (0.341) 6.985 (0.434)
200 93.4% 98.6% 3.3 54.0% 98.7% 2.1 2.101 (0.220) 5.502 (0.263) 6.042 (0.174)
400 97.1% 99.2% 3.2 58.3% 98.2% 2.4 1.783 (0.127) 5.048 (0.386) 5.672 (0.076)
100 100 74.0% 98.4% 3.8 NA NA NA 3.562 (0.593) NA 30.370 (8.124)
200 88.0% 99.1% 3.6 49.3% 99.4% 2.1 2.600 (0.312) 5.602 (0.319) 7.597 (0.504)
400 93.3% 99.4% 3.4 61.7% 99.2% 2.6 2.172 (0.200) 5.506 (0.232) 6.427 (0.200)

Note: TPR: True positive rate; TNR: True negative rate; avg.#: Average number of selected variables. The numbers are the mean of prediction errors and the numbers within parentheses are the median absolute deviations from 500 replicates. “NA”: Results are not available due to failure of the methods.

5 |. APPLICATION TO AIZHEIMER’S DISEASE INITIATIVE STUDY

We applied the proposed method to analyze data from the Alzheimer’s disease neuroimaging initiative (ADNI) study (Toledo, Bjerke, Da, and Landau (2015)). The feature variables included demographic variables (age, gender, race, education level), APoE4 mutation status, clinical variables (functional assessment questionnaire, ADAS-cog11, MMSE) and 7 imaging biomarkers (flurodeoxyglucose, ventricles, hippocampus, whole brain, entorhinal cortex, fusiform gyrus, middle temporal gyrus). Our goal was to assess how well Alzheimer’s disease (AD) biomarkers measured from invasive cerebral spinal fluid (CSF) procedure can be predicted from clinical or biomarker data collected by non-invasive procedures. Thus, we aim to identify important feature variables to predict t-tau and Aβ-42 protein measured from CSF in the ADNI study. There were 535 subjects included in the analyses of t-tau and 542 for Aβ-42. We randomly divided the subjects so that 70% were used for training and 30% were used for testing. We applied the proposed method to learn the prediction rule using the training sample. Since gender and race variables were binary, as mentioned before, we set the individual kernels for these two features in (1) as κ(X,X˜)=I(X=X˜). For the other feature variables, we used the individual Gaussian kernels with the same bandwidth as described before. We standardized the outcomes and all continuous feature variables. The tuning parameters for γ’s were obtained from 5-fold cross validation in the training sample. For comparison, we also fit LASSO and COSSO with 5-fold cross validation for tuning to the same data, where the coefficients estimated to be less than 0.0001 in LASSO are thresholded to be zeros, and then compared their prediction performance in the testing sample. To obtain a reliable comparison, we repeated the same analysis for 500 randomly divided training sample and test sample.

Figure 2 gives some smooth plots of outcome variables versus feature variables, from which we can get an intuition of the nonlinear relationship between them. Figure 3 gives the frequency of a variable selected. Prediction error by each method together with average and range number of selected variables from 500 replications of random splitting are shown in Table 2. From this table, It is clear that our method yields the smallest prediction errors. Among all 500 replications, for outcome t-tau, the most frequently chosen features in our method were Gender, APoE4, MMSE, ADAS, Ventricles, Hippocampus and Middle temporal gyrus; while for Aβ-42, APoE4, ADAS, MMSE and Hippocampus were highly selected. Also, FAQ is moderately important for outcome Aβ-42, but has no importance for outcome t-tau. In contrast, for outcome t-tau, COSSO highly selected APoE4, MMSE, ADAS, Ventricles and Middle temporal gyrus, but Gender and Hippocampus failed to be selected, which may be a reason of the large prediction error. For outcome Aβ-42, Age, APoE4, ADAS, FAQ, Hippocampus and Middle temporal gyrus were frequently chosen by COSSO. We also noticed that COSSO gave large mean prediction error with high variability but reasonable median prediction error, which indicates the existence of outliers of prediction errors among all 500 replications. LASSO nearly failed to remove any noise variables and selected approximate 15 variables in average for both outcomes. Finally, when applying our method to analyze the whole sample, for t-tau outcome, our method selected 8 feature variables as important and they were gender, APoE4, FDG, ADAS11, MMSE, Ventricles, Hippocampus and MidTemp with prediction error equals to 0.839. For Aβ-42, there were 4 important features including APoE4, ADAS11, MMSE and Hippocampus with prediction error equals to 0.811. The data that support the findings of this study are openly available in http://adni.loni.usc.edu.

FIGURE 2.

FIGURE 2

smooth plots of outcome variables versus imaging feature variables in ADNI data

FIGURE 3.

FIGURE 3

Frequency of variables selected in 500 random sample splittings

TABLE 2.

Summary of feature selection results in the application to the ADNI study

t-tau Aβ1-42

Proposed COSSO LASSO Proposed COSSO LASSO
Mean of Prediction error 0.870 (0.028) 1.745 (1.869) 0.881 (0.027) 0.830 (0.032) 6.748 (13.176) 0.838 (0.031)
Median of Prediction error 0.869 0.923 0.880 0.829 0.859 0.836
Avg.# 8.08 6.12 14.96 5.72 6.34 14.98
Range # [4,12] [1,14] [14,15] [3,13] [1,14] [14,15]

Note: The numbers are the mean and median of prediction errors from 500 replicates and the numbers within parentheses are their standard deviations. Avg.#: Average number of selected variables. Range #: Range number of selected variables.

6 |. CONCLUSIONS

In this work, we propose a regularized tensor product kernel for sparse nonparametric regression in the presence of nonlinear relationships. The importance of each feature is captured by a non-negative parameter in the kernel function. Our approach is computationally efficient because it can be iteratively computed by optimizing a convex quadratic function from a modified ADMM algorithm. Theoretically, we have shown that our method leads to oracle feature variable selection. The superior performance of the proposed method was demonstrated via simulation studies and a real data application. Note that both our algorithm and theory can be extended to higher dimensional feature variables, or even ultra high dimensional cases.

Here we focus on a regression problem using L2 loss function. However, our method can be extended to other machine learning approaches with different losses, such as hinge-loss (support vector machine) and boosting. Feature selection can be simultaneously performed when training these machine learning algorithms. We expect that the same iterative algorithm applies but the step of updating regularization parameters may be different, although it remains to be a convex optimization problem with linear constraints.

Finally, our method can also be generalized to feature selection problems when feature variables collected from different domains (imaging, genomics, clinical biomarkers), which is common in integrative data analysis. By accounting for the hierarchical structure of multiple domains, one possibility is to construct a hierarchical tensor product kernel with regularization parameters for both domains and features within each domain. In this way, we can perform domain selection and feature selection at the same time. We will consider such extensions in a future work.

Supplementary Material

supp

ACKNOWLEDGMENTS

This research is supported by U.S. NIH grants NS073671, GM124104, and MH117458.

Abbreviations:

ANA

anti-nuclear antibodies

APC

antigen-presenting cells

IRF

interferon regulatory factor

APPENDIX

Proof of Theorems

Before proving the theorems, we need two lemmas. The proof of them can be found in supporting information file. The first lemma shows that the proposed kernel function satisfies positive-definite condition.

Lemma A.1. For any positive constants λ1,,λr>0,

κλ,σn(X,X˜)=m=1p(1+λmκn(Xm,X˜m)),

is a kernel satisfying semi positive-definiteness condition.

The next lemma shows that when σn goes to zero, the closure of the reproducing kernel Hilbert space generated by the tensor product kernel contains the true function f0(X1,,Xr) if λm>0 for mr. We define d(f0,Hλ,σn) as the L2(P) distance between f0 and the reproducing kernel Hilbert space.

Lemma A.2. Assume that σn0. For any λ=(λ1,λ2,λp),λm0,mp,

  1. If λm0 for mr, then d(f0(X1,X2,,Xr),Hλ,σn)0. In fact, the closure of lim supn Hλ,σn contains any L2-integrable function that only depends on (X1, …, Xr).

  2. If for some m ≤ r, λm=0, then lim inf d(f0(X1,X2,,Xr),Hλ,σn)>0.

Before proving two theorems, recall that P denotes the true probability measure; Pn denotes the empirical measure from n observations.

Proof of Theorem 1. By Lemma A.2, for any λ0=(λ01,0), where λ01=(λ011,λ012,,λ01r), and λ01m>0 for any m ≤ r, there exits f˜λ0Hλ0,σns.t. d(f0,f˜λ0)=f˜λ0f0L20. Since (λ^,f^λ^) is the optimal solution for objective function (2), we have

Pnl(Y,f^λ^(X))+γ1nf^Hλ^,σn2+γ2nλ^0Pnl(Y,f˜λ0(X))+γ1nf˜Hλ0,σn2+γ2nλ00.

That is,

(PnP)l(Y,f^λ^(X))+γ1nf^Hλ^,σn2+Pl(Y,f^λ^(X))+γ2nλ^0(PnP)l(Y,f˜λ0(X))+γ1nf˜Hλ0,σn2+Pl(Y,f˜λ0(X))+γ2nλ00.

Following some similar arguments to Theorem 3.1 in Steinwart and Scovel (2007), since f^Hλ^,σnO(γ1n1/2), the uniform covering numbers of this bounded set in the reproducing kernel Hilbert space can be calculated as follows.

First, the entropy number (van der Vaart and Wellner (1996)) for the unit ball in Hλ^,σn, denoted by On, satisfies

logN(ϵ,On,)c1(v,p)σn(1v/4)pϵv,

where 0 < v < 2 and c1 (v, p) is a constant that depends only on v and p. Then it gives

logN[](ϵ,On,L4(P))c1(v,p)σn(1v/4)pϵv.

Thus, we obtain

logN[](ϵ,{f^:f^Hλ^,σn,f^Hλ^,σnO(γ1n1/2)},L4(P))c1(v,p)σn(1v/4)pϵvγ1nv/2.

Note

l(Y,f^1)l(Y,f^2)L2(P)=(Yf^1)2(Yf^2)2L2(P)=(E(2Yf^1f^2)2(f^2f^1)2)12(E(f^2f^1)4)12(E(2Yf^1f^2)4)12,

where (E(2Yf^1f^2)4)12332(16E(Y4)+E(f^14)+E(f^24))12c, c is a constant. Thus, we can obtain l(Y,f^1)l(Y,f^2)L2(P)c(f^1f^2L2(P))2, which yields

logN[](ϵ,{l(Y,f^):f^Hλ^,σn,f^Hλ^,σnO(γ1n1/2)},L4(P))c2(v,p,c)σn(1v/4)pϵv/2γ1nv/2.

Let c denotes a constant. The above result implies

E[supf^Hλ^,σn,f^Hλ^,σnO(γ1n1/2)|(PnP)l(Y,f)|]c3n1/20cγ1n11+logN[](ϵ,{l(Y,f^):f^Hλ^,σn,f^Hλ^,σnO(γ1n1/2)},L4(P))dϵk(v,p)n1/2σn(1/2v/8)pγ1n1.

In other words, (PnP)l(Y,f^λ^(X))=O(n1/2σnp/2γ1n1) by the choice of γ1n in Theorem 1. Similarly,

(PnP)l(Y,f˜λ0(X))=O(n1/2σnp/2γ1n1).

Therefore, let n then since γ1n0, we have

lim supn(Pl(Y,f^λ^(X))+γ2nλ^0)lim supn(Pl(Y,f˜λ0(X))+O(n1/2σnp/2γ1n1)+γ2nλ00).

Under the assumption that n1/2σnp0,γ2n0, we have

lim supnPl(Y,f^λ^(X))lim supnPl(Y,f˜λ0(X))=El(Y,f0(X)).

Since El(Y,f0(X))lim supnPl(Y,f^λ^(X)), we obtain

limnPl(Y,f^λ^)=El(Y,f0(X)).

Together with Lemma A.2, this also gives the result that for any mr,λ^m>0. Otherwise, if there is some mr,λ^m=0,, by Lemma A.2 (ii), it always holds that limnPl(Y,f^λ^)EI(Y,f0(X))lim infd(f0(X1,X2,,Xr),Hλ,σn)>0. We then have a contradiction. □

Proof of Theorem 2. For this proof, we particularly choose f˜λ0 to be the Gaussian-kernel convolution of f0 in the space of (X1, …, Xr), where the kernel is given m=1rkn(Xm,X˜m). Clearly, f˜λ0 belongs to Hλ,σn and since f0(X) is twice continuous differentiable,

d(f˜λ0,f0)=O(σn2).

With the same arguments in the proof of Theorem 1, we have

Pl(Y,f^λ^(X))+γ2nλ^0Pl(Y,f˜λ0(X))+O(n1/2σnp/2γ1n1)+O(γ1n)+γ2nλ00.

Thus,

Pl(Y,f^λ^(X))+γ2nλ^0El(Y,f0(X))+O(σnmin(2,p))+O(n1/2σnp/2γ1n1)+γ2nλ00.

Since it always holds that El(Y,f0(X))Pl(Y,f^λ^(X)),

γ2nλ^0O(σnmin(2,p))+O(n1/2σnp/2γ1n1)+γ2nλ00.

Thus, by dividing γ2n on both sides, we have

λ^0O(γ2n1σnmin(2,p))+O(γ1n1γ2n1n1/2σnp/2)+r.

Under the assumptions in Theorem 2, we conclude that with probability tending to one, the number of the non-zero components for λ^ can not be larger than r. However, Theorem 1 implies that this number should be at least r. We thus obtain Theorem 2. □

Footnotes

Financial disclosure

None reported.

Conflict of interest

The authors declare no potential conflict of interests.

SUPPORTING INFORMATION

Additional information for this article is available:

Appendix S1. Expression of Constants in Updating λ’s

Appendix S2. Proof of two lemmas

References

  1. Allen GI (2013). Automatic feature selection via weighted kernels and regularization. Journal of Computational and Graphical Statistics, 22(2), 284–299. [Google Scholar]
  2. Chen G, & Chen J (2015). A novel wrapper method for feature selection and its application. Neurocomputing, 159(2), 219–226. [Google Scholar]
  3. Dasgupta S, Goldberg Y, & Kosorok MR (2019). Feauture elimination in kernel machines in moderately high dimensions. The Annals of Statistics, 47(1), 497–526. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Fan J, Feng Y, & Song R (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association, 106(494), 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Fan J, & Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. [Google Scholar]
  6. Fan J, & Lv J (2008). Sure independence screening for ultra-high dimensional feature space. Journal of the Royal Statistical Society. Series B, 70(5), 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Gao C, & Wu X (2012). Kernel support tensor regression. 2012 International Workshop on Information and Electronics Engineering (IWIEE), 29, 3986–3990. [Google Scholar]
  8. Guyon I, & Elisseeff A (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. [Google Scholar]
  9. Hofmann T, Schölkopf B, & Smola AJ (2008). Kernel methods in machine learning. The Annals of Statistics, 36(3), 1171–1220. [Google Scholar]
  10. Jaakkola T, Diekhans M, & Haussler D (1999). Using the fisher kernel method to detect remote protein. ISMB, 99,149–158. [PubMed] [Google Scholar]
  11. Kohavi R, & John GH (1997). Wrappers for feature subset selection. Artificial Intelligence, 97,273–324. [Google Scholar]
  12. Lin Y, & Zhang HH (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5), 2272–2297. [Google Scholar]
  13. Liu Y, & Zheng YF (2006). A novel feature selection method for support vector machines. Pattern Recognition, 39(7), 1333–1345. [Google Scholar]
  14. Maldonado S, & Weber R (2009). A wrapper method for feature selection using support vector machines. Information Sciences, 179(13), 2208–2217. [Google Scholar]
  15. Song L, Smola A, Gretton A, Bedo J, & Borgwardt K (2012). Feature selection via dependence maximization. Journal of Machine Learning Research, 13,1393–1434. [Google Scholar]
  16. Steinwart I, & Scovel C (2007). Fast rates for support vector machines using gaussian kernels. The Annals of Statistics, 35(2), 575–607. [Google Scholar]
  17. Suzuki T, Kanagawa H, Kobayashi H, Shimizu N, & Tagami Y (2016). Minimax optimal alternating minimization for kernel nonparametric tensor learning., 3783–3791. [Google Scholar]
  18. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1), 267–288. [Google Scholar]
  19. Toledo JB, Bjerke M, Da X, & Landau SM (2015). Nonlinear association between cerebrospinal fluid and florbetapir f-18 β-amyloid measures across the spectrum of alzheimer disease. JAMA Neurology, 72(5), 571–581. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. van der Vaart A, & Wellner JA (1996). Weak convergence and empirical processes. New York: Springer. [Google Scholar]
  21. Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, & Vapnik V (2001). Feature selection for svms. In Advances in Neural Information Processing Systems. [Google Scholar]
  22. Zhang C (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

RESOURCES