Sparse Nonparametric Regression With Regularized Tensor Product Kernel

Hang Yu; Yuanjia Wang; Donglin Zeng

doi:10.1002/sta4.300

. Author manuscript; available in PMC: 2021 Apr 5.

Published in final edited form as: Stat (Int Stat Inst). 2020 Jul 6;9(1):e300. doi: 10.1002/sta4.300

Sparse Nonparametric Regression With Regularized Tensor Product Kernel

Hang Yu ^1,^*, Yuanjia Wang ², Donglin Zeng ³

PMCID: PMC8021131 NIHMSID: NIHMS1657089 PMID: 33824723

Summary

With growing interest to use black-box machine learning for complex data with many feature variables, it is critical to obtain a prediction model that only depends on a small set of features to maximize generalizability. Therefore, feature selection remains to be an important and challenging problem in modern applications. Most of existing methods for feature selection are based on either parametric or semiparametric models, so the resulting performance can severely suffer from model misspecification when high-order nonlinear interactions among the features are present. A very limited number of approaches for nonparametric feature selection were proposed, but they are computationally intensive and may not even converge. In this paper, we propose a novel and computationally efficient approach for nonparametric feature selection in regression field based on a tensor-product kernel function over the feature space. The importance of each feature is governed by a parameter in the kernel function which can be efficiently computed iteratively from a modified alternating direction method of multipliers (ADMM) algorithm. We prove the oracle selection property of the proposed method. Finally, we demonstrate the superior performance of our approach compared to existing methods via simulation studies and application to the prediction of Alzheimer’s disease.

Keywords: Alternating direction method of multipliers, Fisher consistency, Reproducing kernel Hilbert space, Oracle property, Tensor product

1 |. INTRODUCTION

Applications with big data often contain many noisy features that will obscure true signals and deteriorate prediction. With growing interest to use complex data and black-box models to predict an outcome with noisy features, it is critical to obtain generalizable and accurate prediction models that only depend on a small set of features. Therefore, feature selection remains to be crucial for current big data applications. For example, for neurodegenerative diseases such as Parkinson’s disease and Alzheimer’s disease, identifying a few disease diagnostic and prognostic biomarkers is the focal of research for early disease detection and developing intervention. In particular, distinguishing useful biomarkers from noisy ones is essential for identifying individuals at risk long before irreversible damage has occurred, which has implications for prevention and therapeutic development. As another example, type 2 diabetic patient’s health and medical records are routinely captured electronically over time. Such electronic health records consist of patient’s vital signs (blood pressure, heart rates), disease diagnostic biomarkers (e.g., glucose and cholesterol level), co-morbidities and medication history. Thus, it is important to determine which features are predictive of diseases and their treatment outcomes in order to manage individual patient’s healthcare under the framework of precision medicine.

There is an extensive literature on variable selection methods in regression field for parametric and semiparametric models including linear, generalized linear and additive models (e.g., LASSO, Tibshirani (1996), SCAD, Fan and Li (2001), COSSO, Lin and Zhang (2006), and MCP, Zhang (2010)). In these methods, the importance of feature variables is uniquely determined by non-zero coefficients or some univariate functions in the models. However, high-order interaction is often present among the features in many biomedical applications, so parametric or semiparametric models are likely to be misspecified. Theoretical results on variable selection under these misspecified models thus no longer hold. Approaches proposed for nonparametric feature selection include filter methods and wrapper methods. The filter method performs feature selection using various dependence measures. For example, Guyon and Elisseeff (2003) assigned each feature an importance score based on its correlation or mutual information with the outcome of interest, and then the features with low score were removed; Fan and Lv(2008) proposed Sure Independence Screening (SIS) to reduce high dimensionality to relatively large scale, which can be further extended to marginal nonparametric learning (Fan, Feng, and Song (2011)). Song, Smola, Gretton, Bedo, and Borgwardt (2012) proposed a Hilbert-Schmidt Independence Criterion as the dependence measure and a greedy procedure for feature selection. However, the filter method relies on marginal relationship between each feature and the outcome so cannot correctly capture the higher-order interactions among the features. The wrapper methods (Kohavi and John (1997); Liu and Zheng (2006); Maldonado and Weber (2009); Chen and Chen (2015), Dasgupta, Goldberg, and Kosorok (2019)) adopt a greedy search algorithm to generate subsets of the features via forward or backward elimination. These methods are computationally demanding and sequential elimination procedures are likely to lead to cumulative errors over steps.

Since nonparametric prediction can be achieved using approximation from a reproducing kernel Hilbert space (RKHS), several works considered to incorporate feature selection into the construction of such space. Specifically, Weston et al. (2001) introduced a binary indicator variable for each feature in the kernel function that yielded the RKHS, and then performed variable selection using greedy search, thus is computationally intensive. In a more recent work by Allen (2013), they proposed a procedure named KerNel Iterative Feature Extraction. In this approach, the feature input was constructed in a Gaussian RKHS in order to perform any nonparametric prediction. Different bandwidths were used in the Gaussian kernel function for each feature so that a larger bandwidth implied less importance of the corresponding feature. In this way, variable selection was likely to be achieved by tuning the bandwidths data-adaptively. However, due to the nonlinearity of the Gaussian kernel and the high sensitivity to the bandwidth choices, in our numerical experience, this method is unstable even when the number of the feature variables is moderate.

In this work, we propose a novel and computationally efficient approach for nonparametric feature selection when predicting continuous outcomes. Our method considers the feature space from a RKHS that is defined based on a novel tensor-product kernel. The tensor product kernel has been commonly used to integrate features from multiple domains (c.f., Gao and Wu (2012)) in order to account for any highly nonlinear interactions among the domains. For feature selection, we treat each individual feature as a different domain so the use of the tensor product kernel can potentially capture nonlinear high-order interactions among the features, yielding an adequate approximation to any underlying prediction function. Furthermore, we introduce regularization parameters in the tensor product kernel where each parameter determines the importance of the corresponding feature variable. In this way, we can estimate the regularization parameters adaptively from data in order to achieve feature selection and nonparametric function estimation at the same time. Computationally, the estimation of the regularization parameters can be solved efficiently using an iterative procedure, where each iteration is based on a modified alternating direction method of multipliers (ADMM) algorithm. The method essentially reduces to finding optimums of quadratic functions with positive constraints. The unique construction of tensor product kernel results in much greater computational efficiency and numerical stability as compared to previous approaches that either depend on subset search or use a highly nonlinear Gaussian kernel function. We prove the theoretical properties of the proposed method including Fisher consistency and feature selection consistency.

The paper is organized as follows. In Section 2, we describe the proposed method based on a regularized tensor product kernel and then discuss the details of the computational algorithms in our method. In Section 3, we provide the theorems for the Fisher consistency and oracle variable selection property of our method. Numerical evidence based on simulations and application are given Section 4 and 5. We conclude the paper with discussion in Section 6.

2 |. METHOD

Let Y denote the outcome of interest and $X = (X_{1}, \dots, X_{p})$ denote the p-dimensional feature variables. Our goal is to learn a nonparametric prediction function, denoted by f(X), to predict Y using data from n independent subjects, denoted by (X_i, Y_i), i = 1, 2, …n. In the following sections, we focus on L₂-loss to quantify the prediction performance for our method development, although the whole framework applies any other convex loss functions.

2.1 |. Empirical risk minimization on RKHS

Let $H_{κ}$ denote a RKHS with kernel function $κ (X, \tilde{X})$ (Hofmann, Schölkopf, and Smola (2008)), equipped with norm ${‖ \cdot ‖}_{H_{κ}}$ . Commonly used kernel functions for κ include the Gaussian kernel, $κ (x, y) = \exp (- {‖ x - y ‖}^{2} / 2 σ^{2})$ , and Epanechnikov kernel, $κ (x, y) = \frac{3}{4 h} (1 - \frac{{‖ x - y ‖}^{2}}{h^{2}}) | (‖ x - y ‖ \leq h)$ , in $R^{p}$ . The empirical regularized risk minimization on RKHS for estimating f(X) solves the following problem:

\min_{f} P_{n} ({(Y - f (X))}^{2}) + γ_{n} {‖ f ‖}_{H_{κ}}^{2},

where P_n denotes the empirical measure from n observations, i.e., for any function g(Y, X), $P_{n} g (Y, X) = n^{- 1} \sum_{i = 1}^{n} g (Y_{i}, X_{i})$ , and $γ_{n}$ , is a tuning parameter to control the complexity of f. Based on the representation theory for RKHS, this optimization is equivalent to solving

\min_{α} \frac{1}{n} \sum_{i = 1}^{n} {Y_{i} - \sum_{j = 1}^{n} α_{j} κ (X_{i}, X_{j})}^{2} + γ_{n} α^{T} K α,

where $α = {(α_{1}, α_{2}, \dots, α_{n})}^{T}$ . If we denote the kernel matrix as $K = {κ (X_{i}, X_{j})} \in R^{n \times n}$ , then the solution for α is

\hat{α} = {(K^{T} K + n γ_{n} K)}^{- 1} K^{T} Y,

where $Y = {(Y_{1}, \dots, Y_{n})}^{T}$ . The resulting prediction function is $\hat{f} (X) = \sum_{i=1}^{n} {\hat{α}}_{i} κ ({X, X}_{i})$ . The tuning parameter $γ_{n}$ is estimated via cross-validation.

2.2 |. Feature selection using a regularized tensor product kernel

In this section, we describe our proposed method for nonparametric feature selection. First, we introduce a regularized tensor product kernel as follows: for a given nonnegative vector λ = (λ₁, λ₂, ⋯ λ_p)^T, we define λ-regularized kernel function for X = (X₁, X₂, ⋯ , X_p)^T and $\tilde{X} = {(\tilde{X_{1}}, \tilde{X_{2}}, \dots, \tilde{X_{p}})}^{T}$ as

κ_{λ, σ_{n}} (X, \tilde{X}) = \prod_{m = 1}^{p} {1 + λ_{m} κ_{n} (X_{m}, {\tilde{X}}_{m})},

(1)

where $κ_{n} (x, y) = \exp {- {(x - y)}^{2} / 2 σ_{n}^{2}}$ so is proportional to the univariate Gaussian kernel with a pre-defined bandwidth σ_n in $R$ . Essentially, this kernel function is a tensor-product kernel where for each domain (individual feature in our case, sharing similar idea commonly used for multitask learning (e.g. Suzuki, Kanagawa, Kobayashi, Shimizu, and Tagami (2016)) the kernel function is given by $1 + λ_{m} κ_{n} (x, y)$ . One significant feature of this kernel is that there is a non-negative parameter, $λ_{m}$ , that regularizes the contribution of feature m to the entire feature space. In Figure 1, we plot such regularized tensor-product kernel in a 2-dimensional feature space when varying λ₁ and λ₂. Clearly, when λ₁ becomes relatively smaller than λ₂, the entire kernel function is increasingly dominated by X₂. When λ₁ decreases to λ₁ = 0 and λ₂ > 0, the kernel function is flat along the direction of X₁ and only X₂ is actively contributing to the distance measure. Similarly, when λ₁ > 0 and λ₂ = 0, only X₁ is actively contributing to the kernel function. Note that for categorical feature variables, κ_n(x, y) reduces to I(x = y) when σ_n is small enough.

Plots of tensor product kernel in $R^{2}$ .

Note: The bandwidth $σ_{n} = \sqrt{5}$ and each kernel is centered at 0. Settings Of λ: from left To right: λ₁ = 0; λ₂ = 10; λ₁ = 4, λ₂ = 10; λ₁ = 2, λ₂ = 10; λ₁ = 2, λ₂ = 0.

Several interesting properties of the proposed tensor-product kernel are note-worthy. First, κ_n used in the construction is the Gaussian kernel function so it preserves the universal approximation property in RKHS of the Gaussian kernel when σ_n is chosen to be small (see Lemma A.1 and A.2 in the appendix). In this way, we expect that the estimation over the RKHS generated by this tensor-product kernel can approximate any underlying nonparametric prediction function. Second, the regularization parameters, which determine the contribution of each feature variable, can be estimated data-adaptively to reveal the true importance of the feature variables and true shape of the underlying prediction function. In particular, if λ_m=0, the kernel function no longer depends on the m-th feature variable. Therefore, we can achieve the goal of feature selection by estimating the regularization parameters from the data through this tensor-product kernel. Finally, there is significant computation advantage when searching sparse functions based on λ’s, as will be detailed below.

Denote $H_{λ, σ_{n}}$ as the RKHS corresponding to $κ_{λ, σ_{n}}$ . We aim to minimize

\begin{array}{l} L_{n} (λ, f) = P_{n} ({(Y - f (X))}^{2}) + γ 1 n {‖ f ‖}_{H_{λ, σ_{n}}}^{2} + γ 2 n {‖ λ ‖}_{0} \\ subject to λ_{1}, λ_{2}, \dots, λ_{p} \geq 0, \end{array}

(2)

where ${‖ λ ‖}_{0} = \sum_{m = 1}^{p} I (λ_{m} \neq 0)$ , and both γ_1n and γ_2n are tuning parameters. Note that in the objective function (2), in order to perform variable selection, we include a I₀-penalty in the third term to select the non-zero regularization parameters. With the L₂-loss, the optimization problem is equivalent to

\begin{array}{l} \min_{λ, α} \frac{1}{n} \sum_{i = 1}^{n} {Y_{i} - \sum_{j = 1}^{n} α_{j} κ_{λ, σ_{n}} (X_{i}, X_{j})}^{2} + γ_{1 n} α^{T} K_{λ, σ_{n}} α + γ_{2 n} {‖ λ ‖}_{0} \\ subject to γ_{1 n} \geq 0, γ_{2 n} \geq 0, \end{array}

(3)

where $K_{λ, σ_{n}}$ is the matrix given by $(κ_{λ, σ_{n}} (X_{i}, X_{j}))$ . Note that we optimize over both α and λ, so this procedure performs estimation of nonparametric prediction function (via updating α) and search of sparse function space (via updating λ) simultaneously. This is analogous to feature selection in LASSO for (parametric) linear models where one aims to find the optimal prediction and most sparse linear functions at the same time. In fact, this optimization is NP-hard and when p is large it requires that we evaluate all possible subsets of nonzero coefficients of λ. Instead, we solve an approximate optimization problem to (3) based on a modified ADMM algorithm. First, with a surrogate parameter θ, we re-formulate the objective function (3) as,

\begin{array}{l} \min_{λ, α} \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - \sum_{j = 1}^{n} α_{j} κ_{λ, σ_{n}} (X_{i}, X_{j}))}^{2} + γ_{1 n} α^{T} K_{λ, σ_{n}} α + γ_{2 n} {‖ θ ‖}_{0} \\ subject to \sum_{m = 1}^{p} | λ_{m} - θ_{m} | \leq 0, λ_{1}, \dots, λ_{p} \geq 0. \end{array}

The Lagrange multiplier is denoted by $γ_{3 n} (γ_{3 n} > 0)$ . The Lagrange form of the reformulated objective function becomes,

\frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - \sum_{j = 1}^{n} α_{j} κ_{λ, σ_{n}} (X_{i}, X_{j}))}^{2} + γ_{1 n} α^{T} K_{λ, σ_{n}} α + γ_{2 n} {‖ θ ‖}_{0} + γ_{3 n} \sum_{m = 1}^{p} | λ_{m} - θ_{m} |,

(4)

subject to $λ_{m} \geq 0$ , m = 1, …, p. The advantage of the approximation in (4) is that the objective function is strictly convex for λ_m’s while the solution for θ_m’s is explicit given the other parameters. The details of the algorithm are given in next section.

2.3 |. Algorithms

We iteratively update all parameters to minimize (4). At the k-th iteration,

α^{k + 1} = {(K_{λ^{k}, σ_{n}}^{T} K_{λ^{k}, σ_{n}} + n γ_{1 n} K_{λ^{k}, σ_{n}})}^{- 1} K_{λ^{k}, σ_{n}}^{T} Y

(5)

λ^{k + 1} = \min_{λ} \frac{1}{n} {\sum_{i = 1}^{n} (Y_{i} - \sum_{j = 1}^{n} α_{j}^{k + 1} κ_{λ, σ_{n}} (X_{i}, X_{j}))}^{2} + γ_{1 n} {(α^{k + 1})}^{T} K_{λ, σ_{n}} α^{k + 1} + γ_{3 n} \sum_{m = 1}^{p} | λ_{m} - θ_{m}^{k} |

(6)

θ^{k + 1} = \min_{θ} γ_{3 n} \sum_{m = 1}^{p} | λ_{m}^{k + 1} - θ_{m} | + γ_{2 n} \sum_{m = 1}^{p} I (θ_{m} \neq 0) .

(7)

Note that (5) is an explicit expression and that the update θ in (7) is given by

θ_{q}^{κ + 1} = λ_{q}^{κ + 1} I (| λ_{q}^{κ + 1} | > ρ_{n}),

where $ρ_{n} = γ_{2 n} / γ_{3n}$ .

The update function in (6) is essentially a regression problem with a LASSO-type penalty. Specifically, we use a coordinate descent algorithm to obtain each $λ_{q} (q = 1, 2, \dots, p)$ . To obtain $λ_{q}^{k + 1}$ , we fix $λ_{1}^{k + 1}, λ_{2}^{k + 1}, \dots, λ_{q + 1}^{k}, λ_{q + 2}^{k}, \dots, λ_{p}^{k}$ and then after simple calculation, the objective function takes the following form,

\min_{λ_{q}} \frac{1}{n} \sum_{i = 1}^{n} {(a_{i q} + b_{i q} λ_{q})}^{2} + d_{q} λ_{q},

where a_iq, b_iq, d_q’s are constants with their expressions given in Appendix S1. This is quadratic in λ_q with constraint $λ_{q} \geq 0$ ; Thus, its solution can be achieved easily by checking whether the minimal value is non-negative or not. The prediction function at an iteration is defined as

{\hat{f}}_{λ^{k + 1}} (X) = \sum_{i = 1}^{n} α_{i}^{k + 1} κ_{λ^{k + 1}, σ_{n}} (X, X_{i}) .

Since our goal is to minimize the objective function which penalizes the size of the non-zero λ’s, we set the convergence criteria to be the change of both the objective function and the non-zero number of λ’s. We let $δ = | L_{n} ({\hat{λ}}^{k + 1}, {\hat{f}}_{{\hat{λ}}^{k + 1}}) - L_{n} ({\hat{λ}}^{k}, {\hat{f}}_{{\hat{λ}}^{k}}) |$ and $e = {‖ {\hat{λ}}^{k + 1} ‖}_{0}$ . Then our algorithm can be summarized as follows:

At the initial step, set ${\hat{λ}}_{0} = 0$ , ${\hat{θ}}_{0} = 0$ .
Fix ${\hat{λ}}^{k}$ then update ${\hat{α}}^{k + 1}$ .
For fixed ${\hat{α}}^{k + 1}$ , update ${\hat{λ}}^{k + 1}$ and ${\hat{θ}}^{k + 1}$ via coordinate descent algorithm.
Calculate $δ = | L_{n} ({\hat{λ}}^{k + 1}, {\hat{f}}_{{\hat{λ}}^{k + 1}}) - L_{n} ({\hat{λ}}^{k}, {\hat{f}}_{{\hat{λ}}^{k}}) |$ and $e = {‖ {\hat{λ}}^{k + 1} ‖}_{0}$ .
Stop if $δ \leq c$ (c is a given cut point) and e doses not change. Otherwise, go to step (ii) with updated ${\hat{λ}}^{k + 1}$ .

Since our computation alternates between α’s and λ, each computation given other parameters is a convex optimization problem. Thus, our algorithm guarantees that the objective function decreases over iterations and converges to a local minimum.

In our algorithm, both the tuning parameters and bandwidth need to be determined. First, following the median trick for bandwidth of Gaussian kernel (Jaakkola, Diekhans, and Haussler (1999)), we calculate the matched-pair distance of feature variables and set σ_n such that the proportion of the matched pairs with distance less than σ_n is about a half. To tune the other parameters including γ_1n, γ_3n and ρ_n (equivalently, γ_2n), we use 5-fold cross-validation by varying them on a grid of 2⁻¹⁵, 2⁻¹⁴, ⋯ , 2¹⁵.

3 |. THEORETICAL RESULTS

In this section, we provide theoretical results to justify the proposed method. In particular, we show that under some assumptions, the resulting prediction function from our method leads to Bayesian risk asymptotically. Furthermore, we show that with probability tending to one, the variable selection based on non-zero λ’s is oracle, i.e., as if we had known which variables were important. Without loss of generality, we assume that the first r feature variables are important, while the others are not; that is, the Bayes rule, $E [Y | X]$ , only depends on X₁, … , X_r in the sense that for m ≤ r,

E [{(E [Y | X] - E [Y | X_{- m}])}^{2}] > 0,

with probability 1,

E [Y | X] = E [Y | X_{1}, X_{2}, \dots, X_{r}],

where X_−m denotes the random vector of X excluding X_m. We use $f_{0} (X_{1}, \dots, X_{r})$ to denote $E [Y | X]$ . We further denote $(\hat{λ}, {\hat{f}}_{\hat{λ}})$ as the optimal solution for the objective function in (2), where $\hat{λ} = ({\hat{λ}}_{1}, {\hat{λ}}_{2}, \dots, {\hat{λ}}_{p})$ . Then our first main result is:

Theorem 1. Assume that $γ_{1 n}, γ_{2 n} \to 0$ and let $γ_{1 n} = σ_{n}^{p / 2}$ with $n^{1 / 2} σ_{n}^{p} \to \infty$ . Let P denotes the true probability measure, i.e., $P g (Y, X) = E [g (Y, X)]$ for any measurable function g(Y, X) with finite first moment. Then it holds

with probability one, $\lim_{n \to \infty} P ({(Y - {\hat{f}}_{\hat{λ}} (X))}^{2}) = E {(Y - f_{0} (X))}^{2}$ ;
$Pr {{\hat{λ}}_{m} > 0 for all m = 1, 2, \dots, r} \to 1$ .

Theorem 1 implies that the loss of the estimated prediction function converges to Bayesian risk. Moreover, the ${\hat{λ}}_{m}$ ’s associated with important feature variables should be non-zero, i.e., the estimated function does depend on X₁, .., X_r.

The following theorem states that with additional regularity conditions, our method can also identify those unimportant features with probability tending to 1.

Theorem 2. In addition to Theorem 1’s assumptions, assume that $f_{0} (X_{1}, \dots, X_{r})$ is twice continuously differentiable and that γ_2n satisfies

γ_{2 n} (n^{1 / 2} σ_{n}^{p / 2} γ_{1 n} + σ_{n}^{- \min (2, p)}) \to \infty .

Then $Pr {{\hat{λ}}_{m} = 0 for m = r + 1, \dots, p} \to 1$ .

Theorem 2 implies that the proposed method can estimate the predicted function as if we knew which variables are important in the truth. The proofs of the theorems are given in the Appendix.

4 |. SIMULATION STUDIES

We conducted simulation studies to examine the performance of the proposed method. First, we considered a model with continuous outcomes and a total of ten variables (p = 10) and gradually increased p up to 100. We generated X₁, .., X₁₀ from a multivariate normal distribution with mean zeros and variances 1, where all were independent except that X₇, X₈, X₉, X₁₀ were correlated with $corr (X_{7}, X_{8}) = 0.4$ , $corr (X_{7}, X_{9}) = - 0.3$ , $corr (X_{8}, X_{9}) = 0.5$ and $corr (X_{9}, X_{10}) = 0.2$ . We treated X₇, X₈, X₉ as important variables and simulated the continuous response Y using the following model

Y_{i} = 2 X_{i 7} X_{i 8} X_{i 9} + 3.3 \exp (- X_{i 9}) + ϵ_{i} .

where $ϵ_{i} \sim N (0, 1)$ . We centered the outcome Y to be mean zero. To examine properties of proposed method in higher dimension, we also simulated scenarios with p = 20, 40, 100, where we generated additional independent noise features from the standard normal distribution. We varied training sample size from n = 100, 200 to 400.

For each simulated data, we used the proposed method to learn the prediction function. The choices of tuning parameters followed the description in Section 2.3 and ρ_n was set to be 0.001. We reported the true positive rates, true negative rates, average number of selected variables and prediction errors in our method. In addition, we compared our method with COSSO and LASSO, where COSSO can handle variable selection in nonlinear cases based on SS-ANOVA and LASSO assumes a linear regression model where coefficients estimated to be less than 0.0001 are set to zeros. For tuning parameters of LASSO and COSSO, we used 5-fold cross validations. The comparing performance was based on the mean squared errors in an independent testing sample.

The results based on 500 replicates are summarized in Table 1. From feature selection results columns, we observe that for fixed p, as sample size n becomes larger, the number of true positive rate and true negative rate from our method becomes larger. For fixed sample size n, as p becomes larger, the true positive rate becomes smaller due to additional noise variables. As shown in Table 1, our model can successfully selected all three important variables, with both true positive rate and true negative rate around 95%. Average number of selected variables are also approximate 3. However, COSSO cannot select all important variables, which is reflected from both the true positive rate and average number of selected variables columns. This is because COSSO fits a misspecified model. LASSO does not yield any reasonable variable selection results (not shown here) this setting. Prediction error columns show the mean and median absolute deviation of the prediction errors. Clearly, LASSO gives the worst result since it misspecifies the model most. Our method has the best prediction performance and as the sample size n becomes larger, the prediction errors decrease to Bayes error, which is 1 in this case.

TABLE 1.

Summary of feature selection results and prediction errors

Feature Selection Results								Prediction Errors

		Proposed			COSSO			Proposed	COSSO	LASSO

p	n	TPR	TNR	avg.#	TPR	TNR	avg.#
10	100	88.4%	90.7%	3.3	65.4%	96.1%	2.2	2.743 (0.313)	5.650 (0.556)	5.954 (0.121)
	200	95.4%	95.4%	3.2	82.7%	97.0%	2.7	2.198 (0.205)	5.256 (0.591)	5.689 (0.057)
	400	97.5%	96.3%	3.2	91.5%	98.3%	2.9	1.889 (0.171)	4.708 (0.698)	5.606 (0.038)
20	100	85.0%	94.7%	3.5	49.5%	97.5%	1.9	3.455 (0.396)	6.664 (0.455)	7.056 (0.168)
	200	94.1%	97.5%	3.3	55.6%	96.5%	2.3	2.820 (0.253)	6.312 (0.492)	6.659 (0.077)
	400	97.3%	98.6%	3.2	74.7%	95.3%	3.0	2.464 (0.232)	5.913 (0.570)	6.500 (0.043)
40	100	81.5%	96.7%	3.7	44.3%	98.8%	1.8	2.835 (0.437)	5.693 (0.341)	6.985 (0.434)
	200	93.4%	98.6%	3.3	54.0%	98.7%	2.1	2.101 (0.220)	5.502 (0.263)	6.042 (0.174)
	400	97.1%	99.2%	3.2	58.3%	98.2%	2.4	1.783 (0.127)	5.048 (0.386)	5.672 (0.076)
100	100	74.0%	98.4%	3.8	NA	NA	NA	3.562 (0.593)	NA	30.370 (8.124)
	200	88.0%	99.1%	3.6	49.3%	99.4%	2.1	2.600 (0.312)	5.602 (0.319)	7.597 (0.504)
	400	93.3%	99.4%	3.4	61.7%	99.2%	2.6	2.172 (0.200)	5.506 (0.232)	6.427 (0.200)

Open in a new tab

Note: TPR: True positive rate; TNR: True negative rate; avg.#: Average number of selected variables. The numbers are the mean of prediction errors and the numbers within parentheses are the median absolute deviations from 500 replicates. “NA”: Results are not available due to failure of the methods.

5 |. APPLICATION TO AIZHEIMER’S DISEASE INITIATIVE STUDY

We applied the proposed method to analyze data from the Alzheimer’s disease neuroimaging initiative (ADNI) study (Toledo, Bjerke, Da, and Landau (2015)). The feature variables included demographic variables (age, gender, race, education level), APoE4 mutation status, clinical variables (functional assessment questionnaire, ADAS-cog11, MMSE) and 7 imaging biomarkers (flurodeoxyglucose, ventricles, hippocampus, whole brain, entorhinal cortex, fusiform gyrus, middle temporal gyrus). Our goal was to assess how well Alzheimer’s disease (AD) biomarkers measured from invasive cerebral spinal fluid (CSF) procedure can be predicted from clinical or biomarker data collected by non-invasive procedures. Thus, we aim to identify important feature variables to predict t-tau and Aβ-42 protein measured from CSF in the ADNI study. There were 535 subjects included in the analyses of t-tau and 542 for Aβ-42. We randomly divided the subjects so that 70% were used for training and 30% were used for testing. We applied the proposed method to learn the prediction rule using the training sample. Since gender and race variables were binary, as mentioned before, we set the individual kernels for these two features in (1) as $κ (X, \tilde{X}) = I (X = \tilde{X})$ . For the other feature variables, we used the individual Gaussian kernels with the same bandwidth as described before. We standardized the outcomes and all continuous feature variables. The tuning parameters for γ’s were obtained from 5-fold cross validation in the training sample. For comparison, we also fit LASSO and COSSO with 5-fold cross validation for tuning to the same data, where the coefficients estimated to be less than 0.0001 in LASSO are thresholded to be zeros, and then compared their prediction performance in the testing sample. To obtain a reliable comparison, we repeated the same analysis for 500 randomly divided training sample and test sample.

Figure 2 gives some smooth plots of outcome variables versus feature variables, from which we can get an intuition of the nonlinear relationship between them. Figure 3 gives the frequency of a variable selected. Prediction error by each method together with average and range number of selected variables from 500 replications of random splitting are shown in Table 2. From this table, It is clear that our method yields the smallest prediction errors. Among all 500 replications, for outcome t-tau, the most frequently chosen features in our method were Gender, APoE4, MMSE, ADAS, Ventricles, Hippocampus and Middle temporal gyrus; while for Aβ-42, APoE4, ADAS, MMSE and Hippocampus were highly selected. Also, FAQ is moderately important for outcome Aβ-42, but has no importance for outcome t-tau. In contrast, for outcome t-tau, COSSO highly selected APoE4, MMSE, ADAS, Ventricles and Middle temporal gyrus, but Gender and Hippocampus failed to be selected, which may be a reason of the large prediction error. For outcome Aβ-42, Age, APoE4, ADAS, FAQ, Hippocampus and Middle temporal gyrus were frequently chosen by COSSO. We also noticed that COSSO gave large mean prediction error with high variability but reasonable median prediction error, which indicates the existence of outliers of prediction errors among all 500 replications. LASSO nearly failed to remove any noise variables and selected approximate 15 variables in average for both outcomes. Finally, when applying our method to analyze the whole sample, for t-tau outcome, our method selected 8 feature variables as important and they were gender, APoE4, FDG, ADAS11, MMSE, Ventricles, Hippocampus and MidTemp with prediction error equals to 0.839. For Aβ-42, there were 4 important features including APoE4, ADAS11, MMSE and Hippocampus with prediction error equals to 0.811. The data that support the findings of this study are openly available in http://adni.loni.usc.edu.

smooth plots of outcome variables versus imaging feature variables in ADNI data

Frequency of variables selected in 500 random sample splittings

TABLE 2.

Summary of feature selection results in the application to the ADNI study

	t-tau			Aβ1-42

	Proposed	COSSO	LASSO	Proposed	COSSO	LASSO
Mean of Prediction error	0.870 (0.028)	1.745 (1.869)	0.881 (0.027)	0.830 (0.032)	6.748 (13.176)	0.838 (0.031)
Median of Prediction error	0.869	0.923	0.880	0.829	0.859	0.836
Avg.#	8.08	6.12	14.96	5.72	6.34	14.98
Range #	[4,12]	[1,14]	[14,15]	[3,13]	[1,14]	[14,15]

Open in a new tab

Note: The numbers are the mean and median of prediction errors from 500 replicates and the numbers within parentheses are their standard deviations. Avg.#: Average number of selected variables. Range #: Range number of selected variables.

6 |. CONCLUSIONS

In this work, we propose a regularized tensor product kernel for sparse nonparametric regression in the presence of nonlinear relationships. The importance of each feature is captured by a non-negative parameter in the kernel function. Our approach is computationally efficient because it can be iteratively computed by optimizing a convex quadratic function from a modified ADMM algorithm. Theoretically, we have shown that our method leads to oracle feature variable selection. The superior performance of the proposed method was demonstrated via simulation studies and a real data application. Note that both our algorithm and theory can be extended to higher dimensional feature variables, or even ultra high dimensional cases.

Here we focus on a regression problem using L₂ loss function. However, our method can be extended to other machine learning approaches with different losses, such as hinge-loss (support vector machine) and boosting. Feature selection can be simultaneously performed when training these machine learning algorithms. We expect that the same iterative algorithm applies but the step of updating regularization parameters may be different, although it remains to be a convex optimization problem with linear constraints.

Finally, our method can also be generalized to feature selection problems when feature variables collected from different domains (imaging, genomics, clinical biomarkers), which is common in integrative data analysis. By accounting for the hierarchical structure of multiple domains, one possibility is to construct a hierarchical tensor product kernel with regularization parameters for both domains and features within each domain. In this way, we can perform domain selection and feature selection at the same time. We will consider such extensions in a future work.

Supplementary Material

supp

NIHMS1657089-supplement-supp.pdf^{(176.4KB, pdf)}

ACKNOWLEDGMENTS

This research is supported by U.S. NIH grants NS073671, GM124104, and MH117458.

Abbreviations:

ANA: anti-nuclear antibodies
APC: antigen-presenting cells
IRF: interferon regulatory factor

APPENDIX

Proof of Theorems

Before proving the theorems, we need two lemmas. The proof of them can be found in supporting information file. The first lemma shows that the proposed kernel function satisfies positive-definite condition.

Lemma A.1. For any positive constants $λ_{1}, \dots, λ_{r} > 0$ ,

κ_{λ, σ_{n}} (X, \tilde{X}) = \prod_{m = 1}^{p} (1 + λ_{m} κ_{n} (X_{m}, {\tilde{X}}_{m})),

is a kernel satisfying semi positive-definiteness condition.

The next lemma shows that when σ_n goes to zero, the closure of the reproducing kernel Hilbert space generated by the tensor product kernel contains the true function $f_{0} (X_{1}, \dots, X_{r})$ if $λ_{m} > 0$ for $m \leq r$ . We define $d (f_{0}, H_{λ, σ_{n}})$ as the L₂(P) distance between f₀ and the reproducing kernel Hilbert space.

Lemma A.2. Assume that $σ_{n} \to 0$ . For any $λ = (λ_{1}, λ_{2} \dots, λ_{p}), λ_{m} \geq 0, m \leq p$ ,

If $λ_{m} \neq 0$ for $m \leq r$ , then $d (f_{0} (X_{1}, X_{2}, \dots, X_{r}), H_{λ, σ_{n}}) \to 0$ . In fact, the closure of lim sup_n $H_{λ, σ_{n}}$ contains any L₂-integrable function that only depends on (X₁, …, X_r).
If for some m ≤ r, $λ_{m} = 0$ , then $lim inf d (f_{0} (X_{1}, X_{2}, \dots, X_{r}), H_{λ, σ_{n}}) > 0$ .

Before proving two theorems, recall that P denotes the true probability measure; P_n denotes the empirical measure from n observations.

Proof of Theorem 1. By Lemma A.2, for any $λ_{0} = (λ_{01}, 0)$ , where $λ_{01} = (λ_{011}, λ_{012}, \dots, λ_{01 r})$ , and $λ_{01 m} > 0$ for any m ≤ r, there exits ${\tilde{f}}_{λ_{0}} \in H_{λ_{0}, σ_{n}} s.t. d (f_{0}, {\tilde{f}}_{λ_{0}}) = {‖ {\tilde{f}}_{λ_{0}} - f_{0} ‖}_{L_{2}} \to 0$ . Since $(\hat{λ}, {\hat{f}}_{\hat{λ}})$ is the optimal solution for objective function (2), we have

P_{n} l (Y, {\hat{f}}_{\hat{λ}} (X)) + γ_{1 n} {‖ \hat{f} ‖}_{H_{\hat{λ}, σ_{n}}}^{2} + γ_{2 n} {‖ \hat{λ} ‖}_{0} \leq P_{n} l (Y, {\tilde{f}}_{λ_{0}} (X)) + γ_{1 n} {‖ \tilde{f} ‖}_{H_{λ_{0}, σ_{n}}}^{2} + γ_{2 n} {‖ λ_{0} ‖}_{0} .

That is,

\begin{array}{l} (P_{n} - P) l (Y, {\hat{f}}_{\hat{λ}} (X)) + γ_{1 n} {‖ \hat{f} ‖}_{H_{\hat{λ}, σ_{n}}}^{2} + P l (Y, {\hat{f}}_{\hat{λ}} (X)) + γ_{2 n} {‖ \hat{λ} ‖}_{0} \\ \leq (P_{n} - P) l (Y, {\tilde{f}}_{λ_{0}} (X)) + γ_{1 n} {‖ \tilde{f} ‖}_{H_{λ_{0}, σ_{n}}}^{2} + P l (Y, {\tilde{f}}_{λ_{0}} (X)) + γ_{2 n} {‖ λ_{0} ‖}_{0} . \end{array}

Following some similar arguments to Theorem 3.1 in Steinwart and Scovel (2007), since ${‖ \hat{f} ‖}_{H_{\hat{λ}, σ_{n}}} \leq O (γ_{1 n}^{- 1 / 2})$ , the uniform covering numbers of this bounded set in the reproducing kernel Hilbert space can be calculated as follows.

First, the entropy number (van der Vaart and Wellner (1996)) for the unit ball in $H_{\hat{λ}, σ_{n}}$ , denoted by $O_{n}$ , satisfies

\log N (ϵ, O_{n}, {‖ \cdot ‖}_{\infty}) \leq c_{1} (v, p) σ_{n}^{- (1 - v / 4) p} ϵ^{- v},

where 0 < v < 2 and c₁ (v, p) is a constant that depends only on v and p. Then it gives

\log N_{[]} (ϵ, O_{n}, {‖ \cdot ‖}_{L^{4} (P)}) \leq c_{1} (v, p) σ_{n}^{- (1 - v / 4) p} ϵ^{- v} .

Thus, we obtain

\log N_{[]} (ϵ, {\hat{f} : \hat{f} \in H_{\hat{λ}, σ_{n}}, {‖ \hat{f} ‖}_{H_{\hat{λ}, σ_{n}}} \leq O (γ_{1 n}^{- 1 / 2})}, {‖ \cdot ‖}_{L^{4} (P)}) \leq c_{1} (v, p) σ_{n}^{- (1 - v / 4) p} ϵ^{- v} γ_{1 n}^{- v / 2} .

Note

\begin{array}{l} {‖ l (Y, {\hat{f}}_{1}) - l (Y, {\hat{f}}_{2}) ‖}_{L_{2} (P)} = {‖ {(Y - {\hat{f}}_{1})}^{2} - {(Y - {\hat{f}}_{2})}^{2} ‖}_{L_{2} (P)} \\ = {(E {(2 Y - {\hat{f}}_{1} - {\hat{f}}_{2})}^{2} {({\hat{f}}_{2} - {\hat{f}}_{1})}^{2})}^{\frac{1}{2}} \\ \leq {(E {({\hat{f}}_{2} - {\hat{f}}_{1})}^{4})}^{\frac{1}{2}} {(E {(2 Y - {\hat{f}}_{1} - {\hat{f}}_{2})}^{4})}^{\frac{1}{2}}, \end{array}

where ${(E {(2 Y - {\hat{f}}_{1} - {\hat{f}}_{2})}^{4})}^{\frac{1}{2}} \leq 3^{\frac{3}{2}} {(16 E (Y^{4}) + E ({\hat{f}}_{1}^{4}) + E ({\hat{f}}_{2}^{4}))}^{\frac{1}{2}} \leq c^{'}$ , $c^{'}$ is a constant. Thus, we can obtain ${‖ l (Y, {\hat{f}}_{1}) - l (Y, {\hat{f}}_{2}) ‖}_{L_{2} (P)} \leq c^{'} {({‖ {\hat{f}}_{1} - {\hat{f}}_{2} ‖}_{L_{2} (P)})}^{2}$ , which yields

\log N_{[]} (ϵ, {l (Y, \hat{f}) : \hat{f} \in H_{\hat{λ}, σ_{n}}, {‖ \hat{f} ‖}_{H_{\hat{λ}, σ_{n}}} \leq O (γ_{1 n}^{- 1 / 2})}, {‖ \cdot ‖}_{L^{4} (P)}) \leq c_{2} (v, p, c^{'}) σ_{n}^{- (1 - v / 4) p} ϵ^{- v / 2} γ_{1 n}^{- v / 2} .

Let $c^{″}$ denotes a constant. The above result implies

\begin{array}{l} E [\begin{matrix} \sup_{\hat{f} \in H_{\hat{λ}, σ_{n}}, {‖ \hat{f} ‖}_{H_{\hat{λ}, σ_{n}}} \leq O (γ_{1 n}^{- 1 / 2})} & | (P_{n} - P) l (Y, f) | \end{matrix}] \\ \leq c_{3} n^{- 1 / 2} \int_{0}^{c^{″} γ_{1 n}^{- 1}} \sqrt{1 + \log N_{[]} (ϵ, {l (Y, \hat{f}) : \hat{f} \in H_{\hat{λ}, σ_{n}}, {‖ \hat{f} ‖}_{H_{\hat{λ}, σ_{n}}} \leq O (γ_{1 n}^{- 1 / 2})}, {‖ \cdot ‖}_{L^{4} (P)}) d ϵ} \\ \leq k (v, p) n^{- 1 / 2} σ_{n}^{- (1 / 2 - v / 8) p} γ_{1 n}^{- 1} . \end{array}

In other words, $(P_{n} - P) l (Y, {\hat{f}}_{\hat{λ}} (X)) = O (n^{- 1 / 2} σ_{n}^{- p / 2} γ_{1 n}^{- 1})$ by the choice of γ_1n in Theorem 1. Similarly,

(P_{n} - P) l (Y, {\tilde{f}}_{λ_{0}} (X)) = O (n^{- 1 / 2} σ_{n}^{- p / 2} γ_{1 n}^{- 1}) .

Therefore, let $n \to \infty$ then since $γ_{1 n} \to 0$ , we have

\underset{n \to \infty}{lim sup} (P l (Y, {\hat{f}}_{\hat{λ}} (X)) + γ_{2 n} {‖ \hat{λ} ‖}_{0}) \leq \underset{n \to \infty}{lim sup} (P l (Y, {\tilde{f}}_{λ_{0}} (X)) + O (n^{- 1 / 2} σ_{n}^{- p / 2} γ_{1 n}^{- 1}) + γ_{2 n} {‖ λ_{0} ‖}_{0}) .

Under the assumption that $n^{- 1 / 2} σ_{n}^{- p} \to 0, γ_{2 n} \to 0$ , we have

\underset{n \to \infty}{lim sup} P l (Y, {\hat{f}}_{\hat{λ}} (X)) \leq \underset{n \to \infty}{lim sup} P l (Y, {\tilde{f}}_{λ_{0}} (X)) = E l (Y, f_{0} (X)) .

Since $E l (Y, f_{0} (X)) \leq {lim sup}_{n \to \infty} P l (Y, {\hat{f}}_{\hat{λ}} (X))$ , we obtain

\lim_{n \to \infty} P l (Y, {\hat{f}}_{\hat{λ}}) = E l (Y, f_{0} (X)) .

Together with Lemma A.2, this also gives the result that for any $m \leq r, {\hat{λ}}_{m} > 0$ . Otherwise, if there is some $m \leq r, {\hat{λ}}_{m} = 0,$ , by Lemma A.2 (ii), it always holds that $\lim_{n \to \infty} P l (Y, {\hat{f}}_{\hat{λ}}) - EI (Y, f_{0} (X)) \geq lim inf d (f_{0} (X_{1}, X_{2}, \dots, X_{r}), H_{λ, σ_{n}}) > 0$ . We then have a contradiction. □

Proof of Theorem 2. For this proof, we particularly choose ${\tilde{f}}_{λ_{0}}$ to be the Gaussian-kernel convolution of f₀ in the space of (X₁, …, X_r), where the kernel is given $\prod_{m = 1}^{r} k_{n} (X_{m}, {\tilde{X}}_{m})$ . Clearly, ${\tilde{f}}_{λ_{0}}$ belongs to $H_{λ, σ_{n}}$ and since f₀(X) is twice continuous differentiable,

d ({\tilde{f}}_{λ_{0}}, f_{0}) = O (σ_{n}^{2}) .

With the same arguments in the proof of Theorem 1, we have

P l (Y, {\hat{f}}_{\hat{λ}} (X)) + γ_{2 n} {‖ \hat{λ} ‖}_{0} \leq P l (Y, {\tilde{f}}_{λ_{0}} (X)) + O (n^{- 1 / 2} σ_{n}^{- p / 2} γ_{1 n}^{- 1}) + O (γ_{1 n}) + γ_{2 n} {‖ λ_{0} ‖}_{0} .

Thus,

P l (Y, {\hat{f}}_{\hat{λ}} (X)) + γ_{2 n} {‖ \hat{λ} ‖}_{0} \leq E l (Y, f_{0} (X)) + O (σ_{n}^{min (2, p)}) + O (n^{- 1 / 2} σ_{n}^{- p / 2} γ_{1 n}^{- 1}) + γ_{2 n} {‖ λ_{0} ‖}_{0} .

Since it always holds that $E l (Y, f_{0} (X)) \leq P l (Y, {\hat{f}}_{\hat{λ}} (X))$ ,

γ_{2 n} {‖ \hat{λ} ‖}_{0} \leq O (σ_{n}^{min (2, p)}) + O (n^{- 1 / 2} σ_{n}^{- p / 2} γ_{1 n}^{- 1}) + γ_{2 n} {‖ λ_{0} ‖}_{0} .

Thus, by dividing γ_2n on both sides, we have

{‖ \hat{λ} ‖}_{0} \leq O (γ_{2 n}^{- 1} σ_{n}^{min (2, p)}) + O (γ_{1 n}^{- 1} γ_{2 n}^{- 1} n^{- 1 / 2} σ_{n}^{- p / 2}) + r .

Under the assumptions in Theorem 2, we conclude that with probability tending to one, the number of the non-zero components for $\hat{λ}$ can not be larger than r. However, Theorem 1 implies that this number should be at least r. We thus obtain Theorem 2. □

Footnotes

Financial disclosure

None reported.

Conflict of interest

The authors declare no potential conflict of interests.

SUPPORTING INFORMATION

Additional information for this article is available:

Appendix S1. Expression of Constants in Updating λ’s

Appendix S2. Proof of two lemmas

References

Allen GI (2013). Automatic feature selection via weighted kernels and regularization. Journal of Computational and Graphical Statistics, 22(2), 284–299. [Google Scholar]
Chen G, & Chen J (2015). A novel wrapper method for feature selection and its application. Neurocomputing, 159(2), 219–226. [Google Scholar]
Dasgupta S, Goldberg Y, & Kosorok MR (2019). Feauture elimination in kernel machines in moderately high dimensions. The Annals of Statistics, 47(1), 497–526. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Feng Y, & Song R (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association, 106(494), 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, & Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. [Google Scholar]
Fan J, & Lv J (2008). Sure independence screening for ultra-high dimensional feature space. Journal of the Royal Statistical Society. Series B, 70(5), 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gao C, & Wu X (2012). Kernel support tensor regression. 2012 International Workshop on Information and Electronics Engineering (IWIEE), 29, 3986–3990. [Google Scholar]
Guyon I, & Elisseeff A (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. [Google Scholar]
Hofmann T, Schölkopf B, & Smola AJ (2008). Kernel methods in machine learning. The Annals of Statistics, 36(3), 1171–1220. [Google Scholar]
Jaakkola T, Diekhans M, & Haussler D (1999). Using the fisher kernel method to detect remote protein. ISMB, 99,149–158. [PubMed] [Google Scholar]
Kohavi R, & John GH (1997). Wrappers for feature subset selection. Artificial Intelligence, 97,273–324. [Google Scholar]
Lin Y, & Zhang HH (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5), 2272–2297. [Google Scholar]
Liu Y, & Zheng YF (2006). A novel feature selection method for support vector machines. Pattern Recognition, 39(7), 1333–1345. [Google Scholar]
Maldonado S, & Weber R (2009). A wrapper method for feature selection using support vector machines. Information Sciences, 179(13), 2208–2217. [Google Scholar]
Song L, Smola A, Gretton A, Bedo J, & Borgwardt K (2012). Feature selection via dependence maximization. Journal of Machine Learning Research, 13,1393–1434. [Google Scholar]
Steinwart I, & Scovel C (2007). Fast rates for support vector machines using gaussian kernels. The Annals of Statistics, 35(2), 575–607. [Google Scholar]
Suzuki T, Kanagawa H, Kobayashi H, Shimizu N, & Tagami Y (2016). Minimax optimal alternating minimization for kernel nonparametric tensor learning., 3783–3791. [Google Scholar]
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1), 267–288. [Google Scholar]
Toledo JB, Bjerke M, Da X, & Landau SM (2015). Nonlinear association between cerebrospinal fluid and florbetapir f-18 β-amyloid measures across the spectrum of alzheimer disease. JAMA Neurology, 72(5), 571–581. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Vaart A, & Wellner JA (1996). Weak convergence and empirical processes. New York: Springer. [Google Scholar]
Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, & Vapnik V (2001). Feature selection for svms. In Advances in Neural Information Processing Systems. [Google Scholar]
Zhang C (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

NIHMS1657089-supplement-supp.pdf^{(176.4KB, pdf)}

[R1] Allen GI (2013). Automatic feature selection via weighted kernels and regularization. Journal of Computational and Graphical Statistics, 22(2), 284–299. [Google Scholar]

[R2] Chen G, & Chen J (2015). A novel wrapper method for feature selection and its application. Neurocomputing, 159(2), 219–226. [Google Scholar]

[R3] Dasgupta S, Goldberg Y, & Kosorok MR (2019). Feauture elimination in kernel machines in moderately high dimensions. The Annals of Statistics, 47(1), 497–526. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Fan J, Feng Y, & Song R (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association, 106(494), 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Fan J, & Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. [Google Scholar]

[R6] Fan J, & Lv J (2008). Sure independence screening for ultra-high dimensional feature space. Journal of the Royal Statistical Society. Series B, 70(5), 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Gao C, & Wu X (2012). Kernel support tensor regression. 2012 International Workshop on Information and Electronics Engineering (IWIEE), 29, 3986–3990. [Google Scholar]

[R8] Guyon I, & Elisseeff A (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. [Google Scholar]

[R9] Hofmann T, Schölkopf B, & Smola AJ (2008). Kernel methods in machine learning. The Annals of Statistics, 36(3), 1171–1220. [Google Scholar]

[R10] Jaakkola T, Diekhans M, & Haussler D (1999). Using the fisher kernel method to detect remote protein. ISMB, 99,149–158. [PubMed] [Google Scholar]

[R11] Kohavi R, & John GH (1997). Wrappers for feature subset selection. Artificial Intelligence, 97,273–324. [Google Scholar]

[R12] Lin Y, & Zhang HH (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5), 2272–2297. [Google Scholar]

[R13] Liu Y, & Zheng YF (2006). A novel feature selection method for support vector machines. Pattern Recognition, 39(7), 1333–1345. [Google Scholar]

[R14] Maldonado S, & Weber R (2009). A wrapper method for feature selection using support vector machines. Information Sciences, 179(13), 2208–2217. [Google Scholar]

[R15] Song L, Smola A, Gretton A, Bedo J, & Borgwardt K (2012). Feature selection via dependence maximization. Journal of Machine Learning Research, 13,1393–1434. [Google Scholar]

[R16] Steinwart I, & Scovel C (2007). Fast rates for support vector machines using gaussian kernels. The Annals of Statistics, 35(2), 575–607. [Google Scholar]

[R17] Suzuki T, Kanagawa H, Kobayashi H, Shimizu N, & Tagami Y (2016). Minimax optimal alternating minimization for kernel nonparametric tensor learning., 3783–3791. [Google Scholar]

[R18] Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1), 267–288. [Google Scholar]

[R19] Toledo JB, Bjerke M, Da X, & Landau SM (2015). Nonlinear association between cerebrospinal fluid and florbetapir f-18 β-amyloid measures across the spectrum of alzheimer disease. JAMA Neurology, 72(5), 571–581. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] van der Vaart A, & Wellner JA (1996). Weak convergence and empirical processes. New York: Springer. [Google Scholar]

[R21] Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, & Vapnik V (2001). Feature selection for svms. In Advances in Neural Information Processing Systems. [Google Scholar]

[R22] Zhang C (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942. [Google Scholar]

PERMALINK

Sparse Nonparametric Regression With Regularized Tensor Product Kernel

Hang Yu

Yuanjia Wang

Donglin Zeng

Summary

1 |. INTRODUCTION

2 |. METHOD

2.1 |. Empirical risk minimization on RKHS

2.2 |. Feature selection using a regularized tensor product kernel

FIGURE 1.

2.3 |. Algorithms

3 |. THEORETICAL RESULTS

4 |. SIMULATION STUDIES

TABLE 1.

5 |. APPLICATION TO AIZHEIMER’S DISEASE INITIATIVE STUDY

FIGURE 2.

FIGURE 3.

TABLE 2.

6 |. CONCLUSIONS

Supplementary Material

ACKNOWLEDGMENTS

Abbreviations:

APPENDIX

Proof of Theorems

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Sparse Nonparametric Regression With Regularized Tensor Product Kernel

Hang Yu

Yuanjia Wang

Donglin Zeng

Summary

1 |. INTRODUCTION

2 |. METHOD

2.1 |. Empirical risk minimization on RKHS

2.2 |. Feature selection using a regularized tensor product kernel

FIGURE 1.

2.3 |. Algorithms

3 |. THEORETICAL RESULTS

4 |. SIMULATION STUDIES

TABLE 1.

5 |. APPLICATION TO AIZHEIMER’S DISEASE INITIATIVE STUDY

FIGURE 2.

FIGURE 3.

TABLE 2.

6 |. CONCLUSIONS

Supplementary Material

ACKNOWLEDGMENTS

Abbreviations:

APPENDIX

Proof of Theorems

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases