Error Bound of Mode-Based Additive Models

Hao Deng; Jianghong Chen; Biqin Song; Zhibin Pan

doi:10.3390/e23060651

. 2021 May 22;23(6):651. doi: 10.3390/e23060651

Error Bound of Mode-Based Additive Models

Hao Deng ¹, Jianghong Chen ², Biqin Song ^1,^*, Zhibin Pan ^1,^*

Editor: Ercan Kuruoglu

PMCID: PMC8224641 PMID: 34067420

Abstract

Due to their flexibility and interpretability, additive models are powerful tools for high-dimensional mean regression and variable selection. However, the least-squares loss-based mean regression models suffer from sensitivity to non-Gaussian noises, and there is also a need to improve the model’s robustness. This paper considers the estimation and variable selection via modal regression in reproducing kernel Hilbert spaces (RKHSs). Based on the mode-induced metric and two-fold Lasso-type regularizer, we proposed a sparse modal regression algorithm and gave the excess generalization error. The experimental results demonstrated the effectiveness of the proposed model.

Keywords: modal regression, additive models, reproducing kernel Hilbert spaces, error bound

1. Introduction

Regression estimation and variable selection are two important tasks for high-dimensional data mining [1]. Sparse additive models [2,3], aiming to deal with the above tasks simultaneously, have been extensively investigated in the mean regression setting. As a class of models between linear and nonparametric regression, these methods inherit the flexibility from nonparametric regression and the interpretability from linear regression. Typical methods include COSSO [4] and SpAM [2] and its variants, such as Group SpAM [3], SAM [5], Group SAM [6], SALSA [7], MAM [8], SSAM [9], and ramp-SAM [10]. From the lens of nonparametric regression, the additive structure on the hypothesis space is crucial to overcome the curse of dimensionality [7,11,12].

Usually, the aforementioned models are limited to the estimation of the conditional mean under the mean-squared error (MSE) criterion. However, for the complex non-Gaussian noises (e.g., the skewed noise, the heavy-tailed noise), it is difficult to extract the intrinsic trends from the mean-based approaches, resulting in degraded performance. Beyond the traditional mean regression, it is interesting to formulate a new regression framework under the (conditional) mode-based criterion. With the help of the recent works in [13,14,15,16,17,18,19], this paper aimed to propose a new robust sparse additive model, rooted in modal regression associated with the RKHS.

As an alternative approach to mean regression, modal regression has been investigated on statistical behavior [14,15,17] and real-world applications [20,21]. Yao [14] proposed a modal linear regression algorithm and characterized its theoretical properties under the global mode assumption. As a natural extension of Lasso [22], Wang et al. [15] considered the regularized modal regression and established its analysis on the generalization bound and variable selection consistency. Feng et al. [17] studied modal regression by a learning theory approach and illustrated its relation with MCC [23,24]. Different from the above global approaches, some local modal regression algorithms were formulated in [16,25] with convergence guarantees. Recent literature [26] gave a general overview of modal regression, and a more comprehensive list of references can be found there.

The proposed robust additive models are formulated under the Tikhonov regularization scheme, which involves three building blocks, including the mode-based metric, the RKHS-based hypothesis space, and two Lasso-type penalties. Since the linear function space, polynomial function space, and Sobolev/Besov space are special cases of the RKHS, the kernel-based function space is more flexible than the traditional spline-based spaces or other dictionary-based hypotheses [2,5,27,28,29]. The mode-induced regression metric is robust to the non-Gaussian noise according to the theoretical and empirical evaluations [14,15,17]. The regularized penalty addresses the sparsity and smoothness of the estimator, which has shown promising performance for mean regression [2,29,30,31]. Therefore, different from mean-based kernel regression and additive models, the mode-based approach enjoys robustness and interpretability simultaneously due to its metric criterion and trade-off penalty. The estimator of our approach can be obtained by integrating the half-quadratic (HQ) optimization [32] and the second-order cone programming (SOCP) [33].

The rest of this article is organized as follows. After introducing the robust additive model in Section 2, we state its generalization error bound in Section 3. Finally, Section 5 ends this paper with a brief conclusion.

2. Methodology

2.1. Modal Regression

In this section, we recall the basic background on modal regression [19,34]. Let $X$ be a compact subset of $R^{p}$ associated with the input covariate vector and $Y \in R$ be the response variable set. In this paper, we considered the following nonparametric model:

Y = f^{*} (X) + ϵ,

(1)

where $X = {(X_{1}, \dots, X_{p})}^{T} \in X$ , $Y \in Y$ , and $ϵ$ is a random noise. For feasibility, we denote by $ρ$ the underlying joint distribution of $(X, Y)$ generated by (1).

Being different from the traditional mean regression under the noise condition $E (ϵ | X = x) = 0$ (e.g., Gaussian noise), we just require that the mode of the conditional distribution of $ϵ$ equal zero at each $x \in X$ . That is:

\forall x \in X, m o d e (ϵ | X = x) = arg max_{t \in R} P_{ϵ | X} (t | X = x) = 0,

(2)

where $P_{ϵ | X}$ is the conditional density of $ϵ$ given X. Notice that the zero condition is not specified to the homogeneity or symmetry distribution of noise $ϵ$ , and some non-Gaussian noises (e.g., the skewed noise, the heavy-tailed noise) are not excluded.

From (1), we further deduce that:

\begin{matrix} f^{*} (u) & : = & \sum_{j = 1}^{p} f_{j}^{*} (u_{j}) = m o d e (Y | X = u) = arg max_{t} P_{Y | X} (t | X = u), \end{matrix}

where $u = {(u_{1}, \dots, u_{p})}^{T} \in X$ and $P_{Y | X}$ denotes the density of Y conditional on X. Then, the purpose of modal regression is to find the target function $f^{*}$ according to the empirical data $z = {z_{i}}_{i = 1}^{n} = {(x_{i}, y_{i})}_{i = 1}^{n}$ drawn independently from $ρ$ .

For modal regression, the performance of a predictor $f : X \to R$ is measured by the mode-based metric:

R (f) = \int_{X} P_{Y | X} (f (x) | X = x) d ρ_{X} (x),

(3)

where $ρ_{X}$ is the marginal distribution of $ρ$ with respect to input space $X$ .

Although the target function $f^{*}$ is the maximizer of $R (f)$ over all measurable functions, it cannot be estimated directly via maximizing (3) due to the unknown $P_{Y | X}$ and $ρ_{X}$ . Fortunately, some indirect density-estimation-based strategies were proposed in [14,15,17]. As shown in Theorem 5 of [17], $R (f)$ equals the density function of random variable $E_{f} = Y - f (X)$ at zero, e.g.,

R (f) = P_{E_{f}} (0) .

Therefore, we can find an approximation of $f^{*}$ by maximizing the empirical version of $P_{E_{f}} (0)$ with the help of kernel density estimation (KDE).

Let $K_{σ} : R \times R \to R_{+}$ be a kernel with bandwidth $σ$ , and its representing function $ϕ : R \to [0, \infty)$ satisfies $ϕ (\frac{u - u^{'}}{σ}) = K_{σ} (u, u^{'}), \forall u, u^{'} \in R$ . Typical kernels used in KDE include the Gaussian kernel, the Epanechnikov kernel, the logistic kernel, and the sigmoid kernel. The KDE-based estimator of $P_{E_{f}} (0)$ is defined as:

\begin{matrix} {\hat{P}}_{E_{f}} (0) = \frac{1}{n σ} \sum_{i = 1}^{n} K_{σ} (y_{i} - f (x_{i}), 0) = \frac{1}{n σ} \sum_{i = 1}^{n} ϕ (\frac{y_{i} - f (x_{i})}{σ}) : = {\hat{R}}^{σ} (f) . \end{matrix}

Learning models for modal regression are usually formulated by Tikhonov regularization schemes associated with the empirical metric ${\hat{R}}^{σ} (f)$ ; see, e.g., [15,35].

Naturally, the data-free modal regression metric, $w . r . t .$ ${\hat{R}}^{σ} (f)$ , can be defined as:

R^{σ} (f) = \frac{1}{σ} \int_{X \times Y} ϕ (\frac{y - f (x)}{σ}) d ρ (x, y) .

In theory, the learning performance of estimator $f : X \to R$ can be evaluated in terms of $R (f) - R (f^{*})$ , which can be further bounded via $R^{σ} (f) - R^{σ} (f^{*})$ (see Theorem 10 in [17]).

Remark 1.

As illustrated in [17], when taking $K_{σ}$ as a Gaussian kernel, the modal regression for maximizing $R^{σ} (f)$ is consistent with learning under the maximum correntropy criterion (MCC). By employing different kernels, we can provide rich evaluated metrics for better robust estimation.

2.2. Mode-Based Sparse Additive Models

The additive model is formulated as follows,

Y = \sum_{j = 1}^{p} f_{j}^{*} (X_{j}) + ϵ,

(4)

where $X_{j} \in X$ , $(j = 1, 2, \cdot \cdot \cdot, p)$ , $Y \in Y$ , and $f_{j}^{*}$ are unknown component functions. By employing nonlinear hypothesis function spaces with an additive structure, the additive model provides better flexibility for regression estimation and variable selection [19]. In [28], the theoretical properties of the sparse additive model with the quantile loss function were discussed. We introduce some basic notation and assumptions in a similar way.

Suppose that $E f_{j}^{*} (X_{j}) = 0$ and $∥ f_{j}^{*} ∥_{K_{j}} \leq 1$ for each $f_{j}^{*}$ in (4) with $j \in S$ . Here, $f_{j}^{*} : X_{j} \to R$ is an unknown univariate function in a reproducing kernel Hilbert space (RKHS) $H_{j} : = H_{K_{j}}$ associated with kernel $K_{j}$ and norm ${∥ \cdot ∥}_{K_{j}}$ [30,31], and $S \subseteq {1, \dots, p}$ is an intrinsic subset with cardinality $| S | < p$ . This means each observation $(x_{j}, y_{j})$ is generated according to:

y_{i} = \sum_{j \in S} f_{j}^{*} (x_{i j}) + ϵ_{i}, i = 1, \dots, n,

where $x_{i} = {(x_{i 1}, \dots, x_{i p})}^{T} \in R^{p}$ , $f_{j}^{*} \in H_{j}$ and $ϵ$ satisfies the condition (2).

For any given $j \in {1, \dots, p}$ , denote $B_{r} (H_{j}) = {g \in H_{j} {: ∥ g ∥}_{K_{j}} \leq r}$ . The hypothesis space considered here is defined by:

F = {f = \sum_{j = 1}^{p} f_{j} : f_{j} \in B_{r} (H_{j}), i = 1, \dots, p},

(5)

which is a subset of the RKHS $H = {f = \sum_{j = 1}^{p} f_{j} : f_{j} \in H_{j}}$ with the norm:

{∥ f ∥}_{K}^{2} = inf {\sum_{j = 1}^{p} ∥ f_{j} ∥_{K_{j}}^{2} : f = \sum_{j = 1}^{p} f_{j}} .

For each $X_{j}$ and the corresponding marginal distribution $ρ_{X_{j}}$ , we denote $∥ f_{j} ∥_{2}^{2} : = \int_{X_{j}} {| f_{j} (u) |}^{2} d ρ_{X_{j}} (u)$ . Given inputs ${x_{i}}_{i = 1}^{n}$ , define the empirical norm of each $f_{j}$ as:

∥ f_{j} ∥_{n}^{2} : = \frac{1}{n} \sum_{i = 1}^{n} f_{j}^{2} (x_{i j}), \forall f_{j} \in H_{j}, j \in {1, \dots, p} .

With the help of the mode-based metric (3) and the hypothesis space (5), we formulated the mode-based sparse additive model as:

\hat{f} = arg max_{f \in F} {{\hat{R}}^{σ} (f) - λ_{1} \sum_{j = 1}^{p} ∥ f_{j} ∥_{n} - λ_{2} \sum_{j = 1}^{p} ∥ f_{j} ∥_{K_{j}}},

(6)

where $(λ_{1}, λ_{2})$ is a pair of positive regularization. The first regularization term is sparsity-promoting [11,36], and the second one guarantees smoothness in the solution.

By the representer theorem of kernel methods (e.g., [37]), the solution of (6) admits the following form:

\hat{f} (u) = \sum_{i = 1}^{n} \sum_{j = 1}^{p} {\hat{α}}_{i j} K (u_{j}, x_{i j}), u = {(u_{1}, \dots, u_{p})}^{T}

with a collection of coefficients ${{\hat{α}}_{j} = {(α_{1 j}, \dots, α_{n j})}^{T} \in R^{n} : j = 1, \dots, p}$ .

The optimal coefficients with respect to (6) are the solution to the following non-convex optimization:

\begin{matrix} max_{α_{j} \in R^{n}, α_{j}^{T} K_{j} α_{j} \leq 1} {\frac{1}{n} \sum_{i = 1}^{n} ϕ (\frac{y_{i} - \sum_{j = 1}^{p} K_{j i}^{T} α_{j}}{σ}) - \frac{λ_{1}}{\sqrt{n}} \sum_{j = 1}^{p} ∥ K_{j} α_{j} ∥_{2} - λ_{2} \sum_{j = 1}^{p} \sqrt{α_{j}^{T} K_{j} α_{j}}} \end{matrix}

where $K_{j i} = {(K_{j} (x_{1 j}, x_{i j}), \dots, K_{j} (x_{n j}, x_{i j}))}^{T} \in R^{n}$ and $K_{j} = {(K_{j} (x_{i j}, x_{l j}))}_{i, l}^{n} = (K_{j 1}, \dots, K_{j n}) \in R^{n \times n}$ .

Remark 2.

There are various combinations of sparsity and smoothness regularization for additive models [2,3,29,30,31]. The regularization in this paper adopting a two-fold group Lasso scheme, which was employed in [28], but in quantile regression settings, is also different from the coefficient-based regularized modal regression in [19].

Remark 3.

From the lens of computation, the proposed algorithm (6) can be transformed into a regularized least-squares regression problem by HQ optimization [32]. Then, the transformed problem can be tackled with the SOCP [33] easily.

3. Error Analysis

This section states the upper bounds of the excess quantity $R (f^{*}) - R (\hat{f})$ . For the ease of presentation, we only considered the special setting where $H_{j} \equiv H_{j^{'}}, \forall j, j^{'} \in {1, \dots, p}$ , and we denote $\oplus_{j = 1}^{p} H_{j}$ as $H_{K}$ with $sup K (x, x) \leq 1$ .

Recall that the Mercer kernel $K : X \times X \to R$ admits the following spectral expansion [38]:

K (x, x^{'}) = \sum_{ℓ \geq 1} b_{ℓ} ψ_{ℓ} (x) ψ_{ℓ} (x^{'}), x, x^{'} \in X,

where ${(b_{ℓ}, ψ_{ℓ})}_{ℓ \geq 1}$ are the pairs of eigenvalue-eigenfunctions of integral operator $T f : \int K (\cdot, x) f (x) d ρ_{X} (x)$ with $b_{1} \geq b_{2} \geq \dots \geq 0$ .

To evaluate the complexity of $H_{K}$ in terms of the decay rate of eigenvalues ${b_{ℓ}}_{ℓ \geq 1}$ [27,28], we refer to Assumption 1 in [28] as the basis of our analysis.

Assumption 1.

There exist $s \in (0, 1)$ and constant $c_{1} > 0$ such that $b_{ℓ} \leq c_{1} ℓ^{- \frac{1}{s}}$ , $\forall ℓ \geq 1$ .

As illustrated in [27,28], the requirement $s < 1$ is a weak condition since $\sum_{ℓ} b_{ℓ} = E K (x, x) \leq 1$ . In particular, it holds $b_{ℓ} ≍ ℓ^{- 2 h}$ for the Sobolev space $H_{K} = W_{2}^{h} (h > \frac{1}{2})$ with the Lebesgue measure on $[0, 1]$ .

To describe the hypothesis in RKHS, we refer to Assumption 2 in [28].

Assumption 2.

For some $s \in (0, 1)$ given in Assumption 1, there exists a positive constant $c_{2}$ such that ${∥ f ∥}_{\infty} \leq c_{2} {∥ f ∥}_{2}^{1 - s} {∥ f ∥}_{K}^{s}$ , $\forall f \in H_{K}$ .

Remark 4.

To understand the statistical performance of the proposed estimator without any “correlatedness” conditions on covariates, Rademacher complexity [39] was used to measure functional complexity in [28]. We drew on the experience of [28].

In general, Assumption 2 is stronger than Assumption 1 and is satisfied when the RKHS is continuously embeddable in a Sobolev space. For the uniformly bounded ${ψ_{ℓ}}_{ℓ \geq 1}$ , this sub-norm condition is consistent with Assumption 1.

For any given independent input variables ${x_{i}}_{i = 1}^{n} \subset X$ , define the Rademacher complexity:

R_{n} (f) : = \frac{1}{n} \sum_{i = 1}^{n} σ_{i} f (x_{i}), \forall f \in H_{K},

where ${σ_{i}}_{i = 1}^{n}$ is an $i . i . d .$ sequence of Rademacher variables that take ${\pm 1}$ with probability $1 / 2$ . As shown in [40], it holds:

E R_{n} {f \in H_{K} {{∥ f ∥}_{K} = {1, ∥ f ∥}_{2} \leq t}} ≍ \frac{1}{\sqrt{n}} {[\sum_{ℓ}^{\infty} min {t^{2}, b_{ℓ}}]}^{\frac{1}{2}} .

Moreover, from Assumption 1, define:

\begin{matrix} γ_{n} & : = & inf {γ \geq \sqrt{\frac{A log \tilde{p}}{n}}, E [sup_{{∥ f ∥}_{K} = 1, {∥ f ∥}_{2} \leq t} | R_{n} (f) |] \leq γ t + γ^{2}, \forall t \in (0, 1)} \\ ≍ & max {\sqrt{\frac{A log \tilde{p}}{n}}, {(\frac{1}{n})}^{\frac{1}{2 (1 + α)}}} . \end{matrix}

The main idea of our error analysis is to first state a theory result for a defined event and then investigate the behavior of $\hat{f}$ in (6) conditional on that event.

Define $η (t) : = max {1, \sqrt{t}, t / \sqrt{n}}$ for any $t > 0$ and $ξ_{n} : = ξ_{n} (λ) = max {λ^{- \frac{α}{2}} n^{- \frac{1}{2}}, λ^{- \frac{1}{2}} n^{- \frac{1}{1 + α}}, \sqrt{\frac{log p}{n}}}$ , and consider the event:

\begin{matrix} θ (t) & = & {| \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i} f (x_{i}) | \leq c_{α} η (t) ξ_{n} {(∥ f ∥}_{2} + λ^{\frac{1}{2}} {∥ f ∥}_{K}), \forall f \in H_{K}}, \end{matrix}

where ${ϵ_{i}}_{i = 1}^{n}$ are zero-mean $i . i . d .$ random variables with $| ϵ_{i} | \leq L$ and $c_{α}$ is a constant depending on $α$ and L.

Remark 5.

To analyze the behavior of the regularized estimator conditioned on the event, several basic facts of the empirical processes were introduced in [28]. Our work can be boiled down to this framework. We introduced the relevant lemmas in [28] as a stepping stone.

Lemma 1.

Let Assumptions 1 and 2 be true. If $\frac{log p}{\sqrt{n}} \leq 1$ , it holds:

$P (θ (t)) \geq 1 - exp {- t}, \forall λ > 0, t \geq 1 .$

The following lemma (see also Theorem 4 in [41]) demonstrates the relationship between the empirical norm ${∥ \cdot ∥}_{n}$ and ${∥ \cdot ∥}_{2}$ for functions in $H_{K}$ .

Lemma 2.

For $A \geq 1$ and any given $\tilde{p} \geq p$ with $log \tilde{p} \geq 2 log log n$ , there exists a constant c such that:

${∥ f ∥}_{2} \leq {c (∥ f ∥}_{n} + γ_{n} {∥ f ∥}_{K})$

and:

${∥ f ∥}_{n} \leq {c (∥ f ∥}_{2} + γ_{n} {∥ f ∥}_{K})$

with confidence at least $1 - {\tilde{p}}^{- A}$ , where $γ_{n} ≍ max (\sqrt{\frac{A log \tilde{p}}{n}}, {(\frac{1}{n})}^{\frac{1}{2 (1 + α)}})$ .

Lemma 3.

Let ${z_{i}}_{i = 1}^{n} \subset Z$ be independent random variables, and let Γ be a class of real-valued functions on $Z$ satisfying:

$∥ γ ∥ \leq η_{n}, \forall γ \in Γ, a n d \frac{1}{n} \sum_{i = 1}^{n} v a r (γ (z_{i})) \leq ι_{n}^{2},$

for some positive constants $η_{n}$ and $ι_{n}$ . Define $ζ : = {sup}_{γ \in Γ} | \frac{1}{n} \sum_{i = 1}^{n} γ (z_{i}) - E γ (z) |$ . Then,

$\begin{matrix} P {ζ \geq E ζ + t \sqrt{2 (ι_{n}^{2} + 2 η_{n} E z)} + \frac{2 η_{n} t^{2}}{3} \leq exp {- n t^{2}} \end{matrix}$

For any given $Δ_{-}$ and $Δ_{+}$ , define:

\begin{matrix} F (Δ_{-}, Δ_{+}) & = & {f = \sum_{j = 1}^{p} f_{j} \in H_{K} : γ_{n} \sum_{j = 1}^{p} ∥ f_{j} - f_{j}^{*} ∥_{2} \leq Δ_{-}, γ_{n}^{2} \sum_{j = 1}^{p} ∥ f_{j} - f_{j}^{*} ∥_{K} \leq Δ_{+}}, \end{matrix}

Lemma 4.

Let Assumptions 1 and 2 be true for each $H_{j}$ . For any given $A \geq 2$ , with confidence at least $1 - {\tilde{p}}^{- A}$ , it holds:

$\begin{matrix} R^{σ} (f^{*}) - R^{σ} (f) - ({\hat{R}}^{σ} (f^{*}) - {\hat{R}}^{σ} (f)) \leq c_{*} η (t_{0}) (Δ_{-} + Δ_{+}) + exp {- \tilde{p}}, \end{matrix}$

for any $f \in F (Δ_{-}, Δ_{+})$ with $m a x {Δ_{-}, Δ_{+}} \leq e^{\tilde{p}}$ , where $t_{0} = 2 log (\frac{2 \sqrt{3}}{log 2}) + A log \tilde{p} + 2 log \tilde{p}$ , $λ = n^{- \frac{1}{1 + α}}$ , and $c_{*}$ is a positive constant.

Proof.

Denote $Γ = {γ (z) : γ (z) = \frac{1}{σ} ϕ (\frac{y - f^{*} (x)}{σ}) - \frac{1}{σ} ϕ (\frac{y - f (x)}{σ}), f \in F (Δ_{-}, Δ_{+})}$ . It is easy to verify that:

$E γ (z) - \frac{1}{n} \sum_{i = 1}^{n} γ (z_{i}) = R (f^{*}) - R (f) - (\hat{R} (f^{*}) - \hat{R} (f)), γ \in Γ .$

Let $ζ : = {sup}_{γ \in Γ} | \frac{1}{n} \sum_{i = 1}^{n} γ (z_{i}) - E γ (z) |$ . From Lemma 3, we have:

$ζ \leq E ζ + \sqrt{\frac{2 t (ι_{n}^{2} + 2 η_{n} E ζ)}{n}} + \frac{2 η_{n} t}{3 n},$ (7)

with probability at least $1 - exp {- t}$ , where constants ${sup}_{γ \in Γ} {∥ γ ∥}_{\infty} = η_{n}$ and ${sup}_{γ \in Γ} \sqrt{\frac{1}{n} \sum_{i = 1}^{n} v a r (γ (z_{i}))} = ι_{n}$ . Observing that:

$\begin{matrix} \sqrt{\frac{2 t (ι_{n}^{2} + 2 η_{n} E ζ)}{n}} \leq \sqrt{\frac{2 t ι_{n}^{2}}{n}} + 2 \sqrt{\frac{η_{n} E ζ}{n}} \leq \sqrt{\frac{2 t}{n}} ι_{n} + E ζ + \frac{η_{n}}{n}, \end{matrix}$ (8)

we can take:

$\begin{matrix} ι_{n}^{2} \leq 2 E {(γ (z))}^{2} = 2 E {(\frac{1}{σ} ϕ (\frac{y - f^{*} (x)}{σ}) - \frac{1}{σ} ϕ (\frac{y - f (x)}{σ}))}^{2} \leq \frac{2 ∥ ϕ^{'} ∥_{\infty}^{2}}{σ^{4}} {∥ f - f^{*} ∥}_{2}^{2} \leq \frac{2 ∥ ϕ^{'} ∥_{\infty}^{2}}{σ^{4}} \frac{Δ_{-}^{2}}{γ^{2}}, \end{matrix}$ (9)

and:

$\begin{matrix} η_{n} & = & sup_{γ \in Γ} {∥ γ ∥}_{\infty} \leq \frac{∥ ϕ^{'} ∥_{\infty}}{σ^{2}} ∥ f^{*} {- f ∥}_{\infty} \leq \frac{∥ ϕ^{'} ∥_{\infty}}{σ^{2}} {∥ f^{*} - f ∥}_{K} \leq \frac{∥ ϕ^{'} ∥_{\infty}}{σ^{2}} \frac{Δ_{+}}{γ_{n}^{2}} . \end{matrix}$ (10)

Combining (7)–(10), we obtain with confidence at least $1 - exp {- t}$

$ζ \leq 2 E ζ + \frac{2 ∥ ϕ^{'} ∥_{\infty}}{γ_{n} σ^{2}} \sqrt{\frac{t}{n}} + \frac{κ ∥ ϕ^{'} ∥_{\infty} Δ_{+}}{σ^{2} γ_{n}^{2}} \frac{1 + t}{n} .$

By a symmetrization technique in [42], we have:

$E ζ \leq 2 E R_{n} (Γ) \leq \frac{2 ∥ ϕ^{'} ∥_{\infty}}{σ^{2}} E R_{n} (F - f^{*}) .$

Applying Lemma 3 for $R_{n} (F - f^{*})$ , we obtain that:

$E [R_{n} (F - f^{*})] \leq R_{n} (F - f^{*}) + 4 \frac{Δ_{-}}{γ_{n}} \sqrt{\frac{2 t}{n}} + \frac{Δ_{+}}{γ_{n}^{2}} \frac{1 + t}{n},$

with probability at least $1 - 2 exp {- t}$ . Moreover, with probability at least $1 - 2 exp {- t}$ , it holds:

$\begin{matrix} ζ & \leq & \frac{8 ∥ ϕ^{'} ∥_{\infty}}{σ^{2}} R_{n} (F - f^{*}) + \frac{6 ∥ ϕ^{'} ∥_{\infty} Δ_{-}}{γ_{n} σ^{2}} \sqrt{\frac{t}{n}} + \frac{5 ∥ ϕ^{'} ∥_{\infty} Δ_{+}}{γ_{n}^{2} σ^{2}} \frac{1 + t}{n} \\ \leq & \frac{8 ∥ ϕ^{'} ∥_{\infty}}{σ^{2}} \sum_{j = 1}^{p} R_{n} (H_{j} - f_{j}^{*}) + \frac{6 ∥ ϕ^{'} ∥_{\infty} Δ_{-}}{γ_{n} σ^{2}} \sqrt{\frac{t}{n}} + \frac{5 ∥ ϕ^{'} ∥_{\infty} Δ_{+}}{γ_{n}^{2} σ^{2}} \frac{1 + t}{n} . \end{matrix}$

For the event $θ (t)$ , Lemma 1 demonstrates that:

$| R_{n} (f) | \leq c_{α} η (t) ξ_{n} {(∥ f ∥}_{2} + λ^{\frac{1}{2}} {∥ f ∥}_{K}), \forall f \in H_{K}, \forall λ > 0,$

with confidence $1 - exp {- t}$ . Then,

$\begin{matrix} ζ & \leq & \frac{8 ∥ ϕ^{'} ∥_{\infty} c_{α} η (t) ξ_{n}}{σ^{2}} sup_{f \in F} {\sum_{j = 1}^{p} ∥ f - f_{j}^{*} ∥_{2} + λ^{\frac{1}{2}} \sum_{j = 1}^{p} ∥ f_{j} - f_{j}^{*} ∥_{K}} + \frac{6 ∥ ϕ^{'} ∥_{\infty} Δ_{-}}{γ_{n} σ^{2}} \sqrt{\frac{t}{n}} + \frac{5 ∥ ϕ^{'} ∥_{\infty} Δ_{+}}{γ_{n}^{2} σ^{2}} \frac{1 + t}{n} . \end{matrix}$

Taking $λ = n^{- \frac{1}{1 + α}}$ , we can verify that $c γ_{n} \geq ξ_{n}$ and $ξ_{n} λ^{\frac{1}{2}} \geq c γ_{n}^{2}$ . Then,

$\begin{matrix} ζ & \leq & \frac{8 c_{α} η (t) {∥ ϕ^{'} ∥}_{\infty}}{σ^{2}} (Δ_{+} + Δ_{-}) + \frac{6 Δ_{-} {∥ ϕ^{'} ∥}_{\infty}}{σ^{2}} \sqrt{\frac{t}{A log \tilde{p}}} + \frac{5 Δ_{+} {∥ ϕ^{'} ∥}_{\infty} t}{σ^{2} A log \tilde{p}}, \end{matrix}$

for some event $θ (Δ_{-}, Δ_{+})$ .

For $t = 2 log (2 \sqrt{3} / log 2) + A log \tilde{p} + 2 log \tilde{p}$ , we deduce that $e^{- \tilde{p}} \leq Δ_{-} \leq e^{\tilde{p}}$ and $e^{- \tilde{p}} \leq Δ_{+} \leq e^{\tilde{p}}$ considering ${(2 \tilde{p} + 1)}^{2}$ different discrete pairs $Δ_{-}^{j} = Δ_{+}^{j} : = 2^{- j}, j = - \tilde{p}, \dots, \tilde{p}$ , and we deduce that:

$\begin{matrix} P (⋂_{k, j} θ (Δ_{-}^{j}, Δ_{+}^{j})) & \leq & 1 - 3 ({\frac{2}{log 2}}^{2} {\tilde{p}}^{2} exp {- 2 log (\frac{2 \sqrt{3}}{log 2} - A log \tilde{p} - 2 log \tilde{p}} \leq 1 - {\tilde{p}}^{- A} . \end{matrix}$

When $Δ_{-} \leq e^{- \tilde{p}}$ or $Δ_{+} \leq e^{- \tilde{p}}$ , it is trivial to obtain the desired result. □

The proof of Lemma 4 is derived from the proof of Proposition 1 in [28] for the quantile regression. We state our main result on the error bound.

Theorem 1.

Let the regularization parameters of $\hat{f}$ defined in (6) be $λ_{1} = \sqrt{ξ} γ_{n}$ and $λ_{2} = ξ γ_{n}^{2}$ , where $ξ = max {2 c η (t_{0}), 4}$ with $η (t) = max {1, \sqrt{t}, t / \sqrt{n}}$ , $t_{0} = 2 log (2 \sqrt{3} / log 2) + A log \tilde{p} + 2 log \tilde{p}$ , and $γ_{n} ≍ max (\sqrt{\frac{A log \tilde{p}}{n}}, {(\frac{1}{n})}^{\frac{1}{2 (1 + α)}})$ . Under the conditions of Assumptions 1 and 2, for any $\tilde{p} \geq p$ such that $log p \leq \sqrt{n}$ and $log \tilde{p} \geq 2 log log n$ , then for some constant $A \geq 2$ , such that with probability at least $1 - 2 {\tilde{d}}^{- A}$ :

$\begin{matrix} R (f^{*}) - R (\hat{f}) & \leq & c s ∥ ϕ^{'} ∥_{\infty} η (t_{0}) {(η (t_{0}))}^{\frac{1}{4}} \sqrt{γ_{n}} \leq c {(η (t_{0}))}^{\frac{5}{4}} max {{(\frac{A log \tilde{p}}{c})}^{\frac{1}{4}}, {(\frac{1}{n})}^{\frac{1}{4 (1 + α)}}} \\ \leq & c max {\sqrt{A log \tilde{p}}, \frac{A log \tilde{p}}{\sqrt{n}}}^{\frac{5}{4}} \cdot max {{(\frac{A log \tilde{p}}{n})}^{\frac{1}{4}}, {(\frac{1}{n})}^{\frac{1}{4 + 4 α}}} \\ \leq & c max {\frac{{(A log \tilde{p})}^{\frac{7}{8}}}{n^{\frac{1}{4}}}, \frac{{(A log \tilde{p})}^{\frac{1}{2}}}{n^{\frac{1}{4 + 4 α}}}, \frac{{(A log \tilde{p})}^{\frac{3}{2}}}{n^{\frac{3}{4}}}, \frac{{(A log \tilde{p})}^{\frac{5}{4}}}{n^{\frac{3 + 2 α}{4 + 4 α}}}} . \end{matrix}$

Proof.

By the definition of $\hat{f}$ in (6), we know that:

$\begin{matrix} {\hat{R}}^{σ} (\hat{f}) - λ_{1} \sum_{j = 1}^{p} ∥ {\hat{f}}_{j} ∥_{n} - λ_{2} \sum_{j = 1}^{p} ∥ {\hat{f}}_{j} ∥_{K_{j}} \geq {\hat{R}}^{σ} (f^{*}) - λ_{1} \sum_{j = 1}^{p} ∥ f_{j}^{*} ∥_{n} - λ_{2} \sum_{j = 1}^{p} {∥ f_{j}^{*} ∥}_{K_{j}} . \end{matrix}$

This implies that:

$\begin{matrix} {\hat{R}}^{σ} (\hat{f}) - R^{σ} (f^{*}) - λ_{1} \sum_{j = 1}^{p} ∥ {\hat{f}}_{j} ∥_{n} - λ_{2} \sum_{j = 1}^{p} {∥ {\hat{f}}_{j} ∥}_{K_{j}} \\ \geq & [R^{σ} (\hat{f}) - R^{σ} (f^{*})] - [{\hat{R}}^{σ} (\hat{f}) - {\hat{R}}^{σ} (f^{*})] - λ_{1} \sum_{j = 1}^{p} ∥ f_{j}^{*} ∥_{n} - λ_{2} \sum_{j = 1}^{p} {∥ f_{j}^{*} ∥}_{K_{j}} . \end{matrix}$

Moreover,

$\begin{matrix} R^{σ} (f^{*}) - R^{σ} (\hat{f}) \leq R^{σ} (f^{*}) - R^{σ} (\hat{f}) + λ_{1} \sum_{j \notin S} ∥ {\hat{f}}_{j} ∥_{n} + λ_{2} \sum_{j \notin S} {∥ {\hat{f}}_{j} ∥}_{K} \\ \leq & [R^{σ} (f^{*}) - R^{σ} (\hat{f})] - [{\hat{R}}^{σ} (f^{*}) - {\hat{R}}^{σ} (\hat{f})] + λ_{1} \sum_{j \in S} (∥ f_{j}^{*} ∥_{n} - ∥ {\hat{f}}_{j} ∥_{n}) + λ_{2} \sum_{j \in S} (∥ f_{j}^{*} ∥_{K} - ∥ {\hat{f}}_{j} ∥_{K}) \\ \leq & [R^{σ} (f^{*}) - R^{σ} (\hat{f})] - [{\hat{R}}^{σ} (f^{*}) - {\hat{R}}^{σ} (\hat{f})] + λ_{1} \sum_{j \in S} ∥ {\hat{f}}_{j} - f_{j}^{*} ∥_{n} + λ_{2} \sum_{j \in S} {∥ {\hat{f}}_{j} - f_{j}^{*} ∥}_{K} . \end{matrix}$ (11)

Taking $λ_{1} = \sqrt{ξ} γ_{n}$ , $λ_{2} = ξ γ_{n}^{2}$ with $γ_{n} = max {\sqrt{\frac{A log \tilde{p}}{n}}, {(\frac{1}{n})}^{\frac{1}{2 + 2 α}}}$ , $α \in (0, 1)$ , we deduce that:

$\begin{matrix} γ_{n} \sum_{j = 1}^{p} {∥ {\hat{f}}_{j} - f_{j}^{*} ∥}_{2} \leq 2 p {(\frac{1}{n})}^{\frac{1}{2 + 2 α}} \leq 2 \tilde{p} (\frac{1}{4}) \leq e^{\tilde{p}}, \forall n \geq 1, \tilde{p} \geq p, \end{matrix}$

and:

$γ_{n}^{2} \sum_{j = 1}^{p} ∥ f_{j} - f_{j}^{*} ∥_{K_{j}} \leq γ_{n} γ_{n} \sum_{j = 1}^{p} {∥ \hat{f} - f^{*} ∥}_{K_{j}} \leq e^{- \tilde{p}} .$

Therefore, we verify that $\hat{f} \in F (Δ_{-}, Δ_{+})$ with $Δ_{-} \leq e^{\tilde{p}}$ and $Δ_{+} \leq e^{\tilde{p}}$ . With the choices $λ_{2} = λ_{1}^{2} = ξ γ_{n}^{2}$ , it holds:

$λ_{1} ∥ {\hat{f}}_{j} - f_{j}^{*} ∥_{n} + λ_{2} {∥ {\hat{f}}_{j} - f_{j}^{*} ∥}_{K} \leq 2 (λ_{1} + λ_{2}) = 4 \sqrt{ξ} γ_{n}, j \in S .$

due to the fact $∥ f_{j} ∥_{n} \leq {∥ f_{j} ∥}_{K} \leq 1$ , for any $f_{j} \in H_{K_{j}}$ .

According to Lemma 4 and (11), we obtain:

$\begin{matrix} R^{σ} (f^{*}) - R^{σ} (\hat{f}) \\ \leq & \frac{c η t_{0} {∥ ϕ^{'} ∥}_{\infty}}{σ^{2}} (γ_{n} \sum_{j = 1}^{p} ∥ {\hat{f}}_{j} - f_{j}^{*} ∥_{2} + γ_{n}^{2} \sum_{j = 1}^{p} ∥ {\hat{f}}_{j} - f_{j}^{*} ∥_{K}) + λ_{1} \sum_{j \in S} ∥ {\hat{f}}_{j} - f_{j}^{*} ∥_{n} + λ_{2} \sum_{j \in S} {∥ {\hat{f}}_{j} - f_{j}^{*} ∥}_{K} + e^{- \tilde{p}} \\ \leq & \frac{c η (t_{0}) {∥ ϕ^{'} ∥}_{\infty}}{σ^{2}} \sqrt{ξ} γ_{n} + e^{- \tilde{p}}, \end{matrix}$

with probability at least $1 - 2 {\tilde{p}}^{- A}$ .

Notice that $log \tilde{p} \geq 2 log log n$ implies that $e^{- \tilde{p}} \leq n^{- 2} \leq γ_{n}$ . Then:

$R^{σ} (f^{*}) - R (\hat{f}) \leq \frac{c η (t_{0}) {∥ ϕ^{'} ∥}_{\infty}}{σ^{2}} \sqrt{ξ} γ_{n} .$

Combining this with Theorem 9 in [17] and setting $σ = (∥ ϕ^{'} {∥_{\infty} η (t_{0}) \sqrt{ξ} γ_{n})}^{\frac{1}{4}}$ , we obtain the desired result. □

The proof of Theorem 1 is inspired by that of Theorem 1 in [28]; see [28] for more details. According to Theorem 1, we can conclude that the mode-based SpAM can achieve the learning rate with polynomial decay $O (n^{- \frac{1}{4 + 4 α}})$ since $ϵ \in [0, 1]$ and $A, \tilde{p}$ are positive constants.

4. Experimental Evaluation

To demonstrate the efficiency of our method, in this section, we evaluated our model on some synthetic datasets. The data in $R^{p}$ with dimension $p = 5$ and $p = 10$ were generated randomly according to the uniform distribution on the interval $[0, 1]$ . Then, we computed the MSE of our estimator $\hat{f}$ . Figure 1, Figure 2 and Figure 3 depict the MSE of $\hat{f}$ when the parameter pair $(λ_{1}, λ_{2}) = (0, 1), (1, 0)$ and $(1, 1)$ , respectively, while the number of samples n varies from 50/60 to 80/90. This paper used Yalmip [43] modeling in the MATLAB environment and called fmincon to solve the problem. From the figures, we can notice that the MSEs tended to decrease with the increase of the number of samples n under three kinds of parameter settings, which verified that our method was effective in the regression of high-dimensional data.

MSE of $\hat{f}$ when $(λ_{1}, λ_{2}) = (0, 1)$ .

MSE of $\hat{f}$ when $(λ_{1}, λ_{2}) = (1, 0)$ .

MSE of $\hat{f}$ when $(λ_{1}, λ_{2}) = (1, 1)$ .

5. Conclusions

In this work, we proposed a mode-based sparse additive model and established its generalization error bound. The theoretical results extended the previous mean-based analysis to the mode-based approach. We demonstrated that the mode-based SpAM can achieve the learning rate with polynomial decay $O (n^{- \frac{1}{4 + 4 α}})$ , which is comparable to the previous result in [15] with $O (n^{- \frac{1}{7}})$ . In the future, it will be important to further explore the variable selection consistency of the proposed model.

Author Contributions

Conceptualization, H.D., B.S., J.C. and Z.P.; methodology, H.D. and Z.P.; validation, B.S. and Z.P.; formal analysis, H.D., B.S. and Z.P.; investigation, H.D. and J.C.; resources, Z.P.; data curation, H.D. and J.C.; writing—original draft preparation, H.D. and J.C.; writing—review and editing, H.D. and J.C.; visualization, H.D. and J.C.; supervision, B.S. and Z.P.; project administration, B.S. and Z.P.; funding acquisition, H.D., B.S. and Z.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Fundamental Research Funds for the Central Universities of China (Grant Nos. 2662019FW003 and 2662020LXQD001) and the National Natural Science Foundation of China (Grant No. 12001217).

Data Availability Statement

The synthetic data generation method of the simulation experiment has been introduced in the experimental part.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Xia Y., Hou Y., Lv S. Learning Rates for Partially Linear Support Vector Machine in High Dimensions. Anal. Appl. 2021;19:167–182. doi: 10.1142/S0219530520400126. [DOI] [Google Scholar]
2.Ravikumar P., Liu H., Lafferty J., Wasserman L. SpAM: Sparse Additive Models. J. R. Stat. Soc. Ser. B. 2009;71:1009–1030. doi: 10.1111/j.1467-9868.2009.00718.x. [DOI] [Google Scholar]
3.Yin J., Chen X., Xing E.P. Group Sparse Additive Models; Proceedings of the International Conference on Machine Learning (ICML); Edinburgh, UK. 26 June–1 July 2012; pp. 1643–1650. [PMC free article] [PubMed] [Google Scholar]
4.Lin Y., Zhang H.H. Component Selection and Smoothing in Multivariate Nonparametric Regression. Ann. Stat. 2006;34:2272–2297. doi: 10.1214/009053606000000722. [DOI] [Google Scholar]
5.Zhao T., Liu H. Sparse additive machine; Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS); La Palma, Spain. 21–23 April 2012; pp. 1435–1443. [Google Scholar]
6.Chen H., Wang X., Deng C., Huang H. Group Sparse Additive Machine; Proceedings of the Advances in Neural Information Processing Systems (NIPS); Long Beach, CA, USA. 4–9 December 2017; pp. 197–207. [Google Scholar]
7.Kandasamy K., Yu Y. Additive Approximations in High Dimensional Nonparametric Regression via the SALSA; Proceedings of the International Conference on Machine Learning (ICML); New York, NY, USA. 19–24 June 2016; pp. 69–78. [Google Scholar]
8.Wang Y., Chen H., Zheng F., Xu C., Gong T., Chen Y. Multi-task Additive Models for Robust Estimation and Automatic Structure Discovery; Proceedings of the Advances in Neural Information Processing Systems (NIPS); Online. 6–12 December 2020; pp. 11744–11755. [Google Scholar]
9.Chen H., Liu G., Huang H. Sparse Shrunk Additive Models; Proceedings of the International Conference on Machine Learning (ICML); Vienna, Austria. 12–18 July 2020; pp. 6194–6204. [Google Scholar]
10.Chen H., Guo C., Xiong H., Wang Y. Sparse Additive Machine with Ramp Loss. Anal. Appl. 2021;19:509–528. doi: 10.1142/S0219530520400011. [DOI] [Google Scholar]
11.Meier L., Geer S.V.D., Buhlmann P. High-dimensional Additive Modeling. Ann. Stat. 2009;37:3779–3821. doi: 10.1214/09-AOS692. [DOI] [Google Scholar]
12.Raskutti G., Wainwright M.J., Yu B. Minimax-optimal Rates for Sparse Additive Models over Kernel Classes via Convex Programming. J. Mach. Learn. Res. 2012;13:389–427. [Google Scholar]
13.Kemp G.C.R., Silva J.M.C.S. Regression towards the mode. J. Econom. 2012;170:92–101. doi: 10.1016/j.jeconom.2012.03.002. [DOI] [Google Scholar]
14.Yao W., Li L. A New Regression model: Modal Linear Regression. Scand. J. Stat. 2014;41:656–671. doi: 10.1111/sjos.12054. [DOI] [Google Scholar]
15.Wang X., Chen H., Cai W., Shen D., Huang H. Regularized Modal Regression with Applications in Cognitive Impairment Prediction; Proceedings of the Advances in Neural Information Processing Systems (NIPS); Long Beach, CA, USA. 4–9 December 2017; pp. 1448–1458. [PMC free article] [PubMed] [Google Scholar]
16.Chen Y.C., Genovese C.R., Tibshirani R.J., Wasserman L. Nonparametric Modal Regression. Ann. Stat. 2014;44:489–514. doi: 10.1214/15-AOS1373. [DOI] [Google Scholar]
17.Feng Y., Fan J., Suykens J. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020;21:1–35. [Google Scholar]
18.Collomb G., Härdle W., Hassani S. A Note on Prediction via Estimation of the Conditional Mode Function. J. Stat. Plan. Inference. 1986;15:227–236. doi: 10.1016/0378-3758(86)90099-6. [DOI] [Google Scholar]
19.Chen H., Wang Y., Zheng F., Deng C., Huang H. Sparse Modal Additive Model. IEEE Trans. Neural Netw. Learn. Syst. 2020:1–15. doi: 10.1109/TNNLS.2020.3005144. [DOI] [PubMed] [Google Scholar]
20.Li J., Ray S., Lindsay B.G. A Nonparametric Statistical Approach to Clustering via Mode Identification. J. Mach. Learn. Res. 2007;8:1687–1723. [Google Scholar]
21.Einbeck J., Tutz G. Modeling beyond Regression Function: An Application of Multimodal Regression to Speed-flow Data. J. R. Stat. Soc. Ser. C Appl. Stat. 2006;55:461–475. doi: 10.1111/j.1467-9876.2006.00547.x. [DOI] [Google Scholar]
22.Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996;58:267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]
23.Feng Y., Huang X., Shi L., Yang Y., Suykens J.A. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. J. Mach. Learn. Res. 2015;16:993–1034. [Google Scholar]
24.Lv F., Fan J. Optimal learning with Gaussians and Correntropy Loss. Anal. Appl. 2019;19:107–124. doi: 10.1142/S0219530519410124. [DOI] [Google Scholar]
25.Yao W., Lindsay B.G., Li R. Local Modal Regression. J. Nonparametr. Stat. 2012;24:647–663. doi: 10.1080/10485252.2012.678848. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Chen Y. Modal Regression using Kernel Density Estimation: A Review. Wiley Interdiscip. Rev. Comput. Stat. 2018;10:e1431. doi: 10.1002/wics.1431. [DOI] [Google Scholar]
27.Steinwart I., Christmann A. Support Vector Machines. Springer Science and Business Media; Berlin/Heidelberg, Germany: 2008. [Google Scholar]
28.Lv S., Lin H., Lian H., Huang J. Oracle Inequalities for Sparse Additive Quantile Regression in Reproducing Kernel Hilbert Space. Ann. Stat. 2018;46:781–813. doi: 10.1214/17-AOS1567. [DOI] [Google Scholar]
29.Huang J., Horowitz J.L., Wei F. Variable Selection in Nonparametric Additive Models. Ann. Stat. 2010;38:2282–2313. doi: 10.1214/09-AOS781. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Christmann A., Zhou D.X. Learning Rates for the Risk of Kernel based Quantile Regression Estimators in Additive Models. Anal. Appl. 2016;14:449–477. doi: 10.1142/S0219530515500050. [DOI] [Google Scholar]
31.Yuan M., Zhou D.X. Minimax Optimal Rates of Estimation in High Dimensional Additive Models. Ann. Stat. 2016;44:2564–2593. doi: 10.1214/15-AOS1422. [DOI] [Google Scholar]
32.Nikolova M., Ng M.K. Analysis of Half-quadratic Minimization Methods for Sgnal and Image Recovery. SIAM J. Sci. Comput. 2006;27:937–966. doi: 10.1137/030600862. [DOI] [Google Scholar]
33.Alizadeh F., Goldfarb D. Second-Order Cone Programming. Math. Program. 2003;95:3–51. doi: 10.1007/s10107-002-0339-5. [DOI] [Google Scholar]
34.Guo C., Song B., Wang Y., Chen H., Xiong H. Robust Variable Selection and Estimation Based on Kernel Modal Regression. Entropy. 2019;21:403. doi: 10.3390/e21040403. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Wang Y., Tang Y.Y., Li L., Chen H. Modal Regression-based Atomic Representation for Robust Face Recognition and Reconstruction. IEEE Trans. Cybern. 2020;50:4393–4405. doi: 10.1109/TCYB.2019.2903205. [DOI] [PubMed] [Google Scholar]
36.Suzuki T., Sugiyama M. Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness. Ann. Stat. 2013;41:1381–1405. doi: 10.1214/13-AOS1095. [DOI] [Google Scholar]
37.Schlköpf B., Smola A.J. Learning with Kernels. The MIT Press; Cambridge, MA, USA: 2002. [Google Scholar]
38.Aronszajn N. Theory of Reproducing Kernels. Trans. Am. Math. Soc. 1950;68:337–404. doi: 10.1090/S0002-9947-1950-0051437-7. [DOI] [Google Scholar]
39.Bartlett P.L., Bousquet O., Mendelson S. Localized Rademacher Complexities; Proceedings of the Conference on Computational Learning Theory (COLT); Sydney, Australia. 8–10 July 2002; pp. 44–58. [Google Scholar]
40.Mendelson S. Geometric Parameters of Kernel Machines; Proceedings of the Conference on Computational Learning Theory (COLT); Sydney, Australia. 8–10 July 2002; pp. 29–43. [Google Scholar]
41.Koltchinskii V., Yuan M. Sparsity in Multiple Kernel Learning. Ann. Stat. 2010;38:3660–3695. doi: 10.1214/10-AOS825. [DOI] [Google Scholar]
42.Van De Geer S. Empirical Processes in M-Estimation. Cambridge University Press; Cambridge, UK: 2000. [Google Scholar]
43.Löfberg J. Automatic robust convex programming. Optim. Methods Softw. 2012;27:115–129. doi: 10.1080/10556788.2010.517532. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The synthetic data generation method of the simulation experiment has been introduced in the experimental part.

[B1-entropy-23-00651] 1.Xia Y., Hou Y., Lv S. Learning Rates for Partially Linear Support Vector Machine in High Dimensions. Anal. Appl. 2021;19:167–182. doi: 10.1142/S0219530520400126. [DOI] [Google Scholar]

[B2-entropy-23-00651] 2.Ravikumar P., Liu H., Lafferty J., Wasserman L. SpAM: Sparse Additive Models. J. R. Stat. Soc. Ser. B. 2009;71:1009–1030. doi: 10.1111/j.1467-9868.2009.00718.x. [DOI] [Google Scholar]

[B3-entropy-23-00651] 3.Yin J., Chen X., Xing E.P. Group Sparse Additive Models; Proceedings of the International Conference on Machine Learning (ICML); Edinburgh, UK. 26 June–1 July 2012; pp. 1643–1650. [PMC free article] [PubMed] [Google Scholar]

[B4-entropy-23-00651] 4.Lin Y., Zhang H.H. Component Selection and Smoothing in Multivariate Nonparametric Regression. Ann. Stat. 2006;34:2272–2297. doi: 10.1214/009053606000000722. [DOI] [Google Scholar]

[B5-entropy-23-00651] 5.Zhao T., Liu H. Sparse additive machine; Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS); La Palma, Spain. 21–23 April 2012; pp. 1435–1443. [Google Scholar]

[B6-entropy-23-00651] 6.Chen H., Wang X., Deng C., Huang H. Group Sparse Additive Machine; Proceedings of the Advances in Neural Information Processing Systems (NIPS); Long Beach, CA, USA. 4–9 December 2017; pp. 197–207. [Google Scholar]

[B7-entropy-23-00651] 7.Kandasamy K., Yu Y. Additive Approximations in High Dimensional Nonparametric Regression via the SALSA; Proceedings of the International Conference on Machine Learning (ICML); New York, NY, USA. 19–24 June 2016; pp. 69–78. [Google Scholar]

[B8-entropy-23-00651] 8.Wang Y., Chen H., Zheng F., Xu C., Gong T., Chen Y. Multi-task Additive Models for Robust Estimation and Automatic Structure Discovery; Proceedings of the Advances in Neural Information Processing Systems (NIPS); Online. 6–12 December 2020; pp. 11744–11755. [Google Scholar]

[B9-entropy-23-00651] 9.Chen H., Liu G., Huang H. Sparse Shrunk Additive Models; Proceedings of the International Conference on Machine Learning (ICML); Vienna, Austria. 12–18 July 2020; pp. 6194–6204. [Google Scholar]

[B10-entropy-23-00651] 10.Chen H., Guo C., Xiong H., Wang Y. Sparse Additive Machine with Ramp Loss. Anal. Appl. 2021;19:509–528. doi: 10.1142/S0219530520400011. [DOI] [Google Scholar]

[B11-entropy-23-00651] 11.Meier L., Geer S.V.D., Buhlmann P. High-dimensional Additive Modeling. Ann. Stat. 2009;37:3779–3821. doi: 10.1214/09-AOS692. [DOI] [Google Scholar]

[B12-entropy-23-00651] 12.Raskutti G., Wainwright M.J., Yu B. Minimax-optimal Rates for Sparse Additive Models over Kernel Classes via Convex Programming. J. Mach. Learn. Res. 2012;13:389–427. [Google Scholar]

[B13-entropy-23-00651] 13.Kemp G.C.R., Silva J.M.C.S. Regression towards the mode. J. Econom. 2012;170:92–101. doi: 10.1016/j.jeconom.2012.03.002. [DOI] [Google Scholar]

[B14-entropy-23-00651] 14.Yao W., Li L. A New Regression model: Modal Linear Regression. Scand. J. Stat. 2014;41:656–671. doi: 10.1111/sjos.12054. [DOI] [Google Scholar]

[B15-entropy-23-00651] 15.Wang X., Chen H., Cai W., Shen D., Huang H. Regularized Modal Regression with Applications in Cognitive Impairment Prediction; Proceedings of the Advances in Neural Information Processing Systems (NIPS); Long Beach, CA, USA. 4–9 December 2017; pp. 1448–1458. [PMC free article] [PubMed] [Google Scholar]

[B16-entropy-23-00651] 16.Chen Y.C., Genovese C.R., Tibshirani R.J., Wasserman L. Nonparametric Modal Regression. Ann. Stat. 2014;44:489–514. doi: 10.1214/15-AOS1373. [DOI] [Google Scholar]

[B17-entropy-23-00651] 17.Feng Y., Fan J., Suykens J. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020;21:1–35. [Google Scholar]

[B18-entropy-23-00651] 18.Collomb G., Härdle W., Hassani S. A Note on Prediction via Estimation of the Conditional Mode Function. J. Stat. Plan. Inference. 1986;15:227–236. doi: 10.1016/0378-3758(86)90099-6. [DOI] [Google Scholar]

[B19-entropy-23-00651] 19.Chen H., Wang Y., Zheng F., Deng C., Huang H. Sparse Modal Additive Model. IEEE Trans. Neural Netw. Learn. Syst. 2020:1–15. doi: 10.1109/TNNLS.2020.3005144. [DOI] [PubMed] [Google Scholar]

[B20-entropy-23-00651] 20.Li J., Ray S., Lindsay B.G. A Nonparametric Statistical Approach to Clustering via Mode Identification. J. Mach. Learn. Res. 2007;8:1687–1723. [Google Scholar]

[B21-entropy-23-00651] 21.Einbeck J., Tutz G. Modeling beyond Regression Function: An Application of Multimodal Regression to Speed-flow Data. J. R. Stat. Soc. Ser. C Appl. Stat. 2006;55:461–475. doi: 10.1111/j.1467-9876.2006.00547.x. [DOI] [Google Scholar]

[B22-entropy-23-00651] 22.Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996;58:267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]

[B23-entropy-23-00651] 23.Feng Y., Huang X., Shi L., Yang Y., Suykens J.A. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. J. Mach. Learn. Res. 2015;16:993–1034. [Google Scholar]

[B24-entropy-23-00651] 24.Lv F., Fan J. Optimal learning with Gaussians and Correntropy Loss. Anal. Appl. 2019;19:107–124. doi: 10.1142/S0219530519410124. [DOI] [Google Scholar]

[B25-entropy-23-00651] 25.Yao W., Lindsay B.G., Li R. Local Modal Regression. J. Nonparametr. Stat. 2012;24:647–663. doi: 10.1080/10485252.2012.678848. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26-entropy-23-00651] 26.Chen Y. Modal Regression using Kernel Density Estimation: A Review. Wiley Interdiscip. Rev. Comput. Stat. 2018;10:e1431. doi: 10.1002/wics.1431. [DOI] [Google Scholar]

[B27-entropy-23-00651] 27.Steinwart I., Christmann A. Support Vector Machines. Springer Science and Business Media; Berlin/Heidelberg, Germany: 2008. [Google Scholar]

[B28-entropy-23-00651] 28.Lv S., Lin H., Lian H., Huang J. Oracle Inequalities for Sparse Additive Quantile Regression in Reproducing Kernel Hilbert Space. Ann. Stat. 2018;46:781–813. doi: 10.1214/17-AOS1567. [DOI] [Google Scholar]

[B29-entropy-23-00651] 29.Huang J., Horowitz J.L., Wei F. Variable Selection in Nonparametric Additive Models. Ann. Stat. 2010;38:2282–2313. doi: 10.1214/09-AOS781. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30-entropy-23-00651] 30.Christmann A., Zhou D.X. Learning Rates for the Risk of Kernel based Quantile Regression Estimators in Additive Models. Anal. Appl. 2016;14:449–477. doi: 10.1142/S0219530515500050. [DOI] [Google Scholar]

[B31-entropy-23-00651] 31.Yuan M., Zhou D.X. Minimax Optimal Rates of Estimation in High Dimensional Additive Models. Ann. Stat. 2016;44:2564–2593. doi: 10.1214/15-AOS1422. [DOI] [Google Scholar]

[B32-entropy-23-00651] 32.Nikolova M., Ng M.K. Analysis of Half-quadratic Minimization Methods for Sgnal and Image Recovery. SIAM J. Sci. Comput. 2006;27:937–966. doi: 10.1137/030600862. [DOI] [Google Scholar]

[B33-entropy-23-00651] 33.Alizadeh F., Goldfarb D. Second-Order Cone Programming. Math. Program. 2003;95:3–51. doi: 10.1007/s10107-002-0339-5. [DOI] [Google Scholar]

[B34-entropy-23-00651] 34.Guo C., Song B., Wang Y., Chen H., Xiong H. Robust Variable Selection and Estimation Based on Kernel Modal Regression. Entropy. 2019;21:403. doi: 10.3390/e21040403. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35-entropy-23-00651] 35.Wang Y., Tang Y.Y., Li L., Chen H. Modal Regression-based Atomic Representation for Robust Face Recognition and Reconstruction. IEEE Trans. Cybern. 2020;50:4393–4405. doi: 10.1109/TCYB.2019.2903205. [DOI] [PubMed] [Google Scholar]

[B36-entropy-23-00651] 36.Suzuki T., Sugiyama M. Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness. Ann. Stat. 2013;41:1381–1405. doi: 10.1214/13-AOS1095. [DOI] [Google Scholar]

[B37-entropy-23-00651] 37.Schlköpf B., Smola A.J. Learning with Kernels. The MIT Press; Cambridge, MA, USA: 2002. [Google Scholar]

[B38-entropy-23-00651] 38.Aronszajn N. Theory of Reproducing Kernels. Trans. Am. Math. Soc. 1950;68:337–404. doi: 10.1090/S0002-9947-1950-0051437-7. [DOI] [Google Scholar]

[B39-entropy-23-00651] 39.Bartlett P.L., Bousquet O., Mendelson S. Localized Rademacher Complexities; Proceedings of the Conference on Computational Learning Theory (COLT); Sydney, Australia. 8–10 July 2002; pp. 44–58. [Google Scholar]

[B40-entropy-23-00651] 40.Mendelson S. Geometric Parameters of Kernel Machines; Proceedings of the Conference on Computational Learning Theory (COLT); Sydney, Australia. 8–10 July 2002; pp. 29–43. [Google Scholar]

[B41-entropy-23-00651] 41.Koltchinskii V., Yuan M. Sparsity in Multiple Kernel Learning. Ann. Stat. 2010;38:3660–3695. doi: 10.1214/10-AOS825. [DOI] [Google Scholar]

[B42-entropy-23-00651] 42.Van De Geer S. Empirical Processes in M-Estimation. Cambridge University Press; Cambridge, UK: 2000. [Google Scholar]

[B43-entropy-23-00651] 43.Löfberg J. Automatic robust convex programming. Optim. Methods Softw. 2012;27:115–129. doi: 10.1080/10556788.2010.517532. [DOI] [Google Scholar]

PERMALINK

Error Bound of Mode-Based Additive Models

Hao Deng

Jianghong Chen

Biqin Song

Zhibin Pan

Roles

Abstract

1. Introduction

2. Methodology

2.1. Modal Regression

Remark 1.

2.2. Mode-Based Sparse Additive Models

Remark 2.

Remark 3.

3. Error Analysis

Assumption 1.

Assumption 2.

Remark 4.

Remark 5.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Proof.

Theorem 1.

Proof.

4. Experimental Evaluation

Figure 1.

Figure 2.

Figure 3.

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases