Skip to main content
Entropy logoLink to Entropy
. 2021 May 22;23(6):651. doi: 10.3390/e23060651

Error Bound of Mode-Based Additive Models

Hao Deng 1, Jianghong Chen 2, Biqin Song 1,*, Zhibin Pan 1,*
Editor: Ercan Kuruoglu
PMCID: PMC8224641  PMID: 34067420

Abstract

Due to their flexibility and interpretability, additive models are powerful tools for high-dimensional mean regression and variable selection. However, the least-squares loss-based mean regression models suffer from sensitivity to non-Gaussian noises, and there is also a need to improve the model’s robustness. This paper considers the estimation and variable selection via modal regression in reproducing kernel Hilbert spaces (RKHSs). Based on the mode-induced metric and two-fold Lasso-type regularizer, we proposed a sparse modal regression algorithm and gave the excess generalization error. The experimental results demonstrated the effectiveness of the proposed model.

Keywords: modal regression, additive models, reproducing kernel Hilbert spaces, error bound

1. Introduction

Regression estimation and variable selection are two important tasks for high-dimensional data mining [1]. Sparse additive models [2,3], aiming to deal with the above tasks simultaneously, have been extensively investigated in the mean regression setting. As a class of models between linear and nonparametric regression, these methods inherit the flexibility from nonparametric regression and the interpretability from linear regression. Typical methods include COSSO [4] and SpAM [2] and its variants, such as Group SpAM [3], SAM [5], Group SAM [6], SALSA [7], MAM [8], SSAM [9], and ramp-SAM [10]. From the lens of nonparametric regression, the additive structure on the hypothesis space is crucial to overcome the curse of dimensionality [7,11,12].

Usually, the aforementioned models are limited to the estimation of the conditional mean under the mean-squared error (MSE) criterion. However, for the complex non-Gaussian noises (e.g., the skewed noise, the heavy-tailed noise), it is difficult to extract the intrinsic trends from the mean-based approaches, resulting in degraded performance. Beyond the traditional mean regression, it is interesting to formulate a new regression framework under the (conditional) mode-based criterion. With the help of the recent works in [13,14,15,16,17,18,19], this paper aimed to propose a new robust sparse additive model, rooted in modal regression associated with the RKHS.

As an alternative approach to mean regression, modal regression has been investigated on statistical behavior [14,15,17] and real-world applications [20,21]. Yao [14] proposed a modal linear regression algorithm and characterized its theoretical properties under the global mode assumption. As a natural extension of Lasso [22], Wang et al. [15] considered the regularized modal regression and established its analysis on the generalization bound and variable selection consistency. Feng et al. [17] studied modal regression by a learning theory approach and illustrated its relation with MCC [23,24]. Different from the above global approaches, some local modal regression algorithms were formulated in [16,25] with convergence guarantees. Recent literature [26] gave a general overview of modal regression, and a more comprehensive list of references can be found there.

The proposed robust additive models are formulated under the Tikhonov regularization scheme, which involves three building blocks, including the mode-based metric, the RKHS-based hypothesis space, and two Lasso-type penalties. Since the linear function space, polynomial function space, and Sobolev/Besov space are special cases of the RKHS, the kernel-based function space is more flexible than the traditional spline-based spaces or other dictionary-based hypotheses [2,5,27,28,29]. The mode-induced regression metric is robust to the non-Gaussian noise according to the theoretical and empirical evaluations [14,15,17]. The regularized penalty addresses the sparsity and smoothness of the estimator, which has shown promising performance for mean regression [2,29,30,31]. Therefore, different from mean-based kernel regression and additive models, the mode-based approach enjoys robustness and interpretability simultaneously due to its metric criterion and trade-off penalty. The estimator of our approach can be obtained by integrating the half-quadratic (HQ) optimization [32] and the second-order cone programming (SOCP) [33].

The rest of this article is organized as follows. After introducing the robust additive model in Section 2, we state its generalization error bound in Section 3. Finally, Section 5 ends this paper with a brief conclusion.

2. Methodology

2.1. Modal Regression

In this section, we recall the basic background on modal regression [19,34]. Let X be a compact subset of Rp associated with the input covariate vector and YR be the response variable set. In this paper, we considered the following nonparametric model:

Y=f*(X)+ϵ, (1)

where X=(X1,,Xp)TX, YY, and ϵ is a random noise. For feasibility, we denote by ρ the underlying joint distribution of (X,Y) generated by (1).

Being different from the traditional mean regression under the noise condition E(ϵ|X=x)=0 (e.g., Gaussian noise), we just require that the mode of the conditional distribution of ϵ equal zero at each xX. That is:

xX,mode(ϵ|X=x)=argmaxtRPϵ|X(t|X=x)=0, (2)

where Pϵ|X is the conditional density of ϵ given X. Notice that the zero condition is not specified to the homogeneity or symmetry distribution of noise ϵ, and some non-Gaussian noises (e.g., the skewed noise, the heavy-tailed noise) are not excluded.

From (1), we further deduce that:

f*(u):=j=1pfj*(uj)=mode(Y|X=u)=argmaxtPY|X(t|X=u),

where u=(u1,,up)TX and PY|X denotes the density of Y conditional on X. Then, the purpose of modal regression is to find the target function f* according to the empirical data z={zi}i=1n={(xi,yi)}i=1n drawn independently from ρ.

For modal regression, the performance of a predictor f:XR is measured by the mode-based metric:

R(f)=XPY|X(f(x)|X=x)dρX(x), (3)

where ρX is the marginal distribution of ρ with respect to input space X.

Although the target function f* is the maximizer of R(f) over all measurable functions, it cannot be estimated directly via maximizing (3) due to the unknown PY|X and ρX. Fortunately, some indirect density-estimation-based strategies were proposed in [14,15,17]. As shown in Theorem 5 of [17], R(f) equals the density function of random variable Ef=Yf(X) at zero, e.g.,

R(f)=PEf(0).

Therefore, we can find an approximation of f* by maximizing the empirical version of PEf(0) with the help of kernel density estimation (KDE).

Let Kσ:R×RR+ be a kernel with bandwidth σ, and its representing function ϕ:R[0,) satisfies ϕ(uuσ)=Kσ(u,u),u,uR. Typical kernels used in KDE include the Gaussian kernel, the Epanechnikov kernel, the logistic kernel, and the sigmoid kernel. The KDE-based estimator of PEf(0) is defined as:

P^Ef(0)=1nσi=1nKσ(yif(xi),0)=1nσi=1nϕ(yif(xi)σ):=R^σ(f).

Learning models for modal regression are usually formulated by Tikhonov regularization schemes associated with the empirical metric R^σ(f); see, e.g., [15,35].

Naturally, the data-free modal regression metric, w.r.t.R^σ(f), can be defined as:

Rσ(f)=1σX×Yϕ(yf(x)σ)dρ(x,y).

In theory, the learning performance of estimator f:XR can be evaluated in terms of R(f)R(f*), which can be further bounded via Rσ(f)Rσ(f*) (see Theorem 10 in [17]).

Remark 1.

As illustrated in [17], when taking Kσ as a Gaussian kernel, the modal regression for maximizing Rσ(f) is consistent with learning under the maximum correntropy criterion (MCC). By employing different kernels, we can provide rich evaluated metrics for better robust estimation.

2.2. Mode-Based Sparse Additive Models

The additive model is formulated as follows,

Y=j=1pfj*(Xj)+ϵ, (4)

where XjX, (j=1,2,···,p), YY, and fj* are unknown component functions. By employing nonlinear hypothesis function spaces with an additive structure, the additive model provides better flexibility for regression estimation and variable selection [19]. In [28], the theoretical properties of the sparse additive model with the quantile loss function were discussed. We introduce some basic notation and assumptions in a similar way.

Suppose that Efj*(Xj)=0 and fj*Kj1 for each fj* in (4) with jS. Here, fj*:XjR is an unknown univariate function in a reproducing kernel Hilbert space (RKHS) Hj:=HKj associated with kernel Kj and norm ·Kj [30,31], and S{1,,p} is an intrinsic subset with cardinality |S|<p. This means each observation (xj,yj) is generated according to:

yi=jSfj*(xij)+ϵi,i=1,,n,

where xi=(xi1,,xip)TRp, fj*Hj and ϵ satisfies the condition (2).

For any given j{1,,p}, denote Br(Hj)={gHj:gKjr}. The hypothesis space considered here is defined by:

F={f=j=1pfj:fjBr(Hj),i=1,,p}, (5)

which is a subset of the RKHS H={f=j=1pfj:fjHj} with the norm:

fK2=inf{j=1pfjKj2:f=j=1pfj}.

For each Xj and the corresponding marginal distribution ρXj, we denote fj22:=Xj|fj(u)|2dρXj(u). Given inputs {xi}i=1n, define the empirical norm of each fj as:

fjn2:=1ni=1nfj2(xij),fjHj,j{1,,p}.

With the help of the mode-based metric (3) and the hypothesis space (5), we formulated the mode-based sparse additive model as:

f^=argmaxfF{R^σ(f)λ1j=1pfjnλ2j=1pfjKj}, (6)

where (λ1,λ2) is a pair of positive regularization. The first regularization term is sparsity-promoting [11,36], and the second one guarantees smoothness in the solution.

By the representer theorem of kernel methods (e.g., [37]), the solution of (6) admits the following form:

f^(u)=i=1nj=1pα^ijK(uj,xij),u=(u1,,up)T

with a collection of coefficients {α^j=(α1j,,αnj)TRn:j=1,,p}.

The optimal coefficients with respect to (6) are the solution to the following non-convex optimization:

maxαjRn,αjTKjαj1{1ni=1nϕ(yij=1pKjiTαjσ)λ1nj=1pKjαj2λ2j=1pαjTKjαj}

where Kji=(Kj(x1j,xij),,Kj(xnj,xij))TRn and Kj=(Kj(xij,xlj))i,ln=(Kj1,,Kjn)Rn×n.

Remark 2.

There are various combinations of sparsity and smoothness regularization for additive models [2,3,29,30,31]. The regularization in this paper adopting a two-fold group Lasso scheme, which was employed in [28], but in quantile regression settings, is also different from the coefficient-based regularized modal regression in [19].

Remark 3.

From the lens of computation, the proposed algorithm (6) can be transformed into a regularized least-squares regression problem by HQ optimization [32]. Then, the transformed problem can be tackled with the SOCP [33] easily.

3. Error Analysis

This section states the upper bounds of the excess quantity R(f*)R(f^). For the ease of presentation, we only considered the special setting where HjHj,j,j{1,,p}, and we denote j=1pHj as HK with supK(x,x)1.

Recall that the Mercer kernel K:X×XR admits the following spectral expansion [38]:

K(x,x)=1bψ(x)ψ(x),x,xX,

where {(b,ψ)}1 are the pairs of eigenvalue-eigenfunctions of integral operator Tf:K(·,x)f(x)dρX(x) with b1b20.

To evaluate the complexity of HK in terms of the decay rate of eigenvalues {b}1 [27,28], we refer to Assumption 1 in [28] as the basis of our analysis.

Assumption 1.

There exist s(0,1) and constant c1>0 such that bc11s, 1.

As illustrated in [27,28], the requirement s<1 is a weak condition since b=EK(x,x)1. In particular, it holds b2h for the Sobolev space HK=W2h(h>12) with the Lebesgue measure on [0,1].

To describe the hypothesis in RKHS, we refer to Assumption 2 in [28].

Assumption 2.

For some s(0,1) given in Assumption 1, there exists a positive constant c2 such that fc2f21sfKs, fHK.

Remark 4.

To understand the statistical performance of the proposed estimator without any “correlatedness” conditions on covariates, Rademacher complexity [39] was used to measure functional complexity in [28]. We drew on the experience of [28].

In general, Assumption 2 is stronger than Assumption 1 and is satisfied when the RKHS is continuously embeddable in a Sobolev space. For the uniformly bounded {ψ}1, this sub-norm condition is consistent with Assumption 1.

For any given independent input variables {xi}i=1nX, define the Rademacher complexity:

Rn(f):=1ni=1nσif(xi),fHK,

where {σi}i=1n is an i.i.d. sequence of Rademacher variables that take {±1} with probability 1/2. As shown in [40], it holds:

ERn{fHK{fK=1,f2t}}1n[min{t2,b}]12.

Moreover, from Assumption 1, define:

γn:=inf{γAlogp˜n,E[supfK=1,f2t|Rn(f)|]γt+γ2,t(0,1)}max{Alogp˜n,(1n)12(1+α)}.

The main idea of our error analysis is to first state a theory result for a defined event and then investigate the behavior of f^ in (6) conditional on that event.

Define η(t):=max{1,t,t/n} for any t>0 and ξn:=ξn(λ)=max{λα2n12,λ12n11+α,logpn}, and consider the event:

θ(t)={|1ni=1nϵif(xi)|cαη(t)ξn(f2+λ12fK),fHK},

where {ϵi}i=1n are zero-mean i.i.d. random variables with |ϵi|L and cα is a constant depending on α and L.

Remark 5.

To analyze the behavior of the regularized estimator conditioned on the event, several basic facts of the empirical processes were introduced in [28]. Our work can be boiled down to this framework. We introduced the relevant lemmas in [28] as a stepping stone.

Lemma 1.

Let Assumptions 1 and 2 be true. If logpn1, it holds:

P(θ(t))1exp{t},λ>0,t1.

The following lemma (see also Theorem 4 in [41]) demonstrates the relationship between the empirical norm ·n and ·2 for functions in HK.

Lemma 2.

For A1 and any given p˜p with logp˜2loglogn, there exists a constant c such that:

f2c(fn+γnfK)

and:

fnc(f2+γnfK)

with confidence at least 1p˜A, where γnmax(Alogp˜n,(1n)12(1+α)).

Lemma 3.

Let {zi}i=1nZ be independent random variables, and let Γ be a class of real-valued functions on Z satisfying:

γηn,γΓ,and1ni=1nvar(γ(zi))ιn2,

for some positive constants ηn and ιn. Define ζ:=supγΓ|1ni=1nγ(zi)Eγ(z)|. Then,

P{ζEζ+t2(ιn2+2ηnEz)+2ηnt23exp{nt2}

For any given Δ and Δ+, define:

F(Δ,Δ+)={f=j=1pfjHK:γnj=1pfjfj*2Δ,γn2j=1pfjfj*KΔ+},

Lemma 4.

Let Assumptions 1 and 2 be true for each Hj. For any given A2, with confidence at least 1p˜A, it holds:

Rσ(f*)Rσ(f)(R^σ(f*)R^σ(f))c*η(t0)(Δ+Δ+)+exp{p˜},

for any fF(Δ,Δ+) with max{Δ,Δ+}ep˜, where t0=2log(23log2)+Alogp˜+2logp˜, λ=n11+α, and c* is a positive constant.

Proof. 

Denote Γ={γ(z):γ(z)=1σϕ(yf*(x)σ)1σϕ(yf(x)σ),fF(Δ,Δ+)}. It is easy to verify that:

Eγ(z)1ni=1nγ(zi)=R(f*)R(f)(R^(f*)R^(f)),γΓ.

Let ζ:=supγΓ|1ni=1nγ(zi)Eγ(z)|. From Lemma 3, we have:

ζEζ+2t(ιn2+2ηnEζ)n+2ηnt3n, (7)

with probability at least 1exp{t}, where constants supγΓγ=ηn and supγΓ1ni=1nvar(γ(zi))=ιn. Observing that:

2t(ιn2+2ηnEζ)n2tιn2n+2ηnEζn2tnιn+Eζ+ηnn, (8)

we can take:

ιn22E(γ(z))2=2E(1σϕ(yf*(x)σ)1σϕ(yf(x)σ))22ϕ2σ4ff*222ϕ2σ4Δ2γ2, (9)

and:

ηn=supγΓγϕσ2f*fϕσ2f*fKϕσ2Δ+γn2. (10)

Combining (7)–(10), we obtain with confidence at least 1exp{t}

ζ2Eζ+2ϕγnσ2tn+κϕΔ+σ2γn21+tn.

By a symmetrization technique in [42], we have:

Eζ2ERn(Γ)2ϕσ2ERn(Ff*).

Applying Lemma 3 for Rn(Ff*), we obtain that:

E[Rn(Ff*)]Rn(Ff*)+4Δγn2tn+Δ+γn21+tn,

with probability at least 12exp{t}. Moreover, with probability at least 12exp{t}, it holds:

ζ8ϕσ2Rn(Ff*)+6ϕΔγnσ2tn+5ϕΔ+γn2σ21+tn8ϕσ2j=1pRn(Hjfj*)+6ϕΔγnσ2tn+5ϕΔ+γn2σ21+tn.

For the event θ(t), Lemma 1 demonstrates that:

|Rn(f)|cαη(t)ξn(f2+λ12fK),fHK,λ>0,

with confidence 1exp{t}. Then,

ζ8ϕcαη(t)ξnσ2supfF{j=1pffj*2+λ12j=1pfjfj*K}+6ϕΔγnσ2tn+5ϕΔ+γn2σ21+tn.

Taking λ=n11+α, we can verify that cγnξn and ξnλ12cγn2. Then,

ζ8cαη(t)ϕσ2(Δ++Δ)+6Δϕσ2tAlogp˜+5Δ+ϕtσ2Alogp˜,

for some event θ(Δ,Δ+).

For t=2log(23/log2)+Alogp˜+2logp˜, we deduce that ep˜Δep˜ and ep˜Δ+ep˜ considering (2p˜+1)2 different discrete pairs Δj=Δ+j:=2j,j=p˜,,p˜, and we deduce that:

P(k,jθ(Δj,Δ+j))13(2log22p˜2exp{2log(23log2Alogp˜2logp˜}1p˜A.

When Δep˜ or Δ+ep˜, it is trivial to obtain the desired result. □

The proof of Lemma 4 is derived from the proof of Proposition 1 in [28] for the quantile regression. We state our main result on the error bound.

Theorem 1.

Let the regularization parameters of f^ defined in (6) be λ1=ξγn and λ2=ξγn2, where ξ=max{2cη(t0),4} with η(t)=max{1,t,t/n}, t0=2log(23/log2)+Alogp˜+2logp˜, and γnmax(Alogp˜n,(1n)12(1+α)). Under the conditions of Assumptions 1 and 2, for any p˜p such that logpn and logp˜2loglogn, then for some constant A2, such that with probability at least 12d˜A:

R(f*)R(f^)csϕη(t0)(η(t0))14γnc(η(t0))54max{(Alogp˜c)14,(1n)14(1+α)}cmax{Alogp˜,Alogp˜n}54·max{(Alogp˜n)14,(1n)14+4α}cmax{(Alogp˜)78n14,(Alogp˜)12n14+4α,(Alogp˜)32n34,(Alogp˜)54n3+2α4+4α}.

Proof. 

By the definition of f^ in (6), we know that:

R^σ(f^)λ1j=1pf^jnλ2j=1pf^jKjR^σ(f*)λ1j=1pfj*nλ2j=1pfj*Kj.

This implies that:

R^σ(f^)Rσ(f*)λ1j=1pf^jnλ2j=1pf^jKj[Rσ(f^)Rσ(f*)][R^σ(f^)R^σ(f*)]λ1j=1pfj*nλ2j=1pfj*Kj.

Moreover,

Rσ(f*)Rσ(f^)Rσ(f*)Rσ(f^)+λ1jSf^jn+λ2jSf^jK[Rσ(f*)Rσ(f^)][R^σ(f*)R^σ(f^)]+λ1jS(fj*nf^jn)+λ2jS(fj*Kf^jK)[Rσ(f*)Rσ(f^)][R^σ(f*)R^σ(f^)]+λ1jSf^jfj*n+λ2jSf^jfj*K. (11)

Taking λ1=ξγn, λ2=ξγn2 with γn=max{Alogp˜n,(1n)12+2α}, α(0,1), we deduce that:

γnj=1pf^jfj*22p(1n)12+2α2p˜(14)ep˜,n1,p˜p,

and:

γn2j=1pfjfj*Kjγnγnj=1pf^f*Kjep˜.

Therefore, we verify that f^F(Δ,Δ+) with Δep˜ and Δ+ep˜. With the choices λ2=λ12=ξγn2, it holds:

λ1f^jfj*n+λ2f^jfj*K2(λ1+λ2)=4ξγn,jS.

due to the fact fjnfjK1, for any fjHKj.

According to Lemma 4 and (11), we obtain:

Rσ(f*)Rσ(f^)cηt0ϕσ2(γnj=1pf^jfj*2+γn2j=1pf^jfj*K)+λ1jSf^jfj*n+λ2jSf^jfj*K+ep˜cη(t0)ϕσ2ξγn+ep˜,

with probability at least 12p˜A.

Notice that logp˜2loglogn implies that ep˜n2γn. Then:

Rσ(f*)R(f^)cη(t0)ϕσ2ξγn.

Combining this with Theorem 9 in [17] and setting σ=(ϕη(t0)ξγn)14, we obtain the desired result. □

The proof of Theorem 1 is inspired by that of Theorem 1 in [28]; see [28] for more details. According to Theorem 1, we can conclude that the mode-based SpAM can achieve the learning rate with polynomial decay O(n14+4α) since ϵ[0,1] and A,p˜ are positive constants.

4. Experimental Evaluation

To demonstrate the efficiency of our method, in this section, we evaluated our model on some synthetic datasets. The data in Rp with dimension p=5 and p=10 were generated randomly according to the uniform distribution on the interval [0,1]. Then, we computed the MSE of our estimator f^. Figure 1, Figure 2 and Figure 3 depict the MSE of f^ when the parameter pair (λ1,λ2)=(0,1),(1,0) and (1,1), respectively, while the number of samples n varies from 50/60 to 80/90. This paper used Yalmip [43] modeling in the MATLAB environment and called fmincon to solve the problem. From the figures, we can notice that the MSEs tended to decrease with the increase of the number of samples n under three kinds of parameter settings, which verified that our method was effective in the regression of high-dimensional data.

Figure 1.

Figure 1

MSE of f^ when (λ1,λ2)=(0,1).

Figure 2.

Figure 2

MSE of f^ when (λ1,λ2)=(1,0).

Figure 3.

Figure 3

MSE of f^ when (λ1,λ2)=(1,1).

5. Conclusions

In this work, we proposed a mode-based sparse additive model and established its generalization error bound. The theoretical results extended the previous mean-based analysis to the mode-based approach. We demonstrated that the mode-based SpAM can achieve the learning rate with polynomial decay O(n14+4α), which is comparable to the previous result in [15] with O(n17). In the future, it will be important to further explore the variable selection consistency of the proposed model.

Author Contributions

Conceptualization, H.D., B.S., J.C. and Z.P.; methodology, H.D. and Z.P.; validation, B.S. and Z.P.; formal analysis, H.D., B.S. and Z.P.; investigation, H.D. and J.C.; resources, Z.P.; data curation, H.D. and J.C.; writing—original draft preparation, H.D. and J.C.; writing—review and editing, H.D. and J.C.; visualization, H.D. and J.C.; supervision, B.S. and Z.P.; project administration, B.S. and Z.P.; funding acquisition, H.D., B.S. and Z.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Fundamental Research Funds for the Central Universities of China (Grant Nos. 2662019FW003 and 2662020LXQD001) and the National Natural Science Foundation of China (Grant No. 12001217).

Data Availability Statement

The synthetic data generation method of the simulation experiment has been introduced in the experimental part.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Xia Y., Hou Y., Lv S. Learning Rates for Partially Linear Support Vector Machine in High Dimensions. Anal. Appl. 2021;19:167–182. doi: 10.1142/S0219530520400126. [DOI] [Google Scholar]
  • 2.Ravikumar P., Liu H., Lafferty J., Wasserman L. SpAM: Sparse Additive Models. J. R. Stat. Soc. Ser. B. 2009;71:1009–1030. doi: 10.1111/j.1467-9868.2009.00718.x. [DOI] [Google Scholar]
  • 3.Yin J., Chen X., Xing E.P. Group Sparse Additive Models; Proceedings of the International Conference on Machine Learning (ICML); Edinburgh, UK. 26 June–1 July 2012; pp. 1643–1650. [PMC free article] [PubMed] [Google Scholar]
  • 4.Lin Y., Zhang H.H. Component Selection and Smoothing in Multivariate Nonparametric Regression. Ann. Stat. 2006;34:2272–2297. doi: 10.1214/009053606000000722. [DOI] [Google Scholar]
  • 5.Zhao T., Liu H. Sparse additive machine; Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS); La Palma, Spain. 21–23 April 2012; pp. 1435–1443. [Google Scholar]
  • 6.Chen H., Wang X., Deng C., Huang H. Group Sparse Additive Machine; Proceedings of the Advances in Neural Information Processing Systems (NIPS); Long Beach, CA, USA. 4–9 December 2017; pp. 197–207. [Google Scholar]
  • 7.Kandasamy K., Yu Y. Additive Approximations in High Dimensional Nonparametric Regression via the SALSA; Proceedings of the International Conference on Machine Learning (ICML); New York, NY, USA. 19–24 June 2016; pp. 69–78. [Google Scholar]
  • 8.Wang Y., Chen H., Zheng F., Xu C., Gong T., Chen Y. Multi-task Additive Models for Robust Estimation and Automatic Structure Discovery; Proceedings of the Advances in Neural Information Processing Systems (NIPS); Online. 6–12 December 2020; pp. 11744–11755. [Google Scholar]
  • 9.Chen H., Liu G., Huang H. Sparse Shrunk Additive Models; Proceedings of the International Conference on Machine Learning (ICML); Vienna, Austria. 12–18 July 2020; pp. 6194–6204. [Google Scholar]
  • 10.Chen H., Guo C., Xiong H., Wang Y. Sparse Additive Machine with Ramp Loss. Anal. Appl. 2021;19:509–528. doi: 10.1142/S0219530520400011. [DOI] [Google Scholar]
  • 11.Meier L., Geer S.V.D., Buhlmann P. High-dimensional Additive Modeling. Ann. Stat. 2009;37:3779–3821. doi: 10.1214/09-AOS692. [DOI] [Google Scholar]
  • 12.Raskutti G., Wainwright M.J., Yu B. Minimax-optimal Rates for Sparse Additive Models over Kernel Classes via Convex Programming. J. Mach. Learn. Res. 2012;13:389–427. [Google Scholar]
  • 13.Kemp G.C.R., Silva J.M.C.S. Regression towards the mode. J. Econom. 2012;170:92–101. doi: 10.1016/j.jeconom.2012.03.002. [DOI] [Google Scholar]
  • 14.Yao W., Li L. A New Regression model: Modal Linear Regression. Scand. J. Stat. 2014;41:656–671. doi: 10.1111/sjos.12054. [DOI] [Google Scholar]
  • 15.Wang X., Chen H., Cai W., Shen D., Huang H. Regularized Modal Regression with Applications in Cognitive Impairment Prediction; Proceedings of the Advances in Neural Information Processing Systems (NIPS); Long Beach, CA, USA. 4–9 December 2017; pp. 1448–1458. [PMC free article] [PubMed] [Google Scholar]
  • 16.Chen Y.C., Genovese C.R., Tibshirani R.J., Wasserman L. Nonparametric Modal Regression. Ann. Stat. 2014;44:489–514. doi: 10.1214/15-AOS1373. [DOI] [Google Scholar]
  • 17.Feng Y., Fan J., Suykens J. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020;21:1–35. [Google Scholar]
  • 18.Collomb G., Härdle W., Hassani S. A Note on Prediction via Estimation of the Conditional Mode Function. J. Stat. Plan. Inference. 1986;15:227–236. doi: 10.1016/0378-3758(86)90099-6. [DOI] [Google Scholar]
  • 19.Chen H., Wang Y., Zheng F., Deng C., Huang H. Sparse Modal Additive Model. IEEE Trans. Neural Netw. Learn. Syst. 2020:1–15. doi: 10.1109/TNNLS.2020.3005144. [DOI] [PubMed] [Google Scholar]
  • 20.Li J., Ray S., Lindsay B.G. A Nonparametric Statistical Approach to Clustering via Mode Identification. J. Mach. Learn. Res. 2007;8:1687–1723. [Google Scholar]
  • 21.Einbeck J., Tutz G. Modeling beyond Regression Function: An Application of Multimodal Regression to Speed-flow Data. J. R. Stat. Soc. Ser. C Appl. Stat. 2006;55:461–475. doi: 10.1111/j.1467-9876.2006.00547.x. [DOI] [Google Scholar]
  • 22.Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996;58:267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]
  • 23.Feng Y., Huang X., Shi L., Yang Y., Suykens J.A. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. J. Mach. Learn. Res. 2015;16:993–1034. [Google Scholar]
  • 24.Lv F., Fan J. Optimal learning with Gaussians and Correntropy Loss. Anal. Appl. 2019;19:107–124. doi: 10.1142/S0219530519410124. [DOI] [Google Scholar]
  • 25.Yao W., Lindsay B.G., Li R. Local Modal Regression. J. Nonparametr. Stat. 2012;24:647–663. doi: 10.1080/10485252.2012.678848. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Chen Y. Modal Regression using Kernel Density Estimation: A Review. Wiley Interdiscip. Rev. Comput. Stat. 2018;10:e1431. doi: 10.1002/wics.1431. [DOI] [Google Scholar]
  • 27.Steinwart I., Christmann A. Support Vector Machines. Springer Science and Business Media; Berlin/Heidelberg, Germany: 2008. [Google Scholar]
  • 28.Lv S., Lin H., Lian H., Huang J. Oracle Inequalities for Sparse Additive Quantile Regression in Reproducing Kernel Hilbert Space. Ann. Stat. 2018;46:781–813. doi: 10.1214/17-AOS1567. [DOI] [Google Scholar]
  • 29.Huang J., Horowitz J.L., Wei F. Variable Selection in Nonparametric Additive Models. Ann. Stat. 2010;38:2282–2313. doi: 10.1214/09-AOS781. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Christmann A., Zhou D.X. Learning Rates for the Risk of Kernel based Quantile Regression Estimators in Additive Models. Anal. Appl. 2016;14:449–477. doi: 10.1142/S0219530515500050. [DOI] [Google Scholar]
  • 31.Yuan M., Zhou D.X. Minimax Optimal Rates of Estimation in High Dimensional Additive Models. Ann. Stat. 2016;44:2564–2593. doi: 10.1214/15-AOS1422. [DOI] [Google Scholar]
  • 32.Nikolova M., Ng M.K. Analysis of Half-quadratic Minimization Methods for Sgnal and Image Recovery. SIAM J. Sci. Comput. 2006;27:937–966. doi: 10.1137/030600862. [DOI] [Google Scholar]
  • 33.Alizadeh F., Goldfarb D. Second-Order Cone Programming. Math. Program. 2003;95:3–51. doi: 10.1007/s10107-002-0339-5. [DOI] [Google Scholar]
  • 34.Guo C., Song B., Wang Y., Chen H., Xiong H. Robust Variable Selection and Estimation Based on Kernel Modal Regression. Entropy. 2019;21:403. doi: 10.3390/e21040403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wang Y., Tang Y.Y., Li L., Chen H. Modal Regression-based Atomic Representation for Robust Face Recognition and Reconstruction. IEEE Trans. Cybern. 2020;50:4393–4405. doi: 10.1109/TCYB.2019.2903205. [DOI] [PubMed] [Google Scholar]
  • 36.Suzuki T., Sugiyama M. Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness. Ann. Stat. 2013;41:1381–1405. doi: 10.1214/13-AOS1095. [DOI] [Google Scholar]
  • 37.Schlköpf B., Smola A.J. Learning with Kernels. The MIT Press; Cambridge, MA, USA: 2002. [Google Scholar]
  • 38.Aronszajn N. Theory of Reproducing Kernels. Trans. Am. Math. Soc. 1950;68:337–404. doi: 10.1090/S0002-9947-1950-0051437-7. [DOI] [Google Scholar]
  • 39.Bartlett P.L., Bousquet O., Mendelson S. Localized Rademacher Complexities; Proceedings of the Conference on Computational Learning Theory (COLT); Sydney, Australia. 8–10 July 2002; pp. 44–58. [Google Scholar]
  • 40.Mendelson S. Geometric Parameters of Kernel Machines; Proceedings of the Conference on Computational Learning Theory (COLT); Sydney, Australia. 8–10 July 2002; pp. 29–43. [Google Scholar]
  • 41.Koltchinskii V., Yuan M. Sparsity in Multiple Kernel Learning. Ann. Stat. 2010;38:3660–3695. doi: 10.1214/10-AOS825. [DOI] [Google Scholar]
  • 42.Van De Geer S. Empirical Processes in M-Estimation. Cambridge University Press; Cambridge, UK: 2000. [Google Scholar]
  • 43.Löfberg J. Automatic robust convex programming. Optim. Methods Softw. 2012;27:115–129. doi: 10.1080/10556788.2010.517532. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The synthetic data generation method of the simulation experiment has been introduced in the experimental part.


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES