Skip to main content
Springer logoLink to Springer
. 2018 Aug 30;2018(1):225. doi: 10.1186/s13660-018-1819-3

M-estimation in high-dimensional linear model

Kai Wang 1, Yanling Zhu 1,
PMCID: PMC6132379  PMID: 30839615

Abstract

We mainly study the M-estimation method for the high-dimensional linear regression model and discuss the properties of the M-estimator when the penalty term is a local linear approximation. In fact, the M-estimation method is a framework which covers the methods of the least absolute deviation, the quantile regression, the least squares regression and the Huber regression. We show that the proposed estimator possesses the good properties by applying certain assumptions. In the part of the numerical simulation, we select the appropriate algorithm to show the good robustness of this method.

Keywords: M-estimation, High-dimensionality, Variable selection, Oracle property, Penalized method

Introduction

For the classical linear regression model Y=Xβ+ε, we are interested in the problem of variable selection and estimation, where Y=(y1,y2,,yn)T is the response vector, X=(X1,X2,,Xpn)=(x1,x2,,xn)T=(xij)n×pn is an n×pn design matrix, and ε=(ε1,ε2,,εn)T is a random vector. The main topic is how to estimate the coefficients vector βRnp when pn increases with sample size n and many elements of β equal zero. We can transfer this problem into a minimization of a penalized least squares objective function

βnˆ=argminβQn(βn),Qn(βn)=YXβn2+j=1pnpλn(|βnj|),

where is the l2 norm of the vector, λn is a tuning parameter, and pλn(|t|) a penalty term. It is well known that the least squares estimation is not robust, especially when in the data there exist abnormal values or the error term has a heavy-tailed distribution.

In this paper we consider the loss function to be the least absolute deviation, i.e., we minimize the following objective function:

βnˆ=argminβQn(βn),Qn(βn)=1ni=1n|yixiTβn|+j=1pnpλn(|βnj|),

where the loss function is the least absolute deviation (LAD for short) that does not need the noise to obey a gaussian distribution and be more robust than a least squares estimation. In fact, the LAD estimation is a special case of the M-estimation, which was mentioned by Huber (1964, 1973, 1981) [13] firstly and which can be obtained by minimizing the objective function

Qn(βn)=1ni=1nρ(yixiTβn),

where the function ρ can be selected. For example, if we choose ρ(x)=12x21|x|c+(c|x|c2/2)1|x|>c, where c>0, the Huber estimator can be obtained; if we choose ρ(x)=|x|q, where 1q2, Lq estimator will be obtained, with two special cases: LAD estimator for q=1 and OLS estimator for q=2. If we choose ρ(x)=αx++(1α)(x)+, where 0<α<1, x+=max(x,0), we call it a quantile regression, and we can also get the LAD estimator for α=1/2 especially.

When pn approaches infinity as n tends to infinity, we assume that the function ρ is convex and not monotone, and the monotone function φ is the derivative of ρ. By imposing the appropriate regularity conditions, Huber (1973), Portnoy (1984) [4], Welsh (1989) [5] and Mammen (1989) [6] have proved that the M-estimator enjoyed the properties of consistency and asymptotic normality, where Welsh (1989) gave the weaker condition imposed on φ and the stronger condition on pn/n. Bai and Wu [7] further pointed out that the condition on pn could be a part of the integrable condition imposed on the design matrix. Moreover, He and Shao (2000) [8] studied the asymptotic properties of the M-estimator in the case of a generalized model setting and the dimension pn getting bigger and bigger. Li (2011) [9] obtained the Oracle property of the non-concave penalized M-estimator in high-dimensional model with the condition of pnlogn/n0, pn2/n0, and proposed RSIS to make a variable selection by applying a rank sure independence screening method in the ultra high-dimensional model. Zou and Li (2008) [10] combined a penalized function and a local linear approximation method (LLA) to prove that the obtained estimator enjoyed good asymptotic properties, and they demonstrated that this method improved the computational efficiency of a local quadratic approximation (LQA) in a simulation.

Inspired by this, in this paper we consider the following problem:

βnˆ=argminβnQn(βn),Qn(βn)=1ni=1nρ(yixiTβn)+j=1pnpλn(|β˜nj|)|βnj|, 1.1

where pλn() is the derivative of the penalized function, and β˜n=(β˜n1,β˜n2,,β˜npn)T is the non-penalized estimator.

In this paper, we assume that the function ρ is convex, hence the objective function is still convex and the obtained local minimizer is a global minimizer.

Main results

For convenience, we first give some notations. Let β0=(β01,β02,,β0p)T be the true parameter. Without loss of generality, we assume the first kn coefficients of the covariates are nonzero, then there are pnkn covariates with zero coefficients. β0=(β0(1)T,β0(2)T)T, βnˆ=(βˆn(1)T,βˆn(2)T)T correspondingly. For the given symmetric matrix Z, denote by λmin(Z) and λmax(Z) the minimum and maximum eigenvalue of Z, respectively. Denote XTXn:=D and D=(D11D12D21D22), where D11=1nX(1)TX(1). Finally, we denote cn=max{|pλn(|β˜nj|)|:β˜nj0,1jpn}.

Next, we state some assumptions which will be needed in the following results.

(A1)

The function ρ is convex on R, and its left derivative and right derivative φ+(), φ() satisfies φ(t)φ(t)φ+(t), tR.

(A2)

The error term ε is i.i.d, and the distribution function F of εi satisfies F(S)=0, where S is the set of discontinuous points of φ.

Moreover, E[φ(εi)]=0, 0<E[φ2(εi)]=σ2<, and G(t)E[φ(εi+t)]=γt+o(|t|), where γ>0. Besides these, we assume that limt0E[φ(εi+t)φ(εi)]2=0.

(A3)

There exist constants τ1, τ2, τ3, τ4 such that 0<τ1λmin(D)λmax(D)τ2 and 0<τ3λmin(D11)λmax(D11)τ4.

(A4)

λn0 (n), pn=O(n1/2), cn=O(n1/2).

(A5)

Let zi be the transpose of the ith row vector of X(1), such that limnn12max1inziTzi=0.

It is worth mentioning that conditions (A1) and (A2) are classical assumptions for an M-estimation in a linear model, which can be found in many references, for example Bai, Rao and Wu (1992) [11] and Wu (2007) [12]. The condition (A3) is frequently used for a sparse model in the linear model regression theory, which requires that the eigenvalues of the matrices D and D11 are bounded. The condition (A4) is weaker than that in previous references. Under the condition (A4) we broaden the order of pn to n1/2, but Huber (1973) and Li, Peng and Zhu (2011) [9] required that pn2/n0, Portnoy (1984) required pnlogpn/n0, and Mammen (1989) required pn3/2logpn/n0. Compared with these results, it is obvious that our sparse condition is much weaker. The condition (A5) is the same as that in Huang, Horowitz and Ma (2008) [13], which is used to prove the asymptotic properties of the nonzero part of M-estimation.

Theorem 2.1

(Consistency of estimator)

If the conditions (A1)(A4) hold, there exists a non-concave penalized M-estimation βnˆ, such that

βnˆβ0=OP((pn/n)1/2).

Remark 2.1

From Theorem 2.1, we can see that there exists a global M-estimation βnˆ if we choose the appropriate tuning parameter λn; moreover, this M-estimation is (n/pn)1/2-consistent. This convergence rate is the same as that in the work of Huber (1973) and Li, Peng and Zhu (2011).

Theorem 2.2

(The sparse model)

If the conditions (A1)(A4) hold and λmin(D)>λmax(1ni=1nJiJiT), for the non-concave penalized M-estimation βnˆ we have

P(βˆn(2)=0)1.

Remark 2.2

By Theorem 2.2, we see that under the suitable conditions the global M-estimation of the zero-coefficient variables goes to zero with a high probability when n is large enough. This also shows that the model is sparse.

Theorem 2.3

(Oracle property)

If the conditions (A1)(A5) hold and λmin(D)>λmax(1ni=1nJiJiT), with probability converging to one, the non-concave penalized M-estimation βnˆ=(βˆn(1)T,βˆn(2)T)T has the following properties:

  1. (The consistency of the model selection) βˆn(2)=0;

  2. (Asymptotic normality)
    nsn1uT(βˆn(1)β0(1))=i=1nn1/2sn1γ1uTD11ziTφ(εi)+oP(1)dN(0,1),
    where sn2=σ2γ1uTD111u, and u is any kn dimensional vector such that u1. Meanwhile, zi is the transpose of the ith row vector of a kn×kn matrix X(1).

Remark 2.3

From Theorem 2.3, the M-estimation enjoys the Oracle property, that is, the M-estimator can correctly select covariates with nonzero coefficients with probability converging to one and the estimators of the nonzero coefficients has the same asymptotic distribution that they would have if the zero coefficients were known in advance.

Remark 2.4

In Fan and Peng (2004) [14], the authors showed that the non-concave penalized M-estimation has the property of consistency with the condition pn4/n0, and enjoyed the property of asymptotic normality with the condition pn5/n0. By Theorems 2.12.3, we can see that the corresponding conditions we impose are quite weak.

Proofs of main results

The proof of Theorem 2.1

Let αn=(pn/n)1/2+pn1/2cn. For any pn-dimensional vector u with u=C, we only need to prove that there exists a great enough positive constant C such that

lim infnP{infu=CQn(β0+αnu)>Qn(β0)}1ε,

for any ε>0, that is, there at least exists a local minimizer βnˆ such that βnˆβ0=OP(αn) in the closed ball {β0+αnu:uC}.

Firstly, by the triangle inequality we get

Qn(β0+θu)Qn(β0)=1ni=1n[ρ(yixiT(β0+αnu))ρ(yixiTβ0)]+j=1pnpλn(|β˜nj|)(|β0j+αnuj||β0j|)1ni=1n[ρ(yixiT(β0+αnu))ρ(yixiTβ0)]αnj=1pnpλn(|β˜nj|)|uj|:=T1+T2,

where T1=1ni=1n[ρ(yixiT(β0+αnu))ρ(yixiTβ0)], T2=αnj=1pnpλn(|β˜nj|)|uj|. Noticing that

T1=1ni=1n[ρ(yixiT(β0+αnu))ρ(yixiTβ0)]=1ni=1n[ρ(εiαnxiTu)ρ(εi)]=1ni=1n0αnxiTu[φ(εi+t)φ(εi)]dt1nαni=1nφ(εi)xiTu:=T11+T12, 3.1

where T11=1ni=1n0αnxiTu[φ(εi+t)φ(εi)]dt, T12=1nαni=1nφ(εi)xiTu. Combining with the Von-Bahr–Esseen inequality and the fact that |T12|1nαnui=1nφ(εi)xi, we instantly have

E[i=1nφ(εi)xi2]ni=1nE[φ(εi)xi2]=ni=1nE[φ2(εi)xiTxin2pnσ2,

hence

|T12|=OP(αnpn1/2)u=OP((pn2/n)1/2). 3.2

Secondly for T11, let T11=i=1nAin, where Ain=1n0αnxiTu[φ(εi+t)φ(εi)]dt, so

T11=i=1n[AinE(Ain)]+i=1nE(Ain):=T111+T112.

We can easily obtain E(T111)=0. From the Von-Bahr–Esseen inequality, the Schwarz inequality and the condition (B3), it follows that

var(T111)=var(i=1nAin)1ni=1nE(0αnxiTu[φ(εi+t)φ(εi)]dt)21ni=1n|αnxiTu||0αnxiTuE[φ(εi+t)φ(εi)]2dt|=1ni=1noP(1)(αnxiTu)2=1noP(1)αn2i=1nuTxixiTu=oP(1)αn2uTDuλmax(D)oP(1)αn2u2=oP(αn2)u2,

so together with the Markov inequality this yields

P(|T111|>C1αnu)var(T111)C12αn2u2oP(αn2)u2C12αn2u20(n),

hence

T111=oP(αn)u. 3.3

As for T112,

T112=i=1nE(Ain)=1ni=1n0αnxiTu[γt+o(|t|)]dt=1ni=1n(12γαn2uTxixiTu+oP(1)αn2uTxixiTu)=12γαn2uTDu+op(1)αn2uTDu[12γλmin(D)+oP(1)]αn2u2. 3.4

Finally, considering T2, we can easily obtain

T2(pn)1/2αnmax{|pλn(|β˜nj|)|,1jkn}u=(pn)1/2αncnuαn2u. 3.5

This together with (3.1)–(3.5) shows that we can choose a great enough constant C such that T111 and T2 are controlled by T112, from which it follows that there at least exists a local minimizer βnˆ such that βnˆβ0=OP(αn) in the closed ball {β0+αnu:uC}. □

The proof of Theorem 2.2

From Theorem 2.1, as long as we choose a great enough constant C and appropriate αn, then βnˆ will be in the ball {β0+αnu:uC} with probability converging to one, where αn=(pn/n)1/2+pn1/2cn. For any pn-dimensional vector βn, now we denote βn=(βn(1)T,βn(2)T)T, βn(1)=β0(1)+αnu(1), βn(2)=β0(2)+αnu(2)=αnu(2), where β0=(β0(1)T,β0(2)T)T, u2=u(1)2+u(2)2C2. Meanwhile let

Vn(u(1),u(2))=Qn((βn(1)T,βn(2)T)T)Qn((β0(1)T,0T)T),

then by minimizing Vn(u(1),u(2)) we can obtain the estimator βnˆ=(βˆn(1)T,βˆn(2)T)T, where u(1)2+u(2)2C2. In the following part, we will prove that, as long as uC, u(2)>0,

P(Vn(u(1),u(2))Vn(u(1),0)>0)1(n)

holds, for any pn-dimensional vector u=(u(1)T,u(2)T)T. We can easily find the fact that

Vn(u(1),u(2))Vn(u(1),0)=Qn((βn(1)T,βn(2)T)T)Qn((βn(1)T,0T)T)=1ni=1n[ρ(εiαnHiTu(1)αnJiTu(2))ρ(εiαnHiTu(1))]+j=kn+1pnpλn(|β˜nj|)|αnuj|=1ni=1nαnHiTu(1)αnHiTu(1)αnJiTu(2)[φ(εi+t)φ(εi)]dt1nαni=1nφ(εi)JiTu(2)+j=kn+1pnpλn(|β˜nj|)|αnuj|:=W1+W2+W3,

where Hi and Ji are kn and pnkn dimensional vectors, respectively, such that xi=(HiT+JiT)T. Similar to the proof of Theorem 2.1, we get

W1=1ni=1nαnHiTu(1)αnHiTu(1)αnJiTu(2)[φ(εi+t)φ(εi)]dt=12ni=1nγαn2uTxixiTu12ni=1nγαn2u(2)TJiJiTu(2)+oP(1)αn2u2+oP(1)αnu12γαn2[λmin(D)λmax(1ni=1nJiJiT)]u2+oP(1)αn2u2+oP(1)αnu, 3.6
|W2|=|1nαni=1nφ(εi)JiTu(2)|=OP((pn2/n)1/2)u, 3.7

and

|W3|=|j=kn+1pnpλn(|β˜nj|)|αnuj||(pn)1/2αnmax{|pλn(|β˜nj|)|,kn+1jpn}u=(pn)1/2αncnuαn2u. 3.8

By Eqs. (3.6)–(3.8) and the condition λmin(D)>λmax(1ni=1nJiJiT), it follows that

Vn(u(1),u(2))Vn(u(1),0)12γαn2[λmin(D)λmax(1ni=1nJiJiT)]u2+oP(1)αn2u2+oP(1)αnu+OP((pn2/n)1/2)u+OP(αn2)u>0,

which shows that, as long as uC, u(2)>0,

P(Vn(u(1),u(2))Vn(u(1),0)>0)1(n)

holds, for any pn-dimensional vector u=(u(1)T,u(2)T)T. □

The proof of Theorem 2.3

It is obvious that the conclusion (1) can be obtained instantly by Theorem 2.2, so we only need to prove the conclusion (2). It follows from Theorem 2.1 that βnˆ is consistent with β0 and βˆn(2)=0 with probability converging to one from Theorem 2.2. Therefore for βˆn(1)

Qn(βn)βn(1)|βn(1)=βˆn(1)=0,

that is,

1ni=1nHiφ(yiHiTβˆn(1))+W(1)=0,

where

W=(pλn(|β˜n1|)sgn(βˆn1),pλn(|β˜n2|)sgn(βˆn2),,pλn(|β˜npn|)sgn(βˆnpn))T.

In the following part we give the Taylor expansion of upper left first term:

1ni=1n{Hiφ(yiHiTβˆ0(1))[φ(yiHiTβ0(1))HiHiT+oP(1)](βˆn(1)β0(1))}+W(1)=0.

Noticing that yi=HiTβ0(1)+εi, we have

1ni=1nHiφ(εi)+1ni=1n[φ(εi)HiHiT+oP(1)](βˆn(1)β0(1))+W(1)=0,

which shows that

1nγi=1nHiHiT(βˆn(1)β0(1))=1ni=1nHiφ(εi)W(1)+(βˆn(1)β0(1))oP(1)+1ni=1n(γφ(εi))HiHiT(βˆn(1)β0(1)).

Then, as long as u1,

uT(βˆn(1)β0(1))=n1γ1uTD111i=1nHiφ(εi)+n1γ1uTD111i=1n(γφ(εi))HiHiT(βˆn(1)β0(1))γ1uTD111W(1)+oP(αn)

holds, for any kn-dimensional vector u. For the upper right third term, we can obtain

|γ1uTD111W(1)|1γλmin(D11)W(1)1γλmin(D11)pn1/2cnαnγλmin(D11)oP(1)(n). 3.9

Now let us deal with the upper right second term. Theorem 2.1 and the condition (A3) yield

|n1γ1uTD111i=1n(γφ(εi))HiHiT(βˆn(1)β0(1))|1nγλmin(D11)i=1n(γφ(εi))HiHiT(βˆn(1)β0(1))1nγλmin(D11)i=1n(γφ(εi))HiHiTβˆn(1)β0(1)OP(1)nγλmin(D11)βˆn(1)β0(1)=OP(pn1/2n3/2), 3.10

where the upper third inequality sign holds because of Lemma 3 of Mammen (1989). Combining (3.9)–(3.10), we have

uT(βˆn(1)β0(1))=n1γ1uTD111i=1nHiφ(εi)+OP(αn)+OP(pn1/2n3/2),

that is,

n1/2uT(βˆn(1)β0(1))=n1/2γ1uTD111i=1nHiφ(εi)+oP(1).

Denote sn2=σ2γ1uTD111u, Fin=n1/2sn1γ1uTD111ziT, where zi is a kn×kn matrix and the transpose of the ith row vector of X(1), then n1/2uT(βˆn(1)β0(1))=i=1nFinφ(εi)+oP(1). It follows from (A5) that

i=1nFin2=i=1nFinFin=i=1n(n1/2sn1γ1uTD111ziT)(n1/2sn1γ1ziD111u)=i=1nn1sn2γ2uTD111ziTziD111u=sn2γ2uTD111u=σ2.

Applying the Slutsky theorem, we see that

nsn1uT(βˆn(1)β0(1))dN(0,1).

 □

Simulation results

In this section we evaluate the performance of the M-estimator proposed in (1.1) by simulation studies.

We begin with the data. Simulating the data by the model Y=Xβ+ε, where β0(1)=(2,2.5,3,1)T, ε follows N(0,1), t5 and mixed normally distribution 0.9N(0,1)+0.1N(0,9), respectively. The design matrix X is generated by a p-dimensional multivariate normal distribution with mean zero and covariance matrix whose (i,j)th component is ρ|ij|, where we set ρ=0.5.

Then for the loss function. In this section we can choose some special loss functions, such as the LAD loss function, the OLS loss function and the Huber loss function. In this paper we choose the LAD loss function and the Huber loss function.

About the penalty function: for pλn(|β˜nj|) in the penalty function, we choose the penalty function as a SACD estimation in the following:

pλn(|β|)={λn|β|,0|β|λn,(β22aλn|β|+λn2)/(2(a1)),λn<|β|<aλn,(a+1)λn2/2,|β|>aλn,

then pλn(|β˜nj|)=λnI(|β˜nj|λn)+aλn|β˜nj|a1I(λn<|β˜nj|aλn). By the proposal of Fan and Li (2001), we can select a=3.7, which shows that generalized cross validation can be applied in searching the best tuning parameter λn.

About stimulation algorithm. For the proposed LLA method, we connect the penalty function with independent variables and an independent variable, respectively, then we write a program by using the quantile package in R. For the Lasso method, we use the Lars package to simulate.

Now we address the selection of the tuning parameter. We apply the BIC criterion to select the tuning parameter. The criterion is

BIC(λn)=ln(1ni=1nρ(yixiTβˆ))+DFλnln(n)/n,

where DFλn is the generalized degree of freedom used by Fan and Li (2001).

About the selection of the evaluation index. In order to evaluate the performance of the estimators, we select four measures called EE, PE, C, IC and CP, which are obtained by 500 replicates. EE is the median of βˆβ02 to evaluate the estimation accuracy, and PE is the prediction error defined by the median of n1YXβˆ2. The other three measures are to qualify the performance of model consistency, where C and IC refer to the average number of correctly selected zero covariates and the average number of incorrectly selected zero covariates, and CP is the proportion of the number of the correct selection of zero variables to the total number of zero variables.

In the following we will compare the performances of the method LLA we proposed, the Lasso method and the Oracle estimation. Set n=200,500,700, respectively, and p=[2n].

From Table 1, we notice that the indices EE, C, IC, CP of our proposed LLA method perform better when εN(0,1). In particular, for the index CP, LLA outperforms Lasso. The reason for this may be that we impose different penalties for important and unimportant variables, while Lasso imposes the same penalties for all variables. Moreover, with the increase of the sample size, the ability of LLA method to correctly identify unimportant variables is also increasing. When the sample size is 700 and the number of explanatory variables is 53, an average of 48.9617 unimportant variables-zero variables are estimated to be zero on average, with an average accuracy of 99.92%.

Table 1.

Simulation results for LAD loss function and εN(0,1)

Setting Method EE PE C IC CP
n = 200 Oracle 10.8544 3.3916 24.0000 0 100%
p = 28 Lasso 10.5726 3.3035 10.8480 0 45.20%
m = 24 LLA 10.9153 3.3947 23.8540 0 99.39%
n = 500 Oracle 19.9085 5.4118 41.0000 0 100%
p = 45 Lasso 19.5952 5.2928 18.9920 0 46.32%
m = 41 LLA 19.9233 5.4045 40.9140 0 99.79%
n = 700 Oracle 24.3006 6.3847 49.0000 0 100%
p = 53 Lasso 24.0315 6.2994 23.1009 0 47.14%
m = 49 LLA 24.3666 6.4077 48.9617 0 99.92%

An interesting fact can be found from Table 2, that is, when the error term is chosen as t5, the accuracy of the method LLA proposed to correctly exclude incorrect variables is slightly higher than that of the case where the error term is a standardized normal distribution. The reason is that when the error term is heavy tailed, it is more appropriate to choose LLA, but the accuracy of estimation and prediction is slightly worse than that of Lasso. When the sample size increases, the LLA and Oracle estimates perform equally well in the selection of important variables and the complexity of the model.

Table 2.

Simulation results for LAD loss function and εt5

Setting Method EE PE C IC CP
n = 200 Oracle 10.5634 4.2892 24.0000 0 100%
p = 28 Lasso 10.2810 4.1649 11.7700 0 49.04%
m = 24 LLA 10.6448 4.2725 23.8780 0 99.49%
n = 500 Oracle 19.4296 6.8240 41.0000 0 100%
p = 45 Lasso 19.1157 6.7042 18.9580 0 46.24%
m = 41 LLA 19.4665 6.8335 40.9560 0 99.89%
n = 700 Oracle 23.7784 8.0637 49.0000 0 100%
p = 53 Lasso 23.4389 7.9551 22.8800 0 46.69%
m = 49 LLA 23.7808 8.0919 48.9740 0 99.94%

As can be seen from Table 3, when the error term is set to a mixed normal distribution, the ability of the proposed method to correctly select zero variables is good. In the case of a small sample size, the ability of the Lasso method to select important variables is better.

Table 3.

Simulation results for LAD loss function and ε0.9N(0,1)+0.1N(0,9)

Setting Method EE PE C IC CP
n = 200 Oracle 10.4815 4.4830 24.0000 0 100%
p = 28 Lasso 10.2030 4.4063 11.6360 0 48.48%
m = 24 LLA 10.5826 4.4529 23.9240 0 99.68%
n = 500 Oracle 19.2539 7.1997 41.0000 0 100%
p = 45 Lasso 18.9670 7.0960 19.3840 0 47.28%
m = 41 LLA 19.2950 7.1173 40.9520 0 99.88%
n = 700 Oracle 23.6354 8.5657 49.0000 0 100%
p = 53 Lasso 23.2424 8.4609 23.0580 0 47.06%
m = 49 LLA 23.6566 8.3699 48.9300 0 99.86%

From Tables 46 where we choose the Huber loss function, the LLA method we proposed behaves well both in variable selection and robustness. Compared with Table 1 and Table 4, when the data has outliers, we should choose LAD as the loss function. Moreover, when the error term follows a mixed normally distribution, the LLA method behaves better than the Lasso method. The reason for this is that the real data has a mixed normal distribution with high probability.

Table 5.

Simulation results for Huber loss function and εt5

Setting Method EE PE C IC CP
n = 200 Oracle 10.5572 4.2666 24.0000 0 100%
p = 28 Lasso 9.2590 4.4065 18.4680 0.0020 76.95%
m = 24 LLA 10.6099 4.2429 22.8100 0 95.04%
n = 500 Oracle 19.4395 6.8118 41.0000 0 100%
p = 45 Lasso 17.4385 6.9993 36.2080 0 88.31%
m = 41 LLA 19.4471 6.8247 40.5440 0 98.89%
n = 700 Oracle 23.8089 8.0487 49.0000 0 100%
p = 53 Lasso 21.6534 8.2558 44.4980 0 90.81%
m = 49 LLA 23.8220 8.0807 48.6940 0 99.38%

Table 4.

Simulation results for Huber loss function and εN(0,1)

Setting Method EE PE C IC CP
n = 200 Oracle 10.8300 3.3696 24.0000 0 100%
p = 28 Lasso 9.6422 3.5569 20.0920 0 83.72%
m = 24 LLA 10.9088 3.3784 22.7200 0 94.67%
n = 500 Oracle 19.9141 5.4034 41.0000 0 100%
p = 45 Lasso 18.0691 5.6068 38.0300 0 92.76%
m = 41 LLA 19.8884 5.3937 40.5160 0 98.82%
n = 700 Oracle 24.3265 6.3761 49.0000 0 100%
p = 53 Lasso 22.4030 6.5988 46.2440 0 94.38%
m = 49 LLA 24.3596 6.3882 48.6620 0 99.31%

Table 6.

Simulation results for Huber loss function and ε0.9N(0,1)+0.1N(0,9)

Setting Method EE PE C IC CP
n = 200 Oracle 10.4829 4.4694 24.0000 0 100%
p = 28 Lasso 9.1630 4.6333 18.0680 0 75.28%
m = 24 LLA 10.5706 4.4827 22.7880 0 94.95%
n = 500 Oracle 19.2618 7.1860 41.0000 0 100%
p = 45 Lasso 17.3780 7.3190 35.3500 0 86.22%
m = 41 LLA 19.2962 7.2029 40.5900 0 99.00%
n = 700 Oracle 23.6356 8.5563 49.0000 0 100%
p = 53 Lasso 21.5202 8.7148 43.4420 0 88.66%
m = 49 LLA 23.6275 8.5822 48.7120 0 99.41%

Conclusion

In this paper, we mainly study the M-estimation method for the high-dimensional linear regression model and discuss the properties of the M-estimator when the penalty term is the local linear approximation. We show that the proposed estimator possesses the good properties by applying certain assumptions. In the numerical simulation, we select the appropriate algorithm to show the good robustness of this method.

Authors’ contributions

All authors contributed equally to the writing of this paper. All authors read and approved the final manuscript.

Funding

The work was supported by the NSFC (71803001, 61703001), the NSF of Anhui Province (1708085MA17, 1508085QA13), the Key NSF of Education Bureau of Anhui Province (KJ2018A0437) and the Support Plan of Excellent Youth Talents in Colleges and Universities in Anhui Province (gxyq2017011).

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Kai Wang, Email: wangkai050318@163.com.

Yanling Zhu, Email: zhuyanling99@126.com.

References

  • 1.Huber P. Robust estimation of a location parameter. Ann. Math. Stat. 1964;35:73–101. doi: 10.1214/aoms/1177703732. [DOI] [Google Scholar]
  • 2.Huber P. Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Stat. 1973;1:799–821. doi: 10.1214/aos/1176342503. [DOI] [Google Scholar]
  • 3.Huber P. Robust Statistics. New York: Wiley; 1981. [Google Scholar]
  • 4.Portnoy S. Asymptotic behavior of M-estimators of p regression parameters when p2/n is large, I: consistency. Ann. Stat. 1984;12:1298–1309. doi: 10.1214/aos/1176346793. [DOI] [Google Scholar]
  • 5.Welsh A. On M-processes and M-estimation. Ann. Stat. 1989;17:337–361. doi: 10.1214/aos/1176347021. [DOI] [Google Scholar]
  • 6.Mammen E. Asymptotics with increasing dimension for robust regression with applicationsto the bootstrap. Ann. Stat. 1989;17:382–400. doi: 10.1214/aos/1176347023. [DOI] [Google Scholar]
  • 7.Bai Z., Wu Y. Limiting behavior of M-estimators of regression coefficients in high dimensional linear models I. Scale-dependent case. J. Multivar. Anal. 1994;51:211–239. doi: 10.1006/jmva.1994.1059. [DOI] [Google Scholar]
  • 8.He X., Shao Q. On parameters of increasing dimensions. J. Multivar. Anal. 2000;73:120–135. doi: 10.1006/jmva.1999.1873. [DOI] [Google Scholar]
  • 9.Li G., Peng H., Zhu L. Nonconcave penalized M-estimation with a diverging number of parameters. Stat. Sin. 2011;21:391–419. [Google Scholar]
  • 10.Zou H., Li R. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Bai Z., Rao C., Wu Y. M-estimation of multivariate linear regression parameters under a convex discrepancy function. Stat. Sin. 1992;2:237–254. [Google Scholar]
  • 12.Wu W. M-estimation of linear models with dependent errors. Ann. Stat. 2007;35:495–521. doi: 10.1214/009053606000001406. [DOI] [Google Scholar]
  • 13.Huang J., Horowitz J., Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Stat. 2008;36:587–613. doi: 10.1214/009053607000000875. [DOI] [Google Scholar]
  • 14.Fan J., Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 2004;32:928–961. doi: 10.1214/009053604000000256. [DOI] [Google Scholar]

Articles from Journal of Inequalities and Applications are provided here courtesy of Springer

RESOURCES