Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2019 May 7;115(530):794–809. doi: 10.1080/01621459.2019.1585251

Confidence Intervals for Sparse Penalized Regression with Random Designs

Guan Yu 1,*, Liang Yin 1,*, Shu Lu 1, Yufeng Liu 1,
PMCID: PMC7716883  NIHMSID: NIHMS1572321  PMID: 33281249

Abstract

With the abundance of large data, sparse penalized regression techniques are commonly used in data analysis due to the advantage of simultaneous variable selection and estimation. A number of convex as well as non-convex penalties have been proposed in the literature to achieve sparse estimates. Despite intense work in this area, how to perform valid inference for sparse penalized regression with a general penalty remains to be an active research problem. In this paper, by making use of state-of-the-art optimization tools in stochastic variational inequality theory, we propose a unified framework to construct confidence intervals for sparse penalized regression with a wide range of penalties, including convex and non-convex penalties. We study the inference for parameters under the population version of the penalized regression as well as parameters of the underlying linear model. Theoretical convergence properties of the proposed method are obtained. Several simulated and real data examples are presented to demonstrate the validity and effectiveness of the proposed inference procedure.

Keywords: confidence interval, non-convex penalty, penalized regression, random design, variational inequality

1. Introduction

With the advantage of simultaneous variable selection and estimation, sparse penalized regression techniques have been widely used. By introducing biases on the estimators, sparse penalized regression can often select a simpler model and produce estimators with smaller mean square errors than unpenalized regression. One of the well-known representatives is the L1 penalized technique LASSO (Donoho and Johnstone, 1994; Tibshirani, 1996). LASSO has become a popular variable selection method due to its good selection performance and computational efficiency. Many other extensions with different penalties have been studied in the literature, see for example Fan and Li (2001); Zou and Hastie (2005); Candes and Tao (2007); Liu and Wu (2007); Lv and Fan (2009); Zhang (2010).

For computational implementation of these methods, there is a large literature on efficient algorithms. The LARS algorithm by Efron et al. (2004) and the Coordinate-Descent algorithm by Wu and Lange (2008) are two popular examples. Mazumder et al. (2011) proposed the SparseNet algorithm to deal with non-convex penalties. In terms of inference, much less development has been made, especially for estimators from non-convex penalized regression. For the LASSO, one common approach is to first perform model selection and then carry out inference based on some pivot distributions conditional on the selected model. For example, see Lee et al. (2016); Lockhart et al. (2014). This approach does not sufficiently account for the stochastic errors in the model selection step. Another popular approach is to achieve valid inference by adjusting the bias introduced by the L1 regularization term. Papers along this line include Javanmard and Montanari (2014); Van de Geer et al. (2014); Zhang and Zhang (2014). Recently, Lu et al. (2017) suggested using a variational inequality formulation to establish an asymptotic distribution of LASSO estimators that can be used to construct confidence intervals (CI) for the population LASSO parameters as well as the true model parameters. Some other methods about high dimensional inference include Voorman et al. (2014); Ning et al. (2017); Zhao et al. (2017).

For sparse penalized regression with penalties more complex than the LASSO, not much work has been done on inference. In particular, with a non-convex penalty, the regression problem may have multiple local optimal solutions. This brings up the question of whether one can use a local solution to construct meaningful confidence intervals. The goal of this paper is to provide a unified framework to perform valid inference for sparse penalized regression with general penalties, based on a local solution. Our assumptions about the penalties are in line with the desired properties for regularized penalty functions given in Fan and Li (2001), namely, sparsity, unbiasness, and continuity. For example, this framework can be applied to the adaptive LASSO penalty (Zou, 2006), the non-convex log penalty (Friedman, 2012), SCAD (Fan and Li, 2001), MCP (Zhang, 2010), the transformed l1 penalty (Nikolova, 2000), and so on.

In this paper, we consider a general random-design penalized regression problem

minβ0,βE[Yβ0i=1pβiXi]2+j=1pPλj(|βj|), (1)

where X=(X1,,Xp)Tp is an explanatory random vector with mean 0, and Y is a response random variable. Here β0andβ=(β1,,βp)p are the regression parameters. For j=1,2,,p,Pλj(||) is a general penalty for βj with the regularization parameter λj. This general penalty covers many convex and non-convex penalties.

The solution of (1) can be estimated by the solution of the corresponding sample average approximation (SAA) problem

minβ0,β1N||yβ01NXβ22+j=1pPλj(|βj|), (2)

where

y=[y1y2yN],X=[x11x12x1px21x22x2pxN1xN2xNp]=[x1Tx2TxNT],1N=[111]N,

and (x1, y1), ⋯, (xN, yN) are independent samples of (X, Y). For each i = 1, ⋯, N and j = 1, ⋯, p, xij,yiandxip×1. We refer to a local solution to the population penalized problem (1) as a population penalized parameter, and denote it as (β˜0,β˜). We refer to a local solution to the SAA problem (2) as a penalized estimator, and denote it as (β^0,β^)..

The population penalized parameter (β˜0,β˜) is closely related to the traditional least squares parameter. When Pλj(|βj|)=0forj=1,,p, the problem (1) becomes the following population least squares problem:

minβ0,βE[Yβ0i=1pβiXi]2, (3)

which has a unique minimizer (E[XXT])−1 E[XY] when E[XXT] is invertible. If additionally X and Y are related by the following linear model

Y=β0true+XTβtrue+ε (4)

with E[ε|X] = 0, then the solution to the population least squares problem (3) is exactly (β0true,βtrue), which we refer to as the true model parameter. When Pλj(|βj|)>0,(β˜0,β˜) is not exactly (β0true,βtrue), but there is a relation between (β˜0,β˜) and (β0true,βtrue), that will be described in Section 3.4.

The idea of our proposed method is to use the penalized estimator (β^0,β^) to derive confidence intervals for the population penalized parameter (β˜0,β˜), and then exploit the relation between (β˜0,β˜) and (β0true,βtrue) to derive confidence intervals for the true model parameter (β0true,βtrue) in the linear model (4). Therefore, our proposed method could construct confidence intervals for both (β˜0,β˜) and (β0true,βtrue). Note that the valid inference of (β˜0,β˜) is also useful for some problems such as the cost-effective linear regression taking into account the cost of collecting variables. For each sample, suppose we need to spend cj dollars on collecting the value of the jth variable Xj where j = 1, 2,...,p. For this special linear regression problem, in order to find a relatively cheap linear model with good prediction performance, we need to estimate the regression coefficient vector that minimizes an objective function which balances between the expected prediction accuracy and the data collation cost, that is,

(β˜0,β˜)argminβ0,βE(Yβ0j=1pXjβj)2+λj=1pcjP(|βj|),

where P(x) is a continuously differentiable non-convex function, and it is an approximation to the indicator function I(x) that equals to 1 if x > 0 and 0 if x = 0. The parameter λ can be selected according to the budget. In this example, the population penalized parameter becomes a reasonable target of the inference and our proposed method could deliver its asymptotically exact confidence intervals. On the other hand, even when a model on the relation between X and Y is not available, the confidence intervals of (β˜0,β˜) can still provide measures on the randomness of the penalized estimators.

The main techniques to construct confidence intervals for (β˜0,β˜) and (β0true,βtrue), consist of three steps. First, we transform problems (1) and (2) into their corresponding variational inequality and normal map formulations and obtain an asymptotic distribution of a solution to the normal map formulation of (2). Next, by finding reliable estimates for quantities that describe the asymptotic distribution, we provide methods to compute confidence intervals for (β˜0,β˜) based on a solution to the normal map formulation of (2). Finally, we establish the connection between (β˜0,β˜) and (β0true,βtrue), from which we obtain the bias-corrected estimator of (β0true,βtrue), the asymptotic distribution and confidence intervals. The methodology in this paper is developed for a fixed dimension p, based on a local solution (β^0,β^) to (2). The confidence intervals we obtain for (β˜0,β˜) are for the local solution to (1) close to (β^0,β^). On the other hand, for any local solution (β^0,β^) of (2), we can always obtain confidence intervals for the true model parameter (β0true,βtrue). Indeed, under the setting we consider in Section 3.4, a local solution (β^0,β^) almost surely converges to (β0true,βtrue).

Although our method and the method proposed in Lu et al. (2017) use similar techniques, there are important new contributions. First, we propose a unified framework to construct confidence intervals for a large class of penalties, including the method using the LASSO penalty (Lu et al. (2017)) as a special case. Second, for non-convex penalties, the construction of the confidence intervals and the theoretical studies are more complicated. We propose a new transformation of the original optimization problem to its corresponding variational inequality and normal map formulations. Special technical conditions and theoretical results for general penalties are studied. Third, the proposed method based on non-convex penalties could deliver better confidence intervals than the methods using convex penalties. For example, in our numerical studies, we compare the method using the MCP penalty with a = 2 and the method using the MCP penalty with a = 2000 (the case that MCP penalty is very close to the LASSO penalty). Our numerical results indicate that the lengths of the confidence intervals for a = 2 are generally shorter than those for a = 2000 with similar coverage rates. This is possibly due to the smaller bias imposed by the MCP penalty with a small a.

The rest of this paper is organized as follows. In Section 2, we show some background on variational inequality, and present the problem transformations. Section 3 discusses how to obtain the confidence intervals for the population penalized parameters as well as the true model parameters in the linear model. Some theoretical results about convergence properties are shown in this section. In Section 4, we present numerical results to illustrate the performance of the proposed method. Section 5 contains some discussion. The technical details of variational inequalities and proofs are shown in the supplementary materials.

Throughout this paper, we use N(0,Σ) to denote a normal random vector with mean zero and covariance matrix Σ,andYnY to represent weak convergence of a sequence of random variables {Yn} to Y. The inner product between two vectors x and y is denoted as 〈x, y〉. For a convex set S and a vector zn,we useΠS(z) to denote the Euclidean projection onto S. A function f:nm is said to be B-differentiable at a point x0n if there exists a positively homogeneous function df(x0):nm, such that f(x0+v)=f(x0)+df(x0)(v)+o(v). The function df(x0) is called the B-derivative of f at x0.

2. Background and problem transformations

In this section, we first introduce some background on variational inequalities and normal maps. Then we introduce how to transform the problems (1) and (2) to their corresponding variational inequality and normal map formulations. Some assumptions for our theoretical analysis are also given in this section.

2.1. Background on variational inequalities and normal maps

We start with the definition of a variational inequality. Given a function f:nn and a closed, convex set S in n, the variational inequality associated with (f, S) is the problem of finding xS such that

0f(x)+NS(x), (5)

where Ns(x) is the normal cone to S at x defined as

NS(x)={vn|v,sx0for eachsS}.

Variational inequalities are closely related to optimization problems. Consider a problem of minimizing an objective function F:n over a closed and convex set Sn. A well-known fact is that if xS is a local solution to this minimization problem and F is differentiable at x, then x satisfies the variational inequality

0F(x)+NS(x),

where F:nn is the gradient of the function F. Conversely, if x satisfies the above variational inequality, and F is a convex function, then x is a global minimizer of F over the set S.

Besides the above connection with the original minimization problem, the variational inequality can be equivalently formulated as an equation, using a concept called the normal map. The normal map induced by f and S is a function fS:nn given by

fS(z)=f(ΠS(z))+(zΠS(z))for eachzn,

where ΠS(z) denotes the Euclidean projector of z onto S. For any solution xS to the variational inequality (5), the point z = xf(x) satisfies ΠS(z)=x and

fs(z)=0. (6)

Conversely, for any solution z to (6), the point x=ΠS(z) is a solution to (5) and satisfies z = xf(x). Equation (6) is called the normal map formulation of (5).

To understand the above relations, we consider an example of minimizing F(x)=12xx022 for a fixed point x0n over the convex set S. We know that the solution is the projection of x0 onto the convex set S, denoted as ΠS(x0). For the function F(x), the gradient is F(x)=xx0. Thus, the variational inequality formulation (5) of this minimization problem is the problem of finding xS such that x0xNS(x). That is, we need to find xS such that 〈x0x, sx〉 ≤ 0 for each sS. We can show that the solution is ΠS(x0).. On the other hand, the normal map induced by F(x)andSisfS(z)=F(ΠS(z))+zΠS(z)=ΠS(z)x0+zΠS(z)=zx0. Thus, the normal map formulation (6) of this minimization problem is zx0 = 0. The solution to this equation is x0 and the point x=ΠS(x0) is the solution to the original minimization problem. More details about the variational inequality and normal map can be found in the supplementary materials. In the following Sections 2.2 and 2.3, we show how to transform the problems (1) and (2) to their corresponding normal map formulations, respectively.

2.2. Transformations of the population penalized regression

In this subsection, we transform the optimization problem (1) into a normal map formulation. Before discussing details about the transformation, we introduce conditions on the penalties Pλi(·). In this subsection, as well as in Sections 2.3 and 3.1-3.3, λ = (λ1, ⋯, λp) > 0 is fixed.

Assumption 1.

  • (a)

    For each i = 1, 2, ⋯,p, Pλi(·) is nonnegative, nondecreasing and continuously differentiable on [0,+)withPλi(0)>0.

  • (b)

    For any local solution (β˜0,β˜) to (1), the second derivative of Pλi (ti) is Lipchitz continuous on a neighborhood of ti=|β˜i| for every i = 1, ⋯,p.

Many well-known penalty functions satisfy Assumption 1(a). We list five penalty functions as examples.

  • (a)

    The adaptive LASSO penalty (Zou, 2006) defined as Pλi(|βi|)=λi|βi|, where λi is the weight for the ith coordinate.

  • (b)

    The log penalty (Friedman, 2012) defined as Pλi(|βi|)=λilog(1+a)log(a|βi|+1), where a > 0.

  • (c)

    The transformed l1 penalty (Nikolova, 2000) defined as Pλi(|βi|)=λi(a+1)|βi|a+|βi|, where a > 0.

  • (d)
    The SCAD penalty (Fan and Li, 2001) defined as Pλ(0) = 0 and
    Pλi(|βi|)=λi1|βi|λi+(aλi|βi|)+a11|βi|>λiwherea>2. (7)
  • (e)
    The MCP penalty (Zhang, 2010) defined as
    Pλi(|βi|)=λi(|βi|βi22aλi)1|βi|<aλi+aλi221|βi|aλiwherea>0. (8)

We can check that penalties (a), (b), and (c) satisfy Assumption 1(b). The SCAD and MCP penalties satisfy this assumption almost everywhere. Take the SCAD penalty for example. The function corresponds to a quadratic spline with two knots, at which it is not continuously twice differentiable. Assumption 1(b) requires that no local solution to (1) locates at these two knots for each i. It is a reasonable assumption since the set of points on which twice continuous differentiability fails has measure zero.

In the assumption below, part (a) is to ensure the objective function of (1) to be finite valued, and part (b) will be used in proving convergence results.

Assumption 2.

  • (a)

    The expectations E[X12],,E[Xp2]andE[Y2] are finite.

  • (b)

    The expectations E[X14],,E[Xp4]andE[Y4] are finite.

Next, we transform the problem (1) into a normal map formulation in three steps. In the first step, we introduce an equivalent problem, in which a new variable tp is added to eliminate the non-smooth term i=1pPλi(|βi|) from the objective function (1). The new problem is as follows:

minβ0,β,tE[Yβ0i=1pβiXi]2+i=1pPλi(ti)+m(t22β22)s.t.tiβi0,i=1,,p,ti+βi0,i=1,,p, (9)

where m is a positive constant. If we define Si2 as

Si={(βi,ti)|tiβi0,ti+βi0},i=1,,p, (10)

and write

(β0,β,t)=(β0,β1,t1,β2,t2,,βp,tp), (11)

then we can treat the feasible set of (9), denoted by S, as a Cartesian product

S=×Πi=1pSi. (12)

We will use the two ways of ordering of (β0, β, t) in (11) interchangeably for notational convenience.

Note that the above transformation is different from the method shown in Lu et al. (2017) for the LASSO penalty. The term m(t22β22) is added into the objective function of (9) in order to ensure ti = |βi| in any optimal solution to (9), so that there is an one-to-one correspondence between the optimal solutions to (1) and (9). This is necessary and important when the penalty functions are not strictly increasing on [0, +). For instance, some non-convex penalties such as SCAD and MCP are flat on some intervals [di, +).

In the second step, we transform (9) into a variational inequality. To this end, we need to write down the gradient of its objective function. Define a function F:×p×p×p×2p+1 as

F(β0,β,t,X,Y)=[2(Yβ0i=1pβiXi)2(Yβ0i=1pβiXi)X2mβ(Pλi(ti)+2mti)i=1p]. (13)

Furthermore, define a function f0:×p×p2p+1 as

f0(β0,β,t)=E[F(β0,β,t,X,Y)]. (14)

The function f0 is well defined and finite valued under Assumption 2(a). If Pλi (ti) is twice differentiable at ti for every i = 1, ⋯,p, then we can write down the derivative of F with respect to (β0, β, t) as

d1F(β0,β,t,X,Y)=[22XT02X2XXT2mIp000diag(Pλi(ti)+2m)i=1p], (15)

where diag(Pλi(ti)+2m)i=1p represents the diagonal matrix whose ith diagonal element is Pλi(ti)+2mandIp is the p × p identity matrix. Moreover, the Jacobian matrix of f0 is

L(t)=E[d1F(β0,β,t,X,Y)]=[20002E[XXT]2mIp000diag(Pλi(ti)+2m)i=1p]. (16)

The lemma below shows that there is an one-to-one correspondence between the (local or global) optimal solutions to (1) and (9).

Lemma 1.

Suppose Assumptions 1(a) and 2(a) hold. Then the objective function of (9) is finite valued on 2p+1, and its gradient at each (β0,β,t)2p+1isf0(β0,β,t). If (β˜0,β˜,t˜) is a (local) optimal solution to (9), then t˜i=|β˜i|for alli=1,,p, and (β˜0,β˜) is a (local) optimal solution to (1). Conversely, if (β˜0,β˜) is a (local) optimal solution to (1), then (β˜0,β˜,t˜) is a (local) optimal solution to (9), where t˜i=|β˜i|for alli=1,,p.

If Assumption 1(b) holds additionally, then the Hessian matrix of the objective function of (9) at (β˜0,β˜,t˜) is L(t˜).

In view of Lemma 1, we can transform (9) to the following variational inequality:

f0(β0,β,t)NS(β0,β,t). (17)

In the last step, we state the normal map formulation for (17). Let (f0)S be the normal map induced by f0 and S. Then the normal map formulation for (17) is

(f0)S(z)=0,z2p+1. (18)

For the rest of the paper, let (β˜0,β˜,t˜) be a local solution to (9). Then (β˜0,β˜,t˜) is also a solution to (17). Therefore, the point z02p+1 defined as

z0=(β˜0,β˜,t˜)f0(β˜0,β˜,t˜) (19)

is a solution to (18) and satisfies ΠS(z0)=(β˜0,β˜,t˜).

Let Σ0 be the covariance matrix of F(β˜0,β˜,t˜,X,Y). We can check that Σ0 is well defined if Assumption 2(b) holds. Let Σ01 be the upper left (p + 1) × (p + 1) submatrix of Σ0. Since the last p elements of F(β˜0,β˜,t˜,X,Y) are not random at (β˜0,β˜,t˜), we have Σ0=[Σ01000].. In our theoretical analysis as shown in Section 3, we found that the B-derivative of the normal map (f0)S at z0 plays an important role in the construction of the confidence intervals. To study the property of the B-derivative of the normal map (f0)S at z0, we need the following assumption.

Assumption 3.

Let (β˜0,β˜) be a local solution to (1), define t˜pandq˜p by

t˜i=|β˜i|andq˜i=E[2(Yβ˜0j=1pβ˜jXj)Xi]for eachi=1,,p.

Let be a subset of {1, ⋯,p} defined as

={i{1,,p}|β˜i0or(β˜i=0and|q˜i|=|Pλi(t˜i)|)},

and denote L(t˜) in (16) by L. Let Q1 be the submatrix of L that consists of intersections of columns and rows of L with indices in {1}{i+1,i}, and let Q2 be the submatrix of L that consists of intersections of columns and rows of L with indices in {i+p+1,i}. Define matrix Q as

Q=Q1+[000Q2]. (20)

Assume that Q is nonsingular.

In the above assumption, Qi is a submatrix of the upper left (p + 1) × (p + 1) submatrix of L, and Q2 is a submatrix of the lower right p × p submatrix of L. If (β˜0,β˜) is a solution to the optimization problem (1), then for every i{1,,p},we haveβ˜i0and|q˜i|=|Pλi(t˜i)|,orβ˜i=0and|q˜i||Pλi(t˜i)|. Since we have β˜i=0and|q˜i|<|Pλi(t˜i)| for some coefficients, |J| is generally not equal to p. Furthermore, in our theoretical analysis, we assume that the dimension p is fixed, the matrix Q can be nonsingular in many cases. The nonsingularity of Q is a standard assumption to guarantee that (β˜0,β˜) is a locally unique optimal solution.

As shown in Robinson (1995), the B-derivative of the normal map (f0)S at z0 is the same as the normal map LK induced by the linear function defined by the matrix L and the critical cone K to S associated with z0, defined as

K={wTS(ΠS(z0))|z0ΠS(z0),w=0}={wTS(β˜0,β˜,t˜)|f0(β˜0,β˜,t˜),w=0}, (21)

where for each xS,

TS(x)={wn|{xk}Sand{τk} such that xkx,τk0,and(xkx)/τkw} is the tangent cone to S at x. To be specific, the normal map LK is defined as LK(z)=LΠK(z)+zΠK(z)for anyz2p+1. The tangent cone TS(x) contains all the directions along which x can be approached by a sequence of points in S converging to x. Lemma 2 below shows that LK is a global homeomorphism from 2p+1to2p+1 (a continuous bijective function from 2p+1to2p+1 whose inverse function is also continuous). In the proof of Lemma 2, we will give the explicit expression of the critical cone K.

Lemma 2.

Suppose that Assumptions 1, 2(a) and 3 hold. Then the normal map LK is a global homeomorphism from 2p+1to2p+1, and there is a neighborhood of (β˜0,β˜,t˜) in which it is the unique local solution to (9).

Combining Lemma 1 and 2, we can conclude that the assumptions in Lemma 2 guarantee (β˜0,β˜) to be the unique local solution to (1) in a neighborhood of it.

2.3. Transformations of the SAA problem

We follow the same steps in Subsection 2.2 to formulate the SAA problem (2) as a normal map equation. First, by introducing the variable tp, we transform the SAA problem (2) to the following equivalent problem:

min(β0,β,t)S1Ni=1N[yiβ0j=1pβjxij]2+i=1pPλi(ti)+m(t22β22). (22)

Second, we rewrite (22) as a variational inequality

0fN(β0,β,t)+NS(β0,β,t), (23)

where fN(β0,β,t)=N1i=1NF(β0,β,t,xi,yi).IfPλi(ti) is twice differentiable at ti for every i = 1, ⋯,p, then the Jacobian matrix of fN is given by

LN(t)=dfN(β0,β,t)=[22i=1TxiT/N02i=1Nxi/N2i=1TxixiT/N2mIp000diag(Pλi(ti)+2m)i=1p]. (24)

Third, denoting the normal map induced by fN and S by (fN)S, we obtain the normal map formulation of (23) as

(fN)S(z)=0. (25)

Let (β^0,β^,t^) be a local solution to (22). Then, (β^0,β^,t^) is also a solution to (23). So the point zN2p+1 defined as

zN=(β^0,β^,t^)fN(β^0,β^,t^) (26)

is a solution to (25) and satisfies ΠS(zN)=(β^0,β^,t^).

In fact, under Assumptions 1, 2 and 3, zN is a locally unique solution to (25) when N is large enough and it converges to a solution z0 to (18). This result will be shown in Subsection 3.1. Correspondingly, (β^0,β^,t^) is a locally unique solution to (22) and converges to a local solution (β˜0,β˜,t˜) to (9). Let ΣN be the sample covariance matrix of {F(β^0,β^,t^,xi,yi)}i=1NandΣN1 be the upper left (p + 1) × (p + 1) submatrix of ΣN, then we have ΣN=[ΣN1000]. Lu et al. (2017) (Lemma 3) shows that ΣN converges to Σ0 almost surely as N for t LASSO penalty. We can similarly prove the same convergence result with a general penalty in this paper under Assumptions 14. Assumption 4 is shown as follows.

Assumption 4.

  • (a)
    For each h2p+1and(β0,β,t)2p+1,let
    Mβ0,β,t(h)=E[exp{h,F(β0,β,t,X,Y)f0(β0,β,t)}]
    be the moment generating function of the random variable F(β0,β,t,X,Y)f0(β0,β,t). Let C be a compact set in 2p+1 that contains (β˜0,β˜,t˜) in its interior, and on which the second derivative of Pλi (ti) is Lipchitz continuous for each i = 1, ㏯, p. Assume the following conditions.
    1. There exists a constant ζ>0 such that Mβ0,β,t(h)exp{ζ2h2/2} for each h2p+1and(β0,β,t)C.
    2. There exists a nonnegative random variable κ(X, Y) such that
      F(β0,β,t,X,Y)F(β0,β,t,X,Y)κ(X,Y)(β0,β,t)(β0,β,t)
      for all (β0,β,t)and(β0,β,t) in C and almost every (X, Y).
    3. The moment generating function of k is finite valued in a neighborhood of zero.
  • (b)

    The same conditions as in (a) for d1F(β0, β, t, X, Y) instead of F(β0, β, t, X, Y). Accordingly, use E[d1F(β0, β, t, X, Y)] to replace f0(β0, β, t) in the conditions.

  • (c)

    The same conditions as in (a) for F(β0, β, t, X, Y)F(β0, β, t, X, Y)T. Accordingly, use E[F(β0, β, t, X, Y)F(β0, β, t, X, Y)T] to replace β0(β0, β, t) in the conditions.

Assumption 4(a) imposes conditions on the random variable F(β0, β, t, X, Y) as well as the penalty terms. It will hold if (X, Y) is a bounded random variable and Assumption 1(b) holds. Assumption 4(a) is used to ensure the SAA function fN to converge to f0 in probability at an exponential rate. We state the result in the following lemma.

Lemma 3.

Suppose that Assumptions 1, 2 and 4(a) hold. Then there exist positive real numbers δ1, μ1, M1 and σ1 such that the following holds for each ϵ>0 and each sufficiently large N:

Prob{sup(β0,β,t)CfN(β0,β,t)f0(β0,β,t)ϵ}δ1exp{Nμ1}+M1ϵ2p+1exp{Nϵ2σ1}. (27)

Parts (b) and (c) of Assumption 4 impose the same type of assumptions on different random variables. Assumption 4(a-b) is needed to construct a reliable estimate for an unknown quantity in the asymptotic distribution in Theorem 1. Assumption 4(c) is only needed when the matrix Σ01 is singular.

3. Construction of confidence intervals using stochastic variational inequality techniques

In this section, we show the proposed method and some related theoretical results to construct confidence intervals using stochastic variational inequality techniques. We first develop the limiting distribution of SAA solutions in Section 3.1. Then, in Section 3.2, we show how to estimate the unknown quantities in the limiting distribution. The construction of the confidence intervals for the population penalized parameters and the true model parameters in the underlying linear model will be studied in Section 3.3 and Section 3.4, respectively. To present our proposed inference method clearly, we outline the procedures to construct confidence intervals for the true model parameters in Table 1. The extension to the high dimensional case is provided in Section 3.5.

Table 1:

Construction of the (1 − α)% confidence intervals of β0trueandβtrue

Step 1. Find the penalized estimates β^0 and β^ by solving the SAA problem (2). The tuning parameters are chosen by the Generalized Information Criterion (GIC).
Step 2. Calculate the solution of the normal map formulation (25),
zN=(β^0,β^,t^)fN(β^0,β^,t^),wheret^i=|β^i|,i=1,,p.
Step 3. Calculate (β^0true,β^true)=G*(zN), where G* is the function defined in (59).
Step 4. If for every i ∈ {0, 1, 2,...,p}, h^i is very close to 0, we consider Case I to construct individual confidence intervals approximately. Otherwise, we consider Case II.
  Case I: the (1 − α)% confidence interval of βitrue is [β^itrueΦ1(1α/2)·τi,β^itrue+Φ1(1α/2)·τi], where τi=HNΣNHNT/N and HN is defined in Theorem 5;
  Case II: we first use simulation to estimate the (1 − α/2)% percentile of R^i+1(Z), where R^i+1 is defined by (56) and Z~N(0,Ip+1)×1. The estimated percentile is denoted as ηi. The (1 − α)% confidence interval of βitrue is [β^itrueηi/N,β^itrue+ηi/N].

3.1. Convergence and distribution of SAA solutions

Based on Lemma 2 and the relation between (9) and (18), Assumptions 13 guarantee z0 defined in (19) to be a locally unique solution to (18). Furthermore, we show in Theorem 1 below that for sufficiently large N, (25) has a unique solution zN in a neighborhood of z0, and that zN converges almost surely to z0. This theorem also provides results on asymptotic distributions and convergence rates.

Theorem 1.

Suppose that Assumptions 1, 2 and 3 hold. Then, with probability 1, there exist neighborhoods Zofz0andC0of(β˜0,β˜,t˜), such that for sufficiently large N, the equation (25) has a unique solution zN in Z, and the variational inequality (23) has a unique solution in c0 given by (β^0,β^,t^)=ΠS(zN). Moreover,

limNzN=z0a.s.,limN(β^0,β^,t^)=(β˜0,β˜,t˜)a.s., (28)
N(zNz0)(LK)1(N(0,Σ0)), (29)

and

NLK(zNz0)N(0,Σ0). (30)

In addition, if Assumption 4(a-b) holds, then there exist positive real numbers ϵ0,δ0,μ0,M0 and σ0, such that for each ϵ(0,ϵ0] and each sufficiently large N,

Prob{(β^0,β^,t^)(β˜0,β˜,t˜)<ϵ}Prob{zNz0<ϵ}1δ0exp{Nμ0}M0ϵ2p+1exp{Nϵ2σ0}. (31)

In Theorem 1, LK is the normal map induced by the linear function L(t˜) in (16) and the critical cone K defined in (21). We use LK1 to denote its inverse function. Functions ΠK, LK and LK1 are linear if K is a subspace, otherwise they are piecewise linear. Compared to (Lu et al., 2017, Theorem 1) which considers the LASSO penalty, Theorem 1 here handles general penalties that satisfy Assumption 1. The results shown in Theorem 1 are used in the construction of confidence intervals of the population penalized parameter as well as the true model parameter as shown in the following sections.

3.2. Estimators of Σ0 and LK

In order to use (29) and (30) to obtain computable confidence regions and intervals, we need to find reliable estimators of Σ0 and LK, as we discuss in this subsection. One can show that converges to Σ0 almost Σ0 surely under Assumptions 14. See the remarks below (26). Therefore, we use ΣN to estimate Σ0. Our main task in this subsection is to introduce an estimator of the normal map LK, knowing that LK is exactly d(f0)S(z0) (Robinson, 1995), the B-derivative of (f0)S at z0. Let dΠS(z) be the B-derivative of the Euclidean projector ΠS at z. Since S is a polyhedral convex set in 2p+1, ΠS coincides with a different affine function on each (2p + 1)-cell in the normal manifold of S (see Table 1 in the supplementary materials for definitions of the normal manifold and cells). The B-derivative dΠS(z) is a linear function for points z in the interior of each such cell, and is piecewise linear for z on the boundary. Moreover, dΠS(z) is not continuous with respect to z at points z on the boundary of any (2p + 1)-cell. Therefore, the function d(f0)S(z) is generally not continuous with respect to z at such points, which can be seen from the chain rule of B-differentiability:

d(f0)S(z)(h)=L(t)dΠS(z)(h)+hdΠS(z)(h)for eachz2p+1,h2p+1.

If d(f0)S(·) is not continuous at z0, then d(f0)S(zN) is not guaranteed to converge to d(f0)S(z0) even though zN converges to z0. To introduce the estimators of LK, we will consider two cases based on the location of z0.

For each i = 1, ⋯,p, denote the 9 cells in the normal manifold of Si as Cij,j=0,1,,8 (see Figure 1). According to (10) we derive the constraints defining each Cij which are listed in Table 1 in the supplementary materials. That table also lists the critical cones Kij to Si associated with a point in the relative interior of Cij. Each (2p + 1)-cell in the normal manifold of S can then be written as ×Πi=1pCiγ(i), where γ(i) = 0, ⋯, 8 for each i = 1, ㏯,p. From (19), Assumption 1(a) and Lemma 2, we notice that ((z0)2j, (z0)2j+1) can only appear in the relative interior of Ci3,Ci4,Ci6,Ci7orCi8 for each i. Consequently, dΠS(z) is not continuous at z0 if and only if ((z0)2i, (z0)2i+1) is in the relative interior of Ci3orCi4 for some index i. The two cases are defined below, where the first case corresponds to the situation in which the random variable (LK)1(N(0,Σ0)) is normally distributed, and the second case is for situations in which LK is a piecewise linear function.

Figure 1:

Figure 1:

The normal manifold of Si (left) and Ei0,,Ei8 (right).

  • Case I: In this case, ((z0)2i, (z0)2i+1 is in the relative interior of Ci6,Ci7orCi8 for all i{1p}, and the normal map LK and the B-derivative dΠS(z0) are linear functions. Since d(f0)S(z) is continuous at z0 in this case, we can use dΠS(zN) and d(fN)S(zN) as the estimators of dΠS(z0) and LK respectively.

  • Case II: In this case, ((z0)2i, (z0)2i+1 is in the relative interior of Ci3orCi4 for some index i{1p}, and LK and dΠS(z0) are piecewise linear functions. Since d(f0)S(z) is generally not continuous at z0 in this case, we have to derive an estimator of LK other than d(fN)S(zN).

In both cases, d(fN)S(zN) is an invertible linear map with high probability (Lu, 2014b, Proposition 3.5). While it is reasonable to expect Case I to occur more often than Case II in practice, one cannot identify Case I in advance since z0 is unknown. To derive an estimator of Lk, first we give the expression of dΠS(z), and then construct an asymptotically exact approximation of dΠS(z0). According to (12), we have

dΠS(z)(h)=(βˇ0,dΠS1(β1,t1)(βˇ1,tˇ1),,dΠSp(βp,tp)(βˇp,tˇp)), (32)

for each z=(β0,β,t)andh=(β0,βˇ,tˇ). We denote dΠSi(βi,ti) in the relative interior of each Cij by a function ψj:22(j=0,1,,8). Define four matrices

A1=[1001],A2=[1/21/21/21/2],A3=[1/21/21/21/2],A4=[0000].

Table 2 in the supplementary materials shows the expression of each ψj using these matrices. Denote the relative interior of ×Πi=1pCiγ(i)byri(×Πi=1pCiγ(i)). For all zri(×Πi=1pCiγ(i)), we can write dΠS(z) as

Ψγ(z)(h)=(β0,ψγ(1)(β1,t1),,ψγ(p)(βp,tˇp))for eachh=(β0,β,tˇ), (33)

where γ(z)=(γ(1),,γ(p)) such that zri(×Πi=1pCiγ(i)).

Table 2:

Coverage rates and average lengths of 95% individual CIs for population penalized parameters (β˜0,β˜) for different MCP penalties from 500 replications with sample size N = 300 generated in Example 1.

a = 2 a = 2000

λ = 0.5 λ = 1 λ = 2 λ = 0.5 λ = 1 λ = 2
β˜ CR Len β˜ CR Len β˜ CR Len β˜ CR Len β˜ CR Len β˜ CR Len
β0 0 0.99 0.26 0 0.99 0.27 0 0.98 0.39 0 0.98 0.28 0 0.98 0.32 0 0.98 0.46
β1 3 0.97 0.30 3.13 0.97 0.36 3.37 0.95 0.88 2.83 0.97 0.32 2.67 0.98 0.38 2.33 0.96 0.56
β2 1.5 0.97 0.30 1.25 0.97 0.49 0.51 0.96 0.84 1.36 0.98 0.33 1.22 0.98 0.38 0.94 0.98 0.56
β3 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.04 0 1.00 0.01 0 1.00 0.00
β4 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.04 0 1.00 0.01 0 1.00 0.00
β5 2 0.98 0.26 2.02 0.98 0.30 1.47 0.98 0.56 1.78 0.99 0.29 1.56 0.98 0.34 1.11 0.98 0.52
β6 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.02 0 1.00 0.00 0 1.00 0.00
β7 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.01 0 1.00 0.00 0 1.00 0.00
β8 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00

Next, we construct an estimator of dΠS(z0). We divide the plane (βi, ti) into 9 pieces Ei0,,Ei8 (see Figure 1). The constraints that define each of these sets Ei0,,Ei8 are listed in Table 3 in the supplementary materials. The function g(N) in that table can be any combination of finite many terms of the form aNb with a > 0 and b ∈ (0, 1/2), among other choices. For more details, see Lu and Budhiraja (2013). Each partition ×Πi=1pEiγ(i) is related to the (2p+1)-cell×Πi=1pCiγ(i). Let

γ^(z)=(γ(1),,γ(p))such thatz×Πi=1pEiγ(i).

Given a sample size N and a fixed z, we define a function ΛN(z):2p+12p+1 as

ΛN(z)(h)=Ψγ^(z)(h),for eachh2p+1. (34)

According to Theorem 3.1 of Lu (2014a), ΛN(zN) converges to dΠS(z0) in probability under Assumptions 14.

Table 3:

Coverage rates and average lengths of 95% individual CIs for true model parameters (β0true,βtrue) for different MCP penalties from 500 replications with sample size N = 300 generated in Example 1.

a = 2 a = 2000

λ = 0.5 λ = 1 λ = 2 λ = 0.5 λ = 1 λ = 2
True CR Len CR Len CR Len CR Len CR Len CR Len
β0true 0 0.96 0.23 0.96 0.23 0.95 0.34 0.96 0.24 0.95 0.28 0.95 0.40
β1true 3 0.95 0.26 0.96 0.27 0.99 0.42 0.96 0.28 0.99 0.33 1.00 0.49
β2true 1.5 0.95 0.29 0.96 0.30 1.00 0.51 0.96 0.31 0.97 0.36 1.00 0.53
β3true 0 0.95 0.28 0.96 0.29 0.97 0.42 0.97 0.30 0.97 0.35 0.98 0.50
β4true 0 0.97 0.28 0.97 0.29 0.97 0.42 0.96 0.30 0.97 0.35 0.98 0.50
β5true 2 0.96 0.28 0.96 0.29 0.99 0.44 0.97 0.30 1.00 0.36 1.00 0.54
β6true 0 0.95 0.28 0.95 0.28 0.98 0.42 0.95 0.30 0.97 0.34 0.98 0.49
β7true 0 0.96 0.28 0.96 0.29 0.98 0.42 0.96 0.30 0.97 0.35 0.99 0.50
β8true 0 0.96 0.25 0.97 0.26 0.99 0.38 0.97 0.27 0.99 0.32 1.00 0.45

Based on (24), (26) and (34), we define a function ΦN(zN):2p+12p+1 as

ΦN(zN)(h)=LN(t^)ΛN(zN)(h)+hΛN(zN)(h) (35)

for each h2p+1. The following theorem shows that d(fN)S(zN) is a consistent estimator of Lk for Case I, and ΦN(zN) is a consistent estimator of LK for both Case I and Case II.

Theorem 2.

  • (a)
    Suppose that Assumptions 1, 2 and 3 hold. If z0 satisfies the conditions for Case I, then dΠS(zN) defined in (33) converges to dΠS(z0) almost surely, and
    d(fN)S(zN)=LN(t^)dΠS(zN)+IdΠS(zN) (36)
    converges to LK almost surely.
  • (b)

    Suppose that Assumptions 1, 2, 3 and 4(a-b) hold. Then ΦN(zN) converges to LK in probability.

The two functions d(fN)S(zn) and ΦN(zN) are generally different when ((zN)2i, (zN)2i+1) belongs to Ei0,Ei1,Ei2,Ei3orEi4 for some i, in which case ΦN(zN) is a piecewise linear function. In contrast, d(fN)S(zN) is a piecewise linear function only when ((zN)2i, (zN)2i+1 belongs to Ci3orCi4 for some i.

Under Assumptions 14, we can show that the weak convergence in (30) still holds after LK is substituted by ΦN(zN). Consequently, if Σ01 is nonsingular, then we have

N[(ΣN1)1/200Ip](ΦN(zN))(zNz0)N(0,Ip+1)×0. (37)

If Σ01 is singular, we decompose ΣN1asΣN1=UNTΔNUNwhereUN is an orthogonal (p + 1) × (p + 1) matrix, and ΔN is a diagonal matrix with monotonically decreasing elements. Let l be the number of positive eigenvalues of Σ01 counted with regard to their algebraic multiplicities, let Dn be the upper-left submatrix of ΔN whose diagonal elements are at least 1/g(N), and let lN be the number of rows in DN. Furthermore, let (UN)1 be the submatrix of UN that consists of its first lN rows, and submatrix (UN)2 consists of the remaining rows of UN. We present the weak convergence results in the following theorem, which generalizes Theorem 3 in Lu et al. (2017) to cover all penalties satisfying Assumption 1.

Theorem 3.

Suppose that Assumptions 1, 2, 3 and 4(a-b) hold. Then

NΦN(zN)(zNz0)N(0,Σ0). (38)

If Σ01 is nonsingular, then

N[(ΦN(zN))(zNz0)]T[(ΣN1)100Ip][(ΦN(zN))(zNz0)]χp+12, (39)

and

N[(ΦN(zN))(zNz0)]T[000Ip][(ΦN(zN))(zNz0)]0. (40)

If Σ01 is singular and Assumption 4(c) holds, then Prob{lN = l} → 1 as N → ∞,

N[(ΦN(zN))(zNz0)]T[(UN)1TDN1(UN)1000][(ΦN(zN))(zNz0)]χl2, (41)

and

N[(ΦN(zN))(zNz0)]T[000Ip][(ΦN(zN))(zNz0)]0. (42)

We can treat (39) and (40) as a special case of (41) and (42). In fact, if z0 satisfies Case I, then Theorem 3 still holds if ΦN(zN) is replaced by d(fN)S(zn).

3.3. Confidence intervals for the population penalized parameters

In this subsection, we describe how to obtain confidence interval for (β˜0,β˜) from the asymptotic distribution of zN. First, we investigate the relationship between a solution to the normal map formulation (18) and the corresponding solution to (1). Let q˜ be as defined in Assumption 3 and q˜0=E[2(Yβ˜0i=1pβ˜iXi)]. From (13), (14) and (19), we have z0=(β˜0,β˜,t˜)(q˜0,q˜2mβ˜,(Pλi(t˜i)+2mt˜i)i=1p). In the supplementary materials, it is shown in (B.1) that q˜0=0, which implies β˜0=(z0)1. Thus, confidence intervals of β˜0 are exactly those of (z0)1. On the other hand, using the fact (β˜i,t˜i)=ΠSi((z0)2i,(z0)2i+1) for each i = 1, ⋯,p, we have the following relationship between β˜i and ((z0)2i, (z0)2i+1):

β˜i=Γ(V+,V)={12V+,ifV+>0andV0,0,ifV+0andV0,12V,ifV+0andV<0, (43)

where V+ = (z0)2i + (z0)2i+1 and V=(z0)2i(z0)2i+1. The above three cases in (43) include all the possible situations for the location of ((z0)2i, (z0)2i+1. This map Γ can be used to obtain confidence intervals for β˜i(i=1,,p) after we calculate confidence intervals for ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1). For a fixed i, we denote the (1 − α/2)% confidence intervals for ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1) as [L+i,U+i]and[Li,Ui] respectively. Then a (1 − α)% (conservative) confidence interval for β˜i is given by

[Γ(L+i,Li),Γ(U+i,Ui)]. (44)

Next, we show how to find confidence intervals for (z0)1, ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1). Under Assumptions 14, from Theorem 3 we can express the asymptotically exact (1 − α)100% confidence region for z0 as

{z2p+1|N[ΦN(zN)(zNz)]T[(UN)1TDN1(UN)1000][ΦN(zN)(zNz)]χlN2(α)N[ΦN(zN)(zNz)]T[000Ip][ΦN(zN)(zNz)]=0} (45)

where χlN2(α) is the critical value associated with significant level α of a χ2 distribution with lN degrees of freedom. If ΦN(zN) is a linear map, then the set in (45) is an ellipsoid in a subspace of 2p+1. Otherwise it is the union of different ellipsoid fractions. To obtain simultaneous confidence intervals, we find the maximal and minimal values of (z0)1, ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1) in the set of (45) by solving optimization problems.

For individual confidence intervals, first we notice that ΦN(zN) is a global homeomorphism with probability 1 as N (see the proof of Theorem 2 in the supplementary materials). If ΦN(zN) is a global homeomorphism, we can use

(ΦN(zN))1(N(0,ΣN)) (46)

to approximate the distribution of N(zNz0) as in (29). When ΦN(zN) is a linear map, the distribution in (46) is normal. Therefore (z0)1, ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1) also follow normal distributions, from which we can construct individual confidence intervals. When ΦN(zN) is not a linear map, we simulate data based on the distribution in (46), and find empirical individual confidence intervals for (z0)1, ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1) by taking α2%and(1α2)% percentiles of the data as the lower and upper bounds respectively.

3.4. Confidence intervals for true model parameters in the underlying linear model

In this subsection, we develop a method to compute confidence intervals for (β0true,βtrue), based on a relation between a population penalized parameter (β˜0,β˜) and the true model parameter (β0true,βtrue). Suppose the underlying linear model is

Y=β0true+XTβtrue+ε, (47)

where β0trueandβtrue=(β1true,,βptrue)p are the true model parameters. Let ttruepbe defined astitrue=|βitrue|. Denote the covariance matrix of X as Σ, and assume the random error ε has mean zero and variance σ2. Moreover, ε is independent with Xi for all i = 1, ⋯,p. For simplicity we assume E(Xi) = 0 for each i = 1, ⋯,p. Consequently we have E(Y)=β0trueandΣ=E(XXT). We assume that £ is nonsingular and therefore we do not need Assumption 3 in this subsection.

In developing theoretical results of this subsection, we will let λ = (λi, ⋯, λp) converge to 0. Due to this change, assumptions stated in Section 2.2 need to be changed accordingly. We will replace Assumptions 1 by Assumption 1’, and keep Assumption 2. We will not need Assumption 4 until the end of this section.

Assumption 1’(a).

For each i = 1, 2, ⋯,p, P0(t) = 0 for all t ≥ 0. Moreover, for each positive λi in a neighborhood of 0, Pλi(·) is nonnegative, nondecreasing and continuously differentiable on [0,+)withPλi(0)=0andPλi(0)>0.

Assumption 1’(b).

For each i = 1, ⋯,p, there exist neighborhoods Tioftitruein+andΛiof0in+,suchthatPλi(ti) the second derivative of (·) with respect to ti, exists for each λiΛiandtiTi. Moreover, Pλi(ti)andPλi(ti) are Lipschitz continuous in (λi,ti)onTi×Λi.

Assumption 1’(c).

For each i = 1, ⋯,p, there exists a neighborhood Jioftitruein+, such that the mixed partial derivatives 2Pλiti(0,ti)and3Pλi2ti(0,ti)existforeachtiTi, with

limλi0suptiTi|Pλi(ti)λi2Pλiti(0,ti)|+|Pλi(ti)λi3Pλi2ti(0,ti)|=0.

Besides the convex LASSO penalty, we can check that many non-convex penalty functions such as SCAD, MCP, the log-penalty, and the transformed l1 penalty satisfy Assumption 1’(a). We can further check that the LASSO penalty, the log-penalty, and the transformed l1 penalty also satisfy Assumptions 1’(b) and 1’(c). For the SCAD and MCP penalties, Assumptions 1’(b) and 1’(c) are satisfied almost everywhere except that titrue=0 for some i. Assumptions 1’(b) and 1’(c) are used to guarantee that the SAA function fN almost surely converges to the true function f0 in the space of continuously differentiable functions on a neighborhood of (β0true,βtrue,ttrue),andthatN(fNf0) weakly converges to a random function in that space. This assumption is needed for the techniques based on stochastic variational inequalities to be applicable. It is possible to weaken this assumption by developing techniques for a broader class of problems in which the SAA function fN (or equivalently, the first order derivative of the penalty function) is not necessarily continuously differentiable, and we will investigate this in future work.

Under Assumption 1’(a), with λ = 0 the problem (1) becomes the least square problem (3), which has a unique solution (β0true,βtrue) in view of the linear model (47). Let ttruep be defined as if titrue=|βitrue|. Then (9) with λ = 0 has a unique solution (β0true,βtrue,ttrue). By the equivalence between (9) and (17), (β0true,βtrue,ttrue) is also the unique solution to

f0(β0,β,t)NS(β0,β,t)

where f0(β0, β, t) is as defined in (14) but with λ = 0:

f0(β0,β,t)=[2E[Yβ0i=1pβiXi]2E[(Yβ0i=1pβiXi)X]2mβ2mt].

Let z0* be as defined in (19) with λ = 0 and (β0true,βtrue,ttrue)replacing(β˜0,β˜,t˜). The following lemma presents the relation between (β0true,βtrue)andz0*.

Lemma 4.

Suppose that Assumptions 1’(a) and 2 hold. Then we have

(β0true,βtrue)=G*(z0*) (48)

where G*(p+1)×(2p+1) is defined as

G*=[1000(1+2m)1Ip0]. (49)

Lemma 4 above indicates that an estimator of the true parameter (β0true,βtrue)isG*(zN) where zN is defined in (26). In the following theorem, we show the asymptotic distribution of the estimator G*(zN). Before stating this theorem, we define a matrix L*(2p+1)×(2p+1)and a coneK*2p+1 as follows:

L*=df0(β0true,βtrue,ttrue)=[20002Σ2mIp0002mIp], (50)

and

K*={wTS(β0true,βtrue,ttrue)|f0(β0true,βtrue,ttrue),w=0}. (51)

Note that

f0(β0true,βtrue,ttrue)=[02mβtrue2mttrue].

This implies that

K*=×Πi=1pKi* (52)

where

Ki*={{(βi,ti)|βi=ti)}ifβitrue>0,{(βi,ti)|βi=ti)}ifβitrue<0,Siifβitrue=0. (53)

As the setting considered in this subsection is different from previous sections, L* and K* here are different from L and K defined in Assumption 3 and (21). The previous L and K are associated with (β˜0,β˜,t˜), a solution to the population problem with a fixed positive λ, while L* and K* are associated with (β0true,βtrue,ttrue) and λ = 0.

Theorem 4.

Suppose that Assumptions 1’(a-c) and 2 hold. Let m > 0 be sufficiently small so that ΣmIp is nonsingular, L* and K* be defined as above, Σ0* be the covariance matrix of the random vector F(β0true,βtrue,ttrue,X,Y) defined in (13), and Σ0*1 be the upper left (p + 1) × (p + 1) submatrix of Σ0*. Moreover, let λi’s be chosen to satisfy limNNλi=ci for some constant ci ≥ 0, zN be defined in (26), and define (β^0true,β^true)=G*(zN)andhpbyhi=ci2Pλiti(0,titrue).Then(β^0true,β^true) is a consistent estimator of (β0true,βtrue) and

N((β^0true,β^true)(β0true,βtrue))G*(LK**)1(N(0,Σ0*1),h). (54)

Note that the distribution of G*°(LK**)1(N(0,Σ0*1),h) can be normal or non-normal. When the true parameter βitrue0 for each i, we can show that K* is a subspace of (2p+1)and(LK**)1 is a linear function. Therefore, the limiting distribution of the true parameter estimator G*(zN) is normal in this case. However, if the true parameter βitrue=0 for some i, the limiting distribution can be normal or non-normal.

Theorem 5.

Suppose that the assumptions in Theorem 4 hold. If hi = 0 for each i, then (β^0true,β^true) is a consistent estimator of (β0true,βtrue) and

G*(LK**)1(N(0,Σ0*1),h)=[120012Σ1](N(0,Σ0*1))=N(0,[σ200σ2Σ1]).

Furthermore, let Θ^ be a consistent estimate of Σ1. Define HN=[120012Θ^]. Then,

N(β^itrueβitrue)(HNΣN1HNT)i+1,i+1N(0,1),foralli=0,1,,p. (55)

Since hi=ci2Pλiti(0,titrue), we know that hi=0ifci=0. Therefore, if λi’s are chosen to be o(1/N) in the penalty function, the limiting distribution will be a multivariate normal distribution. In this normal case, the above Theorem 5 provides a method to compute asymptotically exact individual confidence intervals for (β0true,βtrue).

When hi0 for some i, the asymptotic distribution in Theorem 4 does not necessarily reduce to a normal distribution. To see this, consider the following example. Let p = 2, β0true=0,β1true=0andβ2true=a0 for some constant a0>0.Letλ=1N,soc1=c2=1.LetΣ=[2002],m=1,and2Pλ1t1(0,t1true)=2Pλ2t2(0,t2true)=1 (which is satisfied by the LASSO penalty function P(λ, t) = λt). It follows that h1 = h2 = 1. Let q0, q1, q2R. To find (L*K*)−1(q0, q1, q2, h1, h2) we consider the following problem

min(β0,β1,t1,β2,t2)K*β02+β12+t12+β22+t22q0β0q1β1h1t1q2β2h2t2,

whose solution satisfies (q0,q1,q2,h1,h2)L*(β0,β1,β2,t1,t2)+NK*(β0,β1,β2,t1,t2). Here K*=×S1×{(β2,t2)|β2=t2}. The solution to the above problem is given by

β0=q02,β1={q12if1q11q1+14ifq11q114ifq11,t1=|β1|,β2=t2=q2+14.

As a result, (LK**)1(q0,q1,q2,h1,h2)=(β0,β1,β2,t1,t2) is a piecewise affine function of (q0,q1,q2,h1,h2) with three pieces. Furthermore, since G*(·) is a linear transformation, we conclude that the asymptotic distribution is non-normal in this case.

Next, we show how to estimate (LK**)1 in the situation considered in Theorem 4. We need to make an assumption analogous to Assumption 4.

Assumption 4’(a).

The same conditions as in Assumption 4(a), with (β0true,βtrue,ttrue)replacing(β˜0,β˜,t˜).

Assumption 4’(b).

The same conditions as in Assumption 4(b), with (β0true,βtrue,ttrue)replacing(β˜0,β˜,t˜).

To estimate (LK**)1, we can also use ΦN(zN). Under Assumption 4’, to show that ΦN(zN) is a consistent estimator of (LK**)1, the key is to show (31) holds (with (β0true,βtrue,ttrue)replacing(β˜0,β˜,t˜)). To show this, one needs to show that fN converges to f0 in probability at an exponential rate in the space of continuous differentiable functions. For the first p + 1 components, this follows from Assumption 4’. For the last p components, note that the norm of the function Pλi/λi in the space of continuous differentiable functions is bounded due to Assumption 1’(c). In addition, since limNNλi=ci for each i, we have NPλiρ for some constant ρ. As a result, for each ϵ>0, we have Pλiρ/N<ϵ/2 for sufficiently large N. Therefore, fN converges to f0 in probability at an exponential rate in the space of continuous differentiable functions, and ΦN(zN). converges to (LK**)1 in probability.

For the normal case, equation (55) justifies using the diagonal elements of HNΣNHNT, divided by N, as the estimated variances of (β0true,βtrue).. For the non-normal case, first we define two functions R and R^from2p+1top+1 as

R=G*(LK**)1[(Σ0*1)1200diag(hi)i=1p]andR^=G*(ΦN(zN))1[(ΣN1)1200diag(h^i)i=1p], (56)

where h^i=Nλi2Pλiti(0,|β^itrue|). Denote the ith component function of R and R^asRiandR^i respectively for each i. Let f:2p+1 be a continuous function and Z be a (2p + 1)-dimensional random variable with Z~N(0,Ip+1)×1.Definear(f)(0,) as

ar(f)=inf{c0|Prob{cf(Z)rc}1α}. (57)

Suppose that Prob {f(Z) = b} = 0 for all b. Then for any given randα(0,1),ar(f) as defined in (57) is the smallest value that satisfies

Prob{ar(f)f(Z)rar(f)}=1α.

Since the map G* has full row rank, LK** is a global homeomorphism, and Σ0*1 is nonsingular. If hi ≠ 0 for each i, then the matrix representation of each piece of the map R has full row rank as well. Therefore, Prob {Ri(Z)=b}=0for allb. The following theorem provides a way to compute individual confidence intervals for (β0true,βtrue) in the general case where hi ≠ 0 for each i.

Theorem 6.

Suppose that assumptions in Theorem 4 and Assumptions 4’(a-b) hold, and hi ≠ 0 for each i. Let α(0,1) and ar(·) be as in (57). Then for every r and all i = 0, 1, ⋯,p, we have

limNProb{|N(β^itrueβitrue)r|ar(R^i+1)}=1α, (58)

where R and R^ are defined in (56).

From (58), one can compute the empirical (1 − α) percentile confidence intervals for (β0true,βtrue) by simulating data from R^(Z). The constant r can be used to control the centers of the confidence intervals for all βitrue simultaneously, which may affect the interval lengths. A reasonable choice of r is 0 if the empirical distribution of R^(Z) is approximately symmetric with respect to 0. Results of Theorems 5 and 6 are applicable to a wide range of general penalty functions which covers the LASSO as a special case (Lu et al., 2017). The procedure to construct confidence intervals of the true regression coefficients is summarized in Table 1.

3.5. Extension to the high dimensional case

In our previous theoretical analysis, we assume that the dimension p is fixed. It is interesting to study the extension of our proposed method to the high dimensional case where the dimension p is also allowed to go to infinity.

As shown in Lemma 4, since (β0true,βtrue)=G*(z0*), our proposed estimate for the true parameter (β0true,βtrue)isG*(zN). In fact, we can also show that (β0true,βtrue)=G(z0), where z0 is defined in (19) with λ > 0 and G is a map from 2p+1top+1 defined as

G=12([100Σ1]B+[1002I(1+2m)Σ1]BΠK), (59)

and the matrix B(p+1)×(2p+1)is given byB=[Ip+10]. Motivated by this result, we can also estimate the true parameter (β0true,βtrue)byG^(zN), where G^ is a map from 2p+1 to p+1 defined as

G^=12([100Θ^]B+[1002I(1+2m)Θ^]BdΠS(zN)), (60)

and Θ^ is a consistent estimate of Σ−1. Theoretically, if limNNλi=0 for each i, we can show that G*(zN) and G*(zN)andG^(zN) have the same asymptotic distribution.

According to the definition of G^ in (60) and the definition of zN in (26), we can show that G^(zN)=(β^0+1N1NT(yβ^01NXβ^),β^+1NΘ^XT(yβ^01NXβ^)). Therefore, the estimate of β is the sum of an initial estimate β^ (e.g., LASSO, SCAD or MCP estimate) and a bias-correction term. Interestingly, although we have different motivations, G^(zN) turns out to be the same as the estimate proposed by Van de Geer et al. (2014). For the high dimensional case with pN, if we choose λ=O(log(p)/N) converging to 0 as N, and use conditions to guarantee that: (a) β^βtrue1=Op(s0log(p)/N) where s0 is the number of true nonzero regression coefficients; (b) s0=o(N/log(p)), and some sparisty assumptions about the precision matrix Σ−1, we can show that the asymptotic distribution of G^(zN) is normal (Van de Geer et al. (2014)). However, the theoretical analysis of the asymptotic distribution of G*(zN) and G^(zN) using stochastic variational inequality techniques for pN case is challenging. Many fundamental results about variational inequality (e.g., some results used in the proof of Theorem 1) need to be generalized to the high dimensional case.

Although we assume that the dimension p is fixed in the theoretical analysis using stochastic variational inequality techniques, our proposed method is applicable to the large p small N data in practice. For the high dimensional data, we can use the results shown in Theorem 6 to construct confidence intervals. In Section 4, we will use Example 3 to study the performance of our method for the high dimensional data.

4. Numerical examples

In this section, we use the MCP methods in (8) to illustrate the performance of the techniques proposed in Section 3. For all examples in this section, we choose 1g(N)=0.001N1/3andm=12 in (9). We use the mixed integer quadratically constrained program (MIQCP) solver in the optimization modeling language GAMS (Brooke et al., 1998) to obtain accurate solutions to (2).

For all simulated examples, we generate the data using the following linear model:

Y=XTβtrue+σϵ, (61)

where βtruep,X is a p-dimensional normal random variable with mean 0 and covariance Σ, is a standard normal random error which is independent of X. We set the noise level σ = 1. Under the model (61), the population penalized regression problem (1) can be written as

minβ0,β(βtrueβ)TΣ(βtrueβ)+β02+j=1pPλj(|βj|). (62)

We compute confidence intervals for the population penalized parameter (β˜0,β˜) and the true model parameter βtrue, which we refer to as the first and second types of confidence intervals respectively. To show their performance in simulation study, we report the following two measures: the empirical coverage rate (the fraction of total replications in which the confidence intervals contain the corresponding population penalized parameters or true model parameters) and the average confidence interval length. For the second type of confidence interval, we compare our proposed method with the LDPE method (Van de Geer et al. (2014); Zhang and Zhang (2014)), the method introduced by Javanmard and Montanari (2014) (dentoed as JM method), and the method proposed by Lu et al. (2017) (denoted as SVI-Lasso). In terms of the tuning parameter λ, we study the performance of our proposed method with some fixed values as well as the value of A chosen by the Generalized Information Criterion (GIC, Konishi and Kitagawa (2008)).

4.1. Example 1: Low dimensional setting with the auto-regressive covariance structure

For this example, we generate a training dataset with 500 replications of sample size N = 300, dimension p = 8, true model parameter βtrue = (3, 1.5, 0, 0, 2, 0, 0, 0), and true covariance matrix Σij=0.5|ij|. We consider six MCP penalties with parameters (λ, a) taking the following values: λ = 0.5, 1 or 2, and a = 2 or 2000. When a = 2000, the MCP penalties are very close to the LASSO penalty. In each replication, after solving the SAA problem for every MCP penalty, we compute the two types of individual confidence intervals with the confidence level 0.95 (α = 0.05).

Tables 2 and 3 show the empirical coverage rates (CR) and average interval lengths (Len) for 95% individual confidence intervals of the 500 replications. In Table 2, the β˜ column contains the population penalized parameters for different MCP penalties, which are expected to be covered by the first type of confidence intervals. In Table 3, the “True” column contains the true model parameters βtrue, which are expected to be covered by the second type of confidence intervals. Note that the coverage rate is 100% for the first type of confidence interval when β˜i=0. This is due to the shrinkage effect of the projection Γ (43) from z0 to β˜, which causes the confidence intervals for β˜i to be the singleton {0}. When λ = 0.5 and a = 2, the population penalized parameters coincide with the true model parameters. The second type of confidence intervals are much longer than the first type for the inactive parameters β3, β4, β6, β7 and β8 as expected. In practice, which type of confidence intervals to use depends on the type of parameters of interest. The first type of confidence intervals can be used to assess the randomness of the penalized estimates with a fixed penalty. This type of inference is especially useful when the penalty conveys prior information on the parameters. In contrast, the second type of confidence intervals provide inference information for the underlying true parameters directly.

As a remark, the parameter λ controls the level of penalization and the parameter a in the MCP penalty controls the degree of non-convexity. As shown in Tables 2 and 3, when λ increases to 1 and 2, the differences between the population penalized parameters and the true model parameters become larger. On the other hand, as a gets large, such as a = 2000, the MCP penalty becomes close to the LASSO penalty. The lengths of the second type of confidence intervals for a = 2 are generally shorter than those for a = 2000 with similar coverage rates as shown in Table 3. This may be due to the smaller bias imposed by the MCP penalty with a small a. In addition, as shown in Table 4, our proposed method with λ selected by GIC has very good performance. The comparison between our proposed methods and the LASSO-type methods indicate that our methods perform well for the inference of the true parameters in the linear model.

Table 4:

Coverage rates and average lengths of 95% individual CIs for true model parameters (β0true,βtrue) for different methods from 500 replications with sample size N = 300 generated in Example 1.

Our method (a = 2) Lasso type methods

λ = 0.5 λ = 1 λ = 2 GIC SVI-Lasso LDPE JM
True CR Len CR Len CR Len CR Len CR Len CR Len CR Len
β1true 3 0.95 0.26 0.96 0.27 0.99 0.42 0.95 0.26 0.95 0.26 0.94 0.26 0.88 0.25
β2true 1.5 0.95 0.29 0.96 0.30 1.00 0.51 0.95 0.29 0.95 0.29 0.94 0.28 0.89 0.28
β3true 0 0.95 0.28 0.96 0.29 0.97 0.42 0.95 0.28 0.95 0.28 0.95 0.28 0.99 0.28
β4true 0 0.97 0.28 0.97 0.29 0.97 0.42 0.97 0.28 0.97 0.28 0.96 0.28 0.97 0.28
β5true 2 0.96 0.28 0.96 0.29 0.99 0.44 0.96 0.28 0.97 0.29 0.96 0.28 0.92 0.28
β6true 0 0.95 0.28 0.95 0.28 0.98 0.42 0.95 0.28 0.94 0.28 0.95 0.28 0.98 0.28
β7true 0 0.96 0.28 0.96 0.29 0.98 0.42 0.96 0.28 0.95 0.28 0.95 0.28 0.99 0.28
β8true 0 0.96 0.25 0.97 0.26 0.99 0.38 0.97 0.25 0.96 0.26 0.97 0.26 0.98 0.25

4.2. Example 2: Low dimensional setting with the equi-correlation covariance structure

In this example, we consider the equi-correlation covariance structure where Σij = 0.5 for all ij and Σjj = 1 for all j. The other settings are the same as Example 1.

Table 5 shows the performance of the 95% individual confidence intervals of the population penalized parameters. The results shown in this table are very similar to the results of Example 1 shown in Table 2. As shown in Table 5, for each fixed λ, the proposed method using a = 2 delivers better performance than the method using a = 2000 in most cases, especially when λ is small. Table 6 shows the comparison of the individual confidence intervals of the true model parameters constructed by our method and the LASSO-type methods. Similar to Example 1, our proposed method using GIC performs well. In addition, for this example, ourmethod (GIC), SVI-Lasso and LDPE methods deliver similar performance. All these three methods perform better than the JM method.

Table 5:

Coverage rates and average lengths of 95% individual CIs for population penalized parameters (β˜0,β˜) for different MCP penalties from 500 replications with sample size N = 300 generated in Example 2.

a = 2 a = 2000

λ = 0.5 λ = 1 λ = 2 λ = 0.5 λ = 1 λ = 2
β˜ CR Len β˜ CR Len β˜ CR Len β˜ CR Len β˜ CR Len β˜ CR Len
β0 0 0.98 0.26 0 0.98 0.27 0 0.99 0.38 0 0.98 0.27 0 0.97 0.31 0 0.97 0.41
β1 3 0.99 0.31 3.10 0.98 0.36 3.57 0.91 0.95 2.88 0.98 0.33 2.75 0.97 0.38 2.50 0.97 0.52
β2 1.5 0.98 0.32 1.20 0.98 0.55 0.57 0.95 0.88 1.37 0.97 0.34 1.25 0.97 0.38 1.00 0.96 0.52
β3 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.05 0 1.00 0.02 0 1.00 0.01
β4 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.05 0 1.00 0.02 0 1.00 0.00
β5 2 0.97 0.32 2.10 0.99 0.38 1.57 0.98 0.93 1.88 0.97 0.34 1.75 0.97 0.38 1.50 0.98 0.52
β6 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.05 0 1.00 0.02 0 1.00 0.00
β7 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.05 0 1.00 0.02 0 1.00 0.00
β8 0 1.00 0.00 0 1.00 0.00 0 1.00 0.00 0 1.00 0.05 0 1.00 0.02 0 1.00 0.00

Table 6:

Coverage rates and average lengths of 95% individual CIs for true model parameters (β0true,βtrue) for different methods from 500 replications with sample size N = 300 generated in Example 2.

Our method (a = 2) Lasso type methods

λ = 0.5 λ = 1 λ = 2 GIC SVI-Lasso LDPE JM
True CR Len CR Len CR Len CR Len CR Len CR Len CR Len
β1true 3 0.97 0.30 0.97 0.31 1.00 0.46 0.97 0.30 0.97 0.30 0.97 0.30 0.91 0.29
β2true 1.5 0.95 0.30 0.97 0.32 1.00 0.50 0.95 0.30 0.95 0.31 0.95 0.30 0.89 0.29
β3true 0 0.95 0.30 0.96 0.31 0.99 0.43 0.95 0.30 0.96 0.30 0.96 0.30 0.98 0.29
β4true 0 0.96 0.30 0.97 0.31 0.99 0.43 0.96 0.30 0.96 0.30 0.96 0.30 0.99 0.29
β5true 2 0.92 0.30 0.94 0.31 0.99 0.46 0.92 0.30 0.92 0.31 0.92 0.30 0.88 0.29
β6true 0 0.95 0.30 0.95 0.31 1.00 0.43 0.95 0.30 0.95 0.30 0.94 0.30 0.98 0.29
β7true 0 0.95 0.30 0.95 0.31 0.99 0.43 0.95 0.30 0.95 0.30 0.95 0.30 0.96 0.29
β8true 0 0.94 0.30 0.95 0.31 1.00 0.43 0.94 0.30 0.94 0.30 0.94 0.30 0.98 0.29

4.3. Example 3: High dimensional example

In this example, we consider a high dimensional case in which the dimension is much larger than the sample size. We choose p = 300 with βtrue being a 300-dimensional vector: β1true=3,β2true=β100true=β200true=β300true=1.5,β5true=β95true=2,β10true=1,β25true=0.5, and all the other components are 0. The true covariance matrix is Σij=0.5|ij|. We generate a training dataset with 500 replications of sample size N = 100. For this high dimensional example, we consider three MCP penalties with parameters λ = 0.5, 1 or 2, and a = 3. In each replication, we use the nodewise LASSO regression introduced by Meinshausen and Bühlmann (2006) to compute the estimate of the precision matrix Θ^, and compute the individual confidence intervals of the true model parameters with the confidence level 0.95. Define the active set as A={i:βitrue0}={1,2,5,10,25,95,100,200,300}andAc={0,1,2,,300}\A. In Table 7, for different methods, we report the average coverage rate, median coverage rate, average length and median length of the individual confidence intervals for true model parameters in AandAc respectively:

AvgcovA=|A|1iACRi,AvgcovAc=|Ac|1iAcCRi,
AvglenA=|A|1iALeni,AvglenAc=|Ac|1iAcLeni,
MedcovA=medianiA{CRi},MedcovAc=medianiAc{CRi},
MedlenA=medianiA{Leni},MedlenAc=medianiAc{Leni},

where CRi and Leni denote the empirical coverage rate and average interval length of the confidence interval for the parameter βitrue for the 500 replications, respectively.

Table 7:

Average coverage rates and lengths of 95% individual confidence intervals for the true model parameters in the linear model with different methods computed from 500 replications with sample size N = 100 and dimension p = 300 generated in Example 3.

Our method (λ = 0.5) Our method (λ = 1) Our method (λ = 2)
Avgcov Medcov Avglen Medlen Avgcov Medcov Avglen Medlen Avgcov Medcov Avglen Medlen
A 92.82 92.60 0.46 0.45 94.64 95.00 0.66 0.65 95.24 95.40 1.26 1.25
Ac 93.26 93.40 0.39 0.39 93.51 93.60 0.56 0.56 93.73 94.00 1.08 1.08
Our method (GIC) LDPE JM
Avgcov Medcov Avglen Medlen Avgcov Medcov Avglen Medlen Avgcov Medcov Avglen Medlen

A 92.91 93.00 0.57 0.56 93.84 94.40 1.13 1.14 88.07 87.80 0.55 0.55
Ac 93.37 93.40 0.47 0.47 95.31 95.60 1.14 1.14 99.38 99.40 0.55 0.55

For our proposed methods, as λ increases to 1 and 2, both the average coverage rates and lengths increase. Compared with LDPE, our proposed method using GIC to choose the tuning parameter has much shorter average lengths while the average coverage rates are only slightly lower. Although the JM method delivers similar average lengths as our proposed method (GIC), the average coverage rates of our proposed method are much closer to the nominal level 95%. Overall, the results shown in Table 7 indicate that our proposed method still delivers comparable performance for the high dimensional case.

4.4. Example 4: ADNI data

In this real data example, we consider the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (http://www.loni.ucla.edu/ADNI). The main goal of ADNI was to test whether the serial structural magnetic resonance imaging (MRI), fluorodeoxyglucose positron emission tomography (FDG-PET) images and some other biological markers such as cerebrospinal fluid (CSF) could be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s Disease (AD). To that end, 800 adults with ages between 55 and 90 were recruited from over 50 sites across the US and Canada. In our analysis, we use data from 199 subjects who have complete baseline MRI, FDG-PET, and CSF data. Using the data processing method shown in Thung et al. (2014), we obtained 93 MRI features, 93 PET features, and 5 CSF features for each subject. The response variable is the Mini-Mental State Examination (MMSE) score (Folstein et al. (1975)) which is often used to screen for cognitive impairment.

The data are standardized at the beginning of our analysis. For our proposed method, we use the MCP penalty with the parameter a = 3 and choose the best tuning parameter λ by GIC. Table 8 shows the selected features of different methods, where the selected features are the features whose 95% confidence intervals do not contain 0. The numbers of features selected by our method, SVI-Lasso, LDPE, and JM are 13, 12, 13, and 3, respectively. Among the 13 features selected by our proposed method, 11 features are selected by the SVI-Lasso method, 9 features are selected by the LDPE method and 3 features are selected by the JM method. Table 9 shows the estimates and 95% individual confidence intervals of the 13 features selected by our proposed method. The results of our proposed method and the results of SVI-Lasso and LDPE methods are comparable. As shown in Table 9, for most features among the 13 features, the absolute values of the estimates delivered by the JM method are much smaller than the corresponding values of the other methods. The 95% confidence intervals of the JM method are also very different from the corresponding confidence intervals of the other methods.

Table 8:

Selected features of different methods for the ADNI data.

Method Selected Features

Our method (GIC) 9, 19, 40, 59, 67, 80, 95, 130, 134, 147, 156, 168, 178
SVI-Lasso 9, 19, 40, 77, 80, 95, 130, 134, 147, 156, 168, 178
LDPE 9, 19, 40, 59, 77, 80, 83, 90, 111, 134, 147, 156, 168
JM 19, 40, 134

Table 9:

Estimates and 95% individual confidence intervals of the 13 features selected by our proposed method for the ADNI data.

Our method (GIC) SVI-Lasso LDPE JM
Est Ind CI Est Ind CI Est Ind CI Est Ind CI
β9true −0.20 [−0.33, −0.07] −0.20 [−0.33, −0.07] −0.19 [−0.34, −0.04] −0.15 [−0.31, 0.01]
β19true 0.24 [0.06, 0.41] 0.23 [0.07, 0.40] 0.24 [0.09, 0.40] 0.25 [0.08, 0.42]
β40true −0.21 [−0.36, −0.06] −0.21 [−0.35, −0.06] −0.20 [−0.35, −0.05] −0.16 [−0.33, 0.00]
β59true 0.15 [0.01, 0.29] 0.15 [0.00, 0.30] 0.16 [0.01, 0.30] 0.12 [−0.03, 0.28]
β67true 0.13 [0.00, 0.27] 0.13 [0.00, 0.26] 0.12 [−0.02, 0.26] 0.11 [−0.03, 0.26]
β80true 0.23 [0.03, 0.43] 0.23 [0.03, 0.42] 0.21 [0.04, 0.38] 0.15 [−0.01, 0.30]
β95true 0.20 [0.00, 0.40] 0.21 [0.01, 0.41] 0.19 [−0.01, 0.39] 0.04 [−0.11, 0.20]
β130true 0.21 [0.00, 0.42] 0.20 [0.01, 0.40] 0.18 [−0.03, 0.39] 0.04 [−0.12, 0.19]
β134true 0.25 [0.08, 0.43] 0.24 [0.02, 0.45] 0.24 [0.06, 0.43] 0.23 [0.08, 0.38]
β147true −0.22 [−0.41, −0.02] −0.22 [−0.41, −0.03] −0.21 [−0.40, −0.02] −0.03 [−0.18, 0.13]
β156true −0.19 [−0.36, −0.02] −0.19 [−0.36, −0.02] −0.18 [−0.37, 0.00] −0.05 [−0.21, 0.11]
β168true −0.24 [−0.43, −0.04] −0.24 [−0.43, −0.04] −0.22 [−0.43, −0.02] 0.01 [−0.14, 0.16]
β178true −0.19 [−0.34, −0.03] −0.19 [−0.35, −0.04] −0.18 [−0.36, 0.00] −0.07 [−0.22, 0.08]

5. Discussion

In this paper we propose a unified framework to construct confidence intervals for the population penalized parameters as well as the true model parameters for a large class of penalties. By transforming the population penalized regression problem (1) and its SAA problem (2) to the equivalent problems (9) and (22) respectively, we exclude the non-smoothness in the objectives. Furthermore, we obtain their normal map formulations (18) and (25), and derive the asymptotic distributions and the two types of confidence intervals. Our numerical results show that these methods are effective. When the objective functions in (1) and (2) are non-convex as a result of non-convex penalty functions, most existing algorithms are only guaranteed to obtain a local optimal solution for the SAA problem. Our proposed methods will generate confidence intervals based on that local solution. In practice, we solve for a SAA solution (β^0,β^) and then use (26) to obtain a solution to (25). The first type of confidence intervals we compute are for a local optimal solution of the population penalized regression problem (1). From any local solution of (2), we can always compute confidence intervals for the true model parameters, which are the second type of confidence intervals we compute.

Supplementary Material

Supp1

Acknowledgments

The authors thank the editors, the associate editor, and referees for their helpful comments and suggestions. This research was supported in part by US National Science Foundation grants DMS-1407241 (Liu, Lu and Yin), and DMS-1109099 (Lu and Yin).

References

  1. Brooke A, Kendrick D, Meeraus A, and Raman R (1998), GAMS, A User’s Guide, Washington, DC: GAMS Development Corporation, available online at http://www.gams.com. [Google Scholar]
  2. Candes EJ and Tao T (2007), “The Dantzig selector: statistical estimation when p is much larger than n,” The Annals of Statistics, 35, 2313–2351. [Google Scholar]
  3. Donoho DL and Johnstone IM (1994), “Ideal spatial adaptation by wavelet shrinkage,” Biometrika, 81, 425–455. [Google Scholar]
  4. Efron B, Hastie T, Johnstone I, and Tibshirani R (2004), “Least angle regression,” The Annals of Statistics, 32, 407–499. [Google Scholar]
  5. Fan J and Li R (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]
  6. Folstein MF, Folstein SE, and McHugh PR (1975), “Mini-mental state: a practical method for grading the cognitive state of patients for the clinician,” Journal of psychiatric research, 12, 189–198. [DOI] [PubMed] [Google Scholar]
  7. Friedman JH (2012), “Fast sparse regression and classification,” International Journal of Forecasting, 28, 722–738. [Google Scholar]
  8. Javanmard A and Montanari A (2014), “Confidence intervals and hypothesis testing for high-dimensional regression,” Journal of Machine Learning Research, 15, 2869–2909. [Google Scholar]
  9. Konishi S and Kitagawa G (2008), Information criteria and statistical modeling, Springer Science & Business Media. [Google Scholar]
  10. Lee JD, Sun DL, Sun Y, Taylor JE, et al. (2016), “Exact post-selection inference, with application to the lasso,” The Annals of Statistics, 44, 907–927. [Google Scholar]
  11. Liu Y and Wu Y (2007), “Variable selection via a combination of the L0 and L1 penalties,” Journal of Computional and Graphical Statistics, 16, 782–798. [Google Scholar]
  12. Lockhart R, Taylor J, Tibshirani R, and Tibshirani R (2014), “A significance test for the lasso,” The Annals of Statistics, 42, 413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lu S (2014a), “A new method to build confidence regions for solutions of stochastic variational inequalities,” Optimization: A Journal of Mathematical Programming and Operations Research, 63, 1431–1443. [Google Scholar]
  14. Lu S (2014b), “Symmetric Confidence Regions and Confidence Intervals for Normal Map Formulations of Stochastic Variational Inequalities,” SIAM Journal on Optimization, 24, 1458–1484. [Google Scholar]
  15. Lu S and Budhiraja A (2013), “Confidence regions for stochastic variational ienqualities,” Mathematics of Operations Research, 38, 545–568. [Google Scholar]
  16. Lu S, Liu Y, Yin L, and Zhang K (2017), “Confidence intervals and regions for the lasso by using stochastic variational inequality techniques in optimization,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 589–611. [Google Scholar]
  17. Lv J and Fan Y (2009), “A unified approach to model selection and sparse recovery using regularized least squares,” The Annals of Statistics, 37, 3498–3528. [Google Scholar]
  18. Mazumder R, Friedman J, and Hastie T (2011), “SparseNet: Coordinate descent with non-convex penalties,” Journal of the American Statistical Association, 106, 1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Meinshausen N and Buhlmann P (2006), “High-dimensional graphs and variable selection with the Lasso,” The Annals of Statistics, 34, 1436–1462. [Google Scholar]
  20. Nikolova M (2000), “Local strong homogeneity of a regularized estimator,” SIAM Journal on Applied Mathematics, 61, 633–658. [Google Scholar]
  21. Ning Y, Liu H, et al. (2017), “A general theory of hypothesis tests and confidence regions for sparse high dimensional models,” The Annals of Statistics, 45, 158–195. [Google Scholar]
  22. Robinson SM (1995), “Sensitivity analysis of variational inequalities by normal-map techniques,” in Variational Inequalities and Network Equilibrium Problems, ed. Giannessi F and Maugeri A, New York: Plenum Press, pp. 257–269. [Google Scholar]
  23. Thung K-H, Wee C-Y, Yap P-T, Shen D, Initiative ADN, et al. (2014), “Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion,” NeuroImage, 91, 386–400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Tibshirani R (1996), “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B, 58, 267–288. [Google Scholar]
  25. Van de Geer S, Buhlmann P, Ritov Y, and Dezeure R (2014), “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, 42, 1166–1202. [Google Scholar]
  26. Voorman A, Shojaie A, and Witten D (2014), “Inference in high dimensions with the penalized score test,” arXiv preprint arXiv:1401.2678. [Google Scholar]
  27. Wu TT and Lange K (2008), “Coordinate descent algorithms for lasso penalized regression,” The Annals of Applied Statistics, 2, 224–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Zhang CH (2010), “Nearly unbiased variable selection under minimax concave penalty,” The Annals of Statistics, 38, 894–942. [Google Scholar]
  29. Zhang CH and Zhang SS (2014), “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B, 76, 217–242. [Google Scholar]
  30. Zhao S, Shojaie A, and Witten D (2017), “In Defense of the Indefensible: A Very Naive Approach to High-Dimensional Inference,” arXiv preprint arXiv:1705.05543. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zou H (2006), “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]
  32. Zou H and Hastie T (2005), “Regularization and variable selection via the elastic net,” The Annals of Statistics, 67, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp1

RESOURCES