Abstract
With the abundance of large data, sparse penalized regression techniques are commonly used in data analysis due to the advantage of simultaneous variable selection and estimation. A number of convex as well as non-convex penalties have been proposed in the literature to achieve sparse estimates. Despite intense work in this area, how to perform valid inference for sparse penalized regression with a general penalty remains to be an active research problem. In this paper, by making use of state-of-the-art optimization tools in stochastic variational inequality theory, we propose a unified framework to construct confidence intervals for sparse penalized regression with a wide range of penalties, including convex and non-convex penalties. We study the inference for parameters under the population version of the penalized regression as well as parameters of the underlying linear model. Theoretical convergence properties of the proposed method are obtained. Several simulated and real data examples are presented to demonstrate the validity and effectiveness of the proposed inference procedure.
Keywords: confidence interval, non-convex penalty, penalized regression, random design, variational inequality
1. Introduction
With the advantage of simultaneous variable selection and estimation, sparse penalized regression techniques have been widely used. By introducing biases on the estimators, sparse penalized regression can often select a simpler model and produce estimators with smaller mean square errors than unpenalized regression. One of the well-known representatives is the L1 penalized technique LASSO (Donoho and Johnstone, 1994; Tibshirani, 1996). LASSO has become a popular variable selection method due to its good selection performance and computational efficiency. Many other extensions with different penalties have been studied in the literature, see for example Fan and Li (2001); Zou and Hastie (2005); Candes and Tao (2007); Liu and Wu (2007); Lv and Fan (2009); Zhang (2010).
For computational implementation of these methods, there is a large literature on efficient algorithms. The LARS algorithm by Efron et al. (2004) and the Coordinate-Descent algorithm by Wu and Lange (2008) are two popular examples. Mazumder et al. (2011) proposed the SparseNet algorithm to deal with non-convex penalties. In terms of inference, much less development has been made, especially for estimators from non-convex penalized regression. For the LASSO, one common approach is to first perform model selection and then carry out inference based on some pivot distributions conditional on the selected model. For example, see Lee et al. (2016); Lockhart et al. (2014). This approach does not sufficiently account for the stochastic errors in the model selection step. Another popular approach is to achieve valid inference by adjusting the bias introduced by the L1 regularization term. Papers along this line include Javanmard and Montanari (2014); Van de Geer et al. (2014); Zhang and Zhang (2014). Recently, Lu et al. (2017) suggested using a variational inequality formulation to establish an asymptotic distribution of LASSO estimators that can be used to construct confidence intervals (CI) for the population LASSO parameters as well as the true model parameters. Some other methods about high dimensional inference include Voorman et al. (2014); Ning et al. (2017); Zhao et al. (2017).
For sparse penalized regression with penalties more complex than the LASSO, not much work has been done on inference. In particular, with a non-convex penalty, the regression problem may have multiple local optimal solutions. This brings up the question of whether one can use a local solution to construct meaningful confidence intervals. The goal of this paper is to provide a unified framework to perform valid inference for sparse penalized regression with general penalties, based on a local solution. Our assumptions about the penalties are in line with the desired properties for regularized penalty functions given in Fan and Li (2001), namely, sparsity, unbiasness, and continuity. For example, this framework can be applied to the adaptive LASSO penalty (Zou, 2006), the non-convex log penalty (Friedman, 2012), SCAD (Fan and Li, 2001), MCP (Zhang, 2010), the transformed penalty (Nikolova, 2000), and so on.
In this paper, we consider a general random-design penalized regression problem
| (1) |
where is an explanatory random vector with mean 0, and is a response random variable. Here are the regression parameters. For is a general penalty for βj with the regularization parameter λj. This general penalty covers many convex and non-convex penalties.
The solution of (1) can be estimated by the solution of the corresponding sample average approximation (SAA) problem
| (2) |
where
and (x1, y1), ⋯, (xN, yN) are independent samples of (X, Y). For each i = 1, ⋯, N and j = 1, ⋯, p, We refer to a local solution to the population penalized problem (1) as a population penalized parameter, and denote it as We refer to a local solution to the SAA problem (2) as a penalized estimator, and denote it as .
The population penalized parameter is closely related to the traditional least squares parameter. When the problem (1) becomes the following population least squares problem:
| (3) |
which has a unique minimizer (E[XXT])−1 E[XY] when E[XXT] is invertible. If additionally X and Y are related by the following linear model
| (4) |
with E[ε|X] = 0, then the solution to the population least squares problem (3) is exactly , which we refer to as the true model parameter. When is not exactly , but there is a relation between and , that will be described in Section 3.4.
The idea of our proposed method is to use the penalized estimator to derive confidence intervals for the population penalized parameter , and then exploit the relation between and to derive confidence intervals for the true model parameter in the linear model (4). Therefore, our proposed method could construct confidence intervals for both and . Note that the valid inference of is also useful for some problems such as the cost-effective linear regression taking into account the cost of collecting variables. For each sample, suppose we need to spend cj dollars on collecting the value of the jth variable Xj where j = 1, 2,...,p. For this special linear regression problem, in order to find a relatively cheap linear model with good prediction performance, we need to estimate the regression coefficient vector that minimizes an objective function which balances between the expected prediction accuracy and the data collation cost, that is,
where P(x) is a continuously differentiable non-convex function, and it is an approximation to the indicator function that equals to 1 if x > 0 and 0 if x = 0. The parameter λ can be selected according to the budget. In this example, the population penalized parameter becomes a reasonable target of the inference and our proposed method could deliver its asymptotically exact confidence intervals. On the other hand, even when a model on the relation between X and Y is not available, the confidence intervals of can still provide measures on the randomness of the penalized estimators.
The main techniques to construct confidence intervals for and , consist of three steps. First, we transform problems (1) and (2) into their corresponding variational inequality and normal map formulations and obtain an asymptotic distribution of a solution to the normal map formulation of (2). Next, by finding reliable estimates for quantities that describe the asymptotic distribution, we provide methods to compute confidence intervals for based on a solution to the normal map formulation of (2). Finally, we establish the connection between and , from which we obtain the bias-corrected estimator of , the asymptotic distribution and confidence intervals. The methodology in this paper is developed for a fixed dimension p, based on a local solution to (2). The confidence intervals we obtain for are for the local solution to (1) close to . On the other hand, for any local solution of (2), we can always obtain confidence intervals for the true model parameter . Indeed, under the setting we consider in Section 3.4, a local solution almost surely converges to .
Although our method and the method proposed in Lu et al. (2017) use similar techniques, there are important new contributions. First, we propose a unified framework to construct confidence intervals for a large class of penalties, including the method using the LASSO penalty (Lu et al. (2017)) as a special case. Second, for non-convex penalties, the construction of the confidence intervals and the theoretical studies are more complicated. We propose a new transformation of the original optimization problem to its corresponding variational inequality and normal map formulations. Special technical conditions and theoretical results for general penalties are studied. Third, the proposed method based on non-convex penalties could deliver better confidence intervals than the methods using convex penalties. For example, in our numerical studies, we compare the method using the MCP penalty with a = 2 and the method using the MCP penalty with a = 2000 (the case that MCP penalty is very close to the LASSO penalty). Our numerical results indicate that the lengths of the confidence intervals for a = 2 are generally shorter than those for a = 2000 with similar coverage rates. This is possibly due to the smaller bias imposed by the MCP penalty with a small a.
The rest of this paper is organized as follows. In Section 2, we show some background on variational inequality, and present the problem transformations. Section 3 discusses how to obtain the confidence intervals for the population penalized parameters as well as the true model parameters in the linear model. Some theoretical results about convergence properties are shown in this section. In Section 4, we present numerical results to illustrate the performance of the proposed method. Section 5 contains some discussion. The technical details of variational inequalities and proofs are shown in the supplementary materials.
Throughout this paper, we use to denote a normal random vector with mean zero and covariance matrix to represent weak convergence of a sequence of random variables {Yn} to Y. The inner product between two vectors x and y is denoted as 〈x, y〉. For a convex set S and a vector to denote the Euclidean projection onto S. A function is said to be B-differentiable at a point if there exists a positively homogeneous function such that The function df(x0) is called the B-derivative of f at x0.
2. Background and problem transformations
In this section, we first introduce some background on variational inequalities and normal maps. Then we introduce how to transform the problems (1) and (2) to their corresponding variational inequality and normal map formulations. Some assumptions for our theoretical analysis are also given in this section.
2.1. Background on variational inequalities and normal maps
We start with the definition of a variational inequality. Given a function and a closed, convex set S in , the variational inequality associated with (f, S) is the problem of finding such that
| (5) |
where Ns(x) is the normal cone to S at x defined as
Variational inequalities are closely related to optimization problems. Consider a problem of minimizing an objective function over a closed and convex set A well-known fact is that if is a local solution to this minimization problem and F is differentiable at x, then x satisfies the variational inequality
where is the gradient of the function F. Conversely, if x satisfies the above variational inequality, and F is a convex function, then x is a global minimizer of F over the set S.
Besides the above connection with the original minimization problem, the variational inequality can be equivalently formulated as an equation, using a concept called the normal map. The normal map induced by f and S is a function given by
where denotes the Euclidean projector of z onto S. For any solution to the variational inequality (5), the point z = x − f(x) satisfies and
| (6) |
Conversely, for any solution z to (6), the point is a solution to (5) and satisfies z = x − f(x). Equation (6) is called the normal map formulation of (5).
To understand the above relations, we consider an example of minimizing for a fixed point over the convex set S. We know that the solution is the projection of x0 onto the convex set S, denoted as For the function F(x), the gradient is Thus, the variational inequality formulation (5) of this minimization problem is the problem of finding such that That is, we need to find such that 〈x0 − x, s − x〉 ≤ 0 for each . We can show that the solution is . On the other hand, the normal map induced by Thus, the normal map formulation (6) of this minimization problem is z − x0 = 0. The solution to this equation is x0 and the point is the solution to the original minimization problem. More details about the variational inequality and normal map can be found in the supplementary materials. In the following Sections 2.2 and 2.3, we show how to transform the problems (1) and (2) to their corresponding normal map formulations, respectively.
2.2. Transformations of the population penalized regression
In this subsection, we transform the optimization problem (1) into a normal map formulation. Before discussing details about the transformation, we introduce conditions on the penalties Pλi(·). In this subsection, as well as in Sections 2.3 and 3.1-3.3, λ = (λ1, ⋯, λp) > 0 is fixed.
Assumption 1.
-
(a)
For each i = 1, 2, ⋯,p, Pλi(·) is nonnegative, nondecreasing and continuously differentiable on
-
(b)
For any local solution to (1), the second derivative of Pλi (ti) is Lipchitz continuous on a neighborhood of for every i = 1, ⋯,p.
Many well-known penalty functions satisfy Assumption 1(a). We list five penalty functions as examples.
-
(a)
The adaptive LASSO penalty (Zou, 2006) defined as where λi is the weight for the ith coordinate.
-
(b)
The log penalty (Friedman, 2012) defined as where a > 0.
-
(c)
The transformed penalty (Nikolova, 2000) defined as where a > 0.
-
(d)The SCAD penalty (Fan and Li, 2001) defined as Pλ(0) = 0 and
(7) -
(e)The MCP penalty (Zhang, 2010) defined as
(8)
We can check that penalties (a), (b), and (c) satisfy Assumption 1(b). The SCAD and MCP penalties satisfy this assumption almost everywhere. Take the SCAD penalty for example. The function corresponds to a quadratic spline with two knots, at which it is not continuously twice differentiable. Assumption 1(b) requires that no local solution to (1) locates at these two knots for each i. It is a reasonable assumption since the set of points on which twice continuous differentiability fails has measure zero.
In the assumption below, part (a) is to ensure the objective function of (1) to be finite valued, and part (b) will be used in proving convergence results.
Assumption 2.
-
(a)
The expectations are finite.
-
(b)
The expectations are finite.
Next, we transform the problem (1) into a normal map formulation in three steps. In the first step, we introduce an equivalent problem, in which a new variable is added to eliminate the non-smooth term from the objective function (1). The new problem is as follows:
| (9) |
where m is a positive constant. If we define as
| (10) |
and write
| (11) |
then we can treat the feasible set of (9), denoted by S, as a Cartesian product
| (12) |
We will use the two ways of ordering of (β0, β, t) in (11) interchangeably for notational convenience.
Note that the above transformation is different from the method shown in Lu et al. (2017) for the LASSO penalty. The term is added into the objective function of (9) in order to ensure ti = |βi| in any optimal solution to (9), so that there is an one-to-one correspondence between the optimal solutions to (1) and (9). This is necessary and important when the penalty functions are not strictly increasing on [0, ). For instance, some non-convex penalties such as SCAD and MCP are flat on some intervals [di, ).
In the second step, we transform (9) into a variational inequality. To this end, we need to write down the gradient of its objective function. Define a function as
| (13) |
Furthermore, define a function as
| (14) |
The function f0 is well defined and finite valued under Assumption 2(a). If Pλi (ti) is twice differentiable at ti for every i = 1, ⋯,p, then we can write down the derivative of F with respect to (β0, β, t) as
| (15) |
where represents the diagonal matrix whose ith diagonal element is is the p × p identity matrix. Moreover, the Jacobian matrix of f0 is
| (16) |
The lemma below shows that there is an one-to-one correspondence between the (local or global) optimal solutions to (1) and (9).
Lemma 1.
Suppose Assumptions 1(a) and 2(a) hold. Then the objective function of (9) is finite valued on and its gradient at each If is a (local) optimal solution to (9), then and is a (local) optimal solution to (1). Conversely, if is a (local) optimal solution to (1), then is a (local) optimal solution to (9), where
If Assumption 1(b) holds additionally, then the Hessian matrix of the objective function of (9) at is L().
In view of Lemma 1, we can transform (9) to the following variational inequality:
| (17) |
In the last step, we state the normal map formulation for (17). Let (f0)S be the normal map induced by f0 and S. Then the normal map formulation for (17) is
| (18) |
For the rest of the paper, let be a local solution to (9). Then is also a solution to (17). Therefore, the point defined as
| (19) |
is a solution to (18) and satisfies
Let Σ0 be the covariance matrix of We can check that Σ0 is well defined if Assumption 2(b) holds. Let be the upper left (p + 1) × (p + 1) submatrix of Σ0. Since the last p elements of are not random at , we have . In our theoretical analysis as shown in Section 3, we found that the B-derivative of the normal map (f0)S at z0 plays an important role in the construction of the confidence intervals. To study the property of the B-derivative of the normal map (f0)S at z0, we need the following assumption.
Assumption 3.
Let be a local solution to (1), define by
Let be a subset of {1, ⋯,p} defined as
and denote in (16) by L. Let Q1 be the submatrix of L that consists of intersections of columns and rows of L with indices in and let Q2 be the submatrix of L that consists of intersections of columns and rows of L with indices in Define matrix Q as
| (20) |
Assume that Q is nonsingular.
In the above assumption, Qi is a submatrix of the upper left (p + 1) × (p + 1) submatrix of L, and Q2 is a submatrix of the lower right p × p submatrix of L. If is a solution to the optimization problem (1), then for every Since we have for some coefficients, is generally not equal to p. Furthermore, in our theoretical analysis, we assume that the dimension p is fixed, the matrix Q can be nonsingular in many cases. The nonsingularity of Q is a standard assumption to guarantee that is a locally unique optimal solution.
As shown in Robinson (1995), the B-derivative of the normal map (f0)S at z0 is the same as the normal map LK induced by the linear function defined by the matrix L and the critical cone K to S associated with z0, defined as
| (21) |
where for each
such that is the tangent cone to S at x. To be specific, the normal map LK is defined as The tangent cone TS(x) contains all the directions along which x can be approached by a sequence of points in S converging to x. Lemma 2 below shows that LK is a global homeomorphism from (a continuous bijective function from whose inverse function is also continuous). In the proof of Lemma 2, we will give the explicit expression of the critical cone K.
Lemma 2.
Suppose that Assumptions 1, 2(a) and 3 hold. Then the normal map LK is a global homeomorphism from , and there is a neighborhood of in which it is the unique local solution to (9).
Combining Lemma 1 and 2, we can conclude that the assumptions in Lemma 2 guarantee to be the unique local solution to (1) in a neighborhood of it.
2.3. Transformations of the SAA problem
We follow the same steps in Subsection 2.2 to formulate the SAA problem (2) as a normal map equation. First, by introducing the variable we transform the SAA problem (2) to the following equivalent problem:
| (22) |
Second, we rewrite (22) as a variational inequality
| (23) |
where is twice differentiable at ti for every i = 1, ⋯,p, then the Jacobian matrix of fN is given by
| (24) |
Third, denoting the normal map induced by fN and S by (fN)S, we obtain the normal map formulation of (23) as
| (25) |
Let be a local solution to (22). Then, is also a solution to (23). So the point defined as
| (26) |
is a solution to (25) and satisfies
In fact, under Assumptions 1, 2 and 3, zN is a locally unique solution to (25) when N is large enough and it converges to a solution z0 to (18). This result will be shown in Subsection 3.1. Correspondingly, is a locally unique solution to (22) and converges to a local solution to (9). Let ΣN be the sample covariance matrix of be the upper left (p + 1) × (p + 1) submatrix of ΣN, then we have Lu et al. (2017) (Lemma 3) shows that ΣN converges to Σ0 almost surely as for t LASSO penalty. We can similarly prove the same convergence result with a general penalty in this paper under Assumptions 1–4. Assumption 4 is shown as follows.
Assumption 4.
-
(a)For each
be the moment generating function of the random variable Let be a compact set in that contains in its interior, and on which the second derivative of Pλi (ti) is Lipchitz continuous for each i = 1, ㏯, p. Assume the following conditions.- There exists a constant such that for each
- There exists a nonnegative random variable κ(X, Y) such that
for all in and almost every (X, Y). - The moment generating function of k is finite valued in a neighborhood of zero.
-
(b)
The same conditions as in (a) for d1F(β0, β, t, X, Y) instead of F(β0, β, t, X, Y). Accordingly, use E[d1F(β0, β, t, X, Y)] to replace f0(β0, β, t) in the conditions.
-
(c)
The same conditions as in (a) for F(β0, β, t, X, Y)F(β0, β, t, X, Y)T. Accordingly, use E[F(β0, β, t, X, Y)F(β0, β, t, X, Y)T] to replace β0(β0, β, t) in the conditions.
Assumption 4(a) imposes conditions on the random variable F(β0, β, t, X, Y) as well as the penalty terms. It will hold if (X, Y) is a bounded random variable and Assumption 1(b) holds. Assumption 4(a) is used to ensure the SAA function fN to converge to f0 in probability at an exponential rate. We state the result in the following lemma.
Lemma 3.
Suppose that Assumptions 1, 2 and 4(a) hold. Then there exist positive real numbers δ1, μ1, M1 and σ1 such that the following holds for each and each sufficiently large N:
| (27) |
Parts (b) and (c) of Assumption 4 impose the same type of assumptions on different random variables. Assumption 4(a-b) is needed to construct a reliable estimate for an unknown quantity in the asymptotic distribution in Theorem 1. Assumption 4(c) is only needed when the matrix is singular.
3. Construction of confidence intervals using stochastic variational inequality techniques
In this section, we show the proposed method and some related theoretical results to construct confidence intervals using stochastic variational inequality techniques. We first develop the limiting distribution of SAA solutions in Section 3.1. Then, in Section 3.2, we show how to estimate the unknown quantities in the limiting distribution. The construction of the confidence intervals for the population penalized parameters and the true model parameters in the underlying linear model will be studied in Section 3.3 and Section 3.4, respectively. To present our proposed inference method clearly, we outline the procedures to construct confidence intervals for the true model parameters in Table 1. The extension to the high dimensional case is provided in Section 3.5.
Table 1:
Construction of the (1 − α)% confidence intervals of
| Step 1. Find the penalized estimates and by solving the SAA problem (2). The tuning parameters are chosen by the Generalized Information Criterion (GIC). |
| Step 2. Calculate the solution of the normal map formulation (25), |
| Step 3. Calculate where G* is the function defined in (59). |
| Step 4. If for every i ∈ {0, 1, 2,...,p}, is very close to 0, we consider Case I to construct individual confidence intervals approximately. Otherwise, we consider Case II. |
| Case I: the (1 − α)% confidence interval of is [], where and HN is defined in Theorem 5; |
| Case II: we first use simulation to estimate the (1 − α/2)% percentile of , where is defined by (56) and . The estimated percentile is denoted as ηi. The (1 − α)% confidence interval of is []. |
3.1. Convergence and distribution of SAA solutions
Based on Lemma 2 and the relation between (9) and (18), Assumptions 1–3 guarantee z0 defined in (19) to be a locally unique solution to (18). Furthermore, we show in Theorem 1 below that for sufficiently large N, (25) has a unique solution zN in a neighborhood of z0, and that zN converges almost surely to z0. This theorem also provides results on asymptotic distributions and convergence rates.
Theorem 1.
Suppose that Assumptions 1, 2 and 3 hold. Then, with probability 1, there exist neighborhoods such that for sufficiently large N, the equation (25) has a unique solution zN in , and the variational inequality (23) has a unique solution in given by Moreover,
| (28) |
| (29) |
and
| (30) |
In addition, if Assumption 4(a-b) holds, then there exist positive real numbers and σ0, such that for each and each sufficiently large N,
| (31) |
In Theorem 1, LK is the normal map induced by the linear function in (16) and the critical cone K defined in (21). We use to denote its inverse function. Functions , LK and are linear if K is a subspace, otherwise they are piecewise linear. Compared to (Lu et al., 2017, Theorem 1) which considers the LASSO penalty, Theorem 1 here handles general penalties that satisfy Assumption 1. The results shown in Theorem 1 are used in the construction of confidence intervals of the population penalized parameter as well as the true model parameter as shown in the following sections.
3.2. Estimators of Σ0 and LK
In order to use (29) and (30) to obtain computable confidence regions and intervals, we need to find reliable estimators of Σ0 and LK, as we discuss in this subsection. One can show that converges to Σ0 almost Σ0 surely under Assumptions 1–4. See the remarks below (26). Therefore, we use ΣN to estimate Σ0. Our main task in this subsection is to introduce an estimator of the normal map LK, knowing that LK is exactly d(f0)S(z0) (Robinson, 1995), the B-derivative of (f0)S at z0. Let dΠS(z) be the B-derivative of the Euclidean projector ΠS at z. Since S is a polyhedral convex set in ΠS coincides with a different affine function on each (2p + 1)-cell in the normal manifold of S (see Table 1 in the supplementary materials for definitions of the normal manifold and cells). The B-derivative is a linear function for points z in the interior of each such cell, and is piecewise linear for z on the boundary. Moreover, is not continuous with respect to z at points z on the boundary of any (2p + 1)-cell. Therefore, the function d(f0)S(z) is generally not continuous with respect to z at such points, which can be seen from the chain rule of B-differentiability:
If d(f0)S(·) is not continuous at z0, then d(f0)S(zN) is not guaranteed to converge to d(f0)S(z0) even though zN converges to z0. To introduce the estimators of LK, we will consider two cases based on the location of z0.
For each i = 1, ⋯,p, denote the 9 cells in the normal manifold of Si as (see Figure 1). According to (10) we derive the constraints defining each which are listed in Table 1 in the supplementary materials. That table also lists the critical cones to Si associated with a point in the relative interior of . Each (2p + 1)-cell in the normal manifold of S can then be written as where γ(i) = 0, ⋯, 8 for each i = 1, ㏯,p. From (19), Assumption 1(a) and Lemma 2, we notice that ((z0)2j, (z0)2j+1) can only appear in the relative interior of for each i. Consequently, is not continuous at z0 if and only if ((z0)2i, (z0)2i+1) is in the relative interior of for some index i. The two cases are defined below, where the first case corresponds to the situation in which the random variable is normally distributed, and the second case is for situations in which LK is a piecewise linear function.
Figure 1:
The normal manifold of Si (left) and (right).
Case I: In this case, ((z0)2i, (z0)2i+1 is in the relative interior of for all and the normal map LK and the B-derivative are linear functions. Since d(f0)S(z) is continuous at z0 in this case, we can use and d(fN)S(zN) as the estimators of and LK respectively.
Case II: In this case, ((z0)2i, (z0)2i+1 is in the relative interior of for some index and LK and are piecewise linear functions. Since d(f0)S(z) is generally not continuous at z0 in this case, we have to derive an estimator of LK other than d(fN)S(zN).
In both cases, d(fN)S(zN) is an invertible linear map with high probability (Lu, 2014b, Proposition 3.5). While it is reasonable to expect Case I to occur more often than Case II in practice, one cannot identify Case I in advance since z0 is unknown. To derive an estimator of Lk, first we give the expression of dΠS(z), and then construct an asymptotically exact approximation of . According to (12), we have
| (32) |
for each We denote in the relative interior of each by a function Define four matrices
Table 2 in the supplementary materials shows the expression of each using these matrices. Denote the relative interior of For all we can write as
| (33) |
where such that
Table 2:
Coverage rates and average lengths of 95% individual CIs for population penalized parameters for different MCP penalties from 500 replications with sample size N = 300 generated in Example 1.
| a = 2 | a = 2000 | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| λ = 0.5 | λ = 1 | λ = 2 | λ = 0.5 | λ = 1 | λ = 2 | |||||||||||||
| CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | |||||||
| β0 | 0 | 0.99 | 0.26 | 0 | 0.99 | 0.27 | 0 | 0.98 | 0.39 | 0 | 0.98 | 0.28 | 0 | 0.98 | 0.32 | 0 | 0.98 | 0.46 |
| β1 | 3 | 0.97 | 0.30 | 3.13 | 0.97 | 0.36 | 3.37 | 0.95 | 0.88 | 2.83 | 0.97 | 0.32 | 2.67 | 0.98 | 0.38 | 2.33 | 0.96 | 0.56 |
| β2 | 1.5 | 0.97 | 0.30 | 1.25 | 0.97 | 0.49 | 0.51 | 0.96 | 0.84 | 1.36 | 0.98 | 0.33 | 1.22 | 0.98 | 0.38 | 0.94 | 0.98 | 0.56 |
| β3 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.04 | 0 | 1.00 | 0.01 | 0 | 1.00 | 0.00 |
| β4 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.04 | 0 | 1.00 | 0.01 | 0 | 1.00 | 0.00 |
| β5 | 2 | 0.98 | 0.26 | 2.02 | 0.98 | 0.30 | 1.47 | 0.98 | 0.56 | 1.78 | 0.99 | 0.29 | 1.56 | 0.98 | 0.34 | 1.11 | 0.98 | 0.52 |
| β6 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.02 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 |
| β7 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.01 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 |
| β8 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 |
Next, we construct an estimator of We divide the plane (βi, ti) into 9 pieces (see Figure 1). The constraints that define each of these sets are listed in Table 3 in the supplementary materials. The function g(N) in that table can be any combination of finite many terms of the form aNb with a > 0 and b ∈ (0, 1/2), among other choices. For more details, see Lu and Budhiraja (2013). Each partition is related to the Let
Given a sample size N and a fixed z, we define a function as
| (34) |
According to Theorem 3.1 of Lu (2014a), converges to in probability under Assumptions 1–4.
Table 3:
Coverage rates and average lengths of 95% individual CIs for true model parameters () for different MCP penalties from 500 replications with sample size N = 300 generated in Example 1.
| a = 2 | a = 2000 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| λ = 0.5 | λ = 1 | λ = 2 | λ = 0.5 | λ = 1 | λ = 2 | ||||||||
| True | CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | |
| 0 | 0.96 | 0.23 | 0.96 | 0.23 | 0.95 | 0.34 | 0.96 | 0.24 | 0.95 | 0.28 | 0.95 | 0.40 | |
| 3 | 0.95 | 0.26 | 0.96 | 0.27 | 0.99 | 0.42 | 0.96 | 0.28 | 0.99 | 0.33 | 1.00 | 0.49 | |
| 1.5 | 0.95 | 0.29 | 0.96 | 0.30 | 1.00 | 0.51 | 0.96 | 0.31 | 0.97 | 0.36 | 1.00 | 0.53 | |
| 0 | 0.95 | 0.28 | 0.96 | 0.29 | 0.97 | 0.42 | 0.97 | 0.30 | 0.97 | 0.35 | 0.98 | 0.50 | |
| 0 | 0.97 | 0.28 | 0.97 | 0.29 | 0.97 | 0.42 | 0.96 | 0.30 | 0.97 | 0.35 | 0.98 | 0.50 | |
| 2 | 0.96 | 0.28 | 0.96 | 0.29 | 0.99 | 0.44 | 0.97 | 0.30 | 1.00 | 0.36 | 1.00 | 0.54 | |
| 0 | 0.95 | 0.28 | 0.95 | 0.28 | 0.98 | 0.42 | 0.95 | 0.30 | 0.97 | 0.34 | 0.98 | 0.49 | |
| 0 | 0.96 | 0.28 | 0.96 | 0.29 | 0.98 | 0.42 | 0.96 | 0.30 | 0.97 | 0.35 | 0.99 | 0.50 | |
| 0 | 0.96 | 0.25 | 0.97 | 0.26 | 0.99 | 0.38 | 0.97 | 0.27 | 0.99 | 0.32 | 1.00 | 0.45 | |
Based on (24), (26) and (34), we define a function as
| (35) |
for each The following theorem shows that d(fN)S(zN) is a consistent estimator of Lk for Case I, and is a consistent estimator of LK for both Case I and Case II.
Theorem 2.
-
(a)Suppose that Assumptions 1, 2 and 3 hold. If z0 satisfies the conditions for Case I, then defined in (33) converges to almost surely, and
converges to LK almost surely.(36) -
(b)
Suppose that Assumptions 1, 2, 3 and 4(a-b) hold. Then converges to LK in probability.
The two functions d(fN)S(zn) and are generally different when ((zN)2i, (zN)2i+1) belongs to for some i, in which case is a piecewise linear function. In contrast, d(fN)S(zN) is a piecewise linear function only when ((zN)2i, (zN)2i+1 belongs to for some i.
Under Assumptions 1–4, we can show that the weak convergence in (30) still holds after LK is substituted by . Consequently, if is nonsingular, then we have
| (37) |
If is singular, we decompose is an orthogonal (p + 1) × (p + 1) matrix, and ΔN is a diagonal matrix with monotonically decreasing elements. Let l be the number of positive eigenvalues of counted with regard to their algebraic multiplicities, let Dn be the upper-left submatrix of ΔN whose diagonal elements are at least 1/g(N), and let lN be the number of rows in DN. Furthermore, let (UN)1 be the submatrix of UN that consists of its first lN rows, and submatrix (UN)2 consists of the remaining rows of UN. We present the weak convergence results in the following theorem, which generalizes Theorem 3 in Lu et al. (2017) to cover all penalties satisfying Assumption 1.
Theorem 3.
Suppose that Assumptions 1, 2, 3 and 4(a-b) hold. Then
| (38) |
If is nonsingular, then
| (39) |
and
| (40) |
If is singular and Assumption 4(c) holds, then Prob{lN = l} → 1 as N → ∞,
| (41) |
and
| (42) |
We can treat (39) and (40) as a special case of (41) and (42). In fact, if z0 satisfies Case I, then Theorem 3 still holds if is replaced by d(fN)S(zn).
3.3. Confidence intervals for the population penalized parameters
In this subsection, we describe how to obtain confidence interval for from the asymptotic distribution of zN. First, we investigate the relationship between a solution to the normal map formulation (18) and the corresponding solution to (1). Let be as defined in Assumption 3 and From (13), (14) and (19), we have In the supplementary materials, it is shown in (B.1) that which implies Thus, confidence intervals of are exactly those of On the other hand, using the fact for each i = 1, ⋯,p, we have the following relationship between and ((z0)2i, (z0)2i+1):
| (43) |
where V+ = (z0)2i + (z0)2i+1 and The above three cases in (43) include all the possible situations for the location of ((z0)2i, (z0)2i+1. This map can be used to obtain confidence intervals for after we calculate confidence intervals for ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1). For a fixed i, we denote the (1 − α/2)% confidence intervals for ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1) as respectively. Then a (1 − α)% (conservative) confidence interval for is given by
| (44) |
Next, we show how to find confidence intervals for (z0)1, ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1). Under Assumptions 1–4, from Theorem 3 we can express the asymptotically exact (1 − α)100% confidence region for z0 as
| (45) |
where is the critical value associated with significant level α of a χ2 distribution with lN degrees of freedom. If is a linear map, then the set in (45) is an ellipsoid in a subspace of Otherwise it is the union of different ellipsoid fractions. To obtain simultaneous confidence intervals, we find the maximal and minimal values of (z0)1, ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1) in the set of (45) by solving optimization problems.
For individual confidence intervals, first we notice that is a global homeomorphism with probability 1 as (see the proof of Theorem 2 in the supplementary materials). If is a global homeomorphism, we can use
| (46) |
to approximate the distribution of as in (29). When is a linear map, the distribution in (46) is normal. Therefore (z0)1, ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1) also follow normal distributions, from which we can construct individual confidence intervals. When is not a linear map, we simulate data based on the distribution in (46), and find empirical individual confidence intervals for (z0)1, ((z0)2i + (z0)2i+1) and ((z0)2i − (z0)2i+1) by taking percentiles of the data as the lower and upper bounds respectively.
3.4. Confidence intervals for true model parameters in the underlying linear model
In this subsection, we develop a method to compute confidence intervals for , based on a relation between a population penalized parameter and the true model parameter . Suppose the underlying linear model is
| (47) |
where are the true model parameters. Let Denote the covariance matrix of X as Σ, and assume the random error ε has mean zero and variance σ2. Moreover, ε is independent with Xi for all i = 1, ⋯,p. For simplicity we assume E(Xi) = 0 for each i = 1, ⋯,p. Consequently we have . We assume that £ is nonsingular and therefore we do not need Assumption 3 in this subsection.
In developing theoretical results of this subsection, we will let λ = (λi, ⋯, λp) converge to 0. Due to this change, assumptions stated in Section 2.2 need to be changed accordingly. We will replace Assumptions 1 by Assumption 1’, and keep Assumption 2. We will not need Assumption 4 until the end of this section.
Assumption 1’(a).
For each i = 1, 2, ⋯,p, P0(t) = 0 for all t ≥ 0. Moreover, for each positive λi in a neighborhood of 0, Pλi(·) is nonnegative, nondecreasing and continuously differentiable on
Assumption 1’(b).
For each i = 1, ⋯,p, there exist neighborhoods the second derivative of (·) with respect to ti, exists for each Moreover, are Lipschitz continuous in
Assumption 1’(c).
For each i = 1, ⋯,p, there exists a neighborhood such that the mixed partial derivatives with
Besides the convex LASSO penalty, we can check that many non-convex penalty functions such as SCAD, MCP, the log-penalty, and the transformed penalty satisfy Assumption 1’(a). We can further check that the LASSO penalty, the log-penalty, and the transformed penalty also satisfy Assumptions 1’(b) and 1’(c). For the SCAD and MCP penalties, Assumptions 1’(b) and 1’(c) are satisfied almost everywhere except that for some i. Assumptions 1’(b) and 1’(c) are used to guarantee that the SAA function fN almost surely converges to the true function f0 in the space of continuously differentiable functions on a neighborhood of weakly converges to a random function in that space. This assumption is needed for the techniques based on stochastic variational inequalities to be applicable. It is possible to weaken this assumption by developing techniques for a broader class of problems in which the SAA function fN (or equivalently, the first order derivative of the penalty function) is not necessarily continuously differentiable, and we will investigate this in future work.
Under Assumption 1’(a), with λ = 0 the problem (1) becomes the least square problem (3), which has a unique solution in view of the linear model (47). Let be defined as if Then (9) with λ = 0 has a unique solution By the equivalence between (9) and (17), is also the unique solution to
where f0(β0, β, t) is as defined in (14) but with λ = 0:
Let be as defined in (19) with λ = 0 and The following lemma presents the relation between
Lemma 4.
Suppose that Assumptions 1’(a) and 2 hold. Then we have
| (48) |
where is defined as
| (49) |
Lemma 4 above indicates that an estimator of the true parameter where zN is defined in (26). In the following theorem, we show the asymptotic distribution of the estimator Before stating this theorem, we define a matrix as follows:
| (50) |
and
| (51) |
Note that
This implies that
| (52) |
where
| (53) |
As the setting considered in this subsection is different from previous sections, L* and K* here are different from L and K defined in Assumption 3 and (21). The previous L and K are associated with a solution to the population problem with a fixed positive λ, while L* and K* are associated with and λ = 0.
Theorem 4.
Suppose that Assumptions 1’(a-c) and 2 hold. Let m > 0 be sufficiently small so that is nonsingular, L* and K* be defined as above, be the covariance matrix of the random vector defined in (13), and be the upper left (p + 1) × (p + 1) submatrix of . Moreover, let λi’s be chosen to satisfy for some constant ci ≥ 0, zN be defined in (26), and define is a consistent estimator of and
| (54) |
Note that the distribution of can be normal or non-normal. When the true parameter for each i, we can show that K* is a subspace of is a linear function. Therefore, the limiting distribution of the true parameter estimator G*(zN) is normal in this case. However, if the true parameter for some i, the limiting distribution can be normal or non-normal.
Theorem 5.
Suppose that the assumptions in Theorem 4 hold. If hi = 0 for each i, then is a consistent estimator of and
Furthermore, let be a consistent estimate of Define Then,
| (55) |
Since we know that . Therefore, if λi’s are chosen to be in the penalty function, the limiting distribution will be a multivariate normal distribution. In this normal case, the above Theorem 5 provides a method to compute asymptotically exact individual confidence intervals for .
When for some i, the asymptotic distribution in Theorem 4 does not necessarily reduce to a normal distribution. To see this, consider the following example. Let p = 2, for some constant (which is satisfied by the LASSO penalty function P(λ, t) = λt). It follows that h1 = h2 = 1. Let q0, q1, q2 ∈ . To find (L*K*)−1(q0, q1, q2, h1, h2) we consider the following problem
whose solution satisfies Here The solution to the above problem is given by
As a result, is a piecewise affine function of (q0,q1,q2,h1,h2) with three pieces. Furthermore, since G*(·) is a linear transformation, we conclude that the asymptotic distribution is non-normal in this case.
Next, we show how to estimate in the situation considered in Theorem 4. We need to make an assumption analogous to Assumption 4.
Assumption 4’(a).
The same conditions as in Assumption 4(a), with
Assumption 4’(b).
The same conditions as in Assumption 4(b), with
To estimate we can also use Under Assumption 4’, to show that is a consistent estimator of the key is to show (31) holds (with To show this, one needs to show that fN converges to f0 in probability at an exponential rate in the space of continuous differentiable functions. For the first p + 1 components, this follows from Assumption 4’. For the last p components, note that the norm of the function in the space of continuous differentiable functions is bounded due to Assumption 1’(c). In addition, since for each i, we have for some constant ρ. As a result, for each we have for sufficiently large N. Therefore, fN converges to f0 in probability at an exponential rate in the space of continuous differentiable functions, and converges to in probability.
For the normal case, equation (55) justifies using the diagonal elements of divided by N, as the estimated variances of . For the non-normal case, first we define two functions R and as
| (56) |
where Denote the ith component function of R and respectively for each i. Let be a continuous function and Z be a (2p + 1)-dimensional random variable with as
| (57) |
Suppose that Prob {f(Z) = b} = 0 for all Then for any given as defined in (57) is the smallest value that satisfies
Since the map G* has full row rank, is a global homeomorphism, and is nonsingular. If hi ≠ 0 for each i, then the matrix representation of each piece of the map R has full row rank as well. Therefore, Prob The following theorem provides a way to compute individual confidence intervals for in the general case where hi ≠ 0 for each i.
Theorem 6.
Suppose that assumptions in Theorem 4 and Assumptions 4’(a-b) hold, and hi ≠ 0 for each i. Let and ar(·) be as in (57). Then for every and all i = 0, 1, ⋯,p, we have
| (58) |
where R and are defined in (56).
From (58), one can compute the empirical (1 − α) percentile confidence intervals for by simulating data from . The constant r can be used to control the centers of the confidence intervals for all simultaneously, which may affect the interval lengths. A reasonable choice of r is 0 if the empirical distribution of is approximately symmetric with respect to 0. Results of Theorems 5 and 6 are applicable to a wide range of general penalty functions which covers the LASSO as a special case (Lu et al., 2017). The procedure to construct confidence intervals of the true regression coefficients is summarized in Table 1.
3.5. Extension to the high dimensional case
In our previous theoretical analysis, we assume that the dimension p is fixed. It is interesting to study the extension of our proposed method to the high dimensional case where the dimension p is also allowed to go to infinity.
As shown in Lemma 4, since our proposed estimate for the true parameter In fact, we can also show that where z0 is defined in (19) with λ > 0 and G is a map from defined as
| (59) |
and the matrix Motivated by this result, we can also estimate the true parameter where is a map from to defined as
| (60) |
and is a consistent estimate of Σ−1. Theoretically, if for each i, we can show that G*(zN) and have the same asymptotic distribution.
According to the definition of in (60) and the definition of zN in (26), we can show that Therefore, the estimate of β is the sum of an initial estimate (e.g., LASSO, SCAD or MCP estimate) and a bias-correction term. Interestingly, although we have different motivations, turns out to be the same as the estimate proposed by Van de Geer et al. (2014). For the high dimensional case with if we choose converging to 0 as and use conditions to guarantee that: (a) where s0 is the number of true nonzero regression coefficients; (b) and some sparisty assumptions about the precision matrix Σ−1, we can show that the asymptotic distribution of is normal (Van de Geer et al. (2014)). However, the theoretical analysis of the asymptotic distribution of G*(zN) and using stochastic variational inequality techniques for case is challenging. Many fundamental results about variational inequality (e.g., some results used in the proof of Theorem 1) need to be generalized to the high dimensional case.
Although we assume that the dimension p is fixed in the theoretical analysis using stochastic variational inequality techniques, our proposed method is applicable to the large p small N data in practice. For the high dimensional data, we can use the results shown in Theorem 6 to construct confidence intervals. In Section 4, we will use Example 3 to study the performance of our method for the high dimensional data.
4. Numerical examples
In this section, we use the MCP methods in (8) to illustrate the performance of the techniques proposed in Section 3. For all examples in this section, we choose in (9). We use the mixed integer quadratically constrained program (MIQCP) solver in the optimization modeling language GAMS (Brooke et al., 1998) to obtain accurate solutions to (2).
For all simulated examples, we generate the data using the following linear model:
| (61) |
where is a p-dimensional normal random variable with mean 0 and covariance Σ, is a standard normal random error which is independent of X. We set the noise level σ = 1. Under the model (61), the population penalized regression problem (1) can be written as
| (62) |
We compute confidence intervals for the population penalized parameter and the true model parameter βtrue, which we refer to as the first and second types of confidence intervals respectively. To show their performance in simulation study, we report the following two measures: the empirical coverage rate (the fraction of total replications in which the confidence intervals contain the corresponding population penalized parameters or true model parameters) and the average confidence interval length. For the second type of confidence interval, we compare our proposed method with the LDPE method (Van de Geer et al. (2014); Zhang and Zhang (2014)), the method introduced by Javanmard and Montanari (2014) (dentoed as JM method), and the method proposed by Lu et al. (2017) (denoted as SVI-Lasso). In terms of the tuning parameter λ, we study the performance of our proposed method with some fixed values as well as the value of A chosen by the Generalized Information Criterion (GIC, Konishi and Kitagawa (2008)).
4.1. Example 1: Low dimensional setting with the auto-regressive covariance structure
For this example, we generate a training dataset with 500 replications of sample size N = 300, dimension p = 8, true model parameter βtrue = (3, 1.5, 0, 0, 2, 0, 0, 0), and true covariance matrix We consider six MCP penalties with parameters (λ, a) taking the following values: λ = 0.5, 1 or 2, and a = 2 or 2000. When a = 2000, the MCP penalties are very close to the LASSO penalty. In each replication, after solving the SAA problem for every MCP penalty, we compute the two types of individual confidence intervals with the confidence level 0.95 (α = 0.05).
Tables 2 and 3 show the empirical coverage rates (CR) and average interval lengths (Len) for 95% individual confidence intervals of the 500 replications. In Table 2, the column contains the population penalized parameters for different MCP penalties, which are expected to be covered by the first type of confidence intervals. In Table 3, the “True” column contains the true model parameters βtrue, which are expected to be covered by the second type of confidence intervals. Note that the coverage rate is 100% for the first type of confidence interval when This is due to the shrinkage effect of the projection Γ (43) from z0 to , which causes the confidence intervals for to be the singleton {0}. When λ = 0.5 and a = 2, the population penalized parameters coincide with the true model parameters. The second type of confidence intervals are much longer than the first type for the inactive parameters β3, β4, β6, β7 and β8 as expected. In practice, which type of confidence intervals to use depends on the type of parameters of interest. The first type of confidence intervals can be used to assess the randomness of the penalized estimates with a fixed penalty. This type of inference is especially useful when the penalty conveys prior information on the parameters. In contrast, the second type of confidence intervals provide inference information for the underlying true parameters directly.
As a remark, the parameter λ controls the level of penalization and the parameter a in the MCP penalty controls the degree of non-convexity. As shown in Tables 2 and 3, when λ increases to 1 and 2, the differences between the population penalized parameters and the true model parameters become larger. On the other hand, as a gets large, such as a = 2000, the MCP penalty becomes close to the LASSO penalty. The lengths of the second type of confidence intervals for a = 2 are generally shorter than those for a = 2000 with similar coverage rates as shown in Table 3. This may be due to the smaller bias imposed by the MCP penalty with a small a. In addition, as shown in Table 4, our proposed method with λ selected by GIC has very good performance. The comparison between our proposed methods and the LASSO-type methods indicate that our methods perform well for the inference of the true parameters in the linear model.
Table 4:
Coverage rates and average lengths of 95% individual CIs for true model parameters () for different methods from 500 replications with sample size N = 300 generated in Example 1.
| Our method (a = 2) | Lasso type methods | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| λ = 0.5 | λ = 1 | λ = 2 | GIC | SVI-Lasso | LDPE | JM | |||||||||
| True | CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | |
| 3 | 0.95 | 0.26 | 0.96 | 0.27 | 0.99 | 0.42 | 0.95 | 0.26 | 0.95 | 0.26 | 0.94 | 0.26 | 0.88 | 0.25 | |
| 1.5 | 0.95 | 0.29 | 0.96 | 0.30 | 1.00 | 0.51 | 0.95 | 0.29 | 0.95 | 0.29 | 0.94 | 0.28 | 0.89 | 0.28 | |
| 0 | 0.95 | 0.28 | 0.96 | 0.29 | 0.97 | 0.42 | 0.95 | 0.28 | 0.95 | 0.28 | 0.95 | 0.28 | 0.99 | 0.28 | |
| 0 | 0.97 | 0.28 | 0.97 | 0.29 | 0.97 | 0.42 | 0.97 | 0.28 | 0.97 | 0.28 | 0.96 | 0.28 | 0.97 | 0.28 | |
| 2 | 0.96 | 0.28 | 0.96 | 0.29 | 0.99 | 0.44 | 0.96 | 0.28 | 0.97 | 0.29 | 0.96 | 0.28 | 0.92 | 0.28 | |
| 0 | 0.95 | 0.28 | 0.95 | 0.28 | 0.98 | 0.42 | 0.95 | 0.28 | 0.94 | 0.28 | 0.95 | 0.28 | 0.98 | 0.28 | |
| 0 | 0.96 | 0.28 | 0.96 | 0.29 | 0.98 | 0.42 | 0.96 | 0.28 | 0.95 | 0.28 | 0.95 | 0.28 | 0.99 | 0.28 | |
| 0 | 0.96 | 0.25 | 0.97 | 0.26 | 0.99 | 0.38 | 0.97 | 0.25 | 0.96 | 0.26 | 0.97 | 0.26 | 0.98 | 0.25 | |
4.2. Example 2: Low dimensional setting with the equi-correlation covariance structure
In this example, we consider the equi-correlation covariance structure where Σij = 0.5 for all i ≠ j and Σjj = 1 for all j. The other settings are the same as Example 1.
Table 5 shows the performance of the 95% individual confidence intervals of the population penalized parameters. The results shown in this table are very similar to the results of Example 1 shown in Table 2. As shown in Table 5, for each fixed λ, the proposed method using a = 2 delivers better performance than the method using a = 2000 in most cases, especially when λ is small. Table 6 shows the comparison of the individual confidence intervals of the true model parameters constructed by our method and the LASSO-type methods. Similar to Example 1, our proposed method using GIC performs well. In addition, for this example, ourmethod (GIC), SVI-Lasso and LDPE methods deliver similar performance. All these three methods perform better than the JM method.
Table 5:
Coverage rates and average lengths of 95% individual CIs for population penalized parameters for different MCP penalties from 500 replications with sample size N = 300 generated in Example 2.
| a = 2 | a = 2000 | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| λ = 0.5 | λ = 1 | λ = 2 | λ = 0.5 | λ = 1 | λ = 2 | |||||||||||||
| CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | |||||||
| β0 | 0 | 0.98 | 0.26 | 0 | 0.98 | 0.27 | 0 | 0.99 | 0.38 | 0 | 0.98 | 0.27 | 0 | 0.97 | 0.31 | 0 | 0.97 | 0.41 |
| β1 | 3 | 0.99 | 0.31 | 3.10 | 0.98 | 0.36 | 3.57 | 0.91 | 0.95 | 2.88 | 0.98 | 0.33 | 2.75 | 0.97 | 0.38 | 2.50 | 0.97 | 0.52 |
| β2 | 1.5 | 0.98 | 0.32 | 1.20 | 0.98 | 0.55 | 0.57 | 0.95 | 0.88 | 1.37 | 0.97 | 0.34 | 1.25 | 0.97 | 0.38 | 1.00 | 0.96 | 0.52 |
| β3 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.05 | 0 | 1.00 | 0.02 | 0 | 1.00 | 0.01 |
| β4 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.05 | 0 | 1.00 | 0.02 | 0 | 1.00 | 0.00 |
| β5 | 2 | 0.97 | 0.32 | 2.10 | 0.99 | 0.38 | 1.57 | 0.98 | 0.93 | 1.88 | 0.97 | 0.34 | 1.75 | 0.97 | 0.38 | 1.50 | 0.98 | 0.52 |
| β6 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.05 | 0 | 1.00 | 0.02 | 0 | 1.00 | 0.00 |
| β7 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.05 | 0 | 1.00 | 0.02 | 0 | 1.00 | 0.00 |
| β8 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.00 | 0 | 1.00 | 0.05 | 0 | 1.00 | 0.02 | 0 | 1.00 | 0.00 |
Table 6:
Coverage rates and average lengths of 95% individual CIs for true model parameters for different methods from 500 replications with sample size N = 300 generated in Example 2.
| Our method (a = 2) | Lasso type methods | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| λ = 0.5 | λ = 1 | λ = 2 | GIC | SVI-Lasso | LDPE | JM | |||||||||
| True | CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | CR | Len | |
| 3 | 0.97 | 0.30 | 0.97 | 0.31 | 1.00 | 0.46 | 0.97 | 0.30 | 0.97 | 0.30 | 0.97 | 0.30 | 0.91 | 0.29 | |
| 1.5 | 0.95 | 0.30 | 0.97 | 0.32 | 1.00 | 0.50 | 0.95 | 0.30 | 0.95 | 0.31 | 0.95 | 0.30 | 0.89 | 0.29 | |
| 0 | 0.95 | 0.30 | 0.96 | 0.31 | 0.99 | 0.43 | 0.95 | 0.30 | 0.96 | 0.30 | 0.96 | 0.30 | 0.98 | 0.29 | |
| 0 | 0.96 | 0.30 | 0.97 | 0.31 | 0.99 | 0.43 | 0.96 | 0.30 | 0.96 | 0.30 | 0.96 | 0.30 | 0.99 | 0.29 | |
| 2 | 0.92 | 0.30 | 0.94 | 0.31 | 0.99 | 0.46 | 0.92 | 0.30 | 0.92 | 0.31 | 0.92 | 0.30 | 0.88 | 0.29 | |
| 0 | 0.95 | 0.30 | 0.95 | 0.31 | 1.00 | 0.43 | 0.95 | 0.30 | 0.95 | 0.30 | 0.94 | 0.30 | 0.98 | 0.29 | |
| 0 | 0.95 | 0.30 | 0.95 | 0.31 | 0.99 | 0.43 | 0.95 | 0.30 | 0.95 | 0.30 | 0.95 | 0.30 | 0.96 | 0.29 | |
| 0 | 0.94 | 0.30 | 0.95 | 0.31 | 1.00 | 0.43 | 0.94 | 0.30 | 0.94 | 0.30 | 0.94 | 0.30 | 0.98 | 0.29 | |
4.3. Example 3: High dimensional example
In this example, we consider a high dimensional case in which the dimension is much larger than the sample size. We choose p = 300 with βtrue being a 300-dimensional vector: , and all the other components are 0. The true covariance matrix is We generate a training dataset with 500 replications of sample size N = 100. For this high dimensional example, we consider three MCP penalties with parameters λ = 0.5, 1 or 2, and a = 3. In each replication, we use the nodewise LASSO regression introduced by Meinshausen and Bühlmann (2006) to compute the estimate of the precision matrix and compute the individual confidence intervals of the true model parameters with the confidence level 0.95. Define the active set as In Table 7, for different methods, we report the average coverage rate, median coverage rate, average length and median length of the individual confidence intervals for true model parameters in respectively:
where CRi and Leni denote the empirical coverage rate and average interval length of the confidence interval for the parameter for the 500 replications, respectively.
Table 7:
Average coverage rates and lengths of 95% individual confidence intervals for the true model parameters in the linear model with different methods computed from 500 replications with sample size N = 100 and dimension p = 300 generated in Example 3.
| Our method (λ = 0.5) | Our method (λ = 1) | Our method (λ = 2) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Avgcov | Medcov | Avglen | Medlen | Avgcov | Medcov | Avglen | Medlen | Avgcov | Medcov | Avglen | Medlen | |
| 92.82 | 92.60 | 0.46 | 0.45 | 94.64 | 95.00 | 0.66 | 0.65 | 95.24 | 95.40 | 1.26 | 1.25 | |
| 93.26 | 93.40 | 0.39 | 0.39 | 93.51 | 93.60 | 0.56 | 0.56 | 93.73 | 94.00 | 1.08 | 1.08 | |
| Our method (GIC) | LDPE | JM | ||||||||||
| Avgcov | Medcov | Avglen | Medlen | Avgcov | Medcov | Avglen | Medlen | Avgcov | Medcov | Avglen | Medlen | |
| 92.91 | 93.00 | 0.57 | 0.56 | 93.84 | 94.40 | 1.13 | 1.14 | 88.07 | 87.80 | 0.55 | 0.55 | |
| 93.37 | 93.40 | 0.47 | 0.47 | 95.31 | 95.60 | 1.14 | 1.14 | 99.38 | 99.40 | 0.55 | 0.55 | |
For our proposed methods, as λ increases to 1 and 2, both the average coverage rates and lengths increase. Compared with LDPE, our proposed method using GIC to choose the tuning parameter has much shorter average lengths while the average coverage rates are only slightly lower. Although the JM method delivers similar average lengths as our proposed method (GIC), the average coverage rates of our proposed method are much closer to the nominal level 95%. Overall, the results shown in Table 7 indicate that our proposed method still delivers comparable performance for the high dimensional case.
4.4. Example 4: ADNI data
In this real data example, we consider the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (http://www.loni.ucla.edu/ADNI). The main goal of ADNI was to test whether the serial structural magnetic resonance imaging (MRI), fluorodeoxyglucose positron emission tomography (FDG-PET) images and some other biological markers such as cerebrospinal fluid (CSF) could be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s Disease (AD). To that end, 800 adults with ages between 55 and 90 were recruited from over 50 sites across the US and Canada. In our analysis, we use data from 199 subjects who have complete baseline MRI, FDG-PET, and CSF data. Using the data processing method shown in Thung et al. (2014), we obtained 93 MRI features, 93 PET features, and 5 CSF features for each subject. The response variable is the Mini-Mental State Examination (MMSE) score (Folstein et al. (1975)) which is often used to screen for cognitive impairment.
The data are standardized at the beginning of our analysis. For our proposed method, we use the MCP penalty with the parameter a = 3 and choose the best tuning parameter λ by GIC. Table 8 shows the selected features of different methods, where the selected features are the features whose 95% confidence intervals do not contain 0. The numbers of features selected by our method, SVI-Lasso, LDPE, and JM are 13, 12, 13, and 3, respectively. Among the 13 features selected by our proposed method, 11 features are selected by the SVI-Lasso method, 9 features are selected by the LDPE method and 3 features are selected by the JM method. Table 9 shows the estimates and 95% individual confidence intervals of the 13 features selected by our proposed method. The results of our proposed method and the results of SVI-Lasso and LDPE methods are comparable. As shown in Table 9, for most features among the 13 features, the absolute values of the estimates delivered by the JM method are much smaller than the corresponding values of the other methods. The 95% confidence intervals of the JM method are also very different from the corresponding confidence intervals of the other methods.
Table 8:
Selected features of different methods for the ADNI data.
| Method | Selected Features |
|---|---|
| Our method (GIC) | 9, 19, 40, 59, 67, 80, 95, 130, 134, 147, 156, 168, 178 |
| SVI-Lasso | 9, 19, 40, 77, 80, 95, 130, 134, 147, 156, 168, 178 |
| LDPE | 9, 19, 40, 59, 77, 80, 83, 90, 111, 134, 147, 156, 168 |
| JM | 19, 40, 134 |
Table 9:
Estimates and 95% individual confidence intervals of the 13 features selected by our proposed method for the ADNI data.
| Our method (GIC) | SVI-Lasso | LDPE | JM | |||||
|---|---|---|---|---|---|---|---|---|
| Est | Ind CI | Est | Ind CI | Est | Ind CI | Est | Ind CI | |
| −0.20 | [−0.33, −0.07] | −0.20 | [−0.33, −0.07] | −0.19 | [−0.34, −0.04] | −0.15 | [−0.31, 0.01] | |
| 0.24 | [0.06, 0.41] | 0.23 | [0.07, 0.40] | 0.24 | [0.09, 0.40] | 0.25 | [0.08, 0.42] | |
| −0.21 | [−0.36, −0.06] | −0.21 | [−0.35, −0.06] | −0.20 | [−0.35, −0.05] | −0.16 | [−0.33, 0.00] | |
| 0.15 | [0.01, 0.29] | 0.15 | [0.00, 0.30] | 0.16 | [0.01, 0.30] | 0.12 | [−0.03, 0.28] | |
| 0.13 | [0.00, 0.27] | 0.13 | [0.00, 0.26] | 0.12 | [−0.02, 0.26] | 0.11 | [−0.03, 0.26] | |
| 0.23 | [0.03, 0.43] | 0.23 | [0.03, 0.42] | 0.21 | [0.04, 0.38] | 0.15 | [−0.01, 0.30] | |
| 0.20 | [0.00, 0.40] | 0.21 | [0.01, 0.41] | 0.19 | [−0.01, 0.39] | 0.04 | [−0.11, 0.20] | |
| 0.21 | [0.00, 0.42] | 0.20 | [0.01, 0.40] | 0.18 | [−0.03, 0.39] | 0.04 | [−0.12, 0.19] | |
| 0.25 | [0.08, 0.43] | 0.24 | [0.02, 0.45] | 0.24 | [0.06, 0.43] | 0.23 | [0.08, 0.38] | |
| −0.22 | [−0.41, −0.02] | −0.22 | [−0.41, −0.03] | −0.21 | [−0.40, −0.02] | −0.03 | [−0.18, 0.13] | |
| −0.19 | [−0.36, −0.02] | −0.19 | [−0.36, −0.02] | −0.18 | [−0.37, 0.00] | −0.05 | [−0.21, 0.11] | |
| −0.24 | [−0.43, −0.04] | −0.24 | [−0.43, −0.04] | −0.22 | [−0.43, −0.02] | 0.01 | [−0.14, 0.16] | |
| −0.19 | [−0.34, −0.03] | −0.19 | [−0.35, −0.04] | −0.18 | [−0.36, 0.00] | −0.07 | [−0.22, 0.08] | |
5. Discussion
In this paper we propose a unified framework to construct confidence intervals for the population penalized parameters as well as the true model parameters for a large class of penalties. By transforming the population penalized regression problem (1) and its SAA problem (2) to the equivalent problems (9) and (22) respectively, we exclude the non-smoothness in the objectives. Furthermore, we obtain their normal map formulations (18) and (25), and derive the asymptotic distributions and the two types of confidence intervals. Our numerical results show that these methods are effective. When the objective functions in (1) and (2) are non-convex as a result of non-convex penalty functions, most existing algorithms are only guaranteed to obtain a local optimal solution for the SAA problem. Our proposed methods will generate confidence intervals based on that local solution. In practice, we solve for a SAA solution and then use (26) to obtain a solution to (25). The first type of confidence intervals we compute are for a local optimal solution of the population penalized regression problem (1). From any local solution of (2), we can always compute confidence intervals for the true model parameters, which are the second type of confidence intervals we compute.
Supplementary Material
Acknowledgments
The authors thank the editors, the associate editor, and referees for their helpful comments and suggestions. This research was supported in part by US National Science Foundation grants DMS-1407241 (Liu, Lu and Yin), and DMS-1109099 (Lu and Yin).
References
- Brooke A, Kendrick D, Meeraus A, and Raman R (1998), GAMS, A User’s Guide, Washington, DC: GAMS Development Corporation, available online at http://www.gams.com. [Google Scholar]
- Candes EJ and Tao T (2007), “The Dantzig selector: statistical estimation when p is much larger than n,” The Annals of Statistics, 35, 2313–2351. [Google Scholar]
- Donoho DL and Johnstone IM (1994), “Ideal spatial adaptation by wavelet shrinkage,” Biometrika, 81, 425–455. [Google Scholar]
- Efron B, Hastie T, Johnstone I, and Tibshirani R (2004), “Least angle regression,” The Annals of Statistics, 32, 407–499. [Google Scholar]
- Fan J and Li R (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]
- Folstein MF, Folstein SE, and McHugh PR (1975), “Mini-mental state: a practical method for grading the cognitive state of patients for the clinician,” Journal of psychiatric research, 12, 189–198. [DOI] [PubMed] [Google Scholar]
- Friedman JH (2012), “Fast sparse regression and classification,” International Journal of Forecasting, 28, 722–738. [Google Scholar]
- Javanmard A and Montanari A (2014), “Confidence intervals and hypothesis testing for high-dimensional regression,” Journal of Machine Learning Research, 15, 2869–2909. [Google Scholar]
- Konishi S and Kitagawa G (2008), Information criteria and statistical modeling, Springer Science & Business Media. [Google Scholar]
- Lee JD, Sun DL, Sun Y, Taylor JE, et al. (2016), “Exact post-selection inference, with application to the lasso,” The Annals of Statistics, 44, 907–927. [Google Scholar]
- Liu Y and Wu Y (2007), “Variable selection via a combination of the L0 and L1 penalties,” Journal of Computional and Graphical Statistics, 16, 782–798. [Google Scholar]
- Lockhart R, Taylor J, Tibshirani R, and Tibshirani R (2014), “A significance test for the lasso,” The Annals of Statistics, 42, 413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu S (2014a), “A new method to build confidence regions for solutions of stochastic variational inequalities,” Optimization: A Journal of Mathematical Programming and Operations Research, 63, 1431–1443. [Google Scholar]
- Lu S (2014b), “Symmetric Confidence Regions and Confidence Intervals for Normal Map Formulations of Stochastic Variational Inequalities,” SIAM Journal on Optimization, 24, 1458–1484. [Google Scholar]
- Lu S and Budhiraja A (2013), “Confidence regions for stochastic variational ienqualities,” Mathematics of Operations Research, 38, 545–568. [Google Scholar]
- Lu S, Liu Y, Yin L, and Zhang K (2017), “Confidence intervals and regions for the lasso by using stochastic variational inequality techniques in optimization,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 589–611. [Google Scholar]
- Lv J and Fan Y (2009), “A unified approach to model selection and sparse recovery using regularized least squares,” The Annals of Statistics, 37, 3498–3528. [Google Scholar]
- Mazumder R, Friedman J, and Hastie T (2011), “SparseNet: Coordinate descent with non-convex penalties,” Journal of the American Statistical Association, 106, 1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meinshausen N and Buhlmann P (2006), “High-dimensional graphs and variable selection with the Lasso,” The Annals of Statistics, 34, 1436–1462. [Google Scholar]
- Nikolova M (2000), “Local strong homogeneity of a regularized estimator,” SIAM Journal on Applied Mathematics, 61, 633–658. [Google Scholar]
- Ning Y, Liu H, et al. (2017), “A general theory of hypothesis tests and confidence regions for sparse high dimensional models,” The Annals of Statistics, 45, 158–195. [Google Scholar]
- Robinson SM (1995), “Sensitivity analysis of variational inequalities by normal-map techniques,” in Variational Inequalities and Network Equilibrium Problems, ed. Giannessi F and Maugeri A, New York: Plenum Press, pp. 257–269. [Google Scholar]
- Thung K-H, Wee C-Y, Yap P-T, Shen D, Initiative ADN, et al. (2014), “Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion,” NeuroImage, 91, 386–400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R (1996), “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B, 58, 267–288. [Google Scholar]
- Van de Geer S, Buhlmann P, Ritov Y, and Dezeure R (2014), “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, 42, 1166–1202. [Google Scholar]
- Voorman A, Shojaie A, and Witten D (2014), “Inference in high dimensions with the penalized score test,” arXiv preprint arXiv:1401.2678. [Google Scholar]
- Wu TT and Lange K (2008), “Coordinate descent algorithms for lasso penalized regression,” The Annals of Applied Statistics, 2, 224–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang CH (2010), “Nearly unbiased variable selection under minimax concave penalty,” The Annals of Statistics, 38, 894–942. [Google Scholar]
- Zhang CH and Zhang SS (2014), “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B, 76, 217–242. [Google Scholar]
- Zhao S, Shojaie A, and Witten D (2017), “In Defense of the Indefensible: A Very Naive Approach to High-Dimensional Inference,” arXiv preprint arXiv:1705.05543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou H (2006), “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]
- Zou H and Hastie T (2005), “Regularization and variable selection via the elastic net,” The Annals of Statistics, 67, 301–320. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

