Confidence Intervals for Sparse Penalized Regression with Random Designs

Guan Yu; Liang Yin; Shu Lu; Yufeng Liu

doi:10.1080/01621459.2019.1585251

. Author manuscript; available in PMC: 2021 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2019 May 7;115(530):794–809. doi: 10.1080/01621459.2019.1585251

Confidence Intervals for Sparse Penalized Regression with Random Designs

Guan Yu ^1,^*, Liang Yin ^1,^*, Shu Lu ¹, Yufeng Liu ^1,^†

PMCID: PMC7716883 NIHMSID: NIHMS1572321 PMID: 33281249

Abstract

With the abundance of large data, sparse penalized regression techniques are commonly used in data analysis due to the advantage of simultaneous variable selection and estimation. A number of convex as well as non-convex penalties have been proposed in the literature to achieve sparse estimates. Despite intense work in this area, how to perform valid inference for sparse penalized regression with a general penalty remains to be an active research problem. In this paper, by making use of state-of-the-art optimization tools in stochastic variational inequality theory, we propose a unified framework to construct confidence intervals for sparse penalized regression with a wide range of penalties, including convex and non-convex penalties. We study the inference for parameters under the population version of the penalized regression as well as parameters of the underlying linear model. Theoretical convergence properties of the proposed method are obtained. Several simulated and real data examples are presented to demonstrate the validity and effectiveness of the proposed inference procedure.

Keywords: confidence interval, non-convex penalty, penalized regression, random design, variational inequality

1. Introduction

With the advantage of simultaneous variable selection and estimation, sparse penalized regression techniques have been widely used. By introducing biases on the estimators, sparse penalized regression can often select a simpler model and produce estimators with smaller mean square errors than unpenalized regression. One of the well-known representatives is the L₁ penalized technique LASSO (Donoho and Johnstone, 1994; Tibshirani, 1996). LASSO has become a popular variable selection method due to its good selection performance and computational efficiency. Many other extensions with different penalties have been studied in the literature, see for example Fan and Li (2001); Zou and Hastie (2005); Candes and Tao (2007); Liu and Wu (2007); Lv and Fan (2009); Zhang (2010).

For computational implementation of these methods, there is a large literature on efficient algorithms. The LARS algorithm by Efron et al. (2004) and the Coordinate-Descent algorithm by Wu and Lange (2008) are two popular examples. Mazumder et al. (2011) proposed the SparseNet algorithm to deal with non-convex penalties. In terms of inference, much less development has been made, especially for estimators from non-convex penalized regression. For the LASSO, one common approach is to first perform model selection and then carry out inference based on some pivot distributions conditional on the selected model. For example, see Lee et al. (2016); Lockhart et al. (2014). This approach does not sufficiently account for the stochastic errors in the model selection step. Another popular approach is to achieve valid inference by adjusting the bias introduced by the L₁ regularization term. Papers along this line include Javanmard and Montanari (2014); Van de Geer et al. (2014); Zhang and Zhang (2014). Recently, Lu et al. (2017) suggested using a variational inequality formulation to establish an asymptotic distribution of LASSO estimators that can be used to construct confidence intervals (CI) for the population LASSO parameters as well as the true model parameters. Some other methods about high dimensional inference include Voorman et al. (2014); Ning et al. (2017); Zhao et al. (2017).

For sparse penalized regression with penalties more complex than the LASSO, not much work has been done on inference. In particular, with a non-convex penalty, the regression problem may have multiple local optimal solutions. This brings up the question of whether one can use a local solution to construct meaningful confidence intervals. The goal of this paper is to provide a unified framework to perform valid inference for sparse penalized regression with general penalties, based on a local solution. Our assumptions about the penalties are in line with the desired properties for regularized penalty functions given in Fan and Li (2001), namely, sparsity, unbiasness, and continuity. For example, this framework can be applied to the adaptive LASSO penalty (Zou, 2006), the non-convex log penalty (Friedman, 2012), SCAD (Fan and Li, 2001), MCP (Zhang, 2010), the transformed $l_{1}$ penalty (Nikolova, 2000), and so on.

In this paper, we consider a general random-design penalized regression problem

\min_{β_{0}, β} E {[Y - β_{0} - \sum_{i = 1}^{p} β_{i} X_{i}]}^{2} + \sum_{j = 1}^{p} P_{λ_{j}} (| β_{j} |),

(1)

where $X = {(X_{1}, \dots, X_{p})}^{T} \in ℝ^{p}$ is an explanatory random vector with mean 0, and $Y \in ℝ$ is a response random variable. Here $β_{0} \in ℝ and β = (β_{1}, \dots, β_{p}) \in ℝ^{p}$ are the regression parameters. For $j = 1, 2, \dots, p, P_{λ_{j}} (| \cdot |)$ is a general penalty for β_j with the regularization parameter λ_j. This general penalty covers many convex and non-convex penalties.

The solution of (1) can be estimated by the solution of the corresponding sample average approximation (SAA) problem

\min_{β_{0}, β} \frac{1}{N} | | y - β_{0} 1_{N} - X β ‖_{2}^{2} + \sum_{j = 1}^{p} P_{λ_{j}} (| β_{j} |),

(2)

where

y = [\begin{matrix} y_{1} \\ y_{2} \\ ⋮ \\ y_{N} \end{matrix}], X = [\begin{matrix} x_{11} & x_{12} & \dots & x_{1 p} \\ x_{21} & x_{22} & \dots & x_{2 p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{N 1} & x_{N 2} & \dots & x_{N p} \end{matrix}] = [\begin{matrix} x_{1}^{T} \\ x_{2}^{T} \\ ⋮ \\ x_{N}^{T} \end{matrix}], 1_{N} = [\begin{matrix} 1 \\ 1 \\ ⋮ \\ 1 \end{matrix}] \in ℝ^{N},

and (x₁, y₁), ⋯, (x_N, y_N) are independent samples of (X, Y). For each i = 1, ⋯, N and j = 1, ⋯, p, $x_{i j} \in ℝ, y_{i} \in ℝ and x_{i} \in ℝ^{p \times 1} .$ We refer to a local solution to the population penalized problem (1) as a population penalized parameter, and denote it as $({\tilde{β}}_{0}, \tilde{β}) .$ We refer to a local solution to the SAA problem (2) as a penalized estimator, and denote it as $({\hat{β}}_{0}, \hat{β}) .$ .

The population penalized parameter $({\tilde{β}}_{0}, \tilde{β})$ is closely related to the traditional least squares parameter. When $P_{λ_{j}} (| β_{j} |) = 0 for j = 1, \dots, p,$ the problem (1) becomes the following population least squares problem:

\min_{β_{0}, β} E {[Y - β_{0} - \sum_{i = 1}^{p} β_{i} X_{i}]}^{2},

(3)

which has a unique minimizer (E[XX^T])⁻¹ E[XY] when E[XX^T] is invertible. If additionally X and Y are related by the following linear model

Y = β_{0}^{t r u e} + X^{T} β^{t r u e} + ε

(4)

with E[ε|X] = 0, then the solution to the population least squares problem (3) is exactly $(β_{0}^{t r u e}, β^{t r u e})$ , which we refer to as the true model parameter. When $P_{λ_{j}} (| β_{j} |) > 0, ({\tilde{β}}_{0}, \tilde{β})$ is not exactly $(β_{0}^{t r u e}, β^{t r u e})$ , but there is a relation between $({\tilde{β}}_{0}, \tilde{β})$ and $(β_{0}^{t r u e}, β^{t r u e})$ , that will be described in Section 3.4.

The idea of our proposed method is to use the penalized estimator $({\hat{β}}_{0}, \hat{β})$ to derive confidence intervals for the population penalized parameter $({\tilde{β}}_{0}, \tilde{β})$ , and then exploit the relation between $({\tilde{β}}_{0}, \tilde{β})$ and $(β_{0}^{t r u e}, β^{t r u e})$ to derive confidence intervals for the true model parameter $(β_{0}^{t r u e}, β^{t r u e})$ in the linear model (4). Therefore, our proposed method could construct confidence intervals for both $({\tilde{β}}_{0}, \tilde{β})$ and $(β_{0}^{t r u e}, β^{t r u e})$ . Note that the valid inference of $({\tilde{β}}_{0}, \tilde{β})$ is also useful for some problems such as the cost-effective linear regression taking into account the cost of collecting variables. For each sample, suppose we need to spend c_j dollars on collecting the value of the jth variable X_j where j = 1, 2,...,p. For this special linear regression problem, in order to find a relatively cheap linear model with good prediction performance, we need to estimate the regression coefficient vector that minimizes an objective function which balances between the expected prediction accuracy and the data collation cost, that is,

({\tilde{β}}_{0}, \tilde{β}) \in \arg \min_{β_{0}, β} E {(Y - β_{0} - \sum_{j = 1}^{p} X_{j} β_{j})}^{2} + λ \sum_{j = 1}^{p} c_{j} P (| β_{j} |),

where P(x) is a continuously differentiable non-convex function, and it is an approximation to the indicator function $I (x)$ that equals to 1 if x > 0 and 0 if x = 0. The parameter λ can be selected according to the budget. In this example, the population penalized parameter becomes a reasonable target of the inference and our proposed method could deliver its asymptotically exact confidence intervals. On the other hand, even when a model on the relation between X and Y is not available, the confidence intervals of $({\tilde{β}}_{0}, \tilde{β})$ can still provide measures on the randomness of the penalized estimators.

The main techniques to construct confidence intervals for $({\tilde{β}}_{0}, \tilde{β})$ and $(β_{0}^{t r u e}, β^{t r u e})$ , consist of three steps. First, we transform problems (1) and (2) into their corresponding variational inequality and normal map formulations and obtain an asymptotic distribution of a solution to the normal map formulation of (2). Next, by finding reliable estimates for quantities that describe the asymptotic distribution, we provide methods to compute confidence intervals for $({\tilde{β}}_{0}, \tilde{β})$ based on a solution to the normal map formulation of (2). Finally, we establish the connection between $({\tilde{β}}_{0}, \tilde{β})$ and $(β_{0}^{t r u e}, β^{t r u e})$ , from which we obtain the bias-corrected estimator of $(β_{0}^{t r u e}, β^{t r u e})$ , the asymptotic distribution and confidence intervals. The methodology in this paper is developed for a fixed dimension p, based on a local solution $({\hat{β}}_{0}, \hat{β})$ to (2). The confidence intervals we obtain for $({\tilde{β}}_{0}, \tilde{β})$ are for the local solution to (1) close to $({\hat{β}}_{0}, \hat{β})$ . On the other hand, for any local solution $({\hat{β}}_{0}, \hat{β})$ of (2), we can always obtain confidence intervals for the true model parameter $(β_{0}^{t r u e}, β^{t r u e})$ . Indeed, under the setting we consider in Section 3.4, a local solution $({\hat{β}}_{0}, \hat{β})$ almost surely converges to $(β_{0}^{t r u e}, β^{t r u e})$ .

Although our method and the method proposed in Lu et al. (2017) use similar techniques, there are important new contributions. First, we propose a unified framework to construct confidence intervals for a large class of penalties, including the method using the LASSO penalty (Lu et al. (2017)) as a special case. Second, for non-convex penalties, the construction of the confidence intervals and the theoretical studies are more complicated. We propose a new transformation of the original optimization problem to its corresponding variational inequality and normal map formulations. Special technical conditions and theoretical results for general penalties are studied. Third, the proposed method based on non-convex penalties could deliver better confidence intervals than the methods using convex penalties. For example, in our numerical studies, we compare the method using the MCP penalty with a = 2 and the method using the MCP penalty with a = 2000 (the case that MCP penalty is very close to the LASSO penalty). Our numerical results indicate that the lengths of the confidence intervals for a = 2 are generally shorter than those for a = 2000 with similar coverage rates. This is possibly due to the smaller bias imposed by the MCP penalty with a small a.

The rest of this paper is organized as follows. In Section 2, we show some background on variational inequality, and present the problem transformations. Section 3 discusses how to obtain the confidence intervals for the population penalized parameters as well as the true model parameters in the linear model. Some theoretical results about convergence properties are shown in this section. In Section 4, we present numerical results to illustrate the performance of the proposed method. Section 5 contains some discussion. The technical details of variational inequalities and proofs are shown in the supplementary materials.

Throughout this paper, we use $N (0, Σ)$ to denote a normal random vector with mean zero and covariance matrix $Σ, and Y_{n} \Rightarrow Y$ to represent weak convergence of a sequence of random variables {Y_n} to Y. The inner product between two vectors x and y is denoted as 〈x, y〉. For a convex set S and a vector $z \in ℝ^{n}, we use Π_{S} (z)$ to denote the Euclidean projection onto S. A function $f : ℝ^{n} \to ℝ^{m}$ is said to be B-differentiable at a point $x_{0} \in ℝ^{n}$ if there exists a positively homogeneous function $d f (x_{0}) : ℝ^{n} \to ℝ^{m},$ such that $f (x_{0} + v) = f (x_{0}) + d f (x_{0}) (v) + o (v) .$ The function df(x₀) is called the B-derivative of f at x₀.

2. Background and problem transformations

In this section, we first introduce some background on variational inequalities and normal maps. Then we introduce how to transform the problems (1) and (2) to their corresponding variational inequality and normal map formulations. Some assumptions for our theoretical analysis are also given in this section.

2.1. Background on variational inequalities and normal maps

We start with the definition of a variational inequality. Given a function $f : ℝ^{n} \to ℝ^{n}$ and a closed, convex set S in $ℝ^{n}$ , the variational inequality associated with (f, S) is the problem of finding $x \in S$ such that

0 \in f (x) + N_{S} (x),

(5)

where N_s(x) is the normal cone to S at x defined as

N_{S} (x) = {v \in ℝ^{n} | 〈 v, s - x 〉 ⩽ 0 for each s \in S} .

Variational inequalities are closely related to optimization problems. Consider a problem of minimizing an objective function $F : ℝ^{n} \to ℝ$ over a closed and convex set $S \subseteq ℝ^{n} .$ A well-known fact is that if $x \in S$ is a local solution to this minimization problem and F is differentiable at x, then x satisfies the variational inequality

0 \in \nabla F (x) + N_{S} (x),

where $\nabla F : ℝ^{n} \to ℝ^{n}$ is the gradient of the function F. Conversely, if x satisfies the above variational inequality, and F is a convex function, then x is a global minimizer of F over the set S.

Besides the above connection with the original minimization problem, the variational inequality can be equivalently formulated as an equation, using a concept called the normal map. The normal map induced by f and S is a function $f_{S} : ℝ^{n} \to ℝ^{n}$ given by

f_{S} (z) = f (Π_{S} (z)) + (z - Π_{S} (z)) for each z \in ℝ^{n},

where $Π_{S} (z)$ denotes the Euclidean projector of z onto S. For any solution $x \in S$ to the variational inequality (5), the point z = x − f(x) satisfies $Π_{S} (z) = x$ and

f_{s} (z) = 0.

(6)

Conversely, for any solution z to (6), the point $x = Π_{S} (z)$ is a solution to (5) and satisfies z = x − f(x). Equation (6) is called the normal map formulation of (5).

To understand the above relations, we consider an example of minimizing $F (x) = \frac{1}{2} ‖ x - x_{0} ‖_{2}^{2}$ for a fixed point $x_{0} \in ℝ^{n}$ over the convex set S. We know that the solution is the projection of x₀ onto the convex set S, denoted as $Π_{S} (x_{0}) .$ For the function F(x), the gradient is $\nabla F (x) = x - x_{0} .$ Thus, the variational inequality formulation (5) of this minimization problem is the problem of finding $x \in S$ such that $x_{0} - x \in N_{S} (x) .$ That is, we need to find $x \in S$ such that 〈x₀ − x, s − x〉 ≤ 0 for each $s \in S$ . We can show that the solution is $Π_{S} (x_{0}) .$ . On the other hand, the normal map induced by $\nabla F (x) and S is f_{S} (z) = \nabla F (Π_{S} (z)) + z - Π_{S} (z) = Π_{S} (z) - x_{0} + z - Π_{S} (z) = z - x_{0} .$ Thus, the normal map formulation (6) of this minimization problem is z − x₀ = 0. The solution to this equation is x₀ and the point $x = Π_{S} (x_{0})$ is the solution to the original minimization problem. More details about the variational inequality and normal map can be found in the supplementary materials. In the following Sections 2.2 and 2.3, we show how to transform the problems (1) and (2) to their corresponding normal map formulations, respectively.

2.2. Transformations of the population penalized regression

In this subsection, we transform the optimization problem (1) into a normal map formulation. Before discussing details about the transformation, we introduce conditions on the penalties P_{λ_i}(·). In this subsection, as well as in Sections 2.3 and 3.1-3.3, λ = (λ₁, ⋯, λ_p) > 0 is fixed.

Assumption 1.

(a)
For each i = 1, 2, ⋯,p, P_{λ_i}(·) is nonnegative, nondecreasing and continuously differentiable on $[0, + \infty) with P_{λ_{i}}^{'} (0) > 0.$
(b)
For any local solution $({\tilde{β}}_{0}, \tilde{β})$ to (1), the second derivative of P_{λ_i} (t_i) is Lipchitz continuous on a neighborhood of $t_{i} = | {\tilde{β}}_{i} |$ for every i = 1, ⋯,p.

Many well-known penalty functions satisfy Assumption 1(a). We list five penalty functions as examples.

(a)
The adaptive LASSO penalty (Zou, 2006) defined as $P_{λ_{i}} (| β_{i} |) = λ_{i} | β_{i} |,$ where λ_i is the weight for the i^th coordinate.
(b)
The log penalty (Friedman, 2012) defined as $P_{λ_{i}} (| β_{i} |) = \frac{λ_{i}}{log (1 + a)} log (a | β_{i} | + 1),$ where a > 0.
(c)
The transformed $l_{1}$ penalty (Nikolova, 2000) defined as $P_{λ_{i}} (| β_{i} |) = λ_{i} \frac{(a + 1) | β_{i} |}{a + | β_{i} |},$ where a > 0.
(d)
The SCAD penalty (Fan and Li, 2001) defined as P_λ(0) = 0 and
$P_{λ_{i}}^{'} (| β_{i} |) = λ_{i} 1_{| β_{i} | ⩽ λ_{i}} + \frac{{(a λ_{i} - | β_{i} |)}_{+}}{a - 1} 1_{| β_{i} | > λ_{i}} where a > 2 .$ (7)
(e)
The MCP penalty (Zhang, 2010) defined as
$P_{λ_{i}} (| β_{i} |) = λ_{i} (| β_{i} | - \frac{β_{i}^{2}}{2 a λ_{i}}) 1_{| β_{i} | < a λ_{i}} + \frac{a λ_{i}^{2}}{2} 1_{| β_{i} | ⩾ a λ_{i}} where a > 0 .$ (8)

We can check that penalties (a), (b), and (c) satisfy Assumption 1(b). The SCAD and MCP penalties satisfy this assumption almost everywhere. Take the SCAD penalty for example. The function corresponds to a quadratic spline with two knots, at which it is not continuously twice differentiable. Assumption 1(b) requires that no local solution to (1) locates at these two knots for each i. It is a reasonable assumption since the set of points on which twice continuous differentiability fails has measure zero.

In the assumption below, part (a) is to ensure the objective function of (1) to be finite valued, and part (b) will be used in proving convergence results.

Assumption 2.

(a)
The expectations $E [X_{1}^{2}], \dots, E [X_{p}^{2}] and E [Y^{2}]$ are finite.
(b)
The expectations $E [X_{1}^{4}], \dots, E [X_{p}^{4}] and E [Y^{4}]$ are finite.

Next, we transform the problem (1) into a normal map formulation in three steps. In the first step, we introduce an equivalent problem, in which a new variable $t \in ℝ^{p}$ is added to eliminate the non-smooth term $\sum_{i = 1}^{p} P_{λ_{i}} (| β_{i} |)$ from the objective function (1). The new problem is as follows:

\begin{array}{l} \min_{β_{0}, β, t} E {[Y - β_{0} - \sum_{i = 1}^{p} β_{i} X_{i}]}^{2} + \sum_{i = 1}^{p} P_{λ_{i}} (t_{i}) + m (‖ t ‖_{2}^{2} - ‖ β ‖_{2}^{2}) \\ \begin{array}{l} s.t. & t_{i} - β_{i} ⩾ 0, i = 1, \dots, p, \\ t_{i} + β_{i} ⩾ 0, i = 1, \dots, p, \end{array} \end{array}

(9)

where m is a positive constant. If we define $S_{i} \subset ℝ^{2}$ as

S_{i} = {(β_{i}, t_{i}) | t_{i} - β_{i} ⩾ 0, t_{i} + β_{i} ⩾ 0}, i = 1, \dots, p,

(10)

and write

(β_{0}, β, t) = (β_{0}, β_{1}, t_{1}, β_{2}, t_{2}, \dots, β_{p}, t_{p}),

(11)

then we can treat the feasible set of (9), denoted by S, as a Cartesian product

S = ℝ \times Π_{i = 1}^{p} S_{i} .

(12)

We will use the two ways of ordering of (β₀, β, t) in (11) interchangeably for notational convenience.

Note that the above transformation is different from the method shown in Lu et al. (2017) for the LASSO penalty. The term $m (‖ t ‖_{2}^{2} - ‖ β ‖_{2}^{2})$ is added into the objective function of (9) in order to ensure t_i = |β_i| in any optimal solution to (9), so that there is an one-to-one correspondence between the optimal solutions to (1) and (9). This is necessary and important when the penalty functions are not strictly increasing on [0, $+ \infty$ ). For instance, some non-convex penalties such as SCAD and MCP are flat on some intervals [d_i, $+ \infty$ ).

In the second step, we transform (9) into a variational inequality. To this end, we need to write down the gradient of its objective function. Define a function $F : ℝ \times ℝ^{p} \times ℝ^{p} \times ℝ^{p} \times ℝ \to ℝ^{2 p + 1}$ as

F (β_{0}, β, t, X, Y) = [\begin{matrix} - 2 (Y - β_{0} - \sum_{i = 1}^{p} β_{i} X_{i}) \\ - 2 (Y - β_{0} - \sum_{i = 1}^{p} β_{i} X_{i}) X - 2 m β \\ ({P_{λ_{i}}^{'} (t_{i}) + 2 m t_{i})}_{i = 1}^{p} \end{matrix}] .

(13)

Furthermore, define a function $f_{0} : ℝ \times ℝ^{p} \times ℝ^{p} \to ℝ^{2 p + 1}$ as

f_{0} (β_{0}, β, t) = E [F (β_{0}, β, t, X, Y)] .

(14)

The function f₀ is well defined and finite valued under Assumption 2(a). If P_{λ_i} (t_i) is twice differentiable at t_i for every i = 1, ⋯,p, then we can write down the derivative of F with respect to (β₀, β, t) as

d_{1} F (β_{0}, β, t, X, Y) = [\begin{matrix} 2 & 2 X^{T} & 0 \\ 2 X & 2 X X^{T} - 2 m I_{p} & 0 \\ 0 & 0 & diag {(P_{λ_{i}}^{″} (t_{i}) + 2 m)}_{i = 1}^{p} \end{matrix}],

(15)

where $diag {(P_{λ_{i}}^{″} (t_{i}) + 2 m)}_{i = 1}^{p}$ represents the diagonal matrix whose ith diagonal element is $P_{λ_{i}}^{″} (t_{i}) + 2 m and I_{p}$ is the p × p identity matrix. Moreover, the Jacobian matrix of f₀ is

L (t) = E [d_{1} F (β_{0}, β, t, X, Y)] = [\begin{matrix} 2 & 0 & 0 \\ 0 & 2 E [X X^{T}] - 2 m I_{p} & 0 \\ 0 & 0 & diag {(P_{λ_{i}}^{″} (t_{i}) + 2 m)}_{i = 1}^{p} \end{matrix}] .

(16)

The lemma below shows that there is an one-to-one correspondence between the (local or global) optimal solutions to (1) and (9).

Lemma 1.

Suppose Assumptions 1(a) and 2(a) hold. Then the objective function of (9) is finite valued on $ℝ^{2 p + 1},$ and its gradient at each $(β_{0}, β, t) \in ℝ^{2 p + 1} is f_{0} (β_{0}, β, t) .$ If $({\tilde{β}}_{0}, \tilde{β}, \tilde{t})$ is a (local) optimal solution to (9), then ${\tilde{t}}_{i} = | {\tilde{β}}_{i} | for all i = 1, \dots, p,$ and $({\tilde{β}}_{0}, \tilde{β})$ is a (local) optimal solution to (1). Conversely, if $({\tilde{β}}_{0}, \tilde{β})$ is a (local) optimal solution to (1), then $({\tilde{β}}_{0}, \tilde{β}, \tilde{t})$ is a (local) optimal solution to (9), where ${\tilde{t}}_{i} = | {\tilde{β}}_{i} | for all i = 1, \dots, p .$

If Assumption 1(b) holds additionally, then the Hessian matrix of the objective function of (9) at $({\tilde{β}}_{0}, \tilde{β}, \tilde{t})$ is L( $\tilde{t}$ ).

In view of Lemma 1, we can transform (9) to the following variational inequality:

- f_{0} (β_{0}, β, t) \in N_{S} (β_{0}, β, t) .

(17)

In the last step, we state the normal map formulation for (17). Let (f₀)_S be the normal map induced by f₀ and S. Then the normal map formulation for (17) is

{(f_{0})}_{S} (z) = 0, z \in ℝ^{2 p + 1} .

(18)

For the rest of the paper, let $({\tilde{β}}_{0}, \tilde{β}, \tilde{t})$ be a local solution to (9). Then $({\tilde{β}}_{0}, \tilde{β}, \tilde{t})$ is also a solution to (17). Therefore, the point $z_{0} \in ℝ^{2 p + 1}$ defined as

z_{0} = ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}) - f_{0} ({\tilde{β}}_{0}, \tilde{β}, \tilde{t})

(19)

is a solution to (18) and satisfies $Π_{S} (z_{0}) = ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}) .$

Let Σ₀ be the covariance matrix of $F ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}, X, Y) .$ We can check that Σ₀ is well defined if Assumption 2(b) holds. Let $Σ_{0}^{1}$ be the upper left (p + 1) × (p + 1) submatrix of Σ₀. Since the last p elements of $F ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}, X, Y)$ are not random at $({\tilde{β}}_{0}, \tilde{β}, \tilde{t})$ , we have $Σ_{0} = [\begin{matrix} Σ_{0}^{1} & 0 \\ 0 & 0 \end{matrix}] .$ . In our theoretical analysis as shown in Section 3, we found that the B-derivative of the normal map (f₀)_S at z₀ plays an important role in the construction of the confidence intervals. To study the property of the B-derivative of the normal map (f₀)_S at z₀, we need the following assumption.

Assumption 3.

Let $({\tilde{β}}_{0}, \tilde{β})$ be a local solution to (1), define $\tilde{t} \in ℝ^{p} and \tilde{q} \in ℝ^{p}$ by

{\tilde{t}}_{i} = | {\tilde{β}}_{i} | and {\tilde{q}}_{i} = E [- 2 (Y - {\tilde{β}}_{0} - \sum_{j = 1}^{p} {\tilde{β}}_{j} X_{j}) X_{i}] for each i = 1, \dots, p .

Let $ℑ$ be a subset of {1, ⋯,p} defined as

ℑ = {i \in {1, \dots, p} | {\tilde{β}}_{i} \neq 0 or ({\tilde{β}}_{i} = 0 and | {\tilde{q}}_{i} | = | P_{λ_{i}}^{'} ({\tilde{t}}_{i}) |)},

and denote $L (\tilde{t})$ in (16) by L. Let Q₁ be the submatrix of L that consists of intersections of columns and rows of L with indices in ${1} \cup {i + 1, i \in ℑ},$ and let Q₂ be the submatrix of L that consists of intersections of columns and rows of L with indices in ${i + p + 1, i \in ℑ} .$ Define matrix Q as

Q = Q_{1} + [\begin{matrix} 0 & 0 \\ 0 & Q_{2} \end{matrix}] .

(20)

Assume that Q is nonsingular.

In the above assumption, Q_i is a submatrix of the upper left (p + 1) × (p + 1) submatrix of L, and Q₂ is a submatrix of the lower right p × p submatrix of L. If $({\tilde{β}}_{0}, \tilde{β})$ is a solution to the optimization problem (1), then for every $i \in {1, \dots, p}, we have {\tilde{β}}_{i} \neq 0 and | {\tilde{q}}_{i} | = | P_{λ_{i}}^{'} ({\tilde{t}}_{i}) |, or {\tilde{β}}_{i} = 0 and | {\tilde{q}}_{i} | \leq | P_{λ_{i}}^{'} ({\tilde{t}}_{i}) | .$ Since we have ${\tilde{β}}_{i} = 0 and | {\tilde{q}}_{i} | < | P_{λ_{i}}^{'} ({\tilde{t}}_{i}) |$ for some coefficients, $| J |$ is generally not equal to p. Furthermore, in our theoretical analysis, we assume that the dimension p is fixed, the matrix Q can be nonsingular in many cases. The nonsingularity of Q is a standard assumption to guarantee that $({\tilde{β}}_{0}, \tilde{β})$ is a locally unique optimal solution.

As shown in Robinson (1995), the B-derivative of the normal map (f₀)_S at z₀ is the same as the normal map L_K induced by the linear function defined by the matrix L and the critical cone K to S associated with z₀, defined as

K = {w \in T_{S} (Π_{S} (z_{0})) | 〈 z_{0} - Π_{S} (z_{0}), w 〉 = 0} = {w \in T_{S} ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}) | 〈 f_{0} ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}), w 〉 = 0},

(21)

where for each $x \in S,$

$T_{S} (x) = {w \in ℝ^{n} | \exists {x_{k}} \subset S and {τ_{k}} \subset ℝ$ such that $x_{k} \to x, τ_{k} \to 0, and (x_{k} - x) / τ_{k} \to w}$ is the tangent cone to S at x. To be specific, the normal map L_K is defined as $L_{K} (z) = L \cdot Π_{K} (z) + z - Π_{K} (z) for any z \in ℝ^{2 p + 1} .$ The tangent cone T_S(x) contains all the directions along which x can be approached by a sequence of points in S converging to x. Lemma 2 below shows that L_K is a global homeomorphism from $ℝ^{2 p + 1} to ℝ^{2 p + 1}$ (a continuous bijective function from $ℝ^{2 p + 1} to ℝ^{2 p + 1}$ whose inverse function is also continuous). In the proof of Lemma 2, we will give the explicit expression of the critical cone K.

Lemma 2.

Suppose that Assumptions 1, 2(a) and 3 hold. Then the normal map L_K is a global homeomorphism from $ℝ^{2 p + 1} to ℝ^{2 p + 1}$ , and there is a neighborhood of $({\tilde{β}}_{0}, \tilde{β}, \tilde{t})$ in which it is the unique local solution to (9).

Combining Lemma 1 and 2, we can conclude that the assumptions in Lemma 2 guarantee $({\tilde{β}}_{0}, \tilde{β})$ to be the unique local solution to (1) in a neighborhood of it.

2.3. Transformations of the SAA problem

We follow the same steps in Subsection 2.2 to formulate the SAA problem (2) as a normal map equation. First, by introducing the variable $t \in ℝ^{p},$ we transform the SAA problem (2) to the following equivalent problem:

\min_{(β_{0}, β, t) \in S} \frac{1}{N} \sum_{i = 1}^{N} {[y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{i j}]}^{2} + \sum_{i = 1}^{p} P_{λ_{i}} (t_{i}) + m (‖ t ‖_{2}^{2} - ‖ β ‖_{2}^{2}) .

(22)

Second, we rewrite (22) as a variational inequality

0 \in f_{N} (β_{0}, β, t) + N_{S} (β_{0}, β, t),

(23)

where $f_{N} (β_{0}, β, t) = N^{- 1} \sum_{i = 1}^{N} F (β_{0}, β, t, x_{i}, y_{i}) . If P_{λ_{i}} (t_{i})$ is twice differentiable at t_i for every i = 1, ⋯,p, then the Jacobian matrix of f_N is given by

L_{N} (t) = d f_{N} (β_{0}, β, t) = [\begin{matrix} 2 & 2 \sum_{i = 1}^{T} x_{i}^{T} / N & 0 \\ 2 \sum_{i = 1}^{N} x_{i} / N & 2 \sum_{i = 1}^{T} x_{i} x_{i}^{T} / N - 2 m I_{p} & 0 \\ 0 & 0 & diag {(P_{λ_{i}}^{″} (t_{i}) + 2 m)}_{i = 1}^{p} \end{matrix}] .

(24)

Third, denoting the normal map induced by f_N and S by (f_N)_S, we obtain the normal map formulation of (23) as

{(f_{N})}_{S} (z) = 0.

(25)

Let $({\hat{β}}_{0}, \hat{β}, \hat{t})$ be a local solution to (22). Then, $({\hat{β}}_{0}, \hat{β}, \hat{t})$ is also a solution to (23). So the point $z_{N} \in ℝ^{2 p + 1}$ defined as

z_{N} = ({\hat{β}}_{0}, \hat{β}, \hat{t}) - f_{N} ({\hat{β}}_{0}, \hat{β}, \hat{t})

(26)

is a solution to (25) and satisfies $Π_{S} (z_{N}) = ({\hat{β}}_{0}, \hat{β}, \hat{t}) .$

In fact, under Assumptions 1, 2 and 3, z_N is a locally unique solution to (25) when N is large enough and it converges to a solution z₀ to (18). This result will be shown in Subsection 3.1. Correspondingly, $({\hat{β}}_{0}, \hat{β}, \hat{t})$ is a locally unique solution to (22) and converges to a local solution $({\tilde{β}}_{0}, \tilde{β}, \tilde{t})$ to (9). Let Σ_N be the sample covariance matrix of ${F ({\hat{β}}_{0}, \hat{β}, \hat{t}, x_{i}, y_{i})}_{i = 1}^{N} and Σ_{N}^{1}$ be the upper left (p + 1) × (p + 1) submatrix of Σ_N, then we have $Σ_{N} = [\begin{matrix} Σ_{N}^{1} & 0 \\ 0 & 0 \end{matrix}] .$ Lu et al. (2017) (Lemma 3) shows that Σ_N converges to Σ₀ almost surely as $N \to \infty$ for t LASSO penalty. We can similarly prove the same convergence result with a general penalty in this paper under Assumptions 1–4. Assumption 4 is shown as follows.

Assumption 4.

(a)
For each $h \in ℝ^{2 p + 1} a n d (β_{0}, β, t) \in ℝ^{2 p + 1}, l e t$
$M_{β_{0}, β, t} (h) = E [\exp {〈 h, F (β_{0}, β, t, X, Y) - f_{0} (β_{0}, β, t) 〉}]$
be the moment generating function of the random variable $F (β_{0}, β, t, X, Y) - f_{0} (β_{0}, β, t) .$ Let $C$ be a compact set in $ℝ^{2 p + 1}$ that contains $({\tilde{β}}_{0}, \tilde{β}, \tilde{t})$ in its interior, and on which the second derivative of P_{λ_i} (t_i) is Lipchitz continuous for each i = 1, ㏯, p. Assume the following conditions.
1. There exists a constant $ζ > 0$ such that $M_{β_{0}, β, t} (h) \leq \exp {ζ^{2} ‖ h ‖^{2} / 2}$ for each $h \in ℝ^{2 p + 1} and (β_{0}, β, t) \in C .$
2. There exists a nonnegative random variable κ(X, Y) such that
  $‖ F (β_{0}, β, t, X, Y) - F (β_{0}^{'}, β^{'}, t^{'}, X, Y) ‖ \leq κ (X, Y) ‖ (β_{0}, β, t) - (β_{0}^{'}, β^{'}, t^{'}) ‖$
  for all $(β_{0}, β, t) and (β_{0}^{'}, β^{'}, t^{'})$ in $C$ and almost every (X, Y).
3. The moment generating function of k is finite valued in a neighborhood of zero.
(b)
The same conditions as in (a) for d₁F(β₀, β, t, X, Y) instead of F(β₀, β, t, X, Y). Accordingly, use E[d₁F(β₀, β, t, X, Y)] to replace f₀(β₀, β, t) in the conditions.
(c)
The same conditions as in (a) for F(β₀, β, t, X, Y)F(β₀, β, t, X, Y)^T. Accordingly, use E[F(β₀, β, t, X, Y)F(β₀, β, t, X, Y)^T] to replace β₀(β₀, β, t) in the conditions.

Assumption 4(a) imposes conditions on the random variable F(β₀, β, t, X, Y) as well as the penalty terms. It will hold if (X, Y) is a bounded random variable and Assumption 1(b) holds. Assumption 4(a) is used to ensure the SAA function f_N to converge to f₀ in probability at an exponential rate. We state the result in the following lemma.

Lemma 3.

Suppose that Assumptions 1, 2 and 4(a) hold. Then there exist positive real numbers δ₁, μ₁, M₁ and σ₁ such that the following holds for each $ϵ > 0$ and each sufficiently large N:

Prob {\sup_{(β_{0}, β, t) \in C} ‖ f_{N} (β_{0}, β, t) - f_{0} (β_{0}, β, t) ‖ ⩾ ϵ} ⩽ δ_{1} \exp {- N μ_{1}} + \frac{M_{1}}{ϵ^{2 p + 1}} \exp {- \frac{N ϵ^{2}}{σ_{1}}} .

(27)

Parts (b) and (c) of Assumption 4 impose the same type of assumptions on different random variables. Assumption 4(a-b) is needed to construct a reliable estimate for an unknown quantity in the asymptotic distribution in Theorem 1. Assumption 4(c) is only needed when the matrix $Σ_{0}^{1}$ is singular.

3. Construction of confidence intervals using stochastic variational inequality techniques

In this section, we show the proposed method and some related theoretical results to construct confidence intervals using stochastic variational inequality techniques. We first develop the limiting distribution of SAA solutions in Section 3.1. Then, in Section 3.2, we show how to estimate the unknown quantities in the limiting distribution. The construction of the confidence intervals for the population penalized parameters and the true model parameters in the underlying linear model will be studied in Section 3.3 and Section 3.4, respectively. To present our proposed inference method clearly, we outline the procedures to construct confidence intervals for the true model parameters in Table 1. The extension to the high dimensional case is provided in Section 3.5.

Table 1:

Construction of the (1 − α)% confidence intervals of $β_{0}^{t r u e} and β^{t r u e}$

Step 1. Find the penalized estimates

{\hat{β}}_{0}

and

\hat{β}

by solving the SAA problem (2). The tuning parameters are chosen by the Generalized Information Criterion (GIC).

Step 2. Calculate the solution of the normal map formulation (25),

z_{N} = ({\hat{β}}_{0}, \hat{β}, \hat{t}) - f_{N} ({\hat{β}}_{0}, \hat{β}, \hat{t}), where {\hat{t}}_{i} = | {\hat{β}}_{i} |, i = 1, \dots, p .

Step 3. Calculate

({\hat{β}}_{0}^{t r u e}, {\hat{β}}^{t r u e}) = G^{*} (z_{N}),

where G* is the function defined in (59).

Step 4. If for every i ∈ {0, 1, 2,...,p},

{\hat{h}}_{i}

is very close to 0, we consider Case I to construct individual confidence intervals approximately. Otherwise, we consider Case II.

Case I: the (1 − α)% confidence interval of

β_{i}^{t r u e}

is [

{\hat{β}}_{i}^{t r u e} - Φ^{- 1} (1 - α / 2) \cdot τ_{i}, {\hat{β}}_{i}^{t r u e} + Φ^{- 1} (1 - α / 2) \cdot τ_{i}

], where

τ_{i} = \sqrt{H_{N} Σ_{N} H_{N}^{T} / N}

and H_N is defined in Theorem 5;

Case II: we first use simulation to estimate the (1 − α/2)% percentile of

{\hat{R}}_{i + 1} (Z)

, where

{\hat{R}}_{i + 1}

is defined by (56) and

Z ~ N (0, I_{p + 1}) \times \vec{1}

. The estimated percentile is denoted as η_i. The (1 − α)% confidence interval of

β_{i}^{t r u e}

is [

{\hat{β}}_{i}^{t r u e} - η_{i} / \sqrt{N}, {\hat{β}}_{i}^{t r u e} + η_{i} / \sqrt{N}

Open in a new tab

3.1. Convergence and distribution of SAA solutions

Based on Lemma 2 and the relation between (9) and (18), Assumptions 1–3 guarantee z₀ defined in (19) to be a locally unique solution to (18). Furthermore, we show in Theorem 1 below that for sufficiently large N, (25) has a unique solution z_N in a neighborhood of z₀, and that z_N converges almost surely to z₀. This theorem also provides results on asymptotic distributions and convergence rates.

Theorem 1.

Suppose that Assumptions 1, 2 and 3 hold. Then, with probability 1, there exist neighborhoods $Z of z_{0} and C_{0} of ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}),$ such that for sufficiently large N, the equation (25) has a unique solution z_N in $Z$ , and the variational inequality (23) has a unique solution in $c_{0}$ given by $({\hat{β}}_{0}, \hat{β}, \hat{t}) = Π_{S} (z_{N}) .$ Moreover,

\lim_{N \to \infty} z_{N} = z_{0} a.s., \lim_{N \to \infty} ({\hat{β}}_{0}, \hat{β}, \hat{t}) = ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}) a.s.,

(28)

\sqrt{N} (z_{N} - z_{0}) \Rightarrow {(L_{K})}^{- 1} (N (0, Σ_{0})),

(29)

and

\sqrt{N} L_{K} (z_{N} - z_{0}) \Rightarrow N (0, Σ_{0}) .

(30)

In addition, if Assumption 4(a-b) holds, then there exist positive real numbers $ϵ_{0}, δ_{0}, μ_{0}, M_{0}$ and σ₀, such that for each $ϵ \in (0, ϵ_{0}]$ and each sufficiently large N,

\begin{array}{l} Prob {‖ ({\hat{β}}_{0}, \hat{β}, \hat{t}) - ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}) ‖ < ϵ} \geq Prob {‖ z_{N} - z_{0} ‖ < ϵ} \\ \geq 1 - δ_{0} \exp {- N μ_{0}} - \frac{M_{0}}{ϵ^{2 p + 1}} \exp {- \frac{N ϵ^{2}}{σ_{0}}} . \end{array}

(31)

In Theorem 1, L_K is the normal map induced by the linear function $L (\tilde{t})$ in (16) and the critical cone K defined in (21). We use $L_{K}^{- 1}$ to denote its inverse function. Functions $Π_{K}$ , L_K and $L_{K}^{- 1}$ are linear if K is a subspace, otherwise they are piecewise linear. Compared to (Lu et al., 2017, Theorem 1) which considers the LASSO penalty, Theorem 1 here handles general penalties that satisfy Assumption 1. The results shown in Theorem 1 are used in the construction of confidence intervals of the population penalized parameter as well as the true model parameter as shown in the following sections.

3.2. Estimators of Σ₀ and L_K

In order to use (29) and (30) to obtain computable confidence regions and intervals, we need to find reliable estimators of Σ₀ and L_K, as we discuss in this subsection. One can show that converges to Σ₀ almost Σ₀ surely under Assumptions 1–4. See the remarks below (26). Therefore, we use Σ_N to estimate Σ₀. Our main task in this subsection is to introduce an estimator of the normal map L_K, knowing that L_K is exactly d(f₀)_S(z₀) (Robinson, 1995), the B-derivative of (f₀)_S at z₀. Let dΠ_S(z) be the B-derivative of the Euclidean projector Π_S at z. Since S is a polyhedral convex set in $ℝ^{2 p + 1},$ Π_S coincides with a different affine function on each (2p + 1)-cell in the normal manifold of S (see Table 1 in the supplementary materials for definitions of the normal manifold and cells). The B-derivative $d Π_{S} (z)$ is a linear function for points z in the interior of each such cell, and is piecewise linear for z on the boundary. Moreover, $d Π_{S} (z)$ is not continuous with respect to z at points z on the boundary of any (2p + 1)-cell. Therefore, the function d(f₀)_S(z) is generally not continuous with respect to z at such points, which can be seen from the chain rule of B-differentiability:

d {(f_{0})}_{S} (z) (h) = L (t) d Π_{S} (z) (h) + h - d Π_{S} (z) (h) for each z \in ℝ^{2 p + 1}, h \in ℝ^{2 p + 1} .

If d(f₀)_S(·) is not continuous at z₀, then d(f₀)_S(z_N) is not guaranteed to converge to d(f₀)_S(z₀) even though z_N converges to z₀. To introduce the estimators of L_K, we will consider two cases based on the location of z₀.

For each i = 1, ⋯,p, denote the 9 cells in the normal manifold of S_i as $C_{i}^{j}, j = 0, 1, \dots, 8$ (see Figure 1). According to (10) we derive the constraints defining each $C_{i}^{j}$ which are listed in Table 1 in the supplementary materials. That table also lists the critical cones $K_{i}^{j}$ to S_i associated with a point in the relative interior of $C_{i}^{j}$ . Each (2p + 1)-cell in the normal manifold of S can then be written as $ℝ \times Π_{i = 1}^{p} C_{i}^{γ (i)},$ where γ(i) = 0, ⋯, 8 for each i = 1, ㏯,p. From (19), Assumption 1(a) and Lemma 2, we notice that ((z₀)_2j, (z₀)_2j+1) can only appear in the relative interior of $C_{i}^{3}, C_{i}^{4}, C_{i}^{6}, C_{i}^{7} or C_{i}^{8}$ for each i. Consequently, $d Π_{S} (z)$ is not continuous at z₀ if and only if ((z₀)_2i, (z₀)_2i+1) is in the relative interior of $C_{i}^{3} or C_{i}^{4}$ for some index i. The two cases are defined below, where the first case corresponds to the situation in which the random variable ${(L_{K})}^{- 1} (N (0, Σ_{0}))$ is normally distributed, and the second case is for situations in which L_K is a piecewise linear function.

Case I: In this case, ((z₀)_2i, (z₀)_2i+1 is in the relative interior of $C_{i}^{6}, C_{i}^{7} or C_{i}^{8}$ for all $i \in {1 \dots p},$ and the normal map L_K and the B-derivative $d Π_{S} (z_{0})$ are linear functions. Since d(f₀)_S(z) is continuous at z₀ in this case, we can use $d Π_{S} (z_{N})$ and d(f_N)_S(z_N) as the estimators of $d Π_{S} (z_{0})$ and L_K respectively.
Case II: In this case, ((z₀)_2i, (z₀)_2i+1 is in the relative interior of $C_{i}^{3} or C_{i}^{4}$ for some index $i \in {1 \dots p},$ and L_K and $d Π_{S} (z_{0})$ are piecewise linear functions. Since d(f₀)_S(z) is generally not continuous at z₀ in this case, we have to derive an estimator of L_K other than d(f_N)_S(z_N).

In both cases, d(f_N)_S(z_N) is an invertible linear map with high probability (Lu, 2014b, Proposition 3.5). While it is reasonable to expect Case I to occur more often than Case II in practice, one cannot identify Case I in advance since z₀ is unknown. To derive an estimator of L_k, first we give the expression of dΠ_S(z), and then construct an asymptotically exact approximation of $d Π_{S} (z_{0})$ . According to (12), we have

d Π_{S} (z) (h) = ({\overset{ˇ}{β}}_{0}, d Π_{S_{1}} (β_{1}, t_{1}) ({\overset{ˇ}{β}}_{1}, {\overset{ˇ}{t}}_{1}), \dots, d Π_{S_{p}} (β_{p}, t_{p}) ({\overset{ˇ}{β}}_{p}, {\overset{ˇ}{t}}_{p})),

(32)

for each $z = (β_{0}, β, t) and h = ({\overset{⌣}{β}}_{0}, \overset{ˇ}{β}, \overset{ˇ}{t}) .$ We denote $d Π_{S_{i}} (β_{i}, t_{i})$ in the relative interior of each $C_{i}^{j}$ by a function $ψ_{j} : ℝ^{2} \to ℝ^{2} (j = 0, 1, \dots, 8) .$ Define four matrices

A_{1} = [\begin{array}{l} 1 & 0 \\ 0 & 1 \end{array}], A_{2} = [\begin{array}{r} 1 / 2 & - 1 / 2 \\ - 1 / 2 & 1 / 2 \end{array}], A_{3} = [\begin{array}{l} 1 / 2 & 1 / 2 \\ 1 / 2 & 1 / 2 \end{array}], A_{4} = [\begin{array}{l} 0 & 0 \\ 0 & 0 \end{array}] .

Table 2 in the supplementary materials shows the expression of each $ψ_{j}$ using these matrices. Denote the relative interior of $ℝ \times Π_{i = 1}^{p} C_{i}^{γ (i)} by ri (ℝ \times Π_{i = 1}^{p} C_{i}^{γ (i)}) .$ For all $z \in ri (ℝ \times Π_{i = 1}^{p} C_{i}^{γ (i)}),$ we can write $d Π_{S} (z)$ as

Ψ_{γ (z)} (h) = ({\overset{⌣}{β}}_{0}, ψ_{γ (1)} ({\overset{⌣}{β}}_{1}, {\overset{⌣}{t}}_{1}), \dots, ψ_{γ (p)} ({\overset{⌣}{β}}_{p}, {\overset{ˇ}{t}}_{p})) for each h = ({\overset{⌣}{β}}_{0}, \overset{⌣}{β}, \overset{ˇ}{t}),

(33)

where $γ (z) = (γ (1), \dots, γ (p))$ such that $z \in ri (ℝ \times Π_{i = 1}^{p} C_{i}^{γ (i)}) .$

Table 2:

Coverage rates and average lengths of 95% individual CIs for population penalized parameters $({\tilde{β}}_{0}, \tilde{β})$ for different MCP penalties from 500 replications with sample size N = 300 generated in Example 1.

	a = 2									a = 2000

	λ = 0.5			λ = 1			λ = 2			λ = 0.5			λ = 1			λ = 2
	$\tilde{β}$	CR	Len	$\tilde{β}$	CR	Len	$\tilde{β}$	CR	Len	$\tilde{β}$	CR	Len	$\tilde{β}$	CR	Len	$\tilde{β}$	CR	Len
β₀	0	0.99	0.26	0	0.99	0.27	0	0.98	0.39	0	0.98	0.28	0	0.98	0.32	0	0.98	0.46
β₁	3	0.97	0.30	3.13	0.97	0.36	3.37	0.95	0.88	2.83	0.97	0.32	2.67	0.98	0.38	2.33	0.96	0.56
β₂	1.5	0.97	0.30	1.25	0.97	0.49	0.51	0.96	0.84	1.36	0.98	0.33	1.22	0.98	0.38	0.94	0.98	0.56
β₃	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.04	0	1.00	0.01	0	1.00	0.00
β₄	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.04	0	1.00	0.01	0	1.00	0.00
β₅	2	0.98	0.26	2.02	0.98	0.30	1.47	0.98	0.56	1.78	0.99	0.29	1.56	0.98	0.34	1.11	0.98	0.52
β₆	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.02	0	1.00	0.00	0	1.00	0.00
β₇	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.01	0	1.00	0.00	0	1.00	0.00
β₈	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00

Open in a new tab

Next, we construct an estimator of $d Π_{S} (z_{0}) .$ We divide the plane (β_i, t_i) into 9 pieces $E_{i}^{0}, \dots, E_{i}^{8}$ (see Figure 1). The constraints that define each of these sets $E_{i}^{0}, \dots, E_{i}^{8}$ are listed in Table 3 in the supplementary materials. The function g(N) in that table can be any combination of finite many terms of the form aN^b with a > 0 and b ∈ (0, 1/2), among other choices. For more details, see Lu and Budhiraja (2013). Each partition $ℝ \times Π_{i = 1}^{p} E_{i}^{γ (i)}$ is related to the $(2 p + 1) -cell ℝ \times Π_{i = 1}^{p} C_{i}^{γ (i)} .$ Let

\hat{γ} (z) = (γ (1), \dots, γ (p)) such that z \in ℝ \times Π_{i = 1}^{p} E_{i}^{γ (i)} .

Given a sample size N and a fixed z, we define a function $Λ_{N} (z) : ℝ^{2 p + 1} \to ℝ^{2 p + 1}$ as

Λ_{N} (z) (h) = Ψ_{\hat{γ} (z)} (h), for each h \in ℝ^{2 p + 1} .

(34)

According to Theorem 3.1 of Lu (2014a), $Λ_{N} (z_{N})$ converges to $d Π_{S} (z_{0})$ in probability under Assumptions 1–4.

Table 3:

Coverage rates and average lengths of 95% individual CIs for true model parameters ( $β_{0}^{t r u e}, β^{t r u e}$ ) for different MCP penalties from 500 replications with sample size N = 300 generated in Example 1.

		a = 2						a = 2000

		λ = 0.5		λ = 1		λ = 2		λ = 0.5		λ = 1		λ = 2
	True	CR	Len	CR	Len	CR	Len	CR	Len	CR	Len	CR	Len
$β_{0}^{t r u e}$	0	0.96	0.23	0.96	0.23	0.95	0.34	0.96	0.24	0.95	0.28	0.95	0.40
$β_{1}^{t r u e}$	3	0.95	0.26	0.96	0.27	0.99	0.42	0.96	0.28	0.99	0.33	1.00	0.49
$β_{2}^{t r u e}$	1.5	0.95	0.29	0.96	0.30	1.00	0.51	0.96	0.31	0.97	0.36	1.00	0.53
$β_{3}^{t r u e}$	0	0.95	0.28	0.96	0.29	0.97	0.42	0.97	0.30	0.97	0.35	0.98	0.50
$β_{4}^{t r u e}$	0	0.97	0.28	0.97	0.29	0.97	0.42	0.96	0.30	0.97	0.35	0.98	0.50
$β_{5}^{t r u e}$	2	0.96	0.28	0.96	0.29	0.99	0.44	0.97	0.30	1.00	0.36	1.00	0.54
$β_{6}^{t r u e}$	0	0.95	0.28	0.95	0.28	0.98	0.42	0.95	0.30	0.97	0.34	0.98	0.49
$β_{7}^{t r u e}$	0	0.96	0.28	0.96	0.29	0.98	0.42	0.96	0.30	0.97	0.35	0.99	0.50
$β_{8}^{t r u e}$	0	0.96	0.25	0.97	0.26	0.99	0.38	0.97	0.27	0.99	0.32	1.00	0.45

Open in a new tab

Based on (24), (26) and (34), we define a function $Φ_{N} (z_{N}) : ℝ^{2 p + 1} \to ℝ^{2 p + 1}$ as

Φ_{N} (z_{N}) (h) = L_{N} (\hat{t}) Λ_{N} (z_{N}) (h) + h - Λ_{N} (z_{N}) (h)

(35)

for each $h \in ℝ^{2 p + 1} .$ The following theorem shows that d(f_N)_S(z_N) is a consistent estimator of L_k for Case I, and $Φ_{N} (z_{N})$ is a consistent estimator of L_K for both Case I and Case II.

Theorem 2.

(a)
Suppose that Assumptions 1, 2 and 3 hold. If z₀ satisfies the conditions for Case I, then $d Π_{S} (z_{N})$ defined in (33) converges to $d Π_{S} (z_{0})$ almost surely, and
$d {(f_{N})}_{S} (z_{N}) = L_{N} (\hat{t}) d Π_{S} (z_{N}) + I - d Π_{S} (z_{N})$ (36)
converges to L_K almost surely.
(b)
Suppose that Assumptions 1, 2, 3 and 4(a-b) hold. Then $Φ_{N} (z_{N})$ converges to L_K in probability.

The two functions d(f_N)_S(z_n) and $Φ_{N} (z_{N})$ are generally different when ((z_N)_2i, (z_N)_2i+1) belongs to $E_{i}^{0}, E_{i}^{1}, E_{i}^{2}, E_{i}^{3} or E_{i}^{4}$ for some i, in which case $Φ_{N} (z_{N})$ is a piecewise linear function. In contrast, d(f_N)_S(z_N) is a piecewise linear function only when ((z_N)_2i, (z_N)_2i+1 belongs to $C_{i}^{3} or C_{i}^{4}$ for some i.

Under Assumptions 1–4, we can show that the weak convergence in (30) still holds after L_K is substituted by $Φ_{N} (z_{N})$ . Consequently, if $Σ_{0}^{1}$ is nonsingular, then we have

\sqrt{N} [\begin{matrix} {(Σ_{N}^{1})}^{- 1 / 2} & 0 \\ 0 & I_{p} \end{matrix}] (Φ_{N} (z_{N})) (z_{N} - z_{0}) \Rightarrow N (0, I_{p + 1}) \times 0 .

(37)

If $Σ_{0}^{1}$ is singular, we decompose $Σ_{N}^{1} as Σ_{N}^{1} = U_{N}^{T} Δ_{N} U_{N} where U_{N}$ is an orthogonal (p + 1) × (p + 1) matrix, and Δ_N is a diagonal matrix with monotonically decreasing elements. Let l be the number of positive eigenvalues of $Σ_{0}^{1}$ counted with regard to their algebraic multiplicities, let D_n be the upper-left submatrix of Δ_N whose diagonal elements are at least 1/g(N), and let l_N be the number of rows in D_N. Furthermore, let (U_N)₁ be the submatrix of U_N that consists of its first l_N rows, and submatrix (U_N)₂ consists of the remaining rows of U_N. We present the weak convergence results in the following theorem, which generalizes Theorem 3 in Lu et al. (2017) to cover all penalties satisfying Assumption 1.

Theorem 3.

Suppose that Assumptions 1, 2, 3 and 4(a-b) hold. Then

\sqrt{N} Φ_{N} (z_{N}) (z_{N} - z_{0}) \Rightarrow N (0, Σ_{0}) .

(38)

If $Σ_{0}^{1}$ is nonsingular, then

N {[(Φ_{N} (z_{N})) (z_{N} - z_{0})]}^{T} [\begin{matrix} {(Σ_{N}^{1})}^{- 1} & 0 \\ 0 & I_{p} \end{matrix}] [(Φ_{N} (z_{N})) (z_{N} - z_{0})] \Rightarrow χ_{p + 1}^{2},

(39)

and

N {[(Φ_{N} (z_{N})) (z_{N} - z_{0})]}^{T} [\begin{matrix} 0 & 0 \\ 0 & I_{p} \end{matrix}] [(Φ_{N} (z_{N})) (z_{N} - z_{0})] \Rightarrow 0.

(40)

If $Σ_{0}^{1}$ is singular and Assumption 4(c) holds, then Prob{l_N = l} → 1 as N → ∞,

N {[(Φ_{N} (z_{N})) (z_{N} - z_{0})]}^{T} [\begin{matrix} {(U_{N})}_{1}^{T} D_{N}^{- 1} {(U_{N})}_{1} & 0 \\ 0 & 0 \end{matrix}] [(Φ_{N} (z_{N})) (z_{N} - z_{0})] \Rightarrow χ_{l}^{2},

(41)

and

N {[(Φ_{N} (z_{N})) (z_{N} - z_{0})]}^{T} [\begin{matrix} 0 & 0 \\ 0 & I_{p} \end{matrix}] [(Φ_{N} (z_{N})) (z_{N} - z_{0})] \Rightarrow 0.

(42)

We can treat (39) and (40) as a special case of (41) and (42). In fact, if z₀ satisfies Case I, then Theorem 3 still holds if $Φ_{N} (z_{N})$ is replaced by d(f_N)_S(z_n).

3.3. Confidence intervals for the population penalized parameters

In this subsection, we describe how to obtain confidence interval for $({\tilde{β}}_{0}, \tilde{β})$ from the asymptotic distribution of z_N. First, we investigate the relationship between a solution to the normal map formulation (18) and the corresponding solution to (1). Let $\tilde{q}$ be as defined in Assumption 3 and ${\tilde{q}}_{0} = E [- 2 (Y - {\tilde{β}}_{0} - \sum_{i = 1}^{p} {\tilde{β}}_{i} X_{i})] .$ From (13), (14) and (19), we have $z_{0} = ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}) - ({\tilde{q}}_{0}, \tilde{q} - 2 m \tilde{β}, {(P_{λ_{i}}^{'} ({\tilde{t}}_{i}) + 2 m {\tilde{t}}_{i})}_{i = 1}^{p}) .$ In the supplementary materials, it is shown in (B.1) that ${\tilde{q}}_{0} = 0,$ which implies ${\tilde{β}}_{0} = {(z_{0})}_{1} .$ Thus, confidence intervals of ${\tilde{β}}_{0}$ are exactly those of ${(z_{0})}_{1} .$ On the other hand, using the fact $({\tilde{β}}_{i}, {\tilde{t}}_{i}) = Π_{S_{i}} ({(z_{0})}_{2 i}, {(z_{0})}_{2 i + 1})$ for each i = 1, ⋯,p, we have the following relationship between ${\tilde{β}}_{i}$ and ((z₀)_2i, (z₀)_2i+1):

{\tilde{β}}_{i} = Γ (V_{+}, V_{-}) = {\begin{matrix} \frac{1}{2} V_{+}, & if V_{+} > 0 and V_{-} ⩾ 0, \\ 0, & if V_{+} ⩽ 0 and V_{-} ⩾ 0, \\ \frac{1}{2} V_{-}, & if V_{+} ⩽ 0 and V_{-} < 0, \end{matrix}

(43)

where V₊ = (z₀)_2i + (z₀)_2i+1 and $V_{-} = {(z_{0})}_{2 i} - {(z_{0})}_{2 i + 1} .$ The above three cases in (43) include all the possible situations for the location of ((z₀)_2i, (z₀)_2i+1. This map $Γ$ can be used to obtain confidence intervals for ${\tilde{β}}_{i} (i = 1, \dots, p)$ after we calculate confidence intervals for ((z₀)_2i + (z₀)_2i+1) and ((z₀)_2i − (z₀)_2i+1). For a fixed i, we denote the (1 − α/2)% confidence intervals for ((z₀)_2i + (z₀)_2i+1) and ((z₀)_2i − (z₀)_2i+1) as $[L_{+}^{i}, U_{+}^{i}] and [L_{-}^{i}, U_{-}^{i}]$ respectively. Then a (1 − α)% (conservative) confidence interval for ${\tilde{β}}_{i}$ is given by

[Γ (L_{+}^{i}, L_{-}^{i}), Γ (U_{+}^{i}, U_{-}^{i})] .

(44)

Next, we show how to find confidence intervals for (z₀)₁, ((z₀)_2i + (z₀)_2i+1) and ((z₀)_2i − (z₀)_2i+1). Under Assumptions 1–4, from Theorem 3 we can express the asymptotically exact (1 − α)100% confidence region for z₀ as

{z \in ℝ^{2 p + 1} | \begin{matrix} N {[Φ_{N} (z_{N}) (z_{N} - z)]}^{T} [\begin{matrix} {(U_{N})}_{1}^{T} D_{N}^{- 1} {(U_{N})}_{1} & 0 \\ 0 & 0 \end{matrix}] [Φ_{N} (z_{N}) (z_{N} - z)] ⩽ χ_{l_{N}}^{2} (α) \\ N {[Φ_{N} (z_{N}) (z_{N} - z)]}^{T} [\begin{matrix} 0 & 0 \\ 0 & I_{p} \end{matrix}] [Φ_{N} (z_{N}) (z_{N} - z)] = 0 \end{matrix}}

(45)

where $χ_{l_{N}}^{2} (α)$ is the critical value associated with significant level α of a χ² distribution with l_N degrees of freedom. If $Φ_{N} (z_{N})$ is a linear map, then the set in (45) is an ellipsoid in a subspace of $ℝ^{2 p + 1} .$ Otherwise it is the union of different ellipsoid fractions. To obtain simultaneous confidence intervals, we find the maximal and minimal values of (z₀)₁, ((z₀)_2i + (z₀)_2i+1) and ((z₀)_2i − (z₀)_2i+1) in the set of (45) by solving optimization problems.

For individual confidence intervals, first we notice that $Φ_{N} (z_{N})$ is a global homeomorphism with probability 1 as $N \to \infty$ (see the proof of Theorem 2 in the supplementary materials). If $Φ_{N} (z_{N})$ is a global homeomorphism, we can use

{(Φ_{N} (z_{N}))}^{- 1} (N (0, Σ_{N}))

(46)

to approximate the distribution of $\sqrt{N} (z_{N} - z_{0})$ as in (29). When $Φ_{N} (z_{N})$ is a linear map, the distribution in (46) is normal. Therefore (z₀)₁, ((z₀)_2i + (z₀)_2i+1) and ((z₀)_2i − (z₀)_2i+1) also follow normal distributions, from which we can construct individual confidence intervals. When $Φ_{N} (z_{N})$ is not a linear map, we simulate data based on the distribution in (46), and find empirical individual confidence intervals for (z₀)₁, ((z₀)_2i + (z₀)_2i+1) and ((z₀)_2i − (z₀)_2i+1) by taking $\frac{α}{2} % and (1 - \frac{α}{2}) %$ percentiles of the data as the lower and upper bounds respectively.

3.4. Confidence intervals for true model parameters in the underlying linear model

In this subsection, we develop a method to compute confidence intervals for $(β_{0}^{t r u e}, β^{t r u e})$ , based on a relation between a population penalized parameter $({\tilde{β}}_{0}, \tilde{β})$ and the true model parameter $(β_{0}^{t r u e}, β^{t r u e})$ . Suppose the underlying linear model is

Y = β_{0}^{t r u e} + X^{T} β^{t r u e} + ε,

(47)

where $β_{0}^{t r u e} \in ℝ and β^{t r u e} = (β_{1}^{t r u e}, \dots, β_{p}^{t r u e}) \in ℝ^{p}$ are the true model parameters. Let $t^{t r u e} \in ℝ^{p} be defined as t_{i}^{t r u e} = | β_{i}^{t r u e} | .$ Denote the covariance matrix of X as Σ, and assume the random error ε has mean zero and variance σ². Moreover, ε is independent with X_i for all i = 1, ⋯,p. For simplicity we assume E(X_i) = 0 for each i = 1, ⋯,p. Consequently we have $E (Y) = β_{0}^{t r u e} and Σ = E (X X^{T})$ . We assume that £ is nonsingular and therefore we do not need Assumption 3 in this subsection.

In developing theoretical results of this subsection, we will let λ = (λ_i, ⋯, λ_p) converge to 0. Due to this change, assumptions stated in Section 2.2 need to be changed accordingly. We will replace Assumptions 1 by Assumption 1’, and keep Assumption 2. We will not need Assumption 4 until the end of this section.

Assumption 1’(a).

For each i = 1, 2, ⋯,p, P₀(t) = 0 for all t ≥ 0. Moreover, for each positive λ_i in a neighborhood of 0, P_{λ_i}(·) is nonnegative, nondecreasing and continuously differentiable on $[0, + \infty) with P_{λ_{i}} (0) = 0 and P_{λ_{i}}^{'} (0) > 0.$

Assumption 1’(b).

For each i = 1, ⋯,p, there exist neighborhoods $T_{i} o f t_{i}^{t r u e} i n ℝ_{+} a n d Λ_{i} o f 0 i n ℝ_{+}, s u c h t h a t P_{λ_{i}}^{″} (t_{i})$ the second derivative of (·) with respect to t_i, exists for each $λ_{i} \in Λ_{i} a n d t_{i} \in T_{i} .$ Moreover, $P_{λ_{i}}^{'} (t_{i}) a n d P_{λ_{i}}^{″} (t_{i})$ are Lipschitz continuous in $(λ_{i}, t_{i}) o n T_{i} \times Λ_{i} .$

Assumption 1’(c).

For each i = 1, ⋯,p, there exists a neighborhood $J_{i} o f t_{i}^{t r u e} i n ℝ_{+},$ such that the mixed partial derivatives $\frac{\partial^{2} P}{\partial λ_{i} \partial t_{i}} (0, t_{i}) a n d \frac{\partial^{3} P}{\partial λ_{i} \partial^{2} t_{i}} (0, t_{i}) e x i s t f o r e a c h t_{i} \in T_{i},$ with

\lim_{λ_{i} \to 0} \sup_{t_{i} \in T_{i}} | \frac{P_{λ_{i}}^{'} (t_{i})}{λ_{i}} - \frac{\partial^{2} P}{\partial λ_{i} \partial t_{i}} (0, t_{i}) | + | \frac{P_{λ_{i}}^{''} (t_{i})}{λ_{i}} - \frac{\partial^{3} P}{\partial λ_{i} \partial^{2} t_{i}} (0, t_{i}) | = 0.

Besides the convex LASSO penalty, we can check that many non-convex penalty functions such as SCAD, MCP, the log-penalty, and the transformed $l_{1}$ penalty satisfy Assumption 1’(a). We can further check that the LASSO penalty, the log-penalty, and the transformed $l_{1}$ penalty also satisfy Assumptions 1’(b) and 1’(c). For the SCAD and MCP penalties, Assumptions 1’(b) and 1’(c) are satisfied almost everywhere except that $t_{i}^{t r u e} = 0$ for some i. Assumptions 1’(b) and 1’(c) are used to guarantee that the SAA function f_N almost surely converges to the true function f₀ in the space of continuously differentiable functions on a neighborhood of $(β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}), a n d t h a t \sqrt{N} (f_{N} - f_{0})$ weakly converges to a random function in that space. This assumption is needed for the techniques based on stochastic variational inequalities to be applicable. It is possible to weaken this assumption by developing techniques for a broader class of problems in which the SAA function f_N (or equivalently, the first order derivative of the penalty function) is not necessarily continuously differentiable, and we will investigate this in future work.

Under Assumption 1’(a), with λ = 0 the problem (1) becomes the least square problem (3), which has a unique solution $(β_{0}^{t r u e}, β^{t r u e})$ in view of the linear model (47). Let $t^{t r u e} \in ℝ^{p}$ be defined as if $t_{i}^{t r u e} = | β_{i}^{t r u e} | .$ Then (9) with λ = 0 has a unique solution $(β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}) .$ By the equivalence between (9) and (17), $(β_{0}^{t r u e}, β^{t r u e}, t^{t r u e})$ is also the unique solution to

- f_{0} (β_{0}, β, t) \in N_{S} (β_{0}, β, t)

where f₀(β₀, β, t) is as defined in (14) but with λ = 0:

f_{0} (β_{0}, β, t) = [\begin{matrix} - 2 E [Y - β_{0} - \sum_{i = 1}^{p} β_{i} X_{i}] \\ - 2 E [(Y - β_{0} - \sum_{i = 1}^{p} β_{i} X_{i}) X] - 2 m β \\ 2 m t \end{matrix}] .

Let $z_{0}^{*}$ be as defined in (19) with λ = 0 and $(β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}) replacing ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}) .$ The following lemma presents the relation between $(β_{0}^{t r u e}, β^{t r u e}) and z_{0}^{*} .$

Lemma 4.

Suppose that Assumptions 1’(a) and 2 hold. Then we have

(β_{0}^{t r u e}, β^{t r u e}) = G^{*} (z_{0}^{*})

(48)

where $G^{*} \in ℝ^{(p + 1) \times (2 p + 1)}$ is defined as

G^{*} = [\begin{matrix} 1 & 0 & 0 \\ 0 & {(1 + 2 m)}^{- 1} I_{p} & 0 \end{matrix}] .

(49)

Lemma 4 above indicates that an estimator of the true parameter $(β_{0}^{t r u e}, β^{t r u e}) is G^{*} (z_{N})$ where z_N is defined in (26). In the following theorem, we show the asymptotic distribution of the estimator $G^{*} (z_{N}) .$ Before stating this theorem, we define a matrix $L^{*} \in ℝ^{(2 p + 1) \times (2 p + 1)} and a cone K^{*} \subset ℝ^{2 p + 1}$ as follows:

L^{*} = d f_{0} (β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}) = [\begin{matrix} 2 & 0 & 0 \\ 0 & 2 Σ - 2 m I_{p} & 0 \\ 0 & 0 & 2 m I_{p} \end{matrix}],

(50)

and

K^{*} = {w \in T_{S} (β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}) | 〈 f_{0} (β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}), w 〉 = 0} .

(51)

Note that

f_{0} (β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}) = [\begin{matrix} 0 \\ - 2 m β^{t r u e} \\ 2 m t^{t r u e} \end{matrix}] .

This implies that

K^{*} = ℝ \times Π_{i = 1}^{p} K_{i}^{*}

(52)

where

K_{i}^{*} = {\begin{array}{l} {(β_{i}, t_{i}) | β_{i} = t_{i})} & if β_{i}^{true} > 0, \\ {(β_{i}, t_{i}) | β_{i} = - t_{i})} & if β_{i}^{true} < 0, \\ S_{i} & if β_{i}^{true} = 0. \end{array}

(53)

As the setting considered in this subsection is different from previous sections, L* and K* here are different from L and K defined in Assumption 3 and (21). The previous L and K are associated with $({\tilde{β}}_{0}, \tilde{β}, \tilde{t}),$ a solution to the population problem with a fixed positive λ, while L* and K* are associated with $(β_{0}^{t r u e}, β^{t r u e}, t^{t r u e})$ and λ = 0.

Theorem 4.

Suppose that Assumptions 1’(a-c) and 2 hold. Let m > 0 be sufficiently small so that $Σ - m I_{p}$ is nonsingular, L* and K* be defined as above, $Σ_{0}^{*}$ be the covariance matrix of the random vector $F (β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}, X, Y)$ defined in (13), and $Σ_{0}^{* 1}$ be the upper left (p + 1) × (p + 1) submatrix of $Σ_{0}^{*}$ . Moreover, let λ_i’s be chosen to satisfy $\lim_{N \to \infty} \sqrt{N} λ_{i} = c_{i}$ for some constant c_i ≥ 0, z_N be defined in (26), and define $({\hat{β}}_{0}^{t r u e}, {\hat{β}}^{t r u e}) = G^{*} (z_{N}) a n d h \in ℝ^{p} b y h_{i} = c_{i} \frac{\partial^{2} P}{\partial λ_{i} \partial t_{i}} (0, t_{i}^{t r u e}) . T h e n ({\hat{β}}_{0}^{t r u e}, {\hat{β}}^{t r u e})$ is a consistent estimator of $(β_{0}^{t r u e}, β^{t r u e})$ and

\sqrt{N} (({\hat{β}}_{0}^{t r u e}, {\hat{β}}^{t r u e}) - (β_{0}^{t r u e}, β^{t r u e})) \Rightarrow G^{*} \circ {(L_{K^{*}}^{*})}^{- 1} (N (0, Σ_{0}^{* 1}), h) .

(54)

Note that the distribution of $G^{*} ° {(L_{K^{*}}^{*})}^{- 1} (N (0, Σ_{0}^{* 1}), h)$ can be normal or non-normal. When the true parameter $β_{i}^{t r u e} \neq 0$ for each i, we can show that K* is a subspace of $ℝ^{(2 p + 1)} and {(L_{K^{*}}^{*})}^{- 1}$ is a linear function. Therefore, the limiting distribution of the true parameter estimator G*(z_N) is normal in this case. However, if the true parameter $β_{i}^{t r u e} = 0$ for some i, the limiting distribution can be normal or non-normal.

Theorem 5.

Suppose that the assumptions in Theorem 4 hold. If h_i = 0 for each i, then $({\hat{β}}_{0}^{true}, {\hat{β}}^{true})$ is a consistent estimator of $(β_{0}^{t r u e}, β^{t r u e})$ and

G^{*} \circ {(L_{K^{*}}^{*})}^{- 1} (N (0, Σ_{0}^{* 1}), h) = [\begin{matrix} \frac{1}{2} & 0 \\ 0 & \frac{1}{2} Σ^{- 1} \end{matrix}] (N (0, Σ_{0}^{* 1})) = N (0, [\begin{matrix} σ^{2} & 0 \\ 0 & σ^{2} Σ^{- 1} \end{matrix}]) .

Furthermore, let $\hat{Θ}$ be a consistent estimate of $Σ^{- 1} .$ Define $H_{N} = [\begin{matrix} \frac{1}{2} & 0 \\ 0 & \frac{1}{2} \hat{Θ} \end{matrix}] .$ Then,

\frac{\sqrt{N} ({\hat{β}}_{i}^{t r u e} - β_{i}^{t r u e})}{\sqrt{{(H_{N} Σ_{N}^{1} H_{N}^{T})}_{i + 1, i + 1}}} \Rightarrow N (0, 1), f o r a l l i = 0, 1, \dots, p .

(55)

Since $h_{i} = c_{i} \frac{\partial^{2} P}{\partial λ_{i} \partial t_{i}} (0, t_{i}^{t r u e}),$ we know that $h_{i} = 0 if c_{i} = 0$ . Therefore, if λ_i’s are chosen to be $o (1 / \sqrt{N})$ in the penalty function, the limiting distribution will be a multivariate normal distribution. In this normal case, the above Theorem 5 provides a method to compute asymptotically exact individual confidence intervals for $(β_{0}^{t r u e}, β^{t r u e})$ .

When $h_{i} \neq 0$ for some i, the asymptotic distribution in Theorem 4 does not necessarily reduce to a normal distribution. To see this, consider the following example. Let p = 2, $β_{0}^{t r u e} = 0, β_{1}^{t r u e} = 0 and β_{2}^{t r u e} = a_{0}$ for some constant $a_{0} > 0. Let λ = \frac{1}{\sqrt{N}}, so c_{1} = c_{2} = 1. Let Σ = [\begin{array}{l} 2 & 0 \\ 0 & 2 \end{array}], m = 1, and \frac{\partial^{2} P}{\partial λ_{1} \partial t_{1}} (0, t_{1}^{t r u e}) = \frac{\partial^{2} P}{\partial λ_{2} \partial t_{2}} (0, t_{2}^{t r u e}) = 1$ (which is satisfied by the LASSO penalty function P(λ, t) = λt). It follows that h₁ = h₂ = 1. Let q₀, q₁, q₂ ∈ $R$ . To find (L*_K*)⁻¹(q₀, q₁, q₂, h₁, h₂) we consider the following problem

\min_{(β_{0}, β_{1}, t_{1}, β_{2}, t_{2}) \in K^{*}} β_{0}^{2} + β_{1}^{2} + t_{1}^{2} + β_{2}^{2} + t_{2}^{2} - q_{0} β_{0} - q_{1} β_{1} - h_{1} t_{1} - q_{2} β_{2} - h_{2} t_{2},

whose solution satisfies $(q_{0}, q_{1}, q_{2}, h_{1}, h_{2}) \in L^{*} (β_{0}, β_{1}, β_{2}, t_{1}, t_{2}) + N_{K^{*}} (β_{0}, β_{1}, β_{2}, t_{1}, t_{2}) .$ Here $K^{*} = ℝ \times S_{1} \times {(β_{2}, t_{2}) | β_{2} = t_{2}} .$ The solution to the above problem is given by

β_{0} = \frac{q_{0}}{2}, β_{1} = {\begin{array}{l} \frac{q_{1}}{2} & if - 1 \leq q_{1} \leq 1 \\ \frac{q_{1} + 1}{4} & if q_{1} \geq 1 \\ \frac{q_{1} - 1}{4} & if q_{1} \leq - 1 \end{array}, t_{1} = | β_{1} |, β_{2} = t_{2} = \frac{q_{2} + 1}{4} .

As a result, ${(L_{K^{*}}^{*})}^{- 1} (q_{0}, q_{1}, q_{2}, h_{1}, h_{2}) = (β_{0}, β_{1}, β_{2}, t_{1}, t_{2})$ is a piecewise affine function of (q₀,q₁,q₂,h₁,h₂) with three pieces. Furthermore, since G*(·) is a linear transformation, we conclude that the asymptotic distribution is non-normal in this case.

Next, we show how to estimate ${(L_{K^{*}}^{*})}^{- 1}$ in the situation considered in Theorem 4. We need to make an assumption analogous to Assumption 4.

Assumption 4’(a).

The same conditions as in Assumption 4(a), with $(β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}) r e p l a c i n g ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}) .$

Assumption 4’(b).

The same conditions as in Assumption 4(b), with $(β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}) r e p l a c i n g ({\tilde{β}}_{0}, \tilde{β}, \tilde{t}) .$

To estimate ${(L_{K^{*}}^{*})}^{- 1},$ we can also use $Φ_{N} (z_{N}) .$ Under Assumption 4’, to show that $Φ_{N} (z_{N})$ is a consistent estimator of ${(L_{K^{*}}^{*})}^{- 1},$ the key is to show (31) holds (with $(β_{0}^{t r u e}, β^{t r u e}, t^{t r u e}) replacing ({\tilde{β}}_{0}, \tilde{β}, \tilde{t})) .$ To show this, one needs to show that f_N converges to f₀ in probability at an exponential rate in the space of continuous differentiable functions. For the first p + 1 components, this follows from Assumption 4’. For the last p components, note that the norm of the function $P_{λ_{i}}^{'} / λ_{i}$ in the space of continuous differentiable functions is bounded due to Assumption 1’(c). In addition, since $\lim_{N \to \infty} \sqrt{N} λ_{i} = c_{i}$ for each i, we have $\sqrt{N} ‖ P_{λ_{i}}^{'} ‖ \leq ρ$ for some constant ρ. As a result, for each $ϵ > 0,$ we have $‖ P_{λ_{i}}^{'} ‖ \leq ρ / \sqrt{N} < ϵ / 2$ for sufficiently large N. Therefore, f_N converges to f₀ in probability at an exponential rate in the space of continuous differentiable functions, and $Φ_{N} (z_{N}) .$ converges to ${(L_{K^{*}}^{*})}^{- 1}$ in probability.

For the normal case, equation (55) justifies using the diagonal elements of $H_{N} Σ_{N} H_{N}^{T},$ divided by N, as the estimated variances of $(β_{0}^{t r u e}, β^{t r u e}) .$ . For the non-normal case, first we define two functions R and $\hat{R} from ℝ^{2 p + 1} to ℝ^{p + 1}$ as

R = G^{*} \circ {(L_{K^{*}}^{*})}^{- 1} [\begin{matrix} {(Σ_{0}^{* 1})}^{\frac{1}{2}} & 0 \\ 0 & diag {(h_{i})}_{i = 1}^{p} \end{matrix}] and \hat{R} = G^{*} \circ {(Φ_{N} (z_{N}))}^{- 1} [\begin{matrix} {(Σ_{N}^{1})}^{\frac{1}{2}} & 0 \\ 0 & diag {({\hat{h}}_{i})}_{i = 1}^{p} \end{matrix}],

(56)

where ${\hat{h}}_{i} = \sqrt{N} λ_{i} \frac{\partial^{2} P}{\partial λ_{i} \partial t_{i}} (0, | {\hat{β}}_{i}^{t r u e} |) .$ Denote the i^th component function of R and $\hat{R} as R_{i} and {\hat{R}}_{i}$ respectively for each i. Let $f : ℝ^{2 p + 1} \to ℝ$ be a continuous function and Z be a (2p + 1)-dimensional random variable with $Z ~ N (0, I_{p + 1}) \times \vec{1} . Define a^{r} (f) \in (0, \infty)$ as

a^{r} (f) = \inf {c ⩾ 0 | Prob {- c ⩽ f (Z) - r ⩽ c} ⩾ 1 - α} .

(57)

Suppose that Prob {f(Z) = b} = 0 for all $b \in ℝ .$ Then for any given $r \in ℝ and α \in (0, 1), a^{r} (f)$ as defined in (57) is the smallest value that satisfies

Prob {- a^{r} (f) ⩽ f (Z) - r ⩽ a^{r} (f)} = 1 - α .

Since the map G* has full row rank, $L_{K^{*}}^{*}$ is a global homeomorphism, and $Σ_{0}^{* 1}$ is nonsingular. If h_i ≠ 0 for each i, then the matrix representation of each piece of the map R has full row rank as well. Therefore, Prob ${R_{i} (Z) = b} = 0 for all b \in ℝ .$ The following theorem provides a way to compute individual confidence intervals for $(β_{0}^{t r u e}, β^{t r u e})$ in the general case where h_i ≠ 0 for each i.

Theorem 6.

Suppose that assumptions in Theorem 4 and Assumptions 4’(a-b) hold, and h_i ≠ 0 for each i. Let $α \in (0, 1)$ and a^r(·) be as in (57). Then for every $r \in ℝ$ and all i = 0, 1, ⋯,p, we have

\lim_{N \to \infty} Prob {| \sqrt{N} ({\hat{β}}_{i}^{true} - β_{i}^{true}) - r | ⩽ a^{r} ({\hat{R}}_{i + 1})} = 1 - α,

(58)

where R and $\hat{R}$ are defined in (56).

From (58), one can compute the empirical (1 − α) percentile confidence intervals for $(β_{0}^{t r u e}, β^{t r u e})$ by simulating data from $\hat{R} (Z)$ . The constant r can be used to control the centers of the confidence intervals for all $β_{i}^{true}$ simultaneously, which may affect the interval lengths. A reasonable choice of r is 0 if the empirical distribution of $\hat{R} (Z)$ is approximately symmetric with respect to 0. Results of Theorems 5 and 6 are applicable to a wide range of general penalty functions which covers the LASSO as a special case (Lu et al., 2017). The procedure to construct confidence intervals of the true regression coefficients is summarized in Table 1.

3.5. Extension to the high dimensional case

In our previous theoretical analysis, we assume that the dimension p is fixed. It is interesting to study the extension of our proposed method to the high dimensional case where the dimension p is also allowed to go to infinity.

As shown in Lemma 4, since $(β_{0}^{t r u e}, β^{t r u e}) = G^{*} (z_{0}^{*}),$ our proposed estimate for the true parameter $(β_{0}^{t r u e}, β^{t r u e}) is G^{*} (z_{N}) .$ In fact, we can also show that $(β_{0}^{t r u e}, β^{t r u e}) = G (z_{0}),$ where z₀ is defined in (19) with λ > 0 and G is a map from $ℝ^{2 p + 1} to ℝ^{p + 1}$ defined as

G = \frac{1}{2} ([\begin{matrix} 1 & 0 \\ 0 & Σ^{- 1} \end{matrix}] B + [\begin{matrix} 1 & 0 \\ 0 & 2 I - (1 + 2 m) Σ^{- 1} \end{matrix}] B \circ Π_{K}),

(59)

and the matrix $B \in ℝ^{(p + 1) \times (2 p + 1)} is given by B = [\begin{matrix} I_{p + 1} & 0 \end{matrix}] .$ Motivated by this result, we can also estimate the true parameter $(β_{0}^{t r u e}, β^{t r u e}) by \hat{G} (z_{N}),$ where $\hat{G}$ is a map from $ℝ^{2 p + 1}$ to $ℝ^{p + 1}$ defined as

\hat{G} = \frac{1}{2} ([\begin{matrix} 1 & 0 \\ 0 & \hat{Θ} \end{matrix}] B + [\begin{matrix} 1 & 0 \\ 0 & 2 I - (1 + 2 m) \hat{Θ} \end{matrix}] B \circ d Π_{S} (z_{N})),

(60)

and $\hat{Θ}$ is a consistent estimate of Σ⁻¹. Theoretically, if $\lim_{N \to \infty} \sqrt{N} λ_{i} = 0$ for each i, we can show that G*(z_N) and $G^{*} (z_{N}) and \hat{G} (z_{N})$ have the same asymptotic distribution.

According to the definition of $\hat{G}$ in (60) and the definition of z_N in (26), we can show that $\hat{G} (z_{N}) = ({\hat{β}}_{0} + \frac{1}{N} 1_{N}^{T} (y - {\hat{β}}_{0} 1_{N} - X \hat{β}), \hat{β} + \frac{1}{N} \hat{Θ} X^{T} (y - {\hat{β}}_{0} 1_{N} - X \hat{β})) .$ Therefore, the estimate of β is the sum of an initial estimate $\hat{β}$ (e.g., LASSO, SCAD or MCP estimate) and a bias-correction term. Interestingly, although we have different motivations, $\hat{G} (z_{N})$ turns out to be the same as the estimate proposed by Van de Geer et al. (2014). For the high dimensional case with $p ≫ N,$ if we choose $λ = O (\sqrt{\log (p) / N})$ converging to 0 as $N \to \infty,$ and use conditions to guarantee that: (a) ${‖ \hat{β} - β^{t r u e} ‖}_{1} = O_{p} (s_{0} \sqrt{\log (p) / N})$ where s₀ is the number of true nonzero regression coefficients; (b) $s_{0} = o (\sqrt{N} / \log (p)),$ and some sparisty assumptions about the precision matrix Σ⁻¹, we can show that the asymptotic distribution of $\hat{G} (z_{N})$ is normal (Van de Geer et al. (2014)). However, the theoretical analysis of the asymptotic distribution of G*(z_N) and $\hat{G} (z_{N})$ using stochastic variational inequality techniques for $p ≫ N$ case is challenging. Many fundamental results about variational inequality (e.g., some results used in the proof of Theorem 1) need to be generalized to the high dimensional case.

Although we assume that the dimension p is fixed in the theoretical analysis using stochastic variational inequality techniques, our proposed method is applicable to the large p small N data in practice. For the high dimensional data, we can use the results shown in Theorem 6 to construct confidence intervals. In Section 4, we will use Example 3 to study the performance of our method for the high dimensional data.

4. Numerical examples

In this section, we use the MCP methods in (8) to illustrate the performance of the techniques proposed in Section 3. For all examples in this section, we choose $\frac{1}{g (N)} = \frac{0.001}{N^{1 / 3}} and m = \frac{1}{2}$ in (9). We use the mixed integer quadratically constrained program (MIQCP) solver in the optimization modeling language GAMS (Brooke et al., 1998) to obtain accurate solutions to (2).

For all simulated examples, we generate the data using the following linear model:

Y = X^{T} β^{t r u e} + σ ϵ,

(61)

where $β^{t r u e} \in ℝ^{p}, X$ is a p-dimensional normal random variable with mean 0 and covariance Σ, is a standard normal random error which is independent of X. We set the noise level σ = 1. Under the model (61), the population penalized regression problem (1) can be written as

\min_{β_{0}, β} {(β^{true} - β)}^{T} Σ (β^{true} - β) + β_{0}^{2} + \sum_{j = 1}^{p} P_{λ_{j}} (| β_{j} |) .

(62)

We compute confidence intervals for the population penalized parameter $({\tilde{β}}_{0}, \tilde{β})$ and the true model parameter β^true, which we refer to as the first and second types of confidence intervals respectively. To show their performance in simulation study, we report the following two measures: the empirical coverage rate (the fraction of total replications in which the confidence intervals contain the corresponding population penalized parameters or true model parameters) and the average confidence interval length. For the second type of confidence interval, we compare our proposed method with the LDPE method (Van de Geer et al. (2014); Zhang and Zhang (2014)), the method introduced by Javanmard and Montanari (2014) (dentoed as JM method), and the method proposed by Lu et al. (2017) (denoted as SVI-Lasso). In terms of the tuning parameter λ, we study the performance of our proposed method with some fixed values as well as the value of A chosen by the Generalized Information Criterion (GIC, Konishi and Kitagawa (2008)).

4.1. Example 1: Low dimensional setting with the auto-regressive covariance structure

For this example, we generate a training dataset with 500 replications of sample size N = 300, dimension p = 8, true model parameter β^true = (3, 1.5, 0, 0, 2, 0, 0, 0), and true covariance matrix $Σ_{i j} = {0.5}^{| i - j |} .$ We consider six MCP penalties with parameters (λ, a) taking the following values: λ = 0.5, 1 or 2, and a = 2 or 2000. When a = 2000, the MCP penalties are very close to the LASSO penalty. In each replication, after solving the SAA problem for every MCP penalty, we compute the two types of individual confidence intervals with the confidence level 0.95 (α = 0.05).

Tables 2 and 3 show the empirical coverage rates (CR) and average interval lengths (Len) for 95% individual confidence intervals of the 500 replications. In Table 2, the $“ \tilde{β} ”$ column contains the population penalized parameters for different MCP penalties, which are expected to be covered by the first type of confidence intervals. In Table 3, the “True” column contains the true model parameters β^true, which are expected to be covered by the second type of confidence intervals. Note that the coverage rate is 100% for the first type of confidence interval when ${\tilde{β}}_{i} = 0.$ This is due to the shrinkage effect of the projection Γ (43) from z₀ to $\tilde{β}$ , which causes the confidence intervals for ${\tilde{β}}_{i}$ to be the singleton {0}. When λ = 0.5 and a = 2, the population penalized parameters coincide with the true model parameters. The second type of confidence intervals are much longer than the first type for the inactive parameters β₃, β₄, β₆, β₇ and β₈ as expected. In practice, which type of confidence intervals to use depends on the type of parameters of interest. The first type of confidence intervals can be used to assess the randomness of the penalized estimates with a fixed penalty. This type of inference is especially useful when the penalty conveys prior information on the parameters. In contrast, the second type of confidence intervals provide inference information for the underlying true parameters directly.

As a remark, the parameter λ controls the level of penalization and the parameter a in the MCP penalty controls the degree of non-convexity. As shown in Tables 2 and 3, when λ increases to 1 and 2, the differences between the population penalized parameters and the true model parameters become larger. On the other hand, as a gets large, such as a = 2000, the MCP penalty becomes close to the LASSO penalty. The lengths of the second type of confidence intervals for a = 2 are generally shorter than those for a = 2000 with similar coverage rates as shown in Table 3. This may be due to the smaller bias imposed by the MCP penalty with a small a. In addition, as shown in Table 4, our proposed method with λ selected by GIC has very good performance. The comparison between our proposed methods and the LASSO-type methods indicate that our methods perform well for the inference of the true parameters in the linear model.

Table 4:

Coverage rates and average lengths of 95% individual CIs for true model parameters ( $β_{0}^{t r u e}, β^{t r u e}$ ) for different methods from 500 replications with sample size N = 300 generated in Example 1.

		Our method (a = 2)								Lasso type methods

		λ = 0.5		λ = 1		λ = 2		GIC		SVI-Lasso		LDPE		JM
	True	CR	Len	CR	Len	CR	Len	CR	Len	CR	Len	CR	Len	CR	Len
$β_{1}^{t r u e}$	3	0.95	0.26	0.96	0.27	0.99	0.42	0.95	0.26	0.95	0.26	0.94	0.26	0.88	0.25
$β_{2}^{t r u e}$	1.5	0.95	0.29	0.96	0.30	1.00	0.51	0.95	0.29	0.95	0.29	0.94	0.28	0.89	0.28
$β_{3}^{t r u e}$	0	0.95	0.28	0.96	0.29	0.97	0.42	0.95	0.28	0.95	0.28	0.95	0.28	0.99	0.28
$β_{4}^{t r u e}$	0	0.97	0.28	0.97	0.29	0.97	0.42	0.97	0.28	0.97	0.28	0.96	0.28	0.97	0.28
$β_{5}^{t r u e}$	2	0.96	0.28	0.96	0.29	0.99	0.44	0.96	0.28	0.97	0.29	0.96	0.28	0.92	0.28
$β_{6}^{t r u e}$	0	0.95	0.28	0.95	0.28	0.98	0.42	0.95	0.28	0.94	0.28	0.95	0.28	0.98	0.28
$β_{7}^{t r u e}$	0	0.96	0.28	0.96	0.29	0.98	0.42	0.96	0.28	0.95	0.28	0.95	0.28	0.99	0.28
$β_{8}^{t r u e}$	0	0.96	0.25	0.97	0.26	0.99	0.38	0.97	0.25	0.96	0.26	0.97	0.26	0.98	0.25

Open in a new tab

4.2. Example 2: Low dimensional setting with the equi-correlation covariance structure

In this example, we consider the equi-correlation covariance structure where Σ_ij = 0.5 for all i ≠ j and Σ_jj = 1 for all j. The other settings are the same as Example 1.

Table 5 shows the performance of the 95% individual confidence intervals of the population penalized parameters. The results shown in this table are very similar to the results of Example 1 shown in Table 2. As shown in Table 5, for each fixed λ, the proposed method using a = 2 delivers better performance than the method using a = 2000 in most cases, especially when λ is small. Table 6 shows the comparison of the individual confidence intervals of the true model parameters constructed by our method and the LASSO-type methods. Similar to Example 1, our proposed method using GIC performs well. In addition, for this example, ourmethod (GIC), SVI-Lasso and LDPE methods deliver similar performance. All these three methods perform better than the JM method.

Table 5:

	a = 2									a = 2000

	λ = 0.5			λ = 1			λ = 2			λ = 0.5			λ = 1			λ = 2
	$\tilde{β}$	CR	Len	$\tilde{β}$	CR	Len	$\tilde{β}$	CR	Len	$\tilde{β}$	CR	Len	$\tilde{β}$	CR	Len	$\tilde{β}$	CR	Len
β₀	0	0.98	0.26	0	0.98	0.27	0	0.99	0.38	0	0.98	0.27	0	0.97	0.31	0	0.97	0.41
β₁	3	0.99	0.31	3.10	0.98	0.36	3.57	0.91	0.95	2.88	0.98	0.33	2.75	0.97	0.38	2.50	0.97	0.52
β₂	1.5	0.98	0.32	1.20	0.98	0.55	0.57	0.95	0.88	1.37	0.97	0.34	1.25	0.97	0.38	1.00	0.96	0.52
β₃	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.05	0	1.00	0.02	0	1.00	0.01
β₄	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.05	0	1.00	0.02	0	1.00	0.00
β₅	2	0.97	0.32	2.10	0.99	0.38	1.57	0.98	0.93	1.88	0.97	0.34	1.75	0.97	0.38	1.50	0.98	0.52
β₆	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.05	0	1.00	0.02	0	1.00	0.00
β₇	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.05	0	1.00	0.02	0	1.00	0.00
β₈	0	1.00	0.00	0	1.00	0.00	0	1.00	0.00	0	1.00	0.05	0	1.00	0.02	0	1.00	0.00

Open in a new tab

Table 6:

Coverage rates and average lengths of 95% individual CIs for true model parameters $(β_{0}^{t r u e}, β^{t r u e})$ for different methods from 500 replications with sample size N = 300 generated in Example 2.

		Our method (a = 2)								Lasso type methods

		λ = 0.5		λ = 1		λ = 2		GIC		SVI-Lasso		LDPE		JM
	True	CR	Len	CR	Len	CR	Len	CR	Len	CR	Len	CR	Len	CR	Len
$β_{1}^{t r u e}$	3	0.97	0.30	0.97	0.31	1.00	0.46	0.97	0.30	0.97	0.30	0.97	0.30	0.91	0.29
$β_{2}^{t r u e}$	1.5	0.95	0.30	0.97	0.32	1.00	0.50	0.95	0.30	0.95	0.31	0.95	0.30	0.89	0.29
$β_{3}^{t r u e}$	0	0.95	0.30	0.96	0.31	0.99	0.43	0.95	0.30	0.96	0.30	0.96	0.30	0.98	0.29
$β_{4}^{t r u e}$	0	0.96	0.30	0.97	0.31	0.99	0.43	0.96	0.30	0.96	0.30	0.96	0.30	0.99	0.29
$β_{5}^{t r u e}$	2	0.92	0.30	0.94	0.31	0.99	0.46	0.92	0.30	0.92	0.31	0.92	0.30	0.88	0.29
$β_{6}^{t r u e}$	0	0.95	0.30	0.95	0.31	1.00	0.43	0.95	0.30	0.95	0.30	0.94	0.30	0.98	0.29
$β_{7}^{t r u e}$	0	0.95	0.30	0.95	0.31	0.99	0.43	0.95	0.30	0.95	0.30	0.95	0.30	0.96	0.29
$β_{8}^{t r u e}$	0	0.94	0.30	0.95	0.31	1.00	0.43	0.94	0.30	0.94	0.30	0.94	0.30	0.98	0.29

Open in a new tab

4.3. Example 3: High dimensional example

In this example, we consider a high dimensional case in which the dimension is much larger than the sample size. We choose p = 300 with β^true being a 300-dimensional vector: $β_{1}^{t r u e} = 3, β_{2}^{t r u e} = β_{100}^{t r u e} = β_{200}^{t r u e} = β_{300}^{t r u e} = 1.5, β_{5}^{t r u e} = β_{95}^{t r u e} = 2, β_{10}^{t r u e} = 1, β_{25}^{t r u e} = 0.5$ , and all the other components are 0. The true covariance matrix is $Σ_{i j} = {0.5}^{| i - j |} .$ We generate a training dataset with 500 replications of sample size N = 100. For this high dimensional example, we consider three MCP penalties with parameters λ = 0.5, 1 or 2, and a = 3. In each replication, we use the nodewise LASSO regression introduced by Meinshausen and Bühlmann (2006) to compute the estimate of the precision matrix $\hat{Θ},$ and compute the individual confidence intervals of the true model parameters with the confidence level 0.95. Define the active set as $A = {i : β_{i}^{true} \neq 0} = {1, 2, 5, 10, 25, 95, 100, 200, 300} and A^{c} = {0, 1, 2, \dots, 300} \ A .$ In Table 7, for different methods, we report the average coverage rate, median coverage rate, average length and median length of the individual confidence intervals for true model parameters in $A and A^{c}$ respectively:

Avgcov A = | A |^{- 1} \sum_{i \in A} {CR}_{i}, Avgcov A^{c} = {| A^{c} |}^{- 1} \sum_{i \in A^{c}} {CR}_{i},

Avglen A = | A |^{- 1} \sum_{i \in A} {Len}_{i}, Avglen A^{c} = {| A^{c} |}^{- 1} \sum_{i \in A^{c}} {Len}_{i},

Medcov A = \underset{i \in A}{median} {{CR}_{i}}, Medcov A^{c} = \underset{i \in A^{c}}{median} {{CR}_{i}},

Medlen A = \underset{i \in A}{median} {{Len}_{i}}, Medlen A^{c} = \underset{i \in A^{c}}{median} {{Len}_{i}},

where CR_i and Len_i denote the empirical coverage rate and average interval length of the confidence interval for the parameter $β_{i}^{true}$ for the 500 replications, respectively.

Table 7:

Average coverage rates and lengths of 95% individual confidence intervals for the true model parameters in the linear model with different methods computed from 500 replications with sample size N = 100 and dimension p = 300 generated in Example 3.

	Our method (λ = 0.5)				Our method (λ = 1)				Our method (λ = 2)
	Avgcov	Medcov	Avglen	Medlen	Avgcov	Medcov	Avglen	Medlen	Avgcov	Medcov	Avglen	Medlen
$A$	92.82	92.60	0.46	0.45	94.64	95.00	0.66	0.65	95.24	95.40	1.26	1.25
$A^{c}$	93.26	93.40	0.39	0.39	93.51	93.60	0.56	0.56	93.73	94.00	1.08	1.08
	Our method (GIC)				LDPE				JM
	Avgcov	Medcov	Avglen	Medlen	Avgcov	Medcov	Avglen	Medlen	Avgcov	Medcov	Avglen	Medlen

$A$	92.91	93.00	0.57	0.56	93.84	94.40	1.13	1.14	88.07	87.80	0.55	0.55
$A^{c}$	93.37	93.40	0.47	0.47	95.31	95.60	1.14	1.14	99.38	99.40	0.55	0.55

Open in a new tab

For our proposed methods, as λ increases to 1 and 2, both the average coverage rates and lengths increase. Compared with LDPE, our proposed method using GIC to choose the tuning parameter has much shorter average lengths while the average coverage rates are only slightly lower. Although the JM method delivers similar average lengths as our proposed method (GIC), the average coverage rates of our proposed method are much closer to the nominal level 95%. Overall, the results shown in Table 7 indicate that our proposed method still delivers comparable performance for the high dimensional case.

4.4. Example 4: ADNI data

In this real data example, we consider the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (http://www.loni.ucla.edu/ADNI). The main goal of ADNI was to test whether the serial structural magnetic resonance imaging (MRI), fluorodeoxyglucose positron emission tomography (FDG-PET) images and some other biological markers such as cerebrospinal fluid (CSF) could be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s Disease (AD). To that end, 800 adults with ages between 55 and 90 were recruited from over 50 sites across the US and Canada. In our analysis, we use data from 199 subjects who have complete baseline MRI, FDG-PET, and CSF data. Using the data processing method shown in Thung et al. (2014), we obtained 93 MRI features, 93 PET features, and 5 CSF features for each subject. The response variable is the Mini-Mental State Examination (MMSE) score (Folstein et al. (1975)) which is often used to screen for cognitive impairment.

The data are standardized at the beginning of our analysis. For our proposed method, we use the MCP penalty with the parameter a = 3 and choose the best tuning parameter λ by GIC. Table 8 shows the selected features of different methods, where the selected features are the features whose 95% confidence intervals do not contain 0. The numbers of features selected by our method, SVI-Lasso, LDPE, and JM are 13, 12, 13, and 3, respectively. Among the 13 features selected by our proposed method, 11 features are selected by the SVI-Lasso method, 9 features are selected by the LDPE method and 3 features are selected by the JM method. Table 9 shows the estimates and 95% individual confidence intervals of the 13 features selected by our proposed method. The results of our proposed method and the results of SVI-Lasso and LDPE methods are comparable. As shown in Table 9, for most features among the 13 features, the absolute values of the estimates delivered by the JM method are much smaller than the corresponding values of the other methods. The 95% confidence intervals of the JM method are also very different from the corresponding confidence intervals of the other methods.

Table 8:

Selected features of different methods for the ADNI data.

Method	Selected Features

Our method (GIC)	9, 19, 40, 59, 67, 80, 95, 130, 134, 147, 156, 168, 178
SVI-Lasso	9, 19, 40, 77, 80, 95, 130, 134, 147, 156, 168, 178
LDPE	9, 19, 40, 59, 77, 80, 83, 90, 111, 134, 147, 156, 168
JM	19, 40, 134

Open in a new tab

Table 9:

Estimates and 95% individual confidence intervals of the 13 features selected by our proposed method for the ADNI data.

	Our method (GIC)		SVI-Lasso		LDPE		JM
	Est	Ind CI	Est	Ind CI	Est	Ind CI	Est	Ind CI
$β_{9}^{t r u e}$	−0.20	[−0.33, −0.07]	−0.20	[−0.33, −0.07]	−0.19	[−0.34, −0.04]	−0.15	[−0.31, 0.01]
$β_{19}^{t r u e}$	0.24	[0.06, 0.41]	0.23	[0.07, 0.40]	0.24	[0.09, 0.40]	0.25	[0.08, 0.42]
$β_{40}^{t r u e}$	−0.21	[−0.36, −0.06]	−0.21	[−0.35, −0.06]	−0.20	[−0.35, −0.05]	−0.16	[−0.33, 0.00]
$β_{59}^{t r u e}$	0.15	[0.01, 0.29]	0.15	[0.00, 0.30]	0.16	[0.01, 0.30]	0.12	[−0.03, 0.28]
$β_{67}^{t r u e}$	0.13	[0.00, 0.27]	0.13	[0.00, 0.26]	0.12	[−0.02, 0.26]	0.11	[−0.03, 0.26]
$β_{80}^{t r u e}$	0.23	[0.03, 0.43]	0.23	[0.03, 0.42]	0.21	[0.04, 0.38]	0.15	[−0.01, 0.30]
$β_{95}^{t r u e}$	0.20	[0.00, 0.40]	0.21	[0.01, 0.41]	0.19	[−0.01, 0.39]	0.04	[−0.11, 0.20]
$β_{130}^{t r u e}$	0.21	[0.00, 0.42]	0.20	[0.01, 0.40]	0.18	[−0.03, 0.39]	0.04	[−0.12, 0.19]
$β_{134}^{t r u e}$	0.25	[0.08, 0.43]	0.24	[0.02, 0.45]	0.24	[0.06, 0.43]	0.23	[0.08, 0.38]
$β_{147}^{t r u e}$	−0.22	[−0.41, −0.02]	−0.22	[−0.41, −0.03]	−0.21	[−0.40, −0.02]	−0.03	[−0.18, 0.13]
$β_{156}^{t r u e}$	−0.19	[−0.36, −0.02]	−0.19	[−0.36, −0.02]	−0.18	[−0.37, 0.00]	−0.05	[−0.21, 0.11]
$β_{168}^{t r u e}$	−0.24	[−0.43, −0.04]	−0.24	[−0.43, −0.04]	−0.22	[−0.43, −0.02]	0.01	[−0.14, 0.16]
$β_{178}^{t r u e}$	−0.19	[−0.34, −0.03]	−0.19	[−0.35, −0.04]	−0.18	[−0.36, 0.00]	−0.07	[−0.22, 0.08]

Open in a new tab

5. Discussion

In this paper we propose a unified framework to construct confidence intervals for the population penalized parameters as well as the true model parameters for a large class of penalties. By transforming the population penalized regression problem (1) and its SAA problem (2) to the equivalent problems (9) and (22) respectively, we exclude the non-smoothness in the objectives. Furthermore, we obtain their normal map formulations (18) and (25), and derive the asymptotic distributions and the two types of confidence intervals. Our numerical results show that these methods are effective. When the objective functions in (1) and (2) are non-convex as a result of non-convex penalty functions, most existing algorithms are only guaranteed to obtain a local optimal solution for the SAA problem. Our proposed methods will generate confidence intervals based on that local solution. In practice, we solve for a SAA solution $({\hat{β}}_{0}, \hat{β})$ and then use (26) to obtain a solution to (25). The first type of confidence intervals we compute are for a local optimal solution of the population penalized regression problem (1). From any local solution of (2), we can always compute confidence intervals for the true model parameters, which are the second type of confidence intervals we compute.

Supplementary Material

Supp1

NIHMS1572321-supplement-Supp1.pdf^{(234.7KB, pdf)}

Acknowledgments

The authors thank the editors, the associate editor, and referees for their helpful comments and suggestions. This research was supported in part by US National Science Foundation grants DMS-1407241 (Liu, Lu and Yin), and DMS-1109099 (Lu and Yin).

References

Brooke A, Kendrick D, Meeraus A, and Raman R (1998), GAMS, A User’s Guide, Washington, DC: GAMS Development Corporation, available online at http://www.gams.com. [Google Scholar]
Candes EJ and Tao T (2007), “The Dantzig selector: statistical estimation when p is much larger than n,” The Annals of Statistics, 35, 2313–2351. [Google Scholar]
Donoho DL and Johnstone IM (1994), “Ideal spatial adaptation by wavelet shrinkage,” Biometrika, 81, 425–455. [Google Scholar]
Efron B, Hastie T, Johnstone I, and Tibshirani R (2004), “Least angle regression,” The Annals of Statistics, 32, 407–499. [Google Scholar]
Fan J and Li R (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]
Folstein MF, Folstein SE, and McHugh PR (1975), “Mini-mental state: a practical method for grading the cognitive state of patients for the clinician,” Journal of psychiatric research, 12, 189–198. [DOI] [PubMed] [Google Scholar]
Friedman JH (2012), “Fast sparse regression and classification,” International Journal of Forecasting, 28, 722–738. [Google Scholar]
Javanmard A and Montanari A (2014), “Confidence intervals and hypothesis testing for high-dimensional regression,” Journal of Machine Learning Research, 15, 2869–2909. [Google Scholar]
Konishi S and Kitagawa G (2008), Information criteria and statistical modeling, Springer Science & Business Media. [Google Scholar]
Lee JD, Sun DL, Sun Y, Taylor JE, et al. (2016), “Exact post-selection inference, with application to the lasso,” The Annals of Statistics, 44, 907–927. [Google Scholar]
Liu Y and Wu Y (2007), “Variable selection via a combination of the L0 and L1 penalties,” Journal of Computional and Graphical Statistics, 16, 782–798. [Google Scholar]
Lockhart R, Taylor J, Tibshirani R, and Tibshirani R (2014), “A significance test for the lasso,” The Annals of Statistics, 42, 413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu S (2014a), “A new method to build confidence regions for solutions of stochastic variational inequalities,” Optimization: A Journal of Mathematical Programming and Operations Research, 63, 1431–1443. [Google Scholar]
Lu S (2014b), “Symmetric Confidence Regions and Confidence Intervals for Normal Map Formulations of Stochastic Variational Inequalities,” SIAM Journal on Optimization, 24, 1458–1484. [Google Scholar]
Lu S and Budhiraja A (2013), “Confidence regions for stochastic variational ienqualities,” Mathematics of Operations Research, 38, 545–568. [Google Scholar]
Lu S, Liu Y, Yin L, and Zhang K (2017), “Confidence intervals and regions for the lasso by using stochastic variational inequality techniques in optimization,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 589–611. [Google Scholar]
Lv J and Fan Y (2009), “A unified approach to model selection and sparse recovery using regularized least squares,” The Annals of Statistics, 37, 3498–3528. [Google Scholar]
Mazumder R, Friedman J, and Hastie T (2011), “SparseNet: Coordinate descent with non-convex penalties,” Journal of the American Statistical Association, 106, 1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meinshausen N and Buhlmann P (2006), “High-dimensional graphs and variable selection with the Lasso,” The Annals of Statistics, 34, 1436–1462. [Google Scholar]
Nikolova M (2000), “Local strong homogeneity of a regularized estimator,” SIAM Journal on Applied Mathematics, 61, 633–658. [Google Scholar]
Ning Y, Liu H, et al. (2017), “A general theory of hypothesis tests and confidence regions for sparse high dimensional models,” The Annals of Statistics, 45, 158–195. [Google Scholar]
Robinson SM (1995), “Sensitivity analysis of variational inequalities by normal-map techniques,” in Variational Inequalities and Network Equilibrium Problems, ed. Giannessi F and Maugeri A, New York: Plenum Press, pp. 257–269. [Google Scholar]
Thung K-H, Wee C-Y, Yap P-T, Shen D, Initiative ADN, et al. (2014), “Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion,” NeuroImage, 91, 386–400. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R (1996), “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B, 58, 267–288. [Google Scholar]
Van de Geer S, Buhlmann P, Ritov Y, and Dezeure R (2014), “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, 42, 1166–1202. [Google Scholar]
Voorman A, Shojaie A, and Witten D (2014), “Inference in high dimensions with the penalized score test,” arXiv preprint arXiv:1401.2678. [Google Scholar]
Wu TT and Lange K (2008), “Coordinate descent algorithms for lasso penalized regression,” The Annals of Applied Statistics, 2, 224–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang CH (2010), “Nearly unbiased variable selection under minimax concave penalty,” The Annals of Statistics, 38, 894–942. [Google Scholar]
Zhang CH and Zhang SS (2014), “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B, 76, 217–242. [Google Scholar]
Zhao S, Shojaie A, and Witten D (2017), “In Defense of the Indefensible: A Very Naive Approach to High-Dimensional Inference,” arXiv preprint arXiv:1705.05543. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H (2006), “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]
Zou H and Hastie T (2005), “Regularization and variable selection via the elastic net,” The Annals of Statistics, 67, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp1

NIHMS1572321-supplement-Supp1.pdf^{(234.7KB, pdf)}

[R1] Brooke A, Kendrick D, Meeraus A, and Raman R (1998), GAMS, A User’s Guide, Washington, DC: GAMS Development Corporation, available online at http://www.gams.com. [Google Scholar]

[R2] Candes EJ and Tao T (2007), “The Dantzig selector: statistical estimation when p is much larger than n,” The Annals of Statistics, 35, 2313–2351. [Google Scholar]

[R3] Donoho DL and Johnstone IM (1994), “Ideal spatial adaptation by wavelet shrinkage,” Biometrika, 81, 425–455. [Google Scholar]

[R4] Efron B, Hastie T, Johnstone I, and Tibshirani R (2004), “Least angle regression,” The Annals of Statistics, 32, 407–499. [Google Scholar]

[R5] Fan J and Li R (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]

[R6] Folstein MF, Folstein SE, and McHugh PR (1975), “Mini-mental state: a practical method for grading the cognitive state of patients for the clinician,” Journal of psychiatric research, 12, 189–198. [DOI] [PubMed] [Google Scholar]

[R7] Friedman JH (2012), “Fast sparse regression and classification,” International Journal of Forecasting, 28, 722–738. [Google Scholar]

[R8] Javanmard A and Montanari A (2014), “Confidence intervals and hypothesis testing for high-dimensional regression,” Journal of Machine Learning Research, 15, 2869–2909. [Google Scholar]

[R9] Konishi S and Kitagawa G (2008), Information criteria and statistical modeling, Springer Science & Business Media. [Google Scholar]

[R10] Lee JD, Sun DL, Sun Y, Taylor JE, et al. (2016), “Exact post-selection inference, with application to the lasso,” The Annals of Statistics, 44, 907–927. [Google Scholar]

[R11] Liu Y and Wu Y (2007), “Variable selection via a combination of the L0 and L1 penalties,” Journal of Computional and Graphical Statistics, 16, 782–798. [Google Scholar]

[R12] Lockhart R, Taylor J, Tibshirani R, and Tibshirani R (2014), “A significance test for the lasso,” The Annals of Statistics, 42, 413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Lu S (2014a), “A new method to build confidence regions for solutions of stochastic variational inequalities,” Optimization: A Journal of Mathematical Programming and Operations Research, 63, 1431–1443. [Google Scholar]

[R14] Lu S (2014b), “Symmetric Confidence Regions and Confidence Intervals for Normal Map Formulations of Stochastic Variational Inequalities,” SIAM Journal on Optimization, 24, 1458–1484. [Google Scholar]

[R15] Lu S and Budhiraja A (2013), “Confidence regions for stochastic variational ienqualities,” Mathematics of Operations Research, 38, 545–568. [Google Scholar]

[R16] Lu S, Liu Y, Yin L, and Zhang K (2017), “Confidence intervals and regions for the lasso by using stochastic variational inequality techniques in optimization,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 589–611. [Google Scholar]

[R17] Lv J and Fan Y (2009), “A unified approach to model selection and sparse recovery using regularized least squares,” The Annals of Statistics, 37, 3498–3528. [Google Scholar]

[R18] Mazumder R, Friedman J, and Hastie T (2011), “SparseNet: Coordinate descent with non-convex penalties,” Journal of the American Statistical Association, 106, 1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Meinshausen N and Buhlmann P (2006), “High-dimensional graphs and variable selection with the Lasso,” The Annals of Statistics, 34, 1436–1462. [Google Scholar]

[R20] Nikolova M (2000), “Local strong homogeneity of a regularized estimator,” SIAM Journal on Applied Mathematics, 61, 633–658. [Google Scholar]

[R21] Ning Y, Liu H, et al. (2017), “A general theory of hypothesis tests and confidence regions for sparse high dimensional models,” The Annals of Statistics, 45, 158–195. [Google Scholar]

[R22] Robinson SM (1995), “Sensitivity analysis of variational inequalities by normal-map techniques,” in Variational Inequalities and Network Equilibrium Problems, ed. Giannessi F and Maugeri A, New York: Plenum Press, pp. 257–269. [Google Scholar]

[R23] Thung K-H, Wee C-Y, Yap P-T, Shen D, Initiative ADN, et al. (2014), “Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion,” NeuroImage, 91, 386–400. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Tibshirani R (1996), “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B, 58, 267–288. [Google Scholar]

[R25] Van de Geer S, Buhlmann P, Ritov Y, and Dezeure R (2014), “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, 42, 1166–1202. [Google Scholar]

[R26] Voorman A, Shojaie A, and Witten D (2014), “Inference in high dimensions with the penalized score test,” arXiv preprint arXiv:1401.2678. [Google Scholar]

[R27] Wu TT and Lange K (2008), “Coordinate descent algorithms for lasso penalized regression,” The Annals of Applied Statistics, 2, 224–244. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Zhang CH (2010), “Nearly unbiased variable selection under minimax concave penalty,” The Annals of Statistics, 38, 894–942. [Google Scholar]

[R29] Zhang CH and Zhang SS (2014), “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B, 76, 217–242. [Google Scholar]

[R30] Zhao S, Shojaie A, and Witten D (2017), “In Defense of the Indefensible: A Very Naive Approach to High-Dimensional Inference,” arXiv preprint arXiv:1705.05543. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Zou H (2006), “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]

[R32] Zou H and Hastie T (2005), “Regularization and variable selection via the elastic net,” The Annals of Statistics, 67, 301–320. [Google Scholar]

PERMALINK

Confidence Intervals for Sparse Penalized Regression with Random Designs

Guan Yu

Liang Yin

Shu Lu

Yufeng Liu

Abstract

1. Introduction

2. Background and problem transformations

2.1. Background on variational inequalities and normal maps

2.2. Transformations of the population penalized regression

Assumption 1.

Assumption 2.

Lemma 1.

Assumption 3.

Lemma 2.

2.3. Transformations of the SAA problem

Assumption 4.

Lemma 3.

3. Construction of confidence intervals using stochastic variational inequality techniques

Table 1:

3.1. Convergence and distribution of SAA solutions

Theorem 1.

3.2. Estimators of Σ0 and LK

Figure 1:

Table 2:

Table 3:

Theorem 2.

Theorem 3.

3.3. Confidence intervals for the population penalized parameters

3.4. Confidence intervals for true model parameters in the underlying linear model

Assumption 1’(a).

Assumption 1’(b).

Assumption 1’(c).

Lemma 4.

Theorem 4.

Theorem 5.

Assumption 4’(a).

Assumption 4’(b).

Theorem 6.

3.5. Extension to the high dimensional case

4. Numerical examples

4.1. Example 1: Low dimensional setting with the auto-regressive covariance structure

Table 4:

4.2. Example 2: Low dimensional setting with the equi-correlation covariance structure

Table 5:

Table 6:

4.3. Example 3: High dimensional example

Table 7:

4.4. Example 4: ADNI data

Table 8:

Table 9:

5. Discussion

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.2. Estimators of Σ₀ and L_K