Likelihood-based selection and sharp parameter estimation

Xiaotong Shen; Wei Pan; Yunzhang Zhu

doi:10.1080/01621459.2011.645783

. Author manuscript; available in PMC: 2013 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2012 Jun 11;107(497):223–232. doi: 10.1080/01621459.2011.645783

Likelihood-based selection and sharp parameter estimation^{^*}

Xiaotong Shen ¹, Wei Pan ², Yunzhang Zhu ¹

PMCID: PMC3378256 NIHMSID: NIHMS364033 PMID: 22736876

Summary

In high-dimensional data analysis, feature selection becomes one means for dimension reduction, which proceeds with parameter estimation. Concerning accuracy of selection and estimation, we study nonconvex constrained and regularized likelihoods in the presence of nuisance parameters. Theoretically, we show that constrained L₀-likelihood and its computational surrogate are optimal in that they achieve feature selection consistency and sharp parameter estimation, under one necessary condition required for any method to be selection consistent and to achieve sharp parameter estimation. It permits up to exponentially many candidate features. Computationally, we develop difference convex methods to implement the computational surrogate through prime and dual subproblems. These results establish a central role of L₀-constrained and regularized likelihoods in feature selection and parameter estimation involving selection. As applications of the general method and theory, we perform feature selection in linear regression and logistic regression, and estimate a precision matrix in Gaussian graphical models. In these situations, we gain a new theoretical insight and obtain favorable numerical results. Finally, we discuss an application to predict the metastasis status of breast cancer patients with their gene expression profiles.

Keywords: Coordinate decent; continuous but non-smooth minimization; general likelihood; graphical models; nonconvex; (p, n)-asymptotics

1 Introduction

Feature selection is essential to battle the inherited “curse of dimensionality” in high-dimensional analysis. It removes non-informative features to derive simpler models for interpretability, prediction and inference. In cancer studies, for instance, a patient’s gene expression is linked to her metastasis status of breast cancer, for identifying cancer genes. In a situation as such, our ability of identifying cancer genes is as critical as a model’s predictive accuracy, where selection accuracy becomes extremely important to reproducible findings and generalizable conclusions. Towards accuracy of selection and parameter estimation, we address several core issues in high-dimensional likelihood-based selection.

Consider a selection problem with nuisance parameters, based on a random sample Y = (Y₁, ···, Y_n) with each Y_i following probability density g(θ⁰, y), where θ⁰ = (β⁰, η⁰) is a true parameter vector, $β^{0} \equiv (β_{1}^{0}, \dots, β_{p}^{0}) = (β_{A_{0}}^{0}, 0_{A_{0}^{c}})$ and $η^{0} \equiv (η_{1}^{0}, \dots, η_{q}^{2})$ are the parameters of interest and nuisance parameters respectively, $A_{0} = {j : β_{j}^{0} \neq 0}$ is a set of nonzero coefficients of β⁰ with size |A₀| = p₀, and $0_{A_{0}^{c}}$ is a vector of 0’s with ^c denoting the set complement. Here we estimate (β⁰, A₀), where p may greatly exceed n, and q = 0 is permitted.

For estimation and selection, a likelihood is regularized with regard to β, particularly when p > n. This leads to an information criterion:

- L (θ) + λ \sum_{j = 1}^{p} I (β_{j} \neq 0),

(1)

where $L (θ) = \sum_{i = 1}^{n} log g (θ, Y_{i})$ is the log-likelihood based on Y, λ > 0 is a regularization parameter, and $\sum_{j = 1}^{p} I (β_{j} \neq 0)$ is the L₀-function penalizing an increase in a model’s size. In (1), when θ = β without nuisance parameters, λ = 1 is Akaike’s information criterion, $λ = \frac{log n}{2}$ is Bayesian information criterion [21], among others. In fact, essentially all selection rules can be cast into the framework of (1).

Regularization (1) has been of considerable interest for its interpretability and computational merits. Yet its constrained counterpart (2) has not received much attention, which is

- L (θ), subject to \sum_{j = 1}^{p} I (β_{j} \neq 0) \leq K,

(2)

where K ≥ 0 is a tuning parameter corresponding to λ in (1). Minimizing (1) or (2) in θ gives a global minimizer leading to an estimate β̂ = (β̂_Â, 0_Â^c)^T, with Â the estimated A₀, where η is un-regularized and possibly profiled out. Note that (1) and (2) may not be equivalent in their global minimizers, which is unlike a convex problem.

This article systematically investigates constrained and regularized likelihoods involving nuisance parameters, for estimating zero components of β⁰ as well as nonzero ones of θ⁰. This includes, but is not limited to, estimating nonzero entries of a precision matrix in graphical models.

There is a huge body on parameter estimation through L₁-regularization in linear regression; see, for instance, [16] for a comprehensive review. For feature selection, consistency of the Lasso [26] has been extensively studied under the irrepresentable assumption; c.f., [15] [34]. Other methods such as the SCAD [6] have been studied. Yet L₀-constrained or regularized likelihood remain largely unexplored. Despite progress, many open issues remain. First, what is the maximum number of candidate features allowed for a likelihood method to reconstruct informative features? Results, such as [13], seem to suggest that the capacity of handling exponentially many features may be attributed primarily to the exponential tail of a Gaussian distribution, which we show is not necessary. Second, can parameter estimation be enhanced through removal of zero components of β? Third, can a selection method continue to perform well for parameters of interest in the presence of a large number of nuisance parameters, as in covariance selection for off-diagonal entries of a precision matrix?

This article intends to address the foregoing three issues. First, we establish finite-sample mis-selection error bounds for constrained L₀-likelihood as well as its computational surrogate, given (n, p₀, p), where the surrogate–a likelihood based on a truncated L₁-function (TLP) approximating the L₀-function, permits efficient computation; see Section 2.1 for a definition. On this basis, we establish feature selection consistency for them as n, p → ∞, under one key condition that is necessary for any method to be selection consistent:

C_{min} (θ^{0}) \geq d_{0} \frac{log p}{n},

(3)

where $C_{min} (θ^{0}) \equiv {inf}_{{θ_{A} = ((β_{A}, 0_{A^{c}}), η) : A \neq A_{0}, ∣ A ∣ \leq p_{0}}} \frac{h^{2} (θ_{A}, θ^{0})}{max (∣ A_{0} \ A ∣, 1)}$ , d₀ > 0 is a constant, |·| and \ denote the size of a set and that of set difference, respectively, $h (θ, θ^{0}) = \frac{1}{2} {(\int {(g^{1 / 2} (θ, y) - g^{1 / 2} (θ^{0}, y))}^{2} d μ (y))}^{1 / 2}$ is the Hellinger-distance for with respect to a dominating measure μ, and g(θ, y) is a probability density for Y₁. As one consequence, exponentially many candidate features $p = exp (n \frac{C_{min} (θ^{0})}{d_{0}})$ are permitted for selection consistency with a broad class of constrained likelihoods. This challenges the well established result that the maximum number of candidate features permitted for selection consistency depends highly on a likelihood’s tail behavior, c.f., [4]. In fact, selection consistency continues to hold even if the error distribution does not have an exponential tail; see Proportion 1 for linear regression. Second, sharper parameter estimation results from accurate selection by L₀-likelihood and its surrogate as compared to that without such selection. For feature selection in linear regression and logistic regression, the optimal Hellinger risk of the oracle estimator, the maximum likelihood estimate (MLE) based on A₀ as if the true A₀ were known a priori, is recovered by these methods, which is of order of $\sqrt{\frac{p_{0}}{n}}$ and is uniform over a certain L₀-band of θ⁰ excluding the origin. This is in contrast to the minimax rate $\sqrt{\frac{u log (p / u)}{n}}$ with u ≥ p₀ for estimation without feature selection in linear regression [17]. In other words, accurate selection by L₀-likelihood and its surrogate over the L₀-band improves accuracy of estimation after non-informative features are removed, without introducing additional bias to estimation. Moreover, in estimating a precision matrix in Gaussian graphical models, the foregoing conclusions extend but with a different rate at $\sqrt{\frac{p_{0} log p}{n}}$ , where a log p factor is due to estimation of 2p nuisance parameters as compared to logistic regression. Third, two difference of convex (DC) methods are employed for computation of (1) and (2), which relax nonconvex minimization through a sequence of convex problems.

Two disparate applications are considered, namely, feature selection in generalized linear models (GLMs), as well as estimation of a precision matrix in Gaussian graphical models. In GLMs, feature selection in nonlinear regression appears more challenging than linear regression for a high-dimensional problem. In statistical modeling of a precision matrix in Gaussian graphical models, two major approaches have emerged to exploit matrix sparsity by likelihood selection and neighborhood selection. Papers based on these two approaches include [15] [12] [28] [2] [8] [19] [20] [18], among others. As suggested by [19], existing methods may not perform well when the dimension of a matrix exceeds the sample size n, although they give estimates better than the sample covariance matrix. In addition, theoretical aspects for a likelihood approach remain to be under studied. In these situations, the proposed method compares favorably against its competitors in simulations, and novel theoretical results provide an insight into a selection process.

This article is organized as follows. Section 2 develops the proposed method for L₀-regularized and constrained likelihoods. Section 3 presents main theoretical results for selection consistency and parameter estimation involving selection, followed by a necessary condition for selection consistency. Section 4 applies the general method and theory to feature selection in GLMs. Section 5 is devoted to estimation of a precision matrix in Gaussian graphical models. Section 6 presents an application to predict the metastasis status of breast cancer patients with their gene expression profiles. Section 7 contains technical proofs.

2 Method and computation

2.1 Method

In a high-dimensional situation, it is computationally infeasible to minimize a discontinuous cost function involving the L₀-function in (1) and (2). As a surrogate, we seek a good approximation of the L₀-function by the TLP, defined as $J (∣ z ∣) = λ min (\frac{∣ z ∣}{τ}, 1)$ , with τ > 0 a tuning parameter controlling the degree of approximation; see Figure 1 for a display. This τ decides which individual coefficients to be shrunk towards zero. The advantages of J(|z|) are fourfold, although J(z) has been considered in other contexts [9]:

Truncated L₁ function *J_t* (|*β_j*|) with τ = 1 in (a), and its DC decomposition into a dfference of two convex functions *J_L* and *J_T,*₂ in (b).

(Surrogate) It performs the model selection task of the L₀-function, while providing a computationally efficient means. Note that the approximation error of the TLP function to the L₀-function becomes zero when τ is tuned such that $τ < min {∣ β_{k}^{0} ∣ : k \in A_{0}}$ , seeking the sparsest solution by minimizing the number of non-zero coefficients.
(Adaptive model selection through adaptive shrinkage) It performs adaptive model selection [23] through a computationally efficient means when λ is tuned. Moreover, it corrects the Lasso bias through adaptive shrinkage combining shrinkage with thresholding.
(Piecewise linearity) It is piecewise linear, gaining computational advantages.
(Low resolutions) It discriminates small from large coefficients through thresholding. Consequently, it is capable of handling many low-resolution coefficients, through tuning τ.

To treat nonconvex minimization, we replace the L₀-function by its surrogate J(·) to construct an approximation of (2) and that of (1):

- L (θ), subject to \sum_{j = 1}^{p} J (∣ β_{j} ∣) \leq K,

(4)

S (θ) = - L (θ) + λ \sum_{j = 1}^{p} J (∣ β_{j} ∣),

(5)

where (5) is a dual problem of (4). To solve (5) and (4), we develop difference convex methods for the primal and dual problems, for efficient computation.

2.2 Unconstrained dual and constrained primal problems

Our DC method for the dual problem (5) begins with a DC decomposition of S(θ): S(θ) = S₁(θ) − S₂(β), where $S_{1} (θ) = - L (θ) + λ \sum_{j = 1}^{p} J_{1} (∣ β_{j} ∣), S_{2} (β) = λ \sum_{j = 1}^{p} J_{2} (∣ β_{j} ∣), J_{1} (∣ β_{j} ∣) = \frac{∣ β_{j} ∣}{τ}$ , and $J_{2} (∣ β_{j} ∣) = \frac{∣ β_{j} ∣}{τ} - max (\frac{∣ β_{j} ∣}{τ} - 1, 0)$ . Without loss of generality, assume that −L is convex in θ; otherwise, a DC decomposition of −L is required and can be treated similarly. Given this DC decomposition, a sequence of upper approximations of S(θ) is constructed iteratively, say, at iteration m, with ∇S₂ a subgradient of S₂ in |β|: S^(m)(θ) = S₁(θ) − (S₂(β̂⁽^m ^{− 1)})+(|β| − |β̂⁽^m^-1)|)^T∇S₂ (|β̂⁽^m^-1)|)), by successively replacing S₂(β) by its minorization, where |·| for a vector takes the absolute value in each component. After ignoring $S_{2} ({\hat{β}}^{(m - 1)}) - \frac{λ}{τ} \sum_{j = 1}^{p} ∣ {\hat{β}}_{j}^{(m - 1)} ∣ I (∣ {\hat{β}}_{j}^{(m - 1)} ∣ > τ)$ that is independent of θ, the problem reduces to

S^{(m)} (θ) = - L (θ) + \frac{λ}{τ} \sum_{j = 1}^{p} ∣ β_{j} ∣ I (∣ {\hat{β}}_{j}^{(m - 1)} ∣ \leq τ) .

(6)

Minimizing (6) in θ yields its minimizer θ̂⁽^m⁾. The process continues in m until termination occurs. Our unconstrained DC method is summarized as follows.

Algorithm 1

Step 1.
(Initialization) Supply a good initial estimate θ̂⁽⁰⁾, such as the minimizer of S₁(θ).
Step 2.
(Iteration) At iteration m, compute θ̂⁽^m⁾ by solving (6).
Step 3.
(Termination) Terminate when S(θ̂⁽^m⁻¹⁾) − S(θ̂⁽^m⁾) ≤ ε, and no components of β̂⁽^m⁾ is at ±τ. Otherwise, add ε to that components whose absolute value is τ, and go to Step 2, where ε is the square root of the machine precision. Then the estimate θ̂ = θ̂^{(m^*)}, where m^* is the smallest index at the termination criterion.

In Algorithm 1, (6) reduces to a general weighted Lasso problem: $- L (θ) + \sum_{j = 1}^{p} λ_{j} ∣ β_{j} ∣$ , with $λ_{j} = \frac{λ}{τ} I (∣ {\hat{β}}_{j}^{(m - 1)} ∣ \leq τ)$ . Therefore any efficient software is applicable.

For (4), we decompose the nonconvex constraint into a difference of two convex functions to construct a sequence of approximating convex constraints. This amounts to solving the mth subproblem in a parallel fashion as in (6):

min_{β} - L (θ), subject to \frac{1}{τ} \sum_{j = 1}^{p} ∣ β_{j} ∣ I (∣ {\hat{β}}_{j}^{(m - 1)} ∣ \leq τ) \leq K - \sum_{j = 1}^{p} I (∣ {\hat{β}}_{j}^{(m - 1)} ∣ > τ) .

(7)

This leads to a constrained DC algorithm—Algorithm 2 for solving (4) by replacing (5) in Algorithm 1 by (4).

Algorithms 1 and 2 are a generalization of those in [24] for a general likelihood, where all the computational properties there extend to the present situation, including equivalence of the DC solutions of the two algorithms and their convergence. Next we shall work with (5) due to its computational advantage. For instance, a coordinate decent method that works well with (5) breaks down for (4), c.f., [24].

3 Theory

This section presents a general theory for accuracy of reconstruction of the oracle estimator θ̂^ml = (β̂^ml, η̂^ml) with ${\hat{β}}^{m l} = ({\hat{β}}_{A_{0}}^{m l}, 0_{A_{0}})$ given A₀, which is the MLE provided that the knowledge about A₀ were known a priori. As direct consequences, feature selection consistency is studied as as well as optimal parameter estimation defined by the oracle estimator. In addition, a necessary condition for feature selection will be established as well. A parallel theory for regularized likelihood is similar and thus is omitted.

3.1 Constrained L₀-likelihood

In (2), assume that a global minimizer exists, denoted by θ̂^L₀ = (β̂^L₀, η̂^L₀) with ${\hat{β}}^{L_{0}} = ({\hat{β}}_{{\hat{A}}^{L_{0}}}^{L_{0}}, 0_{{({\hat{A}}^{L_{0}})}^{c}})$ . Write β as (β_A, 0|_A^c|), with β_A being (β₁, …, β|_A|)^T for any subset A ⊂ {1, ···, p} of nonzero coefficients.

Before proceeding, we define a complexity measure for the size of a space Inline graphic . The bracketing Hellinger metric entropy of , denoted by the function H(·, ), is defined by logarithm of the cardinality of the u-bracketing (of ) of the smallest size. That is, for a bracket covering $S (ε, m) = {f_{1}^{l}, f_{1}^{u}, \dots, f_{m}^{l}, f_{m}^{u}} \subset L_{2}$ satisfying ${max}_{1 \leq j \leq m} {| | f_{j}^{u} - f_{j}^{l} | |}_{2} \leq ε$ and for any f ∈ Inline graphic , there exists a j such that $f_{j}^{l} \leq f \leq f_{j}^{u}$ , a.e. P, then H(u, ) is log(min{m : S(u, m)}), where ||f||₂ = ∫ f²(z)dμ. For more discussions about metric entropy of this type, see [14].

Assumption A (Size of parameter space)

For some constant c₀ > 0 and any $\frac{ε}{2^{4}} < t < ε \leq 1$ , H(t, Inline graphic ) ≤ c₀(log p)²|A| log(2ε/t), with |A| ≤ p₀, where = ∩ {h(θ, θ⁰) ≤ 2ε} is a local parameter space, and = {g^1/2(θ, y) : θ = (β, η), β = (β_A, 0)} be a collection of square-root densities.

Theorem 1 (Error bound and oracle properties)

Under Assumption A, if K = p₀, then, there exists a constant c₂ > 0, say $c_{2} = \frac{2}{27} \frac{1}{963}$ , such that for (n, p₀, p),

P ({\hat{θ}}^{L_{0}} \neq {\hat{θ}}^{m l}) \leq exp (- c_{2} {n C}_{min} (θ^{0}) + 2 log (p + 1) + 3)) .

(8)

Moreover, under (3) with $d_{0} > max (\frac{2}{c_{2}}, {(2 c_{0})}^{1 / 2} c_{4}^{- 1} log (2^{1 / 2} / c_{3}))$ , θ̂^L₀ reconstructs the oracle estimator θ̂^ml with probability tending to one as n, p → ∞. Three oracle properties hold as n, p → ∞:

(Selection consistency) Estimator Â^L₀ is selection consistent, that is, P(Â^L₀ ≠ A₀) → 0.
(Optimal parameter estimation) For θ⁰, ${E h}^{2} ({\hat{θ}}^{L_{0}}, θ^{0}) = (1 + o (1)) {E h}^{2} ({\hat{θ}}^{m l}, θ^{0}) = O (ε_{n, p}^{2})$ and $h^{2} ({\hat{θ}}^{L_{0}}, θ^{0}) = O_{p} (ε_{n, p_{0}, p}^{2})$ , provided that Eh²(θ̂^ml, θ⁰) does not tend to zero too fast in that $\frac{c_{2}}{2} {n C}_{min} (θ^{0}) + log {E h}^{2} ({\hat{θ}}^{m l}, θ^{0}) \to \infty$ , where ε_n,p₀p is any solution for ε:
$\int_{2^{- 8} ε^{2}}^{2^{1 / 2} ε} H^{1 / 2} (t / c_{3}, B_{A_{0}}) d t \leq c_{4} n^{1 / 2} ε^{2} .$ (9)
(Uniformity over a L₀-band) The reconstruction holds uniformly over B₀(u, l), namely, sup_{θ⁰∈B₀(u,l)} P(θ̂^L₀ ≠ θ̂^ml) → 0, where B₀(u,l) is a L₀-band, defined as ${(β, η_{0}) : p_{0} = \sum_{j = 1}^{p} I (β_{j} \neq 0) \leq u, C_{min} (θ) \geq l}$ with 0 < u ≤ min(n, p), $l = d_{0} σ^{2} \frac{log p}{n}$ , and u < min(n, p). This implies feature selection consistency sup_{θ⁰∈B₀(u,l)} P(Â^L₀ ≠ A₀) → 0, and optimal parameter estimation $\frac{{sup}_{θ} 0_{\in B_{0} (u, l)} {E h}^{2} ({\hat{θ}}^{L_{0}}, θ^{0})}{{sup}_{θ} 0_{\in B_{0} (u, l)} {E h}^{2} ({\hat{θ}}^{m l}, θ^{0})} \to 1$ , with ${sup}_{θ^{0} \in B_{0} (u, l)} {E h}^{2} ({\hat{θ}}^{m l}, θ^{0}) = O (ε_{n, u, p}^{2})$ , provided that Eh²(θ̂^ml, θ⁰) does not tend to zero too fast in that $\frac{c_{2}}{2} {n C}_{min} (θ^{0}) + log {sup}_{θ^{0} \in B_{0} (u, l)} {E h}^{2} ({\hat{θ}}^{m l}, θ^{0}) \to \infty$ .

The L₀-method consistently reconstructs the oracle estimator when the degree of separation exceeds the minimal level, precisely under (3). As a result, selection consistency is established for the L₀-method. This, combined with that in Theorem 3, suggests that the L₀-method is optimal in feature selection against any method, matching up with the lower bound requirement under the degree of separation with respect to (p, p₀, n) except a constant factor d₀ > 0 in Theorem 3. Moreover, the optimality extends further to parameter estimation, where sharper parameter estimation is obtained from accurate L₀ selection, achieving the optimal Hellinger risk of the oracle estimator asymptotically. By comparison, such a result is not expected for L₁-regularization. As suggested in [17], selection consistency of Lasso does not give sharper parameter estimation, where the rate of convergence of a L₁-method in the L₂ risk remains to be $\sqrt{\frac{p_{0} log (p / p_{0})}{n}}$ in linear regression. This is because a L₁-method is nonadaptive and overpenalizes large coefficients as a result of shrinking small coefficients towards zero. Similarly, in feature selection in logistic regression, the L₀-method is expected to give rise to better estimation precision than a L₁-method, although a parallel result for a L₁-method has not been available. Finally, the uniform result in (C) is over a L₀-band B₀(u, l), which is not expected over a L₀-ball B₀(u, 0) in view of the result of Theorem 3.

3.2 Constrained truncated L₁-likelihood

For constrained truncated L₁-likelihood, one additional regularity condition–Assumption B is assumed, which is generally met with a smooth likelihood; see Section 4 for an example. It requires the Hellinger-distance to be smooth so that the TLP approximation to the L₀-function becomes adequate through tuning τ.

Assumption B

For some constants d₁−d₃ > 0,

h^{2} (θ, θ^{0}) \geq d_{1} h^{2} (θ_{τ^{+}}, θ^{0}) - d_{3} p τ^{d_{2}}, A^{τ^{+}} \equiv {j : ∣ β_{j} ∣ \geq τ},

(10)

where θ_τ⁺ = (β₁I(|β₁| ≥ τ), ···, β_pI(|β_p| ≥ τ), η₁, ···, η_q).

Theorem 2 (Error bound and oracle properties)

Under Assumption A with Inline graphic replaced by {g^1/2(θ, y) : θ = (β, η) : β = (β_A, β_A^c), ||β_A^c||_{ℓ_∞} ≤ τ}, say 0 ≤ τ ≤ c′ε for some constant c′, and Assumption B, if K = p₀ and τ ≤ max(c′, (d₁C_min(θ⁰)/2pd₃)^1/d₂>), then there exists a constant c₂ > 0, such that for any (n, p₀, p),

P ({\hat{θ}}^{T} \neq {\hat{θ}}^{m l}) \leq exp (- c_{2} {n C}_{min} (θ^{0}) + 2 log (p + 1) + 3) .

(11)

Moreover, under (3) with sufficiently large constant d₀ > 0, θ̂^T has the three oracle properties (A)–(C) of θ̂^L₀, provided that $\frac{c_{2} d_{1}}{4} {n C}_{min} (θ^{0}) + log {E h}^{2} ({\hat{θ}}^{m l}, θ^{0}) \to \infty$ . For (C), $τ \leq {(\frac{d_{1} l}{2 {p d}_{3}})}^{1 / d_{2}}$ is required as well as $\frac{c_{2} d_{1}}{4} {n C}_{min} (θ^{0}) + log {sup}_{θ^{0} \in B_{0} (u, l)} {E h}^{2} ({\hat{θ}}^{m l}, θ^{0}) \to \infty$ .

Remark

Constants in Theorem 1 can be made precise. For instance, $c_{2} = \frac{4}{27} \frac{1}{1926}$ and $d_{0} > max (\frac{4}{c_{2} d_{1}}, {(2 c_{0})}^{1 / 2} c_{4}^{- 1} log (2^{1 / 2} / c_{3}))$ .

Theorem 2 says that the oracle properties of the L₀-function are attained by its computational surrogate when τ is sufficiently small.

3.3 Necessary condition for selection consistency

This section establishes the necessary condition (3) by estimating the minimal value d₀ in (3), required for feature selection consistency.

Let K(θ₁, θ₂) = E log g(θ₁, Y)/g(θ₂, Y) be the Kullback-Leibler loss for θ₁ versus θ₂, where E is taken with regard to g(θ₁, Y). Let $γ_{min} (θ^{0}) \equiv min {∣ β_{k}^{0} ∣ : k \in A_{0}} > 0$ .

Assumption C

For a constant r > 0, $K (θ_{j}, θ_{k}) \leq r γ_{min}^{2} (β)$ . Here {θ_j = (β_j, η⁰), j = 1, ···, p} is a set of parameters, where $β_{j} = \sum_{k = 1}^{p_{0}} γ_{\min} e_{k} - γ_{\min} e_{j}$ ; j = 1, ···, p₀}, and $β_{j} = \sum_{k = 1}^{p_{0}} γ_{\min} e_{k} + γ_{\min} e_{j}$ ; j = p₀ + 1, ···, p, and $e_{j} = {(\underset{j - 1}{\underset{︸}{0, \dots, 0}}, 1, \underset{p - j - 1}{\underset{︸}{0, \dots, 0}})}^{T}$ . Assume that $s \equiv {inf}_{θ^{0}} \frac{C_{min} (θ^{0})}{γ_{min}^{2} (θ^{0})} > 0$ .

Theorem 3 (Necessary condition for feature selection consistency)

Under Assumption C, for any constant c_* ∈ (0, 1), any (n, p₀, p) with p₀ ≤ p/2, and any η⁰, we have

inf_{\hat{A}} sup_{{β^{0} : C_{min} (θ^{0}) = R^{*}}} P (\hat{A} \neq A_{0}) \geq c_{*},

(12)

with $R^{*} = \frac{s (1 - c_{*}) log p}{4 r n}$ . Moreover,

inf_{\hat{A}} sup_{θ^{0} \in B_{0} (u, l)} P (\hat{A} \neq A_{0}) \geq c_{*}, a s n, p \to \infty,

(13)

where u ≤ min(p/2, n), $l = d_{0} \frac{log p}{n}$ , and $d_{0} = \frac{(1 - c^{*}) s}{4 r}$ .

Theorem 3 says that feature selection inconsistency occurs when $d_{0} < \frac{s}{4 r}$ in (3). There the minimal value $d_{0} = \frac{s}{4 r}$ yields a requirement for feature selection consistency in (3).

4 Generalized linear models

For GLMs, observations Y_i = (Z_i, X_i) are paired, response Z_i is assumed to follow an exponential family with density function g(z_i; θ_i, φ) = exp{[z_iθ_i − b(θ_i)]/a(φ) + c(z_i, φ)}, where θ_i is the natural parameter that is related to the mean μ_i = E(z_i) = b′(θ_i), and φ is a dispersion parameter. With a link function g, a regression model becomes η_i = g(μ_i) = β^Tx_i. The penalized likelihood for estimating regression coefficient vectors β is $- L (β) + \sum_{j = 1}^{p} J (∣ β_{j} ∣; λ, τ)$ , where $L (β) = \sum_{i = 1}^{n} [z_{i} μ_{i} - b (μ_{i})] / a (φ) + c (z_{i}, φ)$ is the log-likelihood, and $J (∣ β_{j} ∣; λ, τ) = \frac{λ}{τ} min (∣ β_{j} ∣, τ)$ is the TLP penalty.

For parameter estimation and feature selection, we apply Algorithm 1, where (6) becomes a series of weighted lasso for GLMs, for which some existing routines are applicable, for simplicity. In implementation, we use the function wtlassoglm() in R package SIS.

Next we examine effectiveness of the proposed method through simulated examples in feature selection. In linear regression and logistic regression, the Lasso, SCAD [6], SCAD-OS, TLP and TLP-OS are compared in terms of predictive accuracy and identification of the true model, where SCAD-OS and TLP-OS are SCAD and TLP with only one iteration step in the DC iterative process, and SCAD-OS is proposed in [33]. The latter four methods use the Lasso as an initial estimate.

4.1 Simulations

For simulations predictors X_i’s are iid from N (0, V), where V is a p × p matrix whose ijth element is 0.5^|ⁱ⁻^j^|. In linear regression, Z_i = β^TX_i + ε_i, ε_i ~ N (0, σ²); i = 1, ···, n, and random error ε_i is independent of X_i; in logistic regression, a binary response is generated from logit Pr(Z_i = 1) = β^TX_i. In both cases, β = (β₁, ···, β_p)^T with β₁ = 1, β₂ = 0.5 and β₅ = 0.75; β_j = 0 for j ≠ 1, 2, 5. This set-up was similar to that considered in [33]; here we examine various situations with respect to p, n. Each simulation is based on 1000 independent replications.

For any given tuning parameter λ, all other methods use the Lasso estimate as an initial estimate. For each method, we choose its tuning parameter values by maximizing the log-likelihood based on a common tuning dataset with an equal sample size of the training data and independent of the training data. This is achieved through a grid search over 21 λ values returned by glmnet() for all the methods, and additionally over a grid of 10 τ values that are the 9th-, 19th-, 29th-, ···, 99th-percentiles of the final Lasso estimate for the TLP.

The model error (ME) is used to evaluate predictive performance of β̂, defined as ME(β̂) = (β̂ − β⁰)^TV (β̂ − β⁰), which is the prediction error minus σ² in linear regression, corresponding to the test error over an independent test sample of size T = ∞. In our context, the median ME’s are reported over 1000 simulation replications, due to possible skewness of the distribution of ME. In addition, the mean parameter estimates of the nonzero elements of β will be reported, together with the mean true positive (TP) and mean false positive (FP) numbers: $# T P = \sum_{j = 1}^{p} I (β_{j} \neq 0, {\hat{β}}_{j} \neq 0)$ and $# F P = \sum_{j = 1}^{p} I (β_{j} = 0, {\hat{β}}_{j} \neq 0)$ .

For linear regression, simulation results are reported for the cases of p = 12, 500, 1000, n = 50, 100, and σ² = 1 in Table 1. As suggested by Table 1, the TLP performs best: it gives the smallest estimation and prediction error as measured by the ME, the smallest mean false positive number (FP) while maintaining a comparable mean number of true positives (TP) around 3. Most critically, as p increases, the TLP’s performance remains much more stable than its competitors. On a relative basis, the TLP outperforms its competitors more in more difficult situations.

Table 1.

Median ME’s, means (SD in parentheses) nonzero coefficients (β₁, β₂, β₅), and true positive (TP) and false positive (FP) numbers of nonzero estimates, for linear regression, based on 1000 simulation replications.

n	p	Method	ME	β₁ = 1	β₂ = .5	β₅ = .75	#TP	#FP
50	12	Lasso	.129	.91(.17)	.41(.18)	.60(.16)	2.98(0.14)	3.82(2.39)
		SCAD-OS	.109	1.02(.19)	.40(.22)	.68(.18)	2.92(0.27)	2.50(1.97)
		SCAD	.118	1.04(.20)	.39(.24)	.71(.18)	2.88(0.32)	2.30(1.90)
		TLP-OS	.088	1.01(.18)	.41(.20)	.68(.17)	2.94(0.25)	1.65(2.04)
		TLP	.090	1.01(.19)	.41(.21)	.69(.17)	2.92(0.27)	1.57(1.98)

50	500	Lasso	.431	.76(.19)	.29(.18)	.41(.17)	2.90(0.30)	14.7(10.48)
		SCAD-OS	.327	1.01(.24)	.25(.25)	.52(.24)	2.70(0.47)	14.81(8.69)
		SCAD	.301	1.09(.26)	.21(.27)	.59(.26)	2.53(0.53)	12.25(7.63)
		TLP-OS	.150	1.02(.21)	.39(.26)	.67(.22)	2.75(0.45)	4.27(6.86)
		TLP	.143	1.02(.21)	.39(.26)	.68(.22)	2.75(0.45)	4.10(6.89)

50	1000	Lasso	.501	.72(.19)	.28(.18)	.37(.18)	2.88(0.33)	17.20(11.49)
		SCAD-OS	.370	.99(.25)	.26(.25)	.51(.26)	2.67(0.49)	18.76(9.60)
		SCAD	.327	1.08(.26)	.20(.28)	.57(.29)	2.49(0.54)	15.19(8.41)
		TLP-OS	.182	1.01(.20)	.40(.27)	.66(.25)	2.72(0.47)	5.43(8.69)
		TLP	.175	1.02(.20)	.40(.27)	.66(.25)	2.72(0.47)	5.06(8.30)

100	12	Lasso	.063	.94(.12)	.44(.13)	.65(.11)	3.00(0.00)	3.94(2.42)
		SCAD-OS	.042	1.01(.12)	.45(.14)	.72(.11)	2.99(0.08)	2.17(2.04)
		SCAD	.042	1.02(.13)	.45(.15)	.74(.11)	2.99(0.09)	2.06(2.01)
		TLP-OS	.037	1.01(.12)	.45(.13)	.71(.11)	3.00(0.06)	1.51(1.99)
		TLP	.036	1.01(.12)	.45(.13)	.72(.11)	3.00(0.07)	1.46(1.95)

100	500	Lasso	.186	.84(.12)	.36(.12)	.52(.11)	3.00(0.04)	15.61(11.04)
		SCAD-OS	.118	1.06(.14)	.32(.18)	.66(.14)	2.94(0.24)	14.92(10.45)
		SCAD	.121	1.10(.15)	.30(.21)	.71(.12)	2.89(0.31)	14.20(10.00)
		TLP-OS	.036	1.01(.12)	.47(.14)	.72(.11)	2.99(0.12)	3.64(6.57)
		TLP	.035	1.01(.12)	.46(.14)	.72(.11)	2.99(0.12)	3.49(6.69)

100	1000	Lasso	.211	.83(.13)	.34(.13)	.51(.12)	3.00(0.06)	18.10(12.50)
		SCAD-OS	.142	1.06(.15)	.30(.19)	.66(.15)	2.91(0.29)	19.70(13.35)
		SCAD	.147	1.10(.15)	.27(.22)	.72(.14)	2.83(0.38)	18.80(12.74)
		TLP-OS	.037	1.01(.13)	.46(.15)	.74(.12)	2.97(0.18)	3.93(7.04)
		TLP	.037	1.01(.13)	.46(.15)	.74(.12)	2.97(0.18)	3.80(6.94)

Open in a new tab

For logistic regression, simulation results are summarized for the cases of p = 12, 200, 500 and n = 100, 200 in Table 2. As expected, the TLP continues to outperform other methods with the smallest median ME’s. It gives less biased estimates than the Lasso estimates. The TLP’s superior performance remains strong over other methods, as p increases.

Table 2.

Median ME’s, means (SD in parentheses) nonzero coefficients (β₁, β₂, β₅), and true positive (TP) and false positive (FP) numbers of nonzero estimates, for logistic regression, based on 1000 simulation replications.

n	p	Method	ME	β₁ = 1	β₂ = .5	β₅ = .75	#TP	#FP
100	12	Lasso	.388	.80(.27)	.35(.25)	.49(.27)	2.8(0.4)	3.8(2.2)
		SCAD-OS	.416	1.03(.37)	.39(.36)	.61(.40)	2.5(0.7)	1.6(1.9)
		SCAD	.472	1.10(.41)	.38(.39)	.67(.44)	2.4(0.7)	1.1(1.9)
		TLP-OS	.350	.98(.35)	.36(.32)	.59(.35)	2.7(0.5)	1.8(2.0)
		TLP	.355	.98(.35)	.35(.32)	.58(.35)	2.6(0.6)	1.8(2.0)

100	200	Lasso	.947	.57(.25)	.20(.19)	.26(.21)	2.6(0.6)	11.7(7.1)
		SCAD-OS	.733	.96(.45)	.23(.36)	.40(.41)	2.0(0.7)	3.1(2.9)
		SCAD	.827	1.08(.53)	.23(.42)	.46(.53)	1.7(0.6)	1.1(1.4)
		TLP-OS	.649	.99(.42)	.31(.36)	.49(.46)	2.2(0.7)	3.8(5.2)
		TLP	.664	.99(.43)	.30(.37)	.48(.47)	2.2(0.7)	3.6(5.2)

100	500	Lasso	1.166	.48(.24)	.18(.19)	.19(.19)	2.4(0.7)	13.6(9.1)
		SCAD-OS	.867	.84(.48)	.23(.35)	.29(.37)	1.8(0.7)	3.9(3.6)
		SCAD	.847	1.00(.57)	.25(.46)	.34(.50)	1.6(0.6)	1.3(1.5)
		TLP-OS	.791	.93(.45)	.30(.39)	.38(.45)	2.0(0.7)	4.4(6.5)
		TLP	.811	.94(.46)	.29(.40)	.38(.45)	2.0(0.7)	4.1(6.4)

200	12	Lasso	.203	.87(.20)	.39(.20)	.57(.20)	3.0(0.2)	4.3(2.4)
		SCAD-OS	.173	1.06(.25)	.44(.28)	.72(.26)	2.8(0.4)	1.6(2.2)
		SCAD	.202	1.08(.25)	.45(.30)	.77(.25)	2.8(0.5)	1.2(2.1)
		TLP-OS	.155	1.00(.24)	.40(.24)	.67(.24)	2.9(0.3)	1.8(2.1)
		TLP	.157	1.00(.24)	.40(.24)	.67(.24)	2.9(0.3)	1.8(2.1)

200	200	Lasso	.540	.68(.18)	.27(.17)	.38(.17)	2.9(0.3)	14.1(8.9)
		SCAD-OS	.271	1.07(.26)	.34(.32)	.64(.32)	2.6(0.6)	3.2(3.3)
		SCAD	.262	1.12(.29)	.30(.37)	.68(.36)	2.3(0.6)	0.8(1.4)
		TLP-OS	.204	1.04(.25)	.40(.29)	.68(.27)	2.7(0.5)	3.3(5.5)
		TLP	.204	1.04(.26)	.39(.30)	.68(.28)	2.7(0.5)	3.2(5.8)

200	500	Lasso	.651	.64(.17)	.24(.16)	.33(.15)	2.9(0.3)	18.0(10.5)
		SCAD-OS	.289	1.07(.27)	.31(.32)	.57(.31)	2.5(0.5)	4.1(4.0)
		SCAD	.262	1.13(.28)	.29(.37)	.66(.36)	2.3(0.6)	1.4(1.7)
		TLP-OS	.231	1.04(.27)	.39(.30)	.65(.29)	2.7(0.5)	4.1(6.8)
		TLP	.231	1.04(.27)	.38(.30)	.65(.30)	2.7(0.5)	3.8(6.8)

Open in a new tab

4.2 Theory for feature selection

This section establishes some theoretical results to gain an insight into performance of the proposed method in feature selection. Let Y = (Z, X), and $g (β, Z) = \frac{1}{2 \sqrt{π} σ} exp (- \frac{1}{2 σ^{2}} {(Z - β^{T} X)}^{2})$ and g(β, Z) = p^Z(1 − p)¹⁻^Z in linear and logistic regression. Assume that $β^{T} x = β_{A}^{T} x_{A}$ belongs to a compact parameter space for any model size |A| ≤ p₀. In this case, selection does not involve nuisance parameters, where θ = β. Under (14), we establish feature selection consistency as well as optimal parameter estimation for the TLP:

\begin{array}{l} C_{min} (β^{0}) = min_{A : ∣ A ∣ \leq p_{0}, A \neq A_{0}} \frac{1}{max (∣ A_{0} \ A ∣, 1)} {(β_{A_{0} \ A}^{0})}^{T} (\sum_{A_{0} \ A} - \sum_{A_{0} \ A, A} \sum_{A}^{- 1} \sum_{A, A_{0} \ A}) β_{A_{0} \ A}^{0} \\ \geq d_{0} \frac{log p}{n}, \end{array}

(14)

where d₀ > 0 is a constant independent of (n, p, p₀), and Σ_B is a sub-matrix given a subset B of predictors, of covariance matrix Σ with the jkth element Cov(X_j, X_k), independent of β⁰. A simpler but stronger condition can be used for verification of (14):

γ_{min}^{2} min_{A : ∣ A ∣ \leq p_{0}, A \neq A_{0}} c_{min} (\sum_{A_{0} \ A} - \sum_{A_{0} \ A, A} \sum_{A}^{- 1} \sum_{A, A_{0} \ A}) \geq d_{0} \frac{log p}{n},

(15)

where $γ_{min} = γ_{min} (β^{0}) \equiv min {∣ β_{k}^{0} ∣ : β_{k}^{0} \neq 0}$ is the resolution level of the true regression coefficients, ${min}_{A : ∣ A ∣ \leq p_{0}, A \neq A_{0}} c_{min} (\sum_{A_{0} \ A} - \sum_{A_{0} \ A, A} \sum_{A}^{- 1} \sum_{A, A_{0} \ A}) \geq {min}_{B \supset A_{0} : ∣ B ∣ \leq 2 p_{0}} c_{min} (\sum_{B})$ , and c_min denotes the smallest eigenvalue. Note that (14) is necessary for any method to be selection consistent except constant d₀ if ${min}_{A : ∣ A ∣ \leq p_{0}, A \neq A_{0}} c_{min} (\sum_{A_{0} \ A} - \sum_{A_{0} \ A, A} \sum_{A}^{- 1} \sum_{A, A_{0} \ A}) > 0$ .

Proposition 1

Under (14), the constrained MLE β̂^T of (4) consistently reconstructs the oracle estimate β̂^ml. As n, p → ∞, feature selection consistency is established for the TLP as well as optimal parameter estimation ${E h}^{2} ({\hat{β}}^{T}, β^{0}) = {E h}^{2} ({\hat{β}}^{m l}, β^{0}) = O (\frac{p_{0}}{n})$ under the Hellinger distance h(·, ·). Moreover, the results hold uniformly over a L₀-band $B_{0} (u, l) = {β_{0} : \sum_{j = 1}^{p} I (β_{j}^{0} \neq 0) \leq u, γ_{min}^{2} (β^{0}) {min}_{B \supset A_{0} : ∣ B ∣ \leq 2 p_{0}} c_{min} (\sum_{B}) \geq l}$ , with 0 < u ≤ min(n, p), $l = d_{0} σ^{2} \frac{log p}{n}$ , that is, as n, p → ∞,

sup_{β^{0} \in B_{0} (u, l)} P ({\hat{β}}^{T} \neq {\hat{β}}^{m l}) \to 0, \frac{{sup}_{β^{0} \in B_{0} (u, l)} {E h}^{2} ({\hat{β}}^{T}, β^{0})}{{sup}_{β^{0} \in B_{0} (u, l)} {E h}^{2} ({\hat{β}}^{m l}, β^{0})} \to 1,

with ${sup}_{β^{0} \in B_{0} (u, l)} {E h}^{2} ({\hat{β}}^{m l}, β^{0}) = d^{*} \frac{u}{n}$ for some d^*.

Various conditions have been proposed for studying feature selection consistency in linear regression. In particular, a condition on γ_min is usually imposed, in addition to assumptions on the design matrix X such as the sparse Riesz condition in [32]. To compare (14) with existing assumptions for consistent selection, note that these assumptions imply a fixed design version of (14) by necessity of consistent feature selection. For instance, as showed in [32], the sparse Riesz condition with dimension restriction and $γ_{min}^{2} \geq c^{'} \frac{log (p - u)}{n}$ , required for the minimum concavity penalty to be consistent, imply (15) with p replaced by p − u thus (14) when p/u bounded away from 1, where u ≥ p₀. Moreover, the number of over-selected variables is proved to be bounded but may not tend to zero for thresholding Lasso in Theorem 1.1 of [35], under a restrictive eigenvalue condition [1] and a requirement on γ_min. Finally, in linear regression, only finite variance σ² is required for the proposed method, which is in contrast to a commonly used assumption on sub-Gaussian distribution of ε_i.

In conclusion, the computational surrogate–the TLP method indeed shares desirable oracle properties of the L₀-method, which is optimal against any selection method, for feature selection and parameter estimation.

5 Estimation of a precision matrix

Given n random samples from a p-dimensional normal distribution Y₁, ···, Y_n ~ N (μ, Σ), we estimate the inverse covariance matrix Ω = Σ⁻¹ that is p × p positive definite, denoted by Ω ≻ 0. For estimation of (μ, Ω), the log-likelihood is proportional to

\frac{n}{2} log \det (Ω) - \frac{1}{2} \sum_{i = 1}^{n} {(Y_{i} - μ)}^{T} Ω (Y_{i} - μ) .

(16)

The profile log-likelihood for Ω, after μ is maximized out, is proportional to $\frac{n}{2} log \det (Ω) - \frac{1}{2} t r (S Ω)$ , where $\bar{Y} = n^{- 1} \sum_{i = 1}^{n} Y_{i}$ and $S = n^{- 1} \sum_{i = 1}^{n} (Y_{i} - \bar{Y}) {(Y_{i} - \bar{Y})}^{T}$ are the corresponding sample mean and covariance matrix, det and tr denote the determinant and trace. In (16), the number of unknown parameters p² in Ω can greatly exceed the sample size n in the presence of 2p nuisance parameters (μ, {Ω_jj : j = 1, ···, p}), where Ω_jk denotes the jkth elements of Ω. To avoid non-identifiability in estimation, we regularize off-diagonal elements of Ω in (16) through a nonnegative penalty function J(·) for the $\frac{p (p - 1)}{2}$ parameters of interest:

S (Ω) = log \det (Ω) - \frac{1}{2} t r (S Ω) - \sum_{j, k = 1, j \neq k}^{p} J ({Ω_{j k}, j \neq k}) .

(17)

In estimation, the TLP function $J ({Ω_{j k}, j \neq k}) = \frac{λ}{τ} min (∣ Ω_{j k} ∣, τ)$ is employed for both parameter estimation and covariance selection in (17). Towards this end, we apply Algorithm 1 to solve (6) sequentially, which reduces to a series of weighted graphical lasso problems, and is solved by taking advantage of existing software. In implementation, we use R package glasso [8] for (6).

5.1 Simulations

Simulations are performed, where a tridiagonal precision matrix is used as in [7]. In particular, Σ is AR(1)-structured with its ij-element being σ_ij = exp(−a|s_i − s_j|), and s₁ < s₂ < ···< s_p are randomly chosen: s_i − s_i₋₁ ~ Unif(0.5, 1), for some a > 0; i = 2, ···, p. The following situations are considered: (n, p) = (120, 30) or (n, p) = (120, 200), and a = 0.9 or a = 0.6, based on 100 replications.

Five competing methods are compared, including Lasso, adaptive Lasso (ALasso), SCAD-OS and SCAD, and TLP-OS and TLP. ALasso uses weight $λ / ∣ {\hat{β}}_{j}^{(0)} ∣^{γ}$ , where β̂⁽⁰⁾ is an initial estimate and γ = 1/2 as in [7].

To measure performance of estimator Ω̂, we use the entropy loss and quadratic loss: loss₁(Ω, Ω̂) = tr(Ω⁻¹Ω̂) − log |Ω⁻¹Ω̂| − p, and loss₂(Ω, Ω̂) = tr(Ω⁻¹Ω̂ − I)², as well as the true positive (TP) and false positive (FP) numbers: #TP = Σ_i_,_j I(Ω_ij ≠ 0, Ω̂_ij ≠ 0); #FP = Σ_i_,_j I(ω_ij = 0, Ω̂_ij ≠ 0).

For small p = 30, TLP and TLP-OS are always among the winners. It is also confirmed that the one-step approximation to SCAD or TLP gives similar performance to that of the fully iterated SCAD or TLP, respectively. For large p, to save computing time, as advocated in [7], we only run SCAD-OS and TLP-OS. In such a situation, an improvement of TLP-OS over other methods is more substantial for large p = 200 than for small = 30. Overall, the proposed method delivers higher performance in low-dimensional and high-dimensional situations, respectively.

5.2 Theory for precision matrix

To perform theoretical analysis, we specify a parameter space Θ in which Ω ❻ 0 with 0 < max_1≤_j_≤_p |Ω_jj| ≤ M₂, c_min(Ω) ≥ M₁ > 0, for some constants M₁, M₂ > 0, independent of (n, p, p₀). Let A = {(j, k) : j ≠ k, Ω_jk ≠ 0} be the set of nonzero off diagonal elements of Ω, where |A| = p₀ is an even number by symmetry of Ω, and Ω depends on A. Results in Theorem 1 imply that the constrained MLE yields covariance selection consistency under one assumption:

C_{min} (Ω^{0}) \geq d_{0} \frac{log p}{n},

(18)

which is necessary for covariance selection consistency indeed for any method, up to constant d₀ when c_min(H) > 0, where d₀ > 0 is a constant independent of (n, p, p₀), and $H = (\frac{\partial^{2} (- log det (Ω))}{\partial^{2} Ω}) ∣_{Ω = Ω_{0}}$ is the p² ×p² Hessian matrix of −log det(Ω), whose (Ω_jk, Ω_j_′_k_′) element is tr(Σ⁰Δ_jkΣ⁰Δ_j_′_k_′), c.f., [3], Δ_jK is a p × p with the jk-element being 1 and 0 otherwise. Sufficiently, (18) can be verified using

C_{min} (Ω^{0}) \geq γ_{\min}^{2} c_{min} (H),

(19)

with $γ_{min} (Ω^{0}) \equiv γ_{min} = min {∣ Ω_{j k}^{0} ∣ : Ω_{j k}^{0} \neq 0, j \neq k}$ .

Proposition 2

Under (18), the constrained MLE Ω̂^T of (4) consistently reconstructs the oracle estimator Ω̂^ml. As n, p → ∞, covariance selection consistency is established for the TLP as well as optimal parameter estimation ${E h}^{2} ({\hat{Ω}}^{T}, Ω^{0}) = (1 + o (1)) {E h}^{2} ({\hat{Ω}}^{m l}, Ω^{0}) = O (\frac{p_{0} log p}{n})$ , where $h^{2} (Ω, Ω^{0}) = 1 - \sqrt{\frac{{(\det (Ω) \det (Ω^{0}))}^{1 / 2}}{\det (\frac{Ω + Ω^{0}}{2})}}$ is the squared Hellinger distance for Ω versus Ω⁰. Moreover, the above results hold uniformly over a L₀-band $B_{0} (u, l) = {Ω^{0} : \sum_{j, k = 1, j \neq k}^{p} I (Ω_{j k}^{0} \neq 0) \leq u, γ_{\min}^{2} (Ω^{0}) c_{min} (H) \leq l}$ , with 0 < u ≤ min(n, p) and $l = d_{0} σ^{2} \frac{log p}{n}$ , that is, as n, p → ∞,

sup_{Ω^{0} \in B_{0} (u, l)} P ({\hat{Ω}}^{T} \neq {\hat{Ω}}^{m l}) \to 0, \frac{{sup}_{Ω^{0} \in B_{0} (u, l)} {E h}^{2} ({\hat{Ω}}^{T}, Ω^{0})}{{sup}_{Ω^{0} \in B_{0} (u, l)} {E h}^{2} ({\hat{Ω}}^{m l}, Ω^{0})} \to 1,

with ${sup}_{Ω^{0} \in B_{0} (u, l)} {E h}^{2} ({\hat{Ω}}^{m l}, Ω^{0}) = d^{*} \frac{u log p}{n}$ for some d^* > 0.

In short, the TLP method is optimal against any method in covariance selection, permitting p up to exponentially large in the sample size, or $p^{2} \leq p_{0} exp (n \frac{γ_{min}^{2} c_{min} (H)}{d_{0}})$ . Moreover, as a result of accurate selection of this method, parameter estimation can be sharply enhanced at an order of $\sqrt{\frac{p_{0} log p}{n}}$ , as measured by the Hellinger distance, after zero off-diagonal elements are removed. Note that the log p factor is due to estimation of 2p nuisance parameters as compared to the rate of $\sqrt{\frac{p_{0}}{n}}$ in logistic regression. In view of the result in Lemma 1, this result seems to be consistent with the minimax rate $\sqrt{\frac{log p}{n}}$ under the L_∞ matrix norm [20].

6 Metastasis status of breast cancer patients

We apply the penalized logistic regression methods to analyze a microarray gene expression dataset of [29], where our objectives are (1) to develop a model predicting the metastasis status, and (2) to identify cancer genes, for breast cancer patients. Among the 286 patients, metastasis was detected in 106 patients during follow-ups within 5 years after surgery. Their expression profiles were obtained from primary breast tumors with Affymetrix HG-133a GeneChips.

In [29], a 76-gene signature was developed based on a training set of 115 patients, which yielded a misclassification error rate of 64/171=37.4% when applied to the remaining samples. [30] compared the performance of a variety of classifiers using a subset of 245 genes drawn from 33 cancer-related pathways: based on a 10-fold cross-validation (CV). Their non-parametric pathway-based regression method yielded the smallest error rate at 29%, while random forest, bagging and Support Vector Machine (SVM) had error rates of 33%, 35% and 42%.

In our analysis, we first performed a preliminary screening of the genes using a marginal t-test to select the top p genes with most significant p-values, based on the training data for each fold of a 10-fold-CV. Then the training data were split into two parts to fit penalized logistic models and to select tuning parameters, respectively. The results were summarized in Table 3, including the total misclassification errors and average model sizes (i.e. non-zero estimates) based on 10-fold CV. A final model is obtained by fitting the best model selected from a 10-fold CV to the entire data set.

With regard to prediction, no large difference is seen among various methods, with the error rates ranging from 102/286=35.7% (of TLP and TLP-OS with p = 200) to 118/286=41.3% (of ALasso with p = 200). The TLP performed similar to the TLP-OS, both were among the winners. In addition, the Lasso gave the least sparse models while the SCAD gave the most sparse models.

With regard to identifying cancer genes, the Lasso, TLP-OS and TLP yield the same model, identifying the largest number of cancer genes, whereas the SCAD and SCAD-OS give the most sparse models with only at most 2 cancer genes, and ALasso only yields 10 cancer genes. Here cancer genes are defined according to the Cancer Gene Database [10].

In summary, the TLP and TLP-OS identify a good proportion of cancer genes and lead to a model giving a reasonably good predictive accuracy of the metastasis status. In this sense, they perform well with regard to the foregoing two objectives.

Table 3.

Averaged (with SD in parentheses) entropy loss (loss₁), quadratic loss (loss₂), true positive (TP) and false positive (FP) numbers of nonzero parameters based on 100 simulations, for estimating a precision matrix in Gaussian graphical models in Section 4.

Set-up	Method	loss₁	loss₂	#TP	#FP
p = 30, a = 0.9	Lasso	1.55(.15)	2.96(.42)	88.0(.0)	314.0(41.6)
	ALasso	1.02(.15)	1.99(.37)	88.0(.0)	95.5(30.2)
	SCAD-OS	0.93(.16)	1.99(.44)	88.0(.0)	126.0(39.4)
	SCAD	0.74(.16)	1.60(.42)	87.9(.5)	85.5(18.0)
	TLP-OS	0.66(.18)	1.47(.47)	87.9(.5)	28.1(21.4)
	TLP	0.63(.18)	1.39(.48)	87.8(.7)	22.4(17.0)

p = 30, a = 0.6	Lasso	1.69(.16)	3.28(.46)	88.0(.0)	342.5(35.5)
	ALasso	1.01(.15)	1.97(.37)	88.0(.0)	103.9(17.4)
	SCAD-OS	0.75(.14)	1.61(.36)	88.0(.0)	83.8(29.0)
	SCAD	0.56(.12)	1.20(.30)	88.0(.2)	26.1(15.0)
	TLP-OS	0.57(.14)	1.26(.37)	88.0(.0)	14.7(13.0)
	TLP	0.54(.14)	1.18(.36)	88.0(.0)	7.3(10.6)

p = 200, a = 0.9	Lasso	20.16(.50)	34.50(1.85)	597.9(.4)	4847.8(614.7)
	ALasso	10.62(.53)	19.64(1.20)	597.3(1.2)	936.8(37.9)
	SCAD-OS	11.46(.60)	24.03(1.67)	597.7(.8)	2453.6(251.2)
	TLP-OS	6.16(.77)	13.99(2.10)	593.6(3.0)	284.8(158.0)

p = 200, a = 0.6	Lasso	24.86(.54)	46.18(3.72)	598.0(.0)	6161.7(863.0)
	ALasso	11.06(.48)	21.53(1.18)	598.0(.0)	1526.1(118.6)
	SCAD-OS	9.43(.49)	20.87₁.77)	598.0(.0)	2754.8(523.6)
	TLP-OS	4.45(.48)	9.89(1.29)	597.7(.7)	185.5(76.3)

Open in a new tab

Table 4.

Analysis results with various numbers (p) of predictors for the breast cancer data. The numbers of total classification errors (#Err), including false positives (#FP), and mean numbers of nonzero estimates (#Nonzero) from 10-fold CV, and the total numbers of nonzero estimates and cancer genes in the final models are shown.

p	Method	10-fold CV			Final model
p	Method	#Err	#FP	#Nonzero	#Nonzero	#Cancer genes
200	Lasso	107	17	40.1	62	13
	ALasso	118	27	18.8	39	9
	SCAD-OS	107	4	9.5	15	2
	SCAD	107	1	4.7	2	0
	TLP-OS	102	8	33.5	62	13
	TLP	102	8	33.2	62	13

400	Lasso	107	19	46.9	95	26
	ALasso	112	19	14.4	32	10
	SCAD-OS	108	8	11.1	15	2
	SCAD	106	0	4.1	2	0
	TLP-OS	106	15	40.1	95	26
	TLP	106	14	38.2	95	26

Open in a new tab

7 Appendix

Proof of Theorem 1

The proof uses a large deviation probability inequality of [27] to treat one-sided log-likelihood ratios with constraints. This enables us to obtain sharp results without a moment condition on both tails of the log-likelihood ratios.

When K = p₀, |Â^L₀| ≤ p₀. If Â^L₀ = A₀, then β̂^L₀ = β̂^ml. Let a class of candidate subsets be {A : A ≠ A₀, |A| ≤ p₀} for feature selection. Note that A ⊂ {1, ···, p} can be partitioned into (A \ A₀) ∪ (A₀ ∩ A). Let B_kj = {θ = (β_A, 0, η) : A ≠ A₀, |A₀ ∩ A| = k, |A \ A₀| = j, (p₀ − k)C_min(θ⁰) ≤ h²(θ, θ⁰)} ⊂ Inline graphic ; k = 0, ···, p₀ − 1, j = 1, ···, p₀ − k. Note that B_kj consists of $(\begin{matrix} p_{0} \\ k \end{matrix}) (\begin{matrix} p - p_{0} \\ j \end{matrix})$ different elements A’s of sizes |A₀ ∩ A| = k and |A \ A₀| = j. By definition, ${θ = (β_{A}, 0, η) : A \neq A_{0} : C_{min} (θ^{0}) \leq h^{2} (θ, θ^{0}), ∣ A ∣ \leq p_{0}} \subset \cup_{k = 0}^{p_{0} - 1} \cup_{j = 1}^{p_{0}} B_{k j}$ . Hence

\begin{array}{l} P ({\hat{θ}}^{L_{0}} \neq {\hat{θ}}^{m l}) \leq P^{*} (sup_{θ = (β, η) : β = (β_{A}, 0), A \neq A_{0}, ∣ A ∣ \leq p_{0}} (L (θ) - L ({\hat{θ}}^{m l})) > 0) \\ \leq P^{*} (sup_{θ = (β, η) : β = (β_{A}, 0), A \neq A_{0}, ∣ A ∣ \leq p_{0}} (L (θ) - L (θ^{0})) > 0) \\ \leq \sum_{A \subset {1, \dots, p} : A \neq A_{0}, ∣ A ∣ \leq p_{0}} P^{*} (sup_{θ = (β, η) : β = (β_{A}, 0)} (L (θ) - L (θ^{0})) \geq 0) \equiv I \end{array}

where P^* is the outer measure and L(θ̂^ml) ≥ L(θ ⁰) by definition.

For I, we apply Theorem 1 of [27] to bound each term. Towards this end, we verify the entropy condition (3.1) there for the local entropy over Inline graphic . Note that under Assumption A $ε = ε_{n, p_{0}, p} = {(2 c_{0})}^{1 / 2} c_{4}^{- 1} log (2^{1 / 2} / c_{3}) log p {(\frac{p_{0}}{n})}^{1 / 2}$ satisfies there with respect to ε > 0, that is,

sup_{{0 \leq ∣ A ∣ \leq p_{0}}} \int_{2^{- 8} ε^{2}}^{2^{1 / 2} ε} H^{1 / 2} (t / c_{3}, B_{A}) d t \leq p_{0}^{1 / 2} 2^{1 / 2} ε log (2 / 2^{1 / 2} c_{3}) \leq c_{4} n^{1 / 2} ε^{2} .

(20)

for some constant c₃ > 0 and c₄, say c₃ = 10 and $c_{4} = \frac{{(2 / 3)}^{5 / 2}}{512}$ . Moreover, by Theorem 2.6 of [25], $(\begin{matrix} b \\ a \end{matrix}) \leq \frac{b^{b + 1 / 2}}{\sqrt{2 π} a^{a + 1 / 2} {(b - a)}^{b - a + 1 / 2}} \leq exp ((a + 1 / 2) log (b / a) + a)$ for any integers a < b. By (3), $C_{min} (θ^{0}) \geq ε_{n, p_{0}, p}^{2}$ implies (20), provided that $d_{0} > {(2 c_{0})}^{1 / 2} c_{4}^{- 1} log (2^{1 / 2} / c_{3})$ . Using the facts about binomial coefficients: $\sum_{j = 0}^{p_{0} - k} (\begin{matrix} p - p_{0} \\ j \end{matrix}) \leq {(p - p_{0} + 1)}^{p_{0} - k}$ and $(\begin{matrix} p_{0} \\ i \end{matrix}) \leq p_{0}^{i}$ , we obtain, by

Theorem 1 of [27], that for a constant c₂ > 0, say $c_{2} = \frac{4}{27} \frac{1}{1926}$ , I is upper bounded by

\begin{array}{l} \sum_{k = 0}^{p_{0} - 1} \sum_{j = 0}^{p_{0} - k} P^{*} (sup_{θ \in B_{k j}} (L (θ) - L (θ^{0})) \geq 0) \leq 4 \sum_{k = 0}^{p_{0} - 1} (\begin{matrix} p_{0} \\ k \end{matrix}) exp (- c_{2} n (p_{0} - k) C_{min} (θ^{0})) \sum_{j = 0}^{p_{0} - k} (\begin{matrix} p - p_{0} \\ j \end{matrix}) \\ \leq 4 \sum_{i = 1}^{p_{0}} exp (- i (c_{2} {n C}_{min} (θ^{0}) - log (p - p_{0} + 1) - log p_{0})) \\ \leq R (exp (- (c_{2} {n C}_{min} (θ^{0}) - log (p - p_{0} + 1) - log p_{0}))), \end{array}

where R(x) = x/(1 − x) is the exponentiated logistic function. Note, moreover, that I ≤ 1 and $log (p - p_{0} + 1) + log p_{0} \leq 2 log (p + 1) / 2 \leq 2 log \frac{p + 1}{2}$ . Then

I \leq 5 exp (- c_{2} {n C}_{min} (θ^{0}) + 2 log \frac{p + 1}{2}) \leq exp (- c_{2} {n C}_{min} (θ^{0}) + 2 log (p + 1) + 3) .

Finally, (A) follows from P(Â^L₀ ≠ A₀) ≤ P (θ̂^L₀ ≠ θ̂^ml), (8) and (3) with $d_{0} > \frac{2}{c_{2}}$ , as n, p → ∞. For (B), let G = {θ̂^L₀ ≠ θ̂^ml} and P(G) ≤ 8 exp(−c₂nC_min/4) by (8) and (3). For the risk property, Eh²(θ̂^L₀, θ⁰) ≤ Eh²(θ̂^ml, θ̂⁰) + Eh²(θ̂^L₀, θ⁰)I(G) is upper bounded by

{E h}^{2} ({\hat{θ}}^{m l}, θ^{0}) + 4 exp (- c_{2} {n C}_{min} / 2) = (1 + o (1)) {E h}^{2} ({\hat{θ}}^{m l}, θ^{0}),

using the fact that h(θ̂^L₀, θ⁰) ≤ 1. Then (B) is established. Similarly (C) follows. This completes the proof.

Proof of Theorem 2

The proof is basically the same as that in Theorem 1 with a modification that A is replaced by A^τ⁺. Now B_kj = {θ_τ⁺ : A^τ⁺ ≠ A₀, |A₀ ∩ A^τ⁺| = k, |A^τ⁺ \A₀| = j, (d₁(p₀ − k)C_min(θ⁰) − d₃pτ^d₂) ≤ h²(θ_τ⁺, θ⁰)}; j = 1, ···, p₀. Then ${θ = (β_{A}, 0, η) : A \neq A_{0} : \sum_{j = 1}^{p} J (∣ β_{j} ∣) \leq p_{0}, C_{min} (θ^{0}) \leq h^{2} (θ, θ^{0})} \subset \cup_{k = 0}^{p_{0} - 1} \cup_{j = 1}^{p_{0} - k} B_{k j}$ .

When K = p₀, $\sum_{j = 1}^{p} J (∣ β_{j} ∣) \leq p_{0}$ , implying that |Â⁺| ≤ p₀. If |Â⁺| = p₀, then $\sum_{j = 1}^{p} ∣ β_{j} ∣ I (∣ β_{j} ∣ \leq τ) = 0$ , implying that θ̂^T = θ̂^ml. Then we focus our attention to the case of A⁺ ≠ A₀. Note that, with θ = (β, η) and β = (β_A, 0),

\begin{array}{l} P^{*} (sup_{θ : A \neq A_{0}, \sum_{j = 1}^{p} J (∣ β_{j} ∣) \leq p_{0}} (L (θ) - L (θ^{0})) \geq 0) \leq \sum_{j = 1}^{p_{0}} P^{*} (sup_{θ \in B_{j}} (L (θ)) - L (θ^{0})) \geq 0) \\ \leq 4 \sum_{k = 0}^{p_{0} - 1} \sum_{j = 0}^{p_{0} - k} (\begin{matrix} p - p_{0} \\ j \end{matrix}) (\begin{matrix} p_{0} \\ k \end{matrix}) exp (- c_{2} n (d_{1} C_{min} (θ^{0}) - d_{3} p τ^{d_{2}})) \\ \leq 5 exp (- (c_{2} d_{1} / 2) {n C}_{min} (θ^{0}) + 2 log \frac{p + 1}{2}) \leq exp (- c_{2} {n C}_{min} (θ^{0}) + 2 log (p + 1) + 3), \end{array}

provided that τ ≤ (d₁C_min(θ ⁰)/2pd₃)^1/d₂. The rest of the proof proceeds as in the proof of Theorem 1. This completes the proof.

Proof of Theorem 3

The main idea of the proof is the same as that for Theorem 1 of [24], which constructs an approximated least favorable situation for feature selection and uses Fano’s Lemma. According to Fano’s Lemma [11], for any mapping T = T(Y₁, ···, Y_n) taking values in S = {1, ···, ∣S∣}, $∣ S ∣^{- 1} \sum_{j = 1}^{∣ S ∣} P_{j} (T (Y_{1}, \dots, Y_{n}) = j) \leq \sum_{1 \leq j, k \leq ∣ S ∣} \frac{n K (q_{j}, q_{k}) + log 2}{{∣ S ∣}^{2} log (∣ S ∣ - 1)}$ , where K(q_j, q_k) = ∫ q_j log(q_j/q_k) is the Kullback-Leibler information for densities q_j versus q_k corresponding P_j and P_k.

To construct an approximated least favorable set of parameters S for A₀ versus $A_{0}^{c}$ , define β to be ( $γ_{min} 1_{∣ A_{0} ∣}, 0_{∣ A_{0}^{c} ∣}$ ). Let $S = {θ_{j} = (β_{j}, η^{0})}_{j = 0}^{p}$ be a collection of parameters with components equal to γ_min or 0 satisfying that for any 1 ≤ j, j′ = p, ${| | β_{j^{'}} - β_{j} | |}^{2} \leq 4 γ_{\min}^{2}$ , as defined in Assumption C. Then for any θ_j, θ_k ∈ S, $K (θ_{j}, θ_{k}) \leq r γ_{min}^{2} \leq n \frac{r}{s} C_{min} (θ^{0})$ by Assumption C.

By Fano’s lemma, ${∣ S ∣}^{- 1} \sum_{j \in S} P_{j} (T = j) \leq \frac{n (r / s) C_{min} (θ^{0}) + log 2}{log p}$ , implying that

sup_{{θ : C_{min} (θ^{0}) = R^{*}}} P (\hat{A} \neq A_{0}) \geq 1 - \frac{{nrC}_{min} (θ^{0}) + s log 2}{s log p},

bounded below by c_* with $R^{*} = \frac{s (1 - c_{*}) log p}{4 r n}$ . This yields (12). For (13), it follows that R^* ≥ l with $l = d_{1} \frac{log p}{n}$ and $d_{1} = \frac{(1 - c^{*}) s}{4 r}$ , for any θ⁰ ∈ B₀(u, l). This completes the proof.

Proof of Proposition 1

We now verify Assumptions A–C. Note that

h^{2} (β, β^{0}) = 2 E (1 - exp (- \frac{1}{8} {(β^{T} X - {(β^{0})}^{T} X)}^{2}))

for linear regression, and h²(β, β⁰) is

\frac{1}{2} (E {(μ^{1 / 2} ({(β^{0})}^{T} X) - μ^{1 / 2} (β^{T} X))}^{2} + {(1 - μ ({(β^{0})}^{T} X))}^{1 / 2} - {(1 - μ (β^{T} X))}^{1 / 2}),

for logistic regression, where μ(s) = (1 + exp(s))⁻¹.

Assumption A follows from [14]. Note that $A = A^{τ^{+}} \cup A_{2}^{τ^{-}}$ and $| \frac{\partial h^{2} (β, β^{0})}{\partial β_{j}} | \leq \frac{1}{2} E (∣ X_{j} ∣)$ , for 1 ≤ j ≤ p and β ∈ Inline graphic . Thus

∣ h^{2} (β, β^{0}) - h^{2} (β_{τ^{+}}, β^{0}) ∣ = τ {| \sum_{j \in A^{τ^{-}}} \frac{\partial h^{2} (β, β^{0})}{\partial β_{j}} |}_{β = β^{★}} | \leq 2 τ \sum_{j \in A^{τ^{+}}} E (∣ X_{j} ∣) \leq 2 τ p max_{j} \sum_{j j} .

Then Assumption B is fulfilled with d₁ = d₂ = 1 and d₃ = 2 max_j Σ_jj.

To simplify (3), we derive an inequality through some straightforward calculations: with β̃ = ((β_A, 0) − (0, β_A₀))

\begin{array}{l} C_{min} (β^{0}) \geq c_{1}^{*} min_{β_{A} : A \neq A_{0}, ∣ A ∣ \leq p_{0}} {∣ A_{0} \ A ∣}^{- 1} E {(β_{A} X_{A} - β_{A_{0}} X_{A_{0}})}^{2} \\ \geq c_{1}^{*} min_{β_{A} : A \neq A_{0}, ∣ A ∣ \leq p_{0}} {∣ A_{0} \ A ∣}^{- 1} {\tilde{β}}^{T} \sum_{A \cup A_{0}} \tilde{β} \geq γ_{min}^{2} min_{B : ∣ B ∣ \leq 2 p_{0}, A_{0} \subset B} c_{min} (\sum_{B}) . \end{array}

for some constant $c_{1}^{*} > 0$ , because the derivative of $1 - exp (- \frac{1}{8} x^{2})$ and (1 + exp(x))^−1/2 are bounded away from zero under the compactness assumption. This leads to (14). By Theorem 2, the TLP has the properties (A)–(C) there, through tuning.

Finally, $K (β_{j}, β_{k}) \leq c E {(β_{A_{j}} X_{A_{j}} - β_{A_{k}} X_{A_{k}})}^{2} \leq r γ_{min}^{2}$ by the compactness assumption, where r = c max_{(A_j,A_k}) E(β_{A_j} X_{A_j} − β_{A_k} X_{A_k})². By Theorem 3, (14) except a constant d₀ > 0 is necessary for any method to be feature selection consistent. This completes the proof.

Proof of Proposition 2

To obtain the desired results, Theorems 1–3 are applied. First a lower bound of C_min(θ⁰) is derived to simplify (3). Given the squared Hellinger distance $h^{2} (θ, θ^{0}) = 1 - \sqrt{\frac{{(\det (Ω) \det (Ω^{0}))}^{1 / 2}}{\det (\frac{Ω + Ω^{0}}{2})}} e^{- \frac{1}{4} {(μ - μ^{0})}^{T} (Ω + Ω^{0}) (μ - μ^{0})}$ , by strong convexity of −log det(Ω), c.f., [3], for any θ ∈ Θ and a constant c^* > 0 depending on M₁,

\begin{array}{l} - 2 log (1 - h^{2} (θ, θ^{0})) \geq - \frac{1}{2} (log det (Ω) + log det (Ω^{0})) + log det (\frac{Ω + Ω^{0}}{2}) \\ \geq \frac{1}{8} t r ({(Ω^{*})}^{- 1} (Ω - Ω^{0}) {(Ω^{*})}^{- 1} (Ω - Ω^{0})) \geq c^{*} ∣ A^{0} \ A ∣ c_{min} (H) γ_{min}^{2}, \end{array}

where A⁰ and A are as defined in Section 4.2, and Ω^* is an intermediate value between Ω and Ω⁰; see A.4.3 of [3] of such an expansion. Moreover, C_min(θ⁰) ≥ inf_{Ω_AA≠A₀,|A|≤p₀} log (1 − h²(Ω, Ω⁰)), yielding (18).

For Assumption A, note that |Ω_jk| ≤ (Ω_jjΩ_kk) ≤ M₂; j ≠ k, because Ω ≻ 0 and det(Ω) is bounded away from zero. To calculate the bracketing Hellinger metric entropy, we apply Proposition 1 of [22]. Let Ω_A be a submatrix, consisting of p₀ nonzero off-diagonal elements of Ω. Note that g(θ, y) of Y₁ is proportional to h₀(θ_A, y) Π_j∈A^c h_j(θ_j, y₁), where $h_{0} (θ_{A}, y) = {(\det (Ω_{A}))}^{n / 2} exp (- \frac{1}{2} {(y_{A} - μ_{A})}^{T} Ω_{A} (y_{A} - μ_{A})), h_{j} (θ_{j}, y) = \det (Ω_{j j}) exp (- \frac{1}{2} {(y_{j} - μ_{j})}^{2} Ω_{j j})$ , and y_A and θ_A are the sub-vectors of y and θ corresponding to Ω_A. Then for some constants k_j > 0; j = 1, 2 and any Ω̄, Ω ∈ Θ,

\begin{array}{l} ∣ g^{1 / 2} (\bar{θ}, y) - g^{1 / 2} (θ, y) ∣ \leq k ∣ g (\bar{θ}, y) - g (θ, y) ∣ \\ \leq k_{1} k_{2}^{p - p_{0}} (∣ h_{0} ({\bar{θ}}_{A}, y) - h_{0} (θ_{A}, y) ∣ + \sum_{j \in A^{c}} ∣ h_{j} ({\bar{θ}}_{j}, y) - h_{j} (θ_{j}, y) ∣) . \end{array}

This implies that H(t, Inline graphic ) = c₀(|A| log(2εp/t)+log((p −|A|)p/t))) by [14] for some constant c₀, which in turn yields Assumption A. For Assumption B, note that, for j ≠ k = 1, ···, p, for any θ ∈ Θ,

∣ \frac{\partial h^{2} (θ, θ^{0})}{\partial Ω_{j k}} ∣ = \frac{1}{4} ∣ (1 - h^{2} (θ, θ^{0})) t r ((2 {(\frac{Ω + Ω^{0}}{2})}^{- 1} - Ω^{- 1}) Δ_{j k}) ∣,

which is upper bounded by $∣ \frac{1}{c_{min} (Ω) + c_{min} (Ω^{0})} + \frac{1}{4 c_{min} (Ω)} ∣ \leq \frac{2}{M_{1}}$ ; j ≠ k = 1, ···, p. With $A = A^{τ^{+}} \cup A_{2}^{τ^{-}}, ∣ h^{2} (θ, θ^{0}) - h^{2} (θ_{τ^{+}}, θ^{0}) ∣ = τ {| \sum_{j \in A^{τ^{-}}} \frac{\partial h^{2} (θ, θ^{0})}{\partial Ω_{j k}} |}_{Ω = Ω^{★}} |$ . This implies Assumption B with d₁ = d₂ = 1 and $d_{3} = \frac{2}{M_{1}}$ . For Assumption C, note that the Kullback-Leibler for θ⁰ versus θ is $\frac{1}{2 n} (log \frac{\det (Ω^{0})}{\det (Ω)} + t r (Ω \sum^{0}) + {(μ - μ^{0})}^{T} Ω (μ - μ^{0}) - n)$ , which is upper bounded by h²(θ, θ⁰), because the likelihood ratios are uniformly bounded. An application of Taylor’s expansion as in verification of Assumption A yields that $- 2 log (1 - h^{2} (θ, θ^{0})) \leq r γ_{min}^{2}$ , where r = c^*c_max(H), leading to Assumption C.

The results in (18) follow from Theorems 1–3 with $ε_{n, p_{0}, p} = max ({(p_{0} log p)}^{1 / 2}, {(log (p - p_{0}) p_{0})}^{1 / 2}) n^{- 1 / 2} = \sqrt{\frac{p_{0} log p}{n}}$ by solving (9). This completes the proof.

Footnotes

Research supported in part by NSF grant DMS-0906616, NIH grants 1R01GM081535, 2R01GM081535, HL65462 and R01HL105397. The authors would like to thank the editor, the associate editor and anonymous referees for helpful comments and suggestions.

References

1.Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of lasso and dantzig selector. Ann Statist. 2008;37:1705–1732. [Google Scholar]
2.Benerjee O, Ghmoui LE, dAspremont A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J Mach Learn Res. 2008;9:485–516. [Google Scholar]
3.Boyd S, Vandenberghe L. Convex optimization. Cambridge Univ. Press; 2004. [Google Scholar]
4.Chen J, Chen Z. Extended Bayesian information criterion for model selection with large model space. Biometrika. 2008;95:759–771. [Google Scholar]
5.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32:407–499. [Google Scholar]
6.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
7.Fan J, Feng Y, Wu Y. Network exploration via the adaptive Lasso and SCAD penalties. Ann Appl Statist. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gasso G, Raotomamonjy A, Canu S. Recovering sparse signals with nonconvex penalties and DC programming. 2009. Submitted. [Google Scholar]
10.Higgins ME, Claremont M, Major JE, Sander C, Lash AE. Cancer-Genes: a gene selection resource for cancer genome projects. Nucleic Acids Research. 2007;35 (suppl 1):D721–D726. doi: 10.1093/nar/gkl811. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ibragimov IA, Has’minskii RZ. Statistical estimation. Springer; New York: 1981. [Google Scholar]
12.Li H, Gui J. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostat. 2006;7:302–317. doi: 10.1093/biostatistics/kxj008. [DOI] [PubMed] [Google Scholar]
13.Kim Y, Choi H, Oh HS. Smoothly clipped absolute deviation of high dimensions. J Amer Statist Assoc. 2008;103:1665–1673. [Google Scholar]
14.Kolmogorov AN, Tihomirov VM. ε-entropy and ε-capacity of sets in function spaces. Uspekhi Mat Nauk. 1959;14:3–86. [In Russian. English translation, Ameri. Math. Soc. transl. 2, 17, 277–364. (1961)] [Google Scholar]
15.Meinshausen N, Buhlmann P. High dimensional graphs and variable selection with the lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]
16.Negahban S, Wainwright M, Ravikumar P, Yu B. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers 2010 [Google Scholar]
17.Raskutti G, Wainwright M, Yu B. Tech Report. UC Berkeley: 2009. Minimax rates of estimation for high-dimensional linear regression over lq balls. [Google Scholar]
18.Rocha G, Zhao P, Yu B. Technical Report. UC Berkeley: 2008. A path following algorithm for sparse pseudo-likelihood inverse covariance estimation; p. 759. [Google Scholar]
19.Rothman A, Bickel P, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic J Statist. 2008;2:494–515. [Google Scholar]
20.Rothman A, Bickel P, Levina E, Zhu J. A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika. 2009 To appear. [Google Scholar]
21.Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–64. [Google Scholar]
22.Shen X, Wong WH. Convergence rate of sieve estimates. Ann Statist. 1994;22:580–615. [Google Scholar]
23.Shen X, Ye J. Adaptive model selection. J Amer Statist Assoc. 2002;97:210–221. [Google Scholar]
24.Shen X, Zhu Y, Pan W. Necessary and sufficient conditions towards feature selection consistency. 2010 Unpublished manuscript. [Google Scholar]
25.Stanica P, Montgomery AP. Good lower and upper bounds on binomial coefficients. J Ineq in Pure Appl Math. 2001;2:art 30. [Google Scholar]
26.Tibshirani R. Regression shrinkage and selection via the LASSO. JRSS-B. 1996;58:267–288. [Google Scholar]
27.Wong WH, Shen X. Probability inequalities for likelihood ratios and convergence rates of sieve MLEs. The Annals of Statistics. 1995;23:339–362. [Google Scholar]
28.Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika, bf. 2007;94:19–35. [Google Scholar]
29.Wang Y, Klijn JG, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365:671–679. doi: 10.1016/S0140-6736(05)17947-1. [DOI] [PubMed] [Google Scholar]
30.Wei Z, Li H. Nonparametric pathway-based regression models for analysis of genomic data. Biostatistics. 2007;8:265–284. doi: 10.1093/biostatistics/kxl007. [DOI] [PubMed] [Google Scholar]
31.Yuan M. High dimensional inverse covariance matrix estimation via linear programming. J Mach Learning Res. 2010;11:2261–2286. [Google Scholar]
32.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]
33.Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann Statist. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Zhao P, Yu B. On model selection consistency of Lasso. JMLR. 2006;7:2541–2563. [Google Scholar]
35.Zhou S. Thresholded Lasso for high dimensional variable selection and statistical estimation. Technical report 2010 [Google Scholar]

[R1] 1.Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of lasso and dantzig selector. Ann Statist. 2008;37:1705–1732. [Google Scholar]

[R2] 2.Benerjee O, Ghmoui LE, dAspremont A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J Mach Learn Res. 2008;9:485–516. [Google Scholar]

[R3] 3.Boyd S, Vandenberghe L. Convex optimization. Cambridge Univ. Press; 2004. [Google Scholar]

[R4] 4.Chen J, Chen Z. Extended Bayesian information criterion for model selection with large model space. Biometrika. 2008;95:759–771. [Google Scholar]

[R5] 5.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32:407–499. [Google Scholar]

[R6] 6.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]

[R7] 7.Fan J, Feng Y, Wu Y. Network exploration via the adaptive Lasso and SCAD penalties. Ann Appl Statist. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Gasso G, Raotomamonjy A, Canu S. Recovering sparse signals with nonconvex penalties and DC programming. 2009. Submitted. [Google Scholar]

[R10] 10.Higgins ME, Claremont M, Major JE, Sander C, Lash AE. Cancer-Genes: a gene selection resource for cancer genome projects. Nucleic Acids Research. 2007;35 (suppl 1):D721–D726. doi: 10.1093/nar/gkl811. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Ibragimov IA, Has’minskii RZ. Statistical estimation. Springer; New York: 1981. [Google Scholar]

[R12] 12.Li H, Gui J. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostat. 2006;7:302–317. doi: 10.1093/biostatistics/kxj008. [DOI] [PubMed] [Google Scholar]

[R13] 13.Kim Y, Choi H, Oh HS. Smoothly clipped absolute deviation of high dimensions. J Amer Statist Assoc. 2008;103:1665–1673. [Google Scholar]

[R14] 14.Kolmogorov AN, Tihomirov VM. ε-entropy and ε-capacity of sets in function spaces. Uspekhi Mat Nauk. 1959;14:3–86. [In Russian. English translation, Ameri. Math. Soc. transl. 2, 17, 277–364. (1961)] [Google Scholar]

[R15] 15.Meinshausen N, Buhlmann P. High dimensional graphs and variable selection with the lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]

[R16] 16.Negahban S, Wainwright M, Ravikumar P, Yu B. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers 2010 [Google Scholar]

[R17] 17.Raskutti G, Wainwright M, Yu B. Tech Report. UC Berkeley: 2009. Minimax rates of estimation for high-dimensional linear regression over lq balls. [Google Scholar]

[R18] 18.Rocha G, Zhao P, Yu B. Technical Report. UC Berkeley: 2008. A path following algorithm for sparse pseudo-likelihood inverse covariance estimation; p. 759. [Google Scholar]

[R19] 19.Rothman A, Bickel P, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic J Statist. 2008;2:494–515. [Google Scholar]

[R20] 20.Rothman A, Bickel P, Levina E, Zhu J. A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika. 2009 To appear. [Google Scholar]

[R21] 21.Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–64. [Google Scholar]

[R22] 22.Shen X, Wong WH. Convergence rate of sieve estimates. Ann Statist. 1994;22:580–615. [Google Scholar]

[R23] 23.Shen X, Ye J. Adaptive model selection. J Amer Statist Assoc. 2002;97:210–221. [Google Scholar]

[R24] 24.Shen X, Zhu Y, Pan W. Necessary and sufficient conditions towards feature selection consistency. 2010 Unpublished manuscript. [Google Scholar]

[R25] 25.Stanica P, Montgomery AP. Good lower and upper bounds on binomial coefficients. J Ineq in Pure Appl Math. 2001;2:art 30. [Google Scholar]

[R26] 26.Tibshirani R. Regression shrinkage and selection via the LASSO. JRSS-B. 1996;58:267–288. [Google Scholar]

[R27] 27.Wong WH, Shen X. Probability inequalities for likelihood ratios and convergence rates of sieve MLEs. The Annals of Statistics. 1995;23:339–362. [Google Scholar]

[R28] 28.Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika, bf. 2007;94:19–35. [Google Scholar]

[R29] 29.Wang Y, Klijn JG, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365:671–679. doi: 10.1016/S0140-6736(05)17947-1. [DOI] [PubMed] [Google Scholar]

[R30] 30.Wei Z, Li H. Nonparametric pathway-based regression models for analysis of genomic data. Biostatistics. 2007;8:265–284. doi: 10.1093/biostatistics/kxl007. [DOI] [PubMed] [Google Scholar]

[R31] 31.Yuan M. High dimensional inverse covariance matrix estimation via linear programming. J Mach Learning Res. 2010;11:2261–2286. [Google Scholar]

[R32] 32.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]

[R33] 33.Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann Statist. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Zhao P, Yu B. On model selection consistency of Lasso. JMLR. 2006;7:2541–2563. [Google Scholar]

[R35] 35.Zhou S. Thresholded Lasso for high dimensional variable selection and statistical estimation. Technical report 2010 [Google Scholar]

PERMALINK

Likelihood-based selection and sharp parameter estimation*

Xiaotong Shen

Wei Pan

Yunzhang Zhu

Summary

1 Introduction

2 Method and computation

2.1 Method

Figure 1.

2.2 Unconstrained dual and constrained primal problems

Algorithm 1

3 Theory

3.1 Constrained L0-likelihood

Assumption A (Size of parameter space)

Theorem 1 (Error bound and oracle properties)

3.2 Constrained truncated L1-likelihood

Assumption B

Theorem 2 (Error bound and oracle properties)

Remark

3.3 Necessary condition for selection consistency

Assumption C

Theorem 3 (Necessary condition for feature selection consistency)

4 Generalized linear models

4.1 Simulations

Table 1.

Table 2.

4.2 Theory for feature selection

Proposition 1

5 Estimation of a precision matrix

5.1 Simulations

5.2 Theory for precision matrix

Proposition 2

6 Metastasis status of breast cancer patients

Table 3.

Table 4.

7 Appendix

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Proposition 1

Proof of Proposition 2

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Likelihood-based selection and sharp parameter estimation^{^*}

3.1 Constrained L₀-likelihood

3.2 Constrained truncated L₁-likelihood