Grouping pursuit through a regularization solution surface

Xiaotong Shen; Hsin-Cheng Huang

doi:10.1198/jasa.2010.tm09380

. Author manuscript; available in PMC: 2010 Sep 1.

Published in final edited form as: J Am Stat Assoc. 2010 Jun 1;105(490):727–739. doi: 10.1198/jasa.2010.tm09380

Grouping pursuit through a regularization solution surface ^{^*}

Xiaotong Shen, Hsin-Cheng Huang

PMCID: PMC2913333 NIHMSID: NIHMS215171 PMID: 20689721

Summary

Extracting grouping structure or identifying homogenous subgroups of predictors in regression is crucial for high-dimensional data analysis. A low-dimensional structure in particular–grouping, when captured in a regression model, enables to enhance predictive performance and to facilitate a model's interpretability Grouping pursuit extracts homogenous subgroups of predictors most responsible for outcomes of a response. This is the case in gene network analysis, where grouping reveals gene functionalities with regard to progression of a disease. To address challenges in grouping pursuit, we introduce a novel homotopy method for computing an entire solution surface through regularization involving a piecewise linear penalty. This nonconvex and overcomplete penalty permits adaptive grouping and nearly unbiased estimation, which is treated with a novel concept of grouped subdifferentials and difference convex programming for efficient computation. Finally, the proposed method not only achieves high performance as suggested by numerical analysis, but also has the desired optimality with regard to grouping pursuit and prediction as showed by our theoretical results.

Keywords: Gene networks, large p but small n, nonconvex minimization, prediction, supervised clustering

1 Introduction

Essential to high-dimensional data analysis is seeking a certain lower-dimensional structure in knowledge discovery, as in web mining. Extracting one-kind of lower-dimensional structure–grouping, remains largely unexplored in regression. In gene network analysis, a large amount of current genetic knowledge has been organized in terms of networks, for instance, the Kyoto Encyclopedia of Genes and Genomes (KEGG), a collection of manually drawn pathway maps representing the knowledge about molecular interactions and reactions. In situations as such, extracting homogenous subnetworks from a network of dependent predictors, most responsible for predicating outcomes of a response, has been one key challenge of biomedical research. There homogenous subnetworks of genes are usually estimated for understanding a disease's progression. The central issue this article addresses is automatic identification of homogenous subgroups in regression, what we call grouping pursuit.

Now consider a linear model in which response Y_i depends on a vector of p predictors:

Y_{i} \equiv μ (x_{i}) + ε_{i}, μ (x) \equiv x^{T} β = \sum_{j = 1}^{p} x_{j} β_{j}, E (ε_{i}) = 0, Var (ε_{i}) = σ^{2}; i = 1, \dots, n,

(1)

where β ≡ (β₁, …, β_p)^T is a vector of regression coefficients, x_i is independent of ε_i, and μ(x) is in a generic form, including linear, and nonlinear predictors expressed in terms of linear combinations of known bases. Our objective is to identify all possible homogenous subgroups of predictors, for optimal prediction of the outcome of Y. Here homogeneity means that regression coefficients are of similar (same) values, that is, β_j₁ ≈ ⋯ ≈ β_{j_K} within each group {j₁, …, j_K} ⊂ {1, …, p}. In (1), grouping pursuit estimates all distinct values of β as well as all corresponding subgroups of homogenous predictors.

Grouping pursuit seeks variance reduction of estimation while retaining roughly the same amount of bias, which is advantageous in high-dimensional analysis. First, it collapses predictors whose sample covariances between the residual and predictors are of similar values, for best predicting outcomes of Y; c.f., Theorem 4. Moreover, it goes beyond the notion of feature selection. This is because it seeks not only a set of redundant predictors, or a single group of zero-coefficient predictors, but also additional homogenous subgroups for further variance reduction. As a result, it yields higher predictive performance. These aspects are confirmed by numerical and theoretical results in Sections 4 and 5. Second, the price to be paid for adaptive grouping pursuit is estimation of tuning parameters, which is small as compared to its potential gain of a simpler model with higher predictive accuracy.

Grouping pursuit considered in here is one kind of supervised clustering. Papers that investigate groping pursuit are those of Tibshirani et al. (2005), where the Fused Lasso is proposed using an L₁-penalty with respect to a certain serial order; Bondell and Reich (2008), where the OSCAR penalty involves pairwise L_∞-penalties for grouping variables in terms of absolute values, in addition to variable selection. Grouping pursuit dramatically differs from feature selection for grouped predictors. This is in the sense that the former only groups predictors without removing redundancy, whereas the latter removes redundancy by encouraging grouped predictors stay together in selection; see Yuan and Lin (2006), and Zhao, Rocha and Yu (2009).

Our primary objective is achieving high accuracy in both grouping and prediction through a computationally efficient method, which seems to be difficult, if not impossible, with existing methods, especially those through enumeration. To achieve our objective, we employ the regularized least squares method with a piecewise linear nonconvex penalty. The penalty to be introduced in (2) involves one thresholding parameter determining which pairs to be shrunk towards a common group, which works jointly with one regularization parameter for shrinkage towards unknown location. These two tuning parameters combine thresholding with shrinkage for achieving adaptive grouping, which is otherwise not possible with shrinkage alone. The penalty is overcomplete in that the number of individual penalty terms in the penalty may be redundant with regard to certain grouping structures, and is continuous but with three nondifferentiable points, leading to significant computational advantage, in addition to the desired optimality for grouping pursuit (Theorems 3 and Corollary 1).

Computationally, the proposed penalty imposes great challenges in two aspects: (a) potential discontinuities and (b) overcompleteness of the penalty, where an effective treatment does not seem to exist in the literature; see Friedman et al. (2007) about computational challenges for a pathwise coordinate method in this type of situation. To meet the challenges, we design a novel homotopy algorithm to compute the regularization solution surface. The algorithm uses a novel concept of grouped subdifferentials to deal with overcompleteness for tracking the process of grouping, and difference convex (DC) programming to treat discontinuities due to nonconvex minimization. This, together with a model selection routine for estimators that can be discontinuous, permits adaptive grouping pursuit.

Theoretically, we derive a finite-sample probability error bound of our DC estimator, what we call DCE, computed from the homotopy algorithm for grouping pursuit. On this basis, we prove that DCE is consistent with regard to grouping pursuit as well as reconstructing the unbiased least squares estimate under the true grouping, roughly for nearly exponentially many predictors in n as long as $\frac{log p}{n} \to 0$ , c.f., Theorem 3 for details.

For subnetwork analysis, we apply our proposed method to study predictability of a protein-protein interaction (PPI) network of genes on the time to breast cancer metastasis through gene expression profiles. In Section 5.2, 27 homogenous subnetworks are identified through a Laplacian network weight vector, which surround three tumor suppressor genes TP53, BRACA1 and BRACA2 for metastasis. There 17 disease genes that were identified in the study of Wang et al. (2005) belong to 5 groups containing 1, 1, 1, 1, and 13 disease genes, indicating gene functionalities with regard to breast cancer survivability.

This article is organized in seven sections. Section 2 introduces the proposed method and the homotopy algorithm. Section 3 is devoted to selection of tuning parameter. Section 4 presents a theory concerning optimal properties of DCE in grouping pursuit and prediction, followed by some numerical examples and an application to breast cancer data in Section 5. Section 6 discusses the proposed method. Finally, the appendix contains technical proofs.

2 Grouping pursuit

In (1), let the true coefficient vector $β^{0} = {(β_{1}^{0}, \dots, β_{p}^{0})}^{T}$ be ${(α_{1}^{0} 1_{| G_{1}^{0} |}^{T}, \dots, α_{K^{0}}^{0} 1_{| G_{K^{0}}^{0} |}^{T})}^{T}$ , where K⁰ is the number of distinct groups, $α_{1}^{0} < \dots < α_{K^{0}}^{0}$ , and $1_{| G_{1}^{0} |}$ denotes a vector of 1's with length $| G_{1}^{0} |$ . Grouping pursuit, as defined early, estimates true grouping $G^{0} = (G_{1}^{0}, \dots, G_{K^{0}}^{0})$ as well as $α^{0} = {(α_{1}^{0}, \dots, α_{K^{0}}^{0})}^{T}$ . Without loss of generality, assume that the response and predictors are centered, that is, Y^T1 = 0 and (x_1j, …, x_nj)1 = 0; j = 1, …, p.

Ideally, one may enumerate over all possible least squares regressions for identifying the best grouping. However, the total number of all possible groupings, which is the pth Bell number (Rota, 1964), is much larger than that of all possible subsets in feature selection, hence that it is computationally infeasible even for moderate p. For instance, the 10th order Bell number is 115975. To circumvent this difficulty, we develop an automatic nonconvex regularization method to obtain (1) accurate grouping, (2) the least squares estimate based on the true grouping, (3) an efficient homotopy algorithm, and (4) high predictive performance.

Our approach utilizes a penalty involving pairwise comparisons: {β_j − β_j′ : 1 ≤ j < j′ ≤ p}. When β_j − β_j′ = 0, X_j and X_j′ are grouped. By transitivity, that is, β_j₁ − β_j₂ = 0 and β_j₂ − β_j₃ = 0 imply that β_j₁ = β_j₂ = β_j₃, we identify all homogenous groups through p(p − 1)/2 comparisons. Naturally, these comparisons can be conducted through penalized least squares with penalty Σ_j<j′ |β_j − β_j′|. However, this convex penalty is not desirable for predictive performance, because it is not adaptive for discriminating large from small pairwise differences. As a result, overpenalizing large differences due to shrinking small differences towards zero impedes predictive performance. We thus introduce its nonconvex counterpart J(β) = Σ_j<j′ G(β_j − β_j′) for adaptive grouping pursuit, where G(z) = λ₂ if |z| > λ₂ and G(z) = |z| otherwise, with λ₂ > 0 being the thresholding parameter. For G(z), one locally convex and two locally concave points at z = 0, ±λ₂ enable us to achieve computational advantage, as well as to realize sharp statistical properties. First, the piecewise linearity and the two locally concave points of G(z) yield an efficient method (Algorithm 1), and fast finite-step convergence of the surface algorithm (Algorithm 2). Second, they yield a sharp finite-sample error bound in Theorem 3. These aspects are unique for G(z), which may not be shared by other penalties such as SCAD (Fan and Li, 2001); see the discussion after Theorem 2. A function like G(z) was considered in other contexts such as wavelet denoising (Fan, 1997) and a combined L₀ and L₁ penalty via integer programming (Liu and Wu, 2007).

We now propose our penalized least squares criterion for automatic grouping pursuit:

S (β) = \frac{1}{2 n} \sum_{i = 1}^{n} {(Y_{i} - x_{i}^{T} β)}^{2} + λ_{1} J (β), J (β) = \sum_{j < j'} G (β_{j} - β_{j'}),

(2)

where λ₁ > 0 is the regularization parameter controlling the degree of grouping. For (2), any local/global minimizer can not attain at any of non-smooth locally concave points of J(β).

Lemma 1 Let h(·) be any differentiable function in ℝ^p and $β * = {(β_{1}^{*}, \dots, β_{p}^{*})}^{T}$ be a local minimizer of f(β) = h(β) + λ₁J(β) with J(·) given in (2). Then $| β_{j}^{*} - β_{j'}^{*} | \neq λ_{2}$ for j ≠ j′.

2.1 Grouped subdifferentials

We now introduce a novel concept of grouped subdifferentials for a convex function, which constitutes a basis of our homotopy algorithm for tracking the process of grouping.

A subgradient of a convex function f(β) with respect to β at β is any vector b ∈ ℝ^p satisfying f(β*) ≥ f(β) + b^T(β* − β) for any β* with sufficiently small β* − β, and reduces to the derivative at a smooth point. The subdifferential of S(β) at any β is the set of all such b's, which is either a singleton or a non-singleton compact set. Let β̂ be a local minimizer of (2) and ( Inline graphic ₁, …, _K) be the corresponding grouping, where K is the number of distinct groups. The subgradient of |β_j − β_j′| with respect to β_j at β = β̂ is given by b_jj′ = Sign(β̂_j − β̂_j′) if |β̂_j − β̂_j′| > 0, and |b_jj′| ≤ 1 otherwise. Here b_jj′ is a singleton everywhere except at β̂_j − β̂_j′ = 0.

To proceed, write β̂ as ${({\hat{α}}_{1} 1_{| G_{1} |}^{T}, \dots, {\hat{α}}_{K} 1_{| G_{K} |}^{T})}^{T}$ , where the index in each group is arranged increasingly, α̂₁ < ⋯ < α̂_K. Define g(j) ≡ k if β̂_j = α̂_k; j = 1, …, p, mapping indices from β̂ to α̂. Then grouping ( Inline graphic ₁, …, _K) partitions index set {1, …, p}, with _k ≡ {j : g(j) = k}; k = 1, …, K.

Ordinarily, group splitting can be tracked through certain transition conditions for {b_jj′ : j ≠ j′}, c.f., Rosset and Zhu (2007). However, {b_jj′ : j ≠ j′} are not estimable from data when an overcomplete penalty is used. To overcome this difficulty, we define the grouped subgradient of index j ∈ Inline graphic _k at β = β̂ as B_j ≡ Σ_{j′∈_k\{j}} b_jj′ if |_k| > 1, and B_j ≡ 0 if |_k| = 1; j = 1, …, p. Note that Σ_j∈k B_j = 0; k = 1, …, K, because b_jj′ = −b_j′j; j ≠ j′. Moreover, we define the grouped subgradient of a subset A ⊂ _k at β = β̂ as B_A ≡ Σ_j∈A B_j = Σ_{(j,j′)∈A×(_k\A)} b_jj′. Then

| B_{A} | \leq | A | (| G_{k} | - | A |) .

(3)

Subsequently, we work with {B_A : A ⊂ {1, …, p}} that can be uniquely determined; see Theorem 1.

2.2 Difference convex programming

This section treats non-differentiable nonconvex minimization (2) through DC programming, which is a principle for nonconvex minimization, relying on decomposing an objective function into a difference of two convex functions. The reader may consult An and Tao (1997) for DC programming. Through this DC method, we will design a novel homotopy algorithm for a DC solution of (2) in Section 2.3, which is a solution through DC programming.

First, we decompose S(β) in (2) into a difference of two convex functions $S_{1} (β) = \frac{1}{2 n} \sum_{i = 1}^{n} {(Y_{i} - x_{i}^{T} β)}^{2} + λ_{1} \sum_{j < j'} | β_{j} - β_{j'} |$ and S₂(β) = λ₁ Σ_j<j′ G₂(β_j − β_j′), through a DC decomposition of G(·) = G₁(·) − G₂(·) with G₁(z) = |z| and G₂(z) = (|z| − λ₂)₊, where z₊ is the positive part of z. This DC decomposition is interpretable in that S₂(·) corrects the estimation bias due to use of convex penalty λ₁ Σ_j<j′ |β_j − β_j′| for a nonconvex problem (2).

Second, we construct a sequence of upper approximations by successively replacing S₂(β) at iteration m = 0, 1, …, by its affine minorization based on iteration m − 1, leading to an upper convex approximating function at iteration m:

S_{1} (β) - S_{2} ({\hat{\hat{β}}}^{(m - 1)} (λ_{1}, λ_{2})) - {(β - {\hat{\hat{β}}}^{(m - 1)} (λ_{1}, λ_{2}))}^{T} \nabla S_{2} ({\hat{\hat{β}}}^{(m - 1)} (λ_{1}, λ_{2})),

(4)

where ∇ is the subgradient operator, Inline graphic ^(m−1) (λ₁, λ₂) is the minimizer of (4) at iteration m − 1, and ⁽⁻¹⁾ (λ₁, λ₂) ≡ 0. The last term in (4) becomes $λ_{1} \sum_{j = 1}^{p} (β_{j} - {\hat{\hat{β}}}_{j}^{(m - 1)} (λ_{1}, λ_{2})) \times \sum_{j' : j' \neq j} \nabla G_{2} ({\hat{\hat{β}}}_{j}^{(m - 1)} (λ_{1}, λ_{2}) - {\hat{\hat{β}}}_{j'}^{(m - 1)} (λ_{1}, λ_{2}))$ with ∇G₂(z) = Sign(z)I(|z| > λ₂) being a subgradient of G₂ at z.

Third, we utilize the grouped subdifferentials to track the entire solution surface iteratively. One technical difficulty is that Inline graphic ^(m) (λ₁, λ₂) would have jumps in λ₁ if ⁽⁰⁾ (λ₁, λ₂) were piecewise linear in λ₁ given λ₂, in view of (6) and (7) in Theorem 1. This is undesirable for tracking by continuity through homotopy. For grouping pursuit, we therefore replace ⁽⁰⁾ (λ₁, λ₂) in (4) by Inline graphic ⁽⁰⁾ (λ₀, λ₂), where rough tuning for λ₀ suffices because a DC algorithm is not sensitive to an initial value (An and Tao, 1997). This choice leads to a piecewise linear and continuous minimizer β̂⁽¹⁾(λ) of (4) in λ₁ given (λ₀, λ₂), where λ = (λ₀, λ₁, λ₂)^T. Successively replacing Inline graphic ^(m−1) (λ₁, λ₂) in (4) by β̂^(m−1) (λ₀, λ₀, λ₂) for m ∈ ℕ, we obtain a modified version of (4):

S^{(m)} (β) = S_{1} (β) - S_{2} ({\hat{β}}^{(m - 1)} (λ_{0}, λ_{0}, λ_{2})) - {(β - {\hat{β}}^{(m - 1)} (λ_{0}, λ_{0}, λ_{2}))}^{T} \nabla S_{2} ({\hat{β}}^{(m - 1)} (λ_{0}, λ_{0}, λ_{2})),

(5)

which yields its minimizer β̂^(m)(λ) and the estimated grouping Inline graphic ^(m)(λ). As suggested by Theorems 1 and 2, β̂^(m)(λ) converges in finite steps. Most importantly, the iterative scheme yields an estimator having the desired properties of a global minimizer, c.f., Theorem 3.

Given grouping Inline graphic = (₁, …, _K), let Z = (z_₁, …, z_{_K}) be an n × K matrix with z_{_k} = X_{_k} 1, and X_{_k} be the design matrix spanned by the predictors of _k; k = 1, …, K.

Theorem 1 Assume that $Z_{G^{(m)} (λ)}^{T} Z_{G^{(m)} (λ)}$ is invertible. Then β̂^(m)(λ) defined by (5) is piecewise linear in (Y, λ) and continuous in λ₁. In addition,

{\hat{α}}^{(m)} (λ) \equiv {({\hat{α}}_{1}^{(m)} (λ), \dots, {\hat{α}}_{K^{(m)} (λ)}^{(m)} (λ))}^{T} = {(Z_{G^{(m)} (λ)}^{T} Z_{G^{(m)} (λ)})}^{- 1} (Z_{G^{(m)} (λ)}^{T} Y - n λ_{1} δ^{(m)} (λ)),

(6)

where $δ^{(m)} (λ) \equiv {(δ_{1}^{(m)} (λ), \dots, δ_{K^{(m)} (λ)}^{(m)} (λ))}^{T}, δ_{k}^{(m)} (λ) \equiv \sum_{j \in G_{k}^{(m)} (λ)} Δ_{j}^{(m)} (λ)$ , and

Δ_{j}^{(m)} (λ) \equiv \sum_{j^{'} : j^{'} \neq j} {sign ({\hat{β}}_{j}^{(m)} (λ) - {\hat{β}}_{j^{'}}^{(m)} (λ)) - \nabla G_{2} ({\hat{β}}_{j}^{(m - 1)} (λ_{0}, λ_{0}, λ_{2}) - {\hat{β}}_{j^{'}}^{(m - 1)} (λ_{0}, λ_{0}, λ_{2}))} .

(7)

Moreover, for $j \in G_{k}^{(m)} (λ)$ with $| G_{k}^{(m)} (λ) | \geq 2$ , and k = 1, …, K^(m)(λ),

B_{j}^{(m)} (λ) = \frac{1}{n λ_{1}} x_{j}^{T} (I - Z_{G^{(m)} (λ)} {(Z_{G^{(m)} (λ)}^{T} Z_{G^{(m)} (λ)})}^{- 1} Z_{G^{(m)} (λ)}^{T}) Y + x_{j}^{T} Z_{G^{(m)} (λ)} {(Z_{G^{(m)} (λ)}^{T} Z_{G^{(m)} (λ)})}^{- 1} δ^{(m)} (λ) - Δ_{j}^{(m)} (λ) .

(8)

Theorem 1 reveals two important aspects of β̂^(m)(λ) from (5). First, α̂^(m)(λ) and $B_{j}^{(m)} (λ)$ are continuous and piecewise linear in λ₁ and $λ_{1}^{- 1}$ , piecewise constant in λ₂ with possible jumps, and piecewise linear in Y with possible jumps. In other words, $({\hat{α}}^{(m)} (λ), B_{j}^{(m)} (λ))$ is continuous in λ₁ given (Y, λ₀, λ₂), but may contain jumps with respect to (Y, λ₂). Second, J(·) shrinks β̂^(m)(λ) towards the least squares estimate ${({\hat{α}}_{1}^{0}, \dots, {\hat{α}}_{K^{(m)} (λ)}^{0})}^{T}$ , with the amount of shrinkage controlled by λ₁ > 0. This occurs only when a pairwise difference stays below the thresholding value of λ₂. As λ₂ → 0, the bias due to penalization becomes ignorable, yielding a nearly unbiased estimate for grouping and parameter estimation.

2.3 Algorithms for difference convex solution surface

One efficient computational tool is a homotopy method (Allgower and Georg, 2003; Wu et al., 2009), which utilizes continuity of a solution in λ to compute the entire solution surface simultaneously. To our knowledge, homotopy methods for nonconvex problems have not yet received attention in the literature. This section develops a homotopy method for a regularization solution surface for nonconvex minimization (2) through DC programming. One major computational challenge is that the solution may be piecewise linear with jumps in (Y, λ), which is difficult to treat with homotopy. To overcome this difficulty, we design a DC algorithm to obtain an easily computed solution β̂(λ), which could be local or global. Note that a DC method guarantees a global solution when it is combined with the branch-and-bound method, c.f., Liu, Shen and Wong (2005). However, seeking a global minimizer of (2) is unnecessary, because the DC solution has the desired statistical properties for grouping (Theorem 3), and can be computed more efficiently (Theorem 2).

The main gradients of our DC homotopy algorithm are (1) iterating the entire DC solution surfaces, (2) utilizing the piecewise linear continuity of $({\hat{α}}^{(m)} (λ), B_{j}^{(m)} (λ))$ in $(λ_{1}, λ_{1}^{- 1})$ for given (λ₀, λ₂, m), and (3) tracking transition points (joints for a piecewise linear function) through the grouped subdifferentials. This algorithm permits efficient computation of a DC solution for nonconvex minimization (2) with an overcomplete penalty, which is otherwise difficult to treat. To compute β̂^(m)(λ), we proceed as follows. First, we fix at one evaluation point of (λ₀, λ₂), then move along path from λ₁ = ∞ towards λ₁ = 0 given (λ₀, λ₂). By Theorem 1, β̂^(m)(λ) is piecewise linear and continuous in λ₁ given (λ₀, λ₂). Along this path, we compute transition points at which the derivative of β̂^(m)(λ) with respect to λ₁ changes. Second, we move to other evaluation points of (λ₀, λ₂) and repeat the above process.

For given (λ₀, λ₂), transition in λ₁ occurs when either of the following conditions is met:

Merging: Groups $G_{l}^{(m)} (λ)$ and $G_{k}^{(m)} (λ)$ are combined at λ, when ${\hat{α}}_{l}^{(m)} (λ) = {\hat{α}}_{k}^{(m)} (λ)$ ;
Splitting: Group $G_{k}^{(m)} (λ)$ is split into two disjoint sets A₁ and A₂ with $A_{1} \cup A_{2} = G_{k}^{(m)} (λ)$ at λ, when $| B_{A_{1}}^{(m)} (λ) | = | B_{A_{2}}^{(m)} (λ) | = | A_{1} | | A_{2} |$ , according to (3).

In (A) and (B), at a transition point, two or more groups may be merged, and a single group may be split into two or multiple subgroups.

We now describe the basic idea for computing {β̂^(m)(λ) : λ₁ > 0} given (λ₀, λ₂, m). From (6), we track $(G^{(m)} (λ), Δ_{j}^{(m)} (λ))$ along the path from λ₁ = ∞ towards λ₁ = 0 for given (λ₀, λ₂). Let $λ_{1}^{*} > 0$ be the current transition point. Our algorithm successively identifies the next transition point $λ_{1}^{* *}$ along the path. For notational ease, we use ( Inline graphic = (₁, …, _K), Δ_j) to denote the current $(G^{(m)} (λ), Δ_{j}^{(m)} (λ))$ after the current transition; j = 1, …, p. Note that $(G^{(m)} (λ), Δ_{j}^{(m)} (λ))$ ; j = 1, …, p, remain unchanged before the next transition is reached.

For merging in (A), we compute potential merge points:

m_{k l} \equiv \frac{{(e_{k} - e_{l})}^{T} {(Z_{G}^{T} Z_{G})}^{- 1} Z_{G} Y}{n {(e_{k} - e_{l})}^{T} {(Z_{G}^{T} Z_{G})}^{- 1} δ}, δ = {(δ_{1}, \dots, δ_{K})}^{T}, δ_{k} = \sum_{j : j \in G_{k}} Δ_{j};

(9)

1 ≤ k < l ≤ K, where e_k is the kth column of I_p. Then

λ_{1, A} = max {m_{k l} \in (0, λ_{1}^{*}] : 1 \leq k < l \leq K}

(10)

is a potential transition point at which Inline graphic _k′ and _l′ are combined into one group. Define $λ_{1, A} = 0 if {m_{k l} \in (0, λ_{1}^{*}) : 1 \leq k < l \leq K} = \emptyset$ , with ∅ denoting the empty set.

For splitting in (B), we utilize (3) and the subdifferentials in (8), to compute, for each k = 1, …, K, the largest $λ_{1} \in (λ_{1, A}, λ_{1}^{*}]$ and A ⊂ Inline graphic _k with |A| < |_k|/2 such that

L_{k}^{+} (λ_{1}, A) L_{k}^{-} (λ_{1}, A) = 0,

(11)

where $L_{k}^{\pm} (λ_{1}, A) \equiv \sum_{j \in A} {\frac{1}{n λ_{1}} ξ_{j} + η_{j}} \mp | A | (| G_{k} | - | A |)$ , $η_{j} \equiv x_{j}^{T} Z_{G} {(Z_{G}^{T} Z_{G})}^{- 1} δ - Δ_{j}$ , and $ξ_{j} \equiv x_{j}^{T} (I - Z_{G} {(Z_{G}^{T} Z_{G})}^{- 1} Z_{G}^{T}) Y$ . It follows from (8) that $B_{j}^{(m)} (λ) = \frac{1}{n λ_{1}} ξ_{j} + η_{j}$ before the next transition occurs, and hence that $L_{k}^{\pm} (λ_{1}, G_{k}) = \sum_{j \in G_{k}} B_{j}^{(m)} (λ) = 0$ for any λ₁ ∈ ℝ. Unfortunately, solving (11) through enumeration is infeasible over all subsets A ⊂ Inline graphic _k. In Algorithm 1 below, we develop an efficient strategy utilizing piecewise linearity of $L_{k}^{\pm} (λ_{1}, A)$ in $λ_{1}^{- 1}$ for computing the potential transition point $λ_{1, B} \equiv max {s_{k} \in (λ_{1, A}, λ_{1}^{*}] : k = 1, \dots, K}$ , as well as its corresponding grouping, where s_k is the solution of (11); k = 1, …, K. This strategy requires roughly O(p² log p) operations, c.f. Proposition 1.

To describe our strategy for solving (11), let $A_{k ℓ}^{+} (λ_{1})$ and $A_{k ℓ}^{-} (λ_{1})$ be the two subsets of Inline graphic _k of size ℓ corresponding to the ℓ largest and smallest values of $D_{k ℓ}^{+} (λ_{1}) \equiv {\frac{1}{n λ_{1}} ξ_{j} + η_{j} : ξ_{j} > 0, j \in G_{k}}$ and $D_{k ℓ}^{-} (λ_{1}) \equiv {\frac{1}{n λ_{1}} ξ_{j} + η_{j} : ξ_{j} < 0, j \in G_{k}}$ , for ℓ = 1, …, | Inline graphic _k| and k = 1, …, K, where $A_{k ℓ}^{\pm} (λ_{1}) \equiv A_{k, ℓ - 1}^{\pm} (λ_{1})$ if $| D_{k ℓ}^{\pm} (λ_{1}) | < ℓ$ . Since $L_{k}^{\pm} (λ_{1}^{*}, G_{k}) = 0$ , we can refine our search to ${A_{k ℓ}^{\pm} (λ_{1}) : ℓ = 1, \dots, [G_{k} / 2], λ_{1} \in (λ_{1, A}, λ_{1}^{*}]}$ for solving (11). Note that $L_{k}^{+} (λ_{1}, A_{k ℓ}^{+} (λ_{1})) \leq 0$ and $L_{k}^{-} (λ_{1}, A_{k ℓ}^{-} (λ_{1})) \geq 0$ by the definition of $B_{j}^{(m)} (λ)$ 's for $λ_{1} \in [λ_{1}^{* *}, λ_{1}^{*}]$ . For k = 1, …, K and ℓ = 1, …, [ Inline graphic _k/2], we seek the first zero-crossing λ₁ values for $L_{k}^{\pm} (λ_{1}, A_{k ℓ}^{\pm} (λ_{1}))$ , denoted as $s_{k ℓ}^{\pm}$ , as λ₁ decreases from $λ_{1}^{*}$ . For each k = 1, …, K, we start with ℓ = 1 and compute

s_{k 1}^{\pm} = max {\frac{ξ_{j}}{\pm (| G_{k} | - 1) - η_{j}} \in (λ_{1, A}, λ_{1}^{*}] : j = 1, \dots, p} .

(12)

For ℓ = 2, …, [ Inline graphic _k/2], we observe that elements in $A_{k ℓ}^{\pm} (λ_{1})$ need to be updated as λ₁ decreases due to rank changes in ${\frac{1}{n λ_{1}} ξ_{j} + η_{j} : j \in G_{k}}$ . This occurs at switching points:

h_{j j^{'}} \equiv \frac{ξ_{j} - ξ_{j^{'}}}{η_{j^{'}} - η_{j}}; at which \frac{1}{n λ_{1}} ξ_{j} + η_{j} = \frac{1}{n λ_{1}} ξ_{j^{'}} + η_{j^{'}}; 1 \leq j < j^{'} \leq p .

(13)

On this basis, $s_{k ℓ}^{\pm}$ can be computed. First, calculate the largest λ₁ ∈ {λ_1,A} ∪ {h_jj′ : 1 ≤ j < j′ ≤ p} at which $L_{k}^{+} (λ_{1}, A_{k ℓ}^{+} (λ_{1})) \geq 0$ or $L_{k}^{-} (λ_{1}, A_{k ℓ}^{-} (λ_{1})) \leq 0$ . Second, compute the exact crossing point for $L_{k}^{\pm} (λ_{1}, A_{k ℓ}^{\pm} (λ_{1})) = 0$ through linear interpolation due to the fact that $L_{k}^{\pm} (λ_{1}, A_{k ℓ}^{\pm} (λ_{1}))$ is piecewise linear and continuous in $λ_{1}^{- 1}$ with joints at h_jj′, 1 ≤ j < j′ ≤ p.

Algorithm 1 computes the next transition point, Inline graphic and Δ_j's.

Algorithm 1: Computation of next transition point

Given the current grouping Inline graphic and Δ_j's with invertible $Z_{G}^{T} Z_{G}$ ,

Step 1 (Potential transition for merging) Compute λ_1,A as defined in (10) as well as the corresponding grouping.
Step 2 Compute $s_{k 1}^{\pm}$ ; k = 1, …, K, as defined in (12), and H ≡ {λ_1,A} ∪ {h_jj′ : 1 ≤ j < j′ ≤ p} based on (13).
Step 3 (Splitting points) Starting with ℓ = 1, and for k = 1, …, K, we
- compute the largest λ₁ ∈ H such that $L_{k}^{+} (λ_{1}, A_{k ℓ}^{+} (λ_{1})) \geq 0$ , and the largest λ₁ ∈ H such that $L_{k}^{-} (λ_{1}, A_{k ℓ}^{-} (λ_{1})) \leq 0$ , where bisection or Fibonacci search (e.g., Gill, Murray and Wright, 1981) may be applied;
- interpolate $L_{k}^{+} (λ_{1}, A_{k ℓ}^{+} (λ_{1})) (resp. L_{k}^{-} (λ_{1}, A_{k ℓ}^{-} (λ_{1})))$ linearly to obtain $s_{k ℓ}^{+} (resp. s_{k ℓ}^{-})$ satisfying $L_{k}^{+} (s_{k ℓ}^{+}, A_{k ℓ}^{+} (s_{k ℓ}^{+})) = 0 (resp. L_{k}^{-} (s_{k ℓ}^{-}, A_{k ℓ}^{-} (s_{k ℓ}^{-})) = 0)$ ; set $s_{k ℓ}^{+} = 0 (resp. s_{k ℓ}^{-} = 0)$ if ${max}_{λ_{1} \in H} L_{k}^{+} (λ_{1}, A_{k ℓ}^{+} (λ_{1})) < 0 (resp. {max}_{λ_{1} \in H} L_{k}^{-} (λ_{1}, A_{k ℓ}^{-} (λ_{1})) > 0)$ ;
- compute $s_{k ℓ} = max {s_{k ℓ}^{+}, s_{k ℓ}^{-}}$ and the corresponding index set;
- if s_k,ℓ−1 > s_k,ℓ and ℓ ≤ [_k/2] − 1, then go to Step 3 with ℓ replaced by ℓ + 1. Otherwise, set s_k,ℓ+1 = ⋯ = s_{k,[_k/2]} = 0.
Step 4 (Potential transition for splitting) Compute $λ_{1, B} = max {s_{k ℓ} \in [0, λ_{1}^{*}] : 1 < k < K, l = 1, \dots, [G_{k} / 2]}$ , as well as the corresponding grouping.
Step 5 (Transition) Compute $λ_{1}^{* *} = max {λ_{1, A}, λ_{1, B}}$ . If $λ_{1}^{* *} = λ_{1, A} > 0$ , two groups are merged at $λ_{1}^{* *}$ . If $λ_{1}^{* *} = λ_{1, B} > 0$ , a group is split into two at $λ_{1}^{* *}$ . Update and Δ_j's. If $λ_{1}^{* *} = 0$ , no further transition can be obtained.

Algorithm 2: Main algorithm

Step 1 (Parameter initialization): Specify the upper bound parameter K*, and evaluation points of (λ₀, λ₂), where K* ≤ min{n, p}.
Step 2 (Initialization for DC iterations): Given (λ₀, λ₂), compute β̂⁽⁰⁾(λ₀, +∞, λ₂), the corresponding and Δ's by solving (5) with m = 0. Compute β̂⁽⁰⁾(λ) along the path from λ₁ = ∞ to λ₁ = 0 until |⁽⁰⁾(λ)| = K* using Algorithm 1 while holding (λ₀, λ₂) fixed.
Step 3 (DC iterations): Starting from m = 1, compute β̂^(m)(λ₀, +∞, λ₂), the corresponding and Δ's by solving (5); then successively compute β̂^(m)(λ) along the path from λ₁ = ∞ to λ₁ = 0 until || = K* using Algorithm 1.
Step 4 (Stopping rule): If S(β̂^(m−1)(λ)) − S(β̂^(m)(λ)) = 0, then go to Step 3 with m replaced by m + 1. Otherwise, move to next evaluation point of (λ₀, λ₂) and go to Step 2 until all the evaluation points have been computed.

Denote by m* the termination step of Algorithm 2, which may depend on (λ₀, λ₂). Our estimate DCE of β is β̂(λ) ≡ β̂^(m*)(λ). The corresponding grouping of β̂(λ) is Inline graphic (λ). In practice, Algorithm 2 is applicable to unstandardized or standardized predictors.

Proposition 1 (Computational properties). For Algorithm 1, $L_{k}^{\pm} (λ_{1}, A_{k ℓ}^{\pm} (λ_{1}))$ is continuous, piecewise linear, and strictly monotone in λ₁; k = 1, …, K, ℓ = 1, …, [ Inline graphic _k/2]. Moreover, the computational complexities of Algorithms 1 and 2 are no greater than p² (log p + n) and O(m*n*p²(log p + n)), where n* is the number of transition points.

In general, it is difficult to bound n* precisely. However, an application of a heuristic argument similar to that of Rosset and Zhu (2007, Section 3.2, p. 1019) suggests that n* is O(min{n, p}) on average for group combining and splitting.

Theorem 2 (Computation). Assume that $Z_{G^{(m)} (λ)}^{T} Z_{G^{(m)} (λ)}$ is invertible for m ≤ m*. Then the solution of Algorithm 2 is unique, and sequence S(β̂^(m)(λ₀, λ₀, λ₂)) decreases strictly in m unless β̂^(m)(λ₀, λ₀, λ₂) = β̂^(m−1)(λ₀, λ₀, λ₂). In addition, Algorithm 2 terminates in finite steps, i.e., m* < ∞ with

{\hat{β}}^{(m)} (λ) = {\hat{β}}^{(m^{*} - 1)} (λ)

(14)

for all λ over the evaluation region of λ and all m ≥ m*.

Two distinctive properties of β̂^(m*)(λ) are revealed by (14), leading to fast convergence and a sharp result for consistency of β̂(λ). First, the stopping rule of Algorithm 2, which is reinforced at one fixed λ₀, controls the entire surface for all (λ₁, λ₂) simultaneously, owing to the replacement of Inline graphic ^(m−1)(λ₀, λ₂) by β̂^(m−1)(λ₀, λ₀, λ₂) for iteration m in (4). Second, the iteration process terminates finitely with β̂^(m*)(λ) satisfying (14), because of the step function of ∇S₂(β̂^(m−1)(λ)) resulted from the locally concave points z = ±λ₂ of G(z). Most critically, (14) is not expected for any penalty that is not piecewise linear with non-differentiable but continuous points.

3 Estimation of tuning parameters and σ²

Selection of tuning parameters is important for DCE, which involves λ = (λ₀, λ₁, λ₂). In (1), predictive performance of estimator β̂(λ) is measured by MSE(β̂(λ)), defined as $\frac{1}{n} E L (\hat{β} (λ), β^{0})$ , where $L (\hat{β} (λ), β^{0}) = \sum_{i = 1}^{n} {(\hat{μ} (λ, x_{i}) - μ^{0} (x_{i}))}^{2}$ , μ̂(λ, x) = β̂^T(λ)x, and μ⁰(x) = (β⁰)^Tx.

One critical aspect of tuning is that DCE is a piecewise continuous estimator with jumps in Y. Therefore, any model selection routine may be used for tuning of DCE, which allows for estimators with discontinuities. For instance, cross-validation and the generalized degrees of freedom (GDF, Shen and Huang, 2006) are applicable, but Stein's unbiased risk estimator (Stein, 1981) is not suited because of the requirement of continuity. Then the tuning parameters are estimated by minimizing the model selection criterion.

In practice, σ² needs to be estimated when it is unknown. In the literature, there have been many proposals for the case of p < n, for instance, σ² can be estimated by the residual sum squares over (n − p). In general, estimation of σ² in the case of p > n has not yet received much attention. In our case, we propose a simple estimator ${\hat{σ}}^{2} = \frac{1}{n - K^{*} / 2} \sum_{i = 1}^{n} {(Y_{i} - {\hat{μ}}_{i} (λ^{*}))}^{2}$ , where λ* = (λ₀, λ̃₁, ∞), λ̃₁ is the smallest λ₁ reaching the upper bound K(λ₀, λ₁, ∞) = K*/2, and K* is defined in Step 2 of Algorithm 2. Note that when λ₂ = ∞, μ̂_i(λ*) is independent of λ₀. The quality of estimation depends on the bias of μ̂_i(λ*) as well as the variance. By choosing a tight value of K* ≥ K₀, one may achieve good performance.

4 Theory

This section derives a finite-sample probability error bound, based on which we prove that β̂(λ) is consistent with regard to grouping pursuit and predictive optimality simultaneously for the same set of values of λ. As a result, the true grouping Inline graphic ⁰ is reconstructed, as well as the unbiased least squares estimate ${\hat{β}}^{(ols)} \equiv {({\hat{β}}_{1}^{(ols)}, \dots, {\hat{β}}_{p}^{(ols)})}^{T} = {({\hat{α}}_{1}^{(ols)} 1_{| G_{1}^{0} |}, \dots, {\hat{α}}_{K^{0}}^{(ols)} 1_{| G_{K^{0}}^{0} |})}^{T}$ given ⁰. Here ${\hat{α}}^{(ols)} \equiv {({\hat{α}}_{1}^{(ols)}, \dots, {\hat{α}}_{K^{0}}^{(ols)})}^{T} = {(Z_{G^{0}}^{T} Z_{G^{0}})}^{- 1} Z_{G^{0}}^{T} Y$ with $Z_{G^{0}}^{T} Z_{G^{0}}$ being invertible.

Denote by c_min( Inline graphic ) > 0 the smallest eigenvalue of $Z_{G}^{T} Z_{G} / n$ , where Z is the design matrix based on grouping Denote by $γ_{min} \equiv min {| α_{k}^{0} - α_{l}^{0} | > 0 : 1 \leq k < l \leq K^{0}}$ , the resolution level that may depend on (p, n), or the level of difficulty of grouping pursuit, with a small value of γ_min being difficult. The following result is established for β̂(λ) from Algorithm 2.

Theorem 3 (Error bounds for grouping pursuit and consistency). Under the model assumptions of (1) with ε_i ∼ N(0, σ²), assume that λ₀ = λ₁, (2K* + 1)λ₁/λ₂ < min_||≤(K*)² c_min( Inline graphic ), where $K_{0} < K^{*} \leq min {\sqrt{n}, p}$ . Then for any n and p, we have

\begin{array}{l} P (G (λ) \neq G^{0}) \leq P (\hat{β} (λ) \neq {\hat{β}}^{(ols)}) \\ \leq \frac{K^{0} (K^{0} - 1)}{2} Φ (\frac{- n^{1 / 2} (γ_{min} - 3 λ_{2} / 2)}{2 σ c_{min}^{- 1 / 2} (G^{0})}) + p Φ (\frac{- n λ_{1}}{σ {max}_{1 \leq j \leq p} ‖ x_{j} ‖}), \end{array}

(15)

where $Φ (z) = \int_{- \infty}^{z} exp (- u^{2} / 2) d u$ is the cumulative distribution function of N(0, 1), and ‖x_j‖ is the L₂-norm of x_j ∈ Inline graphic ⁿ. Moreover, as p, n → +∞, if

$\frac{n {(γ_{min} - 3 λ_{2} / 2)}^{2}}{8 c_{min} (G^{0}) σ^{2}} - 2 log K^{0} \to \infty, 0 < λ_{2} < \frac{2}{3} γ_{min}$ ,
$\frac{n λ_{1}^{2}}{2 σ^{2} {max}_{1 \leq j \leq p} {‖ x_{j} ‖}^{2} / n} - log p \to \infty$ ,

then P( Inline graphic (λ) ≠ (⁰) ≤ P(β̂(λ) ≠ β̂^(ols)) → 0. In other words, (λ) = ⁰ and β̂(λ) = β̂^(ols) with probability tending to 1.

Corollary 1 (Predictive performance) Under the assumption of Theorem 3, $\frac{L (\hat{β} (λ), β^{0})}{L ({\hat{β}}^{(ols)}, β^{0})} \to 1$ , and L(β̂(λ), β⁰) = O_p(K₀/n), as p, n → ∞.

Theorem 3 and Corollary 1 say that DCE consistently identifies the true grouping Inline graphic ⁰ and reconstructs the unbiased least squares estimator β̂^(ols) based on ⁰ when p, n → ∞. They also confirm the assertion made in the Introduction for consistency with regard to nearly exponentially many predictors in n. Specifically, consistency occurs when (a) $p = O (exp (n λ_{1}^{2}))$ (Condition (ii)) or $\frac{log p}{n} \to 0$ , (b) λ₁(2K* + 1) < λ₂min_||≤(K*)² c_min( Inline graphic ), (c) $0 < λ_{2} < \frac{2}{3} γ_{min}$ and $n c_{min}^{- 1} (G^{0}) {(γ_{min} - 3 λ_{2} / 2)}^{2} - 16 σ^{2} log K^{0} \to \infty$ (Condition (i)), provided that max_j:1≤j≤p ‖x_j‖²/n is bounded. Note that c_min(⁰) may tend to zero as p, n → ∞ even if the number of true groups K⁰ is independent of (p, n). To understand the conditions (a)-(c), we examine the simplest case in which min_||≤(K*)² c_min( Inline graphic ), K⁰ and K* are independent of (p, n). Then (a)-(c) reduce to that $p = O (exp (n λ_{1}^{2}))$ , n^1/2λ₁ → ∞, and λ₂ > c₁λ₁ but $λ_{2} \leq \frac{3}{2} γ_{min} - d_{n}$ for some constant c₁ > 0 and some sequence d_n > 0 with n^1/2d_n → ∞, where the resolution level γ_min is required to be not too low in that n^1/2γ_min → ∞. Interestingly, there is a trade-off between p and the resolution level γ_min. Note that p = O(exp(n^2δ)) when λ is tuned: λ₁ = c₂λ₂ = c₃γ_min, given that γ_min = O(n^−1/2+δ), for some positive constants c₂, c₃ > 0 and 0 < δ ≤ 1/2. Depending on the value of δ or γ_min, p can be nearly exponentially many for high-resolution regression functions with δ = 1/2, whereas p can be ΔO(exp(n^2δ)) for low-resolution functions when δ is close to 0. The resolution level for DCE can be as low as nearly O(n^−1/2), which, to our knowledge, compares favorably with existing penalties for feature selection.

We now describe characteristics of grouping, in particular—how predictors are grouped. Denote by $ρ_{j} (λ) \equiv x_{j}^{T} (Y - X^{T} \hat{β} (λ))$ , the sample covariance between x_j and the residual.

Theorem 4 (Grouping). Let $Δ_{j} (λ) = Δ_{j}^{(m^{*})} (λ)$ be defined in Theorem 1, where Δ_j(λ) = Δ_j′ (λ) if j, j′ ∈ Inline graphic _k(λ). Let E_k(λ) ≡ [Δ_j(λ) − (|_k(λ)| − 1), Δ_j(λ) + (|_k(λ)| − 1)] be an interval or a point for any j ∈ _k(λ). Then E₁(λ), …, E_K(λ)(λ) are disjoint. Finally, j belongs to _k(λ) if and only if $\frac{1}{n λ_{1}} ρ_{j} (λ) \in E_{k} (λ)$ ; k = 1, …, K(λ).

Theorem 4 says that predictors are grouped according to if their sample covariance values fall into the same intervals, where these disjoint intervals E₁(λ), …, E_K(λ)(λ) characterize grouping. As λ varies, group splitting or combining may take place when these intervals split or combine.

5 Numerical examples

This section examines effectiveness of the proposed method on three simulated examples and one real application to gene network analysis.

For a fair comparison, we compare DCE with the estimator obtained from its convex counterpart–an ultra-fused version of the fused Lasso based on convex penalty Σ_j<j′ |β_j− β_j′|. This allows us to understand the role λ₂ plays in grouping. In addition, we examine the Lasso to investigate the connection between grouping pursuit and feature selection, to confirm our intuition in the foregoing discussion. For reference, least squares estimators based on the full model, and the true grouping, are reported as well, in addition to the average number of iterations in Algorithm 2. Finally, we compare DCE with OSCAR using two examples from Bondell and Reich (2008) in Example 3.

5.1 Benchmarks

We perform simulations in several scenarios, including correlated predictors, different noise levels and situations of “small p but large n” and “small n but large p”. Note that a decrease in the value of σ² implies an increase in the sample size in this case. We therefore fix n = 50 and vary the value of σ² in Examples 1 and 2 and use different sample sizes in Example 3.

For estimating λ for DCE, we generate an independent tuning set ${(x_{i}, {\tilde{y}}_{i})}_{i = 1}^{n}$ of size n in each example. Specially, the estimated λ, denoted by λ̂, is obtained by minimizing tuning error $n^{- 1} \sum_{i = 1}^{n} {({\tilde{y}}_{i} - \hat{μ} (λ, x_{i}))}^{2}$ over the tuning set with respect to λ over the path of λ₁ > 0, $λ_{0} \in {i λ_{1}^{*} / 10 : i = 1, \dots, 10}$ and λ₂ ∈ {0.5, 1, 1.5, 2, 2.5, 3, 5, 10, ∞} in Examples 1-3, where $λ_{1}^{*}$ is the largest transition point corresponding to λ₂ = ∞. The predictive performance is evaluated by MSE(β̂(λ̂)) as defined in Section 3. The accuracy of grouping is measured by the percentage of matchings between estimated and true pairs of indices (j, j′), for j ≠ j′. Similarly, the tuning parameter of Lasso and the convex counterpart of DCE is estimated over λ₁ > 0 based on the same tuning set for a fair comparison. All numerical analyses are conducted in R 2.9.1.

Example 1 (Sparse Grouping)

This example was used previously for feature selection in Zou and Hastie (2005). A random sample of {(x_i, Y_i) : i = 1, …, n} is obtained with n = 50, where Y_i follows (1) with ε_i ∼ N(0, σ²) and σ² = 2, 1, .5, and x_i is sampled from N(0, Σ_40×40) with p = 20, having the diagonal and off-diagonal elements 1 and 0.5. Here $β = {(\underset{5}{\underset{︸}{0, \dots, 0}}, \underset{5}{\underset{︸}{2, \dots, 2}}, \underset{5}{\underset{︸}{0, \dots, 0}}, \underset{5}{\underset{︸}{2, \dots, 2}})}^{T}$ .

As suggested in Table 1, DCE outperforms its convex counterpart and the least squares estimate based on estimated grouping across three levels of σ². This is due primarily to the nonconvex penalty, which corrects the estimation bias due to use of its convex counterpart. This is evident from the fact that the convex counterpart of DCE (m = 0) performs worst than the least square estimates based on the estimated grouping. Furthermore, grouping indeed offers additional improvement in predictive performance in view of the result of the Lasso. Most importantly, DCE reconstructs the least square estimates based on true grouping well, confirming the asymptotic results in Theorem 3 and Corollary 1. The reconstruction is nearly perfectly when σ² = .5 but is less so when the value of σ² increases towards 2, indicating that the noise level does impact the accuracy of reconstruction. Overall, grouping identification and reconstruction appear to be accurate, agreeing with the theoretical results.

Table 1.

MSEs as well as estimated standard errors (in parentheses) of grouping pursuit for various methods based on 100 simulation replications in Example 1. Here Full, True, Lasso, Convex, DCE, denote the least squares estimates based on the full model, the true model, the Lasso estimate, our convex counterpart based on iteration m = 0, and our estimate.

n	σ	Full	True model		Lasso	Our		Ave # Iter	Ave match proportion

			Grouping	Variable		Convex	DCE
50	2.0	1.607 (0.0446)	0.235 (0.0150)	0.845 (0.0310)	1.318 (0.0449)	1.418 (0.0483)	0.837 (0.0721)	4.31 (0.17)	0.633
	1.0	0.402 (0.0111)	0.059 (0.0037)	0.211 (0.0077)	0.330 (0.0112)	0.362 (0.0128)	0.070 (0.0050)	4.08 (0.14)	0.716
	0.5	0.100 (0.0028)	0.015 (0.0009)	0.053 (0.0019)	0.083 (0.0029)	0.091 (0.0032)	0.019 (0.0016)	3.71 (0.10)	0.733

100	2.0	0.833 (0.0244)	0.129 (0.0083)	0.456 (0.0176)	0.658 (0.0229)	0.699 (0.0234)	0.183 (0.0157)	4.32 (0.16)	0.788
	1.0	0.208 (0.0061)	0.032 (0.0021)	0.114 (0.0044)	0.164 (0.0058)	0.175 (0.0059)	0.040 (0.0035)	3.94 (0.12)	0.867
	0.5	0.052 (0.0015)	0.008 (0.0005)	0.028 (0.0011)	0.041 (0.0014)	0.044 (0.0015)	0.010 (0.0008)	3.16 (0.06)	0.887

200	2.0	0.411 (0.0123)	0.060 (0.0041)	0.223 (0.0092)	0.335 (0.0117)	0.364 (0.0129)	0.080 (0.0057)	4.03 (0.13)	0.926
	1.0	0.103 (0.0031)	0.015 (0.0010)	0.056 (0.0023)	0.084 (0.0029)	0.091 (0.0032)	0.020 (0.0014)	3.33 (0.06)	0.950
	0.5	0.026 (0.0008)	0.004 (0.0003)	0.014 (0.0006)	0.022 (0.0008)	0.023 (0.0008)	0.005 (0.0003)	2.98 (0.01)	0.960

Open in a new tab

Figure 1 displays the paths in λ₁ given various values of λ₂ with λ₀ = .2. Clearly, β̂(λ) is continuous in λ₁ given (λ₀, λ₂) and has jumps in λ₂ given (λ₀, λ₁), as discussed early. Figure 2 shows four two-dimensional DCE solution surfaces for (β̂₁(λ), β̂₂(λ), β̂₁₉(λ), β̂₂₀(λ)) with respect to λ. In Figure 2, the four estimates are close to their corresponding least squares estimates when either λ₁ or λ₂ becomes small. On the other hand, they tend to be close to each other when both λ₁ and λ₂ become large. Note that some jumps in λ₂ are visible for β̂₁(λ) around λ = (0.2, 2.5, 2.5).

Plots of β̂(λ) as a function of λ₁ for various λ₂ values with λ₀ = .2 and σ = 1.2 in Example 1. Different components of β̂(λ) are represented by different types of lines and colors.

Image plots of regularization solution surfaces of four components β̂₁(λ), β̂₂(λ), β̂₁₉(λ), and β̂₂₀(λ) as a function of (λ₁, λ₂) for λ₀ = 0.2 and σ = 1.2 in Example 1.

Example 2 (Large p but small n)

A random sample of {(x_i, Y_i) : i = 1, …, n} with n = 50, is obtained, where Y_i follows (2) with ε_i ∼ N(0, σ²), σ = .41, .58, p = 50, 100, and x_i is sampled from N(0, Σ) with the (j, k)th element of Σ being 0.5^|j−k|. Here $β = {(\underset{5}{\underset{︸}{3, \dots, 3}}, \underset{5}{\underset{︸}{- 1.5, \dots, - 1.5}}, \underset{5}{\underset{︸}{1, \dots, 1}}, \underset{5}{\underset{︸}{2, \dots, 2}}, \underset{p - 20}{\underset{︸}{0, \dots, 0}})}^{T}$ .

In this “large p but small n” example, DCE outperforms its convex counterpart with regard to predictive performance in all the cases, but the amounts of improvement vary. Here, the Lasso performs slightly better in some cases, which is expected because of the large group size of zero-coefficient predictors. Interestingly, the average number of iterations for Algorithm 2 is about 3 as compared to 4 in Example 1. Finally, the matching proportion for grouping is reasonably high.

Example 3 (Small p but large n)

Consider Examples 4 and 5 of Bondell and Reich (2008), which are low-dimensional with relatively large dimensions. Their Example 4 is the same as Example 1 except that n = 100, p = 40, σ² = 15² and $β = {(\underset{10}{\underset{︸}{0, \dots, 0}}, \underset{10}{\underset{︸}{2, \dots, 2}}, \underset{10}{\underset{︸}{0, \dots, 0}}, \underset{10}{\underset{︸}{2, \dots, 2}})}^{T}$ . Their Example 5 has a similar setting, but with n = 50, p = 40, σ² = 15², and $β = {(\underset{15}{\underset{︸}{3, \dots, 3}}, \underset{25}{\underset{︸}{0, \dots, 0}})}^{T}$ , where x_i ∼ N(0, V), and V is a block-diagonal matrix with diagonal blocks $1_{5} 1 \frac{T}{5}$ , $1_{5} 1 \frac{T}{5}$ , $1_{5} 1 \frac{T}{5}$ and I_p−15. In addition, the first 15 components of x_i are added to independent noise distributed as N(0, .16) to generate three equally important groups having pairwise correlations being around 0.85.

Overall DCE performs comparably with OSCAR in these noisy situations with σ² = 15², which performs better but worst, respectively in Examples 4 and 5 of Bondell and Reich (2008). This is mainly due to the fact that DCE does not select variables beyond grouping. This aspect was also evident from Table 2, where DCE performs worse than Lasso in some cases. Most noticeably, DCE performs slightly worse than its convex counterpart when σ is very large. This is expected because an adaptive method tends to perform worse than its non-adaptive counterpart in a noisy situation.

Table 2.

MSEs as well as estimated standard errors (in parentheses) for various methods based on 100 simulation replications in Example 2. Here Full, True, Lasso, Convex, DCE denote the least squares estimates based on the full model, the true model, the Lasso estimate, our convex counterpart based on iteration m = 0, and our estimate.

p	n	σ	Full	True model		Lasso	Our		Ave # Iter	Ave match proportion

				Grouping	Variable		Convex	DCE
50	50	0.58	0.327 (0.007)	0.041 (0.002)	0.142 (0.004)	0.227 (0.006)	0.276 (0.006)	0.123 (0.010)	3.01 (0.07)	0.799
50	100	0.41	0.089 (0.002)	0.010 (0.001)	0.037 (0.001)	0.056 (0.002)	0.075 (0.002)	0.011 (0.001)	3.12 (0.03)	0.862
100	50	0.58	0.325 (0.007)	0.037 (0.002)	0.136 (0.004)	0.311 (0.007)	0.327 (0.007)	0.325 (0.007)	2.17 (0.04)	0.778
100	100	0.41	0.165 (0.003)	0.010 (0.001)	0.034 (0.001)	0.073 (0.002)	0.116 (0.003)	0.110 (0.003)	2.06 (0.02)	0.722

Open in a new tab

5.2 Breast cancer metastasis and gene network

Mapping the pathways giving rise to metastasis is important in breast cancer research. Recent studies suggest that gene expression profiles are useful in identifying gene subnetworks correlated with metastasis. Here we apply our proposed method to understand the functionality of subnetworks of genes for predicting the time to metastasis, which may provide novel hypotheses and confirm the existing theory for pathways involved in tumor progression.

The breast cancer metastasis data (Wang et al., 2005; Chuang et al., 2007) contain gene expression levels of 8141 genes for 286 patients, 107 of whom were detected to develop metastasis within a five year follow-up after surgery. To utilize the present gene network knowledge, we explore the PPI network previously constructed in Chuang et al. (2007).

For breast metastasis, three tumor suppressor genes–TP53, BRACA1 and BRACA2, are known to be crucial in preventing uncontrolled cell proliferation and repairing the chromosomal damage. Certain mutations of these genes increase risk of breast center, c.f., Soussi (2003). In our analysis, we construct a subnetwork of Chuang et al. (2007), consisting of genes TP53, BRCA1 and BRCA2, as well as genes that were regulated by them. This leads to 294 expressed genes for 107 patients who developed metastasis.

For subnetwork analysis, consider a vector of p predictors, each corresponding to one node in an undirected graph together with edges connecting nodes. Also available is a vector of network weights w ≡ (w₁, …, w_p)^T, indicating relative importance of the predictors. The weight vector reflects the biological importance of a “hub” gene. Given predictors ${({\tilde{x}}_{i} = {({\tilde{x}}_{i 1}, \dots, {\tilde{x}}_{i p})}^{T})}_{i = 1}^{n}$ ,

Y_{i} \equiv μ ({\tilde{x}}_{i}) + ε_{i}, μ (\tilde{x}) \equiv {\tilde{x}}^{T} \tilde{β}, E (ε_{i}) = 0, Var (ε_{i}) = σ^{2}; i = 1, \dots, n,

(16)

where β̃ ≡ (β̃₁, …, β̃_p)^T is a vector of regression coefficients, and x̃_i is independent of ε_i. In (16), we aim to identify all possible homogenous subnetworks of predictors with respect to w. That is, $w_{j_{1}}^{- 1} {\tilde{β}}_{j_{1}} \approx \dots \approx w_{j_{K}}^{- 1} {\tilde{β}}_{j_{K}}$ within each group {j₁, …, j_K} ⊂ {1, …, p}. Model (16) reduces to (1) by letting x_i = (w₁x̃_i1, …, w_px̃_ip)^T and $β = {(w_{1}^{- 1} {\tilde{β}}_{1}, \dots, w_{p}^{- 1} {\tilde{β}}_{p})}^{T}$ . Note that existence of a path between nodes j and j′ in the undirected graph indicates if predictors x_j and x_j′ can be grouped. However, our network is a complete graph in this application.

For data analysis, the 107 patients are divided randomly into two groups with 70 and 37 patients, respectively for model-building and validation. For model building, an estimated MSE based on the generalized degrees of freedom is minimized with regard to a set of grid points over the path of λ₁ > 0, $λ_{0} \in {i λ_{1}^{*} / 10 : i = 1, \dots, 10}$ and λ₂ ∈ {0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, ∞}, where $λ_{1}^{*}$ is the largest transition point corresponding to λ₂ = ∞, to obtain the optimal tuning parameters based on the data perturbation method, c.f., Shen and Huang (2007). Moreover, we take Y_i as the log time to metastasis (in months) and x_i to be the expression levels with dimension p = 294, together with a Laplacian weight vector ${w_{j} = d_{j}^{1 / 2}, : j = 1, \dots, 294}$ , where d_j is the number of directed nodes connecting to node j, c.f., Li and Li (2008).

Table 3 summarizes the estimated grouped regression coefficients based on 27 estimated groups with the corresponding subnetworks displayed by different colors in Figure 3. Interestingly TP53 and BRACA1 are similar but they differ from BRACA2, as evident by the corresponding color intensities in Figure 3, indicating the roles that they play in the process of the metastasis. To make a sense of the estimated grouping, we examine the disease genes causing the metastasis, which were identified in Wang et al. (2005). Among the 17 disease genes expressed in our network, 1, 1, 1, 1 and 13 genes belong to the second, 13th, 14th, 17th and 18th groups, respectively. In fact, three disease genes form single groups, and one disease gene belongs to a small group, and 13 of them are in a large group, indicating that genes work in groups according to their functionalities with regard to survivability of breast cancer. In our analysis, the mean squared prediction error is .81 based on the testing data set of 37 observations, which yields the MSE of .3. This is reasonably good relative to the estimated σ² = .51.

Table 3.

Median MSEs based on 100 simulation replications in Example 3. Here Full, True, Lasso, OSCAR, E-NET, Convex, DCE denote the least squares estimates based on the full model, the true model, the Lasso estimate, the elastic net estimate, the OSCAR estimate, our convex counterpart based on iteration m = 0, and our estimate. Note that the results for OSCAR and E-NET in Examples 4 and 5 of Bondell and Reich (2008) with σ = 15 are taken from Table 1 there.

Bondell & Reich	σ	Full	True model		Lasso	E-NET	OSCAR	Our		Ave # Iter	Ave match proportion

			Group	Variable				Conv	DCE
Ex 4	15	95.1	6.2	47.4	45.4	34.4	25.9	21.4	22.0	2	0.516

Ex 5	15	174.8	11.6	67.4	64.7	40.7	51.8	67.6	70.0	4	0.535
	10	77.7	5.1	30.0	33.1			35.3	38.0	4	0.575
	5	19.4	1.3	7.5	10.7			10.4	6.0	4.5	0.626
	1	0.86	0.05	0.30	0.47			0.43	0.06	3	0.703

Open in a new tab

Plot of the PPI subnetwork for the metastasis data, as described by an undirected graph with 294 nodes and 326 edges. Three regulating genes TP53, BRACA1 and BRACA2 are represented by large nodes. There are 27 estimated groups colored over the color spectrum with dark color corresponding to small estimated group regression coefficients. The plot is produced in Cytoscape.

6 Discussion

This article proposes a novel grouping pursuit method for high-dimensional least squares regression. The proposed method is placed in the framework of likelihood estimation and model selection. It offers a general treatment to a continuous but non-differentiable nonconvex likelihood problem, where the global penalized maximum likelihood estimate is difficult to obtain. Remarkably, our DC treatment yields desired statistical properties expected from the global penalized maximum likelihood estimate. This is mainly because the DC score equation defined through subdifferentials equals to that of the original nonconvex problem, when the termination criterion is met, c.f., Theorem 2. In this process, the continuous but non-differentiable penalty is essential, At present the proposed method is not designed for feature selection. To generalize, one may replace J(β) by $\sum_{j = 1}^{p} G (β_{j}) + \sum_{j < j^{'}} G (β_{j} - β_{j^{'}})$ in (2). Moreover, estimation of the tuning parameters needs to be further investigated with regard to the accuracy of selection. Further research is therefore necessary.

7 Technical proofs

Proof of Lemma 1

We prove by contradiction. Without loss of generality, assume that $β^{*} = {(β_{1}^{*}, \dots β_{m}^{*})}^{T}$ is a local minimum of f(β) = h(β) + λ₁J(β) attaining at the locally non-smooth concave point λ₂ for pair (1, 2) in that $β_{1}^{*} - β_{2}^{*} = λ_{2}$ . Let $f_{1} (β_{1}) = f (β_{1}, β_{2}^{*}, \dots, β_{p}^{*})$ , $h_{1} (β_{1}) = h (β_{1}, β_{2}^{*}, \dots, β_{p}^{*})$ and $J_{1} (β_{1}) = J (β_{1}, β_{2}^{*}, \dots, β_{p}^{*})$ . Denote by the right derivative of J₁(β₁) at $β_{1}^{*}$ to be b. By assumption, its left derivative at $β_{1}^{*}$ must be b + λ₁. Consider the derivative of h₁(β₁) at $β_{1}^{*}$ . Note that f₁(β₁) achieves a local minimum at $β_{1}^{*}$ , implying that its right and left derivatives at $β_{1}^{*}$ are larger than 0 and less than 0. Hence the right and left derivatives of h₁(β₁) at $β_{1}^{*}$ is larger than −b and smaller than −b − λ₁. This contradicts to the fact that h₁(β₁) is differentiable in β₁ for any λ₂ > 0. This completes the proof.

Proof of Theorem 1

To prove continuity in λ₁, note that the derivative of (5) with respect to β is continuous in λ₁ because $\sum_{j^{'} : j^{'} \neq j} \nabla G_{2} ({\hat{β}}_{j}^{(m - 1)} (λ_{0}, λ_{0}, λ_{2}) - {\hat{β}}_{j^{'}}^{(m - 1)} (λ_{0}, λ_{0}, λ_{2}))$ is independent of λ₁. By convexity of (5) in β and uniqueness of β̂^(m)(λ), β̂^(m)(λ) is continuous in λ₁ for each m.

Next we derive an expression for α̂^(m)(λ). If K^(m)(λ) = 1 and $| G_{k}^{(m)} (λ) | \geq 2$ , then the result follows from (5). Now consider the case of K^(m)(λ) ≥ 2. For any j = 1, …, p, write $\sum_{j^{'} \neq j} b_{j j^{'}}^{(m)} (λ) = \sum_{j^{'} : j^{'} \sim j} b_{j j^{'}}^{(m)} (λ) + \sum_{j^{'} : j^{'} ≁ j} b_{j j^{'}}^{(m)} (λ)$ as $B_{j}^{(m)} (λ) + \sum_{k : k \neq g (λ, j)} | G_{g (λ, j)}^{(m)} (λ) |$ Sign $({\hat{α}}_{k}^{(m)} (λ) - {\hat{α}}_{g (λ, j)}^{(m)} (λ))$ , where j′ ∼ j if j′ and j are in the same group and j′ ≁ j otherwise. Differentiating (5) with respect to β, we obtain

- x_{j}^{T} (Y - Z_{G^{(m)} (λ)} {\hat{α}}^{(m)} (λ)) + n λ_{1} (Δ_{j}^{(m)} (λ) + B_{j}^{(m)} (λ)) = 0, j = 1, \dots, p,

(17)

which is an optimality condition (Rockafellar and Wets, 2003). For k = 1, …, K^(m)(λ), invoking the sum-to-zero constraint $\sum_{j \in G_{k}^{(m)} (λ)} B_{j}^{(m)} (λ) = 0$ , we have $- z_{k, G^{(m)} (λ)} (Y - Z_{G^{(m)} (λ)} {\hat{α}}^{(m)} (λ)) + n λ_{1} δ_{k}^{(m)} (λ) = 0$ , implying (6). Thus (8) follows from (17).

Proof of Proposition 1

By definition, $L_{k}^{\pm} (λ_{1}, A_{k ℓ}^{\pm} (λ_{1}))$ is piecewise linear and continuous in λ₁, and is strictly monotone because $\sum_{j \in A_{k ℓ}^{+} (λ_{1})} ξ_{j} > 0$ and $\sum_{j \in A_{k ℓ}^{-} (λ_{1})} ξ_{j} < 0$ .

Lets g_k = | Inline graphic _k| − 1. For Algorithm 1, the complexities for Steps 1 and 2 are O(np²). In Step 3, sorting for computing search points is no greater than $O (g_{k}^{2} log g_{k})$ for group _k. Hence the complexity for Step 3 is O(p² log p) using the fact that $\sum_{k = 1}^{K} g_{k} < p$ , because the complexity of search in Step 3 is no great than log p for the bisection (Fibonacci) search. Then the complexity for Algorithm 2 is O(m*n*p²(log p + n)). This completes the proof.

Proof of Theorem 2

Uniqueness of the solution follows from strict convexity of S^(m)(β) in β for each m under the assumption that $Z_{G^{(m)} (λ)}^{T} Z_{G^{(m)} (λ)}$ is invertible.

Our plan is to prove the result for λ₁ = λ₀ with λ = (λ₀, λ₀, λ₂)^T. Then controlling at this point implies the desirable result for all λ. In what follows, we set λ₁ = λ₀ unless indicated otherwise. For convergence of Algorithm 2, it follows from (2) and (5) that for m ∈ ℕ, 0 ≤ S(β̂^(m)(λ)) = S^(m+1)(β̂^(m)(λ)) ≤ S^(m)(β̂^(m)(λ)) ≤ S^(m)(β̂^(m−1)(λ)) = S(β̂^(m−1)(λ)). This implies that lim_m→∞ S(β̂^(m)(λ)) exists, thus leading to convergence. To study the number of steps to termination, note that S^(m)(β̂^(m−1)(λ)) − S^(m)(β̂^(m)(λ)) can be written as

\frac{1}{2 n} {‖ X ({\hat{β}}^{(m)} (λ) - {\hat{β}}^{(m - 1)} (λ)) ‖}^{2} + \frac{1}{n} {(Y - X {\hat{β}}^{(m)} (λ))}^{T} X ({\hat{β}}^{(m)} (λ) - {\hat{β}}^{(m - 1)} (λ)) + λ_{1} {\sum_{j = 1}^{p} ({\hat{β}}_{j}^{(m)} (λ) - {\hat{β}}_{j}^{(m - 1)} (λ)) \sum_{j^{'} \neq j} \nabla G_{2} ({\hat{β}}_{j}^{(m - 1)} (λ) - {\hat{β}}_{j^{'}}^{(m - 1)} (λ))} + λ_{1} {\sum_{j < j^{'}} (| {\hat{β}}_{j}^{(m - 1)} (λ) - {\hat{β}}_{j^{'}}^{(m - 1)} (λ) | - | {\hat{β}}_{j}^{(m)} (\hat{λ}) - {\hat{β}}_{j^{'}}^{(m - 1)} (λ) |)},

which can be simplified, using the following equality from (17), $- x_{j}^{T} (Y - X {\hat{β}}^{(m)} (λ)) = n λ_{1} \sum_{j' : j' \neq j} {\nabla G_{2} ({\hat{β}}_{j}^{(m - 1)} (λ) - {\hat{β}}_{j'}^{(m - 1)} (λ)) - b_{j j'}^{(m)} (λ)}$ , as $\frac{1}{2 n} {‖ X ({\hat{β}}^{(m)} (λ) - {\hat{β}}^{(m - 1)} (λ)) ‖}^{2} + λ_{1} \sum_{j < j'} {| z_{j j'}^{(m - 1)} (λ) | - | z_{j j'}^{(m)} (λ) | - b_{j j'}^{(m)} (λ) (z_{j j'}^{(m - 1)} (λ) - z_{j j'}^{(m)} (λ))}$ , where $z_{j j'}^{(m)} (λ) = {\hat{β}}_{j}^{(m)} (λ) - {\hat{β}}_{j'}^{(m)} (λ)$ . By convexity of |z|, $| z_{j j'}^{(m - 1)} (λ) | - | z_{j j'}^{(m)} (λ) | - b_{j j'}^{(m)} (\hat{λ}) (z_{j j'}^{(m - 1)} (λ) - z_{j j'}^{(m)} (λ)) \geq 0$ , implying S(β̂^(m−1)(λ)) − S(β̂^(m) (λ)) ≥ S^(m) (β̂^(m−1) (λ)) −S^(m) (β̂^(m) (λ)) is bounded below by $\frac{1}{2 n} {‖ X ({\hat{β}}^{(m)} (λ) - {\hat{β}}^{(m - 1)} (λ)) ‖}^{2}$ . That is, $\frac{1}{2 n} {({\hat{α}}^{(m)} (λ) - α^{(m - 1)} (\hat{λ}))}^{T} Z_{G^{(m)} (λ)}^{T} Z_{G^{(m)} (λ)} ({\hat{α}}^{(m)} (λ) - {\hat{α}}^{(m - 1)} (λ))$ is greater than zero unless α̂^(m) (λ) = α̂^(m−1) (λ).

Finally, finite step convergence follows from strict decreasingness of S^(m)(β̂ (λ)) in m and finite possible values of ∇S₂(β̂^(m−1) (λ)) in (5). For (14), note that when termination, ∇S₂(β̂^(m−1) (λ)) remains unchanged for m ≥ m*, so does the cost function (5) for m ≥ m*. This implies termination for all λ = (λ₀, λ₁, λ₂)^T in (14). This completes the proof.

Proof of Theorem 3

Define event

F \equiv {min_{k < l} | {\hat{α}}_{k}^{(ols)} - {\hat{α}}_{l}^{(ols)} | > 3 λ_{2} / 2} \cap_{k : | G_{k}^{0} | > 1} {max_{j : j \in G_{k}^{0}} | x_{j}^{T} (Y - X^{T} {\hat{β}}^{(ols)}) | \leq n λ_{1} (| G_{k}^{0} | - 1)} .

By (17) with m = m* and (14), for k = 1, …, K, β̂(λ) = β̂^(m)(λ) satisfies

{\begin{array}{l} - {(\sum_{j \in G_{k}} x_{j})}^{T} (Y - X β) - n λ_{1} (\sum_{j \in G_{k}} Δ_{j} (β)) = 0; \\ | x_{j}^{T} (Y - X β) + n λ_{1} Δ_{j} (β) | \leq n λ_{1} (| G_{k} | - 1); j \in G_{k}, | G_{k} | > 1, \end{array}

(18)

for some partition ( Inline graphic ₁, …, _K) of {1, …, p} with K ≤ min {n, p}, where Δ_j(β) ≡ Σ_{j′:j′≠j} {Sign(β_j − β_j′) − ∇G₂(β_j − β_j′)}; j = 1, …, p.

Note that the first event in F, together with the grouped subdifferentials, yields that $\sum_{j \in G_{k}^{0}} Δ_{j} ({\hat{β}}^{(ols)}) = 0$ ; k = 1, …, K⁰. This, together with the least squares property that ${(\sum_{j \in G_{k}^{0}} x_{j})}^{T} (Y - X {\hat{β}}^{(ols)}) = 0$ , implies that the first equation of (18) is fulfilled with β = β̂^(ols). Moreover, the events in F imply the second equation of (18) with β = β̂^(ols). Consequently β̂^(ols) is a solution of (18) on F.

It remains to show that (18) yields the unique minimizer on F. Define

\tilde{G} (z) = {\begin{array}{l} G (z); & if | z | \leq λ_{2} (1 - ν) or | z | \geq λ_{2} (1 + ν), \\ - \frac{1}{4 λ_{2} ν} {(z - λ_{2})}^{2} + \frac{1}{2} (z - λ_{2}) + λ_{2} (1 - \frac{ν}{4}); & if | z - λ_{2} | < λ_{2} ν, \\ - \frac{1}{4 λ_{2} ν} {(z + λ_{2})}^{2} - \frac{1}{2} (z + λ_{2}) + λ_{2} (1 - \frac{ν}{4}); & if | z + λ_{2} | < ν λ_{2}, \end{array}

for ν = 1/2. Given any grouping Inline graphic with || ≤ K*, S̃(β) is a function of α with β = (α₁1_|₁|, …, α_K1_|K|)^T. Then $\tilde{S} (β) = \frac{1}{2 n} \sum_{i = 1}^{n} {(Y_{i} - x_{i}^{T} β)}^{2} + λ_{1} \sum_{j < j'} \tilde{G} (β_{j} - β_{j'})$ is strictly convex in α ∈ ℝ^|| when $\frac{1}{n} Z_{G}^{T} Z_{G} > \frac{λ_{1}}{λ_{2}} {(| G | + 1) I_{| G |} - 1_{| G |} 1_{| G |}^{T}}$ , which occurs when $c_{min} (G) > \frac{λ_{1}}{λ_{2}} (| G | + 1)$ . To prove that β̂^(ols) is the unique minimizer of S̃(β), suppose β̃ is another minimizer of S̃(β) with Inline graphic the corresponding grouping and || < K*. Because ${min}_{| G | \leq {(K^{*})}^{2}} c_{min} (G) > \frac{λ_{1}}{λ_{2}} (K^{*} + 1)$ and (K*)² ≤ n, it follows that S̃(β) is strictly convex in α_⁰∨ ∈ ℝ^|⁰∨&|, implying β̃ = β̂^(ols), where ⁰ ∨ is the coarsest common refinement of Inline graphic ⁰ and with |⁰ ∨ | ≤ min{n, p}.

To prove that β̂^(ols) is the unique minimizer of S(β) on F, we let Inline graphic * = ⁰ ∨ with || ≤ K*. Let ${\hat{α}}_{G *}^{(ols)}$ be the estimate corresponding to β̂^(ols). By the mean value theorem, $‖ \frac{\partial}{\partial α_{G *}} \tilde{S} (β) - \frac{\partial}{\partial α_{G *}} \tilde{S} (β) |_{α_{G *} = {\hat{α}}_{G *}^{(ols)}} ‖$ is lower bounded by

{min_{| G | \leq {(K *)}^{2}} c_{min} (G) - \frac{λ_{1}}{λ_{2}} (K^{*} + 1)} ‖ α_{G *} - {\hat{α}}_{G *}^{(ols)} ‖ > 0 .

(19)

Note that S̃(β) = S(β) over E = {β : ||α_k − α_l| − λ₂| > λ₂/2 : 1 ≤ k < l ≤ | Inline graphic |}. Moreover, by construction, ${sup}_{α_{G *}} ‖ \frac{\partial}{\partial α_{G *}} S (β) - \frac{\partial}{\partial α_{G *}} \tilde{S} (β) ‖ \leq \frac{λ_{1}}{2} K^{*}$ on F, which implies, together with (19), for any β ≠ β̂^(ols) ∈ E^c, $‖ \frac{\partial}{\partial α_{G *}} S (β) ‖$ is

‖ (\frac{\partial}{\partial α_{G *}} \tilde{S} (β) - \frac{\partial}{\partial α_{G *}} \tilde{S} (β) |_{α_{G *} = {\hat{α}}_{G *}^{(ols)}}) + (\frac{\partial}{\partial α_{G *}} S (β) - \frac{\partial}{\partial α_{G *}} \tilde{S} (β)) ‖ \geq (min_{| G | \leq {(K^{*})}^{2}} c_{min} (G) - \frac{λ_{1}}{λ_{2}} (K^{*} + 1)) ‖ α_{G *} - {\hat{α}}_{G *}^{(ols)} ‖ - \frac{λ_{1}}{2} K^{*},

which is lower bounded by $({min}_{| G | \leq {(K^{*})}^{2}} c_{min} (G) - \frac{λ_{1}}{λ_{2}} (2 K^{*} + 1)) \frac{λ_{2}}{2} > 0$ , because ${min}_{| G | \leq {(K^{*})}^{2}} c_{min} (G) > \frac{λ_{1}}{λ_{2}} (2 K^{*} + 1)$ . This, together with Lemma 1, implies that S(β) has no local minimal in E^c on F, and hence it has the unique local minimal on F. On the other hand, β̂(λ) is a local minimizer of S(β) on F. Consequently, β̂^(ols) = β̂(λ) on F.

Note that ${\hat{α}}_{k}^{(ols)} - {\hat{α}}_{l}^{(ols)} \sim N (α_{k}^{0} - α_{l}^{0}, Var ({\hat{α}}_{k}^{(ols)} - {\hat{α}}_{l}^{(ols)}))$ with $Var ({\hat{α}}_{k}^{(ols)} - {\hat{α}}_{l}^{(ols)}) \leq 4 c_{min}^{- 1} (G^{0}) σ^{2} / n$ , and $x_{j}^{T} (Y - X^{T} {\hat{β}}^{(ols)}) \sim N (0, σ^{2} {‖ (I - Z_{G^{0}} {(Z_{G^{0}}^{T} Z_{G^{0}})}^{- 1} Z_{G^{0}}^{T}) x_{j} ‖}^{2})$ with ${‖ (I - Z_{G^{0}} {(Z_{G^{0}}^{T} Z_{G^{0}})}^{- 1} Z_{G^{0}}^{T}) x_{j} ‖}^{2} \leq {‖ x_{j} ‖}^{2}$ . It follows that P( Inline graphic (λ) ≠ ⁰) ≤ P(β̂(λ) ≠ β̂^(ols)) ≤ P(F^c), which is upper bounded by

\sum_{k < l} P (| α_{k}^{0} - α_{l}^{0} | - q_{k l}^{T} ε \leq 3 λ_{2} / 2) + \sum_{k = 1}^{K^{0}} \sum_{j \in G_{k}^{0}} P (| x_{j}^{T} (Y - X^{T} {\hat{β}}^{(ols)}) | > n λ_{1} (| G_{k}^{0} | - 1)} \leq \frac{K^{0} (K^{0} - 1)}{2} Φ (\frac{n^{1 / 2} (3 λ_{2} / 2 - γ_{min}}{2 σ c_{min}^{- 1 / 2} (G^{0})}) + p Φ (\frac{- n λ_{1}}{σ {max}_{1 \leq j \leq p} ‖ x_{j} ‖}),

where $q_{k l} = Z_{G^{0}} {(Z_{G^{0}}^{T} Z_{G^{0}})}^{- 1} (e_{k} - e_{l})$ and e_k is the kth column of I_p. Using an inequality that $Φ (- | z |) \leq \sqrt{2 / π} {| z |}^{- 1} exp (- z^{2} / 2)$ , we obtain the desired bound.

Proof of Corollary 1

It is a direct consequence of Theorem 3 and the least squares property. The proof is thus omitted.

Proof of Theorem 4

By Theorem 2, β̂(λ) = β̂^(m*)(λ). From (3), for any j ∈ Inline graphic _k(λ); k = 1, …, K(λ), we have |B_j(λ)| ≤ |_k(λ)| − 1. Note further that for j ∈ _k(λ), Δ_j(λ) can be rewritten as Δ_j(λ) = Σ_{k′:k′≠k} {|_k′(λ)|(Sign (α̂_k(λ) − α̂_k′(λ)) − ∇G₂(α̂_k(λ̂⁽⁰⁾) − α̂_k′(λ̂⁽⁰⁾)))}. By (17), |r_j(β̂(λ)) − nλ₁Δ_j(λ)| ≤ nλ₁|B_j(λ)|, implying $\frac{1}{n λ_{1}} ρ_{j} (λ) \in E_{k} (λ)$ .

For disjointness of E_k(λ)'s, assume, without loss of generality, that α̂₁(λ) < ⋯ < α̂_K_(λ)(λ). For any j ∈ Inline graphic _k(λ), j′ ∈ _k′(λ), and k < k′, Δ_j(λ) − Δ_j′(λ) = (|_k′(λ)| + |_k′(λ)|)(k′ − k) > (|_k′(λ)| − 1) + (|_k(λ)| − 1), implying disjointness.

Table 4.

Estimated group coefficients and group sizes for breast cancer data in Section 5.3.

k	1	2	3	4	5	6	7	8	9

α̂_k	−0.634	−0.507	−0.449	−0.416	−0.381	−0.370	−0.244	−0.225	−0.205
\|_k\|	1	1	1	2	1	1	2	1	4

k	10	11	12	13	14	15	16	17	18

α̂_k	−0.116	−0.059	−0.041	−0.018	−0.017	−0.017	0.006	0.039	0.048
\|_k\|	1	2	1	1	1	3	1	6	237

k	19	20	21	22	23	24	25	26	27

α̂_k	0.060	0.105	0.110	0.127	0.140	0.247	0.301	0.361	0.392
\|_k\|	2	10	3	5	2	1	1	1	2

Open in a new tab

Footnotes

Xiaotong Shen is Professor, School of Statistics, University of Minnesota, 224 Church Street S.E., Minneapolis, MN 55455 (E-mail: xshen@stat.umn.edu). His research is supported in part by National Science Foundation Grant DMS-0906616 and National Institute of Health Grant 1R01GM081535. Hsin-Cheng Huang is Research Fellow, Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan (E-mail: hchuang@stat.sinica.edu.tw). He is supported in part by Grant NSC 97-2118-M-001-001-MY3. The authors thank the editor, the associate editor and three referees for their helpful comments and suggestions.

References

1.An HLT, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. J Global Optim. 1997;11:253–85. [Google Scholar]
2.Allgower EL, George K. Introduction to Numerical Continuation Methods. SIAM; 2003. [Google Scholar]
3.Bondell HD, Reich BJ. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics. 2008;64:115–23. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Chuang HY, Lee EJ, et al. Network-based classification of breast cancer metastasis. Molecular Systems Biology. 2007;3:140. doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Efron B. The estimation of prediction error: covariance penalties and cross-validation. J Amer Statist Assoc. 2004;99:619–32. [Google Scholar]
6.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its Oracle properties. J Amer Statist Assoc. 2001;96:1348–60. [Google Scholar]
7.Fan J. Comments on “Wavelets in statistics: a review” by A. Antoniadis. J Italian Statist Assoc. 1997;6:131–138. [Google Scholar]
8.Friedman J, Haste T, Hofling H, Tibshirani R. Pathwise coordinate optimization. Ann Applied Statist. 2007;1:302–332. [Google Scholar]
9.Li C, Li H. Network-constraint regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–82. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
10.Liu S, Shen X, Wong W. Computational developments of ψ-learning. Proc 5th SIAM Intern Conf on Data Mining; Newport, CA. April, 2005; 2005. pp. 1–12. [Google Scholar]
11.Liu Y, Wu Y. Variable selection via a combination of the L0 and L1 penalties. J Comput Graph Statist. 2007;16:782–798. [Google Scholar]
12.Gill PE, Murray W, Wright MH. Practical Optimization. Academic Press; London: 1981. [Google Scholar]
13.Rockafellar RT, Wets RJ. Variational Analysis. Springer-Verlag; 2003. [Google Scholar]
14.Rosset S, Zhu J. Piecewise linear regularized solution paths. Ann Statist. 2007;35:1012–30. [Google Scholar]
15.Rota GC. The number of partitions of a set. American Mathematical Monthly. 1964;71:498–504. [Google Scholar]
16.Shen X, Huang HC. Optimal model assessment, selection and combination. J Amer Statist Assoc. 2006;101:554–68. [Google Scholar]
17.Stein C. Estimation of the mean of a multivariate normal distribution. Ann Statist. 1981;9:1135–51. [Google Scholar]
18.Soussi T. Focus on the P53 gene and cancer: advances in TP53 mutation research. Human mutation. 2003;21:173–5. doi: 10.1002/humu.10191. [DOI] [PubMed] [Google Scholar]
19.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J Royal Statist Soc, Ser B. 2005;67:91–108. [Google Scholar]
20.Wang Y, Klijin JG, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365:671–79. doi: 10.1016/S0140-6736(05)17947-1. [DOI] [PubMed] [Google Scholar]
21.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Royal Statist Soc Ser B. 2006;68:49–67. [Google Scholar]
22.Wu S, Shen X, Geyer C. Adaptive regularization through entire solution surface. Biometrika. 2009;96:513–527. doi: 10.1093/biomet/asp038. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variable selection. Ann Statist. 2009;37:3468–3497. [Google Scholar]
24.Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Statist Assoc, Ser B. 2005;67:301–20. [Google Scholar]

[R1] 1.An HLT, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. J Global Optim. 1997;11:253–85. [Google Scholar]

[R2] 2.Allgower EL, George K. Introduction to Numerical Continuation Methods. SIAM; 2003. [Google Scholar]

[R3] 3.Bondell HD, Reich BJ. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics. 2008;64:115–23. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Chuang HY, Lee EJ, et al. Network-based classification of breast cancer metastasis. Molecular Systems Biology. 2007;3:140. doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Efron B. The estimation of prediction error: covariance penalties and cross-validation. J Amer Statist Assoc. 2004;99:619–32. [Google Scholar]

[R6] 6.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its Oracle properties. J Amer Statist Assoc. 2001;96:1348–60. [Google Scholar]

[R7] 7.Fan J. Comments on “Wavelets in statistics: a review” by A. Antoniadis. J Italian Statist Assoc. 1997;6:131–138. [Google Scholar]

[R8] 8.Friedman J, Haste T, Hofling H, Tibshirani R. Pathwise coordinate optimization. Ann Applied Statist. 2007;1:302–332. [Google Scholar]

[R9] 9.Li C, Li H. Network-constraint regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–82. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]

[R10] 10.Liu S, Shen X, Wong W. Computational developments of ψ-learning. Proc 5th SIAM Intern Conf on Data Mining; Newport, CA. April, 2005; 2005. pp. 1–12. [Google Scholar]

[R11] 11.Liu Y, Wu Y. Variable selection via a combination of the L0 and L1 penalties. J Comput Graph Statist. 2007;16:782–798. [Google Scholar]

[R12] 12.Gill PE, Murray W, Wright MH. Practical Optimization. Academic Press; London: 1981. [Google Scholar]

[R13] 13.Rockafellar RT, Wets RJ. Variational Analysis. Springer-Verlag; 2003. [Google Scholar]

[R14] 14.Rosset S, Zhu J. Piecewise linear regularized solution paths. Ann Statist. 2007;35:1012–30. [Google Scholar]

[R15] 15.Rota GC. The number of partitions of a set. American Mathematical Monthly. 1964;71:498–504. [Google Scholar]

[R16] 16.Shen X, Huang HC. Optimal model assessment, selection and combination. J Amer Statist Assoc. 2006;101:554–68. [Google Scholar]

[R17] 17.Stein C. Estimation of the mean of a multivariate normal distribution. Ann Statist. 1981;9:1135–51. [Google Scholar]

[R18] 18.Soussi T. Focus on the P53 gene and cancer: advances in TP53 mutation research. Human mutation. 2003;21:173–5. doi: 10.1002/humu.10191. [DOI] [PubMed] [Google Scholar]

[R19] 19.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J Royal Statist Soc, Ser B. 2005;67:91–108. [Google Scholar]

[R20] 20.Wang Y, Klijin JG, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365:671–79. doi: 10.1016/S0140-6736(05)17947-1. [DOI] [PubMed] [Google Scholar]

[R21] 21.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Royal Statist Soc Ser B. 2006;68:49–67. [Google Scholar]

[R22] 22.Wu S, Shen X, Geyer C. Adaptive regularization through entire solution surface. Biometrika. 2009;96:513–527. doi: 10.1093/biomet/asp038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variable selection. Ann Statist. 2009;37:3468–3497. [Google Scholar]

[R24] 24.Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Statist Assoc, Ser B. 2005;67:301–20. [Google Scholar]

PERMALINK

Grouping pursuit through a regularization solution surface *

Xiaotong Shen

Hsin-Cheng Huang

Summary

1 Introduction

2 Grouping pursuit

2.1 Grouped subdifferentials

2.2 Difference convex programming

2.3 Algorithms for difference convex solution surface

Algorithm 1: Computation of next transition point

Algorithm 2: Main algorithm

3 Estimation of tuning parameters and σ2

4 Theory

5 Numerical examples

5.1 Benchmarks

Example 1 (Sparse Grouping)

Table 1.

Figure 1.

Figure 2.

Example 2 (Large p but small n)

Example 3 (Small p but large n)

Table 2.

5.2 Breast cancer metastasis and gene network

Table 3.

Figure 3.

6 Discussion

7 Technical proofs

Proof of Lemma 1

Proof of Theorem 1

Proof of Proposition 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Corollary 1

Proof of Theorem 4

Table 4.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Grouping pursuit through a regularization solution surface ^{^*}

3 Estimation of tuning parameters and σ²