Feature Grouping and Selection Over an Undirected Graph

Sen Yang; Lei Yuan; Ying-Cheng Lai; Xiaotong Shen; Peter Wonka; Jieping Ye

doi:10.1145/2339530.2339675

. Author manuscript; available in PMC: 2013 Sep 5.

Published in final edited form as: KDD. 2012:922–930. doi: 10.1145/2339530.2339675

Feature Grouping and Selection Over an Undirected Graph

Sen Yang ¹, Lei Yuan ^1,², Ying-Cheng Lai ³, Xiaotong Shen ⁴, Peter Wonka ¹, Jieping Ye ^1,²

PMCID: PMC3763852 NIHMSID: NIHMS502053 PMID: 24014201

Abstract

High-dimensional regression/classification continues to be an important and challenging problem, especially when features are highly correlated. Feature selection, combined with additional structure information on the features has been considered to be promising in promoting regression/classification performance. Graph-guided fused lasso (GFlasso) has recently been proposed to facilitate feature selection and graph structure exploitation, when features exhibit certain graph structures. However, the formulation in GFlasso relies on pairwise sample correlations to perform feature grouping, which could introduce additional estimation bias. In this paper, we propose three new feature grouping and selection methods to resolve this issue. The first method employs a convex function to penalize the pairwise l_∞ norm of connected regression/classification coefficients, achieving simultaneous feature grouping and selection. The second method improves the first one by utilizing a non-convex function to reduce the estimation bias. The third one is the extension of the second method using a truncated l₁ regularization to further reduce the estimation bias. The proposed methods combine feature grouping and feature selection to enhance estimation accuracy. We employ the alternating direction method of multipliers (ADMM) and difference of convex functions (DC) programming to solve the proposed formulations. Our experimental results on synthetic data and two real datasets demonstrate the effectiveness of the proposed methods.

Keywords: Feature grouping, feature selection, undirected graph, l₁ regularization, regression, classification

1. INTRODUCTION

High-dimensional regression/classification is challenging due to the curse of dimensionality. Lasso [18] and its various extensions, which can simultaneously perform feature selection and regression/classification, have received increasing attention in this situation. However, in the presence of highly correlated features lasso tends to only select one of those features resulting in suboptimal performance [27]. Several methods have been proposed to address this issue in the literature. Shen and Ye [15] introduce an adaptive model selection procedure that corrects the estimation bias through a data-driven penalty based on generalized degrees of freedom. The Elastic Net [27] uses an additional l₂ regularizer to encourage highly correlated features to stay together. However, these methods do not incorporate prior knowledge into the regression/classification process, which is critical in many applications. As an example, many biological studies have suggested that genes tend to work in groups according to their biological functions, and there are some regulatory relationships between genes [10]. This biological knowledge can be represented as a graph, where the nodes represent the genes, and the edges imply the regulatory relationships between genes. Therefore, we want to study how estimation accuracy can be improved using dependency information encoded as a graph.

Given feature grouping information, the group lasso [1, 7, 22, 11] yields a solution with grouped sparsity using l₁/l₂ penalty. The orignal group lasso does not consider the overlaps between groups. Zhao et al. [24] extend the group lasso to the case of overlapping groups. Jacob et al. [7] introduce a new penalty function leading to a grouped sparse solution with overlapping groups. Yuan et al. [21] propose an efficient method to solve the overlapping group lasso. Other extensions of group lasso with tree structured regularization include [11, 8]. Prior works have demonstrated the benefit of using feature grouping information for high-dimensional regression/classification. However, these methods need the feature groups to be pre-specified. In other words, they only utilize the grouping information to obtain solutions with grouped sparsity, but lack the capability of identifying groups.

There are also a number of existing methods for feature grouping. Fused lasso [19] introduces an l₁ regularization method for estimating subgroups in a certain serial order, but pre-ordering features is required before using fused lasso. A study about parameter estimation of the fused lasso can be found in [12]; Shen et al. [13] propose a non-convex method to select all possible homogenous subgroups, but it fails to obtain sparse solutions. OSCAR [2] employs an l₁ regularizer and a pairwise l_∞ regularizer to perform feature selection and automatic feature grouping. Li and Li [10] suggest a grouping penalty using a Laplacian matrix to force the co-efficients to be similar, which can be considered as a graph version of the Elastic Net. When the Laplacian matrix is an identity matrix, Laplacian lasso [10, 5] is identical to the Elastic Net. GFlasso employs an l₁ regularization over a graph, which penalizes the difference |β_i − sign(r_ij)β_j|, to encourage the coefficients β_i, β_j for features i, j connected by an edge in the graph to be similar when r_ij > 0, but dissimilar when r_ij < 0, where r_ij is the sample correlation between two features [9]. Although these grouping penalties can improve the performance, they would introduce additional estimation bias due to strict convexity of the penalties or due to possible graph misspecification. For example, additional bias may occur when the signs of coefficients for two features connected by an edge in the graph are different in Laplacian lasso [10, 5], or when the sign of r_ij is inaccurate in GFlasso [9].

In this paper, we focus on simultaneous estimation of grouping and sparseness structures over a given undirected graph. Features tend to be grouped when they are connected by an edge in a graph. When features are connected by an edge in a graph, the absolute values of the model coefficients for these two features should be similar or identical. We propose one convex and two non-convex penalties to encourage both sparsity and equality of absolute values of coefficients for connected features. The convex penalty includes a pairwise l_∞ regularizer over a graph. The first non-convex penalty improves the convex penalty by penalizing the difference of absolute values of coefficients for connected features. The other one is the extension of the first non-convex penalty using a truncated l₁ regularization to further reduce the estimation bias. These penalties are designed to resolve the aforementioned issues of Laplacian lasso and GFlasso. The non-convex penalties shrink only small differences in absolute values so that estimation bias can be reduced; several recent works analyze their theoretical properties [26, 14]. Through ADMM and DC programming, we develop computational methods to solve the proposed formulations. The proposed methods can combine the benefit of feature selection and that of feature grouping to improve regression/classification performance. Due to the equality of absolute values of coefficients, the model complexity of the learned model can be reduced. We have performed experiments on synthetic data and two real datasets. The results demonstrate the effectiveness of the proposed methods.

The rest of the paper is organized as follows. We introduce the proposed convex method in Section 2, and the two proposed non-convex methods in Section 3. Experimental results are given in Section 4. We conclude the paper in Section 5.

2. A CONVEX FORMULATION

Consider a linear model in which response y_i depends on a vector of p features:

y = X β + ε,

(1)

where β ∈ Inline graphic ^p is a vector of coefficients, X ∈ ⁿ^×^p is the data matrix, and ε is random noise. Given an undirected graph, we try to build a prediction model (regression or classification) incorporating the graph structure information to estimate the nonzero coefficients of β and to identify the feature groups when the number of features p is larger than the sample size n. Let (N, E) be the given undirected graph, where N = {1, 2, …, p} is a set of nodes, and E is the set of edges. Node i corresponds to feature x_i. If nodes i and j are connected by an edge in E, then features x_i and x_j tend to be grouped. The formulation of graph OSCAR (GOSCAR) is given by

min_{β} \frac{1}{2} {‖ y - X β ‖}^{2} + λ_{1} {‖ β ‖}_{1} + λ_{2} \sum_{(i, j) \in E} max {∣ β_{i} ∣, ∣ β_{j} ∣}

(2)

where λ₁, λ₂ are regularization parameters. We use a pairwise l_∞ regularizer to encourage the coefficients to be equal [2], but we only put grouping constraints over the nodes connected over the given graph. The l₁ regularizer encourages sparseness. The pairwise l_∞ regularizer puts more penalty on the larger coefficients. Note that max{|β_i|, |β_j|} can be decomposed as

max {∣ β_{i} ∣, ∣ β_{j} ∣} = \frac{1}{2} (∣ β_{i} + β_{j} ∣ + ∣ β_{i} - β_{j} ∣) .

$\frac{1}{2} (∣ β_{i} + β_{j} ∣ + ∣ β_{i} - β_{j} ∣)$ can be represented by

∣ u^{T} β ∣ + ∣ v^{T} β ∣,

where u, v are sparse vectors, each with only two non-zero entries $u_{i} = u_{j} = \frac{1}{2}, v_{i} = - v_{j} = \frac{1}{2}$ . Thus Eq. (2) can be rewritten in a matrix form as

min_{β} \frac{1}{2} {‖ y - X β ‖}^{2} + λ_{1} {‖ β ‖}_{1} + λ_{2} {‖ T β ‖}_{1},

(3)

where T is a sparse matrix constructed from the edge set E.

The proposed formulation is closely related to OSCAR [2]. The penalty of OSCAR is λ₁||β||₁ + λ₂ Σ_i_<_j max{|β_i|, |β_j|}. The l₁ regularizer leads to a sparse solution, and the l_∞ regularizer encourages the coefficients to be equal. OSCAR can be efficiently solved by accelerated gradient methods, whose key projection can be solved by a simple iterative group merging algorithm [25]. However, OSCAR assumes each node is connected to all the other nodes, which is not sufficient for many applications. Note that OSCAR is a special case of GOSCAR when the graph is complete. GOSCAR, incorporating an arbitrary undirected graph, is much more challenging to solve.

2.1 Algorithm

We propose to solve GOSCAR using the alternating direction method of multipliers (ADMM) [3]. ADMM decomposes a large global problem into a series of smaller local subproblems and coordinates the local solutions to identify the globally optimal solution. ADMM attempts to combine the benefits of dual decomposition and augmented Lagrangian methods for constrained optimization [3]. The problem solved by ADMM takes the form of

\begin{array}{l} {min}_{x, z} f (x) + g (z) \\ s . t . Ax + Bz = c . \end{array}

ADMM uses a variant of the augmented Lagrangian method and reformulates the problem as follows:

L_{ρ} (x, z, μ) = f (x) + g (z) + μ^{T} (Ax + Bz - c) + \frac{ρ}{z} {‖ Ax + Bz - c ‖}^{2},

with μ being the augmented Lagrangian multiplier, and ρ being the non-negative dual update step length. ADMM solves this problem by iteratively minimizing L_ρ(x, z, μ) over x, z, and μ. The update rule for ADMM is given by

\begin{array}{l} x^{k + 1} : arg min_{x} L_{ρ} (x, z^{k}, μ^{k}), \\ z^{k + 1} : arg min_{z} L_{ρ} (x^{k + 1}, z, μ^{k}), \\ μ^{k + 1} : = μ^{k} + ρ ({Ax}^{k + 1} + {Bz}^{k + 1} - c) . \end{array}

Consider the unconstrained optimization problem in Eq. (3), which is equivalent to the following constrained optimization problem:

\begin{array}{l} {min}_{β, q, p} \frac{1}{2} {‖ y - X β ‖}^{2} + λ_{1} {‖ q ‖}_{1} + λ_{2} {‖ p ‖}_{1} \\ s . t . β - q = 0, T β - p = 0, \end{array}

(4)

where q, p are slack variables. Eq. (4) can then be solved by ADMM. The augmented Lagrangian is

L_{ρ} (β, q, p, μ, υ) = \frac{1}{2} {‖ y - X β ‖}^{2} + λ_{1} {‖ q ‖}_{1} + λ_{2} {‖ p ‖}_{1} + μ^{T} (β - q) + υ^{T} (T β - p) + \frac{ρ}{2} {‖ β - q ‖}^{2} + \frac{ρ}{2} {‖ T β - p ‖}^{2},

where μ, υ are augmented Lagrangian multipliers.

Update β

In the (k + 1)-th iteration, β^k⁺¹ can be updated by minimizing L_ρ with q, p, μ, υ fixed:

β^{k + 1} = arg {min}_{β} \frac{1}{2} {‖ y - X β ‖}^{2} + {(μ^{k} + T^{T} υ^{k})}^{T} β + \frac{ρ}{2} {‖ β - q^{k} ‖}^{2} + \frac{ρ}{2} {‖ T β - p^{k} ‖}^{2} .

(5)

The above optimization problem is quadratic. The optimal solution is given by β^k⁺¹ = F⁻¹b^k, where

\begin{array}{l} F = X^{T} X + ρ (I + T^{T} T), \\ b^{k} = X^{T} y - μ^{k} - T^{T} υ^{k} + ρ T^{T} p^{k} + ρ q^{k} . \end{array}

The computation of β^k⁺¹ involves solving a linear system, which is the most time-consuming part in the whole algorithm. To compute β^k⁺¹ efficiently, we compute the Cholesky factorization of F at the beginning of the algorithm:

F = R^{T} R .

Note that F is a constant and positive definite matrix. Using the Cholesky factorization we only need to solve the following two linear systems at each iteration:

R^{T} \hat{β} = b^{k}, R β = \hat{β} .

(6)

Since R is an upper triangular matrix, solving these two linear systems is very efficient.

Update q

q^k⁺¹ can be obtained by solving

q^{k + 1} = arg min_{q} \frac{ρ}{2} {‖ q - β^{k + 1} ‖}^{2} + λ_{1} {‖ q ‖}_{1} - {(μ^{k})}^{T} q

which is equivalent to the following problem:

q^{k + 1} = arg min_{q} \frac{1}{2} {‖ q - β^{k + 1} - \frac{1}{ρ} μ^{k} ‖}^{2} + \frac{λ_{1}}{ρ} {‖ q ‖}_{1}

(7)

Eq. (7) has a closed-form solution, known as soft-thresholding:

q^{k + 1} = S_{λ_{1} / ρ} (β^{k + 1} + \frac{1}{ρ} μ^{k}),

(8)

where the soft-thresholding operator is defined as:

S_{λ} (x) = sign (x) max (∣ x ∣ - λ, 0) .

Update p

Similar to updating q, p^k⁺¹ can also be obtained by soft-thresholding:

p_{i}^{k + 1} = S_{λ_{2} / ρ} (T β^{K + 1} + \frac{1}{ρ} υ^{k}) .

(9)

Update μ, υ

\begin{array}{l} μ^{k + 1} = μ^{k} + ρ (β^{k + 1} - q^{k + 1}), \\ υ^{k + 1} = υ^{k} + ρ (T β^{k + 1} - p^{k + 1}) . \end{array}

(10)

A summary of GOSCAR is shown in Algorithm 1.

Algorithm 1.

The GOSCAR algorithm

graphic file with name nihms502053f8.jpg

Open in a new tab

In Algorithm 1, the Cholesky factorization only needs to be computed once, and each iteration involves solving one linear system and two soft-thresholding operations. The time complexity of the soft-thresholding operation in Eq. (8) is O(p). The other one in Eq. (9) involves a matrix-vector multiplication. Due to the sparsity of T, its time complexity is O(n_e), where n_e is the number of edges. Solving the linear system involves computing b^k and solving Eq. (6), whose total time complexity is O(p(p + n) + n_e). Thus the time complexity of each iteration is O(p(p + n) + n_e).

3. TWO NON-CONVEX FORMULATIONS

The grouping penalty of GOSCAR overcomes the limitation of Laplacian lasso that the different signs of coefficients can introduce additional penalty. However, under the l_∞ regularizer, even if |β_i| and |β_j| are close to each other, the penalty on this pair may still be large due to the property of the max operator, resulting in the coefficient β_i or β_j being over penalized. The additional penalty would result in biased estimation, especially for large coefficient, as in the lasso case [18]. Another related grouping penalty is GFlasso, |β_i − sign(r_ij)β_j|, where r_ij is the pairwise sample correlation. GFlasso relies on the pairwise sample correlation to decide whether β_i and β_j are enforced to be close or not. When the pairwise sample correlation wrongly estimates the sign between β_i and β_j, an additional penalty on β_i and β_j would occur, introducing estimation bias. This motivates our non-convex grouping penalty, ||β_i| − |β_j||, that shrinks only small differences in absolutes values. As a result, estimation bias is reduced as compared to these convex grouping penalties. The proposed non-convex methods perform well even when the graph is wrongly specified, unlike GFlasso. Note that the proposed non-convex grouping penalty does not assume the sign of an edge is given; it only relies on the graph structure.

3.1 Non-Convex Formulation I: ncFGS

The proposed non-convex formulation (ncFGS) solves the following optimization problem:

min_{β} f (β) = \frac{1}{2} {‖ y - X β ‖}^{2} + λ_{1} {‖ β ‖}_{1} + λ_{2} \sum_{(i, j) \in E} ∣ ∣ β_{i} ∣ - ∣ β_{j} ∣ ∣,

(11)

where the grouping penalty Σ₍_i;j_)∈_E ||β_i| − |β_j|| controls only magnitudes of differences of coefficients ignoring their signs over the graph. Through the l₁ regularizer and grouping penalty, simultaneous feature grouping and selection are performed, where only large coefficients as well as pairwise differences are shrunk.

A computational method for the non-convex optimization in Eq. (11) is through DC programming. We will first give a brief review of DC programming.

A particular DC program on Inline graphic ^p takes the form of

f (β) = f_{1} (β) - f_{2} (β)

with f₁(β) and f₂(β) being convex on Inline graphic ^p. Algorithms to solve DC programming based on the duality and local optimality conditions have been introduced in [17]. Due to their local characteristic and the non-convexity of DC programming, these algorithms cannot guarantee the computed solution to be globally optimal. In general, these DC algorithms converge to a local solution, but some researchers observed that they converge quite often to a global one [16].

To apply DC programming to our problem we need to decompose the objective function into the difference of two convex functions. We propose to use:

\begin{array}{l} f_{1} (β) = \frac{1}{2} {‖ y - X β ‖}^{2} + λ_{1} {‖ β ‖}_{1} + λ_{2} \sum_{(i, j) \in E} (∣ β_{i} + β_{j} ∣ + ∣ β_{i} - β_{j} ∣), \\ f_{2} (β) = λ_{2} \sum_{(i, j) \in E} (∣ β_{i} ∣ + ∣ β_{j} ∣) . \end{array}

The above DC decomposition is based on the following identity: ||β_i| − |β_j|| = |β_i + β_j| + |β_i − β_j| − (|β_i| + |β_j|). Note that both f₁(β) and f₂(β) are convex functions.

Denote $f_{2}^{k} (β) = f_{2} (β^{k}) + 〈 β - β^{k}, \partial f_{2} (β^{k}) 〉$ as the affine minorization of f₂(β), where 〈·, ·〉 is the inner product. Then DC programming solves Eq. (11) by iteratively solving a sub-problem as follows:

min_{β} f_{1} (β) - f_{2}^{k} (β) .

(12)

Since 〈β^k, ∂f₂(β^k)〉 is constant, Eq. (12) can be rewritten as

min_{β} f_{1} (β) - 〈 β, \partial f_{2} (β^{k}) 〉 .

(13)

Let c^k = ∂f₂(β^k). Note that

c_{i}^{k} = λ_{2} d_{i} sign (β_{i}^{k}) I (β_{i}^{k} \neq 0),

(14)

where d_i is the degree of node i, and Inline graphic (·) is the indicator function. Hence, the formulation in Eq. (13) is

{min}_{β} \frac{1}{2} {‖ y - X β ‖}^{2} + λ_{1} {‖ β ‖}_{1} - {(c^{k})}^{T} β + λ_{2} \sum_{(i, j) \in E} (∣ β_{i} + β_{j} ∣ + ∣ β_{i} - β_{j} ∣),

(15)

which is convex. Note that the only differences between the problems in Eq. (2) and Eq. (15) are the linear term (c^k)^Tβand the second regularization parameter. Similar to GOSCAR, we can solve Eq. (15) using ADMM, which is equivalent to the following optimization problem:

\begin{array}{l} {min}_{β, q, p} \frac{1}{2} {‖ y - X β ‖}^{2} - {(c^{k})}^{T} β + λ_{1} {‖ q ‖}_{1} + 2 λ_{2} {‖ p ‖}_{1} \\ s . t β - q = 0, T β - p = 0 . \end{array}

(16)

There is an additional linear term (c^k)^Tβ in updating β compared to Algorithm 1. Hence, we can use Algorithm 1 to solve Eq. (15) with a small change in updating β:

F β - b^{s} - c^{k} = 0.

where s represents the iteration number in Algorithm 1.

The key steps of ncFGS are shown in Algorithm 2.

Algorithm 2.

The ncFGS algorithm

graphic file with name nihms502053f9.jpg

Open in a new tab

3.2 Non-Convex Formulation II: ncTFGS

It is known that the bias of lasso is due to the looseness of convex relaxation of l₀ regularization. The truncated l₁ regularizer, a non-convex regularizer close to the l₀ regularizer, has been proposed to resolve the bias issue [23]. The truncated l₁ regularizer can recover the exact set of nonzero coefficients under a weaker condition, and has a smaller upper error bound than lasso [23]. Therefore, we propose a truncated grouping penalty to further reduce the estimation bias. The proposed formulation based on the truncated grouping penalty is

{min}_{β} f_{T} (β) = \frac{1}{2} {‖ y - X β ‖}^{2} + λ_{1} p_{1} (β) + λ_{2} p_{2} (β)

(17)

where

\begin{array}{l} p_{1} (β) = \sum_{i} J_{τ} (∣ β_{i} ∣), \\ p_{2} (β) = \sum_{(i, j) \in E} J_{τ} (∣ ∣ β_{i} ∣ - ∣ β_{j} ∣ ∣), \end{array}

and $J_{τ} (x) = min (\frac{x}{τ}, 1)$ is the truncated l₁regularizer, a surrogate of the l₀ function; τ is a non-negative tuning parameter. Figure 1 shows the difference between l₀ norm, l₁ norm and J_τ (|x|). When τ → 0, J_τ (|x|) is equivalent to the l₀ norm given by the number of nonzero entries of a vector. When τ ≥ |x|, τJ_τ (|x|) is equivalent to the l₁ norm of x.

Example for l₀ norm (left), l₁ norm (middle), and *J_τ* (|x|) with $τ = \frac{1}{8}$ (right).

Note that J_τ(||β_i| − |β_j||) can be decomposed as

J_{τ} (∣ ∣ β_{i} ∣ - ∣ β_{j} ∣ ∣) = \frac{1}{τ} (∣ β_{i} + β_{j} ∣ + ∣ β_{i} - β_{j} ∣) - \frac{1}{τ} max (2 ∣ β_{i} ∣ - τ, 2 ∣ β_{j} ∣ - τ, ∣ β_{i} ∣ + ∣ β_{j} ∣),

and a DC decomposition of J_τ(|β_i|) is

J_{τ} (∣ β_{i} ∣) = \frac{1}{τ} ∣ β_{i} ∣ - \frac{1}{τ} max (∣ β_{i} ∣ - τ, 0) .

Hence, the DC decomposition of f_T (β) can be written as

f_{T} (β) = f_{T, 1} (β) - f_{T, 2} (β),

where

\begin{array}{l} f_{T, 1} (β) = \frac{1}{2} {‖ y - X β ‖}^{2} + \frac{λ_{1}}{τ} {‖ β ‖}_{1} + \frac{λ_{2}}{τ} \sum_{(i, j) \in E} (∣ β_{i} + β_{j} ∣ + ∣ β_{i} - β_{j} ∣), \\ f_{T, 2} (β) = \frac{λ_{1}}{τ} \sum_{i} max (∣ β_{i} ∣ - τ, 0) + \frac{λ_{2}}{τ} \sum_{(i, j) \in E} max (2 ∣ β_{i} ∣ - τ, 2 ∣ β_{j} ∣ - τ, ∣ β_{i} ∣ + ∣ β_{j} ∣) . \end{array}

Let $c_{T}^{k} = \partial f_{T, 2} (β^{k})$ be the subgradient of f_T;₂ in the (k +1)-th iteration. We have

c_{T, i}^{k} = sign (β_{i}^{k}) (\frac{λ_{1}}{τ} I (∣ β_{i}^{k} ∣ > τ) + \frac{λ_{2}}{τ} \sum_{j : (i, j) \in E} (2 I (∣ β_{j}^{k} ∣ < ∣ β_{i}^{k} ∣ - τ) + I (∣ ∣ β_{i}^{k} ∣ - ∣ β_{j}^{k} ∣ ∣ < τ))) .

(18)

Then the subproblem of ncTFGS is

{min}_{β} \frac{1}{2} {‖ y - X β ‖}^{2} + \frac{λ_{1}}{τ} {‖ β ‖}_{1} - {(c_{T}^{k})}^{T} β + \frac{λ_{2}}{τ} \sum_{(i, j) \in E} (∣ β_{i} + β_{j} ∣ + ∣ β_{i} - β_{j} ∣),

(19)

which can be solved using Algorithm 1 as in ncFGS.

The key steps of ncTFGS are summarized in Algorithm 3.

Algorithm 3.

The ncTFGS algorithm

graphic file with name nihms502053f10.jpg

Open in a new tab

ncTFGS is an extension of ncFGS. When τ ≥ |β_i|, ∀i, ncT-FGS with regularization parameters τλ₁ and τλ₂ is identical to ncFGS (see Figure 3). ncFGS and ncTFGS have the same time complexity. The subproblems of ncFGS and ncTFGS are solved by Algorithm 1. In our experiments, we observed ncFGS and ncTFGS usually converge in less than 10 iterations.

MSEs (left), s₀ (middle), and s (right) of ncFGS and ncTFGS on dataset 1 for fixed λ₁ and λ₂. The regularization parameters for ncTFGS are τλ₁ and τλ₂. τ ranges from 0.04 to 4.

4. NUMERICAL RESULTS

We examine the performance of the proposed methods and compare them against lasso, GFlasso, and OSCAR on synthetic datasets and two real datasets: FDG-PET images¹ and Breast Cancer². The experiments are performed on a PC with dual-core Intel 3.0GHz CPU and 4GB memory. The code is written in MATLAB. The algorithms and their associated penalties are:

Lasso: λ₁||β||₁;
OSCAR: λ₁||β||₁ + λ₂ Σ_i_<_j max{|β_i|, |β_j|};
GFlasso: λ₁||β||₁ + λ₂ Σ ₍_i;j_)∈_E |β_i − sign(r_ij)β_j|;
GOSCAR: λ₁||β||₁ + λ₂ Σ ₍_i;j_)∈_E max{|β_i|, |β_j|};
ncFGS: λ₁||β||₁ + λ₂ Σ ₍_i;j_{) ∈}_E ||β_i| − |β_j||;
ncTFGS: λ₁ Σ_i J_τ(|β_i|) + λ₂ Σ₍_i;j_{) ∈}_E J_τ(||β_i| − |β_j||);

4.1 Efficiency

To evaluate the efficiency of the proposed methods, we conduct experiments on a synthetic dataset with a sample size of 100 and dimensions varying from 100 to 3000. The regression model is y = Xβ+ ε, where X ~ Inline graphic (0, I_p_×_p), β_i ~ (0, 1), and ε_i ~ (0, 0.01²). The graph is randomly generated. The number of edges n_e varies from 100 to 3000. The regularization parameters are set as λ₁ = λ₂ = 0.8 max{|β_i|} with n_e fixed. Since the graph size affects the penalty, λ₁ and λ₂ are scaled by $\frac{1}{n_{e}}$ to avoid trivial solutions with dimension p fixed. The average computational time based on 30 repetitions is reported in Figure 2. As can be seen in Figure 2, GOSCAR can achieve 1e-4 precision in less than 10s when the dimension and the number of edges are 1000. The computational time of ncTFGS is about 7 times higher than that of GOSCAR in this experiment. The computational time of ncFGS is the same as that of ncTFGS when τ = 100, and very close to that of ncTFGS when τ = 0.15. We can also observe that the proposed methods scale very well to the number of edges. The computational time of the proposed method increases less than 4 times when the number of edges increases from 100 to 3000. It is not surprising because the time complexity of each iteration in Algorithm 1 is linear with respect to n_e, and the sparsity of T makes the algorithm much more efficient. The increase of dimension is more costly than that of the number of edges, as the complexity of each iteration is quadratic with respect to p.

Comparison of GOSCAR, ncFGS, ncT-FGS (τ = 0.15), and ncTFGS (τ = 100) in terms of computation time with different dimensions, precisions and the numbers of edges (in seconds and in logarithmic scale).

4.2 Simulations

We use five synthetic problems that have been commonly used in the sparse learning literature [2, 10] to compare the performance of different methods. The data is generated from the regression model y = Xβ+ ε, ε_i ~ Inline graphic (0, σ²). The five problems are given by:

n = 100, p = 40, and σ = 2, 5, 10. The true parameter is given by
$β = {(\underset{10}{\underset{︸}{0, \dots, 0}}, \underset{10}{\underset{︸}{2, \dots, 2}}, \underset{10}{\underset{︸}{0, \dots, 0}}, \underset{10}{\underset{︸}{2, \dots, 2}})}^{T} .$

X ~ (0, S_p_×_p) with s_ii = 1, ∀i and s_ij = 0.5 for i ≠ j.
n = 50, p = 40, $β = {(\underset{15}{\underset{︸}{3, \dots, 3}}, \underset{25}{\underset{︸}{0, \dots, 0}})}^{T}$ , and σ = 2, 5, 10. The features are generated as
$\begin{array}{l} x_{i} = Z_{1} + ε_{i}^{x}, Z_{1} ~ N (0, 1), i = 1, \dots, 5 \\ x_{i} = Z_{2} + ε_{i}^{x}, Z_{2} ~ N (0, 1), i = 6, \dots, 10 \\ x_{i} = Z_{3} + ε_{i}^{x}, Z_{3} ~ N (0, 1), i = 11, \dots, 15 \\ x_{i} ~ N (0, 1) i = 16, \dots, 40 \end{array}$

with $ε_{i}^{x} ~ N (0, 0.16)$ , and X = [x₁, …, x₄₀].
Consider a regulatory gene network [10], where an entire network consists of n_TF subnetworks, each with one transcription factor (TF) and its 10 regulatory target genes. The data for each subnetwork can be generated as $X_{i}^{T F} ~ N (0, S_{11 \times 11})$ with s_ii = 1, s₁_i = s_i₁ = 0.7, ∀i, i ≠ 1 and s_ij = 0 for i ≠ j, j ≠ 1, i ≠ 1. Then $X = [X_{1}^{T F}, \dots, X_{n_{T F}}^{T F}]$ , n = 100, p = 110, and σ = 5. The true parameters are
$β = {(\underset{11}{\underset{︸}{\frac{5}{\sqrt{11}}, \dots, \frac{5}{\sqrt{11}}}}, \underset{11}{\underset{︸}{\frac{- 3}{\sqrt{11}}, \dots, \frac{- 3}{\sqrt{11}}}}, \underset{p - 22}{\underset{︸}{0, \dots, 0}})}^{T} .$
Same as 3 except that
$β = {(5, \underset{10}{\underset{︸}{\frac{5}{\sqrt{10}}, \dots, \frac{5}{\sqrt{10}}}}, - 3, \underset{10}{\underset{︸}{\frac{- 3}{\sqrt{10}}, \dots, \frac{- 3}{\sqrt{10}}}}, \underset{p - 22}{\underset{︸}{0, \dots, 0}})}^{T}$
Same as 3 except that
$β = {(5, \underset{10}{\underset{︸}{\frac{5}{\sqrt{10}}, \dots, \frac{5}{\sqrt{10}}}}, - 5, \underset{10}{\underset{︸}{\frac{- 5}{\sqrt{10}}, \dots, \frac{- 5}{\sqrt{10}}}}, 3, \underset{10}{\underset{︸}{\frac{3}{\sqrt{10}}, \dots, \frac{3}{\sqrt{10}}}}, - 3, \underset{10}{\underset{︸}{\frac{- 3}{\sqrt{10}}, \dots, \frac{- 3}{\sqrt{10}}}} \underset{p - 44}{\underset{︸}{0, \dots, 0}})}^{T}$

We assume that the features in the same group are connected in a graph, and those in different groups are not connected. We use MSE to measure the performance of estimation of β, which is defined as

MSE (β) = {(β - β^{*})}^{T} X^{T} X (β - β^{*}) .

For feature grouping and selection, we introduce two separate metrics to measure the accuracy of feature grouping and selection. Denote I_i, i = 0, 1, 2, …, K as the index of different groups, where I₀ is the index of zero coefficients. Then the metric for feature selection is defined as

s_{0} = \frac{\sum_{i \in I_{0}} I (β_{i} = 0) + \sum_{i \notin I_{0}} I (β_{i} \neq 0)}{p},

and the metric for feature grouping is defined as

s = \frac{\sum_{i = 1}^{K} s_{i} + s_{0}}{K + 1},

where

s_{i} = \frac{\sum_{i \neq j, i, j \in I_{i}} I (∣ β_{i} ∣ = ∣ β_{j} ∣) + \sum_{i \neq j, i \in I_{i}, j \notin I_{i}} I (∣ β_{i} ∣ \neq ∣ β_{j} ∣)}{∣ I_{i} ∣ (p - 1)} .

s_i measures the grouping accuracy of group i under the assumption that the absolute values of entries in the same group should be the same, but different from those in different groups. s₀ measures the accuracy of feature selection. It is clear that 0 ≤ s₀, s_i, s ≤ 1.

For each dataset, we generate n samples for training, as well as n samples for testing. To make the synthetic datasets more challenging, we first randomly select ⌊n/2⌋ coefficients, and change their signs, as well as those of the corresponding features. Denote β̃ and X̃ as the coefficients and features after changing signs. Then β̃_i = −β_i, x̃_i = −x_i, if the i-th coefficient is selected; otherwise, β̃_i = β_i, x̃_i = x_i. So that X̃β̃ = Xβ. We apply different approaches on X̃. The covariance matrix of X is used in GFlasso to simulate the graph misspecification. The results of β converted from β̃ are reported.

Figure 3 shows that ncFGS obtains the same results as ncTFGS on dataset 1 with σ = 2 when τ is larger than |β_i|. The regularization parameters are τλ₁ and τλ₂ for ncTFGS, and λ₁ and λ₂ for ncFGS. Figure 4 shows the average nonzero coefficients obtained on dataset 1 with σ = 2. As can be seen in Figure 4, GOSCAR, ncFGS, and ncTFGS are able to utilize the graph information, and achieve good parameter estimation. Although GFlasso can use the graph information, it performs worse than GOSCAR, ncFGS, and ncTFGS due to the graph misspecification.

The average nonzero coefficients obtained on dataset 1 with σ = 2: (a) Lasso; (b) GFlasso; (c) OSCAR; (d) GOSCAR; (e); ncFGS; (f) ncTGS

The performance in terms of MSEs averaged over 30 simulations is shown in Table 1. As indicated in Table 1, among existing methods (Lasso, GFlasso, OSCAR), GFlasso is the best, except in the two cases where OSCAR is better. GOSCAR is better than the best existing method in all cases except for two, and ncFGS and ncTFGS outperform all the other methods.

Table 1.

Comparison of performance in terms of MSEs and estimated standard deviations (in parentheses) for different methods based on 30 simulations on different synthetic datasets.

Datasets	Lasso	OSCAR	GFlasso	GOSCAR	ncFGS	ncTFGS
Data1 (σ = 2)	1.807 (0.331)	1.441 (0.318)	1.080 (0.276)	0.315 (0.157)	0.123 (0.075)	0.116 (0.075)
Data1 (σ = 5)	5.294 (0.983)	5.328 (1.080)	3.480 (1.072)	1.262 (0.764)	0.356 (0.395)	0.356 (0.395)
Data1 (σ = 10)	12.628 (3.931)	13.880 (4.031)	13.411 (4.540)	6.061 (4.022)	1.963 (1.600)	1.958 (1.593)
Data2 (σ = 2)	1.308 (0.435)	1.084 (0.439)	0.623 (0.250)	0.291 (0.208)	0.226 (0.175)	0.223 (0.135)
Data2 (σ = 5)	4.907 (1.496)	4.868 (1.625)	2.538 (0.656)	0.781 (0.598)	0.721 (0.532)	0.705 (0.535)
Data2 (σ = 10)	18.175 (6.611)	18.353 (6.611)	6.930 (2.858)	4.601 (2.623)	4.232 (2.561)	4.196 (2.577)
Data3 (σ = 5)	5.163 (1.708)	4.503 (1.677)	4.236 (1.476)	3.336 (1.725)	0.349 (0.282)	0.348 (0.283)
Data4 (σ = 5)	7.664 (2.502)	7.167 (2.492)	7.516 (2.441)	7.527 (2.434)	5.097 (0.780)	4.943 (0.764)
Data5 (σ = 5)	9.893 (1.965)	7.907 (2.194)	9.622 (2.025)	9.810 (2.068)	7.684 (1.1191)	7.601 (1.038)

Open in a new tab

Table 2 shows the results in terms of accuracy of feature grouping and selection. Since Lasso does not perform feature grouping, we only report the results of the other five methods: OSCAR, GFlasso, GOSCAR, ncFGS, and ncT-FGS. Table 2 shows that ncFGS and ncTFGS achieve higher accuracy than other methods.

Table 2.

Accuracy of feature grouping and selection based on 30 simulations for five feature grouping methods: the first row for each dataset corresponds to the accuracy of feature selection; the second row corresponds to the accuracy of feature grouping. The numbers in parentheses are the standard deviations.

Datasets	OSCAR	GFlasso	GOSCAR	ncFGS	ncTFGS
Data1 (σ = 2)	0.675 (0.098)	0.553 (0.064)	0.513 (0.036)	0.983 (0.063)	1.000 (0.000)
Data1 (σ = 2)	0.708 (0.021)	0.709 (0.017)	0.702 (0.009)	0.994 (0.022)	1.000 (0.000)

Data1 (σ = 5)	0.565 (0.084)	0.502 (0.009)	0.585 (0.085)	1.000 (0.000)	1.000 (0.000)
Data1 (σ = 5)	0.691 (0.011)	0.709 (0.016)	0.708 (0.017)	1.000 (0.000)	1.000 (0.000)

Data1 (σ = 10)	0.532 (0.069)	0.568 (0.088)	0.577 (0.061)	0.983 (0.063)	1.000 (0.000)
Data1 (σ = 10)	0.675 (0.031)	0.725 (0.022)	0.708 (0.020)	0.994 (0.022)	0.999 (0.001)

Data2 (σ = 2)	0.739 (0.108)	0.544 (0.272)	1.000 (0.000)	0.958 (0.159)	0.958 (0.159)
Data2 (σ = 2)	0.625 (0.052)	0.823 (0.029)	0.837 (0.014)	0.831 (0.052)	0.846 (0.041)

Data2 (σ = 5)	0.763 (0.114)	0.717 (0.275)	0.999 (0.005)	0.979 (0.114)	0.975 (0.115)
Data2 (σ = 5)	0.650 (0.066)	0.741 (0.062)	0.833 (0.011)	0.845 (0.030)	0.842 (0.037)

Data2 (σ = 10)	0.726 (0.101)	0.818 (0.149)	0.993 (0.024)	1.000 (0.000)	1.000 (0.000)
Data2 (σ = 10)	0.597 (0.058)	0.680 (0.049)	0.829 (0.025)	0.851 (0.015)	0.856 (0.014)

Data3 (σ = 5)	0.886 (0.135)	0.736 (0.103)	0.382 (0.084)	0.992 (0.026)	0.996 (0.014)
Data3 (σ = 5)	0.841 (0.056)	0.739 (0.041)	0.689 (0.013)	0.995 (0.017)	0.978 (0.028)

Data4 (σ = 5)	0.875 (0.033)	0.881 (0.026)	0.882 (0.037)	0.796 (0.245)	0.950 (0.012)
Data4 (σ = 5)	0.834 (0.030)	0.805 (0.035)	0.805 (0.036)	0.895 (0.114)	0.890 (0.074)

Data5 (σ = 5)	0.760 (0.203)	0.802 (0.153)	0.861 (0.051)	0.881 (0.174)	0.894 (0.132)
Data5 (σ = 5)	0.858 (0.031)	0.821 (0.037)	0.805 (0.037)	0.920 (0.056)	0.919 (0.057)

Open in a new tab

Table 3 shows the comparison of feature selection alone (λ₂ = 0), feature grouping alone (λ₁ = 0), and simultaneous feature grouping and selection using ncTFGS. From Table 3, we can observe that simultaneous feature grouping and selection outperforms either feature grouping or feature selection, demonstrating the benefit of joint feature grouping and selection in the proposed non-convex method.

Table 3.

Comparison of feature selection alone (FS), feature grouping alone (FG), and simultaneous feature grouping and feature selection (Both). The average results based on 30 replications of three datasets with σ = 5: Data3 (top), Data4 (middle), and Data5 (bottom) are reported. The numbers in parentheses are the standard deviations.

Meth.	MSE	s₀	s
FG	2.774 (0.967)	0.252 (0.156)	0.696 (0.006)
FS	6.005 (1.410)	0.945 (0.012)	0.773 (0.037)
Both	0.348 (0.283)	0.996 (0.014)	0.978 (0.028)

FG	9.4930 (1.810)	0.613 (0.115)	0.770 (0.038)
FS	6.437 (1.803)	0.947 (0.016)	0.782 (0.046)
Both	4.944 (0.764)	0.951 (0.166)	0.890 (0.074)

FG	10.830 (2.161)	0.434 (0.043)	0.847 (0.014)
FS	10.276 (1.438)	0.891 (0.018)	0.768 (0.026)
Both	7.601 (1.038)	0.894 (0.132)	0.919 (0.057)

Open in a new tab

4.3 Real Data

We conduct experiments on two real datasets: FDG-PET and Breast Cancer. The metrics to measure the performance of different algorithms include accuracy (acc.), sensitivity (sen.), specificity (spe.), degrees of freedom (dof.), and the number of nonzero coefficients (nonzero coeff.). The dof. of lasso is the number of nonzero coefficients [18]. For the algorithms capable of feature grouping, we use the same definition of dof. in [2], which is the number of estimated groups.

4.3.1 FDG-PET

In this experiment, we use FDG-PET 3D images from 74 Alzheimer’s disease (AD), 172 mild cognitive impairment (MCI), and 81 normal control (NC) subjects downloaded from the Alzheimer’s disease neuroimaging initiative (ADNI) database. The different regions of whole brain volume can be represented by 116 anatomical volumes of interest (AVOI), defined by Automated Anatomical Labeling (AAL) [20]. Then we extracted data from each of the 116 AVOIs, and derived average of each AVOI for each subject.

In our study, we compare different methods in distinguishing AD and NC subjects, which is a two-class classification problem over a dataset with 155 samples and 116 features. The dataset is randomly split into two subset, one training set consisting of 104 samples, and one testing set consisting of the remaining 51 samples. The tuning of the parameter is achieved by 5-fold cross validation. Sparse inverse covariance estimation (SICE) has been recognized as an effective tool for identifying the structure of the inverse covariance matrix. We use SICE developed in [6] to model the connectivity of brain regions. Figure 5 shows sample subgraphs built by SICE consisting of 115 nodes and 265 edges.

Subgraphs of the graph built by SICE on FDG-PET dataset, which consists of 265 edges.

The results based on 20 replications are shown in Table 4. From Table 4, we can see that ncTFGS achieves more accurate classification while obtaining smaller degrees of freedom. ncFGS and GOSCAR achieve similar classification, while ncFGS selects more features than GOSCAR.

Table 4.

Comparison of classification accuracy, sensitivity, specificity, degrees of freedom, and the number of nonzero coefficients averaged over 20 replications for different methods on FDG-PET dataset.

Metrics	Lasso	OSCAR	GFlasso	GOSCAR	ncFGS	ncTFGS
acc.	0.886 (0.026)	0.891 (0.026)	0.901 (0.029)	0.909 (0.026)	0.909 (0.031)	0.920 (0.024)
sen.	0.870 (0.041)	0.876 (0.038)	0.904 (0.038)	0.909 (0.041)	0.915 (0.043)	0.933 (0.047)
pec.	0.913 (0.0446)	0.917 (0.050)	0.902 (0.046)	0.915 (0.046)	0.908 (0.047)	0.915 (0.052)
dof.	22.150	29.150	24.350	21.300	31.250	18.250
nonzero coeff.	22.150	38.000	41.900	24.250	37.350	19.350

Open in a new tab

Figure 6 shows the comparison of accuracy with either λ₁ or λ₂ fixed. The λ₁ and λ₂ values range from 1e-4 to 100. As we can see, the performance of ncTFGS is slightly better than that of the other competitors. Since the regularization parameters of subproblems in ncTFGS are $\frac{λ_{1}}{τ}$ and $\frac{2 λ_{2}}{τ}$ , the solution of ncTFGS is more sparse than those of other competitors when λ₁ and λ₂ are large and τ is small (τ = 0.15 in this case).

Comparison of accuracies for various methods with λ₁ fixed (left) and λ₂ fixed (right) on FDT-PET dataset.

4.3.2 Breast Cancer

We conduct experiments on the breast cancer dataset, which consists of gene expression data for 8141 genes in 295 breast cancer tumors (78 metastatic and 217 non-metastatic). The network described in [4] is used as the input graph in this experiment. Figure 7 shows a subgraph consisting of 80 nodes of the used graph. We restrict our analysis to the 566 genes most correlated to the output, but also connected in the graph. 2/3 data is randomly chosen as training data, and the remaining 1/3 data is used as testing data. The tuning parameter is estimated by 5-fold cross validation. Table 5 shows the results averaged over 30 replications. As indicated in Table 5, GOSCAR, ncFGS and ncTFGS outperform the other three methods, and ncTFGS achieves the best performance.

A subgraph of the network in Breast Cancer dataset [4]. The subgraph consists of 80 nodes.

Table 5.

Comparison of classification accuracy, sensitivity, specificity, degrees of freedom, and the number of nonzero coefficients averaged over 30 replications for various methods on Breast Cancer dataset.

Metrics	Lasso	OSCAR	GFlasso	GOSCAR	ncFGS	ncTFGS
acc.	0.739 (0.054)	0.755 (0.055)	0.771 (0.050)	0.783 (0.042)	0.779 (0.041)	0.790 (0.036)
sen.	0.707 (0.056)	0.720 (0.060)	0.749 (0.060)	0.755 (0.050)	0.755 (0.055)	0.762 (0.044)
pec.	0.794 (0.071)	0.810 (0.068)	0.805 (0.056)	0.827 (0.061)	0.819 (0.058)	0.834 (0.060)
dof.	239.267	165.633	108.633	70.267	57.233	45.600
nonzero coeff.	239.267	243.867	144.867	140.667	79.833	116.567

Open in a new tab

5. CONCLUSION

In this paper, we consider simultaneous feature grouping and selection over a given undirected graph. We propose one convex and two non-convex penalties to encourage both sparsity and equality of absolute values of coefficients for features connected in the graph. We employ ADMM and DC programming to solve the proposed formulations. Numerical experiments on synthetic and real data demonstrate the effectiveness of the proposed methods. Our results also demonstrate the benefit of simultaneous feature grouping and feature selection through the proposed non-convex methods. In this paper, we focus on undirected graphs. A possible future direction is to extend the formulations to directed graphs. In addition, we plan to study the generalization performance of the proposed formulations.

Acknowledgments

This work was supported in part by NSF (IIS-0953662, MCB-1026710, CCF-1025177, DMS-0906616) and NIH (R01LM010 730, 2R01GM081535-01, R01HL105397).

Footnotes

http://adni.loni.ucla.edu/

http://cbio.ensmp.fr/~jvert/publi/

Contributor Information

Sen Yang, Email: senyang@asu.edu.

Lei Yuan, Email: lei.yuan@asu.edu.

Ying-Cheng Lai, Email: ying-cheng.lai@asu.edu.

Xiaotong Shen, Email: xshen@stat.umn.edu.

Peter Wonka, Email: peter.wonka@asu.edu.

Jieping Ye, Email: jieping.ye@asu.edu.

References

1.Bach F, Lanckriet G, Jordan M. Multiple kernel learning, conic duality, and the SMO algorithm. ICML. 2004 [Google Scholar]
2.Bondell H, Reich B. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics. 2008;64(1):115–123. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. 2011 [Google Scholar]
4.Chuang H, Lee E, Liu Y, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Molecular systems biology. 2007;3(1) doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Fei H, Quanz B, Huan J. Regularization and feature selection for networked features. CIKM. 2010:1893–1896. [Google Scholar]
6.Huang S, Li J, Sun L, Liu J, Wu T, Chen K, Fleisher A, Reiman E, Ye J. Learning brain connectivity of Alzheimer’s disease from neuroimaging data. NIPS. 2009;22:808–816. [Google Scholar]
7.Jacob L, Obozinski G, Vert J. Group lasso with overlap and graph lasso. ICML. 2009:433–440. [Google Scholar]
8.Jenatton R, Mairal J, Obozinski G, Bach F. Proximal methods for sparse hierarchical dictionary learning. ICML. 2010 [Google Scholar]
9.Kim S, Xing E. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS genetics. 2009;5(8):e1000587. doi: 10.1371/journal.pgen.1000587. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24(9):1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
11.Liu J, Ye J. Moreau-Yosida regularization for grouped tree structure learning. NIPS. 2010 [Google Scholar]
12.Rinaldo A. Properties and refinements of the fused lasso. The Annals of Statistics. 2009;37(5B):2922–2952. [Google Scholar]
13.Shen X, Huang H. Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association. 2010;105(490):727–739. doi: 10.1198/jasa.2010.tm09380. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Shen X, Huang H, Pan W. Simultaneous supervised clustering and feature selection over a graph. doi: 10.1093/biomet/ass038. Submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Shen X, Ye J. Adaptive model selection. Journal of the American Statistical Association. 2002;97(457):210–221. [Google Scholar]
16.Tao P, An L. Convex analysis approach to DC programming: Theory, algorithms and applications. Acta Math Vietnam. 1997;22(1):289–355. [Google Scholar]
17.Tao P, El Bernoussi S. Duality in DC (difference of convex functions) optimization. subgradient methods. Trends in Mathematical Optimization. 1988;84:277–293. [Google Scholar]
18.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996:267–288. [Google Scholar]
19.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B. 2005;67(1):91–108. [Google Scholar]
20.Tzourio-Mazoyer N, Landeau B, Papathanassiou D, Crivello F, Etard O, Delcroix N, Mazoyer B, Joliot M. Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. Neuroimage. 2002;15(1):273–289. doi: 10.1006/nimg.2001.0978. [DOI] [PubMed] [Google Scholar]
21.Yuan L, Liu J, Ye J. Efficient methods for overlapping group lasso. NIPS. 2011 doi: 10.1109/TPAMI.2013.17. [DOI] [PubMed] [Google Scholar]
22.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2006;68(1):49–67. [Google Scholar]
23.Zhang T. Multi-stage convex relaxation for feature selection. stat. 2011;1050:3. [Google Scholar]
24.Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics. 2009;37(6A):3468–3497. [Google Scholar]
25.Zhong L, Kwok J. Efficient sparse modeling with automatic feature grouping. ICML. 2011 doi: 10.1109/TNNLS.2012.2200262. [DOI] [PubMed] [Google Scholar]
26.Zhu Y, Shen X, Pan W. Simultaneous grouping pursuit and feature selection in regression over an undirected graph. Submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67(2):301–320. [Google Scholar]

[R1] 1.Bach F, Lanckriet G, Jordan M. Multiple kernel learning, conic duality, and the SMO algorithm. ICML. 2004 [Google Scholar]

[R2] 2.Bondell H, Reich B. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics. 2008;64(1):115–123. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. 2011 [Google Scholar]

[R4] 4.Chuang H, Lee E, Liu Y, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Molecular systems biology. 2007;3(1) doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Fei H, Quanz B, Huan J. Regularization and feature selection for networked features. CIKM. 2010:1893–1896. [Google Scholar]

[R6] 6.Huang S, Li J, Sun L, Liu J, Wu T, Chen K, Fleisher A, Reiman E, Ye J. Learning brain connectivity of Alzheimer’s disease from neuroimaging data. NIPS. 2009;22:808–816. [Google Scholar]

[R7] 7.Jacob L, Obozinski G, Vert J. Group lasso with overlap and graph lasso. ICML. 2009:433–440. [Google Scholar]

[R8] 8.Jenatton R, Mairal J, Obozinski G, Bach F. Proximal methods for sparse hierarchical dictionary learning. ICML. 2010 [Google Scholar]

[R9] 9.Kim S, Xing E. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS genetics. 2009;5(8):e1000587. doi: 10.1371/journal.pgen.1000587. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24(9):1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]

[R11] 11.Liu J, Ye J. Moreau-Yosida regularization for grouped tree structure learning. NIPS. 2010 [Google Scholar]

[R12] 12.Rinaldo A. Properties and refinements of the fused lasso. The Annals of Statistics. 2009;37(5B):2922–2952. [Google Scholar]

[R13] 13.Shen X, Huang H. Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association. 2010;105(490):727–739. doi: 10.1198/jasa.2010.tm09380. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Shen X, Huang H, Pan W. Simultaneous supervised clustering and feature selection over a graph. doi: 10.1093/biomet/ass038. Submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Shen X, Ye J. Adaptive model selection. Journal of the American Statistical Association. 2002;97(457):210–221. [Google Scholar]

[R16] 16.Tao P, An L. Convex analysis approach to DC programming: Theory, algorithms and applications. Acta Math Vietnam. 1997;22(1):289–355. [Google Scholar]

[R17] 17.Tao P, El Bernoussi S. Duality in DC (difference of convex functions) optimization. subgradient methods. Trends in Mathematical Optimization. 1988;84:277–293. [Google Scholar]

[R18] 18.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996:267–288. [Google Scholar]

[R19] 19.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B. 2005;67(1):91–108. [Google Scholar]

[R20] 20.Tzourio-Mazoyer N, Landeau B, Papathanassiou D, Crivello F, Etard O, Delcroix N, Mazoyer B, Joliot M. Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. Neuroimage. 2002;15(1):273–289. doi: 10.1006/nimg.2001.0978. [DOI] [PubMed] [Google Scholar]

[R21] 21.Yuan L, Liu J, Ye J. Efficient methods for overlapping group lasso. NIPS. 2011 doi: 10.1109/TPAMI.2013.17. [DOI] [PubMed] [Google Scholar]

[R22] 22.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2006;68(1):49–67. [Google Scholar]

[R23] 23.Zhang T. Multi-stage convex relaxation for feature selection. stat. 2011;1050:3. [Google Scholar]

[R24] 24.Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics. 2009;37(6A):3468–3497. [Google Scholar]

[R25] 25.Zhong L, Kwok J. Efficient sparse modeling with automatic feature grouping. ICML. 2011 doi: 10.1109/TNNLS.2012.2200262. [DOI] [PubMed] [Google Scholar]

[R26] 26.Zhu Y, Shen X, Pan W. Simultaneous grouping pursuit and feature selection in regression over an undirected graph. Submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67(2):301–320. [Google Scholar]

PERMALINK

Feature Grouping and Selection Over an Undirected Graph

Sen Yang

Lei Yuan

Ying-Cheng Lai

Xiaotong Shen

Peter Wonka

Jieping Ye

Abstract

1. INTRODUCTION

2. A CONVEX FORMULATION

2.1 Algorithm

Update β

Update q

Update p

Update μ, υ

Algorithm 1.

3. TWO NON-CONVEX FORMULATIONS

3.1 Non-Convex Formulation I: ncFGS

Algorithm 2.

3.2 Non-Convex Formulation II: ncTFGS

Figure 1.

Algorithm 3.

Figure 3.

4. NUMERICAL RESULTS

4.1 Efficiency

Figure 2.

4.2 Simulations

Figure 4.

Table 1.

Table 2.

Table 3.

4.3 Real Data

4.3.1 FDG-PET

Figure 5.

Table 4.

Figure 6.

4.3.2 Breast Cancer

Figure 7.

Table 5.

5. CONCLUSION

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases