Multi-Stage Multi-Task Feature Learning

Pinghua Gong; Jieping Ye; Changshui Zhang

. Author manuscript; available in PMC: 2014 Jan 13.

Published in final edited form as: Adv Neural Inf Process Syst. 2013 Oct;14:2979–3010.

Multi-Stage Multi-Task Feature Learning^{^*}

Pinghua Gong ^†, Jieping Ye ^‡, Changshui Zhang ^†

PMCID: PMC3889129 NIHMSID: NIHMS497491 PMID: 24431924

Abstract

Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an $ℓ_{0}$ -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms.

1 Introduction

Multi-task learning (MTL) exploits the relationships among multiple related tasks to improve the generalization performance. It has been applied successfully to many applications such as speech classification [16], handwritten character recognition [14, 17] and medical diagnosis [2]. One common assumption in multi-task learning is that all tasks should share some common structures including the prior or parameters of Bayesian models [18, 21, 24], a similarity metric matrix [16], a classification weight vector [6], a low rank subspace [4, 13] and a common set of shared features [1, 8, 10, 11, 12, 14, 20].

In this paper, we focus on multi-task feature learning, in which we learn the features specific to each task as well as the common features shared among tasks. Although many multi-task feature learning algorithms have been proposed, most of them assume that the relevant features are shared by all tasks. This is too restrictive in real-world applications [9]. To overcome this limitation, Jalali et al. (2010) [9] proposed an $ℓ_{1} + ℓ_{1, \infty}$ regularized formulation, called dirty model, to leverage the common features shared among tasks. The dirty model allows a certain feature to be shared by some tasks but not all tasks. Jalali et al. (2010) also presented a theoretical analysis under the incoherence condition [5, 15] which is more restrictive than RIP [3, 27]. The $ℓ_{1} + ℓ_{1, \infty}$ regularizer is a convex relaxation for the $ℓ_{0}$ -type one, which, however, is too loose to well approximate the $ℓ_{0}$ -type regularizer and usually achieves suboptimal performance (requiring restrictive conditions or obtaining a suboptimal error bound) [23, 26, 27]. To remedy the shortcoming, we propose to use a non-convex regularizer for multi-task feature learning in this paper.

Contributions

We propose to employ a capped- $ℓ_{1}, ℓ_{1}$ regularized formulation (non-convex) to learn the features specific to each task as well as the common features shared among tasks. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm, using the concave duality [26]. Although the MSMTFL algorithm may not obtain a globally optimal solution, we theoretically show that this solution achieves good performance. Specifically, we present a detailed theoretical analysis on the parameter estimation error bound for the MSMTFL algorithm. Our analysis shows that, under the sparse eigenvalue condition which is weaker than the incoherence condition in Jalali et al. (2010) [9], MSMTFL improves the error bound during the multi-stage iteration, i.e., the error bound at the current iteration improves the one at the last iteration. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of the MSMTFL algorithm in comparison with the state of the art algorithms.

Notations

Scalars and vectors are denoted by lower case letters and bold face lower case letters, respectively. Matrices and sets are denoted by capital letters and calligraphic capital letters, respectively. The $ℓ_{1}$ norm, Euclidean norm, $ℓ_{\infty}$ norm and Frobenius norm are denoted by ∥ · ∥₁, ∥ · ∥, ∥ · ∥_∞ and ∥ · ∥_F, respectively. | · | denotes the absolute value of a scalar or the number of elements in a set, depending on the context. We define the $ℓ_{p, q}$ norm of a matrix X as ${‖ X ‖}_{p, q} = {(\sum_{i} {({(\sum_{j} {∣ x_{i j} ∣}^{q})}^{1 ∕ q})}^{p})}^{1 ∕ p}$ . We define $N_{n}$ as {1, …, n} and N(μ, σ²) as a normal distribution with mean μ and variance σ². For a d×m matrix W and sets $I_{i} \subseteq N_{d} \times {i}, I \subseteq N_{d} \times N_{d}$ , we let $w_{I_{i}}$ be a d × 1 vector with the j-th entry being w_ji, if $(j, i) \in I_{i}$ , and 0, otherwise. We also let $W_{I}$ be a d × m matrix with the (j, i)-th entry being w_ji, if $(j, i) \in I$ , and 0, otherwise.

2 The Proposed Formulation

Assume we are given m learning tasks associated with training data {(X₁, y₁), …, (X_m, y_m)}, where $X_{i} \in R^{n_{i} \times d}$ is the data matrix of the i-th task with each row as a sample; $y_{i} \in R^{n_{i}}$ is the response of the i-th task; d is the data dimensionality; n_i is the number of samples for the i-th task. We consider learning a weight matrix $W = [w_{1}, \dots, w_{m}] \in R^{d \times m}$ consisting of the weight vectors for m linear predictive models: y_i ≈ f_i(X_i) = X_iw_i, $y_{i} \approx f_{i} (X_{i}) = X_{i} w_{i}, i \in N_{m}$ . In this paper, we propose a non-convex multi-task feature learning formulation to learn these m models simultaneously, based on the capped- $ℓ_{1}, ℓ_{1}$ regularization. Specifically, we first impose the $ℓ_{1}$ penalty on each row of W, obtaining a column vector. Then, we impose the capped- $ℓ_{1}$ penalty [26, 27] on that vector. Formally, we formulate our proposed model as follows:

\min_{W} {l (W) + λ \sum_{j = 1}^{d} \min ({‖ w^{j} ‖}_{1}, θ)},

(1)

where l(W) is an empirical loss function of (W; λ (> 0) is a parameter balancing the empirical loss and the regularization; θ (> 0) is a thresholding parameter; w^j is the j-th row of the matrix W. In this paper, we focus on the quadratic loss function: $l (W) = \sum_{i = 1}^{m} \frac{1}{m n_{i}} {‖ X_{i} w_{i} - y_{i} ‖}^{2}$ .

graphic file with name nihms-497491-f0001.jpg

Intuitively, due to the capped- $ℓ_{1}, ℓ_{1}$ penalty, the optimal solution of Eq. (1) denoted as W* has many zero rows. For a nonzero row (w*)^k, some entries may be zero, due to the $ℓ_{1}$ -norm imposed on each row of W. Thus, under the formulation in Eq. (1), a certain feature can be shared by some tasks but not all the tasks. Therefore, the proposed formulation can leverage the common features shared among tasks.

The formulation in Eq. (1) is non-convex and is difficult to solve. To this end, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm (see Algorithm 1). Note that if we terminate the algorithm with $ℓ = 1$ , the MSMTFL algorithm is equivalent to the $ℓ_{1}$ regularized multi-task feature learning algorithm (Lasso). Thus, the solution obtained by MSMTFL can be considered as a refinement of that of Lasso. Although Algorithm 1 may not find a globally optimal solution, the solution has good performance. Specifically, we will theoretically show that the solution obtained by Algorithm 1 improves the performance of the parameter estimation error bound during the multi-stage iteration. Moreover, empirical studies also demonstrate the effectiveness of our proposed MSMTFL algorithm. We provide more details about intuitive interpretations, convergence analysis and reproducibility discussions of the proposed algorithm in the full version [7].

3 Theoretical Analysis

In this section, we theoretically analyze the parameter estimation performance of the solution obtained by the MSMTFL algorithm. To simplify the notations in the theoretical analysis, we assume that the number of samples for all the tasks are the same. However, our theoretical analysis can be easily extended to the case where the tasks have different sample sizes.

We first present a sub-Gaussian noise assumption which is very common in the analysis of sparse regularization literature [23, 25, 26, 27].

Assumption 1 Let $\overset{‒}{W} = [{\overset{‒}{w}}_{1}, \dots, {\overset{‒}{w}}_{m}] \in R^{d \times m}$ be the underlying sparse weight matrix and $y_{i} = X_{i} {\overset{‒}{w}}_{i} + δ_{i}$ , $E y_{i} = X_{i} {\overset{‒}{w}}_{i}$ , where $δ_{i} \in R_{n}$ is a random vector with all entries $δ_{j i} (j \in N_{n}, i \in N_{m})$ being independent sub-Gaussians: there exists σ > 0 such that $\forall_{j} \in N_{n}, i \in N_{m}, t \in R : E_{δ_{j i}} exp (t δ_{j i}) \leq exp (σ^{2} t^{2} ∕ 2)$ .

Remark 1 We call the random variable satisfying the condition in Assumption 1 sub-Gaussian, since its moment generating function is upper bounded by that of the zero mean Gaussian random variable. That is, if a normal random variable x ~ N(0, σ²), then we have $E exp (t x) = \int_{- \infty}^{\infty} exp (t x) \frac{1}{\sqrt{2 π} σ} exp (- \frac{x^{2}}{2 σ^{2}}) d x = exp (σ^{2} t^{2} ∕ 2) \int_{- \infty}^{\infty} \frac{1}{\sqrt{2 π} σ} exp (- {(x - σ^{2} t)}^{2} ∕ (2 σ^{2})) d x = exp (σ^{2} t^{2} ∕ 2) \geq E_{δ_{j i}} exp (t δ_{j i})$ .

Remark 2 Based on the Hoeffding's Lemma, for any random variable x ∊ [a, b] and $E x = 0$ , we have $E (exp (t x)) \leq exp (\frac{t^{2} {(b - a)}^{2}}{8})$ . Therefore, both zero mean Gaussian and zero mean bounded random variables are sub-Guassians. Thus, the sub-Gaussian noise assumption is more general than the Gaussian noise assumption which is commonly used in the literature [9, 11].

We next introduce the following sparse eigenvalue concept which is also common in the analysis of sparse regularization literature [22, 23, 25, 26, 27].

Definition 1 Given 1 ≤ k ≤ d, we define

ρ_{i}^{+} (k) = \sup_{w} {\frac{{‖ X_{i} w ‖}^{2}}{n {‖ w ‖}^{2}} : {‖ w ‖}_{0} \leq k}, ρ_{m a x}^{+} (k) = \max_{i \in N_{m}} ρ_{i}^{+} (k),

ρ_{i}^{-} (k) = \inf_{w} {\frac{{‖ X_{i} w ‖}^{2}}{n {‖ w ‖}^{2}} : {‖ w ‖}_{0} \leq k}, ρ_{m i n}^{-} (k) = \min_{i \in N_{m}} ρ_{i}^{-} (k) .

Remark 3 $ρ_{i}^{+} (k) (ρ_{i}^{-} (k))$ is in fact the maximum (minimum) eigenvalue of ${(X_{i})}_{S}^{T} {(X_{i})}_{S} ∕ n$ , where $S$ is a set satisfying $∣ S ∣ \leq k$ and ${(X_{i})}_{S}$ is a submatrix composed of the columns of X_i indexed by $S$ . In the MTL setting, we need to exploit the relations of $ρ_{i}^{+} (k) (ρ_{i}^{-} (k))$ among multiple tasks.

We present our parameter estimation error bound on MSMTFL in the following theorem:

Theorem 1 Let Assumption 1 hold. Define ${\overset{‒}{F}}_{i} = {(j, i) : {\overset{‒}{w}}_{j i} \neq 0}$ and $\overset{‒}{F} = \cup_{i \in N_{m}} {\overset{‒}{F}}_{i}$ . Denote $\overset{‒}{r}$ as the number of nonzero rows of $\overset{‒}{W}$ . We assume that

\forall (j, i) \in \overset{‒}{F}, {‖ {\overset{‒}{w}}^{j} ‖}_{1} \geq 2 θ

(3)

and \frac{ρ_{i}^{+} (s)}{ρ_{i}^{-} (2 \overset{‒}{r} + 2 s)} \leq 1 + \frac{s}{2 \overset{‒}{r}},

(4)

where s is some integer satisfying $s \geq \overset{‒}{r}$ . If we choose λ and θ such that for some $s \geq \overset{‒}{r}$ :

λ \geq 12 σ \sqrt{\frac{2 ρ_{m a x}^{+} (1) \ln (2 d m ∕ η)}{n},}

(5)

θ \geq \frac{11 m λ}{ρ_{m i n}^{-} (2 \overset{‒}{r} + s)},

(6)

then the following parameter estimation error bound holds with probability larger than 1 − η:

{‖ {\hat{W}}^{(ℓ)} - \overset{‒}{W} ‖}_{2, 1} \leq {0.8}^{ℓ ∕ 2} \frac{9.1 m λ \sqrt{\overset{‒}{r}}}{ρ_{m i n}^{-} (2 \overset{‒}{r} + s)} + \frac{39.5 m σ \sqrt{ρ_{m a x}^{+} (\overset{‒}{r}) (7.4 \overset{‒}{r} + 2.7 \ln (2 ∕ η)) ∕ n}}{ρ_{m i n}^{-} (2 \overset{‒}{r} + s)},

(7)

where ${\hat{W}}^{(ℓ)}$ is a solution of Eq. (2).

Remark 4 Eq. (3) assumes that the $ℓ_{1}$ -norm of each nonzero row of $\overset{‒}{W}$ is away from zero. This requires the true nonzero coefficients should be large enough, in order to distinguish them from the noise. Eq. (4) is called the sparse eigenvalue condition [27], which requires the eigenvalue ratio $ρ_{i}^{+} (s) ∕ ρ_{i}^{-} (s)$ to grow sub-linearly with respect to s. Such a condition is very common in the analysis of sparse regularization [22, 25] and it is slightly weaker than the RIP condition [3, 27].

Remark 5 When $ℓ = 1$ (corresponds to Lasso), the first term of the right-hand side of Eq. (7) dominates the error bound in the order of

{‖ {\hat{W}}^{L a s s o} - \overset{‒}{W} ‖}_{2, 1} = O (m \sqrt{\overset{‒}{r} \ln (d m ∕ η) ∕ n}),

(8)

since λ satisfies the condition in Eq. (5). Note that the first term of the right-hand side of Eq. (7) shrinks exponentially as ℓ increases. When ℓ is sufficiently large in the order of $O (ln (m \sqrt{\overset{‒}{r} ∕ n}) + ln ln (d m))$ , this term tends to zero and we obtain the following parameter estimation error bound:

{‖ {\hat{W}}^{(ℓ)} - \overset{‒}{W} ‖}_{2, 1} = O (m \sqrt{\overset{‒}{r} ∕ n + \ln (1 ∕ η) ∕ n}) .

(9)

Jalali et al. (2010) [9] gave an $ℓ_{\infty, \infty}$ -norm error bound ${‖ {\hat{W}}^{Dirty} - \overset{‒}{W} ‖}_{\infty, \infty} = O (\sqrt{ln (d m ∕ η) ∕ n})$ as well as a sign consistency result between $\hat{W}$ and $\overset{‒}{W}$ . A direct comparison between these two bounds is difficult due to the use of different norms. On the other hand, the worst-case estimate of the $ℓ_{2, 1}$ -norm error bound of the algorithm in Jalali et al. (2010) [9] is in the same order with Eq. (8), that is: ${‖ {\hat{W}}^{Dirty} - \overset{‒}{W} ‖}_{2, 1} = O (m \sqrt{\overset{‒}{r} ln (d m ∕ η) ∕ n})$ . When dm is large and the ground truth has a large number of sparse (i.e., $\overset{‒}{r}$ is a small constant), the bound in Eq. (9) is significantly better than the ones for the Lasso and Dirty model.

Remark 6 Jalali et al. (2010) [9] presented an $ℓ_{\infty, \infty}$ -norm parameter estimation error bound and hence a sign consistency result can be obtained. The results are derived under the incoherence condition which is more restrictive than the RIP condition and hence more restrictive than the sparse eigenvalue condition in Eq. (4). From the viewpoint of the parameter estimation error, our proposed algorithm can achieve a better bound under weaker conditions. Please refer to [19, 25, 27] for more details about the incoherence condition, the RIP condition, the sparse eigenvalue condition and their relationships.

Remark 7 The capped- $ℓ_{1}$ regularized formulation in Zhang (2010) [26] is a special case of our formulation when m = 1. However, extending the analysis from the single task to the multi-task setting is nontrivial. Different from previous work on multi-stage sparse learning which focuses on a single task [26, 27], we study a more general multi-stage framework in the multi-task setting. We need to exploit the relationship among tasks, by using the relations of sparse eigenvalues $ρ_{i}^{+} (k) (ρ_{i}^{-} (k))$ and treating the $ℓ_{1}$ -norm on each row of the weight matrix as a whole for consideration. Moreover, we simultaneously exploit the relations of each column and each row of the matrix.

4 Proof Sketch

We first provide several important lemmas (please refer to the full version [7] or supplementary materials for detailed proofs) and then complete the proof of Theorem 1 based on these lemmas.

Lemma 1 Let $\overset{‒}{Υ} = [{\overset{‒}{∊}}_{1}, \dots, {\overset{‒}{∊}}_{m}]$ with ${\overset{‒}{∊}}_{i} = {[{\overset{‒}{∊}}_{1 i}, \dots, {\overset{‒}{∊}}_{d i}]}^{T} = \frac{1}{n} X_{i}^{T} (X_{i} {\overset{‒}{w}}_{i} - y_{i}) (i \in N_{m})$ . Define $\overset{‒}{H} \supseteq \overset{‒}{F}$ such that $(j, i) \in \overset{‒}{H} (\forall i \in N_{m})$ , provided there exists $(j, g) \in \overset{‒}{F}$ ( $\overset{‒}{H}$ is a set consisting of the indices of all entries in the nonzero rows of $\overset{‒}{W}$ ). Under the conditions of Assumption 1 and the notations of Theorem 1, the followings hold with probability larger than 1 − η:

{‖ \overset{‒}{Υ} ‖}_{\infty, \infty} \leq σ \sqrt{\frac{2 ρ_{m a x}^{+} (1) \ln (2 d m ∕ η)}{n}},

(10)

{‖ {\overset{‒}{Υ}}_{\overset{‒}{H}} ‖}_{F}^{2} \leq m σ^{2} ρ_{\max}^{+} (\overset{‒}{r}) (7.4 \overset{‒}{r} + 2.7 ln (2 ∕ η)) ∕ n .

(11)

Lemma 1 gives bounds on the residual correlation $\overset{‒}{Υ}$ with respect to $\overset{‒}{W}$ . We note that Eq. (10) and Eq. (11) are closely related to the assumption on λ in Eq. (5) and the second term of the right-hand side of Eq. (7) (error bound), respectively. This lemma provides a fundamental basis for the proof of Theorem 1.

Lemma 2 Use the notations of Lemma 1 and consider $G_{i} \subseteq N_{d} \times {i}$ such that ${\overset{‒}{F}}_{i} \cap G_{i} = \emptyset (i \in N_{m})$ . Let $\hat{W} = {\hat{W}}^{(ℓ)}$ be a solution of Eq. (2) and $Δ \hat{W} = \hat{W} - \overset{‒}{W}$ . Denote ${\hat{λ}}_{i} = {\hat{λ}}_{i}^{(ℓ - 1)} = {[λ_{1}^{(ℓ - 1)}, \dots, λ_{d}^{(ℓ - 1)}]}^{T}$ . Let ${\hat{λ}}_{G_{i}} = \min_{(j, i) \in G_{i}} {\hat{λ}}_{j i}$ , ${\hat{λ}}_{G_{i}} = \min_{i \in G_{i}} {\hat{λ}}_{G_{i}}$ and ${\hat{λ}}_{0 i} = \max_{j} {\hat{λ}}_{j i}$ , ${\hat{λ}}_{0} = \max_{i} {\hat{λ}}_{0 i}$ . If $2 {‖ {\overset{‒}{∊}}_{i} ‖}_{\infty} 〈 {\hat{λ}}_{G_{i}}$ , then the following inequality holds at any stage ℓ ≥ 1:

\sum_{i = 1}^{m} \sum_{(j, i) \in G_{i}} ∣ {\hat{w}}_{j i}^{(ℓ)} ∣ \leq \frac{2 {‖ \overset{‒}{Υ} ‖}_{\infty, \infty} + {\hat{λ}}_{0}}{{\hat{λ}}_{G} - 2 {‖ \overset{‒}{Υ} ‖}_{\infty, \infty}} \sum_{i = 1}^{m} \sum_{(j, i) \in G_{i}^{c}} ∣ Δ {\hat{w}}_{j i}^{(ℓ)} ∣ .

Denote $G = \cup_{i \in N_{m}} G_{i}$ , $\overset{‒}{F} = \cup_{i \in N_{m}} {\overset{‒}{F}}_{i}$ and notice that $F \cap G = \emptyset \Rightarrow Δ {\hat{W}}^{(ℓ)} = {\hat{W}}^{(ℓ)}$ . Lemma 2 sasy that ${‖ Δ {\hat{W}}_{G}^{(ℓ)} ‖}_{1, 1} = {‖ {\hat{W}}_{G}^{(ℓ)} ‖}_{1, 1}$ is upper bounded in terms of ${‖ Δ {\hat{W}}_{G^{c}}^{(ℓ)} ‖}_{1, 1}$ , which indicates that the error of the estimated coefficients locating outside of $\overset{‒}{F}$ should be small enough. This provides an intuitive explanation why the parameter estimation error of our algorithm can be small.

Lemma 3 Using the notations of Lemma 2, we denote $G = G_{(ℓ)} = {\overset{‒}{H}}^{c} \cap {(j, i) : {\hat{λ}}_{j i}^{(ℓ - 1)} = λ} = \cup_{i \in N_{m}} G_{i}$ with $\overset{‒}{H}$ being defined as in Lemma 1 and $G_{i} \subseteq N_{d} \times {i}$ . Let $J_{i}$ be the indices of the largest s coefficients (in absolute value) of ${\hat{w}}_{G_{i}}$ , $I_{i} = G_{i}^{c} \cup J_{i}$ , $I = \cup_{i \in N_{m}} I_{i}$ and $\overset{‒}{F} = \cup_{i \in N_{m}} {\overset{‒}{F}}_{i}$ . Then, the following inequalities hold at any stage ℓ ≥ 1:

{‖ Δ {\hat{W}}^{(ℓ)} ‖}_{2, 1} \leq \frac{(1 + 1.5 \sqrt{\frac{2 \overset{‒}{r}}{s}}) \sqrt{8 m (4 {‖ {\overset{‒}{Υ}}_{G_{(ℓ)}^{c}} ‖}_{F}^{2} + \sum_{(j, i) \in \overset{‒}{F}} {({\hat{λ}}_{j i}^{(ℓ - 1)})}^{2})}}{ρ_{\min}^{-} (2 \overset{‒}{r} + s)},

(12)

{‖ Δ {\hat{W}}^{(ℓ)} ‖}_{2, 1} \leq \frac{9.1 m λ \sqrt{\overset{‒}{r}}}{ρ_{\min}^{-} (2 \overset{‒}{r} + s)} .

(13)

Lemma 3 is established based on Lemma 2, by considering the relationship between Eq. (5) and Eq. (10), and the specific definition of $G = G_{(ℓ)}$ . Eq. (12) provides a parameter estimation error bound in terms of ℓ_2,1-norm by ${‖ {\overset{‒}{Υ}}_{G_{(ℓ)}^{c}} ‖}_{F}^{2}$ and the regularization parameters ${\hat{λ}}_{j i}^{(ℓ - 1)}$ (see the definition of ${\hat{λ}}_{j i} ({\hat{λ}}_{j i}^{(ℓ - 1)})$ in Lemma 2). This is the result directly used in the proof of Theorem 1. Eq. (13) states that the error bound is upper bounded in terms of λ, the right-hand side of which constitutes the shrinkage part of the error bound in Eq. (7).

Lemma 4 Let ${\hat{λ}}_{j i} = λ I ({‖ {\hat{w}}^{j} ‖}_{1} 〈 θ, j \in N_{d})$ , $\forall i \in N_{m}$ with some $\hat{W} \in R^{d \times m}$ . $\overset{‒}{H} \supseteq \overset{‒}{F}$ is defined in Lemma 1. Then under the condition of Eq. (3), we have:

\sum_{(j, i) \in \overset{‒}{F}} {\hat{λ}}_{j i}^{2} \leq \sum_{(j, i) \in \overset{‒}{H}} {\hat{λ}}_{j i}^{2} \leq m λ^{2} {‖ {\overset{‒}{W}}_{\overset{‒}{H}} - {\hat{W}}_{\overset{‒}{H}} ‖}_{2, 1}^{2} ∕ θ^{2} .

Lemma 4 establishes an upper bound of $Σ_{(j, i) \in \overset{‒}{F}} {\hat{λ}}_{j i}^{2}$ by ${‖ {\overset{‒}{W}}_{\overset{‒}{H}} - {\hat{W}}_{\overset{‒}{H}} ‖}_{2, 1}^{2}$ , which is critical for building the recursive relationship between ${‖ {\hat{W}}^{(ℓ)} - \overset{‒}{W} ‖}_{2, 1}$ and ${‖ {\hat{W}}^{(ℓ - 1)} - \overset{‒}{W} ‖}_{2, 1}$ in the proof of Theorem 1. This recursive relation is crucial for the shrinkage part of the error bound in Eq. (7).

4.1 Proof of Theorem 1

Proof For notational simplicity, we denote the right-hand side of Eq. (11) as:

u = m σ^{2} ρ_{\max}^{+} (\overset{‒}{r}) (7.4 \overset{‒}{r} + 2.7 ln (2 ∕ η)) ∕ n .

(14)

Based on $\overset{‒}{H} \subseteq G_{(ℓ)}^{c}$ , Lemma 1 and Eq. (5), the followings hold with probability larger than 1 − η:

{‖ {\overset{‒}{Υ}}_{G_{(ℓ)}^{c}} ‖}_{F}^{2} = {‖ {\overset{‒}{Υ}}_{H} ‖}_{F}^{2} + {‖ {\overset{‒}{Υ}}_{G_{(ℓ)}^{c} \ \overset{‒}{H}} ‖}_{F}^{2} \leq u + ∣ G_{(ℓ)}^{c} \ \overset{‒}{H} ∣ {‖ \overset{‒}{Υ} ‖}_{\infty, \infty}^{2} \leq u + λ^{2} ∣ G_{(ℓ)}^{c} \ \overset{‒}{H} ∣ ∕ 144 \leq u + (1 ∕ 144) m λ^{2} θ^{- 2} {‖ {\hat{W}}_{G_{(ℓ)}^{c} \ \overset{‒}{H}}^{(ℓ - 1)} - {\overset{‒}{W}}_{G_{(ℓ)}^{c} \ \overset{‒}{H}} ‖}_{2, 1}^{2},

(15)

where the last inequality follows from $\forall (j, i) \in G_{(ℓ)}^{c} \ \overset{‒}{H}$ , ${‖ {({\hat{w}}^{(ℓ - 1)})}^{j} ‖}_{1}^{2} ∕ θ^{2} = {‖ {({\hat{w}}^{(ℓ - 1)})}^{j} - {\overset{‒}{w}}^{j} ‖}_{1}^{2} ∕ θ^{2} \geq 1 \Rightarrow ∣ G_{(ℓ)}^{c} \ \overset{‒}{H} ∣ \leq m θ^{- 2} {‖ ({\hat{W}}_{G_{(ℓ)}^{c} \ \overset{‒}{H}}^{(ℓ - 1)} - {\overset{‒}{W}}_{G_{(ℓ)}^{c} \ \overset{‒}{H}}) ‖}_{2, 1}^{2}$ . According to Eq. (12), we have:

{‖ {\hat{W}}^{(ℓ)} - \overset{‒}{W} ‖}_{2, 1}^{2} = {‖ Δ {\hat{W}}^{(ℓ)} ‖}_{2, 1}^{2} \leq \frac{8 m {(1 + 1.5 \sqrt{\frac{2 \overset{‒}{r}}{s}})}^{2} (4 {‖ {\overset{‒}{Υ}}_{G_{(ℓ)}^{c}} ‖}_{F}^{2} + \sum_{(j, i) \in F} {({\hat{λ}}_{j i}^{(ℓ - 1)})}^{2})}{{(ρ_{\min}^{-} (2 \overset{‒}{r} + s))}^{2}} \leq \frac{78 m (4 u + (37 ∕ 36) m λ^{2} θ^{- 2} {‖ {\hat{W}}^{(ℓ - 1)} - \overset{‒}{W} ‖}_{2, 1}^{2})}{{(ρ_{\min}^{-} (2 \overset{‒}{r} + s))}^{2}} \leq \frac{312 m u}{{(ρ_{\min}^{-} (2 \overset{‒}{r} + s))}^{2}} + 0.8 {‖ {\hat{W}}^{(ℓ - 1)} - \overset{‒}{W} ‖}_{2, 1}^{2} \leq {0.8}^{ℓ} {‖ {\hat{W}}^{(0)} - \overset{‒}{W} ‖}_{2, 1}^{2} + \frac{312 m u}{{(ρ_{\min}^{-} (2 \overset{‒}{r} + s))}^{2}} \frac{1 - {0.8}^{ℓ}}{1 - 0.8} \leq {0.8}^{ℓ} \frac{{9.1}^{2} m^{2} λ^{2} \overset{‒}{r}}{{(ρ_{\min}^{-} (2 \overset{‒}{r} + s))}^{2}} + \frac{1560 m u}{{(ρ_{\min}^{-} (2 \overset{‒}{r} + s))}^{2}} .

In the above derivation, the first inequality is due to Eq. (12); the second inequality is due to the assumption $s \geq \overset{‒}{r}$ in Theorem 1, Eq. (15) and Lemma 4; the third inequality is due to Eq. (6); the last inequality follows from Eq. (13) and 1 − 0.8^ℓ ≤ 1 (ℓ ≥ 1). Thus, following the inequality $\sqrt{a + b} \leq \sqrt{a} + \sqrt{b} (\forall a, b \geq 0)$ , we obtain:

{‖ {\hat{W}}^{(ℓ)} - \overset{‒}{W} ‖}_{2, 1} \leq {0.8}^{ℓ ∕ 2} \frac{9.1 m λ \sqrt{\overset{‒}{r}}}{ρ_{\min}^{-} (2 \overset{‒}{r} + s)} + \frac{39.5 \sqrt{m u}}{ρ_{\min}^{-} (2 \overset{‒}{r} + s)} .

Substituting Eq. (14) into the above inequality, we verify Theorem 1.

5 Experiments

We compare our proposed MSMTFL algorithm with three competing multi-task feature learning algorithms: ℓ₁-norm multi-task feature learning algorithm (Lasso), ℓ_1,2-norm multi-task feature learning algorithm (L1,2) [14] and dirty model multi-task feature learning algorithm (DirtyMTL) [9]. In our experiments, we employ the quadratic loss function for all the compared algorithms.

5.1 Synthetic Data Experiments

We generate synthetic data by setting the number of tasks as m and each task has n samples which are of dimensionality d; each element of the data matrix $X_{i} \in R^{n \times d} (i \in N_{m})$ for the i-th task is sampled i.i.d. from the Gaussian distribution N(0, 1) and we then normalize all columns to length 1; each entry of the underlying true weight $\overset{‒}{W} \in R^{d \times m}$ is sampled i.i.d. from the uniform distribution in the interval [−10, 10]; we randomly set 90% rows of $\hat{W}$ as zero vectors and 80% elements of the remaining nonzero entries as zeros; each entry of the noise $δ_{i} \in R^{n}$ is sampled i.i.d. from the Gaussian distribution N(0, σ²); the responses are computed as $y_{i} = X_{i} {\overset{‒}{w}}_{i} + δ_{i} (i \in N_{m})$

We first report the averaged parameter estimation error ${‖ \hat{W} - \overset{‒}{W} ‖}_{2, 1}$ vs. Stage (ℓ) plots for MSMTFL (Figure 1). We observe that the error decreases as ℓ increases, which shows the advantage of our proposed algorithm over Lasso. This is consistent with the theoretical result in Theorem 1. Moreover, the parameter estimation error decreases quickly and converges in a few stages.

Averaged parameter estimation error ${‖ \hat{W} - \overset{‒}{W} ‖}_{2, 1}$ vs. Stage $(ℓ)$ plots for MSMTFL on the synthetic data set (averaged over 10 runs). Here we set $λ = α \sqrt{\ln (d m) ∕ n}, θ = 50 m λ$ . Note that $ℓ = 1$ corresponds to Lasso; the results show the stage-wise improvement over Lasso.

We then report the averaged parameter estimation error ${‖ \hat{W} - \overset{‒}{W} ‖}_{2, 1}$ in comparison with four algorithms in different parameter settings (Figure 2). For a fair comparison, we compare the smallest estimation errors of the four algorithms in all the parameter settings [25, 26]. As expected, the parameter estimation error of the MSMTFL algorithm is the smallest among the four algorithms. This empirical result demonstrates the effectiveness of the MSMTFL algorithm. We also have the following observations: (a) When λ is large enough, all four algorithms tend to have the same parameter estimation error. This is reasonable, because the solutions $\hat{W}$ 's obtained by the four algorithms are all zero matrices, when λ is very large. (b) The performance of the MSMTFL algorithm is similar for different θ's, when λ exceeds a certain value.

Averaged parameter estimation error ${‖ \hat{W} - \overset{‒}{W} ‖}_{2, 1}$ vs. λ plots on the synthetic data set (averaged over 10 runs). MSMTFL has the smallest parameter estimation error among the four algorithms. Both DirtyMTL and MSMTFL have two parameters; we set λ_s/λ_b = 1, 0.5, 0.2, 0.1 for DirtyMTL (1/m ≤ λ_s/λ_b ≤ 1 was adopted in Jalali et al. (2010) [9]) and θ/λ = 50m, 10m, 2m, 0.4m for MSMTFL.

5.2 Real-World Data Experiments

We conduct experiments on two real-world data sets: MRI and Isolet data sets. (1) The MRI data set is collected from the ANDI database, which contains 675 patients' MRI data preprocessed using FreeSurfer¹. The MRI data include 306 features and the response (target) is the Mini Mental State Examination (MMSE) score coming from 6 different time points: M06, M12, M18, M24, M36, and M48. We remove the samples which fail the MRI quality controls and have missing entries. Thus, we have 6 tasks with each task corresponding to a time point and the sample sizes corresponding to 6 tasks are 648, 642, 293, 569, 389 and 87, respectively. (2) The Isolet data set² is collected from 150 speakers who speak the name of each English letter of the alphabet twice. Thus, there are 52 samples from each speaker. The speakers are grouped into 5 subsets which respectively include 30 similar speakers, and the subsets are named Isolet1, Isolet2, Isolet3, Isolet4, and Isolet5. Thus, we naturally have 5 tasks with each task corresponding to a subset. The 5 tasks respectively have 1560, 1560, 1560, 1558, and 1559 samples (Three samples are historically missing), where each sample includes 617 features and the response is the English letter label (1–26).

In the experiments, we treat the MMSE and letter labels as the regression values for the MRI data set and the Isolet data set, respectively. For both data sets, we randomly extract the training samples from each task with different training ratios (15%, 20% and 25%) and use the rest of samples to form the test set. We evaluate the four multi-task feature learning algorithms in terms of normalized mean squared error (nMSE) and averaged means squared error (aMSE), which are commonly used in multi-task learning problems [28, 29]. For each training ratio, both nMSE and aMSE are averaged over 10 random splittings of training and test sets and the standard deviation is also shown. All parameters of the four algorithms are tuned via 3-fold cross validation.

Table 1 and Figure 3 show the experimental results in terms of averaged nMSE (aMSE) and the standard deviation. From these results, we observe that: (a) Our proposed MSMTFL algorithm out-performs all the competing feature learning algorithms on both data sets, with the smallest regression errors (nMSE and aMSE) as well as the smallest standard deviations. (b) On the MRI data set, the MSMTFL algorithm performs well even in the case of a small training ratio. The performance for the 15% training ratio is comparable to that for the 25% training ratio. (c) On the Isolet data set, when the training ratio increases from 15% to 25%, the performance of the MSMTFL algorithm increases and the superiority of the MSMTFL algorithm over the other three algorithms is more significant. Our results demonstrate the effectiveness of the proposed algorithm.

Table 1.

Comparison of four multi-task feature learning algorithms on the MRI data set in terms of averaged nMSE and aMSE (standard deviation), which are averaged over 10 random splittings.

measure	traning ratio	Lasso	L1,2	DirtyMTL	MSMTFL
nMSE	0.15	0.6651(0.0280)	0.6633(0.0470)	0.6224(0.0265)	0.5539(0.0154)
	0.20	0.6254(0.0212)	0.6489(0.0275)	0.6140(0.0185)	0.5542(0.0139)
	0.25	0.6105(0.0186)	0.6577(0.0194)	0.6136(0.0180)	0.5507(0.0142)

aMSE	0.15	0.0189(0.0008)	0.0187(0.0010)	0.0172(0.0006)	0.0159(0.0004)
	0.20	0.0179(0.0006)	0.0184(0.0005)	0.0171(0.0005)	0.0161(0.0004)
	0.25	0.0172(0.0009)	0.0183(0.0006)	0.0167(0.0008)	0.0157(0.0006)

Open in a new tab

Averaged test error (nMSE and aMSE) vs. training ratio plots on the Isolet data set. The results are averaged over 10 random splittings.

6 Conclusions

In this paper, we propose a non-convex multi-task feature learning formulation based on the capped- $ℓ_{1}, ℓ_{1 w}$ regularization. The proposed formulation learns the specific features of each task as well as the common features shared among tasks. We propose to solve the non-convex optimization problem by employing a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm, using concave duality. We also present a detailed theoretical analysis in terms of the parameter estimation error bound for the MSMTFL algorithm. The analysis shows that our MSMTFL algorithm achieves good performance under the sparse eigenvalue condition, which is weaker than the incoherence condition. Experimental results on both synthetic and real-world data sets demonstrate the effectiveness of our proposed MSMTFL algorithm in comparison with the state of the art multi-task feature learning algorithms. In our future work, we will focus on a general non-convex regularization framework for multi-task feature learning settings (involving different loss functions and non-convex regularization terms) and derive theoretical bounds.

Acknowledgements

This work is supported in part by 973 Program (2013CB329503), NSFC (Grant No. 91120301, 60835002 and 61075004), NIH (R01 LM010730) and NSF (IIS-0953662, CCF-1025177).

Footnotes

This work was completed when the first author visited Arizona State University.

www.loni.ucla.edu/ADNI/

www.zjucadcg.cn/dengcai/Data/data.html

References

[1].Argyriou A, Evgeniou T, Pontil M. Convex multi-task feature learning. Machine Learning. 2008;73(3):243–272. [Google Scholar]
[2].Bi J, Xiong T, Yu S, Dundar M, Rao R. An improved multi-task learning approach with applications in medical diagnosis. Machine Learning and Knowledge Discovery in Databases. 2008:117–132. [Google Scholar]
[3].Candes E, Tao T. Decoding by linear programming. IEEE Transactions on Information Theory. 2005;51(12):4203–4215. [Google Scholar]
[4].Chen J, Liu J, Ye J. Learning incoherent sparse and low-rank patterns from multiple tasks. SIGKDD. 2010:1179–1188. doi: 10.1145/2086737.2086742. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Donoho D, Elad M, Temlyakov V. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory. 2006;52(1):6–18. [Google Scholar]
[6].Evgeniou T, Pontil M. Regularized multi–task learning. SIGKDD. 2004:109–117. [Google Scholar]
[7].Gong P, Ye J, Zhang C. Multi-stage multi-task feature learning. arXiv:1210.5806. 2012 [Google Scholar]
[8].Gong P, Ye J, Zhang C. Robust multi-task feature learning. SIGKDD. 2012:895–903. doi: 10.1145/2339530.2339672. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Jalali A, Ravikumar P, Sanghavi S, Ruan C. A dirty model for multi-task learning. NIPS. 2010:964–972. [Google Scholar]
[10].Kim S, Xing E. Tree-guided group lasso for multi-task regression with structured sparsity. ICML. 2009:543–550. [Google Scholar]
[11].Lounici K, Pontil M, Tsybakov A, Van De Geer S. Taking advantage of sparsity in multi-task learning. COLT. 2009:73–82. [Google Scholar]
[12].Negahban S, Wainwright M. Joint support recovery under high-dimensional scaling: Benefits and perils of $ℓ_{1, \infty}$ -regularization. NIPS. 2008:1161–1168. [Google Scholar]
[13].Negahban S, Wainwright M. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics. 2011;39(2):1069–1097. [Google Scholar]
[14].Obozinski G, Taskar B, Jordan M. Multi-task feature selection. Statistics Department, UC Berkeley, Tech. Rep. 2006 [Google Scholar]
[15].Obozinski G, Wainwright M, Jordan M. Support union recovery in high-dimensional multivariate regression. Annals of statistics. 2011;39(1):1–47. [Google Scholar]
[16].Parameswaran S, Weinberger K. Large margin multi-task metric learning. NIPS. 2010:1867–1875. [Google Scholar]
[17].Quadrianto N, Smola A, Caetano T, Vishwanathan S, Petterson J. Multitask learning without label correspondences. NIPS. 2010:1957–1965. [Google Scholar]
[18].Schwaighofer A, Tresp V, Yu K. Learning gaussian process kernels via hierarchical bayes. NIPS. 2005:1209–1216. [Google Scholar]
[19].Van De Geer S, Bühlmann P. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics. 2009;3:1360–1392. [Google Scholar]
[20].Yang X, Kim S, Xing E. Heterogeneous multitask learning with joint sparsity constraints. NIPS. 2009:2151–2159. [Google Scholar]
[21].Yu K, Tresp V, Schwaighofer A. Learning gaussian processes from multiple tasks. ICML. 2005:1012–1019. [Google Scholar]
[22].Zhang C, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36(4):1567–1594. [Google Scholar]
[23].Zhang C, Zhang T. A general theory of concave regularization for high dimensional sparse estimation problems. Statistical Science. 2012 [Google Scholar]
[24].Zhang J, Ghahramani Z, Yang Y. Learning multiple related tasks using latent independent component analysis. NIPS. 2006:1585–1592. [Google Scholar]
[25].Zhang T. Some sharp performance bounds for least squares regression with $ℓ_{1}$ regularization. The Annals of Statistics. 2009;37:2109–2144. [Google Scholar]
[26].Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. JMLR. 2010;11:1081–1107. [Google Scholar]
[27].Zhang T. Multi-stage convex relaxation for feature selection. Bernoulli. 2012 [Google Scholar]
[28].Zhang Y, Yeung D. Multi-task learning using generalized t process. AISTATS. 2010 [Google Scholar]
[29].Zhou J, Chen J, Ye J. Clustered multi-task learning via alternating structure optimization. NIPS. 2011:702–710. [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Argyriou A, Evgeniou T, Pontil M. Convex multi-task feature learning. Machine Learning. 2008;73(3):243–272. [Google Scholar]

[R2] [2].Bi J, Xiong T, Yu S, Dundar M, Rao R. An improved multi-task learning approach with applications in medical diagnosis. Machine Learning and Knowledge Discovery in Databases. 2008:117–132. [Google Scholar]

[R3] [3].Candes E, Tao T. Decoding by linear programming. IEEE Transactions on Information Theory. 2005;51(12):4203–4215. [Google Scholar]

[R4] [4].Chen J, Liu J, Ye J. Learning incoherent sparse and low-rank patterns from multiple tasks. SIGKDD. 2010:1179–1188. doi: 10.1145/2086737.2086742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Donoho D, Elad M, Temlyakov V. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory. 2006;52(1):6–18. [Google Scholar]

[R6] [6].Evgeniou T, Pontil M. Regularized multi–task learning. SIGKDD. 2004:109–117. [Google Scholar]

[R7] [7].Gong P, Ye J, Zhang C. Multi-stage multi-task feature learning. arXiv:1210.5806. 2012 [Google Scholar]

[R8] [8].Gong P, Ye J, Zhang C. Robust multi-task feature learning. SIGKDD. 2012:895–903. doi: 10.1145/2339530.2339672. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Jalali A, Ravikumar P, Sanghavi S, Ruan C. A dirty model for multi-task learning. NIPS. 2010:964–972. [Google Scholar]

[R10] [10].Kim S, Xing E. Tree-guided group lasso for multi-task regression with structured sparsity. ICML. 2009:543–550. [Google Scholar]

[R11] [11].Lounici K, Pontil M, Tsybakov A, Van De Geer S. Taking advantage of sparsity in multi-task learning. COLT. 2009:73–82. [Google Scholar]

[R12] [12].Negahban S, Wainwright M. Joint support recovery under high-dimensional scaling: Benefits and perils of $ℓ_{1, \infty}$ -regularization. NIPS. 2008:1161–1168. [Google Scholar]

[R13] [13].Negahban S, Wainwright M. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics. 2011;39(2):1069–1097. [Google Scholar]

[R14] [14].Obozinski G, Taskar B, Jordan M. Multi-task feature selection. Statistics Department, UC Berkeley, Tech. Rep. 2006 [Google Scholar]

[R15] [15].Obozinski G, Wainwright M, Jordan M. Support union recovery in high-dimensional multivariate regression. Annals of statistics. 2011;39(1):1–47. [Google Scholar]

[R16] [16].Parameswaran S, Weinberger K. Large margin multi-task metric learning. NIPS. 2010:1867–1875. [Google Scholar]

[R17] [17].Quadrianto N, Smola A, Caetano T, Vishwanathan S, Petterson J. Multitask learning without label correspondences. NIPS. 2010:1957–1965. [Google Scholar]

[R18] [18].Schwaighofer A, Tresp V, Yu K. Learning gaussian process kernels via hierarchical bayes. NIPS. 2005:1209–1216. [Google Scholar]

[R19] [19].Van De Geer S, Bühlmann P. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics. 2009;3:1360–1392. [Google Scholar]

[R20] [20].Yang X, Kim S, Xing E. Heterogeneous multitask learning with joint sparsity constraints. NIPS. 2009:2151–2159. [Google Scholar]

[R21] [21].Yu K, Tresp V, Schwaighofer A. Learning gaussian processes from multiple tasks. ICML. 2005:1012–1019. [Google Scholar]

[R22] [22].Zhang C, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36(4):1567–1594. [Google Scholar]

[R23] [23].Zhang C, Zhang T. A general theory of concave regularization for high dimensional sparse estimation problems. Statistical Science. 2012 [Google Scholar]

[R24] [24].Zhang J, Ghahramani Z, Yang Y. Learning multiple related tasks using latent independent component analysis. NIPS. 2006:1585–1592. [Google Scholar]

[R25] [25].Zhang T. Some sharp performance bounds for least squares regression with $ℓ_{1}$ regularization. The Annals of Statistics. 2009;37:2109–2144. [Google Scholar]

[R26] [26].Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. JMLR. 2010;11:1081–1107. [Google Scholar]

[R27] [27].Zhang T. Multi-stage convex relaxation for feature selection. Bernoulli. 2012 [Google Scholar]

[R28] [28].Zhang Y, Yeung D. Multi-task learning using generalized t process. AISTATS. 2010 [Google Scholar]

[R29] [29].Zhou J, Chen J, Ye J. Clustered multi-task learning via alternating structure optimization. NIPS. 2011:702–710. [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Multi-Stage Multi-Task Feature Learning^{^*}

Pinghua Gong

Jieping Ye

Changshui Zhang

Abstract