Multi-task Vector Field Learning

Binbin Lin; Sen Yang; Chiyuan Zhang; Jieping Ye; Xiaofei He

. Author manuscript; available in PMC: 2014 Oct 18.

Published in final edited form as: Adv Neural Inf Process Syst. 2012;2012:296–304.

Multi-task Vector Field Learning

Binbin Lin ¹, Sen Yang ², Chiyuan Zhang ¹, Jieping Ye ², Xiaofei He ¹

PMCID: PMC4201856 NIHMSID: NIHMS497483 PMID: 25332642

Abstract

Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both of which are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the predictor functions and the vector fields simultaneously. MTVFL has the following key properties. (1) The vector fields MTVFL learns are close to the gradient fields of the predictor functions. (2) Within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace. (3) The vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach.

1 Introduction

In many applications, labeled data are expensive and time consuming to obtain while unlabeled data are abundant. The problem of using unlabeled data to improve the generalization performance is often referred to as semi-supervised learning (SSL). It is well known that in order to make semi-supervised learning work, some assumptions on the dependency between the predictor function and the marginal distribution of data are needed. The manifold assumption [15, 5], which has been widely adopted in the last decade, states that the predictor function lives in a low dimensional manifold of the marginal distribution.

Multi-task learning was proposed to enhance the generalization performance by learning multiple related tasks simultaneously. The abundant literature on multi-task learning demonstrates that the learning performance indeed improves when the tasks are related [4, 6, 7]. The key step in MTL is to find the shared information among tasks. Evgeniou et al. [12] proposed a regularization MTL framework which assumes all tasks are related and close to each other. Ando and Zhang [2] proposed a structural learning framework, which assumed multiple predictors for different tasks shared a common structure on the underlying predictor space. An alternating structure optimization (ASO) method was proposed for linear predictors which assumed the task parameters shared a low dimensional subspace. Arvind et al. [1] generalized the idea of sharing a subspace by assuming that all task parameters lie on a manifold.

In this paper, we consider semi-supervised multi-task learning (SSMTL). Although many SSL methods have been proposed in the literature [10], these methods are often not directly amenable to MTL extensions [18]. Liu et al. [18] proposed an SSMTL framework which encouraged related models to have similar parameters. However they require that related tasks share similar representations [9]. Wang et al. [19] proposed another SSMTL method under the assumption that the tasks are clustered [4, 14]. The cluster structure is characterized by task parameters of linear predictor functions. For linear predictors, the task parameters they used are actually the constant gradient of the predictor functions which form a first order differential structure. For general nonlinear predictor functions, we show it is more natural to capture the shared differential structure using vector fields.

In this paper, we propose a novel SSMTL formulation using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. In this way, a vector field naturally characterizes the differential structure of functions while also providing a natural way to exploit the geometric structure of data; these are the two most important aspects for SSMTL. Based on this idea, we develop the multi-task vector field learning (MTVFL) method which learns the prediction functions and the vector fields simultaneously. The vector fields we learned are forced to be close to the gradient fields of predictor functions. In each task, the vector field is required to be as parallel as possible. We say that a vector field is parallel if the vectors are parallel along the geodesics on the manifold. In extreme cases, when the manifold is a linear (or an affine) space, then the geodesics of such manifold are straight lines. In such cases, the space spanned by these parallel vectors is a simply one-dimensional subspace. Thus when the manifold is flat (i.e., with zero curvature) or the curvature is small, it is expected that these parallel vectors concentrate on a low dimensional subspace. As an example, we can see from Fig. 1 that the parallel field on the plane spans a one-dimensional subspace and the parallel field on the Swiss roll spans a two-dimensional subspace. For the multi-task case, these vector fields share a low dimensional subspace. In addition, we assume these vector fields share a low dimensional subspace among all tasks. In essence, we use a first-order differential structure to characterize the shared structure of tasks and use a second-order differential structure to characterize the specific parts of tasks. We formalize our idea in a regularization framework and provide a convex relaxation method to solve the original non-convex problem. We have performed experiments using both synthetic and real data; results demonstrate the effectiveness of our proposed approach.

Examples of parallel fields. The parallel field on ℝ² spans a one dimensional subspace and the parallel field on the Swiss roll spans a two dimensional subspace.

2 Multi-task Learning: A Vector Field Approach

In this section, we first introduce vector fields and then present multi-task learning via exploring shared structure using vector fields.

2.1 Multi-task Learning Setting and Vector Fields

We first introduce notation and symbols. We are given m tasks, with n_l samples $x_{i}^{l}$ , i = 1, …, n_l for the l-th task. The total number of samples is n = Σ_l n_l. For the l-th task, we assume the data { $x_{i}^{l}$ } are on a d_l-dimensional manifold Inline graphic . All of these data manifolds are embedded in the same D-dimensional ambient space ℝ^D. It is worth noting that the dimensions of different data manifolds are not required to be the same. Without loss of generality, we assume the first $n_{l}^{'} (n_{l}^{'} < n_{l})$ samples are labeled, with $y_{j}^{l} \in ℝ$ for regression and $y_{j}^{l} \in {- 1, 1}$ for classification, $j = 1, \dots, n_{l}^{'}$ . The total number of labeled samples is $n^{'} = \sum_{l} n_{l}^{'}$ . For the l-th task, we denote the regression function or classification function by $f_{l}^{*}$ . The goal of semi-supervised multi-task learning is to learn the function value on unlabeled data, i.e., $f_{l}^{*} (x_{i}^{l}), n_{l}^{'} + 1 \leq i \leq n_{l}$ .

Given the l-th task, we first construct a nearest neighbor graph by either ε-neighborhood or k nearest neighbors. Let $x_{i}^{l} ~ x_{j}^{l}$ denote that $x_{i}^{l}$ and $x_{j}^{l}$ are neighbors. Let $w_{i j}^{l}$ denote the weight which measures the similarity between $x_{i}^{l}$ and $x_{j}^{l}$ . It can be approximated by the heat kernel weight or the simple 0–1 weight. For each point $x_{i}^{l}$ , we estimate its tangent space $T_{x_{i}^{l}} M$ by performing PCA on its neighborhood. We choose the largest d_l eigenvectors as the bases since the tangent space $T_{x_{i}^{l}} M$ has the same dimension as the manifold Inline graphic . Let $T_{i}^{l} \in ℝ^{D \times d_{l}}$ be the matrix whose columns constitute an orthonormal basis for $T_{x_{i}^{l}} M$ . It is easy to show that $P_{i}^{l} = T_{i}^{l} {T_{i}^{l}}^{T}$ is the unique orthogonal projection from ℝ^D onto the tangent space $T_{x_{i}^{l}} M$ [13]. That is, for any vector a ∈ ℝ^m, we have $P_{i}^{l} a \in T_{x_{i}^{l}} M$ and $(a - P_{i}^{l} a) ⊥ P_{i}^{l} a$ .

We now formally define the vector field and show how to represent it in the discrete case.

Definition 2.1 ([16])

A vector field X on the manifold Inline graphic is a continuous map X : → T where T is the set of tangent spaces, written as p ↦ X_p, with the property that for each p ∈ , X_p is an element of T_p .

We can think of a vector field on the manifold as an arrow in the same way as we think of the vector field in the Euclidean space, with a given magnitude and direction attached to each point on the manifold, and chosen to be tangent to the manifold. A vector field V on the manifold is called a gradient field if there exists a function f on the manifold such that ∇f = V where ∇ is the covariant derivative on the manifold. Therefore, gradient fields are one kind of vector fields. It plays a critical role in connecting vector fields and functions.

Let V_l be a vector field on the manifold Inline graphic . For each point $x_{i}^{l}$ , let $V_{x_{i}^{l}}$ denote the value of the vector field V_l at $x_{i}^{l}$ . Recall the definition of vector field, $V_{x_{i}^{l}}$ should be a vector in the tangent space $T_{x_{i}^{l}} M_{l}$ . Therefore, we can represent it by the coordinates of the tangent space $T_{x_{i}^{l}} M_{l}$ as $V_{x_{i}^{l}} = T_{i}^{l} v_{i}^{l}$ , where $v_{i}^{l} \in ℝ^{d_{l}}$ is the local representation of $V_{x_{i}^{l}}$ with respect of $T_{i}^{l}$ . Let f_l be a function on the manifold Inline graphic . By abusing the notation without confusion, we also use f_l to denote the vector $f_{l} = {(f_{l} (x_{l}^{1}), \dots, f_{l} (x_{n_{l}}^{l}))}^{T}$ and use V_l to denote the vector $V_{l} = {({v_{1}^{l}}^{T}, \dots, {v_{n_{l}}^{l}}^{T})}^{T} \in ℝ^{d_{l} n_{l}}$ . That is, V_l is a d_ln_l-dimensional big column vector which concatenates all the $v_{i}^{l}$ ’s for a fixed l. Then for each task, we aim to compute the vector f_l and the vector V_l.

2.2 Multi-task Vector Field Learning

In this section, we introduce multi-task vector field learning (MTVFL).

Many existing MTL methods capture the task relatedness by sharing task parameters. For linear predictors, the task parameters they used are actually the constant gradient vectors of the predictor functions. For general nonlinear predictor functions, we show it is natural to capture the shared differential structure using vector fields. Let f denote the vector ${(f_{1}^{T}, \dots, f_{m}^{T})}^{T}$ and V denote the vector ${(V_{1}^{T}, \dots, V_{m}^{T})}^{T} = {({v_{1}^{1}}^{T}, \dots, {v_{n_{l}}^{m}}^{T})}^{T}$ . We propose to learn f and V simultaneously:

The vector field V_l should be close to the gradient field ∇f_l of f_l, which can be formularized as follows:
$min_{f, V} R_{1} (f, V) = \sum_{l = 1}^{m} R_{1} (f_{l}, V_{l}) : = \sum_{l = 1}^{m} \int_{M_{l}} {‖ \nabla f_{l} - V_{l} ‖}^{2}$ (1)
The vector field V_l should be as parallel as possible:
$min_{V} R_{2} (V) = \sum_{l = 1}^{m} R_{2} (V_{l}) : = \sum_{l = 1}^{m} \int_{M_{l}} {‖ \nabla V_{l} ‖}_{HS}^{2}$ (2)

where ∇ is the covariant derivative on the manifold, where ||·||_HS denotes the Hilbert-Schmidt tensor norm [11]. ∇V_l measures the change of the vector field, therefore minimizing $\int_{M_{l}} {‖ \nabla V_{l} ‖}_{HS}^{2}$ enforces the vector field V_l to be parallel.
All vector fields share an h-dimensional subspace where h is a predefined parameter:
$T_{i}^{l} v_{i}^{l} = u_{i}^{l} + Θ^{T} w_{i}^{l}, s . t . Θ Θ^{T} = I_{h \times h} .$ (3)

Since these vector fields are assumed to share a low dimensional space, it is expected that the residual vector $u_{i}^{l}$ is small. We define another term R₃ to control the complexity as follows:

R_{3} (v_{i}^{l}, w_{i}^{l}, Θ) = \sum_{l = 1}^{m} \sum_{i = 1}^{n_{l}} α {‖ u_{i}^{l} ‖}^{2} + β {‖ T_{i}^{l} v_{i}^{l} ‖}^{2}

(4)

= \sum_{l = 1}^{m} \sum_{i = 1}^{n_{l}} α {‖ T_{i}^{l} v_{i}^{l} - Θ^{T} w_{i}^{l} ‖}^{2} + β {‖ T_{i}^{l} v_{i}^{l} ‖}^{2}

(5)

Note that α and β are pre-specified coefficients, indicating the importance of the corresponding regularization component. Since we would like the vector field to be parallel, the vector norm is not expected to be too small. Besides, we assume the vector fields share a low dimensional subspace, the residual vector $u_{i}^{l}$ is expected to be small. In practice we suggest to use a small β and a large α. By setting β = 0, R₃ will reduce to the regularization term proposed in ASO if we also replace the tangent vectors by the task parameters. Therefore, this formulation is a generalization of ASO.

It can be verified that $w_{i}^{l^{*}} = Θ T_{i}^{l} v_{i}^{l} = arg {min}_{w_{i}^{l}} R_{3} (v_{i}^{l}, w_{i}^{l}, Θ)$ . Thus we have $u_{i}^{l} = T_{i}^{l} v_{i}^{l} - Θ^{T} w_{i}^{l} = (I - Θ^{T} Θ) T_{i}^{l} v_{i}^{l}$ . Therefore, we can rewrite R₃ as follows:

\begin{array}{l} R_{3} (V, Θ) = \sum_{l = 1}^{m} \sum_{i = 1}^{n_{l}} α {‖ u_{i}^{l} ‖}^{2} + β {‖ T_{i}^{l} v_{i}^{l} ‖}^{2} \\ = \sum_{l = 1}^{m} \sum_{i = 1}^{n_{l}} (α {‖ (I - Θ^{T} Θ) T_{i}^{l} v_{i}^{l} ‖}^{2} + β {‖ T_{i}^{l} v_{i}^{l} ‖}^{2}) \\ = α V^{T} A_{Θ} V + β V^{T} H V \end{array}

(6)

where H is a block diagonal matrix with the diagonal blocks being ${T_{i}^{l}}^{T} T_{i}^{l}$ , and A_Θ is another block diagonal matrix with the diagonal blocks being ${T_{i}^{l}}^{T} {(I - Θ^{T} Θ)}^{T} (I - Θ^{T} Θ) T_{i}^{l} = {T_{i}^{l}}^{T} (I - Θ^{T} Θ) T_{i}^{l}$ . Therefore, the proposed formulation solves the following optimization problem:

\underset{f, V, Θ}{arg min} E (f, V, Θ) = R_{0} (f) + λ_{1} R_{1} (f, V) + λ_{2} R_{2} (V) + λ_{3} R_{3} (V, Θ) s . t . Θ Θ^{T} = I_{h \times h}

(7)

where R₀(f) is the loss function. For simplicity, we use the quadratic loss function $R_{0} (f) = \sum_{l = 1}^{m} \sum_{i = 1}^{n_{l}^{'}} {(f_{l} (x_{i}^{l}) - y_{i}^{l})}^{2}$ .

2.3 Objective Function in the Matrix Form

To simplify Eq. (7), in this section we rewrite our objective function in the matrix form.

Using the discrete methods in [17], we have the following discrete form equations:

R_{1} (f_{l}, V_{l}) = \sum_{i} \sum_{j ~ i} w_{i j}^{l} {({(x_{j}^{l} - x_{i}^{l})}^{T} T_{i}^{l} v_{i}^{l} - f_{j}^{l} + f_{i}^{l})}^{2}

(8)

R_{2} (f_{l}, V_{l}) = \sum_{i ~ j} w_{i j}^{l} {‖ P_{i}^{l} T_{j}^{l} v_{j}^{l} - T_{i}^{l} v_{i}^{l} ‖}^{2}

(9)

Interestingly, with some algebraic transformations, we have the following matrix forms for our objective functions:

R_{1} (f_{l}, V_{l}) = 2 f_{l}^{T} L_{l} f_{l} + V_{l}^{T} G_{l} V_{l} - 2 V_{l}^{T} C_{l} f_{l}

(10)

where L_l is the graph Laplacian matrix, G_l is a d_ln_l × d_ln_l block diagonal matrix, and $C_{l} = {[{C_{1}^{l}}^{T}, \dots, {C_{n}^{l}}^{T}]}^{T}$ is a d_ln_l × n_l block matrix. Denote the i-th d_l × d_l diagonal block of G_l by $G_{i i}^{l}$ and the i-th d_l × n_l block of C_l by $C_{i}^{l}$ , we have

G_{i i}^{l} = \sum_{j ~ i} w_{i j}^{l} (x_{j}^{l} - x_{i}^{l}) {(x_{j}^{l} - x_{i}^{l})}^{T}, C_{i}^{l} = \sum_{j ~ i} w_{i j}^{l} (x_{j}^{l} - x_{i}^{l}) {s_{i j}^{l}}^{T}

(11)

where $s_{i j}^{l} \in ℝ^{n_{l}}$ is a selection vector of all zero elements except for the i-th element being −1 and the j-th element being 1. And R₂ becomes

R_{2} (V_{l}) = V_{l}^{T} B_{l} V_{l}

(12)

where B_l is a d_ln_l × d_ln_l sparse block matrix. If we index each d_l × d_l block by $B_{i j}^{l}$ , then we have

B_{i i}^{l} = \sum_{j ~ i} w_{i j}^{l} (Q_{i j}^{l} {Q_{i j}^{l}}^{T} + I)

(13)

B_{i j}^{l} = {\begin{array}{r} - 2 w_{i j}^{l} Q_{i j}^{l}, & if x_{i} ~ x_{j} \\ 0, & otherwise \end{array}

(14)

where $Q_{i j}^{l} = {T_{i}^{l}}^{T} T_{j}^{l}$ . It is worth nothing that both R₁ and R₂ depend on tangent spaces $T_{i}^{l}$ .

Thus we can further write R₁(f, V) and R₂(V) as follows

R_{1} (f, V) = \sum_{l = 1}^{m} R_{1} (f_{l}, V_{l}) = 2 f^{T} L f + V^{T} G V - 2 V^{T} C f

(15)

R_{2} (V) = \sum_{l = 1}^{m} R_{2} (V_{l}) = V^{T} B V

(16)

where L, G and B are block diagonal matrices with the corresponding l-th block matrix being L_l, G_l and B_l, respectively. C is a column block matrix with the l-th block matrix being C_l.

Let Inline graphic denote an n × n diagonal matrix where = 1 if the corresponding i-th data is labeled and = 0 otherwise. And let y ∈ ℝⁿ be a column vector whose i-th element is the corresponding label of the i-th labeled data and 0 otherwise. Then $R_{0} (f) = \frac{1}{n^{'}} {(f - y)}^{T} I (f - y)$ . Finally, we get the following matrix form for our objective function in Eq. (7) with the constraint ΘΘ^T = I_h×h as:

\begin{array}{l} E (f, V, Θ) = R_{0} (f) + λ_{1} R_{1} (f, V) + λ_{2} R_{2} (V) + λ_{3} R_{3} (V, Θ) = \frac{1}{n^{'}} {(f - y)}^{T} I (f - y) + λ_{1} (2 f^{T} L f + V^{T} G V - 2 V^{T} C f) + λ_{2} V^{T} B V + λ_{3} V^{T} (α A_{Θ} + β H) V \\ = \frac{1}{n^{'}} {(f - y)}^{T} I (f - y) + 2 λ_{1} f^{T} L f + V^{T} (λ_{1} G + λ_{2} B + λ_{3} (α A_{Θ} + β H)) V - 2 λ_{1} V^{T} C f \end{array}

It is worth noting that matrices L, G, B, C depend on data, and only the matrix A_Θ is related to Θ.

3 Optimization

In this section, we discuss how to solve the following optimization problem:

\underset{f, V, Θ}{arg min} E (f, V, Θ), s . t . Θ Θ^{T} = I_{h \times h}

(17)

We use the alternating optimization to solve this problem.

Optimization of f and V. For a fixed Θ, the optimal f and V can be obtained via solving
$\underset{f, V}{arg min} E (f, V, Θ)$ (18)
Optimization of Θ. For a fixed V, the optimal Θ can be obtained via solving.
$\underset{Θ}{arg min} R_{3} (V, Θ), s . t . Θ Θ^{T} = I_{h \times h}$ (19)

3.1 Optimization of f and V for a Given Θ

When Θ is fixed, the objective function is similar to that of the single task case. However, there are some differences we would like to mention. Firstly, when constructing the nearest neighbor graph, data points from different tasks are disconnected. Therefore when estimating tangent spaces, data points from different tasks are independent. Secondly, we do not require the dimension of tangent spaces from each task to be the same.

We note that

\frac{\partial E}{\partial f} = 2 (\frac{1}{n^{'}} I + 2 λ_{1} L) f - 2 λ_{1} C^{T} V - 2 \frac{1}{n^{'}} y

(20)

\frac{\partial E}{\partial V} = - 2 λ_{1} C f + 2 (λ_{1} G + λ_{2} H + λ_{3} (α A_{Θ} + β H)) V

(21)

Requiring the derivatives to be vanish, we get the following linear system

(\begin{matrix} \frac{1}{n^{'}} I + 2 λ_{1} L & - λ_{1} C^{T} \\ - λ_{1} C & λ_{1} G + λ_{2} B + λ_{3} (α A_{Θ} + β H) \end{matrix}) (\begin{matrix} f \\ V \end{matrix}) = (\begin{matrix} \frac{1}{n^{'}} y \\ 0 \end{matrix})

(22)

Except for the matrix A_Θ, all other matrices can be computed in advance and will not change during the iterative process.

3.2 Optimization of Θ for a Given V

Since functions R₀(f), R₁(f, V) and R₂(V) are not related to the variable Θ, we only need to optimize R₃(V, Θ) subject to ΘΘ^T = I_h×h.

Recall Eq. (6), we rewrite R₃(V, Θ) as follows:

\begin{array}{l} \hat{Θ} = \underset{Θ}{arg min} \sum_{l = 1}^{m} \sum_{i = 1}^{n_{l}} α ({‖ (I - Θ^{T} Θ) T_{i}^{l} v_{i}^{l} ‖}^{2} + \frac{β}{α} {‖ T_{i}^{l} v_{i}^{l} ‖}^{2}) \\ = \underset{Θ}{arg min} α tr (V^{T} ((1 + \frac{β}{α}) I - Θ^{T} Θ) V) \\ = \underset{Θ}{arg max} tr (Θ {VV}^{T} Θ^{T}) \end{array}

(23)

where $V = (T_{1}^{1} v_{1}^{1}, \dots, T_{n_{m}}^{m} v_{n_{m}}^{m})$ is a D × n matrix with each column being a tangent vector. The optimal Θ̂ can be obtained by using singular value decomposition (SVD). Let $V = Z_{1} \sum Z_{2}^{T}$ be the SVD of V and we assume that the singular values are in a decreasing order in Σ. Then the rows of Θ̂ are given by the first h columns of Z₁.

3.3 Convex Relaxation

The orthogonality constraint in Eq. (23) is non-convex. Next, we propose to convert Eq. (23) into a convex formulation by relaxing its feasible domain into a convex set.

Let η = β/α. It can be verified that the following equality holds: (1 + η)I − Θ^TΘ = η (1 + η)(ηI + Θ^TΘ)⁻¹. Then we can rewrite R₃(V, Θ) as R₃(V, Θ) = αη (1 + η) tr V^T (ηI + Θ^T Θ)⁻¹V). Let M_e be defined as M_e = {M : M = Θ^TΘ, ΘΘ^T = I, Θ ∈ ℝ^h×d}. The convex hull [8] of M_e can be expressed as the convex set M_c given by $M_{c} = {M : tr (M) = h, M ≼ I, M \in S_{+}^{d}}$ and each element in M_e is referred to as an extreme point of M_c.

To convert the non-convex problem Eq. (23) into a convex formulation, we replace Θ^T Θ with M, and naturally relax its feasible domain into a convex set based on the relationship between M_e and M_c presented above; this results in an optimization problem as

\underset{Θ}{arg min} R_{3} (V, M), s . t ., tr (M) = h, M ≼ I, M \in S_{+}^{d}

(24)

where R₃(V, M) is defined as R₃(V, M) = αη (1 + η) tr V^T(ηI + M)⁻¹V). It follows from [3, Theorem 3.1] that the relaxed R₃ is jointly convex in V and M. After we obtain the optimal M, the optimal Θ can be approximated using the first h eigenvectors (corresponding to the largest h eigenvalues) of the optimal M.

4 Experiments

In this section, we evaluate our method on one synthetic data and one real data set. We compare the proposed Multi-Task Vector Field Learning (MTVFL) algorithm against the following methods: (a) Single Task Vector Field Learning (STVFL, or PFR), (b) Alternating Structure Optimization (ASO) and (c) its nonlinear version - Kernelized Alternating Structure Optimization (KASO). The kernel constructed in KASO uses both labeled data and unlabeled data. Thus it can be viewed as a semi-supervised MTL method.

4.1 Synthetic Data

We first construct a synthetic data to evaluate our method in comparison with the semi-supervised single task learning method (STVFL). We generate two data sets including Swiss roll and Swiss roll with hole embedded in 3-dimensional Euclidean space. The Swiss roll is generated by the following equations x = t₁ cos t₁; y = t₂; z = t₁ sin t₁ where t₁ ∈ [3π/2, 9π/2]; t₂ ∈ [0, 21]. The Swill roll with hole excludes points within t₁ ∈ [9, 12] and t₂ ∈ [9, 14]. The ground truth function is f(x, y, z) = t₁. This test is a semi-supervised multi-task regression problem. We randomly select a number of labeled data in each task and try to predict the value on other unlabeled data.

Each data set has 400 points. We construct a nearest neighbor graph for each task. The number of nearest neighbors is set to 5 and the manifold dimension is set to 2 as they are both 2 dimensional manifolds. The shared subspace dimension is set to 2. The regularization parameters are chosen via cross-validation. We perform 100 independent trials with randomly selected labeled sets. The performance is measured by the mean squared error (MSE). We also try ASO and KASO, however they perform poorly since the data is highly nonlinear. The averaged MSE over two tasks is presented in Fig. 2. We can observe that MTVFL consistently outperforms STVFL which demonstrates the effectiveness of SSMTL.

(a) Performance of MTVFL and STVFL; (b) The singular value distribution.

We also show the singular value distribution of the ground truth gradient fields. Given the ground truth f, we can compute the gradient field V by taking derivatives of R₁(f, V) with respect to V. Requiring the derivative to vanish, we get the following equation GV = Cf. After obtaining V, the gradient vector $V_{x_{i}^{l}}$ at each point can be obtained as $V_{x_{i}^{l}} = T_{i}^{l} v_{i}^{l}$ . Then we perform PCA on these vectors and the singular values of the covariance matrix of $V_{x_{i}^{l}}$ are shown in Fig. 2 (b). As can be seen from Fig. 2 (b), the number of dominant singular values is 2 which indicates that the ground truth gradient fields concentrate on a 2-dimensional subspace.

4.2 Landmine Detection

We use the landmine data set studied in [20]. There are totally 29 sets of data which are collected from various real landmine fields. Each data example is represented by a 9-dimensional vector with a binary label, which is either 1 for landmine or 0 for clutter. The problem of landmine detection is to predict the labels of unlabeled objects. Among the 29 data sets, 1–15 correspond to relatively highly foliated regions and 16–29 correspond to bare earth or desert regions. Following [20], we choose the data sets 1–10 and 16–24 to form 19 tasks.

The basic setup of all the algorithms is as follows. First, we construct a nearest neighbor graph for each task. The number of nearest neighbors is set to 10 and the manifold dimension is set to 4 empirically. These two parameters are the same for all 19 tasks. The shared subspace dimension is set to be 5 for both of MTVFL and ASO and the shared subspace dimension of KASO is set to 10. All the regularization parameters for the four algorithms are chosen via cross-validation. Note that KASO needs to construct a kernel matrix. We use Gaussian kernel in KASO and the Gaussian width is set to be optimal by searching within [0.01, 10].

We perform 100 independent trials with randomly selected labeled sets. We measure the performance by AUC which denotes area under the Receiver Operation Characteristic (ROC) curve. A large AUC value indicates good classification performance. Since the data have severely unbalanced labels, following [20], we do a special setting that assures there is at least one “1” and one “0” labeled sample in the training set of each task. The AUC averaged over the 19 tasks is presented in Fig. 3 (a). As can be seen, MTVFL consistently outperforms the other three algorithms. When the number of labeled data increases, KASO outperforms STVFL. ASO does not improve much when the amount of labeled data increases, which is probably because the data have severely unbalanced labels and the ground truth predictor function is nonlinear. We also show the singular value distribution of the ground truth gradient fields in Fig. 3 (b). The computation of the singular values is the same as in Section. 4.1. As can be seen from Fig. 3 (b), the number of dominant singular values is 5. The percentage of the sum of the first 5 singular values over the total sum is 91.34%, which indicates that the ground truth gradient fields concentrate on a 5-dimensional subspace.

(a) Performance of various MTL algorithms; (b) The singular value distribution.

5 Conclusion

In this paper, we propose a new semi-supervised multi-task learning formulation using vector fields. We show that vector fields can naturally capture the shared differential structure among tasks as well as the structure of the data manifolds, both of which are crucial for semi-supervised multi-task learning. Our experimental results on synthetic and real data demonstrate the effectiveness of the proposed method. There are several interesting future directions suggested in this work. One is the relation between learning on task parameters and learning on vector fields. Ultimately, both of them are learning functions. Another direction is to apply other assumptions made in the multi-task learning community into vector field learning, e.g., the cluster assumption.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants 61125203 and 61233011, the National Basic Research Program of China (973 Program) under Grant 2012CB316404, NIH (R01 LM010730) and NSF (IIS-0953662, CCF-1025177).

Footnotes

The data set is available at http://www.ee.duke.edu/~lcarin/LandmineData.zip.

Contributor Information

Binbin Lin, Email: binbinlinzju@gmail.com.

Sen Yang, Email: senyang@asu.edu.

Chiyuan Zhang, Email: chiyuan.zhang.zju@gmail.com.

Jieping Ye, Email: jieping.ye@asu.edu.

Xiaofei He, Email: xiaofeihe@gmail.com.

References

1.Agarwal A, HD, Gerber S. Learning multiple tasks using manifold regularization. Advances in Neural Information Processing Systems. 2010;23:46–54. [Google Scholar]
2.Ando RK, Zhang T. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research. 2005;6:1817–1853. [Google Scholar]
3.Argyriou A, Micchelli CA, Pontil M, Ying Y. A spectral regularization framework for multi-task structure learning. Advances in Neural Information Processing Systems. 2008;20:25–32. [Google Scholar]
4.Bakker B, Heskes T. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research. 2003;4:83–99. [Google Scholar]
5.Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research. 2006 Dec;7:2399–2434. [Google Scholar]
6.Ben-David S, Gehrke J, Schuller R. A theoretical framework for learning from a pool of disparate data sources. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002:443–449. [Google Scholar]
7.Ben-David S, Schuller R. Exploiting task relatedness for mulitple task learning. Conference on Learning Theory. 2003:567–580. [Google Scholar]
8.Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004. [Google Scholar]
9.Carlson A, Betteridge J, Wang RC, Hruschka ER, Jr, Mitchell TM. Coupled semi-supervised learning for information extraction. Proceedings of the third ACM international conference on Web search and data mining. 2010:101–110. [Google Scholar]
10.Chapelle O, Schölkopf B, Zien A, editors. Semi-Supervised Learning. MIT Press; 2006. [Google Scholar]
11.Defant A, Floret K. Tensor Norms and Operator Ideals. North-Holland Mathematics Studies; North-Holland, Amsterdam: 1993. [Google Scholar]
12.Evgeniou T, Micchelli CA, Pontil M. Learning multiple tasks with kernel methods. Journal of Machine Learning Research. 2005;6:615–637. [Google Scholar]
13.Golub GH, Loan CFV. Matrix computations. 3. Johns Hopkins University Press; 1996. [Google Scholar]
14.Jacob L, Bach F, Vert JP. Clustered multi-task learning: A convex formulation. Advances in Neural Information Processing Systems. 2009;21:745–752. [Google Scholar]
15.Lafferty J, Wasserman L. Statistical analysis of semi-supervised regression. Advances in Neural Information Processing Systems. 2007;20:801–808. [Google Scholar]
16.Lee JM. Introduction to Smooth Manifolds. 2. Springer Verlag; New York: 2003. [Google Scholar]
17.Lin B, Zhang C, He X. Semi-supervised regression via parallel field regularization. Advances in Neural Information Processing Systems. 2011;24:433–441. [Google Scholar]
18.Liu Q, Liao X, Carin L. Semi-supervised multitask learning. Advances in Neural Information Processing Systems. 2008;20:937–944. [Google Scholar]
19.Wang F, Wang X, Li T. Proceedings of the 2009 Ninth IEEE International Conference on Data Mining. IEEE Computer Society; 2009. Semi-supervised multi-task learning with task regularizations; pp. 562–568. [Google Scholar]
20.Xue Y, Liao X, Carin L, Krishnapuram B. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research. 2007;8:35–63. [Google Scholar]

[R1] 1.Agarwal A, HD, Gerber S. Learning multiple tasks using manifold regularization. Advances in Neural Information Processing Systems. 2010;23:46–54. [Google Scholar]

[R2] 2.Ando RK, Zhang T. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research. 2005;6:1817–1853. [Google Scholar]

[R3] 3.Argyriou A, Micchelli CA, Pontil M, Ying Y. A spectral regularization framework for multi-task structure learning. Advances in Neural Information Processing Systems. 2008;20:25–32. [Google Scholar]

[R4] 4.Bakker B, Heskes T. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research. 2003;4:83–99. [Google Scholar]

[R5] 5.Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research. 2006 Dec;7:2399–2434. [Google Scholar]

[R6] 6.Ben-David S, Gehrke J, Schuller R. A theoretical framework for learning from a pool of disparate data sources. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002:443–449. [Google Scholar]

[R7] 7.Ben-David S, Schuller R. Exploiting task relatedness for mulitple task learning. Conference on Learning Theory. 2003:567–580. [Google Scholar]

[R8] 8.Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004. [Google Scholar]

[R9] 9.Carlson A, Betteridge J, Wang RC, Hruschka ER, Jr, Mitchell TM. Coupled semi-supervised learning for information extraction. Proceedings of the third ACM international conference on Web search and data mining. 2010:101–110. [Google Scholar]

[R10] 10.Chapelle O, Schölkopf B, Zien A, editors. Semi-Supervised Learning. MIT Press; 2006. [Google Scholar]

[R11] 11.Defant A, Floret K. Tensor Norms and Operator Ideals. North-Holland Mathematics Studies; North-Holland, Amsterdam: 1993. [Google Scholar]

[R12] 12.Evgeniou T, Micchelli CA, Pontil M. Learning multiple tasks with kernel methods. Journal of Machine Learning Research. 2005;6:615–637. [Google Scholar]

[R13] 13.Golub GH, Loan CFV. Matrix computations. 3. Johns Hopkins University Press; 1996. [Google Scholar]

[R14] 14.Jacob L, Bach F, Vert JP. Clustered multi-task learning: A convex formulation. Advances in Neural Information Processing Systems. 2009;21:745–752. [Google Scholar]

[R15] 15.Lafferty J, Wasserman L. Statistical analysis of semi-supervised regression. Advances in Neural Information Processing Systems. 2007;20:801–808. [Google Scholar]

[R16] 16.Lee JM. Introduction to Smooth Manifolds. 2. Springer Verlag; New York: 2003. [Google Scholar]

[R17] 17.Lin B, Zhang C, He X. Semi-supervised regression via parallel field regularization. Advances in Neural Information Processing Systems. 2011;24:433–441. [Google Scholar]

[R18] 18.Liu Q, Liao X, Carin L. Semi-supervised multitask learning. Advances in Neural Information Processing Systems. 2008;20:937–944. [Google Scholar]

[R19] 19.Wang F, Wang X, Li T. Proceedings of the 2009 Ninth IEEE International Conference on Data Mining. IEEE Computer Society; 2009. Semi-supervised multi-task learning with task regularizations; pp. 562–568. [Google Scholar]

[R20] 20.Xue Y, Liao X, Carin L, Krishnapuram B. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research. 2007;8:35–63. [Google Scholar]

PERMALINK

Multi-task Vector Field Learning

Binbin Lin

Sen Yang

Chiyuan Zhang

Jieping Ye

Xiaofei He

Abstract

1 Introduction

Figure 1.

2 Multi-task Learning: A Vector Field Approach