Abstract
A non-parametric hierarchical Bayesian framework is developed for designing a classifier, based on a mixture of simple (linear) classifiers. Each simple classifier is termed a local “expert”, and the number of experts and their construction are manifested via a Dirichlet process formulation. The simple form of the “experts” allows analytical handling of incomplete data. The model is extended to allow simultaneous design of classifiers on multiple data sets, termed multi-task learning, with this also performed non-parametrically via the Dirichlet process. Fast inference is performed using variational Bayesian (VB) analysis, and example results are presented for several data sets. We also perform inference via Gibbs sampling, to which we compare the VB results.
Keywords: classification, incomplete data, expert, Dirichlet process, variational Bayesian, multitask learning
1. Introduction
In many applications one must deal with data that have been collected incompletely. For example, in censuses and surveys, some participants may not respond to certain questions (Rubin, 1987); in email spam filtering, server information may be unavailable for emails from external sources (Dick et al., 2008); in medical studies, measurements on some subjects may be partially lost at certain stages of the treatment (Ibrahim, 1990); in DNA analysis, gene-expression microarrays may be incomplete due to insufficient resolution, image corruption, or simply dust or scratches on the slide (Wang et al., 2006); in sensing applications, a subset of sensors may be absent or fail to operate at certain regions (Williams and Carin, 2005). Unlike in semi-supervised learning (Ando and Zhang, 2005) where missing labels (responses) must be addressed, features (inputs) are partially missing in the aforementioned incomplete-data problems. Since most data analysis procedures (for example, regression and classification) are designed for complete data, and cannot be directly applied to incomplete data, the appropriate handling of missing data is challenging.
Traditionally, data are often “completed” by ad hoc editing, such as case deletion and single imputation, where feature vectors with missing values are simply discarded or completed with specific values in the initial stage of analysis, before the main inference (for example, mean imputation and regression imputation see Schafer and Graham, 2002). Although analysis procedures designed for complete data become applicable after these edits, shortcomings are clear. For case deletion, discarding information is generally inefficient, especially when data are scarce. Secondly, the remaining complete data may be statistically unrepresentative. More importantly, even if the incomplete-data problem is eliminated by ignoring data with missing features in the training phase, it is still inevitable in the test stage since test data cannot be ignored simply because a portion of features are missing. For single imputation, the main concern is that the uncertainty of the missing features is ignored by imputing fixed values.
The work of Rubin (1976) developed a theoretical framework for incomplete-data problems, where widely-cited terminology for missing patterns was first defined. It was proven that ignoring the missing mechanism is appropriate (Rubin, 1976) under the missing at random (MAR) assumption, meaning that the missing mechanism is conditionally independent of the missing features given the observed data. As elaborated later, given the MAR assumption (Dick et al., 2008; Ibrahim, 1990; Williams and Carin, 2005), incomplete data can generally be handled by full maximum likelihood and Bayesian approaches; however, when the missing mechanism does depend on the missing values (missing not at random or MNAR), a problem-specific model is necessary to describe the missing mechanism, and no general approach exists. In this paper, we address missing features under the MAR assumption. Previous work in this setting may be placed into two groups, depending on whether the missing data are handled before algorithm learning or within the algorithm.
For the former, an extra step is required to estimate p(xm|xo), conditional distributions of missing values given observed ones, with this step distinct from the main inference algorithm. After p(xm|xo) is learned, various imputation methods may be performed. As a Monte Carlo approach, Bayesian multiple imputation (MI) (Rubin, 1987) is widely used, where multiple (M > 1) samples from p(xm|xo) are imputed to form M “complete” data sets, with the complete-data algorithm applied on each, and results of those imputed data sets combined to yield a final result. The MI method “completes” data sets so that algorithms designed for complete data become applicable. Furthermore, Rubin (1987) showed that MI does not require as many samples as Monte Carlo methods usually do. With a mild Gaussian mixture model (GMM) assumption for the joint distribution of observed and missing data, Williams et al. (2007) managed to analytically integrate out missing values over p(xm|xo) and performed essentially infinite imputations. Since explicit imputations are avoided, this method is more efficient than the MI method, as suggested by empirical results (Williams et al., 2007). Other examples of these two-step methods include Williams and Carin (2005), Smola et al. (2005) and Shivaswamy et al. (2006).
The other class of methods explicitly addresses missing values during the model-learning procedure. The work proposed by Chechik et al. (2008) represents a special case, in which no model is assumed for structurally absent values; the margin for the support vector machine (SVM) is re-scaled according to the observed features for each instance. Empirical results (Chechik et al., 2008) show that this procedure is comparable to several single-imputation methods when values are missing at random. Another recent work (Dick et al., 2008) handles the missing features inside the procedure of learning a support vector machine (SVM), without constraining the distribution of missing features to any specific class. The main concern is that this method can only handle missing features in the training data; however, in many applications one cannot control whether missing values occur in the training or test data.
A widely employed approach for handling missing values within the algorithm involves maximum likelihood (ML) estimation via expectation maximization (EM) (Dempster et al., 1977). Besides the latent variables (e.g., mixture component indicators), the missing features are also integrated out in the E-step so that the likelihood is maximized with respect to model parameters in the M-step. The main difficulty is that the integral in the E-step is analytically tractable only when an assumption is made on the distribution of the missing features. For example, the intractable integral is avoided by requiring the features to be discrete (Ibrahim, 1990), or assuming a Gaussian mixture model (GMM) for the features (Ghahramani and Jordan, 1994; Liao et al., 2007). The discreteness requirement is often too restrictive, while the GMM assumption is mild since it is well known that a GMM can approximate arbitrary continuous distributions.
In Liao et al. (2007) the authors proposed a quadratically gated mixture of experts (QGME) where the GMM is used to form the gating network, statistically partitioning the feature space into quadratic subregions. In each subregion, one linear classifier works as a local “expert”. As a mixture of experts (Jacobs et al., 1991), the QGME is capable of addressing a classification problem with a nonlinear decision boundary in terms of multiple local experts; the simple form of this model makes it straightforward to handle incomplete data without completing kernel functions (Graepel, 2002; Williams and Carin, 2005). However, as in many mixture-of-expert models (Jacobs et al., 1991; Waterhouse and Robinson, 1994; Xu et al., 1995), the number of local experts in the QGME must be specified initially, and thus a model-selection stage is in general necessary. Moreover, since the expectation-maximization method renders a point (single) solution that maximizes the likelihood, over-fitting may occur when data are scarce relative to the model complexity.
In this paper, we first extend the finite QGME (Liao et al., 2007) to an infinite QGME (iQGME), with theoretically an infinite number of experts realized via a Dirichlet process (DP) (Ferguson, 1973) prior; this yields a fully Bayesian solution, rather than a point estimate. In this manner model selection is avoided and the uncertainty on the number of experts is captured in the posterior density function.
The Dirichlet process (Ferguson, 1973) has been an active topic in many applications since the middle 1990s, for example, density estimation (Escobar and West, 1995; MacEachern and Müller, 1998; Dunson et al., 2007) and regression/curve fitting (Müller et al., 1996; Rasmussen and Ghahramani, 2002; Meeds and Osindero, 2006; Shahbaba and Neal, 2009; Rodríguez et al., 2009; Hannah et al., 2010). The latter group is relevant to classification problems of interest in this paper. The work in Müller et al. (1996) jointly modeled inputs and responses as a Dirichlet process mixture of multivariate normals, while Rodríguez et al. (2009) extended this model to simultaneously estimate multiple curves using dependent DP. In Rasmussen and Ghahramani (2002) and Meeds and Osindero (2006) two approaches to constructing infinite mixtures of Gaussian Process (GP) experts were proposed. The difference is that Meeds and Osindero (2006) specified the gating network using a multivariate Gaussian mixture instead of a (fixed) input-dependent Dirichlet Process. In Shahbaba and Neal (2009) another form of infinite mixtures of experts was proposed, where experts are specified by a multinomial logit (MNL) model (also called softmax) and the gating network is Gaussian mixture model with independent covariates. Further, Hannah et al. (2010) generalized existing DP-based nonparametric regression models to accommodate different types of covariates and responses, and further gave theoretical guarantees for this class of models.
Our focus in this paper is on developing classification models that handle incomplete inputs/covariates efficiently using Dirichlet process. Some of the above Dirichlet process regression models are potentially capable of handling incomplete inputs/features; however, none of them actually deal with such problems. In Müller et al. (1996), although the joint multivariate normal assumption over inputs and responses endow this approach with the potential of handling missing features and/or missing responses naturally, a good estimation for the joint distribution does not guarantee a good estimation for classification boundaries. Other than a full joint Gaussian distribution assumption, explicit classifiers were used to model the conditional distribution of responses given covariates in the models proposed in Meeds and Osindero (2006) and Shahbaba and Neal (2009). These two models are highly related to the iQGME proposed here. The independence assumption of covariates in Shahbaba and Neal (2009) leads to efficient computation but is not appealing for handling missing features. With Gaussian process experts (Meeds and Osindero, 2006), the inference for missing features is not analytical for fast inference algorithms such as variational Bayesian (Beal, 2003) and EM, and the computation could be prohibitive for large data sets. The iQGME seeks a balance between the ease of inference, computational burden and the ability of handling missing features. For high-dimensional data sets, we develop a variant of our model based on mixtures of factor analyzers (MFA) (Ghahramani and Hinton, 1996; Ghahramani and Beal, 2000), where a low-rank assumption is made for the covariance matrices of high-dimensional inputs in each cluster.
In addition to challenges with incomplete data, one must often address an insufficient quantity of labeled data. In Williams et al. (2007) the authors employed semi-supervised learning (Zhu, 2005) to address this challenge, using the contextual information in the unlabeled data to augment the limited labeled data, all done in the presence of missing/incomplete data. Another form of context one may employ to address limited labeled data is multi-task learning (MTL) (Caruana, 1997; Ando and Zhang, 2005), which allows the learning of multiple tasks simultaneously to improve generalization performance. The work of Caruana (1997) provided an overview of MTL and demonstrated it on multiple problems. In recent research, a hierarchical statistical structure has been favored for such models, where information is transferred via a common prior within a hierarchical Bayesian model (Yu et al., 2003; Zhang et al., 2006). Specifically, information may be transferred among related tasks (Xue et al., 2007) when the Dirichlet process (DP) (Ferguson, 1973) is introduced as a common prior. To the best of our knowledge, there is no previous example of addressing incomplete data in a multi-task setting, this problem constituting an important aspect of this paper.
The main contributions of this paper may be summarized as follows. The problem of missing data in classifier design is addressed by extending QGME (Liao et al., 2007) to a fully Bayesian setting, with the number of local experts inferred automatically via a DP prior. The algorithm is further extended to a multi-task setting, again using a non-parametric Bayesian model, simultaneously learning J missing-data classification problems, with appropriate sharing (could be global or local). Throughout, efficient inference is implemented via the variational Bayesian (VB) method (Beal, 2003). To quantify the accuracy of the VB results, we also perform comparative studies based on Gibbs sampling.
The remainder of the paper is organized as follows. In Section 2 we extend the finite QGME (Liao et al., 2007) to an infinite QGME via a Dirichlet process prior. The incomplete-data problem is defined and discussed in Section 3. Extension to the multi-task learning case is considered in Section 4, and variational Bayesian inference is developed in Section 5. Experimental results for synthetic data and multiple real data sets are presented in Section 6, followed in Section 7 by conclusions and a discussions of future research directions.
2. Infinite Quadratically Gated Mixture of Experts
In this section, we first provide a brief review of the quadratically gated mixture of experts (QGME) (Liao et al., 2007) and Dirichlet process (DP) (Ferguson, 1973), and then extend the number of experts to be infinite via DP.
2.1 Quadratically Gated Mixture of Experts
Consider a binary classification problem with real-valued P-dimensional column feature vectors xi and corresponding class labels yi ∈ {1, −1}. We assume binary labels for simplicity, while the proposed method may be directly extended to cases with more than two classes. Latent variables ti are introduced as “soft labels” associated with yi, as in probit models (Albert and Chib, 1993), where yi = 1 if ti > 0 and yi = −1 if ti ≤ 0. The finite quadratically gated mixture of experts (QGME) (Liao et al., 2007) is defined as
| (1) |
| (2) |
| (3) |
with , and where δh is a point measure concentrated at h (with probability one, a draw from δh will be h). The (P + 1) × K matrix W has columns wh, where each wh are the weights on a local linear classifier, and the are feature vectors with an intercept, that is, . A total of K groups of wh are introduced to parameterize the K experts. With probability πh the indicator for the ith data point satisfies zi = h, which means the hth local expert is selected, and xi is distributed according to a P-variate Gaussian distribution with mean μh and precision Λh.
It can be seen that the QGME is highly related to the mixture of experts (ME) (Jacobs et al., 1991) and the hierarchical mixture of experts (HME) (Jordan and Jacobs, 1994) if we write the conditional distribution of labels as
| (4) |
where
| (5) |
| (6) |
From (4), as a special case of the ME, the QGME is capable of handling nonlinear problems with linear experts characterized in (5). However, unlike other ME models, the QGME probabilistically partitions the feature space through a mixture of K Gaussian distributions for xi as in (6). This assumption on the distribution of xi is mild since it is well known that a Gaussian mixture model (GMM) is general enough to approximate any continuous distribution. In the QGME, xi as well as yi are treated as random variables (generative model) and we consider a joint probability p(yi, xi) instead of a conditional probability p(yi|xi) for fixed xi as in most ME models (which are typically discriminative). Previous work on the comparison between discriminative and generative models may be found in Ng and Jordan (2002) and Liang and Jordan (2008). In the QGME, the GMM of the inputs xi plays two important roles: (i) as a gating network, while (ii) enabling analytic incorporation of incomplete data during classifier inference (as discussed further below).
The QGME (Liao et al., 2007) is inferred via the expectation-maximization (EM) method, which renders a point-estimate solution for an initially specified model (1)–(3), with a fixed number K of local experts. Since learning the correct model requires model selection, and moreover in many applications there may exist no such fixed “correct” model, in the work reported here we infer the full posterior for a QGME model with the number of experts data-driven. The objective can be achieved by imposing a nonparametric Dirichlet process (DP) prior.
2.2 Dirichlet Process
The Dirichlet process (DP) (Ferguson, 1973) is a random measure defined on measures of random variables, denoted as
(αG0), with a real scaling parameter α ≥ 0 and a base measure G0. Assuming that a measure is drawn G ~
(αG0), the base measure G0 reflects the prior expectation of G and the scaling parameter α controls how much G is allowed to deviate from G0. In the limit α → ∞, G goes to G0; in the limit α → 0, G reduces to a delta function at a random point in the support of G0.
The stick-breaking construction (Sethuraman, 1994) provides an explicit form of a draw from a DP prior. Specifically, it has been proven that a draw G may be constructed as
| (7) |
with 0 ≤ πh ≤ 1 and , and
From (7), it is clear that G is discrete (with probability one) with an infinite set of weights πh at atoms . Since the weights πh decrease stochastically with h, the summation in (7) may be truncated with N terms, yielding an N-level truncated approximation to a draw from the Dirichlet process (Ishwaran and James, 2001).
Assuming that underlying variables θi are drawn i.i.d. from G, the associated data χi ~ F(θi) will naturally cluster with θi taking distinct values , where the function F(θ) represents an arbitrary parametric model for the observed data, with hidden parameters θ. Therefore, the number of clusters is automatically determined by the data and could be “infinite” in principle. Since θi take distinct values with probabilities πh, this clustering is a statistical procedure instead of a hard partition, and thus we only have a belief on the number of clusters, which is affected by the scaling parameter α. As the value of α influences the prior belief on the clustering, a gamma hyper-prior is usually employed on α.
2.3 Infinite QGME via DP
Consider a classification task with a training data set
= {(xi, yi): i = 1, …, n}, where xi ∈ ℝP and yi ∈ {−1, 1}. With soft labels ti introduced as in Section 2.1, the infinite QGME (iQGME) model is achieved via a DP prior imposed on the measure G of (μi, Λi, wi), the hidden variables characterizing the density function of each data point (xi, ti). For simplicity, the same symbols are used to denote parameters associated with each data point and the distinct values, with subscripts i and h indexing data points and unique values, respectively:
| (8) |
where the base measure G0 is factorized as the product of a normal-Wishart prior for (μh, Λh) and a normal prior for wh, for the sake of conjugacy. As discussed in Section 2.2, data samples cluster automatically, and the same mean μh, covariance matrix Λh and regression coefficients (expert) wh are shared for a given cluster h. Using the stick-breaking construction, we elaborate (8) as follows for i = 1, …, n and h = 1, …, ∞:
- Data generation:
- Drawing indicators:
- Drawing parameters from G0:
Furthermore, to achieve a more robust algorithm, we assign diffuse hyper-priors on several crucial parameters. As discussed in Section 2.2, the scaling parameter α reflects our prior belief on the number of clusters. For the sake of conjugacy, a diffuse Gamma prior is usually assumed for α as suggested by West et al. (1994). In addition, parameters ζ, λ characterizing the prior of the distinct local classifiers wh are another set of important parameters, since we focus on classification tasks. Normal-Gamma priors are the conjugate priors for the mean and precision of a normal density. Therefore,
where τ10, τ20, a0, b0 are usually set to be much less than one and of about the same magnitude, so that the constructed Gamma distributions with means about one and large variances are diffuse; γ0 is usually set to be around one.
The graphical representation of the iQGME for single-task learning is shown in Figure 1. We notice that a possible variant with sparse local classifiers could be obtained if we impose zero mean for the local classifiers wh, that is, ζ = 0, and retain the Gamma hyper-prior for the precision λ, as in the relevance vector machine (RVM) (Tipping, 2000), which employs a corresponding Student-t sparseness prior on the weights. Although this sparseness prior is useful for seeking relevant features in many applications, imposing the same sparse pattern for all the local experts is not desirable.
Figure 1.
Graphical representation of the iQGME for single-task leaning. All circles denote random variables, with shaded ones indicating observable data, and bright ones representing hidden variables. Diamonds denote fixed hyper-parameters, boxes represent independent replicates with numbers at the lower right corner indicating the numbers of i.i.d. copies, and arrows indicate the dependence between variables (pointing from parents to children).
2.4 Variant for High-Dimensional Problems
For the classification problem, we assume access to a training data set
= {(xi, yi): i = 1, …, n}, where feature vectors xi ∈ ℝP and labels yi ∈ {−1, 1}. We have assumed that the feature vectors of objects in cluster h are generated from a P-variate normal distribution with mean μh and covariance matrix
, that is,
| (9) |
It is well known that each covariance matrix has P(P + 1)/2 parameters to be estimated. Without any further assumption, the estimation of these parameters could be computationally prohibitive for large P, especially when the number of available training data n is small, which is common for classification applications. By imposing an approximately low-rank constraint on the covariances, as in well-studied mixtures of factor analyzers (MFA) models (Ghahramani and Hinton, 1996; Ghahramani and Beal, 2000), the number of unknowns could be significantly reduced. Specifically, assume a vector of standard normal latent factors si ∈ ℝT×1 for data xi, a factor loading matrix Ah ∈ ℝP×T for cluster h, and Gaussian residues εi with diagonal covariance matrix ψhIP, then
Marginalizing si with si ~
(0, IT), we recover (9), with
. The number of free parameters is significantly reduced if T ≪ P.
In this paper, we modify the MFA model for classification applications with scarce samples. First, we consider a common loading matrix A for all the clusters, and introduce a binary vector bh for each cluster to select which columns of A are used, that is,
where each column of A, Al ~
(0, P−1
IP), si ~
(0, IL), d is a vector responsible for scale, and ∘ is a component-wise (Hadamard) product. For d we employ the prior
with βl ~ Ga(c0, d0). Furthermore, we let the algorithm infer the intrinsic number of factors by imposing a low-rank belief for each cluster through the prior of bh, that is,
where L is a large number, which defines the largest possible dimensionality the algorithm may infer. Through the choice of a0 and b0 we impose our prior belief about the intrinsic dimensionality of cluster h (upon integrating out the draw πh, the number of non-zero components of bh is drawn from Binomial[L, a0/(a0 + b0(L − 1))]). As a result, both the number of clusters and the dimensionality of each cluster is inferred by this variant of iQGME.
With this form of iQGME, we could build local linear classifiers in either the original feature space or the (low-dimensional) space of latent factors si. For the sake of computational simplicity, we choose to classify in the low-dimensional factor space.
3. Incomplete Data Problem
In the above discussion it was assumed that all components of the feature vectors were available (no missing data). In this section, we consider the situation for which feature vectors xi are partially observed. We partition each feature vector xi into observed and missing parts, , where denotes the subvector of observed features and represents the subvector of missing features, with oi and mi denoting the set of indices for observed and missing features, respectively. Each xi has its own observed set oi and missing set mi, which may be different for each i. Following a generic notation (Schafer and Graham, 2002), we refer to R as the missingness. For an arbitrary missing pattern, R could be defined as a missing data indicator matrix, that is,
We use ξ to denote parameters characterizing the distribution of R, which is usually called the missing mechanism. In the classification context, the joint distribution of class labels, observed features and the missingness R may be given by integrating out the missing features xm,
| (10) |
To handle such a problem analytically, assumptions must be made on the distribution of R. If the missing mechanism is conditionally independent of missing values xm given the observed data, that is, p(R|x, ξ) = p(R|xo, ξ), the missing data are defined to be missing at random (MAR) (Rubin, 1976). Consequently, (10) reduces to
| (11) |
According to (11), the likelihood is factorizable under the assumption of MAR. As long as the prior p(θ, ξ) = p(θ)p(ξ) (factorizable), the posterior
is also factorizable. For the purpose of inferring model parameters θ, no explicit specification is necessary on the distribution of the missingness. As an important special case of MAR, missing completely at random (MCAR) occurs if we can further assume that p(R|x, ξ) = p(R|ξ), which means the distribution of missingness is independent of observed values xo as well. When the missing mechanism depends on missing values xm, the data are termed to be missing not at random (MNAR). From (10), an explicit form has to be assumed for the distribution of the missingness, and both the accuracy and the computational efficiency should be concerned.
When missingness is not totally controlled, as in most realistic applications, we cannot tell from the data alone whether the MCAR or MAR assumption is valid. Since the MCAR or MAR assumption is unlikely to be precisely satisfied in practice, inference based on these assumptions may lead to a bias. However, as demonstrated in many cases, it is believed that for realistic problems departures from MAR are usually not large enough to significantly impact the analysis (Collins et al., 2001). On the other hand, without the MAR assumption, one must explicitly specify a model for the missingness R, which is a difficult task in most cases. As a result, the data are typically assumed to be either MCAR or MAR in the literature, unless significant correlations between the missing values and the distribution of the missingness are suspected.
In this work we make the MAR assumption, and thus expression (11) applies. In the iQGME framework, the joint likelihood may be further expanded as
| (12) |
The solution to such a problem with incomplete data xm is analytical since the distributions of t and x are assumed to be a Gaussian and a Gaussian mixture model, respectively. Naturally, the missing features could be regarded as hidden variables to be inferred and the graphical representation of the iQGME with incomplete data remains the same as in Figure 1, except that the node presenting features are partially observed now. As elaborated below, the important but mild assumption that the features are distributed as a GMM enables us to analytically infer the variational distributions associated with the missing values in a procedure of variational Bayesian inference.
As in many models (Williams et al., 2007), estimating the distribution of the missing values first and learning the classifier at a second step gives the flexibility of selecting the classifier for the second step. However, (12) suggests that the classifier and the data distribution are coupled, provided that partial data are missing and thus have to be integrated out. Therefore, a joint estimation of missing features and classifiers (searching in the space of (θ1, θ2)) is more desirable than a two-step process (searching in the space of θ1 for the distribution of the data, and then in the space of θ2 for the classifier).
4. Extension to Multi-Task Learning
Assume we have J data sets, with the jth represented as
= {(xji, yji): i = 1, …, nj}; our goal is to design a classifier for each data set, with the design of each classifier termed a “task”. One may learn separate classifiers for each of the J data sets (single-task learning) by ignoring connections between the data sets, or a single classifier may be learned based on the union of all data (pooling) by ignoring differences between the data sets. More appropriately, in a hierarchical Bayesian framework J task-dependent classifiers may be learned jointly, with information borrowed via a higher-level prior (multi-task learning). In some previous research all tasks are assumed to be equally related to each other (Yu et al., 2003; Zhang et al., 2006), or related tasks share exactly the same task-dependent classifier (Xue et al., 2007). With multiple local experts, the proposed iQGME model for a particular task is relatively flexible, enabling the borrowing of information across the J tasks (two data sets may share parts of the respective classifiers, without requiring sharing of all classifier components).
As discussed in Section 2.2, a DP prior encourages clustering (each cluster corresponds to a local expert). Now considering multiple tasks, a hierarchical Dirichlet process (HDP) (Teh et al., 2006) may be considered to solve the problem of sharing clusters (local experts) across multiple tasks. Assume a random measure Gj is associated with each task j, where each Gj is an independent draw from Dirichlet process
(αG0) with a base measure G0 drawn from an upper-level Dirichlet process
(βH), that is,
As a draw from a Dirichlet process, G0 is discrete with probability one and has a stick-breaking representation as in (7). With such a base measure, the task-dependent DPs reuse the atoms defined in G0, yielding the desired sharing of atoms among tasks.
With the task-dependent iQGME defined in (8), we consider all J tasks jointly:
In this form of borrowing information, experts with associated means and precision matrices are shared across tasks as distinct atoms. Since means and precision matrices statistically define local regions in feature space, sharing is encouraged locally. We explicitly write the stick-breaking representations for Gj and G0, with zji and cjh introduced as the indicators for each data point and each distinct atom of Gj, respectively. By factorizing the base measure H as a product of a normal-Wishart prior for (μs, Λs) and a normal prior for ws, the hierarchical model of the multitask iQGME via the HDP is represented as
- Data Generation:
- Drawing lower-level indicators:
- Drawing upper-level indicators:
-
Drawing parameters from H:where j = 1, …, J and i = 1, …, nj index tasks and data points in each tasks, respectively; h = 1, …, ∞ and s = 1, …, ∞ index atoms for task-dependent Gj and the globally shared base G0, respectively. Hyper-priors are imposed similarly as in the single-task case:
The graphical representation of the iQGME for multi-task learning via the HDP is shown in Figure 2.
Figure 2.
Graphical representation of the iQGME for multi-task leaning via the hierarchical Dirichlet process (HDP). Refer to Figure 1 for additional information.
5. Variational Bayesian Inference
We initially present the inference formalism for single-task learning, and then discuss the (relatively modest) extensions required for the multi-task case.
5.1 Basic Construction
For simplicity we denote the collection of hidden variables and model parameters as Θ and specified hyper-parameters as Ψ. In a Bayesian framework we are interested in p(Θ|
, Ψ), the joint posterior distribution of the unknowns given observed data and hyper-parameters. From Bayes’ rule,
where p(
|Ψ) = ∫p(
|Θ)p(Θ|Ψ)dΘ is the marginal likelihood that often involves multi-dimensional integrals. Since these integrals are nonanalytical in most cases, the computation of the marginal likelihood is the principal challenge in Bayesian inference. These integrals are circumvented if only a point estimate Θ̂ is pursued, as in the expectation-maximization algorithm (Dempster et al., 1977). Markov chain Monte Carlo (MCMC) sampling methods (Gelfand et al., 1990; Neal, 1993) provide one class of approximations for the full posterior, based on samples from a Markov chain whose stationary distribution is the posterior of interest. As a Markov chain is guaranteed to converge to its true posterior theoretically as long as the chain is long enough, MCMC samples constitute an unbiased estimation for the posterior. Most previous applications with a Dirichlet process prior (Ishwaran and James, 2001; West et al., 1994), including the related papers we reviewed in Section 1, have been implemented with various MCMC methods. The main concerns of MCMC methods are associated with computational costs for computation of sufficient collection samples, and that diagnosis of convergence is often difficult.
As an efficient alternative, the variational Bayesian (VB) method (Beal, 2003) approximates the true posterior p(Θ|
,Ψ) with a variational distribution q(Θ) with free variational parameters. The problem of computing the posterior is reformulated as an optimization problem of minimizing the Kullback-Leibler (KL) divergence between q(Θ) and p(Θ|
,Ψ), which is equivalent to maximizing a lower bound of log p(
|Ψ), the log marginal likelihood. This optimization problem can be solved iteratively with two assumptions on q(Θ): (i) q(Θ) is factorized; (ii) the factorized components of q(Θ) come from the same exponential family as the corresponding priors do. Since the lower bound cannot achieve the true log marginal likelihood in general, the approximation given by the variational Bayesian method is biased. Another issue concerning the VB algorithm is that the solution may be trapped at local optima since the optimization problem is not convex. The main advantages of VB include the ease of convergence diagnosis and computational efficiency. As the VB is solving an optimization problem, the objective function—the lower bound of the log marginal likelihood—is a natural criterion for convergence diagnosis. Therefore, VB is a good alternative to MCMC when conjugacy is achieved and computational efficiency is desired. In recent publications (Blei and Jordan, 2006; Kurihara et al., 2007), discussions on the implementation of the variational Bayesian inference are given for Dirichlet process mixtures.
We implement the variational Bayesian inference throughout this paper, with comparisons made to Gibbs sampling. Since it is desirable to maintain the dependencies among random variables (e.g., shown in the graphical models Figure 1) in the variational distribution q(Θ), one typically only breaks those dependencies that bring difficulty to computation. In the subsequent inference for the iQGME, we retain some dependencies as unbroken. Following Blei and Jordan (2006), we employ stick-breaking representations with a truncation level N as variational distributions to approximate the infinite-dimensional random measures G.
We detail the variational Bayesian inference for the case of incomplete data. The inference for the complete-data case is similar, except that all feature vectors are fully observed and thus the step of learning missing values is skipped. To avoid repetition, a thorough procedure for the complete-data case is not included, with differences from the incomplete-data case indicated.
5.2 Single-task Learning
For single-task iQGME the unknowns are Θ = {t, xm, z, V, α, μ, Λ, W, ζ, λ}, with hyper-parameters Ψ = {m0, u0, B0, ν0, τ10, τ20, γ0, a0, b0}. We specify the factorized variational distributions as
where
-
qti(ti) is a truncated normal distribution,
which means the density function of ti is assumed to be normal with mean and unit variance for those ti satisfying yiti > 0.
-
, where qzi (zi) is a multinomial distribution with probabilities ρi, and there are N possible outcomes, zi ~
(1, ρi1, …, ρiN), i = 1, …, n. Given the associated indicators zi, since features are assumed to be distributed as a multivariate Gaussian, the distributions of missing values
are still Gaussian according to conditional properties of multivariate Gaussian distributions:
We retain the dependency between and zi in the variational distribution since the inference is still tractable; for complete data, the variation distribution for ( ) is not necessary.
-
qVh (Vh) is a beta distribution,
Recall that we have a truncation level of N, which implies that the mixture proportions πh(V) are equal to zero for h > N. Therefore, qVh(Vh) = δ1 for h = N, and qVh (Vh) = δ0 for h > N. For h < N, Vh has a variational Beta posterior.
- qμh,Λh (μh, Λh) is a normal-Wishart distribution,
- qwh(wh) is a normal distribution,
- qζp,λp (ζp, λp) is a normal-gamma distribution,
- qα(α) is a Gamma distribution,
Given the specifications on the variational distributions, a mean-field variational algorithm (Beal, 2003) is developed for the iQGME model. All update equations and derivations for are included in the Appendix; similar derivations for other random variables are found elsewhere (Xue et al., 2007; Williams et al., 2007). Each variational parameter is re-estimated iteratively conditioned on the current estimate of the others until the lower bound of the log marginal likelihood converges. Although the algorithm yields a bound for any initialization of the variational parameters, different initializations may lead to different bounds. To alleviate this local-maxima problem, one may perform multiple independent runs with random initializations, and choose the run that produces the highest bound on the marginal likelihood. We will elaborate on our initializations in the experiment section.
For simplicity, we omit the subscripts on the variational distributions and henceforth use q to denote any variational distributions. In the following derivations and update equations, we use generic notation 〈f〉q(·) to denote Eq(·)[f], the expectation of a function f with respect to variational distributions q(·). The subscript q(·) is dropped when it shares the same arguments with f.
5.3 Multi-task Learning
For multi-task learning much of the inference is highly related to that of single-task learning, as discussed above; in the following we focus only on differences. In the multi-task learning model, the latent variables are Θ = {t, xm, z, V, α, c, U, β, μ, Λ, W, ζ, λ}, and hyper-parameters are Ψ = {m0, u0, B0, ν0, τ10, τ20, τ30, τ40, γ0, a0, b0}. We specify the factorized variational distributions as
where the variational distributions of (tji, Vjh, α, μs, Λs, ws, ζp, λp) are assumed to be the same as in the single-task learning, while the variational distributions of hidden variables newly introduced for the upper-level Dirichlet process are specified as
- q(cjh) for each indicator cjh is a multinomial distribution with probabilities σjh,
-
q(Us) for each weight Us is a Beta distribution,
Here we have a truncation level of S for the upper-lever DP, which implies that the mixture proportions ηs(U) are equal to zero for s > S. Therefore, q(Us) = δ1 for s = S, and q(Us) = δ0 for s > S. For s < S, Us has a variational Beta posterior.
- q(β) for the scaling parameter β is a Gamma distribution,
We also note that with a higher-level of hierarchy, the dependency between the missing values and the associated indicator zji has to be broken so that the inference becomes tractable. The variational distribution of zji is still assumed to be multinomial distributed, while is assumed to be normally distributed but no longer dependent on zji. All update equations are included in the Appendix.
5.4 Prediction
For a new observed feature vector , the prediction on the associated class label y★ is given by integrating out the missing values.
We marginalize the hidden variables over their variational distributions to compute the predictive probability of the class label
where
The expectation is a multivariate Student-t distribution (Attias, 2000). However, for the incomplete-data situation, the integral over the missing values is tractable only when the two terms in the integral are both normal. To retain the form of norm distributions, we use the posterior means of μh, Λh and wh to approximate the variables:
where
For complete data the integral of missing features is absent, so we take advantage of the full variational posteriors for prediction.
5.5 Computational Complexity
Given the truncation level (or the number of clusters) N, the data dimensionality P, and the number of data points n, we compare the iQGME to closely related DP regression models (Meeds and Osindero, 2006; Shahbaba and Neal, 2009), in terms of the time and memory complexity. The inference of the iQGME with complete data requires inversion of two P × P matrices (the covariance matrices for the inputs and the local expert) associated with each cluster. Therefore, the time and memory complexity are O(2NP3) and O(2NP2), respectively. With incomplete data, since the missing pattern is unique for each data point, the time and memory complexity increase with number of data points, that is, O(nNP3) and O(nNP2), respectively. The mixture of Gaussian process experts (Meeds and Osindero, 2006) requires O(NP3 + n3/N) computations for each MCMC iteration if the N experts equally divide the data, and the memory complexity is O(NP2 + n2/N). In the model proposed by Shahbaba and Neal (2009), no matrix inversion is needed since the covariates are assumed to be independent. The time and memory complexity are O(NP) and O(NP), respectively.
From the aspect of computational complexity, the model in Meeds and Osindero (2006) is restricted by the increase of both dimensionality and data size; while the model proposed in Shahbaba and Neal (2009) is more efficient. Although the proposed model requires more computations for each MCMC iteration than the latter one, we are able to handle missing values naturally, and much more efficiently compared to the former one. Considering the usual number of iterations required by VB (several dozens) and MCMC (thousands or even tens of thousands), our model is even more efficient.
6. Experimental Results
In all the following experiments the hyper-parameters are set as follows: a0 = 0.01, b0 = 0.01, γ0 = 0.1, τ10 = 0.05, τ20 = 0.05, τ30 = 0.05, τ40 = 0.05, u0 = 0.1, ν0 = P + 2, and m0 and B0 are set according to sample mean and sample precision, respectively. These parameters have not been optimized for any particular data set (which are all different in form), and the results are relatively insensitive to “reasonable” settings. The truncation levels for the variational distributions are set to be N = 20 and S = 50. We have found the results insensitive to the truncation level, for values larger than those considered here.
Because of the local-maxima issue associated with VB, initialization of the inferred VB hyper-parameters is often important. We initialize most variational hyper-parameters using the corresponding prior hyper-parameters, which are data-independent. The precision/covariance matrices Bh and are simply initialized as identity matrices. However, for several other hyper-parameters, we may obtain information for good start points from the data. Specifically, the variational mean of the soft label is initialized by the associated label yi. A K-means clustering algorithm is implemented on the feature vectors, and the cluster means and identifications for objects are used to initialize the variational mean of the Gaussian means mh and the indicator probabilities ρi, respectively. As an alternative, one may randomly initialize mh and ρi multiple times, and select the solution that produces the highest lower bound on the log marginal likelihood. The two approaches work almost equivalently for low-dimensional problems; however, for problems with moderate to high dimensionality, it could be fairly difficult to get a satisfying initialization by making several random trials.
6.1 Synthetic Data
We first demonstrate the proposed iQGME single-task learning model on a synthetic data set, for illustrative purposes. The data are generated according to a GMM model with the following parameters:
The class boundary for each Gaussian component is given by three lines x2 = wkx1 + bk for k = 1,2,3, where w1 = 0.75, b1 = 2.25, w2 = −0.58, b2 = 0.58, and w3 = 0.75, b3 = −3.75. The simulated data are shown in Figure 3(a), where black dots and dashed ellipses represent the true means and covariance matrices of the Gaussian components, respectively.
Figure 3.

Synthetic three-Gaussian single-task data with inferred components. (a) Data in feature space with true labels and true Gaussian components indicated; (b) inferred posterior expectation of weights on components, with standard deviations depicted as error bars; (c) ground truth with posterior means of dominant components indicated (the linear classifiers and Gaussian ellipses are inferred from the data).
The inferred mean mixture weights with standard deviations are depicted in Figure 3(b), and it is observed that three dominant mixture components (local “experts”) are inferred. The dominant components (those with mean weight larger than 0.005) are characterized by Gaussian means, covariance matrices and local experts, as depicted in Figure 3(c). From Figure 3(c), the nonlinear classification is manifested by using three dominant local linear classifiers, with a GMM defining the effective regions stochastically.
An important point is that we are not selecting a “correct” number of mixture components as in most mixture-of-expert models, including the finite QGME model (Liao et al., 2007). Instead, there exists uncertainty on the number of components in our posterior belief. Since this uncertainty is not inferred directly, we obtain samples for the number of dominant components by calculating πh based on Vh sampled from their probability density functions (prior or variational posterior), and the probability mass functions given by histogram are shown in Figure 4(a). As discussed, the scale parameter α is highly related to the number of clusters, so we depict the prior and the variational posterior on α in Figure 4(b).
Figure 4.

Synthetic three-Gaussian single-task data: (a) prior and posterior beliefs on the number of dominant components; (b) prior and posterior beliefs on α.
The predictions in feature space are presented in Figure 5, where the prediction in sub-figure (a) is given by integrating over the full posteriors of local experts and parameters (means and covariance matrices) of Gaussian components; while the prediction in sub-figure (b) is given by posterior means. We examine these two cases since the analytical integrals over the full posteriors may be unavailable sometimes in practice (for example, for cases with incomplete data as discussed in Section 5). From Figures 5(a) and 5(b), we observe that these two predictions are fairly similar, except that (a) allows more uncertainty on regions with scarce data. The reason for this is that the posteriors are often peaked and thus posterior means are usually representative. As an example, we plot the broad common prior imposed for local experts in Figure 5(c) and the peaked variational posteriors for three dominant experts in Figure 5(d). According to Figure 5, we suggest the usage of full posteriors for prediction whenever integrals are analytical, that is, for experiments with complete data. It also empirically justifies the use of posterior means as an approximation. These results have been computed using VB inference, with MCMC-based results presented below, as a comparison.
Figure 5.
Synthetic three-Gauss single-task data: (a) prediction in feature space using the full posteriors; (b) prediction in feature space using the posterior means; (c) a common broad prior on local experts; (d) variational posteriors on local experts.
6.2 Benchmark Data
To further evaluate the proposed iQGME, we compare it with other models, using benchmark data sets available from the UCI machine learning repository (Newman et al., 1998). Specifically, we consider Wisconsin Diagnostic Breast Cancer (WDBC) and the Johns Hopkins University Ionosphere database (Ionosphere) data sets, which have been studied in the literature (Williams et al., 2007; Liao et al., 2007). These two data sets are summarized in Table 1.
Table 1.
Details of Ionosphere and WDBC data sets
| Data set | Dimension | Number of positive instances | Number of negative instances |
|---|---|---|---|
| Ionosphere | 34 | 126 | 225 |
| WDBC | 30 | 212 | 357 |
The models we compare to include:
-
State-of-the-art classification algorithms: Support Vector Machines (SVM) (Vapnik, 1995) and Relevance Vector Machines (RVM) (Tipping, 2000). We consider both linear models (Linear) and non-linear models with radial basis function (RBF) for both algorithms. For each data set, the kernel parameter is selected for one training/test/validation separation, and then fixed for all the other experimental settings. The RVM models are implemented with Tipping’s Matlab code available at http://www.miketipping.com/index.php?page=rvm
Since those SVM and RVM algorithms are not directly applicable to problems with missing features, we use two methods to impute the missing values before the implementation. One is using the mean of observed values (unconditional mean) for the given feature, referred to as Uncond; the other is using the posterior mean conditional on observed features (conditional mean), referred to as Cond (Schafer and Graham, 2002).
Classifiers handling missing values: the finite QGME inferred by expectation-maximization (EM) (Liao et al., 2007), referred to as QGME-EM, and a two-stage algorithm (Williams et al., 2007) where the parameters of the GMM for the covariates are estimated first given the observed features, and then a marginalized linear logistic regression (LR) classifier is learned, referred to as LR-Integration. Results are cited from Liao et al. (2007) and Williams et al. (2007), respectively.
In order to simulate the missing at random setting, we randomly remove a fraction of feature values according to a uniform distribution, and assume the rest are observed. Any instance with all feature values missing is deleted. After that, we randomly split each data set into training and test subsets, imposing that each subset encompasses at least one instance from each of the classes. Note that the random pattern of missing features and the random partition of training and test subsets are independent of each other. By performing multiple trials we consider the general (average) performance for various data settings. For convenient comparison with Williams et al. (2007) and Liao et al. (2007), the performance of algorithms is evaluated in terms of the area under a receiver operating characteristic (ROC) curve (AUC) (Hanley and McNeil, 1982).
The results on the Ionosphere and WDBC data sets are summarized in Figures 6 and 7, respectively, where we consider 25% and 50% of the feature values missing. Given a portion of missing values, each curve is a function of the fraction of data used in training. For a given size of training data, we perform ten independent trials for the SVM and RVM models and the proposed iQGME.
Figure 6.
Results on Ionosphere data set for (a)(b) 25%, and (c)(d) 50% of the feature values missing. For legibility, we only report the standard deviation for the proposed iQGME-VB algorithm as error bars, and present the compared algorithms in two figures for each case. The results of the finite QGME solved with an expectation-maximization method are cited from Liao et al. (2007), and those of LR-Integration are cited from Williams et al. (2007). Since the performance of the QGME-EM is affected by the choice of number of experts K, the overall best results among K = 1,3,5,10,20 are cited for comparison in each case (no such selection of K is required for the proposed iQGME-VB algorithm).
Figure 7.
Results on WDBC data set for cases when (a)(b) 25%, and (c)(d) 50% of the feature values are missing. Refer to Figure 6 for additional information.
From both Figures 6 and 7, the proposed iQGME-VB is robust for all the experimental settings, and its overall performance is the best among all algorithms considered. Although the RVM-RBF-Cond and the SVM-RBF-Cond perform well for the Ionosphere data set, especially when the training data is limited, their performance on the WDBC data set is not as good. The kernel methods benefit from the introduction of the RBF kernel for the Ionosphere data set; however, the performance is inferior for the WDBC data set. We also note that the one-step iQGME and the finite QGME outperform the two-step LR-integration. The proposed iQGME consistently performs better than the finite QGME (where, for the latter, in all cases we show results for the best/optimized choice of number of experts K), which reveals the advantage of retaining the uncertainty on the model structure (number of experts) and model parameters. As shown in Figure 7, the advantage of considering the uncertainty on the model parameters is fairly pronounced for the WDBC data set, especially when training examples are relatively scarce and thus the point-estimation EM method suffers from over-fitting issues. A more detailed examination on the model uncertainty is shown in Figures 8 and 9.
Figure 8.
The comparison on the Ionosphere data set between QGME-EM with different preset number of clusters K and the proposed iQGME-VB, when (a)(b)(c) 25%, and (d)(e)(f) 50% of the features are missing. In each row, 10%, 50%, and 90% of samples are used for training, respectively. Results of QGME-EM are cited from Liao et al. (2007).
Figure 9.

Number of clusters for the Ionosphere data set inferred by iQGME-VB. (a) Prior and inferred posterior on the number of clusters for one trial given 10% samples for training, the number of clusters for the case when (b) 25%, and (c) 50% of features are missing. The most probable value of clusters number is used for each trial to generate (b) and (c) (e.g., the most probable value of clusters number is two for the trial shown in (a)). In (b) and (c), the distribution of number of clusters for the ten trials given each missing fraction and training fraction is presented as a box-plot, where the red line represents the median; the bottom and top of the blue box are the 25th and 75th percentile, respectively; the bottom and top black lines are the end of the whiskers, which could be the minimum and maximum, respectively; if some data are beyond 1.5 times of the length of the blue box (interquartile range), they are outliers, indicated by a red ‘+’.
In Figure 8, the influence of the preset value for K on the QGME-EM model is examined on the Ionosphere data. We observe that with different fractions of missing values and training samples, the values for K which achieve the best performance may be different; as K goes to a large number (e.g., 20 here), the performance gets worse due to over-fitting. In contrast, we do not need to set the number of clusters for the proposed iQGME-VB model. As long as the truncation level N is large enough (N = 20 for all the experiments), the number of clusters is inferred by the algorithm. We give an example for the posterior on the number of clusters inferred by the proposed iQGME-VB model, and report the statistics for the most probable number of experts given each missing fraction and training fraction in Figure 9, which suggests that the number of clusters may vary significantly even for the trials with the same fraction of feature values missing and the same fraction of samples for training. Therefore, it may be not appropriate to set a fixed value for the number of clusters for all the experimental settings as one has to do for the QGME-EM.
Although our main purpose is classification, one may also be interested in how well the algorithm can estimate the missing values while pursuing the main purpose. In Figure 10, we show the ratio of correctly estimated missing values for the Ionosphere data set with 25% feature values missing, where two criteria are considered: true values are one standard deviation (red circles) or two standard deviations (blue squares) away from the posterior means. This figure suggests that the algorithm estimates most of the missing values in a reasonable range away from the true values when the training size is large enough; even with not so satisfying estimations (as for limited training data), the classification results are still relatively robust as shown in Figure 6.
Figure 10.
Ratio of missing values whose true values are one standard deviation (red circles) or two standard deviations (blue squares) away from the posterior means for the Ionosphere data set with 25% feature values missing. One trial for each training size is considered.
We have discussed the advantages and disadvantages for the inference with MCMC and VB in Section 5.1. Here we take the Ionosphere data with 25% features missing as an example to compare these two inference techniques, as shown in Figure 11. It can be seen that they achieve similar performance for the particular iQGME model proposed in this paper. The time consumed for each iteration is also comparable, and increases almost linearly with the training size, as discussed in Section 5.5. The VB inference takes a little bit longer per iteration, probably due to the extra computation for the lower bound of the log marginal likelihood, which serves as convergence criterion. Significant differences occur on the number of iterations we have to take. In the experiment, even though we set a very strict threshold (10−6) for the relative change of the lower bound, the VB algorithm converges at about 50 iterations for most cases except when training data are very scarce (10%). For the MCMC inference, we discard the initial samples from the first 1000 iterations (burn-in), and collect the next 500 samples to present the posterior. It is far from enough to claim convergence; however, we consider it a fair comparison for computation as the two methods yield similar results under this setting. Given the fact that the VB algorithm only takes about 1/30 the CPU time, and VB and MCMC performance are similar, in the following examples we only present results based on VB inference. However, in all the examples below we also performed Gibbs sampling, and the relative inference consistency and computational costs relative to VB were found to be as summarized here (i.e., in all cases there was close agreement between the VB and MCMC inferences, and considerable computational acceleration manifested by VB).
Figure 11.

Comparison between VB and MCMC inferred iQGME on the Ionosphere data with 25% features missing in terms of (a) performance, (b) time consumed for each iteration, (c) number of iterations. For the VB inference, we set a threshold (10−6) for the relative change of lower bound in two consecutive iterations as the convergence criterion; for the MCMC inference, we discard the initial samples from the first 1000 iterations (burn-in), and collect the next 500 samples to present the posterior.
6.3 Unexploded Ordnance Data
We now consider an unexploded ordnance (UXO) detection problem (Zhang et al., 2003), where two types of sensors are used to collect data, but one of them may be absent for particular targets. Specifically, one sensor is a magnetometer (MAG) and the other an electromagnetic induction (EMI) sensor; these sensors are deployed separately to interrogate buried targets, and for some targets both sensors are deployed and for others only one sensor is deployed. This is a real sensing problem for which missing data occurs naturally. The data considered were made available to the authors by the US Army (and were collected from a real former bombing range in the US); the data are available to other researchers upon request. The total number of targets are 146, where 79 of them UXO and the rest are non-UXO (i.e., non-explosives). A six-dimensional feature vector is extracted from the raw signals to represent each target, with the first three components corresponding to MAG features and the rest as EMI features (details on feature extraction is provided in Zhang et al., 2003). Figure 12 shows the missing patterns for this data set.
Figure 12.
Missing pattern for the unexploded ordnance data set, where black and white indicate observed and missing, respectively.
We compare the proposed iQGME-VB algorithm with the SVM, RVM and LR-Integration as detailed in Section 6.2. In order to evaluate the overall performance of classifiers, we randomly partition the training and test subsets, and change the training size. Results are shown in Figure 13, where only performance means are reported for the legibility of the figures. From this figure, the proposed iQGME-VB method is robust for all the experimental settings under both performance criteria.
Figure 13.
Mean performance over 100 random training/test partitions for each training fraction on the unexploded ordnance data set, in terms of (a) area under the ROC curve, and (b) classification accuracy.
6.4 Sepsis Classification Data
In Sections 6.2 and 6.3, we have demonstrated the proposed iQGME-VB on data sets with low to moderate dimensionality. A high-dimensional data set with natural missing values is considered in this subsection. These data were made available to the authors from the National Center for Genomic Research in the US, and will be made available upon request. This is another example for which missing data are a natural consequence of the sensing modality. There are 121 patients who are infected by sepsis, with 90 of them surviving (label −1) and 31 of them who die (label 1). For each patient, we have 521 metabolic features and 100 protein features. The purpose is to predict whether a patient infected by sepsis will die given his/her features. The missing pattern of feature values is shown in Figure 14(a), where black indicates observed (this missingness is a natural consequence of the sensing device).
Figure 14.

Sepsis data set. (a) Missing pattern, where black and white indicate observed and missing, respectively, (b) mean performance over 100 random training/test partitions for each training fraction.
As the data are in a 621-dimensional feature space, with only 121 samples available, we use the MFA-based variant of the iQGME (Section 2.4). To impose the low-rank belief for each cluster, we set c0 = d0 = 1, and the largest possible dimensionality for clusters is set to be L = 50.
We compare to the same algorithms considered in Section 6.3, except the LR-Integration algorithm since it is not capable of handling such a high-dimensional data set. Mean AUC over ten random partitions are reported in Figure 14(b). Here we report the SVM and RVM results on the original data since they are able to classify the data in the original 621-dimensional space after missing values are imputed; we also examined SVM and RVM results on the data in a lower-dimensional latent space, after first performing factor analysis on the data, and these results were very similar to the SVM/RVM results in the original 621-dimensional space. From Figure 14(b), our method provides improvement by handling missing values analytically in the procedure of model inference and performing a dimensionality reduction jointly with local classifiers learning.
6.5 Multi-Task Learning with Landmine Detection Data
We now consider a multi-task-learning example. In an application of landmine detection (available at http://www.ee.duke.edu/~lcarin/LandmineData.zip), data collected from 19 landmine fields are treated as 19 subtasks. Among them, subtasks 1–10 correspond to regions that are relatively highly foliated and subtasks 11–19 correspond to regions that are bare earth or desert. In all subtasks, each target is characterized by a 9-dimensional feature vector x with corresponding binary label y (1 for landmines and −1 for clutter). The number of landmines and clutter in each task is summarized in Figure 15. The feature vectors are extracted from images measured with airborne radar systems. A detailed description of this landmine data set has been presented elsewhere (Xue et al., 2007).
Figure 15.
Number of landmines and clutter in each task for the landmine-detection data set (Xue et al., 2007).
Although our main objective is to simultaneously learn classifiers for multiple tasks with incomplete data, we first demonstrate the proposed iQGME-based multi-task learning (MTL) model on the complete data, comparing it to two multi-task learning algorithms designed for the situation with all the features observed. One is based on task-specific logistic regression (LR) models, with the DP as a hierarchical prior across all the tasks (Xue et al., 2007); the other assumes an underlying structure, which is shared by all the tasks (Ando and Zhang, 2005). For the LR-MTL algorithm, we cite results on complete data from Xue et al. (2007), and implement the authors’ Matlab code with default hyper-parameters on the cases with incomplete data. The Matlab implementation for the Structure-MTL algorithm is included in the “Transfer Learning Toolkit for Matlab” available at http://multitask.cs.berkeley.edu/. The dimension of the underlying structure is a user-set parameter, and it should be smaller than the feature dimension in the original space. As the dimension of the landmine detection data is 9, we set the hidden dimension as 5. We also tried 6, 7, and 8, and did not observe big differences in performance. Single-task learning (STL) iQGME and LR models are also included for comparison.
Each task is divided into training and test subsets randomly. Since the number of elements in the two classes is highly unbalanced, as shown in Figure 15, we impose that there is at least one instance from each class in each subset. Following Xue et al. (2007), the size of the training subset in each task varies from 20 to 300 in increments of 20, and 100 independent trials are performed for each size of data set. An average AUC (Hanley and McNeil, 1982) over all the 19 tasks is calculated as the performance representation for one trial of a given training size. Results are reported in Figure 16.
Figure 16.
Average AUC over 19 tasks of landmine detection with complete data. Error bars reflect the standard deviation across 100 random partitions of training and test subsets. Results of logistic regression based algorithms are cited from Xue et al. (2007), where LR-MTL and LR-STL respectively correspond to SMTL-2 and STL in Figure 3 of Xue et al. (2007).
The first observation from Figure 16 is that we obtain a significant performance improvement for single-task learning by using the iQGME-VB instead of the linear logistic regression model (Xue et al., 2007). We also notice that the multi-task algorithm based on iQGME-VB further improves the performance when the training data are scarce, and yields comparable overall results as the LR-MTL does. The structure-MTL does not perform well on this data set. We suspect that a hidden structure in such a 9-dimensional space does not necessarily exist. Another possible reason may be that the minimization of empirical risk is sensitive for the cases with highly unbalanced labels, as for this data set.
It is also interesting to explore the similarity among tasks. The similarity defined by different algorithms may be different. In Xue et al. (2007), two tasks are defined to be similar if they share the same linear classifier. However, with the joint distribution of covariates and the response, the iQGME-MTL requires both the data distributions and the classification boundaries to be similar if two tasks are deemed to be similar. Another difference is that two tasks could be partially similar since sharing between tasks is encouraged at the cluster-level instead of at the task-level (Xue et al. 2007 employs task-level clustering). We generate the similarity matrices between tasks as follows: In each random trial, there are in total S higher-level items shared among tasks. For each task, we can find the task-specific probability mass function (pmf) over all the higher-level items. Using these pmfs as the characteristics for tasks in the current trial, we calculate the pair-wise Kullback-Leibler (KL) distances and convert them to similarity measures through a minus exponential function. Results of multiple trials are summed over and normalized as shown in Figure 17. It can be seen that the similarity structure among tasks becomes clearer when we have more training data available. As discovered by Xue et al. (2007), we also find two big clusters correspond to two different vegetation conditions of the landmine fields (task 1–10 and task 11–19). Further sub-structures among tasks are also explored by the iQGME-MTL model, which may suggest other unknown difference among the landmine fields.
Figure 17.

Similarity between tasks in the landmine detection problem with complete data given (a) 20, (b) 100, and (c) 300 training samples from each task. The size of green blocks represent the value of the corresponding matrix element.
After yielding competitive results on the landmine-detection data set with complete data, the iQGME-based algorithms are evaluated on incomplete data, which are simulated by randomly removing a portion of feature values for each task as in Section 6.2. We consider three different portions of missing values: 25%, 50% and 75%. As in the experiments above on benchmark data sets, we perform ten independent random trials for each setting of missing fraction and training size.
To the best of our knowledge, there exists no previous work in the literature on multi-task learning with missing data. As presented in Figure 18, we use the LR-MTL (Xue et al., 2007) and the Structure-MTL (Ando and Zhang, 2005) with missing values imputed as baseline algorithms. Results of the two-step LR with integration (Williams et al., 2007) and the LR-STL with single imputations are also included for comparison. Imputations using both unconditional-means and conditional-means are considered. From Figure 18, iQGME-STL consistently performs best among single-task learning methods and even better than LR-MTL-Uncond when the size of the training set is relatively large. The imputations using conditional-means yields consistently better results for the LR-based models on this data set. The iQGME-MTL outperforms the baselines and all the single-task learning methods overall. Furthermore, the improvement of iQGME-MTL is more pronounced when there are more features missing. These observations underscore the advantage of handling missing data in a principled manner and at the same time learning multiple tasks simultaneously.
Figure 18.
Average AUC over 19 tasks of landmine detection for the cases when (a) 25%, (b) 50%, and (c) 75% of the features are missing. Mean values of performance across 10 random partitions of training and test subsets are reported. Error bars are omitted for legibility.
The task-similarity matrices for the incomplete-data cases are shown in Figure 19. It can be seen that when a small fraction (e.g., 25%) of the feature values are missing and training data are rich (e.g., 300 samples from each task), the similarity pattern among tasks is similar to what we have seen for the complete-data case. As the fraction of missing values becomes larger, tasks appear more different from each other in terms of the usage of the higher-level items. Considering that the missing pattern for each task is unique, it is probable that tasks look quite different from each other after a large fraction of feature values are missing. However, the fact that tasks tend to use different subsets of higher-level items does not mean it is equivalent to learning them separately (STL), as parameters of the common base measures are inferred based on all the tasks.
Figure 19.
Similarity between tasks in the landmine detection problem with incomplete data. Row 1, 2 and 3 corresponds to the cases with 25%, 50% and 75% features missing, respectively; column 1, 2 and 3 corresponds to the cases with 20, 100 and 300 training samples from each task, respectively.
6.6 Multi-Task Learning with Handwritten Letters Data
The final example corresponds to multi-task learning of classifiers for handwritten letters, this data set included in the “Transfer Learning Toolkit for Matlab” available at http://multitask.cs.berkeley.edu/. The objective of each task is to distinguish two letters which are easily confused. The number of samples for all the letters considered in the total eight tasks is summarized in Table 2. Each sample is a 16 × 8 image as shown in Figure 20. We use the 128 pixel values of each sample directly as its feature vector.
Table 2.
Handwritten letters classification data set.
| Task 1 | Task 2 | Task 3 | Task 4 | Task 5 | Task 6 | Task 7 | Task 8 |
|---|---|---|---|---|---|---|---|
| ‘c’: 2107 | ‘g’: 2460 | ‘m’: 1596 | ‘a’: 4016 | ‘i’: 4895 | ‘a’: 4016 | ‘f’: 918 | ‘h’: 858 |
| ‘e’: 4928 | ‘y’: 1218 | ‘n’: 5004 | ‘g’: 2460 | ‘j’: 188 | ‘o’: 3880 | ‘t’: 2131 | ‘n’: 5004 |
Figure 20.

Sample images of the handwritten letters. The two images in each column represents the two classes in the corresponding task described in Table 2.
We compare the proposed iQGME-MTL algorithm to the LR-MTL (Xue et al., 2007) and the Structure-MTL (Ando and Zhang, 2005) mentioned in Section 6.5. For the non-parametric Bayesian methods (iQGME-MTL and LR-MTL), we use the same parameter setting as before. The dimension of the underlying structure for the Structure-MTL is set to be 50 in the results shown in Figure 21. We also tried 10, 20, 40, 60, 80 and 100, and did not observe big difference. From Figure 21, the iQGME-MTL performs significantly better than the baselines on this data set for all the missing fractions and training fractions under consideration. As we expected, the Structure-MTL yields comparable results as the LR-MTL on this data set.
Figure 21.
Average AUC over eight tasks of handwriting letters classification for the cases when (a) none, (b) 25%, (c) 50%, and (d) 75% of the features are missing. Mean values of performance with one standard deviation across 10 random partitions of training and test subsets are reported.
7. Conclusion and Future Work
In this paper we have introduced three new concepts, summarized as follows. First, we have employed non-parametric Bayesian techniques to develop a mixture-of-experts algorithm for classifier design, which employs a set of localized (in feature space) linear classifiers as experts. The Dirichlet process is employed to allow the model to infer automatically the proper number of experts and their characteristics; in fact, since a Bayesian formulation is employed, a full posterior distribution is manifested on the properties of the local experts, including their number. Secondly, the classifier is endowed with the ability to naturally address missing data, without the need for an imputation step. Finally, the whole framework has been placed within the context of a multi-task learning, allowing one to jointly infer classifiers for multiple data sets with missing data. The multi-task-learning component has also been implemented with the general tools associated with the Dirichlet process, with specific implementations manifested via the hierarchical Dirichlet process. Because of the hierarchical form of the model, in terms of a sequence of distributions in the conjugate-exponential family, all inference has been manifested efficiently via variational Bayesian (VB) analysis. The VB results have been compared to those computed via Gibbs sampling; the VB results have been found to be consistent with those inferred via Gibbs sampling, while requiring a small fraction of the computational costs. Results have been presented for single-task and multi-task learning on various data sets with the same hyper-parameters setting (no model-parameter tuning), and encouraging algorithm performance has been demonstrated.
Concerning future research, we note that the use of multi-task learning provides an important class of contextual information, and therefore is particularly useful when one has limited labeled data and when the data are incomplete (missing features). Another form of context that has received significant recent attention is semi-supervised learning (Zhu, 2005). There has been recent work on integrating multi-task learning with semi-supervised learning (Liu et al., 2007). An important new research direction includes extending semi-supervised multi-task learning to realistic problems for which the data are incomplete.
Appendix A
The update equations of single-task learning with incomplete data are summarized as follows:
-
The expectation of ti and may be derived according to properties of truncated normal distributions:
where φ(·) and Φ(·) denote the probability density function and the cumulative density function of the standard normal distribution, respectively.
-
A related derivation for a GMM model with incomplete data could be found in Williams et al. (2007), where no classifier terms appear.
First, we explicitly write the intercept , that is, :whereSincethe conditional distribution of missing features given observable features is also a normal distribution, that is, withTherefore, could be factorized as the product of a factor independent of and the variational posterior of , that is,For complete data, no factorization for the distribution for is necessary: -
q(Vh|vh1, vh2)
Similar updating could be found in Blei and Jordan (2006), except that we put a prior belief on α here instead of setting a fixed number. -
q(μh, Λh|mh, uh, Bh, νh)
Similar updating could be found in Williams et al. (2007).where - .
-
q(ζp, λp|φp, γ, ap, bp), 〈λp〉 = ap/bp.
Similar updating could be found in Xue et al. (2007). -
q(α|τ1, τ2), 〈α〉 = τ1/τ2.
Similar updating could be found in any VB-inferred DP model with a Gamma prior on α (Xue et al., 2007).
The update equations of multi-task learning with incomplete data are summarized as follows:
-
where
- q(zji|ρji)
- q(V|v)
- q(α|τ1, τ2), 〈α〉 = τ1/τ2.
- q(c|σ)
- q(Us|κs1, κs2)
- q(β|τ3, τ4), 〈β〉 = τ3/τ4.
-
q(μs, Λs|ms, us, Bs, νs)where , and
- q(λp|ap, bp), 〈λp〉 = ap/bp.
Contributor Information
Chunping Wang, Email: CW36@EE.DUKE.EDU, Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291, USA.
Xuejun Liao, Email: XJLIAO@EE.DUKE.EDU, Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291, USA.
Lawrence Carin, Email: LCARIN@EE.DUKE.EDU, Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291, USA.
David B. Dunson, Email: DUNSON@STAT.DUKE.EDU, Department of Statistical Science, Duke University, Durham, NC 27708-0291, USA
References
- Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88:669–679. [Google Scholar]
- Ando RK, Zhang T. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research. 2005;6:1817–1853. [Google Scholar]
- Attias H. A variational Bayesian framework for graphical models. Advances in Neural Information Processing Systems (NIPS) 2000 [Google Scholar]
- Beal MJ. PhD dissertation. University College London, Gatsby Computational Neuroscience Unit; 2003. Variational Algorithms for Approximate Bayesian Inference. [Google Scholar]
- Blei DM, Jordan MI. Variational inference for Dirichlet process mixtures. Bayesian Analysis. 2006;1(1):121–144. [Google Scholar]
- Caruana R. Multitask learning. Machine Learning. 1997;28:41–75. [Google Scholar]
- Chechik G, Heitz G, Elidan G, Abbeel P, Koller D. Max-margin classification of data with absent features. Journal of Machine Learning Research. 2008;9:1–21. [Google Scholar]
- Collins LM, Schafer JL, Kam CM. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods. 2001;6(4):330–351. [PubMed] [Google Scholar]
- Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B. 1977;39:1–38. [Google Scholar]
- Dick U, Haider P, Scheffer T. Learning from incomplete data with infinite imputations. International Conference on Machine Learning (ICML); 2008. [Google Scholar]
- Dunson DB, Pillai N, Park J-H. Bayesian density regression. Journal of the Royal Statistical Society: Series B. 2007;69 [Google Scholar]
- Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 1995;90:577–588. [Google Scholar]
- Ferguson T. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]
- Gelfand AE, Hills SE, Racine-Poon A, Smith AFM. Illustration of Bayesian inference in normal data models using Gibbs sampling. Journal of American Statistical Association. 1990;85:972–985. [Google Scholar]
- Ghahramani Z, Beal MJ. Advances in Neural Information Processing Systems (NIPS) Vol. 12. MIT Press; 2000. Variational inference for Bayesian mixtures of factor analysers; pp. 449–455. [Google Scholar]
- Ghahramani Z, Hinton GE. Technical Report CRG-TR-96-1. Department of Computer Science, University of Toronto; 1996. The EM algorithm for mixtures of factor analyzers. [Google Scholar]
- Ghahramani Z, Jordan MI. Technical report. Massachusetts Institute of Technology; 1994. Learning from incomplete data. [Google Scholar]
- Graepel T. Kernel matrix completion by semidefinite programming. Proceedings of the International Conference on Artificial Neural Networks; 2002. pp. 694–699. [Google Scholar]
- Hanley J, McNeil B. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
- Hannah L, Blei D, Powell W. Dirichlet process mixtures of generalized linear models. Artificial Intelligence and Statistics (AISTATS) 2010:313–320. [Google Scholar]
- Ibrahim J. Incomplete data in generalized linear models. Journal of the American Statistical Association. 1990;85:765–769. [Google Scholar]
- Ishwaran H, James LF. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96:161–173. [Google Scholar]
- Jacobs RA, Jordon MI, Nowlan SJ, Hinton GE. Adaptive mixtures of local experts. Neural Computation. 1991;3:79–87. doi: 10.1162/neco.1991.3.1.79. [DOI] [PubMed] [Google Scholar]
- Jordan MI, Jacobs RA. Hierarchical mixtures of experts and the EM algorithm. Neural Computation. 1994;6:181–214. [Google Scholar]
- Kurihara K, Welling M, Teh YW. Collapsed variational Dirichlet process mixture models; Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI); 2007. pp. 2796–2801. [Google Scholar]
- Liang Percy, Jordan Michael I. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. Proceedings of the International Conference on Machine Learning (ICML); 2008. pp. 584–591. [Google Scholar]
- Liao X, Li H, Carin L. Quadratically gated mixture of experts for incomplete data classification. Proceedings of the International Conference on Machine Learning (ICML); 2007. pp. 553–560. [Google Scholar]
- Liu Q, Liao X, Carin L. Semi-supervised multitask learning. Neural Information Processing Systems. 2007 [Google Scholar]
- MacEachern SN, Müller P. Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics. 1998;7 [Google Scholar]
- Meeds E, Osindero S. NIPS. Vol. 18. MIT Press; 2006. An alternative infinite mixture of Gaussian process experts; pp. 883–890. [Google Scholar]
- Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83:67–79. [Google Scholar]
- Neal RM. Technical report. Department of Computer Science, University of Toronto; 1993. Probabilistic inference using Markov chain Monte Carlo methods. [Google Scholar]
- Newman DJ, Hettich S, Blake CL, Merz CJ. UCI repository of machine learning databases. 1998 http://www.ics.uci.edu/~mlearn/MLRepository.html.
- Ng AY, Jordan MI. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems (NIPS) 2002 [Google Scholar]
- Rasmussen CE, Ghahramani Z. NIPS. Vol. 14. MIT Press; 2002. Infinite mixtures of Gaussian process experts. [Google Scholar]
- Rodríguez A, Dunson DB, Gelfang AE. Bayesian nonparametric functional data analysis through density estimation. Biometrika. 2009;96 doi: 10.1093/biomet/asn054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]
- Rubin DB. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, Inc; 1987. [Google Scholar]
- Schafer JL, Graham JW. Missing data: Our view of the state of the art. Psychological Methods. 2002;7:147–177. [PubMed] [Google Scholar]
- Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;1:639–650. [Google Scholar]
- Shahbaba B, Neal R. Nonlinear models using Dirichlet process mixtures. Journal of Machine Learning Research. 2009;10:1829–1850. [Google Scholar]
- Shivaswamy PK, Bhattacharyya C, Smola AJ. Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research. 2006;7:1283–1314. [Google Scholar]
- Smola A, Vishwanathan S, Hofmann T. Kernel methods for missing variables. Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics; 2005. [Google Scholar]
- Teh YW, Beal MJ, Jordan MI, Blei DM. Hierarchical Dirichlet processes. Journal of the American Statistical Association. 2006;101:1566–1581. [Google Scholar]
- Tipping ME. The relevance vector machine. In: Leen TK, Solla SA, Müller KR, editors. Advances in Neural Information Processing Systems (NIPS) Vol. 12. MIT Press; 2000. pp. 652–658. [Google Scholar]
- Vapnik VN. The Nature of Statistical Learning Theory. Springer; 1995. [Google Scholar]
- Wang X, Li A, Jiang Z, Feng H. Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinformatics. 2006;7:32. doi: 10.1186/1471-2105-7-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waterhouse SR, Robinson AJ. Classification using hierarchical mixtures of experts. Proceedings of the IEEE Workshop on Neural Networks for Signal Processing IV; 1994. pp. 177–186. [Google Scholar]
- West M, Müller P, Escobar MD. Hierarchical priors and mixture models, with application in regression and density estimation. In: Freeman PR, Smith AF, editors. Aspects of Uncertainty. John Wiley; 1994. pp. 363–386. [Google Scholar]
- Williams D, Carin L. Analytical kernel matrix completion with incomplete multi-view data. Proceedings of the International Conference on Machine Learning (ICML) Workshop on Learning with Multiple Views; 2005. pp. 80–86. [Google Scholar]
- Williams D, Liao X, Xue Y, Carin L, Krishnapuram B. On classification with incomplete data. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2007;29(3):427–436. doi: 10.1109/TPAMI.2007.52. [DOI] [PubMed] [Google Scholar]
- Xu L, Jordan MI, Hinton GE. An alternative model for mixtures of experts. Advances in Neural Information Processing Systems (NIPS) 1995;7:633–640. [Google Scholar]
- Xue Y, Liao X, Carin L, Krishnapuram B. Multi-task learning for classification with Dirichlet process priors. Journal of Machine Learning Research. 2007;8:35–63. [Google Scholar]
- Yu K, Schwaighofer A, Tresp V, Ma W-Y, Zhang H. Collaborative ensemble learning: Combining collaborative and content-based information filtering via hierarchical Bayes. Proceedings of the Conference on Uncertainty in Artificial Intelligence; 2003. pp. 616–623. [Google Scholar]
- Zhang J, Ghahramani Z, Yang Y. Learning multiple related tasks using latent independent component analysis. Advances in Neural Information Processing Systems. 2006 [Google Scholar]
- Zhang Y, Collins LM, Yu H, Baum C, Carin L. Sensing of unexploded ordnance with magnetometer and induction data: theory and signal processing. IEEE Transactions on Geoscience and Remote Sensing. 2003;41(5):1005–1015. [Google Scholar]
- Zhu X. Technical Report 1530. Computer Sciences, University of Wisconsin-Madison; 2005. Semi-supervised learning literature survey. [Google Scholar]














