Abstract
Background
Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing semi-supervised methods for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data.
Results
In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting unlabeled data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data).
Conclusions
The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.
Background
The problem of predicting subcellular protein localization is important in cell biology, because it can provide valuable information for predicting protein function and protein-protein interactions. Furthermore, the location of proteins in their designated subcellular compartments is essential for the proper functioning of the cell. Abnormal subcellular localization has been correlated with diseases such as cancer [1].
Many supervised machine learning methods have been successfully applied to the problem of predicting the subcellular localization of a protein, which can be formulated as a sequence classification problem [2], where the amino acid sequence of a protein is used to classify it in localization classes. For example, Park and Kanehisa [3] trained Support Vector Machine (SVM) classifiers using as features, frequencies of occurrence of pairs of amino acids, with 0 to 3 gaps between them. Emanuelsson et al. [4] developed a Neural Network-based approach using only information available in the N-terminal sequence. Höglund et al. [5] integrated information from the N-terminal sequence, amino acid composition, and protein sequence motifs in an SVM-based approach. Ong and Zien [6] trained multiclass SVMs and used an automated combination of protein motif kernels, with motifs of length up to 5 extracted from the whole sequence, and from different subsequences of it, i.e., the first 15 and 60 amino acids, and the last 15 amino acids of the sequence. Scott et al. [7] developed a Bayesian network that predicts the subcellular localization of a target protein using its features, e.g., InterPro motifs and the subcellular localization of its interacting partners. Yuan [8] trained kth order Markov chain models, with k ranging from 1 to 8, and used an approximation technique to estimate the probability of each element in a sequence given the k contiguous previous elements.
The accuracy of classifiers obtained using supervised learning algorithms depends in part on the quantity of labeled data that is available. Recent advances in sequencing technologies have resulted in an exponential increase in the rate at which DNA and protein sequence data are being acquired [9]. Because annotating the sequences with their subcellular localization requires costly experiments and manual curation effort, reliable annotations are available for only a small fraction of protein sequences. However, even the unlabeled data can provide valuable information, i.e., they contain information about the joint probability distribution over sequence elements. Consequently, there is a significant interest in semi-supervised algorithms [10] that can exploit large amounts of unlabeled data together with limited amounts of labeled data in training classifiers to predict protein subcellular localization.
Formally, the semi-supervised learning problem can be defined as follows: Given training data D = (DL, Du) of labeled and unlabeled examples, where , xl ∈ Rd, yl ∈ Y, and xu ∈ Rd, |DL| ≪ |DU|, respectively; a hypothesis space H; and a performance criterion P, a learning algorithm L outputs a classifier h ∈ H that optimizes P. If |DL| = 0, the problem reduces to unsupervised learning; if |DU| = 0, it reduces to supervised learning. The input x can represent sequences over a finite alphabet X, x ∈ X*. During classification, the task of the classifier h is to accurately assign a new example x test to a class label y ∈ Y.
Recently, a variety of methods for semi-supervised learning have been developed in the literature (see [11], [10] for reviews). Such methods have been successfully applied in many areas including text classification [12], [13], [14], natural language processing [15], [16], [17], image annotation [18], and more recently bioinformatics and computational biology, [19], [20], [21]. However, semi-supervised learning methods have not been widely applied to the subcellular localization prediction problem.
One notable exception is the work of Xu et al. (2009) [22]. The authors applied Co-Forest, which is an algorithm proposed by Li and Zhou [23], to exploit unlabeled data in order to improve predictive accuracy on the protein subcellular localization prediction task. Co-Forest extends the co-training approach of Blum and Mitchell [13] by using an ensemble of N classifiers, called Random Forest [24]. Note that the original co-training approach uses only two classifiers [13]. Co-Forest works as follows: let HN = (h1,…,hN} denote an ensemble of N classifiers. For each classifier hi ∈ HN, let denote the concomitant ensemble of hi, where is defined as the set of classifiers in HN except hi, i.e., . An ensemble HN of N random trees is initially trained on . At each subsequent iteration, for each classifier hi ∈ HN, its concomitant ensemble examines the unlabeled examples . An unlabeled example along with the label predicted by is added to the newly labeled set of hi, if the number of classifiers in that predict a particular label exceeds a predefined threshold. The classifier hi is re-trained on . The process is repeated until no tree in the Random Forest changes from one iteration to another [23].
In this paper, we present a novel semi-supervised approach to the problem of predicting protein subcellular localization. Specifically, we use abstraction augmented Markov models (AAMMs), which are variants of Markov models, to incorporate information available in the unlabeled data. AAMMs model the dependency of each element in a sequence on abstractions of k preceding elements [25]. The abstractions are organized into an abstraction hierarchy that groups together k-grams that induce similar conditional probabilities of the next letter in the sequence. An AAMM corresponds to a generative model for sequence data expressed in terms of random variables whose values correspond to abstractions over k-grams, in addition to the MM random variables [25]. AAMMs provide a simple way to incorporate unlabeled data into the model: first, the abstraction hierarchy is constructed using the entire training set including the unlabeled data. Next, the labeled data is used to estimate the parameters of a set of AAMMs (one for each class) based on the resulting abstraction hierarchy.
Thus, in effect, AAMMs: (i) exploit the relatively large amount of unlabeled data to discover abstractions that transform the sequence data x and, hence, effectively reduce the number of parameters used to specify the probability p(x); and (ii) use the resulting representation to estimate the posterior probability p(y|x). Hence, we hypothesize that AAMMs are likely to yield more robust estimates of p(y|x) than MMs when the amount of labeled data is much smaller compared to the amount of unlabeled data.
To test our hypothesis on the protein subcellular localization prediction task, we compare AAMMs that use both labeled and unlabeled data with AAMMs that use only labeled data, with the standard MMs, which can not make use of unlabeled data, and also with MMs that can incorporate unlabeled data through an expectation maximization approach (EM-MM) and a co-training approach. The results of our experiments show that AAMMs can make effective use of unlabeled data and significantly outperform EM-MMs when the amount of labeled data are very small, and relatively large amounts of unlabeled data are readily available. Here, because of the small amounts of labeled data available for estimating parameters, the ability of AAMMs to minimize overfitting (through parameter smoothing) turns out to be especially useful. The results also show that AAMMs are competitive with, and in some cases significantly outperform two co-trained MMs on different views of the data.
Experiments and results
We present results of experiments on three protein subcellular localization data sets: psortNeg, plant, and non-plant data sets (see Data sets Section for details).
Experimental design
Our experiments on the protein subcellular localization prediction task are designed to explore the following questions: (i) How does the performance of semi-supervised AAMMs, which use both labeled and unlabeled data compare to that of MMs trained only on labeled data? (ii) How do AAMMs compare with MMs when both use unlabeled data? (iii) How effective are AAMMs at exploiting unlabeled data to improve classification accuracy when the amount of labeled data is limited? Specifically, how does the performance of an AAMM trained using both labeled and unlabeled data compare to that of an AAMM trained using only labeled data when both take advantage of abstraction? To answer the first and second questions, we compared AAMMs trained using an abstraction hierarchy constructed from both labeled and unlabeled data with the standard MMs, which can not make use of unlabeled data, and with MMs that can incorporate unlabeled data through an expectation maximization approach (EM) [26]. To answer the third question, we compared AAMMs trained using an abstraction hierarchy constructed from both labeled and unlabeled data with AAMMs trained using an abstraction hierarchy constructed only from labeled data.
In the first set of experiments, we trained semi-supervised AAMMs and supervised MMs for psortNeg, plant, and non-plant data sets. We ran experiments with 1%, 10%, and 25% of the training data being used as labeled examples, and the rest being treated as unlabeled examples (by ignoring the class). To obtain the subsets of 1%, 10%, and 25% of labeled examples, we sampled examples, using a uniform distribution, from the training set. Semi-supervised AAMMs are trained for values of m that range from 1 to N, where m is the cardinality of the set of abstractions Am used as “features” in the classification model, and N is the number of unique k-grams. We learned a single abstraction hierarchy (AH) from both labeled and unlabeled training data and used it to train an AAMM for each class (from the labeled sequences). An MM is trained on the same fraction of labeled data as its AAMM counterpart.
In the second set of experiments, we trained AAMMs, MMs, and EM-MMs for all three data sets. In the case of AAMMs, we trained classifiers for m = 1500 (m is set to 1500 because this partitioning of the set of k-grams produces classifiers that use substantially smaller number of “features” compared to MMs, i.e., ≈ 8000 k-grams, and at the same time, the model compression is not very stringent so as to lose important information in the data through abstraction). We denote by AAMM(l+u) an AAMM trained using an AH constructed from both labeled and unlabeled data, and by AAMM(l) an AAMM trained using an AH constructed only from labeled data, when it is necessary to distinguish between AAMMs training procedures. EM-MMs are trained on the same fractions of labeled and unlabeled data as their AAMM(l+u) counterparts, and AAMM(l) and MMs are trained on the same fraction of labeled data as their AAMM(l+u) counterparts.
Here, we fixed the number of unlabeled examples and varied the number of labeled examples. Specifically, we performed experiments with 1%, 5%, 10%, 15%, 20%, 25%, 35%, and 50% of the training data being used as labeled examples, and 50% being treated as unlabeled examples (by ignoring the class label). Note that the unlabeled subset of the training data is the same across all the experiments; the labeled subset of the training data is successively augmented to increase the amount of labeled data that is provided to the learner.
In the third set of experiments, we compared AAMMs with EM-MMs using a fixed the number of labeled examples and a variable number of unlabeled examples. We performed experiments with (i) 1% of training data being treated as labeled, while 1%, 10%, 25%, 50%, 75%, 90%, and 99% being treated as unlabeled; (ii) 10% of training data being treated as labeled, while 1%, 10%, 25%, 50%, 75%, and 90% being treated as unlabeled; (iii) 25% of training data being treated as labeled, while 1%, 10%, 25%, 50%, and 75% being treated as unlabeled. As before, to obtain the subsets of labeled and unlabeled examples, we sampled using a uniform distribution, from the training set. In all experiments, the class distribution in each labeled subset is the same as that in the entire training set.
In the fourth set of experiments, we compare the semi-supervised AAMM with the co-training procedure as described in [13]. The co-training procedure can be used with any learning algorithms for training two classifiers h1 and h2. In this study, we used two Markov models. Inspired from the work of Ong and Zien [6], instead of considering features extracted from the entire protein sequence, we considered two views on different subsequences. Specifically, the first view corresponds to features extracted from the first 60 amino acids of each sequence, whereas the second view corresponds to features extracted from the last 15 amino acids of each sequence. We trained each Markov model on a different view. Furthermore, in this experiment, we trained AAMMs on the two subsequences of the first 60 and the last 15 amino acids of each sequence, rather than the entire sequence.
Results
For all of the experiments, we report the average classification accuracy obtained in a 5-fold cross-validation experiment. All models are trained using 3-grams extracted from the data. For psortNeg, plant, and non-plant data sets the number of 3-grams is 7970, 7965, and 7999, respectively. Although the number of all unique k-grams is exponential in k, for large values of k, many of the k-grams may not appear in the data (consequently, the counts for such k-grams are zero). Note that the number of unique k-grams is bounded by the cardinality of the multiset of k-grams extracted from D.
We define the relative reduction in classification error between two classifiers to be the difference in error divided by the larger of the two error rates. To test the statistical significance of results, we used the 5-fold cross-validated paired t test for the difference in two classification accuracies [27]. The null hypothesis that the two learning algorithms M1 and M2 have the same accuracy on the same test set can be rejected if |t(M1, M2)| >t4,0.975 = 2.776 (p < 0.05). We abbreviate |t(M1, M2)| by |t| in what follows.
AAMMs can provide more accurate models compared to MMs on the protein subcellular localization prediction task when the amount of labeled data is small compared to that of unlabeled data.
Figure 1 shows results of the comparison of AAMMs with MMs for 1%, 10%, and 25% of labeled data, for non-plant, plant, and psortNeg data sets. Note that the x axis of all subfigures shows the number of abstractions m on a logarithmic scale. When only 1% and 10% of the training data are labeled (Figure 1, first and second rows), AAMMs significantly outperform MMs for many choices of m, on all three data sets. For example, on the 1% plant data set, with m = 200, the accuracy of AAMM is 38.72%, whereas that of MM is 30.53%, which represents 12% reduction in classification error, and |t| = 3.16 (the largest values of t are 27.58 for m = 4905, 21.91 for m = 2070, and 27.34 for m = 535 on non-plant, plant, and psortNeg, respectively). On the 10% plant data set, with m = 560, AAMM achieves an accuracy of 47.97%, compared to that of MM which is 37.87%, and |t| = 10.01.This represents 16% reduction in classification error. When we increased the fraction of labeled data to 25%, AAMMs still have a higher performance than MMs for many choices of m on non-plant and plant data sets, but become comparable in performance with MMs on psortNeg data set.
AAMMs trained using abstraction hierarchies constructed from both labeled and unlabeled protein subcellular localization data significantly outperform AAMMs trained using abstraction hierarchies constructed only from labeled protein subcellular localization data.
Figure 2 shows results of experiments that compare AAMM(l+u) with AAMM(l), MM, and EM-MM on non-plant, plant, and psortNeg data sets. The x axis indicates the number of labeled examples in each data set. The number of unlabeled examples is kept fixed and is equal to the rightmost number of labeled examples on the x axis of each plot.
As can be seen in the figure, AAMM(l+u) significantly outperforms AAMM(l) on all three data sets when small fractions of labeled data are available. For example, with 110 labeled sequences on non-plant (i.e., 5% of labeled data), AAMM(l+u) achieves 63% accuracy while AAMM(l) achieves 52%, which gives 23% reduction in classification error (|t| = 7.2). Strikingly, on the same data set, with only 22 labeled sequences (i.e., 1% of labeled data), AAMM(l+u) achieves 59% accuracy as compared to 43% obtained by AAMM(l), which gives 28% reduction in classification error (|t| = 9.73). Hence, AAMM(l+u) are able to incorporate information available in the unlabeled data (i.e., joint probability distributions of contiguous amino acids in a sequence) to learn more robust abstraction hierarchies than AAMM(l) when the labeled training set is limited in size (thereby, reducing the risk of overfitting).
Furthermore, AAMM(l+u) decreases the need for large numbers of labeled data. Specifically, on non-plant, AAMM(l+u) achieves 63% accuracy with 110 labeled examples, which is matched by that of AAMM(l) with 438 labeled examples (≈ 4 times more labeled data). However, when the fraction of labeled data is large, and hence, good estimates of model parameters can be obtained from such data, there is not much need for unlabeled data. For example, AAMM(l+u) becomes similar in performance with AAMM(l) on non-plant using 35% and 50% of labeled data (the null hypothesis is not rejected, |t| = 1.38 and |t| = 0.26, respectively).
As expected, the performance of AAMM(l+u) increases with the increase in the amount of labeled data. For example, on psortNeg with 12 labeled sequences (i.e., 1% of labeled data), AAMM(l+u) achieves 32% accuracy while AAMM(l+u) with 289 labeled sequences (i.e., 25% of labeled data) achieves 58% accuracy, which corresponds to 38% reduction in classification error.
AAMMs are able to incorporate information available in the unlabeled protein subcellular localization data, and hence, produce more robust classifiers than MMs and EM-MMs, when the fraction of labeled protein subcellular localization data is small.
Again as can be seen in Figure 2, AAMM(l+u) is superior in performance to MM, especially when small amounts of labeled data are available. For example, on plant, with 75 labeled sequences (i.e., 10% of labeled data), MM achieves 39% accuracy as compared to 44% obtained using AAMM(l+u) (|t| = 3.07). On non-plant, with 219 labeled sequences (i.e., 10% of labeled data), MM achieves 51% accuracy whereas AAMM(l+u) achieves 64% (|t| = 14). AAMM(l+u) not only incorporates information available in the unlabeled data (see previous comparison), but also performs parameter smoothing. Thus, AAMM(l+u) provides more robust estimates of model parameters than MMs, and hence, help reduce overfitting when the labeled training set is limited in size.
Both AAMM(l+u) and EM-MM make use of information available in the unlabeled data (i.e., both improve the performance of their counterpart classifiers trained only from labeled data) on all three data sets, although the improvement is not very large on psortNeg (Figure 2). However, AAMM(l+u) uses the joint distribution over amino acids (independent of the class variable) to learn a more robust abstraction hierarchy (i.e., a finer partitioning of the set of k-grams), especially when the amount of labeled data is small, so that better estimates of parameters can be obtained. On the other hand, EM-MM uses the joint distribution over amino acids after an initial classifier has made predictions on the unlabeled data. When small amounts of labeled data are available, the predictions made by the initial classifier may not be reliable.
AAMM(l+u) significantly outperforms EM-MMs on non-plant, plant, and psortNeg data sets, when the fraction of labeled data is small (see Figure 2). For example, with only 22 labeled sequences on non-plant (i.e., 1% of labeled data), AAMM(l+u) achieves 59% accuracy while EM-MM achieves 42%, which gives 29% reduction in classification error (|t| = 8.83). Similarly, with only 8 labeled sequences on plant (i.e., 1% of labeled data), AAMM(l+u) achieves 34% accuracy as compared to 28% obtained by EM-MM, which gives 8% reduction in classification error (|t| = 4.66). As the amount of labeled data increases, EM-MM significantly outperforms AAMM(l+u). For example, with 767 labeled sequences on non-plant (i.e., 35% of labeled data), EM-MM achieves 69% accuracy while AAMM(l+u) achieves 67% (|t| = 4.87).
Note that EM may decrease rather than increase the accuracy of classifiers if the generative model assumptions are not satisfied (see Figure 2 plant data set). A weighted EM (i.e., weighting unlabeled sequences less) [12] helped improved the performance of EM-MMs (data not shown). A similar approach could be considered in AAMMs during learning the abstraction hierarchies.
Figure 3 shows results of comparison of AAMMs with EM-MMs on non-plant, plant, and psortNeg data sets, respectively, while varying the amount of unlabeled data for three different fractions of labeled data (i.e., 1%, 10%, and 25% of labeled data) that are kept fixed. The x axis indicates the number of unlabeled examples in each data set.
As can be seen in Figure 3, the improvement in performance of AAMMs over EM-MMs is rather dramatic when the amount of labeled data is quite small. For example, when only 1% of labeled data is used regardless of the amount of unlabeled data, AAMMs consistently significantly outperform EM-MMs on non-plant and plant data sets (the largest and smallest t values on non-plant are 10.96 and 5.66, respectively). However, the difference in performance between AAMMs and EM-MMs diminishes as more and more labeled data become available (and eventually levels off). When the amount of labeled data is increased (e.g., 25% of labeled data), EM-MMs often significantly outperform AAMMs (Figures 3(a) and 3(c)). For example, on non-plant with 25% of unlabeled data, EM-MM achieves 68% accuracy, whereas AAMM achieves 66% (|t| = 7).
The classification accuracy of AAMMs typically increases with the amount of unlabeled data (when the subset of labeled data is fixed) (see Figure 3). For example, on non-plant, AAMM with 22 labeled sequences (i.e., 1% of labeled data) and 219 unlabeled sequences (i.e., 10% of unlabeled data) achieves an accuracy of 56% as compared to 49% obtained by AAMM with 22 labeled sequences (i.e., 1% of labeled data) and 22 unlabeled sequences (i.e., 1% of unlabeled data), 14% reduction in classification error.
AAMMs are comparable in performance with, and in some cases outperform, the co-training procedure, which uses MMs trained on different views of the protein subcellular localization data.
Figure 4 shows results of experiments that compare AAMMs with co-training MMs on non-plant, plant, and psortNeg data sets, where we fixed the number of unlabeled examples (to 50%) and varied the number of labeled examples (from 1% to 50% as before). The x axis indicates the number of labeled examples in each data set. The number of unlabeled examples is kept fixed and is equal to the rightmost number of labeled examples on the x axis of each plot.
As can be seen in the figure, AAMMs trained on the first 60 and last 15 amino acids of each protein sequence significantly outperform two co-trained MMs, one trained on the first 60 amino acids of each sequence, and the other trained on the last 15 amino acids of each sequence on the non-plant data set. For example, with 22 labeled sequences (i.e., 1% of labeled data), AAMM achieves 59% accuracy while co-training MMs achieves 56% (|t| = 7.14). With 548 labeled sequences (i.e., 25% of labeled data), the accuracy of AAMM is 61%, whereas that of co-training MMs is 57% (|t| = 7.12). These results give 1% reduction in classification error .
However, on plant and psortNeg data sets, AAMMs are comparable in performance with co-training MMs. For example, on plant data set using 188 labeled sequences (i.e., 25% of labeled data), the accuracy of AAMM is 57%, whereas that of co-training MMs is 58% (the null hypothesis is not rejected, |t| = 0.61).
Summary and discussion
Identifying subcellular localization of proteins is an important problem with broad applications in computational biology, e.g., rational drug design. Computational tools for identifying protein subcellular localization that can exploit large amounts of unlabeled data together with limited amounts of labeled data are especially important because of the high cost and efforts involved in labeling the data.
In this study, we presented an abstraction-based approach to semi-supervised learning of classifiers for the protein subcellular localization prediction task. Our approach utilizes abstraction augmented Markov models [25], which extend higher order Markov models by adding new variables corresponding to abstractions of k-grams (i.e., substrings of a fixed length k). AAMMs are probabilistic generative models that have the ability to incorporate information available in the unlabeled data: initially, an abstraction hierarchy over the k-grams is constructed from both labeled and unlabeled data, independent of the class variable. The labeled data is used to estimate the model parameters, based on the resulting abstraction hierarchy.
In this paper we compare AAMMs with MMs and EM-MMs and co-trained MMs. The results of our experiments on the subcellular localization prediction task show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.
Related work on semi-supervised learning
A variety of approaches to semi-supervised learning have been studied in the literature (see [11], [10] for reviews). Most of the existing semi-supervised learning algorithms including those based on co-training [13], Expectation Maximization (EM) [12], Transductive Support Vector Machines (TSVM) [14], cluster kernel [28], manifold based approaches [29,30], essentially involve different means of transferring labels from labeled to unlabeled samples in the process of learning a classifier that can generalize well on new unseen data.
EM-based methods provide a way to estimate the parameters of a generative model from incomplete data [26], i.e., samples that contain missing values for some of the variables. Semi-supervised learning is a special case of such inference where it is the class labels that are missing for a subset of the data [12]. Specifically, the parameters of the model are estimated initially from the labeled fraction of the training data, DL, and the resulting model is used to predict p(y|x) for each of the unlabeled samples in DU. The parameters are re-estimated using the entire training data D and this process is repeated until the estimates converge. Co-training [13] is a variant of this approach where unlabeled data are labeled with two different classifiers trained on different subsets of the features in x.
Several semi-supervised learning algorithms based on discriminative approaches to classification have been investigated. TSVM [14] can be seen as a discriminative counterpart of EM. TSVM starts by training an SVM on the labeled data and uses the trained SVM to label the unlabeled data. The algorithm iteratively attempts to maximize the margin of separation between the sets of samples labeled by the SVM (by considering at each iteration, alternative labels for pairs of originally unlabeled samples that have been assigned different labels by the SVM). A similar outcome can be achieved by adding an additional regularization term for unlabeled data to the objective function optimized by SVM [10]. Similar approaches for exploiting unlabeled data in training discriminative classifiers include [31], [32], [33], [34].
An alternative approach to exploiting unlabeled data relies on the manifold assumption: high-dimensional data lies on a lower dimensional manifold, making it possible to propagate labels from labeled samples to unlabeled samples based on some measure of closeness of the data points on the manifold. The manifold can be approximated by a weighted graph in which the nodes correspond to data samples and the weights on the links between nodes correspond to the pairwise similarity of the corresponding data points [35]. A number of techniques for label propagation have been proposed [29], [30]. Note that graph laplacian based techniques can be interpreted as a more general type of regularization where not only the L2 norm of the hypothesis is penalized but also the L2 norm of the hypothesis gradient.
In contrast to the approaches reviewed above, we present a novel abstraction-based approach to semi-supervised learning of sequence classifiers. We compared the semi-supervised AAMMs with the semi-supervised variants of Markov models trained using expectation maximization [12], and using co-training [13], [10].
Expectation Maximization applied to Markov models
EM applied to MMs (EM-MMs) involves an iterative process of E- and M-steps. Specifically, an initial Markov model is learned only from labeled sequences DL using Equations (3), (4), and (5) (initialization step). The current model is used to assign probabilistic labels to the (originally) unlabeled sequences DU (i.e., to calculate the probability that each class generated an unlabeled sequence, , u =1,…,|DU|) using Equation (6) (E-step). Next, a new model is learned from originally labeled sequences combined with the newly probabilistically labeled sequences , which were originally unlabeled, using Equations (3), (4), and (5) (M-step) (See Methods section for Equations (3), (4), (5), and (6) for details). E- and M- steps are repeated until the model does not change from one iteration to another [12].
Co-training of Markov models
Let DL be a set of labeled examples, and DU a set of unlabeled examples. A set DU′ is obtained by sampling u examples from DU (we used u = 75 examples in experiments). Each example x has two views, i.e., can be encoded with two different sets of features, x(1) and x(2). First, use DL and the x(1) encoding to train a classifier h1, and DL and the x(2) encoding to train another classifier h2. Second, classify the examples in DU′ using h1 and h2 separately. Select h1’s and h2’s top (kj)j=1,…,|C| most confident predictions from each class (corresponding to the underlying data distribution), add them to DL, and remove them from DU′. Sample 2 examples from DU and move them to DU′. This process is repeated for a fixed number of iterations, or until all unlabeled data are used up [13], [10] (in experiments, we iterated until all unlabeled data was used). In co-training, the idea is that the two classifiers teach one another by re-training each classifier on the data enriched with predicted examples that the other classifier is most confident about.
Semi-supervised abstraction augmented Markov models - our approach
Our abstraction-based approach to learning classifiers for the protein subcellular localization prediction task exploits large amounts of unlabeled data together with small amounts of labeled data to construct more robust abstraction hierarchies over the values of the parents of each node in a Markov model. Two values (k-grams) are clustered together if they induce similar conditional distributions of the next node, independent of the class. When the data are scarce, the estimates of joint probabilities are not reliable. However, the unlabeled data contain information about the joint probability distribution over sequence elements, and can help improve the statistical estimates of parameters. The abstraction hierarchy is subsequently used to learn a Markov model with abstract values of the parents.
It is worth mentioning that part of the AAMM is the representation of the clustering. Specifically, after the abstraction hierarchy is learned, for a given choice of the size m of an m-cut that defines an AAMM, an array of indices of size equal to the number of unique k-grams specifies the mapping between k-grams and abstractions (the space complexity is |X|k, where X is the alphabet). However, the number of parameters of AAMM (for a given class) based on such an abstraction hierarchy and an m-cut is m|X|, as opposed to |X|k|X| in the case of MMs, where m ≪ |X|k.
While AAMMs reduce the complexity of the learned model, some information is lost due to abstraction. It is of interest to incorporate into AAMMs some means of gracefully trading off the complexity of the model against its predictive accuracy. One way to do this is to augment the algorithm, e.g., by designing an MDL-based scoring function to guide a top-down search for an optimal cut [36].
AAMMs not only significantly outperform MMs but also are simpler than MMs, and hence easier to interpret from a biological standpoint: the set of k-grams in an abstraction can be seen as a sequence profile (e.g., Position Specific Scoring Matrix).
The results of our experiments show that AAMMs can make effective use of unlabeled data and that AAMMs significantly outperform EM-MMs when the amount of labeled data is very small, and relatively large amounts of unlabeled data are readily available. Here, because of the small amounts of labeled data available, the ability of AAMMs to minimize overfitting (through parameter smoothing) turns out to be especially beneficial. In comparing semi-supervised AAMMs with the previous semi-supervised work on the protein subcellular localization prediction task, we found that AAMMs are competitive with, and in some cases outperform, co-training of MMs.
The results presented here demonstrate the effectiveness of an abstraction-based approach to exploiting unlabeled data in a semi-supervised setting on the protein subcellular localization prediction task. Such an approach can in principle be combined with existing semi-supervised learning techniques including those that use EM, co-training, manifold assumption (propagation of labels from labeled to unlabeled samples based on some similarity measure between samples).
Our current implementation of AAMM constructs an abstraction hierarchy over the values of the k predecessors of a sequence element by grouping them together if they induce similar conditional distributions over that element of the sequence. It would be interesting to explore alternative approaches to building abstraction hierarchies, e.g., probabilistic suffix trees (PSTs) [37].
Methods
In this section, we briefly described the data sets used in experiments, provide some background on Markov models for sequence classification, and then present our novel AAMM-based approach to semi-supervised learning.
Data sets
The first and second data sets used in our experiments, plant and non-plant [38], were first introduced in [4]. The plant data set contains 940 examples belonging to one of the following four classes: chloroplast (141), mitochondrial (368), secretory pathway/signal peptide (269) and other (consisting of 54 examples with label nuclear and 108 examples with label cytosolic). The non-plant data set contains 2738 examples, each in one of the following three classes: mitochondrial (361), secretory pathway/signal peptide (715) and other (consisting of 1214 examples labeled nuclear and 438 examples labeled cytosolic).
The third data set used in our experiments, PSORTdb v.2.0 [39] Gram-negative sequences, introduced in [40], contains experimentally verified localization sites. We refer to this data set as psortNeg. We use all proteins that belong to exactly one of the following five classes: cytoplasm (278), cytoplasmic membrane (309), periplasm (276), outer membrane (391) and extracellular (190). The total number of examples (proteins) in this data set is 1444.
Markov models
Markov models (MMs) are probabilistic generative models that assume a mixture model as the underlying model that generated the sequence data. Each mixture component corresponds to a class cj ∈ C = {c1,…,c|c|}. A sequence is generated according to a set of parameters, denoted by θ, that define the model.
Let x = x0…xn−1 be a sequence over a finite alphabet X, x ∈ X*, and let y denote x’s class (note that if x was generated by the jth mixture component, then y = cj). Let Xi, for i = 0,…, n − 1, denote the random variables corresponding to the sequence elements xi in x. In a kth order MM, the sequence elements satisfy the Markov property: . That is, Xi is conditionally independent of X0,…, Xi−k−1 given Xi−k,…, Xi−1 for i = k,…, n − 1. Xi−k,…, Xi−1 are called parents of Xi. Figure 5 shows the dependency of Xi on Xi−k,…,Xi−1 in a kth order MM. Hence,
p(xi|x0…xi−1,cj;θ) = p(xi|xi−k…xi−1,cj;θ). (1)
The probability of x given its class cj, p(x|cj;θ), can be written as follows:
(2)
Let Si−1 denote the parents Xj−k…Xi−1 of Xi. The values of Si−1 represent instantiations of Xi−k…Xi−1, which are substrings of length k (i.e., k-grams) over the alphabet X. Let S denote the set of k-grams over X, s denote a k-gram in S, and σ a symbol in X. The cardinality of S is |X|k and is denoted by N.
The set of parameters θ of an MM is: , where , and .
Learning Markov models
Given a labeled training set , learning a Markov model reduces to estimating the set of parameters θ from DL, using the maximum likelihood estimation [41]. The estimate of is obtained from DL as follows:
(3)
where #[sσ, xl] is the number of times the symbol σ “follows” the k-gram s in the sequence xl, and p(yl = cj|xl) ∈ {0,1} is obtained based on the sequence label.
The estimate of is obtained from DL as follows:
(4)
where #[s, xl] is the number of times s occurs in xl.
The class prior probabilities are estimated as follows:
(5)
We used Laplace correction to obtain smoothed estimates.
Using Markov models for classification
Classification of a new sequence x requires computation of conditional probability . Applying Bayes rule:
(6)
The class with the highest posterior probability, is assigned to x.
Semi-supervised AAMM
We first provide the AAMM definitions and then describe how to learn semi-supervised AAMMs.
AAMMs
AAMMs effectively reduce the number of parameters of a kth order MM (which is exponential in k) by learning an abstraction hierarchy (AH) over the set of k-grams S.
Definition 1 (Abstraction Hierarchy)An abstraction hierarchy T over a set of k-grams S is a rooted tree such that: (1) the root of T denotes S; (2) the leaves of T correspond to singleton sets containing individual k-grams in S; (3) the children of each internal node (say a) correspond to a partition of the set of k-grams denoted by a. Thus, a denotes an abstraction or grouping of “similar” k-grams.
Note that each internal node (or abstraction a) contains the subset of k-grams at the leaves of the subtree rooted at a. Figure 6(a) shows an example of an AH T on a set S = {s1,…,s9} of 2-grams over an alphabet of size 3.
Definition 2 (m-Cut)An m-cut γm through an abstraction hierarchy T is a subset of m nodes of T such that for any leaf s ∈ S, either s ∈ γm or s is a descendant of some node in γm. The set of abstractions A at any given m-cut γm forms a partition of S.
Specifically, an m-cut γm partitions the set S of k-grams into m (m ≤ N = |S|) non-overlapping subsets A = {a1 : S1,…,am:Sm}, where ai denotes the i-th abstraction and Si denotes the subset of k-grams that are grouped together into the i-th abstraction based on some similarity measure. Note that S1 ∪…∪ Sm = S and ∀1 ≤ i, j ≤ m, Si ∩ Sj = ∅. In Figure 6(a), the subset of nodes {a15, a6, a14} represents a 3-cut γ3 through T.
AAMMs extend the graphical structure of MMs by introducing new variables Ai that represent abstractions over the values of Si−1, for i = k,…, n − 1 (Figure 6(b)). Each Ai takes values in the set of abstractions A = {a1,…,am} corresponding to an m-cut, γm. We model the fact that Ai is an abstraction of Si−1 by defining p(Ai = ai|Si−1 = si−1) = 1 if si−1 ∈ ai, and 0 otherwise, where si−1 ∈ S and ai ∈ A represent instantiations of variables Si−1 and Ai, respectively. Furthermore, in AAMMs, the node Xi directly depends on Ai instead of being directly dependent on Si−1, as in the standard MMs. Hence, the probability of x given its class, p(x|cj;θ), can be written as follows:
(7)
The set of parameters θ of an AAMM is: , where , and .
Learning semi-supervised AAMMs
In what follows we show how to learn AAMMs from both labeled and unlabeled data. This involves: learning abstraction hierarchies from both labeled and unlabeled data; and learning model parameters from labeled data using the resulting abstraction hierarchy.
Learning abstraction hierarchies
The algorithm for learning AHs over a set S of k-grams starts by initializing the set of abstractions A such that each abstraction aj ∈ A corresponds to a k-gram sj ∈ S, j = 1,…,N. The leaves of the AH T are initialized with elements of S. The algorithm recursively merges pairs of abstractions that are most “similar” to each other and terminates with an abstraction hierarchy after N − 1 steps. We store T in a last-in-first-out (LIFO) stack. For a given choice of the size m of an m-cut through T, the set of abstractions that define an AAMM can be extracted by discarding m − 1 elements from the top of the stack.
We consider two k-grams to be “similar” if they occur within similar contexts. In our case, we define the context of a k-gram s ∈ S to be the conditional probability distribution of the next letter in the sequence given the k-gram, p(Xi|s), independent of the class variable. Hence, this can be estimated from both labeled sequences DL and unlabeled sequences DU as follows:
(8)
where #[sσ, xl] and #[sσ, xu] represent the number of times the symbol σ “follows” the k-gram s in the sequence xl, and xu, respectively.
Since an abstraction is a set of k-grams, the context of an abstraction a = {s1,…, s|a|} is obtained by a weighted aggregation of the contexts of its k-grams. That is,
(9)
where .
We identify the most “similar” abstractions as those that have the smallest weighted Jensen-Shannon divergence between their contexts. JS divergence [42] provides a natural way to compute the distance between two probability distributions that represent contexts of two abstractions. Specifically, we define the distance between two abstractions a′ and a″ in D as follows:
where and .
Learning AAMM parameters
Given a labeled training set , learning an AAMM reduces to estimating the set of parameters θ from DL, denoted by . This can be done as follows: use Equation (3) to obtain the estimates of for any k-gram s ∈ S (note that these estimates correspond to the estimates when a = {s}, i.e., the leaf level in the AH T). The estimates of , when a = {s1,…,s|a|}, are a weighted aggregation of the estimates of a’s constituent k-grams, i.e.,
(10)
where . Use Equations (4) and (5) to obtain the estimates of and of , respectively.
Using AAMMs for classification
Given a new sequence x = x0,…,xn−1 and an m-cut γm through T, can be computed as follows: initialize by ; parse the sequence from left to right. For each k-gram xi−k…xi−1 find the abstraction aw ∈ γm it belongs to and retrieve the parameters associated with aw. Successively multiply for i = k,…, n − 1 to obtain .
As in MMs, apply Bayes rule to obtain and assign the class with the highest posterior probability to x.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors read and approved the final manuscript.
Contributor Information
Cornelia Caragea, Email: cornelia@cs.iastate.edu.
Doina Caragea, Email: dcaragea@ksu.edu.
Adrian Silvescu, Email: silvescu@cs.iastate.edu.
Vasant Honavar, Email: honavar@cs.iastate.edu.
Acknowledgements
This research was funded in part by an NSF grant IIS 0711356 to Vasant Honavar and Doina Caragea.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 8, 2010: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S8.
References
- Alberts B, Bray D,et al, editor. Molecular Biology of the Cell. New York and London, Garland Publishing.; 1994. [Google Scholar]
- Baldi P, Brunak S. Bioinformatics: the Machine Learning Approach. MIT Press; 2001. [Google Scholar]
- Park K, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003;19(13):1656–1663. doi: 10.1093/bioinformatics/btg222. [DOI] [PubMed] [Google Scholar]
- Emanuelsson O, Nielsen H, Brunak S, von Heijne G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. 2000;300:1005–1016. doi: 10.1006/jmbi.2000.3903. [DOI] [PubMed] [Google Scholar]
- Höglund A, Donnes P, Blum T, Adolph HW, Kohlbacher O. MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs, and amino acid composition. Bioinformatics. 2006;22(10):1158–1165. doi: 10.1093/bioinformatics/btl002. [DOI] [PubMed] [Google Scholar]
- Ong CS, Zien A. Proceedings of the 8th Workshop on Algorithms in Bioinformatics (WABI) Springer. Lecture Notes in Bioinformatics.; 2008. An Automated Combination of Kernels for Predicting Protein Subcellular Localization. pp. 186–179. [Google Scholar]
- Scott MS, Calafell SJ, Thomas DY, Hallett MT. Refining Protein Subcellular Localization. PLoS Comput Biol. 2005;1(6):e66. doi: 10.1371/journal.pcbi.0010066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan Y. Prediction of Protein Subcellular Locations using Markov Chain Models. FEBS Letters. 1999;451:23–26. doi: 10.1016/S0014-5793(99)00506-2. [DOI] [PubMed] [Google Scholar]
- Ansorge W. Next-generation DNA sequencing techniques. New Biotechnology. 2009;25(4):195–203. doi: 10.1016/j.nbt.2008.12.009. [DOI] [PubMed] [Google Scholar]
- Zhu X, Goldberg A. Introduction to Semi-Supervised Learning. Morgan & Claypool; 2009. [Google Scholar]
- Chapelle O, Schöelkopf B, Zien A, editor. Semi-Supervised Learning. MIT Press; 2006. [Google Scholar]
- Nigam K, Mccallum AK, Thrun S, Mitchell T. Text Classification from Labeled and Unlabeled Documents using EM. In Machine Learning. 1999. pp. 103–134.
- Blum A, Mitchell T. Proc. of COLT’ 98. New York, NY, USA: ACM; 1998. Combining labeled and unlabeled data with co-training. pp. 92–100. full_text. [Google Scholar]
- Joachims T. Transductive Inference for Text Classification using Support Vector Machines. In Proc. of the ICML’99. 1999. pp. 200–209.
- Niu ZY, Ji DH, Tan CL. Word sense disambiguation using label propagation based semi-supervised learning. In Proc. of the ACL. 2005.
- Goldberg A, Zhu X. Seeing stars when there aren’t many stars: Graph-based semi-supervised learning for sentiment categorization. In HLT-NAACL 2006 Workshop on Textgraphs. 2006.
- Qi Y, Kuksa P, Collobert R, Sadamasa K, Kavukcuoglu K, Weston J. Proc. of ICDM. Washington, DC, USA; 2009. Semi-Supervised Sequence Labeling with Self-Learned Features. pp. 428–437. [Google Scholar]
- Camps-valls G, Member S, B TV, Zhou D. Semi-supervised graph-based hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2007;45:2044–3054. doi: 10.1109/TGRS.2007.895416. [DOI] [Google Scholar]
- Käll L, Canterbury J, Weston J, Noble W, MacCoss M. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods. 2007;4(11):923–925. doi: 10.1038/nmeth1113. [DOI] [PubMed] [Google Scholar]
- Lafferty J, Zhu X, Liu Y. Kernel conditional random fields: Representation and clique selection. In The 21st ICML. 2004.
- Kuksa P, Huang PH, Pavlovic V. Efficient use of unlabeled data for protein sequence classification: a comparative study. BMC Bioinformatics. 2009;10(Suppl 4):S2. doi: 10.1186/1471-2105-10-S4-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu Q, Hu DH, Xue H, Yu W, Yang Q. Semi-supervised protein subcellular localization. BMC Bioinformatics. 2009;10(Suppl 1):S47. doi: 10.1186/1471-2105-10-S1-S47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li M, Zhou ZH. Improve Computer-Aided Diagnosis with Machine Learning Techniques Using Undiagnosed Samples. 2007.
- Breiman L. Random Forests. Machine Learning. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
- Caragea C, Silvescu A, Caragea D, Honavar V. Abstraction Augmented Markov Models. NIPS Workshop on “Machine Learning in Comp. Biol.”. ’09.
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]
- Dietterich TG. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation. 1998;10:1895–1923. doi: 10.1162/089976698300017197. [DOI] [PubMed] [Google Scholar]
- Weston J, Leslie CS, Zhou D, Elisseeff A, Noble WS. Semi-supervised Protein Classification Using Cluster Kernels. In NIPS. 2004. [DOI] [PubMed]
- Bengio Y, Delalleau O, Le Roux N. In: Semi-Supervised Learning. Chapelle O, Schoelkopf B, Zien A,, editor. MIT Press; 2006. Label Propogation and Quadratic Criterion. pp. 193–217. [Google Scholar]
- Jebara T, Wang J, Chang SF. ICML ’09: Proc. of the 26th Annual ICML. ACM; 2009. Graph construction and b-matching for semi-supervised learning. pp. 441–448. [Google Scholar]
- Lawrence ND, Jordan MI. In: In NIPS-17 . Saul L, Weiss Y, Bottou L, editor. 2005. Semi-supervised learning via Gaussian processes. [Google Scholar]
- Szummer M, Jaakkola T. Information regularization with partially labeled data. In Advances in Neural Information processing systems 15. 2002.
- Grandvalet Y, Bengio Y. Advances in Neural Information Processing Systems 17. MIT Press; 2005. Semi-supervised Learning by Entropy Minimization. pp. 529–236. [Google Scholar]
- Jaakkola T, Meila M, Jebara T. Maximum entropy discrimination. In Neural Information Processing Systems. 1999;12 [Google Scholar]
- Belkin M, Niyogi P, Sindhwani V. Manifold Regularization: a Geometric Framework for Learning from Labeled and Unlabeled Examples. Journal of Machine Learning Research. 2006;7:2399–2434. [Google Scholar]
- Zhang J, Kang DK, Silvescu A, Honavar V. Learning Accurate and Concise Naive Bayes Classifiers from Attribute Value Taxonomies and Data. Knowledge and Information Systems. 2006;9(2):157–179. doi: 10.1007/s10115-005-0211-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ron D, Singer Y, Tishby N. The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. In Machine Learning. 1996. pp. 117–149. [DOI]
- TargetP. http://www.cbs.dtu.dk/services/TargetP/datasets/datasets.php
- PSORTdb v.2.0. http://www.psort.org/dataset/datasetv2.html
- Gardy JL. et al. PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria. NAR. 2003;31(13):3613–17. doi: 10.1093/nar/gkg602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Casella G, Berger RL. Statistical Inference. Duxbury; 2002. [Google Scholar]
- Lin J. Divergence measures based on the Shannon entropy. IEEE Trans. 1991;37:145–151. doi: 10.1109/18.61115. [DOI] [Google Scholar]