Abstract
Motivation
Many biomedical projects would benefit from reducing the time and expense of in vitro experimentation by using computer models for in silico predictions. These models may help determine which expensive biological data are most useful to acquire next. Active Learning techniques for choosing the most informative data enable biologists and computer scientists to optimize experimental data choices for rapid discovery of biological function. To explore design choices that affect this desirable behavior, five novel and five existing Active Learning techniques, together with three control methods, were tested on 57 previously unknown p53 cancer rescue mutants for their ability to build classifiers that predict protein function. The best of these techniques, Maximum Curiosity, improved the baseline accuracy of 56% to 77%. This paper shows that Active Learning is a useful tool for biomedical research, and provides a case study of interest to others facing similar discovery challenges.
1 Introduction and Background
Ideally, an accurate classifier would be built in the shortest time possible using the least amount of expensive biological data. To achieve this goal strategies are needed that select and assay only the most informative data points first. Active Learning methods iteratively determine the most informative new data points. p53 cancer rescue mutants present an ideal test case for Active Learning while also engaging in useful cancer research.
Over 6 million people worldwide die of cancer each year (Parkin et al., 2002). The central tumor suppressor protein p53 is an important part of cancer prevention mechanisms in healthy human cells. The p53 protein induces cell growth arrest or apoptosis (programmed cell death) in response to cellular stresses (Vogelstein et al., 2000; Prives and Hall, 1999). Close to half of all human cancers contain inactivating p53 mutations. Despite progress, the cure rate of cancers remains around 60% (http://www.cancer.org/). Resistance of human cancers to standard treatments correlates with mutations of p53 (Soussi and Beroud, 2001; Seemann et al., 2004).
Three quarters of p53 mutations result in full-length protein with a single amino acid change. Several hundred clinically important amino acid changes affect p53 (Olivier et al., 2002; Hamroun et al., 2006; Bullock and Fersht, 2001; Sigal and Rotter, 2000).
These full-length p53 cancer mutants provide an exciting opportunity to specifically target cancers. Restoring normal function to mutant p53 would trigger apoptosis in affected cells, thus shrinking or killing the tumor. One strategy is to seek small molecule drugs that stabilize mutant p53 in a native-like conformation (Bullock and Fersht, 2001; Brachmann, 2004; Wang and Rastinejad, 2003; Bykov et al., 2003). While this strategy remains to be realized, some p53 cancer mutants are rescued in vitro by intragenic second-site cancer suppressor mutations (Baroni et al. 2004). In these mutants a second p53 mutation restores active wild-type p53 function.
Studies of p53 second-site suppressor mutations indicate that a large percentage of p53 cancer mutants can be rescued (Brachmann et al., 1998; Baroni et al., 2004; Danziger et al., 2006). In particular, changes of amino acids 235, 239 and 240, alone or combined, result in the rescue of 16 out of 30 of the most common p53 cancer mutants tested (Baroni et al., 2004). Thus, intragenic second-site suppressor mutations identify p53 cancer mutants that are likely to be amenable to functional rescue; uncover regions of the p53 core domain that, upon alteration, lead to functional rescue; and, combined with structural and other experimental studies, help to elucidate the basic mechanisms of p53 functional rescue.
Unfortunately, in vitro testing of all possible mutation combinations to determine their cancer rescue effects is infeasible due to time and expense. Therefore, it would be very desirable to have a computer model to run in silico experiments on virtual mutants. Such a model could narrow down the list of likely cancer rescue mutants to a number that reasonably could be assayed in the laboratory. To reach the desired predictive accuracy, such a classifier would need a larger training set of known mutants than was provided by the initial experimental screens (Baroni et al., 2004; Danziger et al., 2006). Which expensive data points should be acquired next in order to rapidly discover biological function?
To this aim, this paper explores different Active Learning methods and addresses the following questions:
How well do different Active Learning methods guide the exploration of p53 cancer rescue mutants?
Is one Active Learning method better than others?
1.1 Active Learning
Active Learning refers to iterative machine learning techniques for building a classifier by choosing the most informative examples from a space of unlabeled examples (Saar-Tsechansky and Provost, 2001; Cohn et al., 2006; Roy and McCallum, 2001; Jones et al., 2003). This strategy constructs a classifier using few examples, and is useful in biological problems where data is expensive. In the case of p53 cancer mutants, Active Learning is intended to select mutants that both quickly improve the classifier and speed the search for previously unknown cancer rescue mutations.
To illustrate the process we give an example of an Active Learning method from Type I below. Suppose we have a training set of 204 mutants with known activities. We ask if the laboratory should assay next the activity of mutant R280T+N239Y or of R282W+N239Y. Each mutant, assumed as active and as inactive, is added independently to the training set, resulting in 4 new sets:
original set plus R280T+N239Y as active;
original set plus R280T+N239Y as inactive;
original set plus R282W+N239Y as active;
original set plus R282W+N239Y as inactive.
A new classifier is built from each new training set and evaluated based on the true positive (tp), false positive (fp), false negative (fn), and true negative (tn) receiver operator characteristic (ROC) statistics. For stringency we use overlap exclusion cross-validation (OECV, Danziger et al. 2006), which excludes from the training set all mutants that share more than one mutation with the mutant being tested (i.e., no cancer/rescue mutation pair is shared by tested and training mutants). Finally, the mutant that yielded the best classifier, under either assumed activity, would be chosen.
At each step, typically a small number of mutants are selected, assayed, and added to the training set. The new training set begins the next step of in silico prediction and in vitro experimentation.
The Active Learning methods tested here fall into four categories, called here Type I, Type II, Type III, and Type IV. Types I and II are novel methods that extend and adapt the original method of Danziger et al. (2006). Types III and IV provide a comparison to existing methods adapted from the machine learning literature.
1.1.1 Type I
Type I methods consider each unclassified example, assumed first as active and then as inactive. Type I methods select the example that yielded the most improved cross-validated classifier accuracy under either assumed mutant activity. The Type I methods here are called Maximum Curiosity, Composite Classifier, and Improved Composite Classifier (Sections 2.1.1, 2.1.2 and 2.1.3).
A score function, Score(t), evaluates each training set t. The methods differ by choice of the score function. The mutant that gave the highest Score(t) is chosen. In the example, if Score(t2) were highest, then Type I methods would choose R280T+N239Y.
1.1.2 Type II
Type II methods are identical to Type I until the final step. Type II methods choose the mutant with the highest sum of scores across active and inactive. The Type II methods here are Additive Curiosity and Additive Bayesian Surprise (Sections 2.1.4 and 2.1.5). In the example, if Score(t1)+Score(t2) < Score(t3)+Score(t4), then Type II methods would choose R282W+N239Y.
1.1.3 Type III
Type III methods involve choosing the next unknown example based on how close examples are to the decision boundary. The Type III methods are called Minimum Marginal Hyperplane and Maximum Entropy (Sections 2.1.6 and 2.1.7). These methods are described further in Jing et al. (2005); Liu (2004); and Park (2004).
1.1.4 Type IV
Type IV methods are similar to Type III methods, but choose examples furthest from the decision boundary rather than closest. Examples furthest from the decision boundary should be those that the classifier is most likely to predict correctly. The Type IV methods here are Maximum Marginal Hyperplane, Minimum Entropy, and Entropic Tradeoff (Sections 2.1.8, 2.1.9, and 2.1.10).
1.2 Motivation for in silico p53 mutation evaluation
PCR mutagenesis followed by p53 functional assays in yeast (Baroni et al., 2004) provides several experimental advantages, such as the ability to combine PCR mutagenesis with repair of a gapped plasmid directly in yeast, and the immediate observation of phenotypes. This experimental strategy rapidly provides an initial training set of positive and negative examples.
For the broader cancer rescue mutant discovery task, however, certain inherent limitations of PCR mutagenesis cannot be overcome. These include the limited number of amino acid changes available per codon, which means that some amino acid changes are essentially inaccessible, and the limited number of coordinated simultaneous mutations, which means that many mutant combinations are unlikely to be seen. Therefore, we began to apply computational strategies to the problem of discovering novel intragenic second-site suppressor mutations, with the long-term goal of a complete functional census of p53 cancer rescue mutations.
1.3 Relation to previous p53 classifier work
Danziger et al., (2006), developed a structure-based p53 classifier that was used to predict a previously unknown set of putative p53 cancer rescue mutants. The principal technical challenge was to extract structure-based features from atomic level molecular models in a way that was useful to feature vector based learning methods. Briefly, 1D, 2D, 3D, and 4D features were extracted, filtered, and concatenated. 1D features came from the mutation type and location in the p53 core domain. 2D features came from steric and electrostatic properties measured at points on a cartographic projection of the molecular surface. 3D features came from spatial displacements of mutant residues relative to wild-type. 4D features came from the time course of protein unfolding in a simulated heat bath, plus other computational estimates of protein thermostability.
The key finding was that classifiers built from features extracted from atomic level molecular models out-performed classifiers built from features extracted from string-based representations of the same mutant amino acid changes, when compared head-to-head on the same mutant data set. Structure is closer to protein function than is sequence, so that result was satisfying but not unexpected.
Danziger et al., (2006) provided a proof in principle that such a structure-based classifier could guide biological discovery, by exhibiting one Active Learning method that out-performed random selection. Several variants and extensions had theoretically desirable properties. Existing machine learning methods were attractive as well. Which should be used, in practice, to guide the upcoming expensive and time-consuming biological experimentation?
2 Methods
To explain the Active Learning methods used in this paper it is helpful to introduce some notation. T is the total set of all p53 mutants under consideration. During each active learning iteration, i, T is broken up into three groups: (1) TK,i, p53 mutants with known activities; (2) TU,i, p53 mutants with unknown activities; and (3) TC,i, p53 mutants from TU,i chosen to be assayed.
For the first experiment, the initial training set TK,1 contained 204 p53 mutants of known activity, and the initial test set TU,1 contained 57 putative cancer rescue mutants of unknown activity. Subsequently, after all mutants were predicted and assayed, the total mutant set T was divided into different initial training and test sets to explore how Active Learning methods behaved under different conditions.
2.1 Active Learning Implementation
At the beginning of each iteration, three mutants are chosen as TC,i by the Active Learning methods described in this section.
2.1.1. Type I: Maximum Curiosity
Maximum Curiosity scores each potential new training set, t, by its cross-validated correlation coefficient (r), calculated as in (1).
| (1) |
The r for each potential new training set is used to determine which unclassified mutants resulted in the largest increase of r.
Maximum Curiosity optimistically assumes that the highest r for each mutant, m, occurs when that mutant is correctly paired with its true activity, as per (2).
| (2) |
2.1.2. Type I: Composite Classifier
Composite Classifier scores identically to Maximum Curiosity but the classifier is constructed using the methods discussed in Danziger et al., (2006). This was done mainly to provide a direct comparison to our previously published results. Briefly, the Composite Classifier breaks the 1D, 2D, 3D and 4D attributes up into four separate component classifiers. 3000 attributes are selected by Mutual Information (MI, Section 2.3) from the 2D set and the four classifiers are combined into all 15 possible combinations. Each of these combinations is used to build a support vector machine component classifier and given a vote weighted by its cross-validated accuracy in a composite naïve Bayes classifier.
2.1.3. Type I: Improved Composite Classifier
The Improved Composite Classifier differs from the original Composite Classifier in how it selects attributes from the 9778 in the 2D component classifier. In this method, parts of the surface not in a promiscuous binding domain (defined in Friedler et al., 2005) are compressed resulting in 4,826 attributes. 400 features are then selected from those 4,826 attributes using Mutual Information.
2.1.4. Type II: Additive Curiosity
Additive Curiosity scores much like Maximum Curiosity except in the final step curiosity is calculated by adding the scores for each training set (3). In this way the mutant chosen may be the most beneficial to the classifier regardless of its revealed activity.
| (3) |
2.1.5. Type II: Bayesian Surprise
Baysian Surprise (Itti and Baldi, 2006) calculates the scores by summing the Kullbach-Leibler (KL) distance between a priori probability and a posteriori probability (4,5) across assumed active and inactive mutants.
| (4) |
In this implementation, the prior probability is the cross-validated accuracy of the training set (TK,i) and the posterior is that of the training set with the unclassified mutant (TK,i + m(activity)).
| (5) |
2.1.6. Type III: Minimum Marginal Hyperplane
Minimum Marginal Hyperplane scores training sets based on how far new unclassified mutants are from the boundary (support vector machine hyperplane) separating active from inactive mutants. In its simplest, linear form, a support vector machine creates a hyperplane and stores it as an associated normal vector (w). New mutants, described by an attribute vector (D), are evaluated by (6).
| (6) |
If μ > b, where b is some threshold, then the new mutant is assigned one class, otherwise it is assigned the other. The margin for a new example is the difference between μ and b. It may be helpful to think of the margin as the distance from the new example to the hyperplane. The Minimum Marginal Hyperplane algorithm assumes that the unclassified mutants closest to the dividing hyperplane will be the most informative to the classifier once the true class is known. See Platt (1998) for more details on support vector machines.
2.1.7. Type III: Maximum Entropy
Maximum Entropy scores each training set by using the information theory concept of entropy (H). Entropy is calculated from the probability of class membership for each unclassified mutant, estimated by a support vector machine logistic regression algorithm (Witten and Frank 2006).
Formally, given the attributes for an unclassified mutant (D) and the model (M) constructed from the training set, H is calculated as shown in (7) (Jing et al., 2005).
| (7) |
By choosing mutants with the highest H, Maximum Entropy assumes that the most informative mutants are those that the classifier is most uncertain about.
2.1.8. Type IV: Maximum Marginal Hyperplane
Maximum Marginal Hyperplane scores as in Section 2.1.6 but chooses mutants that are furthest from the dividing hyperplane.
2.1.9. Type IV: Minimum Entropy
Minimum Entropy scores as in Section 2.1.7 but chooses mutants that have the lowest possible H.
2.1.10. Type IV: Entropic Tradeoff
Entropic Tradeoff scores using entropy (H) as in Sections 2.1.7 and 2.1.9, giving highest scores to training sets that include mutants with both high and low H.
If a classifier is both to learn quickly and predict accurately as it proceeds, the mutants chosen should be a mix of highly informative and easily predicted. Here, the three mutants chosen are:
The mutant with the highest H.
The mutant with the lowest H that is predicted inactive.
The mutant with the lowest H that is predicted active.
If the classifier runs out of mutants that are predicted active or inactive, the algorithm chooses one mutant with the highest H and two mutants with the lowest H.
2.2 In Vitro Experimentation
All p53 mutants were evaluated using a well-established functional yeast assay for wild-type p53. The assay findings have proven to correlate well with subsequent confirmatory studies in mammalian experiments (Baroni et al., 2004; Brachmann et al., 1998; Brachmann et al., 1996; Kobayashi et al., 2003; Qian et al., 2002). p53 mutants are stably expressed from a CEN-based plasmid (one copy per cell) using the constitutive ADH1 promoter, thus resulting in very similar protein levels of p53 mutants. p53-dependent transcriptional activity results in expression of the URA3 reporter gene due to an upstream consensus p53 DNA binding site. Yeast expressing URA3 will grow on yeast plates lacking uracil.
All p53 cancer mutations not tested with N235K, N239Y and N235K+N239Y in our previous studies (Baroni et al., 2004; Danziger et al., 2006) were cloned into yeast expression plasmids containing both the suppressor mutation(s) and a p53 cancer mutation marked by a unique restriction enzyme site. p53 cancer mutations upstream of N235K, N239Y or N235K+N239Y were cloned using EcoR V and Xba I. p53 cancer mutations downstream of the suppressor mutations were cloned using Nsi I and Sac I. Correct cloning for all constructs was confirmed by loss of the previously present unique restriction enzyme site of a p53 cancer mutation (Baroni et al., 2004).
2.3 Attribute Selection
Attributes are generated from model p53 mutants constructed in silico from chain b of the wild-type p53 crystal structure (Cho et al., 1994) using Amber molecular modeling software (Case et al., 2004). These attributes include a 1D sequence perspective, a 2D steric and electrostatic surface map perspective, a 3D distance map perspective and a 4D stability perspective (Danziger et al., 2006).
Compressing, normalizing and then concatenating attributes from these perspectives yields 5,867 attributes per mutant to describe less than 300 mutants. To improve speed and generalization, attributes were selected using the Conditional Mutual Information Maximization (CMIM) algorithm. Prior to learning, selecting X attributes and then cross-validating using overlap exclusion cross validation (OECV, see Danziger et al., 2006), we determined that 550 out of 5867 was the optimal number of attributes.
CMIM is a method for determining which attributes are most informative for classifying an example. It is based around the Mutual Information (MI) algorithm. MI quantifies the change in information, I, by measuring the entropy (H) of the class A before and after attribute B is known. Conceptually, MI measures how much less random the class A is if B is known. MI selects attributes with high I values.
MI is of limited utility in that highly correlated attributes will have very similar scores. For example, given a 2D surface map, two points immediately next to each other would very likely provide almost exactly the same information and therefore have almost exactly the same I value. For an effective classifier, attributes should be as independent as possible (Witten and Frank, 2006).
CMIM solves this problem by answering the question “How much information does attribute C provide about class A given that attribute B is already known?” Formally, this is shown in equations (8) and (9).
| (8) |
| (9) |
Given the probabilities P(e1…,en) of events e1 through en occurring simultaneously, the I provided by CMIM may be implemented as shown in equation (10).
| (10) |
The first attribute chosen is that with the highest I value as calculated using MI. All following attributes are scored by calculating the CMIM I value with respect to all attributes already chosen.
More formally, let Xn be the vector of values for all mutants at attribute n, and let ν be the sorted score vector for each of F attributes. Start by initializing ν using MI (11) and iteratively update the scores as attributes are chosen (12).
| (11) |
| (12) |
The CMIM algorithm was implemented using optimizations described in Fleuret (2004), which gives fast an efficient execution.
2.4 Control
To test if Active Learning methods are indeed useful, three different control methods were used.
The first control, Non-iterated Prediction, tested if Active Learning is more effective than no active learning at all. It predicted all unclassified mutants (TU,1) using the initial training set (TK,1).
The second control, Predict All Inactive, demonstrated the naturally skewed data set by predicting TU,1 as the most common class, inactive. Because mutants that rescue p53 functionality are rare, the classifier could be highly accurate but utterly useless by always predicting mutants to be inactive. In examining the results from this control, it is informative to consider the correlation coefficient as well as the accuracy.
The final control, Random (30 Trials) selects three mutants at random from TU,1 during each iteration for 30 trials. This tests if Active Learning methods are truly more effective than selecting the “most informative” examples at random.
3 Results
This section shows how well each of the Active Learning techniques predicted a set of 57 unclassified putative p53 cancer rescue mutants using a training set of 204 mutant. The results for the Active Learning method that performed best, Maximum Curiosity, are presented in Table 1; the in vitro assay results are presented in Fig. 1; and the summary prediction statistics for all Active Learning methods are presented in Table 2.
Table 1.
p53 Cancer Rescue Mutants Predicted Using Maximum Curiosity.
| Cancer Mutation | N235K | N239Y | N235K+N239Y |
|---|---|---|---|
| C135Y | I | I | I |
| C141Y | (A) | I | A |
| P151S | I | I | I |
| V157F | (A) | (A) | A |
| R158L | (A) | (A) | A |
| V173L | (I) | (A) | A |
| R175H | I | I | I |
| C176F | I | I | I |
| H179R | I | I | I |
| H179Y | I | I | I |
| Y205C | (A) | (I) | A |
| Y220C | (A) | (A) | A |
| G245C | (I) | (A) | A |
| G245D | I | I | I |
| G245S | (I) | (A) | A |
| R248Q | I | I | I |
| R248L | I | I | I |
| R248W | I | I | I |
| R249M | (I) | (A) | A |
| R249S | I | I | I |
| V272M | (A) | (A) | A |
| R273C | (I) | (I) | I |
| R273H | (I) | (I) | A |
| R273L | (I) | (I) | I |
| P278L | I | I | I |
| R280T | I | I | I |
| R282W | I | I | I |
| E285K | (I) | NA | NA |
| E286K | (A) | (I) | I |
“A” indicates active and “I“ inactive as determined by the yeast assays. Italicized yeast phenotypes in parentheses are for p53 mutants that were part of the training set. NA means not assayed. Mutants highlighted in blue were predicted correctly by the Maximum Curiosity classifier while those in yellow were predicted incorrectly.
Fig. 1.

Yeast growth results for p53 mutants tested in this experiment. p53-dependent transcriptional activity results in URA3 expression and growth at 37°C on yeast plates lacking uracil.
Table 2.
Classifier Accuracy Predicting 3 Mutants at a Time
| 204 Predicts 57 | ||||
|---|---|---|---|---|
| Type | Method | Accuracy | Correlation Coefficient | Student-T |
| I | Maximum Curiosity | 77.19% +/- 5.61% | .5255 | 0.00% |
| I | Composite Classifier | 70.18% +/- 6.11% | .4447 | 100.0% |
| I | Improved Composite Classifier | 71.93% +/- 6.00% | .4637 | 100.0% |
| II | Additive Curiosity | 73.68% +/- 5.88% | .3857 | 99.81% |
| II | Additive Bayesian Surprise | 73.68% +/- 5.88% | .4342 | 99.81% |
| III | Minimum Marginal Hyperplane | 64.91% +/- 6.38% | .2845 | 100.0% |
| III | Maximum Entropy | 64.91% +/- 6.38% | .2845 | 100.0% |
| IV | Maximum Marginal Hyperplane* | 78.95% +/- 5.45% | .3699 | 90.42% |
| IV | Minimum Entropy* | 77.19% +/- 5.61% | .3406 | 0.00% |
| IV | Entropic Tradeoff* | 80.70 % +/- 5.27% | .4860 | 99.89% |
| c | Non-iterated Prediction | 56.14% +/- 6.63% | .2530 | 100.0% |
| c | Predict All Inactive | 80.70% +/- 5.27% | .0000 | 99.89% |
| c | Random (30 trials) | 74.39% +/- 3.87% | .3550 +/- .0992 | 99.24% +/- 2.89% |
The bold and italicized Accuracy and Correlation Coefficient highlight the best and second best scores on TC,i, respectively (excluding Predict All Inactive). All accuracies were calculated by treating each prediction as a separate test for accuracy, and all accuracies show the standard error, except for Random (30 trials) which shows the standard deviation. The Student-T percentage was calculated from the mean and standard deviation to determine statistical difference from Maximum Curiosity. Type “C” means a control type of active learning. Methods marked * were developed after the double-blind predictions.
4 Analysis
When exploring a mutation sequence space of medical importance, such as the cancer rescue mutants in this paper, there are two things that are desired greatly by the biological researcher: (1) to improve the classifier as quickly as possible (Section 4.1); and (2) to identify as many novel functionally active mutants as possible (Section 4.2). For the most effective discovery, the Active Learning method should perform well in both criteria (Section 4.3).
The Type 1 methods Composite Classifier and Improved Composite Classifier from Danziger et al.(2006) were omitted from analysis in this section as they are computationally very expensive and theoretically very similar to Maximum Curiosity.
4.1 The Quickest
For most Active Learning, the goal is to train a classifier as quickly as possible by using the smallest number of examples to reach maximum accuracy. Here we define the forward prediction accuracy as the classifier accuracy predicting all remaining unclassified mutants (TU,i) during each training iterations. To evaluate the forward prediction accuracy of each Active Learning method while starting with different size training sets, two additional partitions of the 261 mutants in T were constructed. One partition used the 123 mutants known in Danziger et al. (2006) as the training set TK,1 to predict the other 138 mutants as TU,1. The other partition assigned all 25 single amino acid mutants as TK,1 to predict the other 236 mutants as TU,1. The results of these trials are presented in Table 3.
Table 3.
Active Learning Across Varying Data Sets
| 25 Predicts 236 | 123 Predicts 138 | 204 Predicts 57 | |||||
|---|---|---|---|---|---|---|---|
| Type | Method | Accuracy | Area Under The Curve | Accuracy | Area Under The Curve | Accuracy | Area Under The Curve |
| I | Maximum Curiosity | .7274 +/- .0046 | .7700 +/- .0094 | .7410 +/- .0077 | .7891 +/- .0140 | .7246 +/- .0187 | .7722 +/- .0254 |
| II | Additive Curiosity | .7018 +/- .0047 | .7364 +/- .0087 | .6835 +/- .0082 | .7387 +/- .0158 | .7316 +/- .0186 | .7900 +/- .0291 |
| II | Additive Bayesian Surprise | .6674 +/- .0049 | .7068 +/- .0090 | .7095 +/- .0080 | .7637 +/- .0160 | .7456 +/- .0182 | .8054 +/- .0291 |
| III | Minimum Marginal Hyperplane | .7250 +/- .0046 | .7780 +/- .0124 | .7475 +/- .0076 | .8246 +/- .0207 | .7193 +/- .0188 | .8141 +/- .0425 |
| III | Maximum Entropy | .7507 +/- .0045 | .8118 +/- .0122 | .7348 +/- .0078 | .8089 +/- .0199 | .7544 +/- .0180 | .8354 +/- .0359 |
| IV | Maximum Marginal Hyperplane | .6440 +/- .0049 | .6621 +/- .0049 | .6432 +/-.0084 | .6803 +/- .0166 | .6122+/-.0204 | .6192 +/- .0178 |
| IV | Minimum Entropy | .6l56 +/- .0050 | .5959 +/- .0060 | .6392 +/- .0084 | .6530 +/- .0130 | .6158 +/- .0204 | .5902 +/- .0222 |
| IV | Entropic Tradeoff | .6965 +/- .0047 | .7139 +/- .0058 | .7058 +/- .0080 | .7423 +/- .0122 | .7456 +/- .0182 | .8044 +/- .0272 |
| c | Random (30 trials) | .6700 +/- .0141 | .6922 +/- .0237 | .6931 +/- .0231 | .7326 +- .0231 | .6950 +/- .0331 | .7392 +/- .0372 |
“25 Predicts 236” uses the 25 mutants with single point mutations as the initial training set and iteratively predicts the remaining 236 mutants in sets of 3. Similarly, “123 Predicts 138” uses the 123 mutants known in Danziger et al. (2006) to predict the remaining 138 mutants. “204 Predicts 57” is the data set discussed in Section 3. Accuracy is the forward prediction accuracy discussed in Section 4.1 weighted by how many mutants are predicted in each iterations. Area Under The Curve is the average forward prediction accuracy for all iterations. The bolded and italicized Accuracy and Area Under The Curve highlight the best and second best scoring classifiers (respectively) for each data set. All errors show the standard error, except for Random (30 trials) which shows the standard deviation. Type “C” means a control type of Active Learning.
To better illustrate Active Learning behavior, the forward prediction accuracy for each Active Learning method using 25 mutants as TK,1 is plotted in Fig. 2. From this figure, Type III methods Minimum Marginal Hyperplane and Maximum Entropy achieve the highest accuracy the most rapidly. This is not surprising as Type III Active Learning methods use the decision boundary to choose the hardest mutants to predict during each iteration. Therefore, as expected, Type IV methods Maximum Marginal Hyperplane and Minimum Entropy failed to achieve high accuracy rapidly. Interestingly, the Type I method, Maximum Curiosity, learns nearly as quickly as the Type II methods.
Fig. 2.

Accuracy predicting the remaining unclassified mutants at each iteration using an initial set of 25 known mutants. A scree test on the one standard deviation error bars from Random (30 trials) was used to truncate the graph at 61 of 79 iterations.
4.2 The Most Accurate and the Positive Predictive Value
The long term goals of this p53 cancer rescue mutant study involve iteratively and correctly identifying new active p53 mutants to be verified using in vitro assays. Active Learning, here choosing 3 unclassified mutants at a time, is essentially a scaled down version of this larger project. Therefore, the 3-pt. Classifier Accuracy, the accuracy predicting the three mutants chosen at the beginning of each Active Learning iteration, Tc,i (Table 4), is an indicator of how this study will progress toward accurate classifiers.
Table 4.
3-pt. Classifier Accuracy
| Type | Method | 25 Predicts | 123 Predicts 138 | 204 Predicts 57 | Average |
|---|---|---|---|---|---|
| I | Maximum Curiosity | 0.6525 | 0.7391 | 0.7719 | 0.7211 |
| II | Additive Curiosity | 0.6313 | 0.7101 | 0.7368 | 0.6927 |
| II | Additive Bayesian Surprise | 0.6398 | 0.7246 | 0.7368 | 0.7004 |
| III | Minimum Marginal Hyperplane | 0.6483 | 0.6522 | 0.6491 | 0.6498 |
| III | Maximum Entropy | 0.6652 | 0.6667 | 0.6491 | 0.6603 |
| IV | Maximum Marginal Hyperplane | 0.6992 | 0.6956 | 0.8070 | 0.7339 |
| IV | Minimum Entropy | 0.70339 | 0.7681 | 0.7895 | 0.7536 |
| IV | Entropic Tradeoff | 0.7119 | 0.7681 | 0.7719 | 0.7506 |
| c | Random (30 trials) | 0.6900 | 0.7374 | 0.7439 | 0.7237 |
The classifier accuracies predicting the three mutants chosen during each iteration. The bold and italicized scores highlight the highest scoring and second highest scoring Active Learning methods in each column, respectively.
However, for a classifier to find new cancer rescue mutants effectively, functionally active mutants must be predicted as accurately as possible. That is to say, true positives (tp) are more important than true negatives (tn) for the classifier to be useful. Therefore, a good way to evaluate a classifier is to use the Positive Predictive Value (PPV), shown in (13), as well as accuracy.
| (13) |
Table 5 shows the 3-pt. PPV (Section 4.2) from predicting the three mutants selected at the beginning of each iteration.
Table 5.
3-pt. PPV
| Type | Method | 25 Predicts 236 | 123 Predicts 138 | 204 Predicts 57 | Average |
|---|---|---|---|---|---|
| I | Maximum Curiosity | 0.4875 | 0.4687 | 0.4545 | 0.4702 |
| II | Additive Curiosity | 0.4533 | 0.4250 | 0.4000 | 0.4261 |
| II | Additive Bayesian Surprise | 0.4658 | 0.4500 | 0.4090 | 0.4416 |
| III | Minimum Marginal Hyperplane | 0.4789 | 0.3158 | 0.3200 | 0.3716 |
| III | Maximum Entropy | 0.5116 | 0.3333 | 0.3200 | 0.3883 |
| IV | Maximum Marginal Hyperplane | 0.5672 | 0.3824 | 0.5000 | 0.4832 |
| IV | Minimum Entropy | 0.5676 | 0.5333 | 0.4167 | 0.5059 |
| IV | Entropic Tradeoff | 0.5857 | 0.5333 | 0.4286 | 0.5159 |
| c | Random (30 trials) | 0.5377 | 0.4682 | 0.3993 | 0.4684 |
The PPV calculated on the three mutants chosen during each iteration. The bold and italicized scores highlight the highest scoring and second highest scoring Active Learning methods in each column, respectively.
To a certain extent, the values shown in Table 4 and Table 5 represent an unfair comparison. Since each Active Learning method selects mutants in a different order, all of the mutants except the first three, Tc,1, are predicted using different training sets. To partially correct for this, Table 6 shows the Unclassified PPV, calculated by predicting all unclassified mutants, TU,i that do not appear in the training set for any method at iteration i.
Table 6.
Unclassified PPV
| Type | Method | 25 Predicts 236 | 123 Predicts 138 | 204 Predicts 57 | Average |
|---|---|---|---|---|---|
| I | Maximum Curiosity | 0.4595 | 0.3918 | 0.3556 | 0.4023 |
| II | Additive Curiosity | 0.4946 | 0.3669 | 0.3261 | 0.3959 |
| II | Additive Bayesian Surprise | 0.4962 | 0.3745 | 0.3261 | 0.3989 |
| III | Minimum Marginal Hyperplane | 0.4450 | 0.3964 | 0.3333 | 0.3916 |
| III | Maximum Entropy | 0.4895 | 0.4120 | 0.3298 | 0.4104 |
| IV | Maximum Marginal Hyperplane | 0.4854 | 0.3148 | 0.3529 | 0.3844 |
| IV | Minimum Entropy | 0.4816 | 0.3667 | 0.3253 | 0.3912 |
| IV | Entropic Tradeoff | 0.4258 | 0.3599 | 0.3086 | 0.3648 |
The PPV was determined for all unclassified mutants at each iteration that did not appear in any training set at that iteration. The bold and italicized scores highlight the highest scoring and second highest scoring Active Learning methods, respectively. Random (30 trials) was omitted from this scoring method because 30 additional classifiers would prematurely eliminate too many mutants from consideration.
Across all methods for determining the most useful Active Learning method, the three Type IV Active Learning methods and the Type I Maximum Curiosity tend to do best.
4.3 Overall Best Methods
If some Active Learning methods perform better at learning quickly, but other methods perform well at finding active mutants, which method is best? In this context, there is no clear theoretical framework for quantifying speed versus accuracy. However, there are (at least) three reasonable metrics that combine the accuracy presented in Section 4.1 (acc1) with any of the measures presented in Section 4.2 (acc2).
-
Distance From an Ideal Classifier: Assume an ideal classifier with the maximum possible accuracies, i.e. acc1 = acc2 = 1. Then Euclidean distance from this ideal classifier is described in (14) and shown in Fig 3.
(14) -
Maximum Area: If acc1 and acc2 are assumed to be orthogonal measures, like the width and height of a rectangle, then a good measure would be the resulting area as described in (15) and shown in Fig 4.
(15) -
Average Accuracy: If acc1 and acc2 are assumed to be interchangeable in terms of usefulness, then the best classifier could be found by adding them as shown in (16) and shown in Fig 5.
(16)
Fig 3.

The Active Learning methods, ordered left to right: Type I - Maximum Curiosity, Type IV - Entropic Tradeoff, Type C - Random (30 trials), Type IV - Minimum Entropy, Type III - Maximum Entropy, Type IV - Maximum Marginal Hyperplane, Type II - Additive Bayesian Surprise, Type III - Minimum Marginal Hyperplane, and Type II - Additive Curiosity, evaluated using Average Accuracy. Shorter distances are better.
Fig 4.

The Active Learning methods as described in Fig 3 evaluated using the Maximum Area metric. Larger areas are better.
Fig 5.

The Active Learning methods as described in Fig 3 evaluated using the Average Accuracy metric. Larger scores are better.
To merge Figs. 3, 4 and 5, each Active Learning method was ordered and assigned points based on how well it performed for each metric relative to the other methods. For example, if there were eight methods, the highest rated method scored 7 points, the second highest 6, and so on, until the lowest rated method received 0 points. Table 7 ranks each Active Learning method based on the average number of points per category, revealing the top 3 Active Learning Methods to be Maximum Curiosity, Entropic Tradeoff, and Random (30 trials).
Table 7.
Overall Average Rank
| Rank | Method | Average Score |
|---|---|---|
| 1 | Maximum Curiosity | 6.11 |
| 2 | Entropic Tradeoff | 5.56 |
| 3 | Random (30 trials) | 5.50 |
| 4 | Minimum Entropy | 4.44 |
| 5 | Maximum Marginal Hyperplane | 3.22 |
| 6 | Maximum Entropy | 3.22 |
| 7 | Additive Bayesian Surprise | 2.89 |
| 8 | Minimum Marginal Hyperplane | 2.33 |
| 9 | Additive Curiosity | 1.89 |
4.3.1. Maximum Curiosity
Maximum Curiosity was the overall best ranked Active Learning method. It performed well, usually at least third best according to any scoring metrics. However it is computationally slow, requiring a separate cross-validation for each unclassified mutant.
4.3.2. Entropic Tradeoff
As per Figs 4 and 5, Type IV methods, including Entropic Tradeoff, accurately predict the three mutants chosen each iteration while performing no worse than random at learning quickly. Entropic Tradeoff's primary advantages are computational speed and that it can be tuned by adjusting the ratio of high entropy to low entropy mutants chosen.
4.3.3. Random (30 trials)
Perhaps the most surprising result is that picking mutants at random did so well. We hypothesize that other methods tend to get stuck in local minima, wasting iterations disproving false hypotheses. Previous research has shown that Random Active Learning is only initially more accurate (Tong et al., 2001). Therefore, 261 mutants may be too few data points to reveal this disadvantage.
5 Conclusion
This paper demonstrated the use of computer models and Active Learning to guide the exploration of p53 cancer rescue mutants. Follow up studies will proceed by identifying interesting clusters of putative p53 cancer rescue mutants. The p53 classifier will iteratively identify interesting mutants which biologists will synthesize and test them, until cancer rescue mutants have been explored for the top 100 p53 mutants found in human cancers.
It is expected that random Active Learning will become much less useful as the experiments progress into larger mutant spaces. Therefore, further experiments will use Maximum Curiosity and Entropic Tradeoff depending on the computational load to process mutants in the pool currently under consideration.
Ultimately, this research will help others studying mutant protein function using crystal structures. The techniques described here will help reveal mutants with a desired function from a larger pool of candidates. This would be useful for any similar experiment program exploring a large sequence space of expensive biological data.
Acknowledgments
Thanks to Pierre Baldi, Richard Chamberlin, Jonathan Chen, Jianlin Cheng, Melanie Cocco, Richard Colman, John Coroneus, Lawrence Dearth, Vinh Hoang, Qiang Lu, Hartmut Luecke, Ray Luo, Gabe Moothart, Hiroto Saigo, Don Senear and Josh Swamidass. Funding provided by the NIH, NSF, UCI Medical Scientist Training Program, UCI Office of Research and Graduate Studies and UCI Institute for Genomics and Bioinformatics.
References
- Baroni TE, Wang T, Qian H, Dearth LR, Truong LN, Zeng J, Denes AE, Chen SW, Brachmann RK. A global suppressor motif for p53 cancer mutants. Proc Natl Acad Sci U S A. 2004;101:4930–5. doi: 10.1073/pnas.0401162101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blagosklonny MV. p53 from complexity to simplicity: mutant p53 stabilization, gain-of-function, and dominant-negative effect. Faseb J. 2000;14:1901–7. doi: 10.1096/fj.99-1078rev. [DOI] [PubMed] [Google Scholar]
- Brachmann RK. p53 mutants: the achilles' heel of human cancers? Cell Cycle. 2004;3:1030–4. [PubMed] [Google Scholar]
- Brachmann RK, Vidal M, Boeke JD. Dominant-negative p53 mutations selected in yeast hit cancer hot spots. Proc Natl Acad Sci U S A. 1996;93:4091–5. doi: 10.1073/pnas.93.9.4091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brachmann RK, Yu K, Eby Y, Pavletich NP, Boeke JD. Genetic selection of intragenic suppressor mutations that reverse the effect of common p53 cancer mutations. EMBO J. 1998;17:1847–59. doi: 10.1093/emboj/17.7.1847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bullock AN, Fersht AR. Rescuing the function of mutant p53. Nat Rev Cancer. 2001;1:68–76. doi: 10.1038/35094077. [DOI] [PubMed] [Google Scholar]
- Bykov VJ, Selivanova G, Wiman KG. Small molecules that reactivate mutant p53. Eur J Cancer. 2003;39:1828–34. doi: 10.1016/s0959-8049(03)00454-4. [DOI] [PubMed] [Google Scholar]
- Case DA, Darden TA, Cheatham TE, III, Simmerling CL, Wang J, Duke RE, Luo R, Merz KM, Wang B, Pearlman DA, Crowley M, Brozell S, Tsui V, Gohlke H, Mongan J, Hornak V, Cui G, Beroza P, Schafmeister C, Caldwell JW, Ross WS, Kollman PA. AMBER 8. University of California, San Francisco; 2004. [Google Scholar]
- Cho Y, Gorina S, Jeffrey PD, Pavletich NP. Crystal structure of a p53 tumor suppres-sor-DNA complex: understanding tumorigenic mutations. Science. 1994;265:346. doi: 10.1126/science.8023157. [DOI] [PubMed] [Google Scholar]
- Cohn DA, Ghahramani Z, Jordan MI. Active Learning with Statistical Models. Journal of Artificial Intelligence Research. 1996;4:129–14. [Google Scholar]
- Danziger SA, Swamidass SJ, Zeng J, Dearth LR, Lu Q, Chen JH, Hoang VP, Saigo H, Luo R, Baldi PF, Brachmann RK, Lathrop RH. Functional census of mutation sequence spaces: the example of p53 cancer rescue mutants. IEEE Transactions on Computational Biology and Bioinformatics. 2006;3:114–125. doi: 10.1109/TCBB.2006.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Erster S, Moll UM. Stress-induced p53 runs a transcription-independent death program. Biochem Biophys Res Commun. 2005;331:843–50. doi: 10.1016/j.bbrc.2005.03.187. [DOI] [PubMed] [Google Scholar]
- Fleuret F. Fast Binary Feature Selection with Conditional Mutual Information. Journal of Machine Learning Research. 2004;5:1521–1555. [Google Scholar]
- Friedler DB, Veprintsev T, Rutherford KI, Fersht A. Binding of RAD51 and Other Peptide Sequences to a Promiscuous, Highly Electrostatic, Binding Site in p53. The Journal of Biological Chemistry. 2004;280(9):8051–8059. doi: 10.1074/jbc.M411176200. [DOI] [PubMed] [Google Scholar]
- Greenblatt MS, Bennett WP, Hollstein M, Harris CC. Mutations in the p53 tumor suppressor gene: clues to cancer etiology and molecular pathogenesis. Cancer Res. 1994;54:4855–78. [PubMed] [Google Scholar]
- Hamroun D, Kato S, Ishioka C, Claustres M, Beroud C, Soussi T. The UMD TP53 database and website: update and revisions. Hum Mutat. 2006;27:14–20. doi: 10.1002/humu.20269. [DOI] [PubMed] [Google Scholar]
- Itti L, Baldi P. Bayesian Surprise Attracts Human Attention. Advances in Neural Information Processing Systems. 2006;18 [Google Scholar]
- Hanahan D, Weinberg RA. The hallmarks of cancer. Cell. 2000;100:57–70. doi: 10.1016/s0092-8674(00)81683-9. [DOI] [PubMed] [Google Scholar]
- Ho J, Benchimol S. Transcriptional repression mediated by the p53 tumour suppressor. Cell Death Differ. 2003;10:404–8. doi: 10.1038/sj.cdd.4401191. [DOI] [PubMed] [Google Scholar]
- Hollstein M, Sidransky D, Vogelstein B, Harris CC. p53 mutations in human cancers. Science. 1991;253:49–53. doi: 10.1126/science.1905840. [DOI] [PubMed] [Google Scholar]
- Jing F, Li M, Zhang H, Zhang B. IEEE Transactions on Image Processing. 7. Vol. 14. 2005. A Unified Framework for Image Retrieval Using Keyword and Visual Features; pp. 979–989. [DOI] [PubMed] [Google Scholar]
- Jones R, Ghani R, Mitchell T, Riloff E. Active learning for information extraction with multiple view feature sets. ECML-03 Workshop on Adaptive Text Extraction and Mining 2003 [Google Scholar]
- Ko LJ, Prives C. p53: puzzle and paradigm. Genes Dev. 1996;10:1054–72. doi: 10.1101/gad.10.9.1054. [DOI] [PubMed] [Google Scholar]
- Kobayashi T, Wang T, Qian H, Brachmann RK. Genetic strategies in Saccharomyces cerevisiae to study human tumor suppressor genes. Methods Mol Biol. 2003;223:73–86. doi: 10.1385/1-59259-329-1:73. [DOI] [PubMed] [Google Scholar]
- Liu Y. Active Learning with Support Vector Machine Applied to Gene Expression Data for Cancer Classification. Journal of Chemical Informatics and Computer Science. 2004;44:1936–1941. doi: 10.1021/ci049810a. [DOI] [PubMed] [Google Scholar]
- Muthurajan UM, Bao Y, Forsberg LJ, Edayathumangalam RS, Dyer PN, White CL, Luger K. Crystal structures of histone Sin mutant nucleo-somes reveal altered protein-DNA interactions. The EMBO journal. 2004;23:260–271. doi: 10.1038/sj.emboj.7600046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olivier M, Eeles R, Hollstein M, Khan MA, Harris CC, Hainaut P. The IARC TP53 database: new online mutation analysis and recommendations to users. Hum Mutat. 2002;19:607–14. doi: 10.1002/humu.10081. [DOI] [PubMed] [Google Scholar]
- Park J. IEEE Transactions on Pattern Analysis and Machine Learning. 9. Vol. 28. 2004. Convergence and Application of Online Active Sampling Using Orthogonal Pillar Vectors; pp. 1197–1207. [DOI] [PubMed] [Google Scholar]
- Parkin D, Bray F, Ferlay J, Pisani P. Global cancer statistics, 2002. CA Cancer J Clin. 2005;55(2):74–108. doi: 10.3322/canjclin.55.2.74. [DOI] [PubMed] [Google Scholar]
- Platt JC. Sequential Minimum Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research Technical Report MSR-TR-98-14 1998 [Google Scholar]
- Prives C, Hall PA. The p53 pathway. J Pathol. 1999;187:112–26. doi: 10.1002/(SICI)1096-9896(199901)187:1<112::AID-PATH250>3.0.CO;2-3. [DOI] [PubMed] [Google Scholar]
- Provost MSTaFJ. Active Learning for Class Probability Estimation and Ranking. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence; Seattle, Washington. 2001. pp. 911–920. [Google Scholar]
- Qian H, Wang T, Naumovski L, Lopez CD, Brachmann RK. Groups of p53 target genes involved in specific p53 downstream effects cluster into different classes of DNA binding sites. Oncogene. 2002;21:7901–11. doi: 10.1038/sj.onc.1205974. [DOI] [PubMed] [Google Scholar]
- Roy N, McCallum A. Toward optimal active learning through sampling estimation of error reduction. Proc. 18th International Conf. on Machine Learning; 2001. pp. 441–448. [Google Scholar]
- Seemann S, Maurici D, Olivier M, de Fromentel CC, Hainaut P. The tumor suppressor gene TP53: implications for cancer management and therapy. Crit Rev Clin Lab Sci. 2004;41:551–583. doi: 10.1080/10408360490504952. [DOI] [PubMed] [Google Scholar]
- Sigal A, Rotter V. Oncogenic mutations of the p53 tumor suppressor: the demons of the guardian of the genome. Cancer Res. 2000;60:6788–93. [PubMed] [Google Scholar]
- Soussi T, Beroud C. Assessing TP53 status in human tumours to evaluate clinical outcome. Nat Rev Cancer. 2001;1:233–240. doi: 10.1038/35106009. [DOI] [PubMed] [Google Scholar]
- Vogelstein B, Kinzler KW. Cancer genes and the pathways they control. Nat Med. 2004;10:789–99. doi: 10.1038/nm1087. [DOI] [PubMed] [Google Scholar]
- Vogelstein B, Lane D, Levine AJ. Surfing the p53 network. Nature. 2000;408:307–10. doi: 10.1038/35042675. [DOI] [PubMed] [Google Scholar]
- Wang W, Rastinejad F, El-Deiry WS. Restoring p53-dependent tumor suppression. Cancer Biol Ther. 2003;2:S55–63. [PubMed] [Google Scholar]
- Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu Y, Weng Z, Liu J, Zhao XD, Chew JL, Lee YL, Kuznetsov VA, Sung WK, D ML, Lim B, Liu ET, Yu Q, Ng HH, Ruan Y. A global map of p53 transcription-factor binding sites in the human genome. Cell. 2006;124:207–19. doi: 10.1016/j.cell.2005.10.043. [DOI] [PubMed] [Google Scholar]
- Witten I, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. Second. Morgan Kaufmann Series in Data Management Systems; San Franciso: 2005. p. Ca 94111. [Google Scholar]
