Miller et al. 10.1073/pnas.0505482102. |
Fig. 5. The similarity between the localization annotations of a pair of proteins were quantified by the same method used for the Gene Ontology annotations, where we embedded the vocabulary used by Huh et al. (1) in a tree that reflects the similarity between the various terms:
1. Huh, W. K., Falvo, J. V., Gerke, L. C., Carroll, A. S., Howson, R. W., Weissman, J. S., and OShea, E. K. (2003) Nature 425, 686691.
Supporting Methods
Yeast Strains, Plasmids, and Stop Codon Removal rePCR.
The vectors pY-Cub-PLV and pX-HA-NubG (a gift of S. Thaminy, University of Zürich) were used as templates to amplify the Cub-PLV and HA2-NubG sequences. p415-Cub-PLV-WBP1 was generated by isolating the CYC1 promoter from p415CYC1 (1) with SacI and XbaI and ligating in the WBP1 promoter from pY-Cub-PLV. The Cub-PLV sequence was inserted into the vector by homologous recombination in yeast. To facilitate insertion of the ORFs as PCR products with common 20-bp sequences on their 5' and 3' ends (2), we inserted these common sequences, separated by an NcoI site, upstream of the Cub-PLV sequence. p414-HA2-NubG-ADH1 was constructed in p414ADH1 similarly, except that the promoter was not substituted.Using the set of yeast ORFs with common 20-bp flanking sequences on the 5' and 3' ends (2) as templates, we carried out PCR reactions to remove the stop codons. Primers were (forward): 5'-AATTCCAGCTGACCACCATG-3' (the first methionine codon is italicized) and (reverse): 5'-GATCCCCGGGAATTGCCATGATA-3' (the stop codon mutating sequence is italicized). Reactions were performed with Taq polymerase for 12 cycles of 94°C for 15 sec, 61°C for 1 min, and 72°C for 1 min/kbp of ORF. PCRs were analyzed by agarose gel, and junctions for 96 of the ORF-HA2-NubG fusions and all of the ORF-Cub-PLV fusions were sequenced.
Construction of an Array of Yeast Expressing ORF-HA2-NubG Fusions.
The p414-HA2-NubG-ADH1 vector was linearized by NcoI and gel-purified. Ten nanograms of linearized vector was cotransformed with 3 ml of PCR product. Transformations were done as described in ref. 3 except in 96-well plates with adjusted volumes. The transformants were taken directly from the transformation reaction and split into two aliquots, one recovered in yeast extract/peptone dextrose (YEPD) media (4) and one recovered in SD-Trp media (4). Transformants were pinned onto 16 SD Trp + Ade Omnitrays in a 96-spot format and incubated at 30°C until the spots had grown (2-3 days). Yeast were replica-pinned from a 96-spot format to a 384-spot format to generate four plates. The same ORF-HA2-NubG fusion was positioned in adjacent spots such that every ORF was present twice. The array was grown on SD-Trp media and pinned to YEPD before screens. Fusions of NubG-HA onto the amino termini of the set of 705 proteins were also generated but behaved promiscuously in the assay and were not used for screens.Testing of Cub-PLV Expression and Orientation By Using NubI
. ORF-Cub-PLV fusions were constructed by homologous recombination, and up to eight individual transformants were selected for the 705 proteins. Single colonies were inoculated into SD-Leu in sixty-four 96-well plates, grown for 2 days, and then condensed to 16 plates of solid media in a 384-spot format. Plates were tested against NubI fusions (Alg5 and Ste14) and an empty vector. Only Cub-PLV fusions that grew with either or both of the NubI fusions, but not with NubG or empty vector-containing yeast, were used in screens.Features Used by the Support Vector Machine (SVM)
. The following section expands the description of the input parameters for the SVM.i. The number of interactions that the Cub-PLV participates
. The 270 proteins included as Cub-PLV fusions in the final data set must have found at least one interaction, and up to 57 interactions were found by one protein, Gpi8. On average, a protein identified between seven and eight interactions as a Cub-PLV fusion. The fewer interactions a protein found, the more specificity it displayed in the assay. Hence, it is expected that the proteins with fewer interactions are likely to have a higher relative percentage of true interactions. Proteins that interact with large numbers of partners may well still contain some "true" interactions, but these will be more difficult to discern among a backdrop of likely false-positive interactions. The proteins used as Cub-PLV fusions to identify the 56 positive training example interactions averaged 15 interactions per Cub-PLV protein, demonstrating that proteins that identify corroborated interactions as a group had a higher average number of partners in the assay.ii. The number of interactions that the NubG participates
. Each protein present as an NubG protein in the data set was found to interact at least once, and one protein was found to interact 107 times (Pho88). The average number of positives for a NubG fusion was 4 across the whole data set, whereas for the NubG proteins involved in the 56 positive examples, the average number of partners was »17 per protein. Thus, the specificity for the positive example interactions is overall lower compared to the rest of the interactions.iii. A Boolean feature indicating that both spots for a given NubG were found by the Cub-PLV in either repetition
. This Boolean feature indicates whether both of the positions for a particular NubG fusion grew when mated to yeast carrying the Cub-PLV fusion. Seven hundred five (36%) of the 1949 non-reciprocal, non-self interactions were identified by both of the positions in at least one screen replicate. For the positive example interactions, 30 (54%) had both positions positive. Therefore, a higher proportion of the corroborated interactions were found for both positions of the NubG partner.iv. A Boolean feature indicating whether repeated screens by using the same Cub-PLV found this NubG
. Similar to the previous parameter, this Boolean feature identifies reproducible interactions in which the two screens of the Cub-PLV protein found the interaction. This repeated observation of the positive is potentially a more informative reproducibility measure than the two positions reproducibility because some positions of the NubG array may be missing, such that the protein can only be identified once per screen. Out of the total set of interactions, 554 (28%) were observed in both screenings of the Cub-PLV protein. For the positive training set interactions, 32 (57%) of the interactions were found both times, again demonstrating that on average the corroborated interactions were more reproducible than the whole data set.v. An integer (14), indicating the total number of times that this interaction was observed
. This feature combines the above two measures of reproducibility and quantifies the observations of an interaction. If both spots were observed in both screens, then the interaction was observed a total of four times. Overall, there were 202 (10%) interactions observed four times, 205 (11%) observed three times, 396 (20%) interactions observed twice, and 1,146 (59%) observed only one time. For the positive example interactions, 15 (27%) were observed four times, 10 (18%) were observed three times, 11 (20%) were observed two times, and 20 (36%) were observed only once. The trend among the corroborated interactions is that they were observed more often than the interactions in the full data set and, thus, more reproducible.vi. A Boolean feature indicating whether a reciprocal interaction is observed (i.e., X-Cub-PLV finds Y-NubG and Y-Cub-PLV finds X-NubG).
The 38 (2%) interactions that are observed both when X is tested against Y and when Y is tested against X are distinguished with the Boolean accounting of this input feature. Among the positive training set, there were five (9%) reciprocal interactions. The observance of a reciprocal interaction indicates that the proteins are functional when expressed as either kind of fusion. However, the enrichment within the positive example set is suggestive that they also represent higher confidence interactions.vii. A Boolean feature to indicate whether the reciprocal interaction was tested.
The NubG fusions present in the array were tested in every screen. However, only 365 Cub-PLVs interacted with a NubI control and, of these, 10 were discarded for self-activation. Therefore, 355 proteins were tested against the array as Cub-PLV fusions. The other 350 proteins could be identified in only a single orientation (as NubG fusions). There were 844 (43% of the 1,949 total) interactions that potentially could have been observed in both orientations, but only 38 (4.5%) of the 844 interactions were observed in both orientations. For the positive examples, 24 (43%) of the interactions were tested in both orientations, and 5 (9%) of these successfully interacted in both directions.viii. An integer (18), indicating the total number of times that this interaction was observed in this orientation or its reciprocal
. This feature adds discriminatory values for the reciprocal interactions found more than four times (the rest of the data has the same values as in parameter v). It also more accurately accounts for the number of times that the reciprocal interactions were observed, which is underrepresented in parameter v. Within the set of positive training examples are the following: two of the four interactions observed eight times, one of the four interactions found seven times, one of the nine interactions found six times, and one of the eight interactions found five times. Thus, 20% of the interactions identified more than four times in the assay are corroborated by other experimental evidence.ix. The strength of growth of the yeast in the positive colonies.
The expectation in measuring growth of a positive in an interaction assay based on reporter gene activation is that stronger interactions result in more growth of the yeast. There is not necessarily a direct relationship between the affinity of an interaction and growth (5). Additionally, neither the stronger interactions nor the better growing colonies necessarily represent more biologically relevant interactions. We expect that stochastic reporter gene activation would result in weak growth. This expectation is based on the rarity of stochastic events, which lead to a small fraction of the cells becoming His+ because of a genomic mutation or rearrangement. Alternatively, a low level of transcription of the reporter gene independent of an interaction may allow some cells to grow to a limited extent. These events might be excluded by an assessment of the growth of the yeast. The average growth score (in arbitrary units) for the entire interaction set is 2.1, compared to the positive training examples for which the average score is 2.8.x. The relative strength of growth of the yeast in the positive colonies to the controls
. A relative growth score of 1 to 5, based on comparison with the growth of positive controls and other positives, is assigned. This feature is the same as above except that the scoring is performed by reference to the other positives for an individual Cub-PLV protein, especially the NubI positive control. If all of the positives for the Cub-PLV grow weakly in comparison to other searches, the previous feature would not account for this bias. The average is 2.1 for the whole data set, and 3.0 for the positive examples (arbitrary units).xixvii. The mutual clustering coefficients for this interaction calculated by the Jaccard index, the meet/min coefficient, the geometric coefficient, and the hypergeometric coefficient
. Goldberg and Roth (6) quantified the tendency of the neighbors of interacting proteins to interact by using these statistics; they showed that the higher the clustering coefficient, the higher the confidence that can be assigned to the interaction. We included statistics relative to the network generated in this study, as well as calculations of the coefficients for the interactions in the context of the entire yeast protein interaction network derived from the BIND database. All calculation methods were included as different features for the SVM to use to classify an interaction. The average coefficients for the first three methods were slightly larger in the positive training set relative to the rest of the data set.xviii. The essentiality phenotypes of the pair of proteins involved in an interaction
. A score of 1 is listed if both proteins are encoded by essential genes, 0 if one is essential and one nonessential, and 1 if both proteins are encoded by nonessential genes as reported on the Saccharomyces Genome Deletion Project web site (www-sequence.stanford.edu/group/yeast_deletion_project/deletions3.html). This feature is meant to address the idea that cells mutant for genes that encode interacting proteins have a similar phenotype. The percentage of interactions that agree in essentiality phenotype of the corresponding mutants is 75% for the entire data set but only 60% for the positive examples.xix. The differences in the logarithm of the protein expression levels (7
). This feature is based on the expectation that two proteins that function together in a complex may be present at equivalent levels in the cell. Therefore, the closer the agreement in protein levels for two interacting proteins, the more confidence that can potentially be assigned to the interaction. The logarithmic values of the expression levels are used to minimize small differences that are within standard error for the original measurements. The average logarithmic expression level difference between the proteins interacting in the whole data set and the positive examples was equivalent.xx. The difference in the codon enrichment correlation (CEC) between the two proteins (7
). The CEC is the correlation of the codon composition of a gene to the codon makeup of experimentally validated genes. Lower numbers indicate more questionable ORFs, and Ghaemmghami et al. (7) failed to detect measurable expression of many of the lowest CEC proteins. Inclusion of this parameter was motivated by the idea that proteins that function together and express at similar levels might have comparable codon compositions. The average difference in CEC across the whole data set is 0.18, whereas the average CEC difference for the positive training set interactions is 0.12. Thus, the corroborated interactions are slightly enriched for interactions that occur between proteins with similar codon compositions.xxixxiii. The similarity of the Gene Ontology (GO) annotations for the two proteins involved in the interaction
. We quantified the similarity between the annotations of two proteins as follows: we assign an annotation a, the score of log p(a), where p(a) is the fraction of proteins in our data set that have an annotation a. The similarity of two annotations is then the maximum score among their common ancestors in the GO hierarchy. This result quantifies the information contained in those shared GO annotations: if the two proteins share annotations that are very common, this finding is not very informative, yielding a low score. A more detailed description can be found in ref. 8. We have found that when comparing the positive training examples to the negative training examples, the GO process annotation is the most informative feature (more so than the localization and molecular function).xxivxxxvii. The Pearson Correlation Coefficients (PCC) of transcript expression levels of the two proteins involved in the interaction for 14 microarray experiments
. Data were obtained from the Saccharomyces Genome Database (http://www.yeastgenome.org) for 14 microarray experiments examining gene expression changes caused by: xxiv, sporulation (9); xxv, exposure to DNA damaging agents (10); xxvi, exposure to different concentrations of a-factor, and xxvii, a time course in the presence of a-factor (11); xxviii, exposure to calcineurin (12); xxix, the cell cycle (13); xxx, the diauxic shift (14); xxxi, adaptive evolution (15); xxxii, histone depletion (16); xxxiii, phosphate metabolic changes (17); xxxiv, ploidy differences (18); xxxv, environmental changes (19); xxxvi, swr1 htz1 ino80 mutation (20); and xxxvii, the unfolded protein response (21).xxxviii. The transcriptional module (22) relatedness of the transcripts encoding the two proteins in the interaction.
Ihmels et al. (22) assigned a percent similarity for the relatedness of each transcript present in 1 of their 86 modules to the founding transcript for that module. For example, one transcript may be 50% similar in expression changes across a large number of microarray experiments to the founding transcript. Where two transcripts for related proteins were present in the same module, the product of their percent similarities is used to score them. Where two transcripts are present in multiple modules, the sum of the percent similarity products for the transcripts across the modules is used.xxxix. The weighted GFP colocalization (23) of the two proteins.
In total, 655 (33%) interactions occurred between proteins assigned to identical localizations. There are additional potentially overlapping localizations such as between the ER and the Golgi apparatus, which may be properties of physiologically relevant interactions, because the vast majority of Golgi-destined membrane proteins will first be inserted into the ER. In the positive training interactions are 29 (52%) that have identical assigned localization, suggesting that the SVM may select for interactions enriched in colocalization relative to the whole data set.SVM-Based Ranking of Interactions.
Labeling of interactions. Positive examples of interacting proteins were chosen as interactions that were confirmed by other experiments or by the presence of interacting yeast paralogs (see text). Choosing a set of negative examples turned out to be more challenging. The established method for choosing noninteracting protein pairs is to take pairs of proteins whose pattern of localization should preclude them from interacting as advocated by ref. 24. In related work where we focused predicting interactions on the basis of sequence information, we found that choosing negative examples by the lack of colocalization can bias the results. Therefore, we chose instead to use random sets of putative interactions as "negative" examples. The assumption is that a random set of putative interactions will contain a sizable fraction of noninteracting proteins due to the false positive rate of the assay. Combining results of multiple randomly selected sets of negative examples provides a similar set of high confidence predictions to an SVM that used putative interactions between noncolocalized proteins as negative examples (A.B.-H., and J.P.M., unpublished observations).To create a stable ranking that does not depend on a particular drawing of negative examples, we use predictions averaged over 100 drawings of the negative examples, and assign a score to an interaction that is the fraction of times it was predicted to be a true interaction. The averaging is performed only on draws in which a particular example did not participate as a negative example. We observed that a large fraction (»50%) of the interactions are consistently classified as noninteracting by this procedure.
Data cleaning.
Because a set of randomly chosen negative examples will invariably contain many true interactions, before its use for training the SVM, we performed a stage of "cleaning" in which we purge from the set of negative examples those examples that were misclassified in training a linear SVM (note that this criterion does not use cross-validation). We then trained an SVM on the resulting data by using a Gaussian kernel whose width was chosen with cross-validation.Missing data.
Some examples may have missing data: e.g., missing GO annotations or no localization data. Because SVMs do not provide a way of dealing with missing data, we rank a putative interaction by using a classifier that is trained only on the data that is available for that particular interaction.The machine learning experiments reported here were performed by using pyml, a machine learning environment available at http://pyml.sourceforge.net.
1. Mumberg, D., Muller, R. & Funk, M. (1995) Gene 156, 119122.
2. Hudson, J. R., Jr., Dawson, E. P., Rushing, K. L., Jackson, C. H., Lockshon, D., Conover, D., Lanciault, C., Harris, J. R., Simmons, S. J., Rothstein, R. & Fields, S. (1997) Genome Res. 7, 11691173.
3. Agatep, R., Kirkpatrick, R. D., Parchaliuk, D. L., Woods, R. A. & Gietz, R. D. (1998) in Technical Tips Online, 1:51:P01525. Available at http://tto.trends.com.
4. Sherman, F., Fink, G. & Hicks, J. B. (1986) Methods in Yeast Genetics: A Laboratory Manual (Cold Spring Harbor Lab. Press, Plainview, NY).
5. Estojak, J., Brent, R. & Golemis, E. A. (1995) Mol. Cell. Biol. 15, 58205829.
6. Goldberg, D. S. & Roth, F. P. (2003) Proc. Natl. Acad. Sci. USA 100, 43724376.
7. Ghaemmaghami, S., Huh, W. K., Bower, K., Howson, R. W., Belle, A., Dephoure, N., OShea, E. K. & Weissman, J. S. (2003) Nature 425, 737741.
8. Ben-Hur, A. & Noble, W.S. (2005) Bioinformatics 21, Suppl. 1, i38i46.
9. Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O. & Herskowitz, I. (1998) Science 282, 699705.
10. Gasch, A. P., Huang, M., Metzner, S., Botstein, D., Elledge, S. J. & Brown, P. O. (2001) Mol. Biol. Cell 12, 29873003.
11. Roberts, C. J., Nelson, B., Marton, M. J., Stoughton, R., Meyer, M. R., Bennett, H. A., He, Y. D., Dai, H., Walker, W. L., Hughes, T. R., et al.. (2000) Science 287, 873880.
12. Yoshimoto, H., Saltsman, K., Gasch, A. P., Li, H. X., Ogawa, N., Botstein, D., Brown, P. O. & Cyert, M. S. (2002) J. Biol. Chem. 277, 3107931088.
13. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998) Mol. Biol. Cell 9, 32733297.
14. DeRisi, J. L., Iyer, V. R. & Brown, P. O. (1997) Science 278, 680686.
15. Ferea, T. L., Botstein, D., Brown, P. O. & Rosenzweig, R. F. (1999) Proc. Natl. Acad. Sci. USA 96, 97219726.
16. Wyrick, J. J., Holstege, F. C., Jennings, E. G., Causton, H. C., Shore, D., Grunstein, M., Lander, E. S. & Young, R. A. (1999) Nature 402, 418421.
17. Ogawa, N., DeRisi, J. & Brown, P. O. (2000) Mol. Biol. Cell 11, 43094321.
18. Galitski, T., Saldanha, A. J., Styles, C. A., Lander, E. S. & Fink, G. R. (1999) Science 285, 251254.
19. Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. & Brown, P. O. (2000) Mol. Biol. Cell 11, 42414257.
20. Mizuguchi, G., Shen, X., Landry, J., Wu, W. H., Sen, S. & Wu, C. (2004) Science 303, 343348.
21. Travers, K. J., Patil, C. K., Wodicka, L., Lockhart, D. J., Weissman, J. S. & Walter, P. (2000) Cell 101, 249258.
22. Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y. & Barkai, N. (2002) Nat. Genet. 31, 370377.
23. Huh, W. K., Falvo, J. V., Gerke, L. C., Carroll, A. S., Howson, R. W., Weissman, J. S. & OShea, E. K. (2003) Nature 425, 686691.
24. Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N. J., Chung, S., Emili, A., Snyder, M., Greenblatt, J. F. & Gerstein, M. (2003) Science 302, 449453.