Abstract
With the completion of the first phase of the European yeast genome sequencing project, the complete DNA sequence of chromosome III of Saccharomyces cerevisiae has become available (Oliver, S. G., et al., 1992, Nature 357, 38-46). We have tested the predictive power of computer sequence analysis of the 176 probable protein products of this chromosome, after exclusion of six problem cases. When the results of database similarity searches are pooled with prior knowledge, a likely function can be assigned to 42% of the proteins, and a predicted three-dimensional structure to a third of these (14% of the total). The function of the remaining 58% remains to be determined. Of these, about one-third have one or more probable transmembrane segments. Among the most interesting proteins with predicted functions are a new member of the type X polymerase family, a transcription factor with an N-terminal DNA-binding domain related to GAL4, a "fork head" DNA-binding domain previously known only in Drosophila and in mammals, and a putative methyltransferase. Our analysis increased the number of known significant sequence similarities on chromosome III by 13, to now 67. Although the near 40% success rate of identifying unknown protein function by sequence analysis is surprisingly high, the information gap between known protein sequences and unknown function is expected to widen and become a major bottleneck of genome projects in the near future. Based on the experience gained in this test study, we suggest that the development of an automated computer workbench for protein sequence analysis must be an important item in genome projects.
Full Text
The Full Text of this article is available as a PDF (1.2 MB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- Bairoch A., Boeckmann B. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 1991 Apr 25;19 (Suppl):2247–2249. doi: 10.1093/nar/19.suppl.2247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bairoch A. PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 1991 Apr 25;19 (Suppl):2241–2245. doi: 10.1093/nar/19.suppl.2241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barker W. C., George D. G., Hunt L. T., Garavelli J. S. The PIR protein sequence database. Nucleic Acids Res. 1991 Apr 25;19 (Suppl):2231–2236. doi: 10.1093/nar/19.suppl.2231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bernstein F. C., Koetzle T. F., Williams G. J., Meyer E. F., Jr, Brice M. D., Rodgers J. R., Kennard O., Shimanouchi T., Tasumi M. The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol. 1977 May 25;112(3):535–542. doi: 10.1016/s0022-2836(77)80200-3. [DOI] [PubMed] [Google Scholar]
- Bork P., Ouzounis C., Sander C., Scharf M., Schneider R., Sonnhammer E. What's in a genome? Nature. 1992 Jul 23;358(6384):287–287. doi: 10.1038/358287a0. [DOI] [PubMed] [Google Scholar]
- Brendel V., Bucher P., Nourbakhsh I. R., Blaisdell B. E., Karlin S. Methods and algorithms for statistical analysis of protein sequences. Proc Natl Acad Sci U S A. 1992 Mar 15;89(6):2002–2006. doi: 10.1073/pnas.89.6.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Date T., Yamamoto S., Tanihara K., Nishimoto Y., Matsukage A. Aspartic acid residues at positions 190 and 192 of rat DNA polymerase beta are involved in primer binding. Biochemistry. 1991 May 28;30(21):5286–5292. doi: 10.1021/bi00235a023. [DOI] [PubMed] [Google Scholar]
- Doolittle R. F. Proteins. Sci Am. 1985 Oct;253(4):88–99. doi: 10.1038/scientificamerican1085-88. [DOI] [PubMed] [Google Scholar]
- Franco L., Jiménez A., Demolder J., Molemans F., Fiers W., Contreras R. The nucleotide sequence of a third cyclophilin-homologous gene from Saccharomyces cerevisiae. Yeast. 1991 Dec;7(9):971–979. doi: 10.1002/yea.320070909. [DOI] [PubMed] [Google Scholar]
- Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982 Dec 15;162(3):705–708. doi: 10.1016/0022-2836(82)90398-9. [DOI] [PubMed] [Google Scholar]
- Hardwick K. G., Pelham H. R. ERS1 a seven transmembrane domain protein from Saccharomyces cerevisiae. Nucleic Acids Res. 1990 Apr 25;18(8):2177–2177. doi: 10.1093/nar/18.8.2177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holm L., Sander C. Database algorithm for generating protein backbone and side-chain co-ordinates from a C alpha trace application to model building and detection of co-ordinate errors. J Mol Biol. 1991 Mar 5;218(1):183–194. doi: 10.1016/0022-2836(91)90883-8. [DOI] [PubMed] [Google Scholar]
- Ingrosso D., Fowler A. V., Bleibaum J., Clarke S. Sequence of the D-aspartyl/L-isoaspartyl protein methyltransferase from human erythrocytes. Common sequence motifs for protein, DNA, RNA, and small molecule S-adenosylmethionine-dependent methyltransferases. J Biol Chem. 1989 Nov 25;264(33):20131–20139. [PubMed] [Google Scholar]
- Ito J., Braithwaite D. K. Compilation and alignment of DNA polymerase sequences. Nucleic Acids Res. 1991 Aug 11;19(15):4045–4057. doi: 10.1093/nar/19.15.4045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karlin S., Altschul S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A. 1990 Mar;87(6):2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karlin S., Brendel V. Chance and statistical significance in protein and DNA sequence analysis. Science. 1992 Jul 3;257(5066):39–49. doi: 10.1126/science.1621093. [DOI] [PubMed] [Google Scholar]
- Klimasauskas S., Timinskas A., Menkevicius S., Butkienè D., Butkus V., Janulaitis A. Sequence motifs characteristic of DNA[cytosine-N4]methyltransferases: similarity to adenine and cytosine-C5 DNA-methylases. Nucleic Acids Res. 1989 Dec 11;17(23):9823–9832. doi: 10.1093/nar/17.23.9823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kondo K., Inouye M. TIP 1, a cold shock-inducible gene of Saccharomyces cerevisiae. J Biol Chem. 1991 Sep 15;266(26):17537–17544. [PubMed] [Google Scholar]
- Kyte J., Doolittle R. F. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982 May 5;157(1):105–132. doi: 10.1016/0022-2836(82)90515-0. [DOI] [PubMed] [Google Scholar]
- Lasters I., Wodak S. J., Alard P., van Cutsem E. Structural principles of parallel beta-barrels in proteins. Proc Natl Acad Sci U S A. 1988 May;85(10):3338–3342. doi: 10.1073/pnas.85.10.3338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li C., Lai C. F., Sigman D. S., Gaynor R. B. Cloning of a cellular factor, interleukin binding factor, that binds to NFAT-like motifs in the human immunodeficiency virus long terminal repeat. Proc Natl Acad Sci U S A. 1991 Sep 1;88(17):7739–7743. doi: 10.1073/pnas.88.17.7739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lupas A., Van Dyke M., Stock J. Predicting coiled coils from protein sequences. Science. 1991 May 24;252(5009):1162–1164. doi: 10.1126/science.252.5009.1162. [DOI] [PubMed] [Google Scholar]
- Marguet D., Guo X. J., Lauquin G. J. Yeast gene SRP1 (serine-rich protein). Intragenic repeat structure and identification of a family of SRP1-related DNA sequences. J Mol Biol. 1988 Aug 5;202(3):455–470. doi: 10.1016/0022-2836(88)90278-1. [DOI] [PubMed] [Google Scholar]
- Marmorstein R., Carey M., Ptashne M., Harrison S. C. DNA recognition by GAL4: structure of a protein-DNA complex. Nature. 1992 Apr 2;356(6368):408–414. doi: 10.1038/356408a0. [DOI] [PubMed] [Google Scholar]
- Oliver S. G., van der Aart Q. J., Agostoni-Carbone M. L., Aigle M., Alberghina L., Alexandraki D., Antoine G., Anwar R., Ballesta J. P., Benit P. The complete DNA sequence of yeast chromosome III. Nature. 1992 May 7;357(6373):38–46. doi: 10.1038/357038a0. [DOI] [PubMed] [Google Scholar]
- Ringe D., Petsko G. A. Cystic fibrosis. A transport problem? Nature. 1990 Jul 26;346(6282):312–313. doi: 10.1038/346312a0. [DOI] [PubMed] [Google Scholar]
- Sander C., Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins. 1991;9(1):56–68. doi: 10.1002/prot.340090107. [DOI] [PubMed] [Google Scholar]
- Skala J., Purnelle B., Goffeau A. The complete sequence of a 10.8 kb segment distal of SUF2 on the right arm of chromosome III from Saccharomyces cerevisiae reveals seven open reading frames including the RVS161, ADP1 and PGK genes. Yeast. 1992 May;8(5):409–417. doi: 10.1002/yea.320080508. [DOI] [PubMed] [Google Scholar]
- Smith T. F., Waterman M. S. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
- Sor F., Chéret G., Fabre F., Faye G., Fukuhara H. Sequence of the HMR region on chromosome III of Saccharomyces cerevisiae. Yeast. 1992 Mar;8(3):215–222. doi: 10.1002/yea.320080307. [DOI] [PubMed] [Google Scholar]
- Taha M. K., So M., Seifert H. S., Billyard E., Marchal C. Pilin expression in Neisseria gonorrhoeae is under both positive and negative transcriptional control. EMBO J. 1988 Dec 20;7(13):4367–4378. doi: 10.1002/j.1460-2075.1988.tb03335.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thierry A., Fairhead C., Dujon B. The complete sequence of the 8.2 kb segment left of MAT on chromosome III reveals five ORFs, including a gene for a yeast ribokinase. Yeast. 1990 Nov-Dec;6(6):521–534. doi: 10.1002/yea.320060609. [DOI] [PubMed] [Google Scholar]
- Warmington J. R., Waring R. B., Newlon C. S., Indge K. J., Oliver S. G. Nucleotide sequence characterization of Ty 1-17, a class II transposon from yeast. Nucleic Acids Res. 1985 Sep 25;13(18):6679–6693. doi: 10.1093/nar/13.18.6679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waterman M. S., Eggert M. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J Mol Biol. 1987 Oct 20;197(4):723–728. doi: 10.1016/0022-2836(87)90478-5. [DOI] [PubMed] [Google Scholar]
- Weigel D., Jäckle H. The fork head domain: a novel DNA binding motif of eukaryotic transcription factors? Cell. 1990 Nov 2;63(3):455–456. doi: 10.1016/0092-8674(90)90439-l. [DOI] [PubMed] [Google Scholar]
- Wiersma P. A., Schmiemann M. G., Condie J. A., Crosby W. L., Moloney M. M. Isolation, expression and phylogenetic inheritance of an acetolactate synthase gene from Brassica napus. Mol Gen Genet. 1989 Nov;219(3):413–420. doi: 10.1007/BF00259614. [DOI] [PubMed] [Google Scholar]
- van der Voorn L., Ploegh H. L. The WD-40 repeat. FEBS Lett. 1992 Jul 28;307(2):131–134. doi: 10.1016/0014-5793(92)80751-2. [DOI] [PubMed] [Google Scholar]
