Abstract
Identification and functional characterization of the genes in the human genome remain a major challenge. A principal source of publicly available information used for this purpose is the National Center for Biotechnology Information database of expressed sequence tags (dbEST), which contains over 4 million human ESTs. To extract the information buried in this data more effectively, we have developed a semiautomated method to mine dbEST for uncharacterized human genes. Starting with a single protein input sequence, a family of related proteins from all species is compiled. This entire family is then used to mine the human EST database for new gene candidates. Evaluation of putative new gene candidates in the context of a family of characterized proteins provides a framework for inference of the structure and function of the new genes. When applied to a test data set of 28 families within the major facilitator superfamily (MFS) of membrane transporters, our protocol found 73 previously characterized human MFS genes and 43 new MFS gene candidates. Development of this approach provided insights into the problems and pitfalls of automated data mining using public databases.
Keywords: Major facilitator superfamily, transporters, superfamily analysis, expressed sequence tags, data mining
References
- 1.Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
- 2.Venter JC, Adams MD, Myers EW, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
- 3.Pao SS, Paulsen IT, Saier MH. Major facilitator superfamily. Microbiol Mol Biol Rev. 1998;62:1–34. doi: 10.1128/mmbr.62.1.1-34.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Paulsen IT, Sliwinski MK, Saier MH. Microbial genome analyses: global comparisons of transport capabilities based on phylogenies, bioenergetics and substrate specificities. J Mol Biol. 1998;277:573–592. doi: 10.1006/jmbi.1998.1609. [DOI] [PubMed] [Google Scholar]
- 5.Paulsen IT, Sliwinski MK, Nelissen B, Goffeau A, Saier MH. Unified inventory of established and putative transporters encoded within the complete genome of Saccharomyces cerevisiae. FEBS Lett. 1998;430:116–125. doi: 10.1016/S0014-5793(98)00629-2. [DOI] [PubMed] [Google Scholar]
- 6.Paulsen IT, Nguyen L, Sliwinski MK, Rabus R, Saier MH. Microbial genome analyses: comparative transport capabilities in eighteen prokaryotes. J Mol Biol. 2000;301:75–100. doi: 10.1006/jmbi.2000.3961. [DOI] [PubMed] [Google Scholar]
- 7.Hogenesch JB, Ching KA, Batalov S, et al. A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell. 2001;106:413–415. doi: 10.1016/S0092-8674(01)00467-6. [DOI] [PubMed] [Google Scholar]
- 8.Karp PD, Riley M, Saier M, Paulsen IT, Paley SM, Pellegrini-Toole A. The EcoCyc and MetaCyc databases. Nucleic Acids Res. 2000;28:56–59. doi: 10.1093/nar/28.1.56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ogasawara N. Systematic function analysis, of Bacillus subtilis genes. Res Microbiol. 2000;151:129–134. doi: 10.1016/S0923-2508(00)00118-2. [DOI] [PubMed] [Google Scholar]
- 10.Blattner FR, Plunkett G, Bloch CA, et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1474. doi: 10.1126/science.277.5331.1453. [DOI] [PubMed] [Google Scholar]
- 11.Kunst F, Ogasawara N, Moszer I, et al. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature. 1997;390:249–256. doi: 10.1038/36786. [DOI] [PubMed] [Google Scholar]
- 12.Wittenberger T, Schaller HC, Hellebrand S. An expressed sequence tag (EST) data mining strategy succeeding in the discovery of new G-protein coupled receptors. J Mol. Biol. 2001;307:799–813. doi: 10.1006/jmbi.2001.4520. [DOI] [PubMed] [Google Scholar]
- 13.Black DL. Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell. 2000;103:367–370. doi: 10.1016/S0092-8674(00)00128-8. [DOI] [PubMed] [Google Scholar]
- 14.Graveley BR. Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 2001;17:100–107. doi: 10.1016/S0168-9525(00)02176-4. [DOI] [PubMed] [Google Scholar]
- 15.Strehler EE, Zacharias DA. Role of alternative splicing in generating isoform diversity among plasma membrane calcium pumps. Physiol Rev. 2001;81:21–50. doi: 10.1152/physrev.2001.81.1.21. [DOI] [PubMed] [Google Scholar]
- 16.Ingram VM. Abnormal human haemoglobin, III: the chemical difference between normal and sickle cell haemoglobins. Biochim Biophys Acta. 1959;36:402–411. doi: 10.1016/0006-3002(59)90183-0. [DOI] [PubMed] [Google Scholar]
- 17.Qi M, Byers PH. Constitutive skipping of alternatively spliced exon 10 in the ATP7A gene abolishes Golgi localization of the menkes protein and produces the occipital horn syndrome. Hum Mol Genet. 1998;7:465–469. doi: 10.1093/hmg/7.3.465. [DOI] [PubMed] [Google Scholar]
- 18.Brett D, Hanke J, Lehmann G, et al. EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 2000;474:83–86. doi: 10.1016/S0014-5793(00)01581-7. [DOI] [PubMed] [Google Scholar]
- 19.Cargill M, Altshuler D, Ireland J, et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet. 1999;22:231–238. doi: 10.1038/10290. [DOI] [PubMed] [Google Scholar]
- 20.Lai E. Application of SNP technologies in medicine: lessons learned and future challenges. Genome Res. 2001;11:927–929. doi: 10.1101/gr.192301. [DOI] [PubMed] [Google Scholar]
- 21.Sadee W. Genomics and drugs: finding the optimal drug for the right patient. Pharm Res. 1998;15:959–963. doi: 10.1023/A:1011949221202. [DOI] [PubMed] [Google Scholar]
- 22.Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 24.Huang X, Madan A. CAP3: a DNA sequence assembly program. Genome Res. 1999;9:868–877. doi: 10.1101/gr.9.9.868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Adams MD, Venter JC. Should non-peer-reviewed raw DNA sequence data release be forced on the scientific community? Science. 1996;274:534–536. doi: 10.1126/science.274.5287.534. [DOI] [PubMed] [Google Scholar]
- 26.Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Xucleic Acids Res. 1994;22:4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Tusnady GE, Simon I. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol. 1998;283:489–506. doi: 10.1006/jmbi.1998.2107. [DOI] [PubMed] [Google Scholar]
- 28.Hillis DM, Moritz C, Mable BK. Molecular Systematics. 2nd ed. Sunderland, MA: Simauer Associates; 1996. [Google Scholar]
- 29.Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7:203–214. doi: 10.1089/10665270050081478. [DOI] [PubMed] [Google Scholar]
- 30.Saier MH. A functional-phylogenetic system for the classification of transport proteins. J Cell Biochem. 1999;32/33:84–94. doi: 10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]
- 31.Anderle P, Rakhmanova V, Woodford K, Zerangue N, Sadee W. Messenger RNA expression of transporter and ion channel genes in undifferentiated and differentiated Caco-2 cells compared to human intestines. Phar. Res. 2003; in press. [DOI] [PubMed]
- 32.The C. elegans Sequencing Consortium Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282:2012–2018. doi: 10.1126/science.282.5396.2012. [DOI] [PubMed] [Google Scholar]
- 33.Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J. Gene index analysis of the human genome estimates approximately 120,000 genes. Nat Genet. 2000;25:239–240. doi: 10.1038/76126. [DOI] [PubMed] [Google Scholar]
- 34.Ewing B, Green P. Analysis of expressed sequence tags indicates 35,000 human genes. Nat Genet. 2000;25:232–234. doi: 10.1038/76115. [DOI] [PubMed] [Google Scholar]
- 35.Roest Crollius H, Jaillon O, Bernot A, et al. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat Genet. 2000;25:235–238. doi: 10.1038/76118. [DOI] [PubMed] [Google Scholar]
- 36.Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL. The Pfam protein families database. Nucleic Acids Res. 2000;28:263–266. doi: 10.1093/nar/28.1.263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Sonnhammer EL, Kahn D. Modular arrangement of proteins as inferred from analysis of homology. Protein Sci. 1994;3:482–492. doi: 10.1002/pro.5560030314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kasahara M, Maeda M. Contribution to substrate recognition of two aromatic amino acid residues in putative transmem brane segment 10 of the yeast sugar transporters Gal1 and Hxt2. J Biol Chem. 1998;273:29106–29112. doi: 10.1074/jbc.273.44.29106. [DOI] [PubMed] [Google Scholar]
- 39.Phay JE, Hussain HB, Moley JF. Cloning and expression analysis of a novel member of the facilitative glucose transporter family, SLC2A9 (GLUT9) Genomics. 2000;66:217–220. doi: 10.1006/geno.2000.6195. [DOI] [PubMed] [Google Scholar]
- 40.Doege H, Bocianski A, Scheepers A, et al. Characterization of human glucose transporter (GLUT) 11 (encoded by SLC2A11), a novel sugar-transport facilitator specifically expressed in heart and skeletal muscle. Biochem J. 2001;359:443–449. doi: 10.1042/0264-6021:3590443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Enomoto A, Kimura H, Chairoungdua A, et al. Molecular identification of a renal urate anion exchanger that regulates blood urate levels. Nature. 2002;417:447–452. doi: 10.1038/nature742. [DOI] [PubMed] [Google Scholar]
- 42.Kim DK, Kanai Y, Matsu H, et al. The human T-type amino acid transporter-1: characterization gene organization, and chromosomal location. Genomics. 2002;79:95–103. doi: 10.1006/geno.2001.6678. [DOI] [PubMed] [Google Scholar]
- 43.Botka C, Wittig T, Graul R, et al. Human proton/oligopeptide transporter (POT) genes: identification of putative human genes using bioinformatics. AAPS PharmSci. 2000; Article 2. Available at: http://www.aapspharmsci.org/scientificjournals/pharmsci/journal/16.html. [DOI] [PMC free article] [PubMed]
- 44.Schultz J, Doerks T, Ponting CP, Copley RR, Bork P. More than 1,000 putative new human signalling proteins revealed by EST data mining. Nat Genet. 2000;25:201–204. doi: 10.1038/76069. [DOI] [PubMed] [Google Scholar]
- 45.Quackenbush J, Liang F, Holt I, Pertea G, Upton J. The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res. 2000;28:141–145. doi: 10.1093/nar/28.1.141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Allikmets R, Gerrard B, Hutchinson A, Dean M. Characterization of the human ABC superfamily: isolation and mapping of 21 new genes using the expressed sequence tags database. Hum Mol Genet. 1996;5:1649–1655. doi: 10.1093/hmg/5.10.1649. [DOI] [PubMed] [Google Scholar]
- 47.Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1. Available at: http://genomebiology/com/1465-6906/1/REVIEWS0005. [DOI] [PMC free article] [PubMed]
- 48.Karp PD. What we do not know about sequence analysis and sequence databases. Bioinformatics. 1998;14:753–754. doi: 10.1093/bioinformatics/14.9.753. [DOI] [PubMed] [Google Scholar]
- 49.Holm L, Sander C. An evolutionary treasure: unification of a broad set of amidohydrolases related to urease. Proteins. 1997;28:72–82. doi: 10.1002/(SICI)1097-0134(199705)28:1<72::AID-PROT7>3.0.CO;2-L. [DOI] [PubMed] [Google Scholar]
- 50.Babbitt PC, Hasson MS, Wedekind JE, et al. The enolase superfamily: a general strategy for enzyme-catalyzed abstraction of the alpha-protons of carboxylic acids. Biochemistry. 1996;35:16489–16501. doi: 10.1021/bi9616413. [DOI] [PubMed] [Google Scholar]
- 51.The Gene Ontology Consortium Creating the gene ontology resource: design and implementation. Genome Res. 2001;11:1425–1433. doi: 10.1101/gr.180801. [DOI] [PMC free article] [PubMed] [Google Scholar]