Abstract
Assigning functions to proteins of unknown function is of considerable interest to the proteomic researchers as the genes encoding them are conserved over various species. Here, we describe HypoDB, a database of hypothetical genes and proteins in six eukaryotes. The database was collected and organized based on the number of entries in each chromosome with few annotations. Hypothetical protein database contains information related to gene and protein sequences, chromosome number and location, secondary and tertiary structure related data.
Availability
The database is available for free at http://www.trimslabs.com/database/hypodb/index.html
Background
Data pertaining to hypothetical proteins expressed in many eukaryotes would help researchers to search for potential proteins of interest with unknown functions [1]. However, many such hypothetical protein encoding genes are conserved over various species, which can be revealed from comparative genome analysis [2–4]. To predict a function for each of the protein coding regions, a comparative sequence analysis against all functionally elucidated sequences in protein sequence databases would reveal the necessary information for sequence retrieval, functional prediction and homologous sequences [5, 6], further which, multiple sequence alignments would reveal possible functional insights on cellular process or biological function [9, 10]. A hypothetical protein showing one or more significant structural homolog is predicted to have similar molecular properties [7, 8]. On the other hand, conserved hypothetical proteins are found in both prokaryotes and eukaryotes, the function of which can be predicted by domain homology searches, secondary, tertiary structure predictions, and gene annotations. Hence, data on hypothetical proteins from NCBI database was collected and organized in the form of a database using html and javascript. The database contains information regarding gene/protein sequences, chromosome number and location, secondary and tertiary structure information, ProFunc server data, primary analysis tools (mol.wt, ionization constant etc.), expression levels of the sequences and related data.
Methodology
Construction of database
HypoDB is constructed using html and JavaScript and can be accessed at http://www.trimslabs.com/database/hypodb/index.html. Data were collected from NCBI GenBank and SWISS-PROT databases. HypoDB includes hypothetical proteins of 8 organisms. The complete list of organisms with their scientific and general names was given in (see Table 1). They are provided as records and organized to simplify the task of finding relevant data for proteins in the related organism. In order to make the database available online, HTML pages are constructed using Javascript. Hypothetical protein database contains information on hypothetical gene and protein sequences in the form of records. The data were categorized based on the number of hypothetical genes and proteins in each chromosome of six eukaryotes. Each record when accessed returns the nucleotide and protein sequence and annotation such as accession numbers, source organism and chromosome number. An example of an entry in human chromosome 1, LOC100131311 is given in Table 2 (see Table 2).
Utility
The database is of much utility to researchers working in the fields of functional proteomics and genomics. Such data on hypothetical genes and proteins represents a prominent research area to annotate the genes of interest and predict functional regions. However, given the insight into the technological advances in bioinformatics, function prediction and assigning functionally important sites within the protein sequence is advantageous to identify the mutations that might have resulted to unknown function of the particular gene. Therefore, this database of hypothetical genes and proteins would be a useful source to study or predict the functional regions of a protein. Data was segregated based on the number of entries in each chromosome of six eukaryotes, provided with an easy way of access.
Supplementary material
Footnotes
Citation:Adinarayana et al, Bioinformation 6(3): 128-130 (2011)
References
- 1.WF Doolittle. Trends Genet. 1998;14:307. doi: 10.1016/s0168-9525(98)01494-2. [DOI] [PubMed] [Google Scholar]
- 2.E Lorbach, et al. Biol Chem. 1998;379:1355. doi: 10.1515/bchm.1998.379.11.1355. [DOI] [PubMed] [Google Scholar]
- 3.CA Wilson, et al. J Mol Biol. 2000;297:233. doi: 10.1006/jmbi.2000.3550. [DOI] [PubMed] [Google Scholar]
- 4.TI Zarembinski, et al. Proc Natl Acad Sci U S A. 1998;95:15189. [Google Scholar]
- 5.E Eisenstein, et al. Curr Opin Biotechnol. 2000;11:25. doi: 10.1016/s0958-1669(99)00063-4. [DOI] [PubMed] [Google Scholar]
- 6.I Uchiyama. Nucleic Acids Res. 2003;31:58. doi: 10.1093/nar/gkg109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.K Kinoshita, H Nakamura. Protein Sci. 2003;12(1589) doi: 10.1110/ps.0368703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Z Liu, et al. BMC Genomics. 2008;9:509. doi: 10.1186/1471-2164-9-509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.C Discala, et al. Nucleic Acids Res. 2000;28:8. doi: 10.1093/nar/28.8.e33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.A Ebihara, et al. Protein Sci. 2006;15:1494. doi: 10.1110/ps.062131106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.