Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2000 Jan 1;28(1):331–332. doi: 10.1093/nar/28.1.331

HUGE: a database for human large proteins identified in the Kazusa cDNA sequencing project

Reiko Kikuno 1,a, Takahiro Nagase 1, Mikita Suyama 1, Mina Waki 1, Makoto Hirosawa 1, Osamu Ohara 1
PMCID: PMC102416  PMID: 10592264

Abstract

HUGE is a database for human large proteins newly identified in the Kazusa cDNA project, the aim of which is to predict the primary structure of proteins from the sequences of human large cDNAs (>4 kb). In particular, cDNA clones capable of coding for large proteins (>50 kDa) are the current targets of the project. HUGE contains >1100 cDNA sequences and detailed information obtained through analysis of the sequences of cDNAs and the predicted proteins. Besides an increase in the number of cDNA entries, the amount of experimental data for expression profiling has been largely increased and data on chromosomal locations have been newly added. All of the protein-coding regions were examined by GeneMark analysis, and the results of a motif/domain search of each predicted protein sequence against the Pfam database have been newly added. HUGE is available through the WWW at http://www.kazusa.or.jp/huge

INTRODUCTION

Kazusa DNA Research Institute has been conducting a cDNA sequencing project with the aim of prediction of the primary structure of unidentified human proteins. In particular, we have been interested in long cDNA clones which direct the synthesis of large proteins (>50 kDa) (1). More than 1100 human cDNA sequences (average size 4.9 kb) have been published to date. The HUGE database was constructed last year (2) to offer clues for assigning the function of the predicted proteins, by which means we can obtain an overview of the characteristics of each cDNA and protein sequence. At that time HUGE included basic and essential information obtained through analysis of the sequences of the cDNAs and the predicted proteins, but the patterns of expression of the genes were available only in the case of a limited number of HUGE entries, and information on chromosomal locations was not available. Since then HUGE has been growing not only through increasing the number of cDNA sequences available for retrieval but also through the addition of results obtained in expression profiling and RH-mapping together with the details concerning the experimental conditions used. Furthermore, GeneMark analysis (3; M.Hirosawa, K.Ishikawa, T.Nagase and O.Ohara, manuscript in preparation) was applied to predict the protein-coding regions to eliminate errors in assignment of protein-coding regions as far as possible. The results of a motif/domain search against the Pfam database (4) using HMMER 2.1.1 (5) have been newly added to provide further functional implications concerning the protein sequence thus predicted. Since we make the cDNA clones publicly available for research purposes, we have added information on cloning vectors, and improved the tools for retrieval of the HUGE entries of interest to the user in an effort to make this database more useful and user-friendly.

GENE/PROTEIN CHARACTERISTIC TABLE

As mentioned in our previous report (2), KIAA numbers have been used as primary gene identifiers in HUGE. To date, there are >1100 entries in HUGE starting with KIAA0001, and each entry has its own gene/protein characteristic table. The contents of the tables are divided into a header part and four sections describing features of the DNA sequence, features of the protein sequence, information on the expression profile and information obtained in chromosomal mapping.

The header part indicates the accession number of the cDNA sequence in GenBank/EMBL/DDBJ, the alias name of the gene, the clone name, information on the cloning vector used, and the biological source of the cDNA library from which the clone was isolated.

The first section, features of the DNA sequence, describes the cloned cDNA sequence showing the length, the physical map, the restriction map together with a list of the commercially available restriction enzymes used (6), and the results of GeneMark analysis. In the physical map, the open reading frame and untranslated regions are indicated by solid and open boxes, respectively, and the position of the first ATG codon is indicated by a triangle. Alu and other repetitive sequences detected by RepeatMasker (A.F.A.Smit and P.Green, unpublished results) are also indicated by dotted and hatched boxes, respectively. GeneMark analysis was applied to predict the protein-coding region of the cDNA sequence. The significant protein coding region(s) are shown graphics linked, and warnings about N-terminal truncation and about spurious interruption of the coding region are provided, if detected. As for the clones for which a warning is issued by GeneMark analysis, we performed additional experiments using the reverse transcription-coupled polymerase chain reaction (RT–PCR) method to detect artifacts in cloning (e.g., frameshift or nonsense mutation, or retention of intron), and we corrected the cDNA sequence. The results of GeneMark analysis for the corrected sequence are also shown, which may be helpful in determining whether further revision is needed.

The second section, features of the protein sequence, describes the predicted protein sequence showing the length, the results of homology search, motif/domain search and prediction of transmembrane regions. The protein sequence was predicted from the cloned cDNA sequence or, in cases where we corrected the cloned DNA sequence, from the revised sequence obtained by direct sequencing of the main RT–PCR products as mentioned above. The procedures of homology searches by FASTA (7) against the OWL database (8) and other HUGE entries, a motif search against the PROSITE database (9) and prediction of membrane-spanning regions by SOSUI (10) are the same as those described previously (2). A motif/domain search against Pfam using HMMER 2.1.1 has been newly added.

The third section, information on the expression profile, describes the experimental results obtained concerning the pattern of expression at the mRNA level as determined by northern blot analysis (for KIAA0001–KIAA0280), or RT–PCR assay (for KIAA0294–KIAA0710), or semi-quantitative RT–PCR together with enzyme-linked immunosorbent assay (RT–PCR ELISA; after KIAA0711). In the case of northern blot analysis and RT–PCR assay, the expression levels are shown directly in the photographs, whereas in the case of RT–PCR ELISA assay, the results are converted into color codes using the digit-color conversion panel shown in this section. The experimental conditions used, such as the primer sequences and PCR conditions can also be browsed through in the case of RT–PCR and RT–PCR ELISA.

The last section, information obtained in chromosomal mapping, describes the chromosomal locations as determined using human–rodent hybrid panels GeneBridge 4, Stanford G3 (Research Genetics Inc., USA) or CCR Coriell2 (Coriell Cell Repositories, USA) and the experimental conditions used. If mapping data for the cDNA clones were available in the UniGene database (11), we fetched the chromosomal numbers from UniGene, but only when the primer sets used in the determination of chromosomal location in the UniGene database were consistent with the sequences of the genes we determined.

TOOLS TO ACCESS GENE/PROTEIN CHARACTERISTIC TABLE OF INTEREST

HUGE provides two tools allowing users to easily obtain the KIAA number(s) of interest to them, such as a homology search and a keyword search. The selected KIAA number is linked to its gene/protein characteristic table. The homology search detects sequences in HUGE similar to the user’s query sequence (either a nucleotide sequence or an amino acid sequence) by FASTA. The keyword search selects HUGE entries that contain query keywords by searching for the words in a keyword table constructed for this purpose. A ‘NOT’ button was prepared to look for the entries that do not contain the input keyword. The HUGE entries thus found can be further retrieved by adding another keyword combined with an ‘AND’ or ‘OR’ operation. This procedure can be continued step by step. The default setting is to search all fields of the keyword table for the input keywords, but users can also specify the field by means of a pull-down menu to make the keyword search more certain. The keyword table contains the following fields: KIAA number, clone name, accession number in GenBank/EMBL/DDBJ, size of the cDNA/protein, motifs in PROSITE or Pfam, results of homology search against HUGE or OWL, number of predicted transmembrane segments, presence of a GeneMark warning, expression pattern and chromosomal locations.

SUPPLEMENTARY MATERIAL

Helpful URL sites for browsing the HUGE database.

Description http://www.kazusa.or.jp/huge/description.html

Keyword search http://www.kazusa.or.jp/huge/keyword/keyword.html

Help for the keyword search http://www.kazusa.or.jp/huge/keyword/help.html

Homology search http://www.kazusa.or.jp/huge/fasta/fasta.html

Bibliography http://www.kazusa.or.jp/biblio/biblio.html

Clone request http://www.kazusa.or.jp/clone.req

Helpful URL sites for browsing the HUGE database.

Description http://www.kazusa.or.jp/huge/description.html

Keyword search http://www.kazusa.or.jp/huge/keyword/keyword.html

Help for the keyword search http://www.kazusa.or.jp/huge/keyword/help.html

Homology search http://www.kazusa.or.jp/huge/fasta/fasta.html

Bibliography http://www.kazusa.or.jp/biblio/biblio.html

Clone request http://www.kazusa.or.jp/clone.req

Acknowledgments

ACKNOWLEDGEMENTS

We thank Takatsugu Hirokawa, Seah Boon-Chieng and Shigeki Mitaku for allowing us to use the SOSUI program for prediction of transmembrane helical regions. This work was supported by the Kazusa DNA Research Institute Foundation.

REFERENCES

  • 1.Ohara O., Nagase,T., Ishikawa,K.-I., Nakajima,D., Ohira,M., Seki,N. and Nomura,N. (1997) DNA Res., 4, 53–59. [DOI] [PubMed] [Google Scholar]
  • 2.Suyama M., Nagase,T. and Ohara,O. (1999) Nucleic Acids Res., 27, 338–339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Hirosawa M., Isono,K., Hayes,W. and Borodovsky,M. (1997) DNA Seq., 8, 17–29. [DOI] [PubMed] [Google Scholar]
  • 4.Bateman A., Birney,E., Durbin,R., Eddy,S.R., Finn,R.D. and Sonnhammer,E.L.L. (1999) Nucleic Acids Res., 27, 260–262. Updated article in this issue: Nucleic Acids Res. (2000), 28, 263–266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Durbin R., Eddy,S., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.
  • 6.Roberts R.J. and Macelis,D. (1999) Nucleic Acids Res., 27, 312–313. Updated article in this issue: Nucleic Acids Res. (2000), 28, 306–307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Pearson W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444–2448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bleasby A.J., Akrigg,D. and Attwood,T.K. (1994) Nucleic Acids Res., 22, 3574–3577. [PMC free article] [PubMed] [Google Scholar]
  • 9.Hofmann K., Bucher,P., Falquet,L. and Bairoch,A. (1999) Nucleic Acids Res., 27, 215–219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hirokawa T., Boon-Chieng,S. and Mitaku,S. (1998) Bioinformatics, 14, 378–379. [DOI] [PubMed] [Google Scholar]
  • 11.Schuler G. (1997) J. Mol. Med., 75, 694–698. [DOI] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES