HOWDY: an integrated database system for human genome research

Mika Hirakawa

doi:10.1093/nar/30.1.152

. 2002 Jan 1;30(1):152–157. doi: 10.1093/nar/30.1.152

HOWDY: an integrated database system for human genome research

Mika Hirakawa ^1,^a

PMCID: PMC99130 PMID: 11752279

Abstract

HOWDY is an integrated database system for accessing and analyzing human genomic information (http://www-alis.tokyo.jst.go.jp/HOWDY/). HOWDY stores information about relationships between genetic objects and the data extracted from a number of databases. HOWDY consists of an Internet accessible user interface that allows thorough searching of the human genomic databases using the gene symbols and their aliases. It also permits flexible editing of the sequence data. The database can be searched using simple words and the search can be restricted to a specific cytogenetic location. Linear maps displaying markers and genes on contig sequences are available, from which an object can be chosen. Any search starting point identifies all the information matching the query. HOWDY provides a convenient search environment of human genomic data for scientists unsure which database is most appropriate for their search.

INTRODUCTION

Although the sequence of the human genome is almost complete and much of the sequence is publicly available, creating an entire genomic database has been a challenge since the genome project began. The integration of a global genomic databases is difficult because much data has already accumulated in various databases with distinct formats and criteria. Some databases contain the latest data, but others are an amalgamation of old and new data in which the obsolete entry has been abandoned. For some entries, a gene symbol is assigned, even though that particular gene may already have another official gene name. Homology search results sometimes contain hits to gene sequences with obsolete names. In order to obtain good quality data, it is essential for the researcher that as many sources of information as possible are screened.

HOWDY is a database system designed to enable exhaustive searching of human genome information automatically. All that is required of the user is a query word or a cytogenetic location and an indication of which database might contain the desired data. HOWDY explores any identical word hit among the data extracted from the human genomic databases defined as the search targets. Moreover, the results can be used to initiate a further search. This particular feature allows the user to follow up leads that might otherwise not have been discovered.

IMPLEMENTATION

HOWDY is accessed via the Internet using a web browser. Although a few features use frames and JavaScript, the main database search and results displays can be accessed by Netscape 2.0 or Internet Explorer 3.0 or above. HOWDY uses Sybase (http://www.sybase.com/), a commercially available database management system. Because the databases that HOWDY interrogates differ in their frequency and method of data release, using Sybase saves time by registering only new data into the HOWDY database and reduces the conflict between daily releases and regular releases. Scripts are implemented using Perl (http://www.perl.com/), in conjunction with CGI modules and GD (http://stein.cshl.org/WWW/software/).

DATA STRUCTURE

HOWDY is based on object oriented modeling, although a relational database system is used for data management and storage. HOWDY deals with two kinds of objects: Database Objects (DBO) and Biological Objects (BO) (Fig. 1). DBO and BO are interfaces with which to access and process the data stored in Sybase. The mirror files are constructed in the server by scripts designed to extract, register and update the data. The HOWDY database saves all the data identified by the search, except for the sequence data.

Overview image of HOWDY database and DBO. Integrated databases are shown in Table 1.

DBO include the 13 types of data that are accessed and searched by HOWDY. These are listed in Table 1.

Table1. Common properties of HOWDY objects and their origins.

Objects	Source databases	Name property	Title property
Gene	GDB	Name (Primary)	Name (multiple words)
	LocusLink	OFFICIAL│PREFERRED GENE SYMBOL	OFFICIAL│PREFERRED GENE NAME
	HUGO NC	Active gene symbol	Full name
Nucleic Acid Sequence	GenBank	ACCESSION	DEFINITION
	EMBL	AC	DE
	DDBJ	ACCESSION	DEFINITION
RefSeq (NM)	Reference Sequences	ACCESSION	DEFINITION
RefSeq (NP)	Reference Sequences	ACCESSION	DEFINITION
UniGene	UniGene	UniGene Cluster ID	TITLE
dbSNP	dbSNP	NCBI Assay ID	Undefined
GDB (Amplimer)	GDB	Name (Primary)	Name (multiple words)
Contig	Reference Sequences	ACCESSION	Undefined
ContigMap	HumanGenome Sequencing	Map Name	DEFINITION
SWISS-PROT	SWISS-PROT	Entry name	Protein name
OMIM	OMIM	OMIM no.	TITLE
ALIS HGS	ALIS HGS	ALIS HGS ID	Undefined
JSNP	JSNP	JSNP ID	Undefined

Open in a new tab

DBO use the data from the public databases, which are integrated into HOWDY as items. Each of the items is stored in HOWDY in the same format as the database from which it originated, making daily updates of the data more manageable. For each database item, the major identifier is defined as the name property (shown in Table 1). The descriptive item of the data is defined as the title property and other items which identify the data are defined as the alias property. These properties are common to all HOWDY objects. Any data specific to the object and other useful information are defined as unique properties.

Some database entries have references to other databases which can be used to access related data. The HOWDY database saves these references and the Objects exploit them as cross-link properties to create dynamic links between related information. Such cross-link properties are only stored as object properties and are not used to generate daily indices storing the binary links between the databases, which is too time and resource consuming.

All the definitions for the Objects are stored in a configuration file, which can be edited or to which further definitions can be added. This allows HOWDY to be quickly updated should new types of genetic data become available or to be adapted for a different species or biological phenomenon.

Operations for the objects are defined as methods using Perl scripts. Common properties are dealt with by the same method except when searching via the dynamic links. When a search is performed and the items retrieved as a search result, any dynamic links are automatically discovered by the method scripts. Each cross-link property contains the item and any search information regarding where the item itself might be found in another object. If the dynamic link method finds items shared between two objects defined by the cross-link properties, a dynamic link is established. At the same time, a reverse link is also searched dynamically. The reverse link is established when an item is found in the cross-link property of another object. In addition, if the cross-link properties are reciprocally kept in two objects, it is a reciprocal link.

Creating links dynamically is one of the most important features of HOWDY, because the information regarding how the links are discovered is as important as what is actually linked (Table 2). Not all databases contain links to other databases or keep them maintained. Dynamic links, with their origins, allows the user to find interesting data, often with unexpected results.

Table 2. Link relationships with classes in HOWDY.

Class	Link	Reverse link
Gene	Nucleic Acid Sequence	Nucleic Acid Sequence
	RefSeq (NM)	RefSeq (NM)
	RefSeq (NP)	RefSeq (NP)
	UniGene	UniGene
	GDB (Amplimer)	dbSNP
	OMIM	GDB (Amplimer)
	PubMed	Contig
		SWISS-PROT
		OMIM
		ALIS HGS
		JSNP
Nucleic Acid Sequence	Gene	Gene
	GDB (Amplimer)	RefSeq (NM)
	OMIM	RefSeq (NP)
	SWISS-PROT	UniGene
	PubMed	dbSNP
		GDB (Amplimer)
		Contig
		SWISS-PROT
		OMIM
		ALIS HGS
		JSNP
RefSeq (NM)	Gene	Gene
	Nucleic Acid Sequence	RefSeq (NP)
	RefSeq (NP)
	SWISS-PROT
	OMIM
	PubMed
RefSeq (NP)	Gene	Gene
	Nucleic Acid Sequence	RefSeq (NM)
	RefSeq (NM)
	SWISS-PROT
	OMIM
	PubMed
UniGene	Gene	Gene
	Nucleic Acid Sequence
	SWISS-PROT
	dbSTS
	dbEST
dbSNP	Gene
	Nucleic Acid Sequence
	dbSTS
	dbEST
GDB (Amplimer)	Gene	Gene
	Nucleic Acid Sequence	Nucleic Acid Sequence
		ALIS HGS
Contig	Gene	Contig map
	Nucleic Acid Sequence
	OMIM
	dbSTS
Contig map	Contig
SWISS-PROT	Gene	Nucleic Acid Sequence
	Nucleic Acid Sequence	RefSeq (NM)
	OMIM	RefSeq (NP)
	PubMed	UniGene
	PDB
	PIR
	PROSITE
	MGD
OMIM	Gene	Gene
	MGD	Nucleic Acid Sequence
		RefSeq (NM)
		RefSeq (NP)
		Contig
		SWISS-PROT
ALIS HGS	Gene
	Nucleic Acid Sequence
	GDB (Amplimer)
JSNP	Gene
	Nucleic Acid Sequence

Open in a new tab

Databases underlined are linked only.

BO are dynamically generated by combining the properties of DBO when they are retrieved. A BO consists of a set of programs that allow it to perform a series of tasks. It can import the entries/data retrieved from the DBO into its own format. It can create dynamic links by using the acquired entries as queries for forward links and as targets of reverse links. It can also sort and merge redundant data. Currently, there are three BO: Gene Class [defined by HUGO nomenclature database (http://www.gene.ucl.ac.uk/nomenclature/), GDB (1) and LocusLink (2)], SNP Class [defined by dbSNP (3) and JSNP (4) (http://snp.ims.u-tokyo.ac.jp/)], and Protein Class [defined by RefSeq (2) and SWISS-PROT (5)].

The Gene Class is created from the gene symbol from HUGO, the primary name of GDB and the official symbol of LocusLink. If at least two of them are identical and considered to be a description of the same gene, then an item of the Gene Class is created (Fig. 2). The properties of BO are the same as DBO and have a definitive order in which to adopt data. The gene symbol, which is the name of the Gene Class, is adopted in the following order of precedence: HUGO symbol, LocusLink official symbol and GDB primary name. The unique properties are chosen from the DBO categories and defined. The cross-link categories in a BO tend towards redundancy because some data are common; therefore, the redundancy is removed before and after the dynamic link search. A BO takes precedence when both a BO and a DBO are found while searching. The advantage of defining the BO is that the most efficient search gains a small amount of the most appropriate information because the user can retrieve data from three databases without redundancy.

HUGO, LocusLink and GDB create Gene as a BO. Gene name is chosen from these name properties in the order shown as circled numbers. An alias property is made from merged aliases and removed redundancy.

As of July 18, 2001, 11 296 active HUGO symbols, 23 101 GDB names and 35 892 Gene symbols, including aliases in LocusLink, are registered in the HOWDY database (Fig. 3). Genes whose positions are unknown are not included in these figures. There are 39 188 gene symbols that can be targeted by a Gene Class search. This will give the user 10% more hits using a Gene Class search than would be expected, for example, by a search only accessing LocusLink.

Numbers of gene symbols in HUGO, GDB, Locuslink and HOWDY Gene counted by chromosome (July 18, 2001). Except HUGO class, numbers of alias symbols are included. HUGO class includes active gene symbols only.

WEB INTERFACE

Search interface

The search interface for the Internet service provides two kinds of direct search: by keyword and by cytogenetic location.

Searching by keyword allows the user to input free text that might be found in the common properties of each object. The search can be limited by choosing different targets. These targets are shown as options to be checked on the search page and are as follows: Gene [HUGO nomenclature, GDB (Gene) (1) and/or LocusLink (2)], Nucleic Acid Sequence (6–8), RefSeq (NM) (2), RefSeq (NP) (2), UniGene (9), dbSNP (3), GDB (Amplimer) (1), Contig (2), ContigMap (9), SWISS-PROT (4), OMIM (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM) (10), ALIS HGS (http://www-alis.tokyo.jst.go.jp/HGS/) and JSNP (4).

The most important factor when searching using HOWDY is the choice of target. When a search is performed, any data matching the keyword is retrieved sequentially by tracing dynamic links amongst the data. Thus, the user should choose the target most likely to have a match to the query in its common categories. For example, if the user wants to search JSNP data by using a GenBank accession number, _Nucleic Acids Sequence_ should be checked and the JSNP data will be retrieved as cross-link data.

The option menu to the left of the text field allows the user to specify the target for the search. For example, if _Name_ is selected from the menu, the query will generate a search among the Name property, as shown in Table 1. Selecting _Keyword_ can be expected to retrieve all data included in the Name and Title property (Table 3).

Table 3. The target of text search specified by field property.

Type	Search field
Primary Name	Name Property
Name	Name Property, ALIAS, OTHER IDs
Keyword	NAME Property, ALIAS, OTHER IDs, Title property, KEYWORD
All	All the field (this search takes long time)

Open in a new tab

A wild card (*) search is available and more than one word in the text field will find a match to either word. The returned results can be further limited by using another query and selecting the _and_ option above the input box or the _!=_ (NOT operator) in the box on the left. A cytogenetic location can be chosen which will restrict the search to a specific band. Text and cytogenetic location cannot be used at the same time.

Results page

The returned results are displayed as a list of entries found in the target objects and each set of results can be individually selected. The list page also has another search field that allows the user to narrow the results further or perform a new search.

Choosing the link to a specific result entry takes the user to a page with further details. At the top of this detail page, the name, title and alias data are displayed. Cross-links are found in the tables with hyperlinks to other HOWDY objects and icons with which to access an original database directly. The entries in the link table can be sorted by the reciprocity of links, alphabetical order of names or locations on chromosomes by selecting the interactive title of the appropriate column. Hyperlinks to HOWDY objects (shown in the _Name_ column) further search HOWDY, with that hyperlink word, and these results are displayed in the same format.

Continually following one link after another allows the user to identify relationships between data that have no explicit connection but that are linked indirectly. The results consist of data with at least one word from the query in its targeted properties, which will sometimes identify unexpected connections. The objective of HOWDY is to provide users with opportunities and clues to reach as much relevant data as possible.

Map browser

Chromosome illustrations on the home page represent integrated maps for each chromosome. These maps include NCBI Integrated Radiation Hybrid Map, STSs distributed by NCBI e-PCR files, genes and their positions as predicted by HOWDY by comparing the contig and transcript sequences from NCBI RefSeq. The maps can be used to select GDB amplimers, contigs and genes. These maps are dynamically generated by Perl modules (CGI.pm, GD.pm) from the data stored in the HOWDY database.

The individual chromosome maps displayed on each results page are linked to a list containing the objects that map to the same cytogenetic location. Sequence structure maps are also available from the results pages.

Sequence editor

Another important feature of HOWDY is the acquisition and analysis of sequence data. If the search result contains many sequences or if the description of sequence entries contains inconsistent data and no further information is available, comparison and analysis of the sequences with program tools is an effective way to resolve discrepancies. This sequence editor provides a convenient interface to capture sequence data for further analysis. The entrance to the page is the _Get sequence_ icon, which is at the top of the detailed results page and on the cross-link table of sequence data. When the sequence entries required are in the cross-link table, the target entry must be selected before the sequence page can be accessed.

There are three ways to obtain the sequence: displaying the sequences in FASTA format, saving the sequences to the local computer and editing sequences. Editing sequences allows the user to clip and join sequence data at any point to create a new sequence file. Sequences from different entries can be used. The editor displays a map of any annotated features and identifies the positions of these features. The interface is interactive and if the user checks a particular feature, that sequence is clipped for further use. The clipped sequence can be used directly, joined with other sequences to generate a new sequence or placed in a file.

The resulting sequence can be displayed and saved in FASTA or XML format. The FASTA format is the most versatile and is accessible to a large number of biological analysis programs. The XML format is becoming more popular and it is able to store more complex and detailed information. We have developed a prototype tool for sequence annotation and map generation using an XML format file saved from the HOWDY sequence editor.

CONCLUSIONS

The HOWDY database has a simple and flexible structure which makes it easy to add and to change data and databases. The acronym HOWDY is derived from Human Organized Whole genome Database. The data growth from the Human Genome Project is so rapid, biologists often struggle to obtain all of the data in which they are interested. HOWDY can identify as much relevant data as possible, allowing the user to select only what is required and then to compile the data in the most useful way. HOWDY has been available to the public since July 2000. A number of new features have been released as the database evolves, such as the maps and the sequence editor, and it is now accessed 2000 times a month, on average. To date, HOWDY has only been used to access Homo sapiens information. However, this versatile system could be applied to any organism or biological phenomenon.

SUPPLEMENTARY MATERIAL

A table of HOWDY web service images is available as Supplementary Material at NAR Online.

[Supplementary Data]

nar_30_1_152__index.html^{(4.1KB, html)}

Acknowledgments

ACKNOWLEDGEMENTS

I would like to thank Katsuji Matsumura and Kensaku Imai for their initial contribution. Thanks also go to Fujitsu Ltd for their programming support and database maintenance. This work is a part of ALIS (Advanced Life Science Information Systems, http://www-alis.tokyo.jst.go.jp/), supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) and its predecessor, the Science and Technology Agency (STA).

REFERENCES

1.Cuticchia A.J. (2000) Future vision of the GDB human genome database. Hum. Mutat., 15, 62–67. [DOI] [PubMed] [Google Scholar]
2.Pruitt K. and Maglott,D. (2001) RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res., 29, 137–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Sherry S.T., Ward,M.-H., Kholodov,M., Baker,J., Phan,L., Smigielski,E.M. and Sirotkin,K. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Hirakawa M., Tanaka,T., Hashimoto,Y., Kuroda,M., Takagi,T. and Nakamura,Y. (2002) JSNP: a database of common gene variations in the Japanese population. Nucleic Acids Res., 30, 158–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bairoch A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Benson D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and Wheeler,D.L. (2000) GenBank. Nucleic Acids Res., 28, 15–18. Updated article in this issue: Nucleic Acids Res. (2002), 30, 17–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Stoesser G., Baker,W., van den Broek,A., Camon,E., Garcia-Pastor,M., Kanz,C., Kulikova,T., Lombard,V., Lopez,R., Parkinson,H., Redaschi,N., Sterk,P., Stoehr,P. and Tuli,M.A. (2001) The EMBL Nucleotide Sequence Database. Nucleic Acids Res., 29, 17–21. Updated article in this issue: Nucleic Acids Res. (2002), 30, 21–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Tateno Y., Miyazaki,S., Ota,M., Sugawara,H. and Gojobori,T. (2000) DNA Data Bank of Japan (DDBJ) in collaboration with mass sequencing teams. Nucleic Acids Res., 28, 24–26. Updated article in this issue: Nucleic Acids Res. (2002), 30, 27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Wheeler D.L., Deanna,M., Church,D.M., Lash,A.E., Leipe,D.D., Madden,T.L., Pontius,J.U., Schuler,G.D., Schriml,L.M., Tatusova,T.A., Wagner,L. and Rapp,B.A. (2001) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 29, 11–16. Updated article in this issue: Nucleic Acids Res. (2002), 30, 13–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.McKusick V.A. (1998) Mendelian Inheritance in Man. Catalogs of Human Genes and Genetic Disorders, 12th edn. Johns Hopkins University Press, Baltimore, MD.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplementary Data]

nar_30_1_152__index.html^{(4.1KB, html)}

[gkf074c1] 1.Cuticchia A.J. (2000) Future vision of the GDB human genome database. Hum. Mutat., 15, 62–67. [DOI] [PubMed] [Google Scholar]

[gkf074c2] 2.Pruitt K. and Maglott,D. (2001) RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res., 29, 137–140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkf074c3] 3.Sherry S.T., Ward,M.-H., Kholodov,M., Baker,J., Phan,L., Smigielski,E.M. and Sirotkin,K. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308–311. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkf074c4] 4.Hirakawa M., Tanaka,T., Hashimoto,Y., Kuroda,M., Takagi,T. and Nakamura,Y. (2002) JSNP: a database of common gene variations in the Japanese population. Nucleic Acids Res., 30, 158–162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkf074c5] 5.Bairoch A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkf074c6] 6.Benson D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and Wheeler,D.L. (2000) GenBank. Nucleic Acids Res., 28, 15–18. Updated article in this issue: Nucleic Acids Res. (2002), 30, 17–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkf074c7] 7.Stoesser G., Baker,W., van den Broek,A., Camon,E., Garcia-Pastor,M., Kanz,C., Kulikova,T., Lombard,V., Lopez,R., Parkinson,H., Redaschi,N., Sterk,P., Stoehr,P. and Tuli,M.A. (2001) The EMBL Nucleotide Sequence Database. Nucleic Acids Res., 29, 17–21. Updated article in this issue: Nucleic Acids Res. (2002), 30, 21–26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkf074c8] 8.Tateno Y., Miyazaki,S., Ota,M., Sugawara,H. and Gojobori,T. (2000) DNA Data Bank of Japan (DDBJ) in collaboration with mass sequencing teams. Nucleic Acids Res., 28, 24–26. Updated article in this issue: Nucleic Acids Res. (2002), 30, 27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkf074c9] 9.Wheeler D.L., Deanna,M., Church,D.M., Lash,A.E., Leipe,D.D., Madden,T.L., Pontius,J.U., Schuler,G.D., Schriml,L.M., Tatusova,T.A., Wagner,L. and Rapp,B.A. (2001) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 29, 11–16. Updated article in this issue: Nucleic Acids Res. (2002), 30, 13–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkf074c10] 10.McKusick V.A. (1998) Mendelian Inheritance in Man. Catalogs of Human Genes and Genetic Disorders, 12th edn. Johns Hopkins University Press, Baltimore, MD.

PERMALINK

HOWDY: an integrated database system for human genome research

Mika Hirakawa

Abstract

INTRODUCTION

IMPLEMENTATION

DATA STRUCTURE

Figure 1.

Table1. Common properties of HOWDY objects and their origins.

Table 2. Link relationships with classes in HOWDY.

Figure 2.

Figure 3.

WEB INTERFACE

Search interface

Table 3. The target of text search specified by field property.

Results page

Map browser

Sequence editor

CONCLUSIONS

SUPPLEMENTARY MATERIAL

Acknowledgments

ACKNOWLEDGEMENTS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

HOWDY: an integrated database system for human genome research

Mika Hirakawa

Abstract

INTRODUCTION

IMPLEMENTATION

DATA STRUCTURE

Figure 1.

Table1. Common properties of HOWDY objects and their origins.

Table 2. Link relationships with classes in HOWDY.

Figure 2.

Figure 3.

WEB INTERFACE

Search interface

Table 3. The target of text search specified by field property.

Results page

Map browser

Sequence editor

CONCLUSIONS

SUPPLEMENTARY MATERIAL

Acknowledgments

ACKNOWLEDGEMENTS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases