Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2000 Jan 1;28(1):120–122. doi: 10.1093/nar/28.1.120

XREFdb: cross-referencing the genetics and genes of mammals and model organisms

R Ploger, J Zhang 2, D Bassett 3, R Reeves 1, P Hieter 4, M Boguski 1, F Spencer a
PMCID: PMC102470  PMID: 10592198

Abstract

XREFdb supports the investigation of protein function in the context of information available through work in multiple organisms. In addition to facilitating the association of functional data among known genes from multiple organisms, XREFdb has developed strategies that provide access to information associated with as-yet unstudied genes. The database organizes protein similarity and genetic map positional information from diverse sources in the public domain to facilitate investigator evaluation of potential functional significance. XREFdb is found at URL www.ncbi.nlm.nih. gov/XREFdb

INTRODUCTION

Many useful bioinformatics tools have been developed that provide summaries and linkages between protein sequences found in the repertoires of divergent organisms. These tools provide a framework for rationally designed assays to determine protein function, for evaluation of conservation and divergence, and for exploiting the strengths of different experimental systems to better understand physiological pathways. The wealth of new protein information emerging from the increasing breadth of publicly available sequence data allows investigators to take advantage of an increasing number of experimental systems. The development of genome maps densely populated with mapped gene sequences provides a ready supply of candidate determinants of mapped phenotypes. Implementation of strategies for organization of these data will determine the efficiency with which investigators are able to navigate among the varied and rich sources of data available to them.

XREFdb currently offers Web-interactive tools that provide three distinct strategies for identifying functional information associated with protein similarity and map position. These are (i) a similarity search service that reports predicted protein alignments, mammalian map position, and associated phenotype data, (ii) a summary table relating potential budding yeast/human orthologues with linkage to map and protein function information, and (iii) a mouse map revealing the locations of gene-related nucleic acid similarity throughout the mouse genome. This map contains numerous gene locations that lack direct functional annotation at the present time, but which exhibit highly significant similarity to budding yeast genes at the protein level. Current international efforts to provide genome-wide functional annotation on the budding yeast protein repertoire, as well as a vigorous experimental community dedicated to understanding the biology of this simple eukaryote, holds promise for the continued accumulation of dense functional annotation.

PROTEIN-BASED SIMILARITY SEARCH

XREFdb provides a TBLASTN (1) search feature that accepts protein query sequences and compares them with a current six-frame translation of the Database of Expressed Sequence Tags (dbEST; 2,3) maintained at the NCBI. In the past, XREFdb has been organized in an accountholder format, with periodic whole database updates (4). However, a new search service within XREFdb is designed to provide users with current information on-the-fly. A search may require several minutes to perform, and thus the option to wait or to receive email notification of search completion is provided. Each search is stored within XREFdb for 30 days from the last date of query, and can be viewed at any time by searching XREFdb for the results set, or by directly accessing the results page using its URL.

XREFdb protein query match results are reported in a table that contains standard BLAST data (GenBank accession, definition line, BLAST score, e-value and linkage to each alignment) as well as extension data for mammalian EST matches. This includes UniGene (5) cluster membership, which allows a user to quickly evaluate the complexity of the search results on inspection. To facilitate analysis, a display option is provided that limits the results table to matches from a particular organism, or expands the results table to show all information stored. In addition, map data are given in columns adjoining the UniGene cluster identifier, providing a tabular summary of the distribution of mapped human, mouse and rat clusters within each genome. Human map data are provided as cytogenetic localizations, in order to allow linkage to phenotype descriptions available in the Online Mendelian Inheritance of Man database (OMIM; 6). Where cytogenetic data are not extracted directly from UniGene records, a calculated approximate cytogenetic location is obtained by XREFdb using neighboring records in NCBI’s GeneMap ‘99 (7) that are associated with experimentally determined cytological location.

UniGene cluster identification numbers are frequently updated by the NCBI, and form a critical link to the map and phenotype data provided by XREFdb. In order to ensure that a search result contains all possible current information, the UniGene linkages are generated anew each time a search is run, as are dependent data such as map position. Occasionally, a UniGene link may fail to call up a UniGene record during the 30-day storage period for an XREFdb results table. In that case, the search can simply be re-run to repopulate the results table with current data. Some search result rows will contain a LocusLink identifier, providing access to an information summary page maintained by the NCBI that organizes many fact sources for well-annotated genes. LocusLink identifiers remain stable with time.

The extended search report provided by XREFdb is useful in several experimental scenarios. dbEST, the target of XREFdb searches, contains a wealth of sequence representing unstudied as well as studied genes. The associated information for mammalian matches allows the results table to be rapidly evaluated for the complexity of the gene spectrum returned, for the presence of map positions of interest, and for candidate orphan phenotypes mapping in regions coincident with novel human genes. The presence of non-mammalian ESTs presents a method for identification of similar proteins from any of the organisms represented in dbEST, and is a resource within the results table that can be expanded upon in the future.

YEAST/HUMAN SUMMARY

XREFdb also maintains a table containing search results from a batch TBLASTN query consisting of all predicted budding yeast proteins versus a six-frame translation of a human-only subset of dbEST. It is organized to present only the most significant human EST match returned. This match is evaluated for its power in the reverse direction; i.e. a six-frame translation of each human EST is used as a BLASTX query versus a database containing all predicted yeast proteins. This analysis is a simplified application of the logic proposed in a study by Tatusov et al. (8) in which clusters of orthologous groups (COGs) were defined using data from many organisms. If protein-based searches in both directions match the same pair of records in the most significant alignment, the match is defined as symmetrical. Presence of match symmetry strongly suggests an orthologous relationship between the matched pair, with the provision that the human protein repertoire is still incompletely represented in dbEST which leaves open the possiblity that a more similar human protein could emerge in the future. The search results presented in the table are updated periodically, with date of last search provided for reference.

The Yeast/Human summary table may be queried by formal yeast ORF name, obtainable via a link to the Saccharomyces Genome Database (SGD; 9) or the Yeast Proteome Database (YPD; 10) if necessary. A single ORF name query returns a summary of the stored match, which includes links to NCBI’s UniGene record representing the human gene, and to phenotype information associated with the yeast protein as curated within YPD. The UniGene identifiers in the table are automatically updated to keep pace with any changes that may occur. If desired, the entire results set can be downloaded as a tab-delimited text file.

XREF MOUSE MAP

We have also developed a map of the mouse genome currently containing 571 loci from 378 mammalian cDNA probes. The map was derived using polymorphic restriction fragment length variants (RFLVs) in Southern blot data, using the Jackson Laboratory BSS backcross panel (11) DNAs for placement in the mouse map. To identify related loci, hybridizations and washes were performed at moderate stringency.

The choice of a Southern blot strategy also accommodates the use of probes identified from a human, mouse, rat subset of dbEST. The probes represent a highly conserved set of proteins, most containing EST sequences with highly significant similarity (p-value < e-15) to budding yeast genes. Southern blot mapping has the potential to reveal all hybridizing locations in the genome for each probe, as long as the parental mouse maps contain polymorphic restriction fragments representing each locus. In practice, 53% of probes revealed one or more strongly hybridizing bands that were nonpolymorphic. This number overestimates unmapped loci, as many of the non-polymorphic bands will derive from loci successfully mapped by adjacent polymorphic fragments. The mean number of loci obtained per probe in the project is 1.5, with many probes (27%) identifying more than one mouse locus.

To provide additional information on the mouse map, each probe was simultaneously hybridized against the NIGMS somatic cell hybrid mapping panel (No. 2, versions 2 or 3) from Coriell Cell Institute which is comprised of 24 cell lines, each containing a unique human chromosome on a mouse or hamster background. The presence of a hybridizing human sequence on a particular chromosome was recognized as the appearance of a human-specific band in a given cell line. The mouse and human map data are summarized in a table organized by mouse locus. The table also displays columns listing the budding yeast protein most similar to the mapped mammalian EST, GenBank accession number of the EST sequence that identified the probe used for mapping, and associated human chromosomal map data. The XREF Synteny Map table may be efficiently searched using browser text ‘Find’ functions. The map data may be viewed as single mouse chromosomes, or as a complete genome-wide table.

The simultaneous generation of human and mouse data allowed the development of a mouse/human synteny view, with information enrichment for both mouse and human resources. Each mouse locus identifier is actively linked to a page containing a summary of map information associated with the probe. For example, a single probe derived from the full cloned insert of the cDNA containing GenBank accession T87073 yielded two mouse loci, D9Xrf346 and D1Xrf347. The same probe also hybridized a sequence contained on human chromosome 1. Mouse chromosome 1 contains a region of conserved synteny with human chromosome 1 in a region surrounding mouse D1Xrf347, strongly suggesting the presence of a human gene locus at approximately 1q41–42. Conversely, this conservation of synteny between human and mouse suggests that the mouse locus D1Xrf347 is likely to be expressed. D9Xrf346 is interpreted to be a likely pseudogene or an expressed family member that maps outside of known conservation of synteny groups.

The XREF synteny map identifies the positions of highly conserved mouse sequences, from several perspectives. The probes were chosen for significance of their predicted protein alignments with budding yeast gene products. Moreover, the inclusive mapping strategy employed reveals the locations of multiple related genomic sites hybridized by an expressed nucleic acid sequence. Finally, both mouse and human hybridization targets are related in this analysis, enriching the annotation on the mouse locus and providing a location for provisional human gene loci based on inference from the known (and confirmed) mouse/human synteny map.

CONCLUSION

XREFdb seeks to provide efficient summary views cross-referencing the genes and genetics of model organisms and mammals. The wealth of new databanks of genomic proportions has created a challenging environment in which to identify the most useful relationships among the many possible. Organization of protein similarity data with the map positions of genes and phenotypes facilitates evaluation of these relationships and their biological significance.

Acknowledgments

ACKNOWLEDGEMENTS

We gratefully acknowledge grant support from the NHGRI (HG00971) to F.S., M.B., R.R. and P.H., and thank Greg Schuler and members of the NCBI for advice and data tables.

REFERENCES

  • 1.Altschul S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Boguski M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) Nature Genet., 4, 332–333. [DOI] [PubMed] [Google Scholar]
  • 3.Boguski M.S. (1995) Trends Biochem. Sci., 20, 295–296. [DOI] [PubMed] [Google Scholar]
  • 4.Bassett D.E. Jr, Boguski,M.S., Spencer,F., Reeves,R., Kim,S., Weaver,T. and Hieter,P. (1997) Nature Genet., 15, 339–344. [DOI] [PubMed] [Google Scholar]
  • 5.Schuler G. (1997) J. Mol. Med., 75, 694–698. [DOI] [PubMed] [Google Scholar]
  • 6. Online Mendelian Inheritance in Man, OMIM (TM). Center for Medical Genetics, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD), 1999.
  • 7.Deloukas P., Schuler,G.D., Gyapay,G., Beasley,E.M., Soderlund,C., Rodriguez-Tome,P., Hui,L., Matise,T.C., McKusick,K.B., Beckman,J.S. et al. (1998) Science, 282, 744–746. [DOI] [PubMed] [Google Scholar]
  • 8.Tatusov R.L., Koonin,E.V. and Lipman,D.J. (1997) Science, 278, 631–637. [DOI] [PubMed] [Google Scholar]
  • 9.Chervitz S.A., Hester,E.T., Ball,C.A., Kolinski,K., Dwight,S.S., Harris,M.A., Juvik,G., Malekian,A., Roberts,S., Roe,T. et al. (1999) Nucleic Acids Res., 27, 74–78. Updated article in this issue: Nucleic Acids Res. (2000), 28, 77–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hodges P.E., McKee,A.H.Z., Davis,B.P., Payne,W.E. and Garrels,J.I. (1999) Nucleic Acids Res., 27, 69–73. Updated article in this issue: Nucleic Acids Res. (2000), 28, 73–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Rowe L.B., Nadeau,J.H., Turner,R., Frankel,W.N., Letts,V.A., Eppig,J.T., Ko,M.S., Thurston,S.J. and Birkenmeier,E.H. (1994) Mamm. Genome, 5, 253–274. [DOI] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES