Abstract
The Mouse Genome Database (MGD) is the community database resource for the laboratory mouse, a key model organism for interpreting the human genome and for understanding human biology and disease (http://www.informatics.jax.org). MGD strives to provide a highly curated, highly integrated information resource that not only includes the consensus view of current knowledge about the mouse, but also provides comparative genomic information particularly for human and rat genomes. MGD includes extensive information about mouse genes, supporting all gene attribute assertions with experimental data, statements of evidence and citation. Detailed information about alleles and mouse mutants includes genotype, molecular variant and phenotype descriptions. Extensive collaboration with other data providers such as NCBI, RIKEN and SWISS-PROT provides standardization of gene:sequence associations and robust interconnections between large information systems based on shared sequence curation. Recent integration of large datasets of mouse full-length cDNAs and radiation-hybrid mapped ESTs, the continued development and use of extensive structured vocabularies and the expansion of the representation of phenotypes highlight this year’s developments.
INTRODUCTION
The Mouse Genome Database (MGD; http://www.informatics. jax.org) is the public community resource representing the genetics, genomics and phenotypes of the laboratory mouse. MGD focuses on an integrated representation of genotype (sequence) to phenotype information including highly curated information about genes and gene products (1) (Table 1). Primary foci of integration are through representations of relationships between genes, sequences and phenotypes. The annotation pipeline includes extensive curation of the scientific literature. All annotations in MGD are supported with experimental evidence and citations. MGD provides official nomenclature for mouse genes and works closely with human and rat genome annotation groups to curate relationships between these genomes and to standardize the representation of genes and gene families.
Table 1. Snapshot of data content in MGD: September 15, 2001.
MGD data statistics | September 2001 |
---|---|
Number of references | 67 371 |
Number of genesa | 35 190 |
Number of markers (including genes) | 53 859 |
Number of genes with sequence data | 31 095 |
Number of markers mapped | 36 601 |
Number of mouse/human curated orthology reports | 6123 |
Number of genes with links to SWISS-PROT | 12 636 |
Number of genes with GO annotations | 5927 |
Number of genes with annotated alleles | 1688 |
Number of annotated alleles | 3330 |
Number of mouse nucleotide sequences curated in MGI system (includes ESTs) | 537 481 |
aSee text for caveats on this number.
MGD provides information about alleles and targeted mutations, homology data for mammalian orthologs and detailed mapping data at both the gene and genomic levels. Extensive curation of sequence to gene associations provides the fundamental dataset against which computational annotation systems are calibrated and tested. MGD provides graphical views of the comparative genomic data from the gene, chromosomal region, genome or species perspective. Extensive experimental mapping data including genetic and physical maps are available, including data that conflicts with current consensus map positions.
MGD is part of the Mouse Genome Informatics (MGI) project effort based at The Jackson Laboratory (http://www.jax.org) and collaborates closely with the Gene Expression Database (GXD) (2), the Mouse Genome Sequencing (MGS) project and the Mouse Tumor Biology (MTB) database (3) to provide an integrated information resource for the laboratory mouse. MGD is a founding member of the Gene Ontology (GO) consortium (4) and contributes particularly to the development of mammalian components of the GO vocabularies. MGD curators collaborate extensively with SWISS-PROT (5) and with the LocusLink project at NCBI (6) to evaluate and update mouse gene:sequence associations.
IMPROVEMENTS DURING 2001
Expanded allele and mutant phenotype data
For each allele or mutant, MGD provides data describing the type of allele (e.g. ENU-induced, transgene targeted), the mouse strain on which it arose, the phenotypic characteristics of homozygous and heterozygous carriers and its relationship with human genes and disease. New data available include molecular details of the allelic change, ES cell lines and cell line strain (for targeted alleles), promoter details (where relevant) and expanded citations. In addition, phenotypic alleles are linked to GXD whenever expression data are available for mice carrying the allele.
Controlled vocabularies for phenotypes
MGD continues to develop and implement controlled and structured vocabularies to standardize the annotation of information for mouse genes and genomic features. A recent effort to develop and use standard vocabularies for describing mouse normal and mutant phenotypes will improve searching, classifying and analyzing phenotype data. The current structured vocabulary consists of several thousand terms, each associated with precise definitions and a citation as a source of the information. Terms are organized hierarchically, from general to specific, allowing annotation to reflect the state of knowledge about particular mutants (e.g. a newly discovered mutation may be observed to have a hearing defect; a better studied mutation may assign the hearing defect to degeneration of the organ of Corti). Phenotype vocabulary terms are being used to annotate the phenotypes of mice carrying heterozygous or homozygous mutant alleles on particular genetic backgrounds and these data are presented as part of the MGD allele and mutant phenotype reports.
PhenoSlim. A particular subset of the phenotype vocabularies consisting of the broadest, high-level terms of the full phenotype vocabulary, referred to as ‘PhenoSlim’, includes approximately 100 terms and is being used to develop the initial phenotype query capability in MGD.
Curated orthology assertions and gene family summaries
MGD provides gene family pages that summarize information about mouse, human and rat orthologs. Each summary report includes official gene symbols, a representative sequence for each gene in each species and links to MGD gene reports, human LocusLink records and, in the near future, links to the Rat Genome Database (7) gene detail pages. An example of the claudin gene family pages can be viewed at http://www.informatics.jax.org/mgihome/nomen/genefamilies/claudin.shtml. These curated representations of gene families incorporate the combined evaluations of mouse, HUGO and rat scientific curators with the input of the scientific research community to evaluate and clarify the gene family relationships. These collaborative efforts are often initiated by experts in the research community and often result in nomenclature revisions to reflect new understanding about the members of the gene family.
NEW LARGE INTEGRATED (SEQUENCE) DATASETS
MGI integration and release of RIKEN data
In collaboration with the Genome Exploration Research Group of the RIKEN Institute of Physical and Chemical Research in Japan, the MGI group (MGD and GXD) released information regarding 21 076 mouse cDNA clones in February, 2001. These clones were isolated and sequenced at the RIKEN Institute as part of their Mouse Genome Encyclopedia, a genomics project centered on the production of full-length mouse cDNAs (8). The RIKEN Institute hosted a meeting (Functional Annotation of Mouse, FANTOM meeting) to provide first-pass functional annotation of these cDNAs. An overview of these clone annotations appears in the summary publication for this meeting (9).
RIKEN clone annotations are fully integrated into MGI and are accessible through standard query forms. Each clone from this data set is associated with a gene in MGI. At the time of this load, these cDNAs represented a total of 15 294 genes, 12 905 of which are new genes to MGI. New mouse genes discovered in the RIKEN clone set were assigned official nomenclature following a defined syntax. For novel genes represented by multiple RIKEN clones that show sequence overlap (clusters), nomenclature was derived from the clone identifier of one representative clone for each cluster. The exact gene count for this clone set reflects the current state of annotation for these clones and is an overestimate due to some unresolved redundancy. As these new mouse genes become better characterized, revised nomenclature and other biological data are being incorporated.
Mouse T31 radiation hybrid data load
A set of expressed sequences have been assigned to mouse chromosomes using the mouse T31 radiation hybrid (RH) mapping panel (http://www.jax.org/resources/documents/cmdata/rhmap/) and these data have been automatically loaded into MGD. Sequence identifiers from the load that were not previously associated with MGD genes form the basis for new gene designations. This load resulted in the designation of 10 015 new genes. New genes created by the load are assigned to a chromosome from the RH mapping data and a Marker Mapping Experiment record is created in MGD. The load provided chromosomal assignments for over 250 genes that were previously unmapped. MGD curators continue to update relationships between the sequences and genes using a variety of sequence annotation tools. Additional positional information for these RH-mapped markers is available from The Jackson Laboratory Mouse Radiation Hybrid Database web site (http://www.jax.org/resources/documents/cmdata/rhmap/).
ADDITIONAL ENHANCEMENTS
Availability of MGI:GO files in various formats
MGI gene-to-GO annotations are updated daily (10). Various files for the MGI gene/markers with the GO associations are publicly available. These files are updated each time MGI submits a new gene association file to the GO web site (http://www.geneontology.org) and can be accessed on the MGI FTP server (ftp://www.informatics.jax.org/pub/informatics/reports/gene_association.mgi). A file of all the GO terms used by MGI in the annotation of genes and gene products is also available.
Enhancements to the allele query form
The allele query form was extended to include query capabilities on allele data reflecting the newly available aspects of molecular and source information. The query form, with links to help and documentation, can be accessed at http://www.informatics.jax.org/searches/allele_form.shtml.
OTHER INFORMATION
User input
MGD encourages user input into its gene and allele annotation efforts. On each gene detail and allele detail page, a clickable button (‘Your Input Welcome’) brings the user to a web-based form for submitting updates to the information being viewed.
Electronic data submission
We encourage contributions of electronic data sets from the scientific community. Any type of data that MGI databases maintain can be submitted as an electronic contribution. Each electronic submission receives a permanent database accession ID and is assigned a citation ID with an abstract if appropriate. For information on submitting data, see http://www.informatics.jax.org/mgihome/submissions/submissions_menu.shtml.
New allele and mutant phenotype submissions form. A new allele and mutant phenotype submissions form is available for researchers to reserve or publish information about new alleles and mutants. This web-based form is accessed at http://www.informatics.jax.org/mgihome/nomen/allmut_form.shtml. Submitters provide the allele designation, information about the molecular mutation, method of allele generation, mode of inheritance, strain and background information, summary of phenotype presentation and information for a citation of the work that credits the contribution to them.
Community outreach and user support
MGD provides extensive user support through online documentation and easy email or phone access to User Support Staff. User Support World Wide Web access, http://www.informatics.jax.org/mgihome/support/support.shtml; Email, mgi-help@informatics.jax.org; Tel: +1 207 288 6445; Fax: +1 207 288 6132.
The User Support team also manages MGI-LIST, an extremely active list with over 2100 researchers subscribed (http://www.informatics.jax.org/mgihome/lists/lists.shtml).
IMPLEMENTATION
MGD is implemented in the Sybase relational database system, version 11.5.2. The web interface comprises a set of static HTML forms and other supporting documents. A large set of CGI scripts, written in Python, mediate the user’s interaction with the database. For computational users, direct SQL access can be requested through User Support. User-requested special SQL reports and a number of widely used data files (generated daily) are available on the FTP site (ftp://ftp.informatics.jax.org).
CITING MGD
The following citation format is suggested when referring to specific datasets within MGD: Mouse Genome Database (MGD), Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, Maine (http://www.informatics.jax.org). [Include the date (month, year) when you retrieved the data cited.] Please cite this paper when referencing MGD.
Acknowledgments
ACKNOWLEDGEMENTS
MGD is supported by NIH/NHGRI grant HG00330. GO development and annotation efforts for MGI are supported by NIH/NHGRI grant HG02273.
REFERENCES
- 1.Blake J.A., Eppig,J.T., Richardson,J.E., Bult,C.J., Kadin,J.A. and the Mouse Genome Database Group (2001) The Mouse Genome Database (MGD): integration nexus for the laboratory mouse. Nucleic Acids Res., 29, 91–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Ringwald M., Eppig,J.T., Begley,D.A., Corradi,J.P., McCright,I.J., Hayamizu,T.F., Hill,D.P., Kadin,J.A. and Richardson,J.E. (2001) The mouse gene expression database. Nucleic Acids Res., 29, 98–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bult C.J., Krupke,D.M., Naf,D., Sundberg,J.P. and Eppig,J.T. (2001) Web-based access to mouse models of human cancers: the mouse tumor biology (MTB) database. Nucleic Acids Res., 29, 95–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.The Gene Ontology Consortium (2001) Creating the Gene Ontology Resource: design and implementation. Genome Res., 11, 1425–1433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bairoch A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEML in 2000. Nucleic Acids Res., 28, 45–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Pruitt K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res., 29, 137–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Twigger S., Lu,J., Shimoyama,M., Chen,D., Pasko,D., Long,H., Ginster,J., Chen,C.-F., Nigam,R., Kwitek,A. et al. (2002) The Rat Genome Database. Nucleic Acids Res., 30, 125–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bono H., Kasukawa,T., Furuno,M., Hayashizaki,Y. and Okazaki,Y. (2002) FANTOM DB: database of Functional Annotation of RIKEN Mouse cDNA Clones. Nucleic Acids Res., 30, 116–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.The RIKEN Genome Exploration Group Phase II Team and the FANTOM Consortium (2001) Functional annotation of a full-length mouse cDNA collection. Nature, 409, 684–690. [DOI] [PubMed] [Google Scholar]
- 10.Hill D.P., Davis,A.P., Richardson,J.E., Corradi,J.P., Ringwald,M., Eppig,J.T. and Blake,J.A. (2001) Biological annotation of mammalian systems: implementing gene ontologies in mouse genome informatics. Genomics, 74, 121–128. [DOI] [PubMed] [Google Scholar]