Abstract
The Prostate Gene Database (PGDB: http://www.ucsf.edu/pgdb) is a curated and integrated database of genes or genomic loci related to the human prostate and prostatic diseases. Currently, PGDB covers genes involved in a number of molecular and genetic events of the prostate including gene amplification, mutation, gross deletion, methylation, polymorphism, linkage and over-expression, as published in the literature. Genes that are specifically expressed in prostate, as evidenced by analysis of data from expressed sequence tags (ESTs) and serial analysis of gene expression (SAGE), are also included. There are a total of 165 unique entries in the database. Users can either browse or query the PGDB through a web interface. For each gene, in addition to basic gene information and rich cross-references to other databases, inclusive and relevant literature references are provided to support the inclusion of the gene in the database. Detailed expression data calculated from the UniGene and SAGEmap databases are also presented.
INTRODUCTION
The prostate is a male sex gland and a common site of urological disorders. Prostate cancer is the most commonly diagnosed malignancy in Western men and is the second leading cause of cancer death in males in Western countries (1). Benign prostatic hyperplasia (BPH) is the most common benign neoplasm in the aging male population. Symptomatic BPH is found in 69% of men aged 61–70 years (2), while infections and inflammations of the prostate affect relatively younger men. A number of molecular and genetic events, occurring in the process of prostate diseases, such as gene amplification, mutation, methylation, altered expression and polymorphism, involving a large number of genes, have been documented in bibliographic databases such as MEDLINE with thousands of records. The information thus accumulated is critical to our understanding of the molecular mechanisms underlying prostate diseases, and is heavily used by both scientific researchers and clinicians.
Due to the sheer volume of the information and the limitations of traditional query tools used to retrieve such free-text information, efficient retrieval and digestion of biomedical data has been extremely difficult. For example, a typical question scientists may ask when searching MEDLINE database is: ‘What genes have been found mutated in human prostate cancer?’ To answer the question, they may search MEDLINE using query ‘prostate cancer’ AND mutation AND human, which returns 714 records as of July 24, 2002, among which less than half are relevant to the question and many of which are redundant. Another problem hindering efficient retrieval of gene-related information from literature databases is the non-standardized terminology used for gene names by scientists. For example, different alias names have been used in the literature for the CDKN2A gene commonly known as p16, including ARF, P16, CMM2, INK4, MTS1, TP16, CDK4I, CDKN2, INK4A, p14ARF and p16INK4.
To alleviate some of these problems, a project was initiated to construct an integrated database that catalogs gene-related facts in the prostate, which have been accumulated in the biomedical literature during the past thirty years and may appear in the future, and that delivers filtered and highly relevant data to both scientists and clinicians. Efforts have been made by others to build databases of genes specific to the prostate, such as the Prostate Expression Database (PEDB), which stores curated expressed sequence tags (ESTs) produced from cDNA libraries derived from the prostate (3) and is a valuable resource for understanding the transcriptome of the prostate. No attempts have been made so far however to curate and catalog bibliographic knowledge about genes involved in normal and diseased prostate.
OVERVIEW OF THE DATABASE
Prostate Gene Database (PGDB) stores factual data about genes related to the human prostate and prostatic diseases supported by literature references. Genes to be included in PGDB must satisfy two criteria: First, a gene must have been reported in the literature in normal or diseased prostate to be involved in one of the molecular events that currently cover gene amplification, mutation, gross deletion, methylation, polymorphism, linkage (including linkage disequilibrium) and over-expression. Second, genes or UniGene clusters from the UniGene database (4) are included in PGDB if they are exclusively expressed in prostate libraries as shown by EST or serial analysis of gene expression (SAGE) (5) experiments.
DATA SOURCES AND DATA CURATIONS
PGDB uses data from the following databases.
MEDLINE citation database through PubMed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
UniGene at http://www.ncbi.nlm.nih.gov/UniGene/
SAGEmap at http://www.ncbi.nlm.nih.gov/SAGE/
dbSNP at http://www.ncbi.nlm.nih.gov/SNP/
LocusLink at http://www.ncbi.nlm.nih.gov/LocusLink/
Gene Ontology at http://www.geneontology.org
NCBI's Gene Expression Omnibus (GEO) at http://www.ncbi.nlm.nih.gov/geo/
Bibliographic abstracts were retrieved from MEDLINE using queries (6) configured for each molecular or genetic event except for gene over-expression. A typical query consists of three key words: ‘prostate’, ‘human’ and the word for the event. For example, the query for the event of gene mutation in prostate was [(‘prostate’[MeSH Terms] OR prostate[Text Word]) AND (‘mutation’[MeSH Terms] OR mutation[Text Word])]. The reason for using separate queries for each molecular event is that it allows us to assign each query as a task to a trained scientist whose expertise overlaps the molecular events assigned to him or her. In addition, separate queries are easier to build and are likely to be more effective in terms of coverage than complex queries. MEDLINE records that are of review type or without abstracts were excluded. Each MEDLINE abstract that resulted from the query was first carefully examined by the scientists to identify the fact that one or more genes are involved in a list of pre-defined molecular or genetic events in the prostate, and second reviewed by another scientist to ensure accuracy. Genes involved in the molecular event of over-expression were retrieved from OMIM database (7) using ‘prostate’ as the query word. Gene names, disease names, type of molecular events in which the genes were involved and the tracking number for the abstracts were recorded by the curators. Once a non-redundant list of genes was extracted from the literature data, further curations were conducted automatically using Perl scripts developed in our laboratory to extract additional information from other databases, generate cross-references and analyze expression data as stated below.
EXPRESSION ANALYSIS
For each gene collected in PGDB, levels of expression were analyzed utilizing both SAGE and EST data and pooled by tissue type. We retrieved raw expression data and library information from the UniGene and SAGEmap databases. For expression derived from ESTs, the number of ESTs for each gene in each library was first normalized to the number of ESTs per million and then was pooled by tissue to obtain the average level of expression in tissues. When calculating expression from SAGE data, only reliable mapping data was used as defined by SAGEmap database (5). For each gene, the tag frequency in each library was also normalized to the number of tags per million. To deal with the problem of multiple tag assignments, if one SAGE tag was mapped to n genes, the tag frequency for each gene in each library was divided by n. If one gene had more than one tag mapped to it, then the tag frequency for the gene was the sum of tag frequencies of all tags. By analysis of the expression data, a list of prostate-specifically expressed genes was also generated and was defined as follows: for EST expression, a UniGene cluster must have at least five member ESTs, all of which were derived from prostate libraries; for SAGE expression, a gene to be defined as prostate-specific must have a tag count of 5, all of which were derived from prostate libraries.
DATABASE ACCESS
The PGDB is freely accessible at the URL http://www.ucsf.edu/pgdb. Users can query the database using gene names, symbols, aliases and identification numbers such as LocusLink ID or UniGene ID, or users can browse the database by categories. Genes in the database have been listed by molecular events and diseases with the number of genes in each category displayed. The PGDB also provides a number of cross-references to other databases, such as LocusLink, UniGene, Gene Ontology and PubMed.
CURRENT STATUS AND FUTURE DEVELOPMENTS
We have manually scanned more than 5000 MEDLINE records, dated from 1970 to July 2002, for genes implicated in molecular and genetic events that occurred in normal and diseased prostate. These events currently cover gene amplification, mutation, gross deletion, methylation, polymorphism, linkage (including linkage disequilibrium) and over-expression. The over-expressed genes in PGDB were derived mainly from the OMIM database. Most of them are novel mRNA transcripts identified by different cloning strategies in the prostate and are putative prostate-specific genes. The PGDB currently does not include genes solely supported by evidence of altered expression in the prostate; because, unlike other molecular events, no clear criteria can be used to define abnormal expression status of genes reported in the literature. For completeness, however, genes that are of high significance to prostatic diseases with well-documented expression alterations will be selectively added into PGDB in future updates.
The current release (v.1.0) of PGDB contains 165 unique genes (Table 1), of which 129 are supported by evidence from 386 unique MEDLINE records, and the rest are supported by SAGE and EST expression data and mainly consist of EST clusters from UniGene. The PGDB will continue to grow in both content and functionality and will be updated every 3 months to include any new data from literature databases and expression databases since the last update. In the next update, we plan to catalog genes that differ significantly between normal prostate and cancerous prostate in their expression levels derived both from SAGE and EST database. In the future, we plan to cover more molecular events such as metabolism and apoptosis. More meticulous curations will also be planned, to include such details as position and type of mutations and genotyping frequencies of polymorphisms.
Table 1. Summary of genes included in PGDB.
Category | Number of genes | Supporting references |
---|---|---|
Category by molecular event | ||
Amplification | ||
Prostate cancer | 16 | 45 |
Mutation | ||
BPHa | 3 | 5 |
Prostate cancer | 47 | 187 |
Gross deletion | ||
Prostate cancer | 24 | 33 |
Methylation | ||
BPH | 1 | 2 |
Prostate cancer | 27 | 70 |
Polymorphism | ||
BPH | 3 | 3 |
Prostate cancer | 26 | 85 |
Prostatitis | 1 | 3 |
Over expression | ||
Normal prostate | 3 | 3 |
Prostate cancer | 9 | 10 |
Linkage | ||
Prostate cancer | 5 | 23 |
Othersb | ||
Normal prostate | 2 | 2 |
Prostate cancer | 8 | 10 |
Category by expression | ||
Prostate specific gene by EST | 32 | – |
Prostate specific gene by SAGE | 4 | – |
aBenign prostatic hyperplasia.
bProstate-related genes as identified by other methods such as data mining (8).
In summary, the PGDB provides a non-redundant catalog of genes involved in the prostate with inclusive and highly relevant supporting evidence from published literature. Use of the PGDB will result in considerable reduction in the time and effort required for scientists and clinicians to survey the literature on genes and their involvement in prostate diseases.
Acknowledgments
ACKNOWLEDGEMENTS
This project was supported in part by National Institutes of Health Grants RO1AG-16870, RO1DK47517, REAP award and VA Merit Review (R.D.).
REFERENCES
- 1.Landis S.H., Murray,T., Bolden,S. and Wingo,P.A. (1999) Cancer statistics, 1999. CA Cancer J. Clin., 49, 8–31, 1. [DOI] [PubMed] [Google Scholar]
- 2.Guess H.A., Arrighi,H.M., Metter,E.J. and Fozard,J.L. (1990) Cumulative prevalence of prostatism matches the autopsy prevalence of benign prostatic hyperplasia. Prostate, 17, 241–246. [DOI] [PubMed] [Google Scholar]
- 3.Hawkins V., Doll,D., Bumgarner,R., Smith,T., Abajian,C., Hood,L. and Nelson,P.S. (1999) PEDB: the Prostate Expression Database. Nucleic Acids Res., 27, 204–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Schuler G.D. (1997) Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J. Mol. Med., 75, 694–698. [DOI] [PubMed] [Google Scholar]
- 5.Lash A.E., Tolstoshev,C.M., Wagner,L., Schuler,G.D., Strausberg,R.L., Riggins,G.J. and Altschul,S.F. (2000) SAGEmap: a public gene expression resource. Genome Res., 10, 1051–1060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Levine A.E. and Steffen,D.L. (2001) OrCGDB: a database of genes involved in oral cancer. Nucleic Acids Res., 29, 300–302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hamosh A., Scott,A.F., Amberger,J., Bocchini,C., Valle,D. and McKusick,V.A. (2002) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res., 30, 52–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Walker M.G., Volkmuth,W., Sprinzak,E., Hodgson,D. and Klingler,T. (1999) Prediction of gene function by genome-scale expression analysis: prostate cancer-associated genes. Genome Res., 9, 1198–1203. [DOI] [PMC free article] [PubMed] [Google Scholar]