Skip to main content
Neoplasia (New York, N.Y.) logoLink to Neoplasia (New York, N.Y.)
. 1999 Jun;1(2):101–106. doi: 10.1038/sj.neo.7900002

The Cancer Genome Anatomy Project: EST Sequencing and the Genetics of Cancer Progression1

David B Krizman *,, Lukas Wagner , Alex Lash , Robert L Strausberg *, Michael R Emmert-Buck *,§
PMCID: PMC1508126  PMID: 10933042

Abstract

As the process of tumor progression proceeds from the normal cellular state to a preneoplastic condition and finally to the fully invasive form, the molecular characteristics of the cell change as well. These characteristics can be considered a molecular fingerprint of the cell at each stage of progression and, analogous to fingerprinting a criminal, can be used as markers of the progression process. Based on this premise, the Cancer Genome Anatomy Project was initiated with the broad goal of determining the comprehensive molecular characterization of normal, premalignant, and malignant tumor cells, thus making a reality the identification of all major cellular mechanisms leading to tumor initiation and progression ([Strausberg, R.L., Dahl, C.A., and Klausner, R.D. (1997). “New opportunities for uncovering the molecular basis of cancer.” Nat. Genet., 16: 415–516.], www.ncbi.nlm.nih.gov/ncicgap/). The expectation of determining the genetic fingerprints of cancer progression will allow for 1) correlation of disease progression with therapeutic outcome; 2) improved evaluation of disease treatment; 3) stimulation of novel approaches to prevention, detection, and therapy; and 4) enhanced diagnostic tools for clinical applications. Whereas acquiring the comprehensive molecular analysis of cancer progression may take years, results from initial, short-term goals are currently being realized and are proving very fruitful.

Initial Goals of the Cancer Genome Anatomy Project

The first of the initial CGAP goals is to establish a Tumor Gene Index (TGI) to serve as a catalogue of all genes expressed in the cancer progression process, with special reference to tumor type and stage of progression (http://www.ncbi.nlm.nih.gov/ncicgap/EST/cgaplb.cgi; [1]). Establishment of the TGI is being done by constructing cDNA libraries from pathological tissue followed by high-throughput library sequencing. In general, this approach was used successfully to develop the Expressed Sequence Tag database (dbEST) by the National Center for Biotechnology Information (NCBI) and the I.M.A.G.E. consortium ([2,3]; http://www.ncbi.nlm.nih.gov/dbEST/index.html). The TGI uses the existing infrastructure of dbEST and amounts to a catalogue of all the genes that are expressed across the entire spectrum of cancer progression, with special attention to prostate, breast, ovarian, lung, and colon cancers. A secondary goal in developing the TGI was to identify the remaining members of the unique human gene set, represented by UniGene set of genes ([4,5]; http://www.ncbi.nlm.nih.gov/UniGene/index.html). In addition, as EST mapping proceeds and every UniGene cluster is eventually placed on the human transcript map, region-specific catalogues of genes will exist that can be matched to genomic “hotspots” correlating with specific cancer types and stages ([6–10]; http://www.ncbi.nlm.nih.gov/genemap/, http://www.ncbi.nlm.nih. gov/CCAP/NG/).

To develop the TGI and build on existing databases in parallel, two types of CGAP-specific cDNA libraries are currently in use. Standard bulk tissue cDNA libraries from RNA derived from large tumor tissues function primarily to create a general picture of those genes expressed in the tumor process in addition to driving the process of gene discovery to aid the UniGene effort. More than 80 bulk tissue cDNA libraries (normalized as well as non-normalized) from a wide range of tumor types and histologies have been sequenced, and to date more than 340,000 ESTs have been deposited in the TGI. In addition, more than 11,000 novel genes have been discovered thus far to supplement the UniGene set. Although these results have proven extremely useful, a serious drawback to the sequencing of bulk tissue cDNA libraries is the lack of gene expression information in the context of tumor biology. This is primarily due to cellular heterogeneity found in bulk tissue.

Histological examination reveals that the prostate is a complex organ comprising of multiple cell types. Only 10% of the cells are epithelial in origin, whereas the remaining 90% of the organ comprises of inflammatory, fibroblastic, endothelial, and nervous cells. Yet it is the epithelium that gives rise to life-threatening prostate cancer [11,12]. Armed with this information, it is quite easy to understand why any attempts to determine a prostate epithelial-specific gene profile would fail by sequencing a bulk tissue cDNA library from a normal prostate gland. Furthermore, bulk tumor tissue will undoubtedly contain inflammatory, structural, and endothelial cells regardless of the percent of tumor cells in the tissue as determined histologically. It was this realization that led to the development of laser capture microdissection (LCM) [13,14].

LCM is a process by which one is able to procure selected groups of cells, or even individual cells, from a heterogeneous population of cells in standard pathology preparations. The second type of cDNA library used in CGAP is constructed from RNA obtained by LCM [15,16]. These libraries make it possible for the first time, to perform large-scale, in vivo gene expression profiling from a specific cell type. To begin addressing the issue of gene expression and profiling in the process of prostate cancer progression, a total of 15 cDNA libraries have been constructed from normal epithelium, prostatic intraepithelial neoplasia (PIN) lesions, invasive tumor cells, and metastatic prostate lesions. Many of these libraries were constructed from cells dissected from the same patient and pathology preparation. More than 30,000 clones have been sequenced from these libraries, representing 5186 UniGene clusters. Not only are these sequences useful for prostate tissue-specific and prostate cancer stage-specific expression analysis, they are useful for gene discovery, as evidenced by the establishment of greater than 400 UniGene clusters. Thus these libraries possess the potential to discover weakly expressed, tissue-specific, and cell-specific transcripts not easily found in bulk tissue libraries.

CGAP bioinformatics

With the recent surge in genetic information available to the cancer researcher, it is apparent that useful bioinformatics packages need to be developed to address these issues. The CGAP Website has been actively pursuing this endeavor in trying to deliver tools that would allow the individual investigator to tease out interesting gene expression data from all of the cDNA libraries that currently exist in the TGI. One such function is Digital Differential Display (DDD). This utility uses the Fisher exact test [17] to compare one library to another, a pool of libraries to a single library, or a pool of libraries against another pool. In addition, all cDNA libraries that exist in dbEST can be used, not just those from CGAP. This allows for flexibility in designing an experiment in silica and many questions can be asked using this function.

For example, one may obtain a list of tissue-specific genes for the prostate by constructing several pools for the DDD to analyze (http://www.ncbi.nlm.nih. gov/ncicgap/ddd.html). This working example is entitled “Compare Stages of Prostate Cancer” and can be viewed on the URL listed above. Detailed instructions for using DDD can also be found at this site. To find tissue-specific genes, a control pool should consist of libraries specific to several tissues different from each other and from the tissue of interest, and the pools of interest should contain libraries which are as narrowly focussed as possible. In addition, a control pool should consist of several diverse libraries with many sequences. Choosing libraries too similar to each other for the control pool (for instance, several different libraries constructed from brain tissue) would simply identify genes not expressed in brain tissue and not a superset of genes specific to prostate tissue. A similar difficulty would arise were the control pool to contain libraries with few ESTs. One would obtain genes not expressed in the small control pool, of which the tissue-specific genes would be a small fraction. Thus, we choose the libraries “Normalized infant brain 1NIB” (45472 ESTs) and “Soares fetal liver spleen 1NFLS S1” (29545 ESTs) for our control pool (Figure 1). Because the pools are easily edited, one may test to ensure that the results are independent of the choice of control pool by modifying the control pool at a later stage in the analysis.

Figure 1.

Figure 1

DDD page showing the choice of pools used for this analysis.

Next, one chooses libraries for three prostate-specific pools; this is preferable to grouping the diverse libraries in a single pool because differences between the pools indicate the extent to which any gene is specific to normal, neoplastic, or preneoplastic tissue. We choose library “NCI_CGAP_Pr1, Microdissected, normal prostate epithelium” to exemplify normal tissue, “NCI_CGAP_Pr2, Microdissected, low grade prostatic intraepithelial neoplasia” to exemplify preneoplastic tissue, and “NCI_CGAP_Pr3, Microdissected, invasive prostate tumor” to exemplify neoplasia.

Because one of the pools is a diverse control, all prostate-specific genes expressed in one of the tissue-specific pools are listed; furthermore, differences between the pools are also listed. Note that comparing only prostate-specific normal and cancerous libraries would produce very few significant differences. Although the use of these as separate pools compared with a control pool displays differences that are not statistically significant, these genes would nonetheless amount to candidate genes involved in prostate cancer progression and may prove very useful to the cancer biologist as potential leads to experimental follow-up. Statistical significance is at the P < = .05 level for the Fisher exact test [17]. To see whether the differences found are due to idiosyncrasies of the libraries chosen, we can expand the pools by adding the following: for normal libraries, “NCI_CGAP_Pr9 Microdissected, normal prostate epithelium,” and “NCI_CGAP_Pr25, Cell line, normal prostate epithelial cell line”; for precancerous libraries, “NCI_CGAP_Pr4, Microdissected, high grade prostatic intraepithelial neoplasia” and “NCI_CGAP_Pr4.1, Microdissected, high grade prostatic intraepithelial neoplasia high grade”; and for cancerous libraries, “NCI-CGAP_Pr24, Cell line, invasive prostate tumor cell line (HPV immortalized),” “NCI_CGAP_Pr10, Microdissected, invasive prostate tumor,” and “NCI_CGAP_Pr8, Microdissected, invasive prostate tumor.”

There are two genes with significant differences between states in the small and large pools: prostate specific antigen (PSA) and beta-microseminoprotein (prostate secreted) (MSMB) (Figure 2). Although not statistically significant according to the Fisher test, we note that kallikrein also has different expression levels in normal and precancerous tissues, as do several ribosomal proteins. MSMB and kallikrein have been implicated in prostate cancer [18–21]. This suggests that UniGene clusters with similar expression profiles would be potential candidates for the molecular fingerprinting of the stages of prostate cancer. Note that UniGene cluster identifiers, although superficially very convenient as referents, are not guaranteed to be stable for archival purposes. This is because clusters may split or merge together with the addition of new sequences. Thus, it is safest to store the list of accession numbers in a cluster of interest.

Figure 2.

Figure 2

Result page from the DDD analysis indicating 7 statistically significant prostate-specific transcripts. Many more transcripts were found, and those can be found on the DDD website as described in the text.

The Fisher exact test, which is used to assess whether the difference in expression levels, is known to be conservative. It is therefore useful to have an independent tool to examine differences in expression level. The gene expression comparison utility (http://www.ncbi.nlm.nih.gov/ncicgap/EST/cgapqr.cgi) (Figure 3) does not attempt to assess statistical significance but can be used to identify which libraries have contributed sequences to a gene of interest. One difficulty particularly relevant in seeking novel ESTs is the observation that the Fisher exact test will not find a significant difference in expression levels for small clusters. The exact definition of small clusters depends on the total number of sequences in the pools being compared but, for instance, clusters of size 1 are never statistically significant. Thus, the Fisher exact test and the DDD interface to the test will tend to identify larger clusters and thus already characterized genes.

Figure 3.

Figure 3

Display of a gene expression profile analysis from the following microdissected cDNA libraries: Lib.281 (NCI_CGAP_Pr1, microdissected normal epithelium), Lib.282 (NCI_CGAP_Pr2, microdissected preneoplastic lesion), and Lib.283 (NCI_CGAP_Pr3, microdissected invasive tumor).

Two recent studies have demonstrated the usefulness of these prostate cDNA libraries [22,23]. A combination of computer-based analysis and laboratory analysis identified a number of genes from the prostate libraries within the TGI that have shown patterns of prostate-specific expression. The investigators suggest the procedure they used can be easily applied to the discovery of genes expressed in others organs or tumors.

Conclusions and Future CGAP Goals

The DDD and gene expression comparison utilities are tools that currently exist to analyze CGAP-generated data. The CGAP website has historically been dynamic and is in continuous flux according to the data present in the TGI; thus all utilities are subject to continual improvements and upgrades. The example outlined in this article focused on prostate cancer. The immediate CGAP goal is to complete construction and sequencing of analogous cDNA libraries from microdissected cells representing all stages of ovarian, lung, colon, and breast cancers. Thus, analysis of the gene expression profiles of these first 5 cancers will undoubtedly render unprecedented bioinformation to the cancer community. More tumors will likely be added to this list once these 5 are completed.

A future goal for the analysis of gene expression in cancer progression is the development and use of serial analysis of gene expression (SAGE) cDNA libraries from cancer tissue [24]. A number of these libraries have recently been constructed and sequenced by CGAP, and utilities to analyze these data are starting to emerge (http://www.ncbi.nlm.nih.gov/SAGE/). Due to the larger amount of data that can be obtained by sequencing SAGE libraries, greater statistical significance to computer-generated gene expression analysis can be ascribed to these analyses. However, because these libraries were generated from bulk tumor tissue and not microdissected cells, direct comparison of tumor to preneoplastic or normal cellular states cannot be made. Thus, an ideal gene expression analysis of cancer progression might be the application of SAGE technology to microdissected cells.

In addition to expanding and improving the usefulness of gene expression profiles generated from sequencing cDNA libraries, CGAP has recently committed to the study of cancer progression at the genomic level by establishment of the Genetic Annotation Initiative (GAI) and the Cancer Chromosome Aberration Project (cCAP). The GAI goal is to discover and catalogue single nucleotide polymorphisms in cDNA sequence (SNPs) that correlate with cancer initiation and progression, whereas the goal of cCAP is to develop a set of tools that will allow for the expedient definition and detailed characterization of chromosomal alterations associated with cancer initiation and progression.

Expansion to model organisms is beginning to take shape within CGAP as well. Establishment of the mouse TGI will take place in the near future that will mirror the current human TGI in that both bulk tumor tissue and microdissected cells will be used to generate cDNA libraries for high-throughput sequencing. Like the human TGI, the 5 cancers of focus for the mouse are prostate, breast, lung, colon, and ovarian cancers.

In conclusion, the CGAP encompasses an entire approach to understanding cancer at the molecular level. Even in its infancy it shows great promise for uncovering important gene expression changes involved in cancer initiation and progression. An example for discovering such changes has been outlined here. In the near future one could envision that as the TGI grows linearly, possibilities for bioinformatics could expand exponentially. With addition of the new CGAP initiatives discussed here, the National Cancer Institute optimistically looks forward to uncovering the molecular changes that lead to cancer initiation and progression.

Abbreviations

CGAP

Cancer Genome Anatomy Project

TGI

Tumor Gene Index

DDD

Digital differential display

Footnotes

1

This is a US government work. There are no restrictions on its use.

References

  • 1.Strausberg RL, Dahl CA, Klausner RD. New opportunities for uncovering the molecular basis of cancer. Nat Genet. 1997;16:415–516. doi: 10.1038/ng0497supp-415. [DOI] [PubMed] [Google Scholar]
  • 2.Boguski MS, Lowe TM, Tolstoshev CM. dbEST-database for “expressed sequence tags.”. Nat Genet. 1993;4:332–333. doi: 10.1038/ng0893-332. [DOI] [PubMed] [Google Scholar]
  • 3.Lennon G, Auffray C, Polymeropoulos M, Soares MB. The I.M.A.G.E. Consortium: An integrated molecular analysis of genomes and their expression. Genomics. 1996;33:151–152. doi: 10.1006/geno.1996.0177. [DOI] [PubMed] [Google Scholar]
  • 4.Boguski M, Schuler G. ESTablishing a human transcript map. Nat Gen. 1995;10:369–371. doi: 10.1038/ng0895-369. [DOI] [PubMed] [Google Scholar]
  • 5.Schuler G. Pieces of the puzzle: Expressed sequence tags and the catalog of human genes. J Mol Med. 1997;75:694–698. doi: 10.1007/s001090050155. [DOI] [PubMed] [Google Scholar]
  • 6.Hudson TJ, Stein LD, Gerety SS, Ma J, Castle AB, Silva J, Slonim DK, Baptista R, Kruglyak L, Xu SH, et al. An STS-based map of the human genome. Science. 1995;270:1945–1954. doi: 10.1126/science.270.5244.1945. [DOI] [PubMed] [Google Scholar]
  • 7.Gyapay G, Schmitt K, Fizames C, Jones H, Vega-Czarny N, Spillett D, Muselet D, Prud'Homme JF, Dib C, Auffray C, Morissette J, Weissenbach J, Goodfellow PN. A radiation hybrid map of the human genome. Hum Mol Genet. 1996;5:339–346. doi: 10.1093/hmg/5.3.339. [DOI] [PubMed] [Google Scholar]
  • 8.Schuler GD, Boguski MS, Stewart EA, Stein LD, Gyapay G, Rice K, White RE, Rodriguez-Tome P, Aggarwal A, Bajorek E, Bentolila S, Birren BB, Butler A, Castle AB, Chiannilkulchai N, Chu A, Clee C, Cowles S, Day PJ, Dibling T, Drouot N, Dunham I, Duprat S, East C, Hudson TJ, et al. A gene map of the human genome. Science. 1996;274:540–546. [PubMed] [Google Scholar]
  • 9.Deloukas P, Schuler GD, Gyapay G, Beasley EM, Soderlund C, Rodriguez-Tomé P, Hui L, Matise TC, McKusick KB, Beckmann JS, Bentolila S, Bihoreau M-T, Birren BB, et al. A physical map of 30,000 human genes. Science. 1998;282:744–746. doi: 10.1126/science.282.5389.744. [DOI] [PubMed] [Google Scholar]
  • 10.Mitelman F, Mertens F, Johansson B. A breakpoint map of recurrent chromosomal rearrangements in human neoplasia. Nat Genet. 1997;15:417–474. doi: 10.1038/ng0497supp-417. [DOI] [PubMed] [Google Scholar]
  • 11.Murphy GP, Busch C, Abrahamsson PA, Epstein JI, McNeal JE, Miller GJ, Mostofi FK, Nagle RB, Nordling S, Parkinson C, et al. Histopathology of localized prostate cancer. Consensus Conference on Diagnosis and Prognostic Parameters in Localized Prostate Cancer. Scand J Urol Nephrol. 1994;162(Suppl):7–42. [PubMed] [Google Scholar]
  • 12.Epstein JI. Pathologic evaluation of prostatic carcinoma: critical information for the oncologist. Oncology. 1996;10:527–534. [PubMed] [Google Scholar]
  • 13.Emmert-Buck MR, Bonner RF, Smith PD, Chuaqui RF, Zhuang Z, Goldstein SR, Weiss RA, Liotta LA. Laser Capture Microdissection. Science. 1996;274:998–1001. doi: 10.1126/science.274.5289.998. [DOI] [PubMed] [Google Scholar]
  • 14.Bonner RF, Emmert-Buck M, Cole K, Pohida T, Chuaqui R, Goldstein S, Liotta LA. Laser Capture Microdissection: Molecular analysis of tissue. Science. 1997;278:1481–1483. doi: 10.1126/science.278.5342.1481. [DOI] [PubMed] [Google Scholar]
  • 15.Krizman DB, Chuaqui RF, Meltzer PS, Trent JM, Duray PH, Linehan WM, Liotta LA, Emmert-Buck MR. Construction of a representative cDNA library from prostatic intraepithelial neoplasia (PIN) Cancer Res. 1996;56:5380–5383. [PubMed] [Google Scholar]
  • 16.Peterson LA, Brown M, Carlisle AJ, Kohn E, Liotta LA, Emmert-Buck MR, Krizman DB. An improved method for microdissected cDNA libraries using papillary serous ovarian carcinoma cells. Cancer Res. 1998;58:5326–5328. [PubMed] [Google Scholar]
  • 17.Seigel S. Nonparametric Statistics for the Behavioral Sciences. New York: McGraw-Hill; 1956. [Google Scholar]
  • 18.Heeb MJ, Espana F. Alpha2-macroglobulin and C1-inactivator are plasma inhibitors of human glandular kallikrein. Blood Cells Mol Dis. 1998;24:412–419. doi: 10.1006/bcmd.1998.0209. [DOI] [PubMed] [Google Scholar]
  • 19.Saedi MS, Hill TM, Kuus-Reichel K, Kumar A, Payne J, Mikolajczyk SD, Wolfert RL, Rittenhouse HG. The precursor form of the human kallikrein 2, a kallikrein homologous to prostate-specific antigen, is present in human sera and is increased in prostate cancer and benign prostatic hyperplasia. Clin Chem. 1998;44:2115–2119. [PubMed] [Google Scholar]
  • 20.Tsurusaki T, Koji T, Sakai H, Kanetake H, Nakane PK, Saito Y. Cellular expression of beta-microseminoprotein (beta-MSP) mRNA and its protein in untreated prostate cancer. Prostate. 1998;35:109–116. doi: 10.1002/(sici)1097-0045(19980501)35:2<109::aid-pros4>3.0.co;2-e. [DOI] [PubMed] [Google Scholar]
  • 21.Hyakutake H, Sakai H, Yogi Y, Tsuda R, Minami Y, Yushita Y, Kanetake H, Nakazono I, Saito Y. Beta-microseminoprotein immuno-reactivity as a new prognostic indicator of prostatic carcinoma. Prostate. 1993;22:347–355. doi: 10.1002/pros.2990220409. [DOI] [PubMed] [Google Scholar]
  • 22.Vasmatzis G, Essand M, Brinkmann U, Lee B, Pastan I. Discovery of three genes specifically expressed in human prostate by expressed sequence tag database analysis. Proc Natl Acad Sci USA. 1998;95:300–304. doi: 10.1073/pnas.95.1.300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Brinkmann U, Vasmatzis G, Lee B, Yerushalmi N, Essand M, Pastan I. PAGE-1, an X chromosome-linked GAGE-like gene that is expressed in normal and neoplastic prostate, testis, and uterus. Proc Natl Acad Sci USA. 1998;95:10757–10762. doi: 10.1073/pnas.95.18.10757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science. 1995;270:484–487. doi: 10.1126/science.270.5235.484. [DOI] [PubMed] [Google Scholar]

Articles from Neoplasia (New York, N.Y.) are provided here courtesy of Neoplasia Press

RESOURCES