Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2003 Sep 8;100(19):10585–10587. doi: 10.1073/pnas.2034937100

Extending the utility of gene profiling data by bridging microarray platforms

Gregory Z Ferl *,†, John M Timmerman ‡,§,¶, Owen N Witte ¶,∥,**,††,‡‡
PMCID: PMC196845  PMID: 12963810

Gene expression profiling studies using cDNA and oligonucleotide microarrays have yielded troves of information about the biology, classification, and prognostication of human cancers, with much of it currently accessible in public databases. A simple but elegant statistical model now offers the potential to unite data sets across these two technology platforms.

Since the start of the new millennium, the scientific community has witnessed an explosion in the use of DNA microarray technology to study gene expression (1). A casual inspection of PubMed citations reveals that, since the year 2000, the number of scientific papers related to microarrays has dramatically increased with each passing year, with well over 1,000 publications in the first half of 2003. However, this powerful technology is still in its infancy. There is no single, standardized array platform on which to conduct experiments, and the lack of a standard data analysis protocol further confounds experimental results (2, 3). In a recent issue of PNAS, Wright et al. (4) proposed a statistical model that can be used to translate experimental results across microarray platforms. The method is used to assign tumor samples to one of two subgroups based on gene expression delineated by using cDNA arrays, and the validity of this predictor is confirmed in publicly available data from a set of similar tumor specimens studied independently by using oligonucleotide arrays.

There are many types of microarrays, but the literature is dominated by two distinct platform technologies, cDNA and oligonucleotide arrays (5). Both methods share the feature of a solid support “chip” to which hundreds of thousands of gene fragments are attached. Major differences between these array platforms include the immobilized probe used to detect specific mRNA transcripts and the number of different biological samples used within a single chip experiment.

cDNA microarrays probe the biological sample by using DNA sequences that are typically several hundred base pairs in length. The probes are derived from cDNA libraries and are fixed to a solid support by using a variety of methods, including covalent bonding (5). cDNA microarrays are two-channel arrays, with both a reference and experimental sample analyzed on the same chip. Samples are labeled with fluorescent dyes, one with Cy5 (red) and the other with Cy3 (green). The chip scanner measures the amount of red and green label and outputs the ratio of Cy5 to Cy3. Typical cDNA microarray experiments compare a normal cell or tissue sample to an abnormal (e.g., cancerous) sample (2).

Oligonucleotide microarrays contain probes, usually 25 nt in length, that are synthesized directly onto the surface of the chip. Although less common, long oligonucleotide microarrays are also used with probes that are 40–80 bp in length (6). Each probe set is divided into multiple probe pairs, consisting of perfect match (PM) oligonucleotides and corresponding mismatch (MM) oligonucleotides, which are used as a control for nonspecific binding. The raw hybridization intensity value is given as the difference between the level of PM and MM binding. These one-channel arrays give an absolute measurement of mRNA binding that can be directly compared with the results of other oligonucleotide microarray experiments.

A statistical model can be used to translate experimental results across microarray platforms.

A growing number of gene profiling data sets acquired by using various platform technologies are now publicly available, prompting questions about how best these data sets can be queried to seek validation of hypotheses generated in earlier, separate studies.

Microarray data comparisons between investigators should be valuable in allowing observations from similar sets of specimens to be confirmed or refuted independently by other laboratories. Historically, however, such comparisons have been troubled not only by the use of different microarray platforms, but also by variances among the sets of included genes, computational methods, and cell/tissue samples.

A clear example of the utility of microarrays in dissecting tumor behavior has come from the study of diffuse large B cell lymphoma (DLBCL), the most common form of non-Hodgkin's lymphoma. Within this single histologic subtype of the disease, biologic heterogeneity has long been known. Approximately 40% of DLBCL patients are cured by combination chemotherapy, whereas most of the others succumb to their disease within 5 years of diagnosis. Although a combination of clinical parameters, the International Prognostic Index (IPI), has been used to stratify patient's risk of treatment failure (7), these parameters are felt to be largely surrogates for unknown biologic features of the patient's tumors.

Alizadeh et al. (8) attempted to sub-classify DLBCL specimens by using tissue-specific microarrays encompassing 17,856 cDNA clones selected to represent normal and malignant lymphocytes. When hierarchical clustering was used, it was found that half of the specimens had a gene expression profile (“signature”) reminiscent of B lymphocytes responding to antigen in the germinal centers of lymphoid organs, and these cases were dubbed the “germinal center B cell-like” (GCB) subgroup. The majority of remaining cases expressed genes characteristic of in vitro-activated peripheral blood B cells, and were termed “activated B cell-like” (ABC). These cell-type-specific gene signatures were found to be associated with substantially different patient outcomes; cases displaying the GCB expression profile had improved overall survival relative to ABC cases. Importantly, the cDNA microarray data provided prognostic information beyond that of clinical factors, as it could further subdivide cases preclassified according to the IPI.

Shipp et al. (9) studied a separate set of DLBCL tumors by using short oligonucleotide microarrays, and used a statistical algorithm called “supervised learning” to identify patterns of gene expression associated with subgroups of cases preclassified as having favorable or unfavorable prognosis. A set of 13 genes was found to provide prognostic information beyond that attained by using the standard IPI. To validate this 13-gene predictor model, Shipp et al. (9) then queried the public database of Alizadeh et al. (8), finding that expression of the 3 of 13 genes shared by the tissue-specific cDNA array correlated with outcome in this alternative set of cases. They next sought to determine whether the cell-type-specific gene signatures defined by cDNA arrays could discriminate their own cases. First, they identified the subset of 90 genes from the GCB and ABC signatures that were also represented on their own oligonucleotide arrays. Expression data for these 90 genes was able to sort the DLBCL cases of Alizadeh et al. according to their previously noted GCB and ABC subgroups, which, as before, were associated with different clinical outcomes. These same 90 genes were then used to recluster their own DLBCL cases. As expected, these cases also sorted according to the GCB and ABC subgroups. However, the cases reanalyzed this way did not differ in their clinical outcome.

To address this discrepancy, and to provide a general tool for comparing gene expression data across microarray platforms, Wright et al. (4) have introduced a statistical model based on a linear predictor score (LPS) applied to hierarchical clustering results (8, 10). The LPS reduces the gene expression data of a tumor sample to a single number by summing the expression levels of the 27 genes most capable of discriminating between GCB and ABC DLBCL. Each element of the LPS (the gene expression level) is multiplied by a weighting factor (aj), which is equal to the t statistic (11) calculated from a statistical analysis of the two groups. The LPS values across the tissue-specific cDNA arrays of each group (GCB and ABC) exhibit an approximately normal distribution. The resulting statistical model consists of two overlapping groups of LPS values (Fig. 1). This model was validated by splitting the set of 274 DLBCL samples into two groups, a training set and a validation set. Based on the training set, the DLBCL group assignments of the validation set were in agreement with earlier hierarchical clustering results, which have upheld the GCB and ABC subgroupings as well as identified a third subgroup (group 3) with intermediate phenotype and prognosis (12). Additionally, the LPS-based statistical model was able to classify many of the group 3 cases as either GCB or ABC.

Fig. 1.

Fig. 1.

Schematic representation of how gene expression results can be compared across microarray platforms. The Venn diagram illustrates the overlap among all clones on the tissue-specific cDNA array and oligonucleotide array probe sets. The red circle in the center of the Venn diagram represents the 27 genes on the cDNA array used to classify the DLBCL samples as GCB, ABC, or other. The intersection among the three sections of the Venn diagram (the right portion of the red circle) represents the 14 significant genes that are found on the oligonucleotide array. General methodology: (1) Identify discriminating genes by using cDNA data. (2) The LPS score is calculated by using the adjusted oligonucleotide expression values, the weights (aj) are the same as those used when classifying the cDNA array data. Expression values for each gene are adjusted by scaling and shifting the mean and variance of the 14 measurements so that they match the data for that same gene on the cDNA array. The distribution of LPS values from the two groups (from clustered cDNA array data) (3) are used to calculate the probability that the tumor sample falls into either group 1 or 2 (4). The equation used to calculate the probability assumes that the sample must fall into either group 1 or group 2. Samples with a probability <90% are assumed to fall into group 3.

The LPS-based statistical model was then used to reanalyze the oligonucleotide microarray data from Shipp et al. (9), which had partially contradicted the results of Alizadeh et al. (8). Fig. 1 illustrates the general concept behind the statistical model used to classify the tumor samples of Shipp et al. Fourteen of the 27 genes used in the model creation and validation process were found on the oligonucleotide array and were used to calculate LPS values for each chip, by using the same weighting factors calculated from the cDNA array hierarchical clustering results (8). By using the distribution curves of the LPS values from the cDNA array data set (green curves in Fig. 1), the probability of each oligonucleotide-derived LPS falling into either the GCB or ABC subgroups was calculated by using a standard application of Bayes' rule. This particular application of Bayes' rule requires the assumption that each sample must fall into either group 1 or group 2; the tumor sample may be placed in group 3 only after the probability of falling into either group 1 or 2 is calculated. A cutoff of 90% was used to place each sample in either GCB, ABC, or group 3. This method of analyzing the oligonucleotide data is validated by comparing the results of the analysis to patient survival data. The LPS-based statistical model was again able to separate the tumor samples into three groups (GCB, ABC, or group 3), with each group corresponding to distinct clinical outcomes, as previously seen (8, 12).

As illustrated here, the assignment of biological samples to gene expression subgroups can be validated with independent data sets regardless of the specific DNA microarray platforms used by implementing the LPS-based model of Wright et al. The model is capable of extracting information that might otherwise be obscured from a second data set. It will be important to determine whether the model can work in a reciprocal fashion by building the “predictor” based on oligonucleotide data and then applying it to cDNA array data. Such rapid, statistical hypothesis validations would not be possible without free public access to gene expression databases. We look forward to further analyses of these and other data sets to better elucidate the molecular taxonomy and pathology of human cancers (13).

What now are the implications of recent molecular profiling studies for lymphoma treatment? At this time, technologies for measuring thousands of genes in each patient's tumor sample remain too complex and expensive for general application. However, it is easier to imagine that routine diagnostic and prognostic tools could soon be supplemented by procedures for measuring a smaller number of genes (<100). Quantitative real-time PCR analysis, now available in many pathology departments, can be used to measure even a single gene associated with the GCB cell signature (bcl-6) and provide powerful prognostic information (14).

Unfortunately, prognostic tools alone are unlikely to improve patient outcomes. Sophisticated risk assessment methodologies do not allow therapy to be suitably tailored unless appropriate alternative therapies are available for high-risk, poor prognosis patients. Without the benefit of disease-specific molecular targets, clinicians have attempted to improve outcomes principally by escalating doses of conventional chemotherapeutic agents, followed by stem cell support. However, the benefit of this approach in improving patients' overall survival has been difficult to prove (15, 16). The current promise thus lies in building a therapeutic armamentarium that can interrupt the tumor-specific signaling pathways revealed via microarray technologies. Genes that are overexpressed in poor-prognosis cases may not necessarily be appropriate targets for therapeutic intervention. Rather, one may need to look upstream or downstream in the associated signaling pathways for the most vulnerable targets, which are usually enzymes and cell surface receptors. Accordingly, the molecular profiling studies described above have already confirmed the activation of two novel targets in high-risk DLBCL, protein kinase Cβ (9), and NF-κB (12). Phase II clinical trials using the protein kinase Cβ inhibitor LY317615 and the proteosome inhibitor PS-341 (Bortezomib; Velcade), which targets NF-κB by impeding degradation of its negative regulator I-κB, are now underway in patients with DLBCL. It will be important to perform molecular profiling of the pretreatment tumor samples in these studies to determine whether overexpression of these targets correlates with the clinical response to these agents. There is every reason to believe that gene profiling studies in other cancers will similarly yield clues to useful therapeutic targets.

These methods help define genes most relevant for disease classification, prognostics, and therapeutic targeting.

For now, cross-validation of gene expression results among data sets generated in particular types of cancer by using methods such as those described by Wright et al. should help to define the genes most relevant for disease classification, prognostics, and therapeutic targeting. For instance, data sets obtained by profiling breast cancers with either cDNA or oligonucleotide platforms (6, 17, 18) may now be more easily pooled to reveal novel associations between gene expression and clinical variables. Opportunities for such advances will continue to be greatest when there is public access to gene expression data, as has been the case in lymphoma studies. As microarrays grow in size to encompass a greater and greater proportion of the human transcriptome, expression profiles emerging from tissue samples acquired by using cDNA versus oligonucleotide platforms will likely converge. However, computational methods to validate or test new hypotheses among individual or pooled data sets compiled by using disparate platforms will continue to improve the mileage per microarray for years to come.

Acknowledgments

We thank Joseph J. DiStefano III, David Elashoff, Sharon Hori, Caius Radu, Sven deVos, and David Betting for helpful discussions. O.N.W. is an Investigator of the Howard Hughes Medical Institute. G.Z.F. is supported by National Institutes of Health Tumor Immunology Grant 5-T32-CA009120-28.

See companion article on page 9991 in issue 17 of volume 100.

References

  • 1.Chipping Forecast II (2002) Nat. Genet. Suppl. 32, 461–552. [Google Scholar]
  • 2.Yang, Y. H. & Speed, T. (2002) Nat. Rev. Genet. 3, 579–588. [DOI] [PubMed] [Google Scholar]
  • 3.Quackenbush, J. (2001) Nat. Rev. Genet. 2, 418–427. [DOI] [PubMed] [Google Scholar]
  • 4.Wright, G., Tan, B., Rosenwald, A., Hurt, E. H., Wiestner, A. & Staudt, L. M. (2003) Proc. Natl. Acad. Sci. USA 100, 9991–9996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Barrett, J. C. & Kawasaki, E. S. (2003) Drug Discovery Today 8, 134–141. [DOI] [PubMed] [Google Scholar]
  • 6.van't Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., et al. (2002) Nature 415, 530–536. [DOI] [PubMed] [Google Scholar]
  • 7.The International Non-Hodgkin's Lymphoma Prognostic Factors Project (1993) N. Engl. J. Med. 329, 987–994. [DOI] [PubMed] [Google Scholar]
  • 8.Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., et al. (2000) Nature 403, 503–511. [DOI] [PubMed] [Google Scholar]
  • 9.Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., et al. (2002) Nat. Med. 8, 68–74. [DOI] [PubMed] [Google Scholar]
  • 10.Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proc. Natl. Acad. Sci. USA 95, 14863–14868. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Cui, X. & Churchill, G. A. (2003) Genome Biol. 4, 210.1–210.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B., Giltnane, J. M., et al. (2002) N. Engl. J. Med. 346, 1937–1947. [DOI] [PubMed] [Google Scholar]
  • 13.Lakhani, S. R. & Ashworth, A. (2001) Nat. Rev. Cancer 1, 151–157. [DOI] [PubMed] [Google Scholar]
  • 14.Lossos, I. S., Jones, C. D., Warnke, R., Natkunam, Y., Kaizer, H., Zehnder, J. L., Tibshirani, R. & Levy, R. (2001) Blood 98, 945–951. [DOI] [PubMed] [Google Scholar]
  • 15.Shipp, M. A., Abeloff, M. D., Antman, K. H., Carroll, G., Hagenbeek, A., Loeffler, M., Montserrat, E., Radford, J. A., Salles, G., Schmitz, N., et al. (1999) J. Clin. Oncol. 17, 423–429. [DOI] [PubMed] [Google Scholar]
  • 16.Coiffier, B. (2003) J. Clin. Oncol. 21, 2457–2459. [DOI] [PubMed] [Google Scholar]
  • 17.Perou, C. M., Sørlie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Rees, C. A., Pollack, J. R., Ross, D. T., Johnsen, H., Akslen, L. A., et al. (2000) Nature 406, 747–752. [DOI] [PubMed] [Google Scholar]
  • 18.West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J. A., Jr., Marks, J. R. & Nevins, J. R. (2001) Proc. Natl. Acad. Sci. USA 98, 11462–11467. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES