Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2005 Jan 25;102(6):2052–2057. doi: 10.1073/pnas.0408105102

Assessment of tumor characteristic gene expression in cell lines using a tissue similarity index (TSI)

Rickard Sandberg 1,*, Ingemar Ernberg 1,
PMCID: PMC548538  PMID: 15671165

Abstract

The gene expression profiles of 60 cell lines, derived from nine different tissues, were compared with their corresponding in vivo tumors and tissues. Cell lines expressed few tissue-specific (2%) or tumor-specific (5%) genes when analyzed group-wise. A tissue similarity index (TSI) was designed based upon singular value decomposition that measured in vivo tumor characteristic gene expression in each cell line independently. Only 34 of the 60 cell lines received the highest TSI toward its tumor of origin. In addition, we identified the most appropriate cell lines to be used as model systems for different in vivo tumors. Seven cell lines were identified as being of another origin than the originally presumed one. The proposed TSI will likely become an important tool for the selection of the most appropriate cell lines in pharmaceutical screening programs and experimental and biomedical research.

Keywords: DNA microarray, in vitro, singular value decomposition, tissue-specific expression, tumor gene expression


Cell lines derived from tumors and tissues have been instrumental for our understanding of biology at the molecular level and are widely used in experimental research. Conclusions on the corresponding tissues are drawn from their behavior in a wide variety of tests. There are general differences between the environment of cells growing in vitro and that of a heterogeneous tissue. The cell lines are rapidly dividing, and general differences in gene expression include an up-regulation of genes involved in proliferation (1, 2). Although cell lines differ from both normal and tumor tissues, the low availability of tissue samples and the limited possibility to manipulate in animals make cell lines a necessity also for future molecular cell biology research and drug development.

New anticancer drugs are often developed with the help of cell lines, despite the notion that monolayer cultures are more sensitive to chemotherapy than in vivo tumors (3, 4). Furthermore, the drug sensitivity of tumors is related to their tissue of origin. Tumors of the testis, breast, and ovaries are normally responsive to chemotherapy, whereas tumors originating in the colon, kidney, or liver are often more resistant. The variation in chemotherapy sensitivity determined by tissue of origin is reduced in cell lines (5). Large screening programs are using cell line panels to predict the chemotherapeutic efficiency of compounds to different tumor types (6), the key assumption being that the tumor cell lines are good experimental models for the tumors they derive from. Despite the frequent use of cell lines, there are few systematic studies investigating how well cell lines represent or correspond to the tumor tissue identity.

For many tumors, metastatic lesions are superior for establishing cell cultures (7). Herein lies a potential risk of “transmitting” a misclassification of the metastatic tumor to the derived cell line. Identification of the correct origin of cell lines will be crucial for their use as model systems for their corresponding tumors. Through decades of cell line usage, there are other potential threats to the preservation of the original tumor characteristics, such as loss of tissue-specific gene expression and even cross-contamination (8, 9). A new tool that can unambiguously assess and authenticate the origins of the cell line would be extremely valuable for a variety of uses in biomedical research and clinical medicine.

The NCI60 cell lines originate from nine different tissues of origin and have been extensively characterized by using a variety of methods, such as karyotyping (10), gene expression arrays (2, 11), and protein expression arrays (12). Hierarchical clustering of the NCI60 cell lines based on their gene expression patterns showed that cell lines for six of the nine tissues of origin were clustered into independent terminal branches with few exceptions (2). The melanoma-derived cell lines had the most tissue-specific gene expression pattern, with many genes involved in melanocyte biology being up-regulated (2). By induction, the genes up-regulated in other cell lines were concluded to reflect their tissues of origin (2).

In this study, we compared the NCI60 cell lines with their corresponding tumors and normal tissues. We not only identified the tissue-specific gene expression in both the tumors and in the corresponding cell lines, but also we determined the proportion of tumor type- and tissue-specific gene expression that was still maintained in the cell lines. We here demonstrate that cell lines in general lose the tissue-specific up-regulation of genes. We also noted a large variation in the expression of tumor- and tissue-specific genes within cell lines originating from the same tumor type, and therefore developed an index, called the tissue similarity index (TSI), that directly measures the similarity in gene expression between a cell line and the different tumor or tissue types. TSI is defined as the distance between the singular-value-decomposed gene expression pattern of a cell line and the average pattern of several samples representing a particular tumor type. By using the TSI for the NCI60 cell lines, it was clear that the different cell lines of presumably identical tumor origin differ widely in their expression of tumor characteristic genes and consequently in their appropriateness as model systems for the tumor types. Correctly assessing the gene expression similarities of cell lines and tumor tissues using the TSI may significantly assist the choosing of the cell lines for experimental research and drug screening programs.

Materials and Methods

Data. We compiled gene expression data measured by Affymetrix Hu6800 arrays for the NCI60 cell lines (13) and a panel of tumors and normal tissue samples (14). The NCI cell lines were derived from nine different tumor types (breast, CNS, colon, kidney, leukemia, lung, melanoma, ovary, and prostate). The tumor panel contained multiple replicates of all these tumor types, and normal tissue samples were present for seven of the tumors, excluding leukemias and melanomas. The data sets were normalized separately, by using robust multichip average (RMA) (15, 16) implemented in rmaexpress 0.2 for the NCI60 CEL files and a global scaling algorithm for the tumors and normal samples. The global scaling algorithm was calculated from the positive average difference values, excluding the top and bottom 2% average difference values.

Significance Analysis of Microarrays (SAM). We used SAM (17), available as an excel add-in (version 1.21), to identify the number of differentially expressed genes, given a set false discovery rate (FDR). In this analysis, we manually set the FDR to zero for each independent comparison, which must be considered as a conservative criterion for differential expression.

We compared cell lines of each specific tissue of origin with the other eight cell line origins and identified the tissue of origin determined up-regulated genes by using SAM. We repeated the procedure for each tissue of origin independently. Within the tumor samples, we identified genes with a statistically significant differential expression in one tumor type as compared with the other eight tumor types (tumor type-specific gene expression). Similarly, we identified tissue-specific gene expression, e.g., genes with a statistically significant differential overexpression in one tissue as compared with the other six tissues (because no normal counterparts of melanomas and leukemias were present in the material).

TSI. The TSI used singular-value-decomposition (SVD) to capture tumor characteristic gene expression patterns of different tumor tissues and to reduce the dimensionality of the gene expression data. SVD is a standard method in linear algebra, and the mathematical details of SVD for gene expression analysis have been described in detail elsewhere (1820). In brief, a gene expression matrix X is after SVD decomposed into three matrices USVT. The left singular vectors (hereafter called Eigenarrays) (19) are the columns of matrix U, the diagonals in S are the singular values, and the rows of VT the right singular vectors. Before SVD calculation, we preprocessed the expression data for each gene independently to an average expression level of zero and a standard deviation of one (21). We projected the gene expression pattern of each tumor sample independently into the “SVD space,” by measuring its correlation to the first 16 Eigenarrays. In the 16-dimensional SVD space, we defined a centroid for each tumor type as the geometric center of the samples of that particular tumor type. These centroids are thought of as idealized gene expression centers for each tumor, respectively. Most of the tumor types formed distinct clusters in SVD space, and the distance between each sample to its tumor type centroid was smaller than to the centroids of other tumor types. We evaluated each cell line by performing the identical projections into SVD space and then measured its similarities to the tumor types as its correlation to the respective centroids. The correlations to the tumor type centroids are the TSI scores and range from 1 to –1.

Results

We compiled gene expression data on tumor cell lines, corresponding tumors, and normal tissues measured by Affymetrix Hu6800 arrays (13, 14). The number of samples for each tumor type and tissue of origin is listed in Table 1. The tumors were mostly biopsy specimens with >50% malignant cell content but otherwise unselected and not exposed to any treatment (14).

Table 1. Number of samples per tissue origin or tumor type.

Sample Cell lines* Normal tissue Tumor tissue
Breast 8 5 11
CNS 6 8 20§
Colon 7 11 11
Kidney 8 13 11
Leukemia 6 30
Lung 9 7 11
Melanoma 8 10
Ovary 6 3 11
Prostate 2 9 10
Sum 60 56 125
*

Data from ref. 13.

Data from ref. 14.

Breast subtypes: ER+ (n = 5), ER- (n = 6).

§

CNS tumor types: glioblastoma (n = 10), medulloblastoma (n = 10).

Leukemia types: ALL-B (n = 10), ALL-T (n = 10), AML (n = 10).

Tissue- and Tumor-Specific Gene Expression in Cell Lines. Using SAM (17), we identified the number of up-regulated genes in normal tissues, tumor samples, and cell lines for each tissue of origin, respectively (Table 2). The number of up-regulated genes identified as tissue of origin determined was in general lower within the cell lines than in the normal and tumor tissues. Melanoma cell lines had the highest number of up-regulated genes (175 genes) whereas, for cell lines of breast and prostate tumor origin, we could not identify any statistically significant tissue of origin determined up-regulation of genes using the present criteria. For all normal tissues and tumor types, we could identify statistically significant up-regulated genes (Table 2).

Table 2. Overlap in up-regulated genes.

Data Breast CNS Colon Kidney Leukemia Lung Melanoma Ovary Prostate Total
Cell lines 0 19 23 38 40 3 175 5 0 303
Normal tissue 46 249 73 110 39 7 75 599
Tumor tissue 8 539 30 34 591 16 42 11 61 1332
Intersection: cell lines and normal tissue 0 (0%) 3 (1%) 5 (7%) 3 (3%) 0 (0%) 0 (0%) 0 (0%) 11 (2%)
Intersection: cell lines and tumor tissue 0 (0%) 7 (1%) 6 (20%) 2 (6%) 30 (5%) 0 (0%) 23 (55%) 0 (0%) 0 (0%) 68 (5%)
Intersection: tumor tissue and normal tissue 3 (7%) 146 (59%) 17 (23%) 12 (11%) 9 (23%) 0 (0%) 28 (37%) 215 (36%)

The number of up-regulated genes is listed for each tissue origin (columns) and for cell lines and normal and tumor tissues (rows). The intersections of the gene lists of up-regulated genes are presented as the number of genes and as the percentage in parenthesis. We clarify with the following example for breast samples (breast column): We identified 0 up-regulated genes in cell lines, 46 in normal tissue and 8 in tumor tissue. Therefore, there was no overlap between genes identified in cell lines and normal or tumor tissue. Three of the genes however were up-regulated in both normal and tumor breast tissue, which represents 7% (3/46) of the genes identified in normal tissue.

Most importantly, the question is whether the genes up-regulated in the cell lines from a particular tissue of origin are the same genes as those up-regulated in the tumor that it was originally derived from. We compared the lists of up-regulated genes and identified their overlap (Table 2). In total, the overlap between tumors and the normal tissue was higher (36%; 215 genes) than between the cell lines and their corresponding tumors (5%; 68 genes) and normal tissues (2%; 11 genes).

The percentages of overlap in genes up-regulated by the cell lines, tumors, and normal tissues are summarized in Fig. 1. Dependent upon tissue of origin, we noted a large variation in the overlap between cell lines and their corresponding tumor and normal tissue. Melanoma cell lines showed the highest overlap (55 genes) with their corresponding tumors (Fig. 1a). It has previously been shown that melanoma cell lines still have up-regulation of many melanocyte-specific genes (2). The colon cell lines had up-regulation of 20% of the tumor-specific genes, whereas kidney- and leukemia-derived cell lines had up-regulation of only 6% and 5% of the tumor-specific genes, respectively (Fig. 1a). There was no overlap of up-regulated genes between cell lines and tumors for breast, prostate, ovary, and lung tissue of origin. Changing the criteria for statistical significance seems unlikely to increase the overlap (Fig. 5, which is published as supporting information on the PNAS web site).

Fig. 1.

Fig. 1.

Comparing tissue-specific up-regulation of genes in cell lines and their corresponding tumor and normal tissues (derived from Table 2). The percentages were derived from Table 2 by dividing the number of genes in the intersection (of cell lines and tissue and tumor respectively) by the total number of genes up-regulated in the respective tissues and tumors. (a) Percentage of up-regulated genes in normal (filled bars) and tumor tissues (open bars), respectively, that was also up-regulated in the corresponding cell lines. (b) Percentage of up-regulated genes in the cell lines that were also up-regulated by the normal (filled bars) and tumor tissues (open bars) respectively.

Moreover, we calculated the percentage of genes with tissue of origin determined up-regulation in the cell lines that were also up-regulated in the corresponding tumors and normal tissues, respectively (Fig. 1b). Leukemia cell lines had the highest percentage of genes overlapping (30 of 40 up-regulated genes), indicating that, in those cell lines, 75% of the up-regulated genes were also up-regulated in the tumors and that few genes (25%) have been significantly up-regulated upon in vitro establishment of these cell lines. By contrast, in melanoma cell lines, only 13% of the up-regulated genes were overlapping with the tumor-specific genes (Fig. 1b). Thus, even though melanoma-derived cell lines still had up-regulation of the highest number of tumor type-specific genes, these same cell lines also had many up-regulated genes, which were not up-regulated in the tumors.

TSI Results. Because cell lines from the same tissue of origin, as a group, expressed only a small percentage of their tissue-specific gene expression, we investigated them individually and detected a large individual variation in their expression of tissue-specific genes. We therefore developed a TSI to measure how well each cell line captured the characteristic gene expression of the tumor or tissue they derived from. We used SVD (18, 19) to capture the characteristic gene expression of the nine tumor types in a reduced dimensionality (the SVD space). We defined a centroid for each tumor type as the geometric center of the tumor samples in SVD space. Thereafter, we projected each cell line independently into the identical SVD space and assessed its TSI, defined as the correlation to the nine tumor type centroids (Fig. 2). If the gene expression of a cell line is similar to the gene expression of a particular tumor type, the correlation will be higher to that tumor type centroid and lower to the other tumor centroids.

Fig. 2.

Fig. 2.

Illustrating the TSI methodology. SVD was performed on a prefiltered gene expression matrix of the tumor samples. Then, each tumor sample was projected into SVD space by measuring its correlation to the 16 Eigenarrays (EA) with largest singular values. We calculated each tumor type-specific centroid, respectively, as the geometric center of the tumor samples in SVD space. Thereafter, we identically projected each cell line into the SVD space. Finally, a TSI score was calculated measuring the correlation of each cell line in SVD space to each tumor type centroid.

We investigated the performance of the TSI on the tumor samples and the cell lines as a function of the number of SVD dimensions and prefiltering of genes (Fig. 3). The number of dimensions in the SVD space should exceed at least the number of distinct tumor types (nine in this study) for good performance (Fig. 3). Filtering the genes toward those that covaried with tumor and tissue types improved the accuracy of the TSI (Fig. 3). Accordingly, we set the dimensionality of SVD space to 16 and prefiltered the genes for the 450 genes that covaried with tumor types (Fig. 3).

Fig. 3.

Fig. 3.

Investigating how the TSI depend upon the number of SVD dimensions used for projection and on the prefiltering of genes. (a) Displaying the percentage of tumor samples that had their maximum TSI toward their own tumor type (y axis) as a function of the dimensionality of SVD space (x axis). We used five different sets of genes as an input to the SVD: All 7070, all genes on the chip; Tumor Top 450, the 50 most differentially expressed genes per tumor (calculated by SAM); Normal Top 450, the 50 most differentially expressed genes for each of the seven normal tissues; Variation Filtering (3), variation filtering (max – min > 300; max/min > 3); and, finally, Variation Filtering (7), a stricter variation filtering (max – min > 700; max/min > 7). (b) Displaying the percentage of cell lines that had their maximum TSI score toward their own presumed tissue of origin (y axis) as a function of the dimensionality of SVD space (x axis). We used the five identical gene sets as in a.

Using these parameters, we calculated the TSI for all 60 cell lines toward the nine tumor types. In total, for 34 of the 60 cell lines, the TSI score was the highest toward its own corresponding tumor, whereas the remaining 26 cell lines did not demonstrate the highest gene expression similarity with its own presumed tumor tissue of origin. The tissue of origin distribution of the 34 cell lines that was most similar to its corresponding tumor was uneven: melanomas (9 of 10), leukemia (5 of 6), breast (5 of 6), CNS (4 of 6), colon (4 of 7), ovary (2 of 6), lung (3 of 9), kidney (2 of 8), and prostate (0 of 2). The TSIs for the 60 cell lines toward all nine tumor types are presented in Fig. 4 and also listed in Table 3, which is published as supporting information on the PNAS web site. To make the discussion easier, we have divided the TSI scores into high (>0.6), medium (0.4–0.6), and low (<0.4) categories.

Fig. 4.

Fig. 4.

TSI scores for 60 cell lines compared with nine tumor types. Each graph (aj) independently presents the TSI scores for the 60 cell lines to a particular tumor type. Within each graph, the TSI scores to the left (•) are those of the cell lines presumed to originate from that tumor tissue, and displayed to the right (×) are the TSI scores for all other cell lines originating from other tumor tissues. Most tumors have cell lines that received high TSI scores, with the exception of the renal and prostate cell lines. The gray area indicates the region of medium TSI scores, and scores above the gray area are consequently high. The best cell lines for the different tumors were labeled. Seven cell lines received high TSI for another tumor than their presumed origin. These were also labeled in the right parts of each graph. The tumor types are as follows: a, melanoma; b, leukemia; c, breast ER-negative; d, breast ER-positive; e, CNS; f, colon; g, ovary; h, lung; i, renal; and j, prostate.

TSI on the NCI60 Cell Lines. Melanoma cell lines (Fig. 4a). All melanoma cell lines, except LOXIMVI, had their highest TSI score for melanomas. Three cell lines had high TSI scores for melanomas (UACC257, 0.86; SKMEL28, 0.8; and MALME3M, 0.76), and these cell lines had gene expression profiles most similar to the melanomas. The melanoma cell line LOXIMVI had a low TSI score for melanomas (–0.16). It has previously been shown that this cell line lacks melanin expression (22). Interestingly, LOXIMVI received a high TSI score for leukemia (Fig. 4b). This cell line may have been misclassified and may be of hematopoietic origin.

Leukemia cell lines (Fig. 4b). Four of the six cell lines derived from leukemias received high TSI scores for leukemias (HL60, CCRF-CEM, MOLT4, and K562) whereas one cell line received a medium TSI score (SR). Leukemia cell line RPMI8266 was not identified as a leukemia (TSI score of 0.06), but this result is not surprising because RPMI8226 is derived from a multiple myeloma whereas the leukemia samples were from acute lymphoblastic leukemia (n = 20) and acute myeloid leukemia (n = 10). Breast cell lines (Fig. 4 c–d). The breast tumor samples were heterogeneous, and we therefore separated the breast tumor samples into estrogen receptor (ER)-positive and -negative breast tumors. This separation was defined by their expression of associated keratins 7, 8, 18, and 19 (23). Originally, the NCI60 cell line panel contained eight breast cell lines. However, two cell lines (MDA-MB-435 and MDN) were recently reclassified to be of melanoma origin (2, 24). These two cell lines were correctly identified as of melanoma origin by using the TSI (scores of 0.74 and 0.6). This evidence demonstrated the predictability and usefulness of TSI. Three breast cell lines had high to medium TSI scores for ER-positive breast tumors (MCF7, BT549, and T47D), whereas breast cell line MDAMB231 had a high TSI score for ER-negative breast tumors. NCI/ADR-RES had low TSI for all tumors. Possible explanations for a low TSI score toward all tumor types are given in Discussion.

CNS cell lines (Fig. 4e). Four of the CNS-derived cell lines were identified as most similar to CNS tumors, two of which had high TSI scores (SNB19 and U251). Two cell lines (SF268 and SF539) got low TSI scores for CNS tumors (0.16 and 0.04). These cell lines expressed few neuronal markers.

Colon cell lines (Fig. 4f). Four of the colon cell lines (HCC-2998, COLO205, HCT15, and KM12) had high TSI for colon cancer. The remaining three colon cell lines (HT29, HCT116, and SW620) had low TSI for colon cancer.

Ovarian cell lines (Fig. 4g). Two of the six ovarian cell lines had highest TSI scores for ovarian tumors (OVCAR3 and OVCAR5). OVCAR8 had a high TSI score for ER-negative breast tumor. The same cell line was recently shown by karyotyping to be related to the breast cell line NCI/ADR-RES (10). Lung cell lines (Fig. 4h). HOP92 had the highest TSI score for lung tumors (0.68) whereas NCIH226 and H460 had medium TSI scores for lung tumors. Three of the cell lines had high TSI scores for other tumors, suggesting a different origin: NCIH522 had a TSI score of 0.76 for CNS tumor and A549ATCC had a TSI score of 0.70 for renal tumor. Tumors of both CNS and renal origin often metastasize to the lung. Furthermore, HOP62 had a high TSI score for leukemia (TSI 0.7).

Renal cell lines (Fig. 4i). Only two of seven renal cell lines had a medium score for renal tumors (UO31, 0.46; ACHN, 0.49). Of the remaining renal cell lines, two had high TSI scores for leukemia (SN12C and TK10).

Prostate cell lines (Fig. 4j). There are only two prostate cell lines in the NCI60 panel and both these cell lines received low TSI scores for prostate tumors (0.21 and –0.06) as well as for other tumors.

Discussion

Cell lines are routinely used in experimental research as model systems for normal or pathological tissues (25). They are also often used for applied purposes so as to aid in isolation of specific tumor antigens/peptides or for drug screening purposes. Despite their successful use in basic research, the use of cell lines as model systems for tumors is controversial (8, 9, 2528). Quantitative assessment of the gene expression similarities of cell lines and their corresponding tumors and tissues of origin have been missing, making it difficult to evaluate how representative individual cell lines actually are.

Here, we provide an easy index on how each cell line reflects the gene expression of the tumors of nine tissues of origin. Thus, we identified the cell lines that had gene expression profiles that closely resembled those of their corresponding in vivo tumor. Moreover, we identified cell lines that did not show any similarities toward their presumed tumor origin; these cell lines are not appropriate as model systems for the tumors. When a cell line receives a high TSI score toward a particular tumor, it indicates that the expression levels of the genes characteristic for the tumors were similarly expressed in the cell line. Therefore, all cell lines that receive a high TSI score toward their origin were authenticated by using this method. Thirty-four of the NCI60 cell line panel had their highest TSI score, reflecting their presumed origin. The remaining 26 cell lines, however, did not receive high TSI scores toward their presumed tumor of origin. There are different possible explanations for a cell line that receives a low TSI for its presumed tumor origin. It is possible that these cell lines were derived from a subtype of the tumor not represented in the tumor biopsies. It is also possible that these cell lines have lost the differentiated phenotype of their tumors of origin or that the tumor, from which the cell line was derived, arose from a progenitor cell that lacked the gene expression associated with differentiated cells from that tissue. Furthermore, it cannot be excluded that the original classification might not be correct due to metastasis or cultivation problems (9, 25). Follow-up studies with extended material of a specific tumor type will clarify this issue. A low TSI score for a cell line toward its presumed origin is a warning sign that motivates caution in the use of these cell lines. For prostate and renal tumors, no cell line received high TSI. A possible explanation is that the tumor samples were too heterogeneous to allow any generalization in gene expression.

Using the TSI, we correctly identified the recent reclassification of two cell lines originally considered of breast origin (MDA-MB-435 and MDN) as of melanoma origin (2, 24). In addition, seven other cell lines had high TSI scores for a tumor origin different from the postulated one, raising the issue of incorrect classification of these as well. Three lung cancer cell lines (NCIH522, A549ATCC, and HOP62) should be investigated for their origin, because TSI indicates a risk that they may have been derived from metastatic tumors of, e.g., kidney, brain, and leukemia to the lung. In addition, two renal cell lines (SN12C and TK10) and a melanoma cell line (LOXIMVI) received high TSI scores for leukemia, raising the issue of incorrect origin. The ovarian cell line OVCAR8, with a karyotype that relates to the breast cell line NCI/ADR-RES (10), had a high TSI score for ER-negative breast tumor. Nine cell lines had a low TSI score for all nine tumors of origin. Again, it could be due to a modification of the differentiated phenotype in vitro due to the changed environment or growth conditions, or it could be that the true origins of these cell lines were different from the nine tumors represented in our study.

The NCI60 cell lines have been successfully used as model systems in several applications, including screening for novel chemotherapeutic compounds. The pharmaceutical industry is currently extending this use to screening for other tumor-specific drugs. This present methodology could therefore be used to ensure that the panel of cell lines used in such screening have gene expression profiles that resemble those of their intended tumors. It will be of interest to find out whether removing the cell lines of questionable origin and low TSI scores toward their tumors will significantly improve the prediction accuracy of screening efforts (11, 13). In addition, in efforts to derive candidate tumor-specific antigens or peptides, the efficiency of the research will be speeded up by optimizing the cell line panel used.

We have presented a gene expression-based TSI that measures how each cell line relates to its corresponding tumors of origin. The TSI uses SVD to capture the fundamental patterns within the gene expression data and to project the gene expression profile of a sample into an SVD space with reduced dimensionality. SVD has previously been successfully applied to gene expression data (1820). However, other closely related dimension reduction techniques such as principal components analysis and partial least square (29) could also be used. The strength of these methods is their ability to capture biological meaningful patterns within the global gene expression data and to sort out noise and possible experimental artifacts (19). We used correlation to measure the similarity between each cell line and the tumor centroids, but other metrics such as Pearson correlation and Euclidian distance could also be applied.

We envision that the TSI presented in this study could provide guidance for choosing the appropriate cell lines for experimental research and drug development, because the tissues of origin are often a decisive factor for successful research. As gene expression becomes a more widely used diagnostic tool, it may also be relevant in choice of therapy. Furthermore, the TSI methodology presented here is a general method that could be extended to analyses of other datasets where model systems are examined based upon the expression of genes or proteins.

Supplementary Material

Supporting Information
pnas_102_6_2052__.html (1.8KB, html)

Acknowledgments

We thank Anna Birgersdotter for fruitful discussions. The research was funded by the Swedish Knowledge Foundation, the Swedish Cancer Society, and the Swedish Research Council.

Author contributions: R.S. and I.E. designed research; R.S. performed research; R.S. analyzed data; and R.S. and I.E. wrote the paper.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: TSI, tissue similarity index; SVD, singular value decomposition; SAM, significance analysis of microarrays; ER, estrogen receptor.

References

  • 1.Perou, C. M., Jeffrey, S. S., van de Rijn, M., Rees, C. A., Eisen, M. B., Ross, D. T., Pergamenschikov, A., Williams, C. F., Zhu, S. X., Lee, J. C., et al. (1999) Proc. Natl. Acad. Sci. USA 96, 9212–9217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ross, D. T., Scherf, U., Eisen, M. B., Perou, C. M., Rees, C., Spellman, P., Iyer, V., Jeffrey, S. S., Van de Rijn, M., Waltham, M., et al. (2000) Nat. Genet. 24, 227–235. [DOI] [PubMed] [Google Scholar]
  • 3.Hoffman, R. M. (1991) Cancer Cells 3, 86–92. [PubMed] [Google Scholar]
  • 4.Hoffman, R. M. (1991) J. Clin. Lab. Anal. 5, 133–143. [DOI] [PubMed] [Google Scholar]
  • 5.Stein, W. D., Litman, T., Fojo, T. & Bates, S. E. (2004) Cancer Res. 64, 2805–2816. [DOI] [PubMed] [Google Scholar]
  • 6.Monks, A., Scudiero, D., Skehan, P., Shoemaker, R., Paull, K., Vistica, D., Hose, C., Langley, J., Cronise, P., Vaigro-Wolff, A., et al. (1991) J. Natl. Cancer Inst. 83, 757–766. [DOI] [PubMed] [Google Scholar]
  • 7.Hsu, M.-Y., Elder, D. A. & Herlyn, M. (1999) in Cancer Cell Lines, Human Cell Culture, eds. Masters, J. R. W. & Palsson, B. (Kluwer, Dordrecht, The Netherlands), Part 1, Vol. 1.
  • 8.Masters, J. R. (2002) Nat. Rev. Cancer 2, 315–319. [DOI] [PubMed] [Google Scholar]
  • 9.Drexler, H. G., Dirks, W. G., Matsuo, Y. & MacLeod, R. A. (2003) Leukemia 17, 416–426. [DOI] [PubMed] [Google Scholar]
  • 10.Roschke, A. V., Tonon, G., Gehlhaus, K. S., McTyre, N., Bussey, K. J., Lababidi, S., Scudiero, D. A., Weinstein, J. N. & Kirsch, I. R. (2003) Cancer Res. 63, 8634–8647. [PubMed] [Google Scholar]
  • 11.Scherf, U., Ross, D. T., Waltham, M., Smith, L. H., Lee, J. K., Tanabe, L., Kohn, K. W., Reinhold, W. C., Myers, T. G., Andrews, D. T., et al. (2000) Nat. Genet. 24, 236–244. [DOI] [PubMed] [Google Scholar]
  • 12.Nishizuka, S., Charboneau, L., Young, L., Major, S., Reinhold, W. C., Waltham, M., Kouros-Mehr, H., Bussey, K. J., Lee, J. K., Espina, V., et al. (2003) Proc. Natl. Acad. Sci. USA 100, 14229–14234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Staunton, J. E., Slonim, D. K., Coller, H. A., Tamayo, P., Angelo, M. J., Park, J., Scherf, U., Lee, J. K., Reinhold, W. O., Weinstein, J. N., et al. (2001) Proc. Natl. Acad. Sci. USA 98, 10787–10792. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C. H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., et al. (2001) Proc. Natl. Acad. Sci. USA 98, 15149–15154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. (2003) Bioinformatics 19, 185–193. [DOI] [PubMed] [Google Scholar]
  • 16.Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B. & Speed, T. P. (2003) Nucleic Acids Res. 31, e15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Tusher, V. G., Tibshirani, R. & Chu, G. (2001) Proc. Natl. Acad. Sci. USA 98, 5116–5121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Holter, N. S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J. R. & Fedoroff, N. V. (2000) Proc. Natl. Acad. Sci. USA 97, 8409–8414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Alter, O., Brown, P. O. & Botstein, D. (2000) Proc. Natl. Acad. Sci. USA 97, 10101–10106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wall, M. E., Rechtsteiner, A. & Rocha, L. M. (2003) in A Practical Approach to Microarray Data Analysis, eds. Berrar, D. P., Dubitzky, W. & Granzow, M. (Kluwer, Norwell, MA), Ch. 5.
  • 21.Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. & Church, G. M. (1999) Nat. Genet. 22, 281–285. [DOI] [PubMed] [Google Scholar]
  • 22.Stinson, S. F., Alley, M. C., Kopp, W. C., Fiebig, H. H., Mullendore, L. A., Pittman, A. F., Kenney, S., Keller, J. & Boyd, M. R. (1992) Anticancer Res. 12, 1035–1053. [PubMed] [Google Scholar]
  • 23.Abd El-Rehim, D. M., Pinder, S. E., Paish, C. E., Bell, J., Blamey, R., Robertson, J. F., Nicholson, R. I. & Ellis, I. O. (2004) J. Pathol. 203, 661–671. [DOI] [PubMed] [Google Scholar]
  • 24.Ellison, G., Klinowska, T., Westwood, R. F., Docter, E., French, T. & Fox, J. C. (2002) Mol. Pathol. 55, 294–299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Masters, J. R. (2000) Nat. Rev. Mol. Cell Biol. 1, 233–236. [DOI] [PubMed] [Google Scholar]
  • 26.Arlett, C. F. (2001) Lancet Oncol. 2, 467. [DOI] [PubMed] [Google Scholar]
  • 27.Clayton, D. F. & Darnell, J. E., Jr. (1983) Mol. Cell. Biol. 3, 1552–1561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Drexler, H. G., Uphoff, C. C., Dirks, W. G. & MacLeod, R. A. (2002) Leuk. Res. 26, 329–333. [DOI] [PubMed] [Google Scholar]
  • 29.Nguyen, D. V. & Rocke, D. M. (2002) Bioinformatics 18, 39–50. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
pnas_102_6_2052__.html (1.8KB, html)
pnas_102_6_2052__2.pdf (48.3KB, pdf)
pnas_102_6_2052__1.pdf (832.2KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES