Supporting information for Garber et al. (November 13, 2001) Proc. Natl. Acad. Sci. USA, 10.1073/pnas.241500798.

Materials and Methods

The complete primary data set, including expanded versions of the figures, can be found at http://genome-www.stanford.edu/lung_cancer/adeno.

cDNA Microarrays.

Microarrays used in the analysis of lung tumors contained 23,100 cDNA clones that map to 17,108 unique genes as defined by Unigene cluster ID. A total of 1,350 clones did not map to a Unigene cluster ID. mRNA was isolated from grossly dissected, human lung tissue (fetal lung mRNA obtained from CLONTECH) and reverse-transcribed in the presence of dUTP conjugated with Cy5. An equal quantity of a common reference mRNA pool (1), comprised of 11 cell lines, was similarly conjugated with Cy3. The two samples were mixed and hybridized to each microarray. The absolute quantity of tumor and reference cDNA hybridized to each spot on the array, as determined by fluorescence, was used to calculate the ratio of sample-containing Cy5 to reference-containing Cy3. Because all tumors were compared to the same common reference, the tumors can then be compared indirectly to one another. Similarly, the relative amount of a gene expressed in one tumor can be compared with the same gene expressed in the other lung tumors and tissue samples.

Gene List Selection.

All 23,100 cDNA clones on the microarray were initially screened for low intensity signal in the reference (signal intensity/background = 1.5). Clones were then filtered by using 92% good data, which allowed missing or poorly measured "gray" values in only five of 73 measurements for any given gene. Using these two filters, we obtained a list of well-measured genes and used this list as the starting point to select a gene list as previously described (ref. 1; see http://genome-www.stanford.edu/molecularportraits/mandm.shtml). Euclidian distance rather than correlation was used in this study.

We focused our analysis of lung tumors on a gene list containing 918 cDNA clones representing 835 unique genes. To find the genes, we calculated an "effect" for each gene in the well-measured list of genes described above. An effect was defined as the distance between two samples over the entire list of genes including the gene of interest minus the distance between the same two samples without the gene of interest. The effect for each gene was calculated for all pair wise combinations of tumor samples. A score was given to each gene in the list that was defined as the average effect of the gene across the 11 tumor pairs (see text) minus the average effect of the gene for all pair wise combinations excluding the 11 tumor pairs. A gene with a high score represents low variation for the 11 paired-samples and high variation in expression for all pair-wise combinations excluding the 11 tumor pairs. Genes with a score higher than one standard deviation from the average score were selected for further analysis. Using a gene list allowed us to concentrate on the genes that best represented differences between the tumors, rather than differences between different samples from the same patient.

Hierarchical Clustering Analysis.

CLUSTER and TREEVIEW software were obtained from M. Eisen at http://www.microarrays.org. Both genes and arrays were median centered. The two gene filters used to generate the gene list were also used for hierarchical clustering. Complete linkage clustering for AC group 3 resulted in a higher correlation to both the mean and median centroid than average clustering. Results of the cluster analysis were nearly identical whether or not the weighting was reduced accordingly for clones duplicated on the microarray.

We clustered only AC tumors along with the six normal tissue samples by using the same gene list and cluster parameters that were used to cluster the entire lung dataset. Results showed that AC group 3 tumors, as defined in Fig. 1, clustered together on a separate branch with average or complete linkage clustering (data not shown). This analysis excluded patients 59 squamous and 139 large cell, which clustered with AC group 3 but were not diagnosed morphologically as AC. These observations suggested that AC group 3 was robust and distinct from the other AC tumors in the absence of SCC, LCLC, and SCLC.

Nonparametric t Test.

Genes characteristic of the different AC subgroups were selected from the 918 cDNA clones used to analyze the lung dataset. Missing values were estimated by using KNNimpute algorithm (2). Gene selection was based on a nonparametric t test using a P value cutoff of 0.001 and 100,000 column permutations. The complete list of genes for each of the four categories discussed in the text may be obtained from Table 2. A smaller subset of genes, shown in Fig. 5, was subjectively chosen from these lists to represent the AC subgroups.

Kaplan–Meier Analysis.

Patient survival was performed by using SAS software. The P value was obtained by the Mantel–Hantzel log-rank test.

References:

1. Perou, C. M., Sorlie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Rees, C. A., Pollack, J. R., Ross, D. T., Johnsen, H., Akslen, L. A., et al. (2000) Nature (London) 406, 747–752.

2. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. & Altman, R. B. (2001) Bioinformatics 17, 520–525.