Abstract
The very high dimensional space of gene expression measurements obtained by DNA microarrays impedes the detection of underlying patterns in gene expression data and the identification of discriminatory genes. In this paper we show the use of projection methods such as principal components analysis (PCA) to obtain a direct link between patterns in the genes and patterns in samples. This feature is useful in the initial interactive pattern exploration of gene expression data and data-driven learning of the nature and types of samples. Using oligonucleotide microarray measurements of 40 samples from different normal human tissues, we show that distinct patterns are obtained when the genes are projected on a two-dimensional plane spanned by the loadings of the two major principal components. These patterns define the particular genes associated with a sample class (i.e., tissue). When used separately from the other genes, these class-specific (i.e., tissue-specific) genes in turn define distinct tissue patterns in the projection space spanned by the scores of the two major principal components. In this study, PCA projection facilitated discriminatory gene selection for different tissues and identified tissue-specific gene expression signatures for liver, skeletal muscle, and brain samples. Furthermore, it allowed the classification of nine new samples belonging to these three types using the linear combination of the expression levels of the tissue-specific genes determined from the first set of samples. The application of the technique to other published data sets is also discussed.
[Online supplementary material available at www.genome.org.]
DNA microarrays are presently used extensively for genome-wide gene expression measurements. Large-scale transcriptional studies have catalyzed new discoveries and are generating important new insights into the behavior and functioning of cells (Spellman et al. 1998; Perou et al. 1999; Alizadeh et al. 2000; Hughes et al. 2000). Class discovery tools have played a key role in this process. Class discovery methods are exploratory analysis tools used to organize, learn from, and discover patterns in the data. Of the various multivariable techniques available, clustering of genes and samples has been the most common tool used for the analysis of microarray data (Eisen et al. 1998; Spellman et al. 1998; Perou et al. 1999; Tamayo et al. 1999; Alizadeh et al. 2000; Hughes et al. 2000). Before proceeding to cluster, it is often advantageous to visualize the data to develop an understanding of underlying structure. This initial exploration is useful in revealing patterns and providing clues for further analysis.
Principal component analysis (PCA) is a linear projection method that defines a new dimensional space that captures the maximum information present in the initial data set by minimizing the error between the original data set and the reduced dimensional data set. Each principal direction of the projection space, or principal component (PC), is defined such as to be orthonormal to all others and to maximize the information in the data that has not already been captured by the previous (lower) dimensions. In this way, as the number of PCs progressively increases, a larger fraction of the total information content is accounted for. PCA is a linear projection in the sense that the variables of the projection space (PCs) are linear combinations of the original variables (i.e., the gene expressions). The coefficients of this linear combination are called loadings and the actual values of the projection of the samples are called scores. PCA is obtained from a singular value decomposition of the data, and the loadings are the entries in the singular vector and are associated with genes. The scores are contained in the matrix obtained from a multiplication of the original data matrix with the singular vectors and are associated with samples. Standard formulas are available for the determination of the projection variables, loadings, and captured variability (Dillon and Goldstein 1984), and many applications of PCA have been reported in a variety of different contexts (Kamimura 1997; Rannar et al. 1998; Alter et al. 2000; Holter et al. 2000).
In this paper we use PCA to analyze a set of microarray measurements on normal human tissues. Initial projection onto a lower dimensional space allows for better visualization of the entire data set. The loadings are subsequently used to select relevant genes while considering the impact of the removal of irrelevant genes on the patterns observed in the projection of the samples. This is an alternate approach to the problem of selection of relevant genes in the analysis of microarray data (Golub et al. 1999) and may be used to obtain a subset of genes that best describe the data. The observation of clear gene-expression patterns after the removal of irrelevant genes points to a high degree of structure in the measurements. Exploration of these gene expression patterns further revealed tissue-specific gene expression signatures. These signatures were further supported by the analysis of additional tissue samples that had not been used in the initial pattern-discovery step.
RESULTS
The data set used in this study comprised expression measurements of 7070 genes made in 40 normal human tissue samples using Affymetrix GeneChips. The data were generated at the Brigham and Women's Hospital (BWH) in Boston (Hsiao et al. 2001). Samples from several human tissues were analyzed, here we use the samples from brain, kidney, liver, lung, esophagus, skeletal muscle, breast, stomach, colon, blood, spleen, prostate, testes, vulva, proliferative endometrium, myometrium, placenta, cervix, and ovary.
PCA Loadings Can Be Used to Filter Irrelevant Genes
The data from the 40 human tissues were first projected using PCA, which may be used with or without scaling (mean-centering, or autoscaling, among others). Here, we did not scale the data, and comparisons with mean-centered results are provided in Discussion. The first and second PCs account for ∼70% of the information present in the entire data set. The score plot of the 40 samples using the entire gene expression set is shown in Figure 1A. Plotted in Figure 1B are the loadings for each of the 7070 genes for the first and second PCs. The loading plot reveals a large number of genes clustered around the origin, implying that they only marginally impact the projection onto the first and second PC. Because the relative magnitude of the loading is a measure of the importance of the corresponding gene in defining the PC, a small magnitude implies that the corresponding gene expression does not materially impact that particular PC. On this basis, a filter that eliminates genes with loadings below a threshold in all of the first five PCs was implemented. The decisions that went into the choice of the threshold are shown in Figure 1E. The threshold was varied over a large range, and at each threshold value a record was maintained of the number of genes retained for analysis and the distortions in the score plot due to the elimination of genes. As the threshold value was gradually increased, the samples were re-projected using the subset of genes passing the filter. The distortion from the original score plot was measured in terms of the squared difference, defined as the sum of the squares of the difference between the 40 original score values and the 40 score values produced with the filtered gene set (this is defined mathematically in Methods). In essence, this squared difference measures the error between the original projections and the new sample projections (or the distortion of the original pattern) as more and more genes are removed. When the threshold value exceeded 0.001, a large fraction of the genes were filtered out, precipitating large distortions in the patterns on the score plot. This criterion eliminated all but 425 genes with loadings in at least one of the first five PCs that exceeded the threshold value. A projection of the samples using only these 425 genes reveals an almost identical pattern on the score plot with the one obtained when all 7070 genes were used (Fig. 1C). This suggests that the dramatic reduction from the initial 7070 genes to the 425 finally retained resulted in a minimal information loss relevant to the description of the samples in the reduced space. Thus, a PCA framework may be used to evaluate the effect of gene removal on expression patterns observed in the reduced dimensional space.
Identification of Tissue-Specific Gene Expression Patterns: Correspondence between Score and Loading Plots
Three linear structures can be identified in the loading plot of the 425 genes selected by the above analysis, each structure comprising a set of genes arranged along a particular angle in Figure 1D. These linear structures suggest a certain degree of organization in gene expression reflected in the linear relationships between the loadings of the first and second PCs of the genes clustered in these structures. An obvious question is whether there is any correlation among the genes that define these structures. Figure 2 shows the results of a systematic exploration of the patterns depicted in Figure 1D. Plotted in Figure 2A are the angles defined by the X-axis and the points representing the loadings of the first two PCs for the 425 consequential genes identified above. This histogram defines three clusters each corresponding to the three structures identified in Figure 1D. The first, termed structure A, comprises genes with angles between 1.452–1.469 radians. The second, structure B, is centered around the second peak, with angles between −1.222 and −1.205 radians, and the third is a set of genes between −0.328 and 0.054 radians, called structure C. The list of genes so selected was further refined to prevent the inclusion of genes that may have the same angle but are far removed from the structures in Figure 1D by clustering the genes on the basis of their distance from the origin (the clustering results are discussed and provided in the Supplementary Materials available online at www.genome.org). The final list of selected genes is provided in Table 1.
Table 1.
Gene ID | Ratio of means | Loading | Gene description |
---|---|---|---|
Liver-specific signature | PC1 | ||
M36803 | 213.5 | 0.3293 | hemopexin |
J02843 | 337.8 | 0.3284 | cytochrome P450IIE1 (ethanol-inducible) |
X53595 | 344.5 | 0.318 | β-2-glycoprotein I (apolipoprotein H) |
HG2841-HT2970 | 197.3 | 0.3175 | albumin 5 |
HG2841-HT2969 | 161.5 | 0.3042 | albumin, 3 |
M13149 | 131.5 | 0.2592 | histidine-rich glycoprotein |
M10050 | 291.6 | 0.2533 | liver fatty acid binding protein (FABP) |
X03168 | 2313.7 | 0.2242 | S-prot |
D14446 | 148.2 | 0.2113 | HFREP-1 |
M16961 | 161.2 | 0.2067 | α-2 HS-glycoprotein α and β chain |
X51441 | 342.2 | 0.1958 | serum amyloid A (SAA) protein clone pAS3-α |
HG1827-HT1856 | 284.2 | 0.1956 | cytochrome P450, subfamily Iic |
L00190 | 254.4 | 0.1614 | D29832, M21642 and others |
M58600 | 1225.6 | 0.1523 | heparin cofactor II (HCF2) |
M21642 | 183.9 | 0.1265 | (dysfunctional) antithrombin III (ATIII) Utah |
M19828 | 1577.6 | 0.1064 | apolipoprotein B-100 (apoB) |
M11567 | 3034.8 | 0.1059 | angiogenin and three Alu repetitive sequences |
X14690 | 222.4 | 0.1045 | plasma inter-α-trypsin inhibitor heavy chain H(3) |
M21642 | 128.9 | 0.096 | (dysfunctional) antithrombin III (ATIII) Utah |
M20786 | 248.8 | 0.0929 | α-2-plasmin inhibitor |
M11321 | 317.2 | 0.0881 | group-specific component vitamin D-binding protein |
U08006 | 146.8 | 0.0855 | complement 8 α subunit (C8A) |
J03474 | 132.6 | 0.0778 | transcription factor SP1 |
S48983 | 358.8 | 0.0771 | SAA4 (serum amyloid A) |
Muscle-specific signature | PC1 | ||
X00371 | 545 | 0.3348 | myoglobin |
M33772 | 1527.7 | 0.3083 | fast skeletal muscle troponin C |
Z20656 | 2992.5 | 0.287 | cardiac α-myosin heavy chain |
M21494 | 410.4 | 0.2863 | muscle creatine kinase (CKMM) |
U96094 | 363.6 | 0.279 | sarcolipin (SLN) |
J04760 | 701.8 | 0.2658 | slow-twitch skeletal troponin I (TNN1) |
M83308 | 5723.7 | 0.2651 | mitochondrial cytochrome-c oxidase subunit VIa (COX6A) |
X06825 | 452.3 | 0.2444 | skeletal β-tropomyosin |
L21715 | 851.7 | 0.2257 | troponin I fast-twitch isom |
M21665 | 488.5 | 0.2184 | β-myosin heavy chain |
M19309 | 1149.9 | 0.2099 | slow skeletal muscle troponin T, clone H22h |
X90568 | 3169.9 | 0.2077 | titin protein (clone hh1-hh4) |
S73840 | 350.5 | 0.2022 | type Hx myosin heavy chain |
M20543 | 993.2 | 0.1917 | skeletal α-actin |
X16504 | 1016.3 | 0.168 | X51957 and others |
M20642 | 747.2 | 0.15 | alkali myosin light chain 1 |
U35637 | 386.9 | 0.1345 | nebulin/U35637 |
M29458 | 564.4 | 0.1056 | carbonic anhydrase III |
M86407 | 759.1 | 0.0813 | α actinin 3 (ACTN3) |
Brain-specific signature | PC2 | ||
S72043 | 90.4306 | 0.4026 | GIF (growth inhibitory factor) |
M13577 | 686.2963 | 0.3566 | myelin basic protein (MBP) |
S40719 | 20.5566 | 0.2755 | glial fibrillary acidic protein |
HG1877-HT1917 | 82.2133 | 0.1778 | myelin basic protein |
X99076 | 49.5985 | 0.1633 | NRGN |
U48437 | 23.3006 | 0.1404 | amyloid precursor-like protein 1 |
J04615 | 5.9926 | 0.1292 | lupus autoantigen (small nuclear ribonuclepoprotein snRNP SM-D) |
D21267 | 184.849 | 0.1252 | highly expressed protein |
L07807 | 30.2311 | 0.1162 | dynamin |
HG3437-HT3628 | 27.4526 | 0.1159 | myelin proteolipid protein |
L10373 | 18.2544 | 0.1123 | (clone CCG-B7) sequence |
M16364 | 9.3301 | 0.1071 | creatine kinase-B |
M98539 | 3.7109 | 0.0912 | prostaglandin D2 synthase |
U44839 | 3.1469 | 0.089 | putative ubiquitin C-terminal hydrolase (UHX1) |
D63851 | 10.9002 | 0.0863 | unc-18 homolog |
Y09836 | 16.17 | 0.0838 | unknown protein |
M37457 | 9.0757 | 0.0805 | Na+, K+, ATPase catalytic subunit alpha-III isoform |
M25667 | 27.35 | 0.0779 | neuronal growth protein 43 (GAP-43) |
D78577 | 6.3676 | 0.0779 | DNA for 14-3-3 protein eta chain |
L20814 | 68.1413 | 0.0735 | glutamate receptor 2 (HBGR2) |
J04046 | 6.4909 | 0.0729 | calmodulin |
X04741 | 137.5351 | 0.0719 | protein product (PGP) 95 |
L37033 | 6.0028 | 0.071 | FK-506 binding protein homolog (FKBP38) |
M11749 | 11.5785 | 0.0669 | Thy-1 glycoprot |
D82343 | 140.3644 | 0.0649 | AMY |
S82024 | 47.6237 | 0.06 | SCG10 (neuron-specific growth-associated protein/stathmin homolog) |
D49958 | 29.5755 | 0.0571 | membrane glycoprotein M6 |
M65066 | 15.0292 | 0.0541 | cAMP-dependent protein kinase regulatory subunit RI-β |
D87465 | 9.7149 | 0.0532 | KIAA0275 |
X86809 | 4.3215 | 0.0524 | major astrocytic phosphoprotein PEA-15 |
The genes are sorted by their loadings on the projection space (PC), which separates the specific tissue. Also provided is the ratio of the mean of the gene expression in the specific tissue sample to the mean of the gene expression in all the other tissues. Genes with large values of the ratio tend to have large PC loadings. In the case of the brain-specific signature, only the top 30 genes as ranked by their loads on PC 2 are provided. A complete list of genes is in Supplementary Materials.
Although the identity of some genes in the above groups are suggestive of the type of tissue they represent (e.g., the genes in structure A contain an excess of genes related to the liver, such as albumins and apolipoproteins), the nature of each gene group is revealed when score plots are constructed using only the genes that are specific to the structures of Figure 1D or 2A. Thus, using only the 24 genes of structure A to project all the samples yields a score plot (Fig. 2B) that dramatically separates the two liver samples in the data set from all the remaining tissue samples. Similarly, projecting the expression data of the 19 genes in structure B separates the three skeletal muscle tissue samples from the remaining tissues along the first PC (Fig. 2C) and, finally, projection of the samples using the 86 genes of structure C separates all six brain samples from the remaining tissues (Fig. 2D).
Inspection of the genes in structure C revealed two broad classes of genes. One class of genes with low expression levels was largely related to ribosomal proteins and function; the other class of genes, with larger and more variable expression, are primarily brain-tissue-related genes. The loadings of these genes on the second PC support this observation, so that genes with high expression levels in the brain samples also had a high loading magnitude on the second PC, as shown in Table 1. This is also true of the genes in the other structures. This fact may be used for class discovery and data-driven learning and is a result of the observed correspondence between the score plot and the loading plot. Given the observed separation of the six brain samples on the second PC in Figure 2D, a learning approach for samples with unidentified characteristics would have consisted of the following steps: Select a set of genes with high loadings on the dominant PC, examine their function, and generate hypotheses as to the nature of the samples. This is a class-discovery approach, in contrast to a classification methodology, which relies on a priori labeling of the samples (Golub et al. 1999; Brown et al. 2000). Here, the methodology allows one to probe the nature of the sample, and simultaneously identify the genes that contribute to the differentiation of the sample(s) from the others.
The genes that were not part of these structures were also analyzed by projecting the samples using these genes; however, no clustering of samples or any noteworthy separation was observed.
Validation of Gene Expression Patterns Using New Samples
Additional samples (three each) from liver, muscle, and brain were collected in a subsequent experiment, profiled transcriptionally, and analyzed by applying the above projection methods. Figure 2B shows the projections of the gene expression data of the new liver samples using the loadings obtained from the projection of the genes in structure A (this discriminated the two liver samples from the remaining tissues in the initial data set). All three liver samples are clearly separated along the first PC from the nonliver tissues in the initial data set, underscoring the tissue-specific nature of these genes and hinting at the construction of a liver axis along the first PC. The genes distinguishing liver from nonliver tissues, include albumin and those associated with the coagulation pathway (e.g., factor IX, antithrombin III, and heparin cofactor), complement pathway (e.g., C8), lipid process (e.g., apolipoproteins), bile metabolism (e.g., fatty acid binding protein 1), xenobiotic metabolism (e.g., cytochrome P450), and iron homeostasis (e.g., hemopexin), a result which is to be expected based on the known biology of the liver. An examination of the 24 genes in this structure revealed that 33% of all gene pairs had correlation coefficients >0.88 for these five liver samples. This value of the coefficient is significant at the 95% confidence level. Thus, a subset of these genes are expressed proportionately to each other in the liver tissue. For instance, it is known that apolipoprot H binds to negatively charged heparin and the heparin cofactor and antithrombin III are serine proteases that inhibit the coagulation pathway (McNally et al. 1994; Vander et al. 1994).
The loadings of the 19 genes in structure B were similarly used to project the three new skeletal muscle samples; the results are shown in Figure 2C. Similar to the liver samples, the first PC clearly separates the new skeletal muscle samples and acts like a muscle axis. The genes include those associated with the cytoskeleton (e.g., actin, α1, actinin α3, and nebulin), contraction (e.g., tropomyosin, troponin, myosin), glucose metabolism (e.g., enolase 3β), CO2 metabolism (e.g., carbonic anhydrase III), and energy transduction (e.g., creatine kinase). Particularly, actinin α3 is known to have expression limited to skeletal muscle (North et al. 1999), and carbonic anhydrase III is strictly present at high levels in skeletal muscle and much lower levels in cardiac and smooth muscle (Lloyd et al.. 1986). About 74% of all gene pairs, after discounting ones with the same genes, had a correlation coefficient >0.811, the 95% confidence level with the given number of samples. This rather striking degree of linear correlation implies that these genes are expressed proportionately in skeletal muscle samples and may be coordinately regulated. For example, whereas both actin and myosin provide force for muscle contraction, troponin, a regulatory protein, prevents actin and myosin interaction in resting muscle tissue. And, tropomyosin, an actin filament-binding protein is required for the interaction of actin and troponin. It is also known that titin maintains resting tension in skeletal muscle (Vander et al. 1994).
Finally, the 86 genes in structure C were used to project the new brain samples, and as Figure 2D shows, the new brain samples are clearly separated from the other nonbrain samples and fall in the same region as the brain samples of the initial set. The genes include those associated with myelin structure (e.g., myelin basic protein), astrocytic differentiation (e.g., glial fibrillary acidic protein), synaptic reorganization (e.g., calmodulin, neurogranin, and GAP-43), and neurotransmission (e.g., glutamate receptor). Of note, many genes with no known functions are also reported here to be specific for the brain samples.
The use of projection methods to analyze the effect of these genes on the samples also led to the automatic construction of a reduced-dimension classifier space for the liver, muscle, and brain tissues. As shown here, new samples may be projected onto this space and the score value used to classify the tissue sample.
Application to Other Data Sets
Figure 3 shows the result of the application of the current methodology to the gene expression data on lymphoid malignancies (Alizadeh et al. 2000). Expression phenotype of 62 samples of diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL), and chronic lymphotic leukemia (CLL) were measured on 17,856 cDNA clones. A simple projection reveals the presence of two clusters and one intervening group of samples. Querying the nature of these samples reveals an almost perfect segmentation of the samples in a PC space that comprises a mere 35% of the information in the data. Implementing the thresholding procedure allows for the identification of 401 consequential genes, which maintain the patterns in the data with minimal distortion. No outstanding structures suggest themselves in the loading plot. The observation of linear structures is a unique characteristic of each data set and will not necessarily occur in all cases. In this particular case, just the thresholding procedure is sufficient to allow for segmentation of the samples and identification of consequential genes.
DISCUSSION
We have shown the utility of PCA as an initial step in the analysis of microarray data to extract and examine gene expression patterns. Previous work has applied a similar approach (singular value decomposition) to construct linear combinations of gene expressions (called characteristic modes, or eigengenes) from microarray measurements of time-series samples (Alter et al. 2000; Holter et al. 2000). Here, we extend the application of PCA to the analysis of nontime series data and the data-driven learning and sample classification problem. The reason for the broad applicability of the PCA lies in its strong, yet flexible, mathematical structure and the correspondence between the score plot and the loading plot. This latter feature is exploited in the interactive methodology presented for the elimination of redundant variables or genes. This method is general and may be applied to any data set.
Our methodology facilitated the identification of strong underlying structures in the data. The identification of such structures is uniquely dependent on the data and is not generally guaranteed. For example, the expression data on leukemia samples (Golub et al. 1999) was similarly analyzed; however, no evident patterns presented themselves, although diffuse structures containing some discriminatory information could be observed at higher, less informative PCs (data not shown). This may be due to the fact that the PCA attempts to maximize the variation that it captures in the data. In cases where the discriminatory information is not the most important type of variation (perhaps due to the presence of a large number of nondiscriminatory genes), the above analysis will not yield discriminatory patterns between two classes of tissues/sample. When discriminatory genes are preselected by applying a t-test on preclassified samples and used for projection, clear separations are obtained between acute myeloid leukemia (AML) and acute lymphoid leukemia (ALL) classes.
Several genes in the tissue-specific signatures identified here are justifiable with respect to known biology regarding the particular tissue. In the case of the liver and muscle samples, coordinate expression of some of these genes may also be biologically explained. Elucidation of the function and role of the other genes observed in these tissue-specific signatures must await further experiments.
In the current study, the data was not mean-centered. Mean-centering is geometrically equivalent to shifting the origin of the PCA coordinate system to the centroid of the data, a procedure which may or may not yield different results. For the purposes of comparison, the data was mean-centered and then analyzed as described above. The structures for the liver and muscle samples were identified in the first and second PC, whereas the identification of the brain structure required the inclusion of the third PC. The list of genes identified overlapped strongly with the one presented here. This raises our confidence in the significance of the genes identified but also underscores the fact that different processing methods will give rise to a slightly different list of genes; it may be best to adopt several processing methods and choose a common subset of genes.
Projection methods shift the focus of analysis from individual genes to the combined quantitative effect of several consequential genes. Here, due to the strong structures observed in the data, such a combination led to the construction of reduced dimension classifiers for the liver, muscle, and brain tissues. If the sole objective of the analysis is to yield a classifier, then other projection methods, such as Fisher discriminant analysis (Stephanopoulos et al. 2002), are more appropriate and rigorous. If the objective is data exploration, the PCA is better applied, because few a priori assumptions, such as sample class type, are made. Overall, due to their data reduction properties and their flexibility in dealing with large data sets, projection methods are an important class of tools for the analysis of microarray data.
METHODS
Data Treatment
Each array from the BWH data was scaled to a target intensity of 100. All negative expression values were reduced to zero for the purpose of analysis. For treatment of the lymphoma data, see Alizadeh et al. (2000). In the lymphoma data set, genes that had missing values for the 62 experiments were removed from the analysis. This gave an initial starting number of 854 cDNA clones.
Principal Components Analysis
Singular value decomposition is used to calculate the principal components of a data matrix (Dillon and Goldstein 1984). Any data matrix X with S samples (tissues) on the rows and V variables (genes) on the columns may be decomposed as follows:
1 |
where T is a diagonal matrix with values that have the singular values of matrix X. The singular values of X are the square roots of the nonzero eigenvalues of square matrix X′X, as well as XX′ (X′ being the transpose of X). The columns of U and L contain the eigenvectors of XX′ and X′X, respectively. R, the maximum number of independent dimensions, is determined by the rank of the matrix X.
The loadings of the genes, or their coefficients in the linear combination that forms the principal component, is given by the column vectors of matrix L. The magnitude of a gene loading is a measure of its importance in defining the principal component. The scores of the samples, or the projections of the samples on the principal components, are given by
2 |
The amount of information in the data that the first r principal components capture may be quantified as
3 |
where SVi is the ith singular value.
The filter on the loadings was implemented by dividing each loading by the sum of the magnitudes of all the other loadings for that PC and then by rejecting all genes with a loading less than the threshold value. The distortion of patterns in the score plot due to the removal of genes in this thresholding procedure was measured by the sum of the squares of the difference between the 40 original score values and the 40 score values produced with the filtered gene set. Mathematically,
4 |
where SD is the squared difference, ys,i,o is the score value of the sth sample on the ith PC in the projection using all the 7070 genes, whereas ys,i,f is the score value of the sth sample on the ith PC obtained when a filtered gene set is used.
Acknowledgments
We thank the anonymous reviewers for their constructive suggestions for this paper. This work was supported by a grant from the Engineering Research Program of the Office of Basic Energy Science at the Department of Energy (DE-FG02-94ER-14487 and DE-FG02-99ER-15015) and an NIH grant (1-RO1-DK58533-01).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.
Footnotes
E-MAIL gregstep@mit.edu; FAX (617) 253-3122.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.225302.
REFERENCES
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JG, Sabet H, Tran T, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. doi: 10.1038/35000501. [DOI] [PubMed] [Google Scholar]
- Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci. 2000;97:10101–10106. doi: 10.1073/pnas.97.18.10101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci. 2000;97:262–267. doi: 10.1073/pnas.97.1.262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dillon WR, Goldstein M. Multivariate Analysis. New York: John Wiley & Sons; 1984. pp. 23–52. [Google Scholar]
- Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
- Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, Fedoroff NV. Fundamental patterns underlying gene expression profiles: Simplicity from complexity. Proc Natl Acad Sci. 2000;97:8409–8414. doi: 10.1073/pnas.150242097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hsiao L, Dangond F, Yoshida T, Hong R, Jensen RV, Misra J, Dillon W, Lee K, Clark K, Haverty P, et al. A compendium of gene expression in normal human tissues reveals tissue-selective genes and distinct expression patterns of housekeeping genes. Physiol Genomics. 2001;7:97–104. doi: 10.1152/physiolgenomics.00040.2001. [DOI] [PubMed] [Google Scholar]
- Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai HY, He Y DD, et al. Functional discovery via a compendium of expression profiles. Cell. 2000;102:109–126. doi: 10.1016/s0092-8674(00)00015-5. [DOI] [PubMed] [Google Scholar]
- Kamimura RT. ‘Application of multivariate statistics to fermentation database mining.‘ Ph.D. thesis. Cambridge: Massachusetts Institute of Technology; 1997. [Google Scholar]
- Lloyd J, McMillan S, Hopkinson D, Edwards YH. Nucleotide sequence and derived amino acid sequence of a cDNA encoding human muscle carbonic anhydrase. Gene. 1986;41:233–239. doi: 10.1016/0378-1119(86)90103-4. [DOI] [PubMed] [Google Scholar]
- McNally T, Cotterell SE, Mackie IJ, Isenberg DA, Machin SJ. The interaction of β(2) glycoprotein-I and heparin and its effect on β(2) glycoprotein-I antiphospholipid antibody cofactor function in plasma. Thromb Haemost. 1994;72:578–581. [PubMed] [Google Scholar]
- North KN, Yang N, Wattanasirichaigoon D, Mills M, Easteal S, Beggs AH. A common nonsense mutation results in α-actinin-3 deficiency in the general population. Nat Genet. 1999;21:353–354. doi: 10.1038/7675. [DOI] [PubMed] [Google Scholar]
- Perou CM, Jeffrey SS, Van de Rijn M, Rees CA, Eisen MB, Ross DT, Pergamenschikov A, Williams CF, Zhu SX, Lee JCF, et al. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci. 1999;96:9212–9217. doi: 10.1073/pnas.96.16.9212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rannar S, MacGregor JF, Wold S. Adaptive batch monitoring using hierarchical PCA. Chemomet Intell Lab Sys. 1998;41:73–81. [Google Scholar]
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998;9:3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stephanopoulos, G., Hwang, D., Schmitt, W.A., Misra, J., and Stephanopoulos, G., 2002. Mapping physiological states from microarray expression measurements. Bioinformatics (in press). [DOI] [PubMed]
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub T R. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc Natl Acad Sci. 1999;96:2907–2912. doi: 10.1073/pnas.96.6.2907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vander AJ, Sherman JH, Luciano DH. Human Physiology. 1994. pp. 454–457. and pp. 308–312. McGraw-Hill, New York. [Google Scholar]