Beroukhim et al. 10.1073/pnas.0710052104. |
Fig. 4. Fewer significant regions are identified in an analysis restricted to primary GBMs. The results of the original GISTIC analysis of glioma (displayed as in Fig. 2b) are presented alongside a similar analysis of only the primary GBMs in the data set. All of the regions that are significant among primary GBMs are also significant in the larger data set including secondary GBMs and lower-grade gliomas. Some events, such as 8q gain and 19q loss, are significant in the larger data set but not among only primary GBMs. This loss of significance may be due to either a decreased prevalence of these events among primary GBMs or decreased power to detect low-prevalence aberrations in a smaller tumor set.
Fig. 5. GISTIC applied to different glioma data sets generates nearly identical results. The results of the original GISTIC analysis (displayed as in Fig. 2b) are presented alongside similar analyses of 178 tumors on 100K SNP arrays (1) and 37 tumors on 16K CGH arrays (2). Only minor differences in results are seen; these are due to differences in the distribution of glioma subtypes within each data set (a high proportion of grade III gliomas among the 178 tumors and of secondary GBMs in the CGH analysis) and to stochastic fluctuation. As expected, significant aberrations tend to reach higher levels of significance (lower q values) as the number of samples increases.
1. Kotliarov Y, Steed ME, Christopher N, Walling J, Su Q, Center A, Heiss J, Rosenblum M, Mikkelsen T, Zenklusen JC, Fine HA (2006) Cancer Res 66:9428-9436.
2. Maher EA, Brennan C, Wen PY, Durso L, Ligon KL, Richardson A, Khatry D, Feng B, Sinha R, Louis DN, et al. (2006) Cancer Res 66:11502-11513.
Fig. 6. Comparison between GISTIC analyses of glioma and lung cancer reveals distinct profiles. The results of the original GISTIC analysis of glioma (displayed as in Fig. 2b) are presented alongside a similar analysis of 81 lung cancer samples using 100K SNP arrays (1). The overall pattern is strikingly different, although both tumor types exhibit similar amplifications of chr7 (including EGFR) and deletions of chr9p (CDKN2A/B) and chr13 (RB1). A more detailed analysis of the lung cancer genome using GISTIC is the subject of a forthcoming manuscript (2).
1. Zhao X, Weir, B. A., LaFramboise T, Lin M, Beroukhim R, Garraway L, Beheshti J, Lee JC, Naoki K, Richards WG, et al. (2005) Cancer Res. 65:5561-5570.
2. Weir BA, Woo MS, Getz G, Perner S, Ding L, Beroukhim R, Lin WM, Province MA, Kraja A, Johnson LA, et al. (2007) Nature, in press.
Fig. 7. Broad amplification of chromosome 7 vs. focal amplification of EGFR. (a) A histogram of the copy numbers (displayed as log2 ratios) across samples at the EGFR locus shows tumors divide into three classes: log2 ratios <0.1 (unamplified, associated with 7norm), between 0.1 and 0.9 (low-level amplifications, associated with 7gain), and >0.9 (high-level amplifications, or 7gainEGFRamp). No samples had log2 ratios between 0.7 and 1.3, suggesting a qualitative difference between 7gain and 7gainEGFRamp. Note that the values 0.1 and 0.9 coincide with qamp and qhi_amp (see SI Text). (b) Copy-number profiles (displayed as log2 ratios, blue line) across chr7 (Mb coordinates on left) are displayed for representative samples with 7norm, 7gain, and 7gainEGFRamp. The presence of low-level amplification at the EGFR locus does not imply a focal amplification. In fact, 42 of 44 cases exhibited low-level amplification across most of the chromosome. The high level of EGFR amplification seen in 7gainEGFRamp, however, is always focal and never extends over most of the chromosome.
Fig. 8. MET/HGF+ cell lines activate MET and AKT in an HGF-dependent manner but do not activate EGFR. (a) Treatment with SU11274 reduces MET and AKT activation in MET/HGF+ cells. Whole-cell lysates from MET/HGF+ (see Fig. 3) Hs683 and LN18 cells were obtained after 24-h serum starvation in the presence of the indicated concentrations of SU11274. (b) MET and AKT activation in MET/HGF+ cells is HGF-dependent. Whole-cell lysates were obtained after 24-h serum starvation in the presence of (as indicated) anti-HGF antibodies (5 mg/ml) or SU11274 (2.5 mM), or with HGF (50 mg/ml) added for the final 10 min. (c) Neither the presence of 7gain nor high levels of EGFR expression are associated with activation of EGFR. Cell lines were characterized as having high EGFR expression if their median-normalized, median absolute deviation-scaled RNA expression levels (using only concordant EGFR probesets on Affymetrix U133 arrays) were greater than zero. None of these cell lines have focal amplification of EGFR. Immunoblots to the indicated epitopes were performed on whole-cell lysates prepared after 24-h serum starvation. EGFR-dependent lung cancer cells (H3255) (1) were included as positive controls. (d) These cells also do not exhibit decreased viability when treated with the EGFR inhibitor erlotinib. Viability was measured by using WST dye after exposure to inhibitor at the indicated concentrations for 96 h.
1. Tracy S, Mukohara T, Hansen M, Meyerson M, Johnson BE, Janne PA (2004) Cancer Res 64:7241-7244.
Fig. 9. Flow chart representing the components of the four stages of the GISTIC algorithm. Each step in GISTIC is represented by a block or bullet and is described in a subsection of SI Text.
Fig. 10. Evolution of the data as they progress through GISTIC. (a) Raw signal intensities (log2 scaled) are displayed across the genome (y axis) for 141 tumors and 33 normal samples (y axis; sample characteristics indicated on top). Copy-number aberrations are difficult to distinguish at this step, before division by normal controls. (b) Raw genotyping data are displayed for these same samples. Large regions of homozygosity (seen as stripes lacking the usual frequency of yellow heterozygous markers) likely represent LOH events. (c) Batch effect correction removes artifactual copy-number changes associated with date of data generation. (Upper) Inspection of the dates on which data were obtained (batches, indicated by color bar at the top) shows that high-level changes in signal intensity are restricted to a single batch in some markers (arrows). If not corrected, these signal intensity changes will be seen as recurrent copy-number changes. (Lower) After batch correction, these artifacts are removed. (d) Normalized signal intensities displayed for the tumor samples only reveal copy-number aberrations, including losses (blue) and amplifications (red). Although systematic errors have been minimized by batch correction and selection of appropriate normalization controls (see SI Text), substantial random errors persist. The last row of sample characteristics (top) indicates samples with low tumor purity (in green; see SI Text), which are removed from subsequent steps. (e) Histograms of normalized data in the quality-control step enable identification of data sets in which contamination with normal cells obscures the signal contributed by tumor cells. (Upper) Histograms depicting normalized signal intensity distributions that would be expected from the indicated tumor purities. A pure tumor would be expected to display separate peaks corresponding to the different copy-number levels in the tumor. The width of each peak will vary according to the level of noise in the array, and the distance between peaks represents the amount of signal contributing to copy-number estimates. As the proportion of tumor cells decreases, so does this signal, leading to a smaller distance between peaks. At low proportions of tumor, peaks associated with different copy-number levels become indistinguishable, indicating that the signal is obscured by noise. (Lower) Actual histograms (gray) and smoothed versions (dark lines) representative of the patterns seen among the 141 gliomas analyzed. These roughly correspond to the expected distributions seen in Upper. (f) Segmented signal intensity data for the 105 glioma samples with high tumor purity reveal the copy-number aberrations with much lower levels of random noise. (g) Segments with log2 signal intensity ratios >0.1 are considered amplified and are displayed here. (h) We identified loss and retention of heterozygosity events (blue and yellow, respectively) among the 105 tumors with high tumor purity by comparing the observed frequency of heterozygous SNP markers to the expected frequency in each region of the genome (1). (i) The frequency of amplification and average level of amplification across the genome are displayed in panels to the right of the amplification data (from g). High scores for either one of these indicates a high likelihood that amplifications in that region of the genome are not solely chance events. GISTIC uses a G score (far right panel) that integrates both of these measures to identify aberrations that are associated with cancer. (j) Comparison of observed G scores to similar scores generated after permuting the marker labels allows us to determine the statistical significance of aberrations in each region (displayed on the right as FDR q values to account for multiple hypotheses; see SI Text). Regions with q values <0.25 (green line) are considered significantly aberrant. (k) "Peel-off" method identifies independent peaks within a statistically significant region. Upper Left shows a chromosome from an idealized set of tumors, with amplified regions in orange; q values associated with these regions are shown in Upper Right. For every chromosome in which some of these q values attain statistical significance (the significance threshold is denoted by the green line), the "peel-off" algorithm identifies the region with minimal q value (red line) as the primary peak. All aberrations involving the primary peak (marked by red stars) are then removed (faded orange, Lower Left), and G scores and FDR q values are recalculated (Lower Right) using only the remaining aberrations. If these reach significance, the region with minimal q value is selected as the secondary peak. The process iterates until no statistically significant regions remain. (l) In the case of chr7 amplfication, this "peel-off" algorithm enables us to identify separate peaks associated with EGFR and MET. The original amplification data for chr7 is displayed in Top along with the associated G scores. The entire chromosome is associated with G scores that are greater than the significance threshold, but a clear peak is observed at the EGFR locus. When we remove all amplicons that cover this peak region, we find a second peak that crosses the significance threshold at the MET locus (Middle). When amplicons covering this second peak are removed, the remaining amplicons do not reach statistical significance (Bottom).
1. Beroukhim R, Lin M, Park Y, Hao K, Zhao X, Garraway LA, Fox EA, Hochberg EP, Mellinghoff IK, Hofer MD, et al (2006) PLoS Comput Biol 2:e41.
Fig. 11. Copy-number changes tend either to be focal or near the size of a chromosome arm. The distribution of sizes of all amplifications and deletions in the data set is displayed. The majority of events fall into one of two peaks: either focal events covering <10% of a chromosome arm or broad events covering >90% of a chromosome arm.
Fig. 12. LOH is usually, but not always, associated with deletions. (a) The statistical significance of deletions (blue) and LOH (purple) are displayed as in Fig. 2b. All significant regions of LOH are also significant regions of deletion, with two exceptions: (i) a small region containing EGFR gives the appearance of LOH in highly amplified samples due to allelic imbalance and (ii) chromosome 17p, containing TP53. (b) TP53 primarily undergoes copy-neutral LOH. Upper displays loss of heterozygosity (LOH, blue) and retention of heterozygosity (yellow) along chromosome 17 for eight gliomas (labeled A-H) with LOH at the TP53 locus. Lower displays signal intensities (red, high; blue, low; white, neutral) and copy-number calls (red bar, amplified; blue bar, deleted) for those gliomas. LOH at TP53 is associated with neutral copy numbers in gliomas A-G. Across our data set, copy loss (as in glioma H) is seen in only three of 23 gliomas observed to have LOH at the TP53 locus.
SI Text
We describe a general method for Genomic Identification of Significant Targets in Cancer (GISTIC). GISTIC can be divided into four stages (SI Fig. 9):
(i) Characterization of chromosomal aberrations on a per-tumor basis
(ii) Aggregation of data from different tumors to differentiate between driver and passenger aberrations.
(iii) Identification of peak regions most likely to contain the oncogene and tumor suppressor gene (TSG) targets.
(iv) Classification of tumors on the basis of their driver aberrations.
Stage 2 contains the 2 central features of the algorithm: that it scores each genomic marker according to an integrated measure of the prevalence and amplitude of copy-number changes (and only prevalence in the case of LOH), and that it assesses the statistical significance of each score by comparison to the results expected from the background aberration rate alone.
In the following section we provide an overview of the motivations and methods behind each of the 4 stages. Detailed descriptions of each stage, to allow reproduction of the results, are included in the following 4 sections, with subsections dedicated for each block or bullet in SI Fig. 9. For clarity, the first time a parameter is described, the value that we used appears in parentheses. The evolution of the data as it progresses through the algorithm is visualized in SI Fig. 10.
Overview of the Method
With data describing chromosomal aberrations in large tumor sets, the aberrations that drive tumorigenesis and the oncogenes and TSGs they most likely target can be identified if the following 4 issues are addressed. (i) The aberrations in each of the tumors must be accurately mapped. (ii) Driver aberrations that rise above the background rate of random passenger aberrations must be identified. (iii) For each driver aberration, the loci most likely to contain the targeted oncogenes or TSGs must be identified. (iv) Tumors must be classified as to whether they are aberrant at the predicted driver loci, so that the effects of those aberrations can be studied. GISTIC represents an example of such an approach, in which these 4 issues are addressed in the 4 stages of the algorithm.
Stage 1
In this stage, chromosomal aberrations are mapped in each tumor. Here, the goal is to maximize the accuracy with which these aberrations are identified, by (i) minimizing systematic error, (ii) minimizing random error, and (iii) discarding poor-quality datasets. Chromosomal regions with high signal intensities are designated as amplified, regions with low signal intensities are designated as deleted, and regions with an excess of homozygous SNP markers are designated as having lost heterozygosity.
Systematic errors arise when datasets from different samples are generated under slightly different experimental conditions. A primary example is batch effect, in which data generated on different days varies slightly. We limit batch effects on our copy-number assessments by using a batch effect correction module, in which we identify and correct markers that show consistent signal within batches but large variations between batches. Other experimental variables, such as day of manufacture of the array, or slight variations in PCR conditions, can also lead to systematic errors even between samples within a batch. Many array comparative genomic hybridization platforms minimize these by using 2-color systems in which tumor and control DNA are hybridized simultaneously to the same array. In single-color systems such as the Affymetrix SNP arrays that we use prominently in this study, these systematic errors can be minimized by selecting appropriate controls for each tumor, such that the controls share similar variations in their noise profiles across the genome.
Several methods exist to reduce the effects of random noise in copy-number datasets, most often by identifying regions of copy-number change and averaging the signal intensities for all markers within them (1). Examples include segmentation algorithms such as Circular Binary Segmentation (2) and Gain and Loss Analysis of DNA (GLAD) (3), Hidden Markov Model-based approaches (4, 5), and clustering methods (6). Each has advantages and disadvantages that may vary with the noise characteristics of the dataset. We used GLAD due to its high sensitivity for identifying copy-number changes (1). However, this high level of sensitivity occasionally leads GLAD to report non-existent copy-number changes in very small segments (fewer than four markers). We therefore filter these out.
In poor-quality datasets, the signal intensity variations due to copy-number changes are obscured by noise. We therefore identify high-quality datasets as having separate peaks, corresponding to different copy numbers, in histograms of the signal intensity data. Poor-quality samples, particularly those with extensive contamination with normal DNA, generate insufficient signal to distinguish separate peaks, and are discarded. Likewise, duplicate samples from the same individual, identified by similar SNP genotypes, are eliminated.
Stage 2
This stage contains the two core features of GISTIC (Fig. 1). First, we score each genomic marker for the sources of evidence that it is in a region affected by driver aberrations (the G-score). Here, we treat amplifications, deletions, and LOH events separately-allowing for the possibility that a region could be significantly amplified and deleted simultaneously (for instance if an oncogene and TSG neighbor each other, with some samples amplified and others deleted). In the cases of amplifications and deletions, we assume that both the prevalence and average amplitude of these events independently indicate the likelihood with which a region is affected by such driver aberrations. Therefore, we use a simple integrated score of the prevalence of the copy-number change times the average (log2-transformed) amplitude. In the case of LOH, amplitudes do not apply and we therefore score each marker only by the prevalence of events.
Second, we compare these G-scores to the distribution of scores expected if only random aberrations were observed. This distribution can be determined by rescoring the genome after permuting marker locations within each sample; we instead derive a semiexact estimate. The comparison of actual scores to those generated by our null model of random aberrations allows us to calculate the statistical significance of each G-score (represented by False Discovery Rate q-values (7)), representing the likelihood that the observed data could have been generated by chance alone.
Regions of the genome that are too frequently or highly aberrant to be explained by chance alone are selected as likely to harbor driver aberrations.
Stage 3
In this stage, GISTIC identifies the most likely locations of the oncogene or TSG targets of the driver aberrations identified in stage 2. This stage is designed with 4 considerations in mind: (i) these gene targets are most likely to lie in the regions most frequently aberrant to the highest degree (similar to the minimal common region of aberration, with high-amplitude aberrations are given greater weight); (ii) occasional random aberrations may occur near, but not overlapping, real oncogenes or TSGs, distracting us from their true locations; (iii) a single region may contain more than one independently targeted gene; and (iv) some aberrations may exert their effects through broad-based changes across much of the length of the aberration. This latter consideration is suggested (but not proven) by the high prevalence of broad aberrations that consistently affect large regions of the genome (near the size of a chromosome arm) (SI Fig. 11).
Given (i), for each region found in stage 2 to contain likely driver aberrations, we select the "peak" regions with maximal G-scores and (an equivalent statement) minimal q-values as most likely to contain the oncogene or TSG targets. In each case, we allow for (ii) (the possibility that random aberrations are skewing the location of the peak) by leaving each sample out in turn, and recalculating the peak boundaries-only the widest boundaries are taken. We also allow for (iii) (that a single region may contain two or more independent gene target) by applying a "peel-off" method designed to identify aberrations that overlap but are independently statistically significant. Finally, we allow for (iv) by determining, for each peak region, whether the aberrations at this locus are primarily focal or broad, or whether both focal and broad aberrations are independently significant.
Stage 4
To determine the effects of driver aberrations identified in stage 2, we must classify tumors as to whether they have these aberrations. Because the peak regions are most likely to contain the oncogene or TSG targets of these aberrations, GISTIC first classify each tumor according to its copy-number status at the peak regions. For broad aberrations, which may be specifically disrupting a large region of the genome, GISTIC also classifies each tumor as to whether it is aberrant across most of the length of the region.
Detailed Description
Required Inputs.
The inputs to GISTIC are the following files (details regarding software availability and exact file formats can be found at www.broad.mit.edu/cancer/pub/GISTIC).(a) A .snp file that represents either the signal intensities or log2 ratio for each of the genomic markers (in our case, SNPs, although non-polymorphic loci interrogated by comparative genomic hybridization methods may also be used) across a set of samples.
(b) A .loh file representing the inferred loss of heterozygosity (LOH) status, either as discrete calls or probabilities (similar format to .snp files).
(c) A sample info file which denotes for each array its array name, sample name, tumor type, ploidy, paired normal, batch, gender and platform. Additional information for each sample can be supplied for visualization and correlation purposes.
(d) A genome info file with the location of each marker.
(e) A cytogenetic info file with cytoband locations.
(f) A copy number variation file with the locations of germline copy-number polymorphisms.
(g) A transcript database with gene locations.
(h) An optional list of known target gene symbols for visualization purposes.
(i) An optional list of general cancer gene symbols for reporting purposes.
(j) A parameter file with values for the various parameters used in the algorithm.
Stage 1: Characterization of Chromosomal Aberrations on a Per-Tumor Basis
In this stage, we systematically characterize on a genome-wide basis the amplifications, deletions, and loss-of-heterozygosity (LOH) events affecting each tumor. We aim to reduce inaccuracies in these determinations due to systematic artifacts, random error, and poor-quality data. In the case of copy-number determinations, systematic errors are controlled by correcting for batch effect, selecting appropriate germline datasets for normalization, and controlling for germline copy-number polymorphisms. The effects of random noise are minimized by use of a segmentation algorithm and application of a threshold for calling amplification or deletion that is rarely attained by fluctuations in segmented copy-number values in normal samples. Duplicate samples from one individual and samples with poor-quality data (i.e., copy-number changes were not reliably distinguishable) are eliminated.
The initial steps are aimed at controlling for systematic errors that can lead to false amplifications and deletions at a single genomic location across tumors. Even when these artifacts occur at a very low frequency, when we consider the hundreds of thousands of markers that may be present in the dataset, we are likely to encounter a few artifacts whose consistency across multiple tumors will lead them to appear even more significant than real changes associated with tumorigenesis. Therefore, controlling for these systematic errors is an essential step in a high-resolution genome-wide approach. We and others (7-10) have found that slight variations in experimental conditions between successive arrays can lead to these systematic changes in signal intensities. We therefore control for these experimental variations in two steps: (i) correcting for variations due to batch effect, which are defined by the date and core-facility in which the data were generated; and (ii) selecting for normalization a set of normal samples that are most similar to the tumor sample according to their baseline signal intensity variations across the genome.
Source Data
GISTIC can be applied to any dataset representing copy-number or LOH data measured across the genome. As an example of its application, we used 100K SNP array data from 141 gliomas, along with a set of normal controls. Here, probe-level signal intensity data were normalized to a baseline array with median intensity, using the invariant set normalization method (11). The signal intensity of each SNP was then obtained using a model-based (PM/MM) method (12). Genotyping calls were made by Affymetrix Genotyping Tools Version 2.0.
SI Fig. 10 a and b shows the raw signal and genotyping calls as heatmaps.
Data Preprocessing
The noise in signal intensities is dominated by a multiplicative component. Hence, to make the noise constant across signal intensity, we log2-transform the data using a floor value of 1 to avoid small or negative numbers. Next, we bring the samples to the same scale by subtracting the median value across all markers for each sample.
Batch Effect Correction
In this step, we assume that signal intensity variations due solely to batch effect are likely to be marked by their consistency within a batch, and variance from other batches. We therefore compare, for each marker independently, the distribution of signal intensities from all tumor and normal samples in a given batch to that of the tumor and normal samples from all other batches, using a variance-thresholded t test with minimal variance of s2min ( = 0.16 in our case), which represents the typical level of noise per marker and can be estimated using replicate datasets. For markers and batches where the t test yields an asymptotic p-value less than Pbatch_effect_cutoff ( = 0.001), a constant is added to the signal intensities of that marker in all samples in each variant batch, to yield the same mean signal intensity as all non-variant batches. Batches with fewer than Nmin ( = 5) samples are not modified in this manner.
Among our data, 4.9% ± SD 9.4% of loci were modified in each batch in this manner. The locations of these loci varied widely between batches, such that 63% of loci were corrected in at least one of the 14 batches (SI Fig. 10c). However, in a majority of cases these corrections were small, with the signal intensity difference averaging 4.2% ± SD 2.5% of the unperturbed signal intensity. Thus the benefit from batch correction largely derives from correcting the small number of markers with more pronounced batch effects.
Selection of Germ-Line Samples for Normalization
To obtain copy number estimates for a sample we first calculate log2 tumor-to-normal copy-number ratios at each marker. These are calculated by subtracting the average of log2-transformed signal intensities from a set of normal controls from the log2-transformed signal intensity of the tumor. When examining the normal controls, we observed that subsets of samples exhibit systematic deviations in signal intensity across megabases of the genome. Replicate datasets representing the same sample often display different patterns of systematic deviation (data not shown). This suggests that some of the variation in signal intensities between samples is due to experimental factors that enter during data generation. Some of these experimental factors have been previously modeled (8, 9); it is possible that many have not. We chose to correct the systematic effects due to these factors, including those that have not been modeled, simply by selecting the set of normal controls that share similar noise profiles to the sample being normalized. To this end, we identify the Nclose( = 5) normal samples that are closest to the sample being normalized measured by Euclidean distance between the log2-transformed signal profiles (again ensuring these profiles are at the same scale between samples by subtracting the median value across all markers for each sample), and use these for normalization.
SI Fig. 10d shows the normalized data as a heatmap.
Merging Platforms
In some cases each sample is profiled using more than one platform, interrogating different sets of loci. For example, the 100K SNP arrays we used consist of independent 50K Xba and Hind array platforms. In this step, we merge the data from these platforms, interlacing markers according to position on the genome. In principle, this step can be used to merge data from different technologies such as array CGH or BAC arrays. However, care must be taken if the different platforms have highly variant dynamic rage or noise characteristics.
A more general problem is that of merging datasets in which each sample was assayed using a separate set of platforms. We do not address this issue here.
Quality Control
In this step, samples with poor-quality data are removed. Copy-number profiles can suffer from either of two features that will make them non-informative. First, extreme levels of noise can lead to the inability to distinguish copy-number changes; second, high levels of contamination with normal cells (even in samples that appear highly enriched for tumor) can dampen the signal intensity differences between copy-number changes to the extent that copy-number changes are not robustly resolved. These two features work in tandem: a larger amount of contaminating normal DNA may be tolerated if the signal-to-noise ratio is high and small changes in signal can be robustly detected.
We assessed both noise level and normal contamination simultaneously by generating histograms of the log2 ratios collected at each autosomal marker locus (SI Fig. 10e). We first smooth the log2 ratios by taking the mean value across a running window of Hwindow ( = 501) markers. The histogram is generated using a bin size Hbin ( = 0.01), and smoothed by convoluting with a Gaussian distribution that has a standard deviation of Hsigma ( = 0.05).
For a dataset from a homogenous tumor sample, we expect to identify separate peaks in this histogram corresponding to the separate copy numbers in the tumor's genome (SI Fig. 10e). As more contaminating normal DNA is mixed with the tumor's, however, the observed signal in all cases will approach normal levels and the peaks will tend to coalesce. Conversely, as the noise level increases, so will the width of each peak, resulting in a single broad hump from which separate peaks cannot be resolved. Therefore, each tumor whose smoothed histogram has only a single peak is marked as having failed quality control.
Some datasets may not display separate peaks despite high-quality data from highly enriched tumors because copy-number changes are not extensive enough in the tumor to be visible as separate peaks in the histogram. This is likely to be true particularly among tumor types with predominantly diploid genomes. In the case of glioma, however, almost all tumors appear to suffer significant levels of aneuploidy. Eight samples were analyzed after tumor purity was assured by obtaining DNA after needle dissection. In all eight, separate peaks indicative of abnormal copy numbers could be observed. However, in four of these cases where DNA was obtained without needle dissection, separate peaks were not observed. This suggests that the majority of tumor samples will have detectable copy-number alterations by histogram analysis if sufficiently pure.
Based on these analyses, measurable copy number differences were resolved in the histograms of 105 samples; all of these were included in further analysis. The clinical characteristics of these included samples are similar to the overall tumor set (SI Table 2), suggesting the selection process is not biased. The copy-number profiles of these 105 samples (segregated on the left in Fig. 2a) are similar to those that were removed (on the right), but with greater amplitude of variation. Nevertheless, we removed the samples on the right because the low amplitudes with which their copy numbers vary makes their classification (aberrant vs. not aberrant) at any locus less reliable.
Removal of Duplicates
Due to inaccuracies in sample tracking, large tumor sets can contain duplicate samples from the same individual. This can bias downstream analyses of the frequencies and common boundaries of chromosomal aberrations. The use of SNP arrays for genome analysis also provides genotype data that allows for the elimination of duplicates by identifying samples with similar genotypes. Here, we score by genotype each of the mtot SNPs that are assayed in each sample: A = 1, AB = 2, and B = 3. For every pair of tumor samples, we calculate the Euclidean distance between all minf SNPs that are informative (i.e., not "No Call") in both samples, and divide by . Pairs for which this normalized Euclidean distance is less than a threshold qdup ( = 0.4 on the basis of experience with known replicates) are identified as coming from the same individual. For such duplicates, only the tumor with higher-quality data (represented by more distinct peaks in its quality control histogram, above) is retained.
Copy-Number Assessment
Copy-number determinations are most reliable when data from neighboring markers with the same underlying copy number are combined to reduce the effects of noise. Several methods for such noise reduction have been reported (1-4, 6). We chose to use the segmentation method Gain and Loss Analysis of DNA (GLAD) (3). The input to the segmentation algorithm are the log2 ratios rij for each marker i and sample j. We denote the segmented (and smoothed) data by cij. (N.B. we do not use GLAD postprocessing clustering steps, but only use the initial steps aimed at segmenting and smoothing the data). GLAD tends to misidentify outliers as separate segments (1). To correct this, we join any segment with fewer than Nshort( = 5) markers to the neighboring segment with the closest cij, and assign the new segment a new cij reflecting the median rij across all markers in the combined segment. This step is performed recursively until no segments with fewer than Nshort markers are left.
SI Fig. 10f shows the segmented data for the 105 samples that passed quality control.
Copy-Number Variation (CNV) Control
To eliminate copy-number variations derived from polymorphic germline events, markers from regions with known germline copy number variations as listed in http://projects.tcag.ca/variation (13-16) are removed.
Identification of Copy-Number Aberrations
To call regions amplified or deleted we first need to set log2 ratio thresholds: qamp and qdel. Markers with cij>qamp are called amplified and ones with cij<qdel are called deleted. We set these thresholds based on the empirical distribution of values in normal samples. First, all samples are brought to the same baseline signal intensity by subtracting the median cij value across all markers from each sample, to generate new cij values. The thresholds are then set to be such that only Fnormal( = 0.5%) of markers on autosomes pass each threshold among normal samples. In the case of the glioma dataset, this yielded q amp = 0.10 and qdel = -0.10.
To test the reliability of these calls, in the case of 13 tumors we obtained DNA from separate aliquots of the same tumor, where both aliquots produced data with measurable copy number differences on histogram analysis (see above). A median of 90.2% of copy-number calls were identical between the separate aliquots of each tumor. This is a conservative estimate of the reliability of these calls, as some of the difference between aliquots reflects real differences due to tumor heterogeneity.
SI Fig. 10g displays the log2 signal intensities in the regions of the genome for which cij > qamp.
LOH Assessment
When using paired tumor and normal genotyping data, the LOH status at all loci (including non-informative loci) is inferred by applying an HMM that takes into account the LOH calls at informative loci (17). When using genotyping data from unpaired tumors (as is our case), the LOH status at all loci is inferred on the basis of extent of regional homozygosity, taking into account the haplotype structure of the genome (17).
SI Fig. 10h displays the LOH identified among the 105 tumors with high tumor purity.
At this point, each sample has been assessed for amplifications, deletions, and LOH, and in Stage 2 we will distinguish between likely driver and passenger aberrations.
Stage 2: Aggregation of Data from Different Tumors to Differentiate Between Driver and Passenger Aberrations
To determine which of the aberrations identified in Stage 1 are likely to represent driver events, we aggregate the data from all tumors used in the analysis to generate summary scores for amplifications, deletions, and LOH. The statistical significance of each score is determined by comparison to the distribution of scores obtained by all permutations of the data (using a semiexact approximation), with correction for multiple hypothesis testing.
Scoring the Copy-Number Genome
We assume there are two sources of evidence that a copy-number aberration is not a chance event: f, the frequency of the aberration across the sample set and , its average amplitude. Therefore, the scores we generate for amplification and deletion events reflect both sources of evidence:
, and
. [1]
We wanted the score to represent the negative log of the likelihood of observing the contributing aberrations by chance alone. We found that log2 ratios approximate these negative log likelihoods, both for amplifications and deletions. as estimated by the overall frequency of aberrations, as a function of amplitude, across our glioma dataset (data not shown).
Note that the scores in Eq. 1 can also be represented as a sum across the n samples in the set:
, and
. [2]
Using the sum representations we can calculate the scores, for example, by first replacing cij£qamp with 0 and then summing over j.
As LOH is not associated with an amplitude, the LOH score represents only the frequency of LOH across the sample set which can also be rewritten as a sum:
. [3]
SI Fig. 10i displays the G scores associated with amplifications in our tumor set, along with the frequency and average amplitude components of these scores and the per-tumor amplification data (after replacing values £qamp with 0) that gave rise to them.
Null Hypothesis Generation: An Analytic Derivation of the Null Distribution
To assess which of these peaks are statistically significant, we identify those G scores which rise above the null distribution of values one would expect to obtain from random passenger aberrations alone. Since passenger aberrations could occur anywhere in the genome, one may model this null distribution by recalculating the G scores across all combinations of permutations of the marker labels within each sample. Note that by assuming, in these permutations, that all observed aberrations (including driver aberrations) are passengers, we generate a conservative, high estimate of the background aberration rate.
Although one can simulate the null distribution by performing each of these permutations in turn, we in fact derive a semiexact estimate of this null distribution. For amplifications and deletions separately, we replace the log2 ratios in each marker not called aberrant with zero:
and
[4]
As noted above, the G scores for each marker can be calculated by summing the correspondingacross all samples. Under the null hypothesis, the arrangement of
values is independent between samples and therefore the distribution of the sum of
across the samples is the same for all markers and equals the convolution of the distributions of
values in each sample. We approximate these distributions by generating histograms for each sample:
and
using a bin size of Cbin ( = 0.001). Note that as Cbin approaches zero the approximation becomes exact. For LOH the histograms have two values: the fraction of markers that do not have LOH and the fraction that do. The final distribution for Gamp is given by
and similarly for Gdel and GLOH.
Significance Testing
We next assign statistical significance to the observed G scores using the null distribution calculated in the previous step. The p-value for an observed G score is simply the sum of the tail of the null distribution from the observed score and above. Next, to correct for multiple hypothesis testing we apply the Benjamini-Hochberg FDR procedure (7) to obtain q-values. These corrected probabilities are an upper bound for the expected fraction of false positives. Note that these q values are conservative since we treat all markers as independent hypotheses when in fact close markers are highly positively correlated.
Regions with q values of less than 0.25 are marked as significantly aberrant (Fig. 2).
SI Fig. 10j displays the q-values associated with amplifications in our tumor set, along with the G scores they are associated with and the per-tumor amplification profiles on which these scores are based.
Stage 3: Identification of Peak Regions Most Likely to Contain the Oncogene and Tumor Suppressor Gene Targets
In this stage, we consider for each significantly aberrant region which are the most likely oncogene and TSG targets. We consider the possibility that the region may encompass two or more independently aberrant genes. We also consider the possibility that a random "passenger" mutation occurring in a single sample near, but not overlapping, the oncogene or TSG genes will distract us from those genes.
Identification of Minimal Targeted Loci
If the driver aberrations within a region are selected due to their effects on a single gene, we would expect that gene to lie in the region where the largest number of tumors are aberrant to the highest degree. This locus equates with the locus with the minimal q value (and maximal G score). Therefore, within each region found to have a q value less than 0.25, we identify the peak region with minimal q value as the primary target. This peak region might contain many markers, as long as they have exactly the same q value. Usually these are neighboring markers that lie on the same copy-number or LOH segment in every sample in the dataset.
Identification of Independent Peak Regions-"Peel-Off" Algorithm
It is possible that two or more peaks within a significant region are independently aberrant, but due to overlap between some aberrations associated with each peak, the entire region appears statistically significant. To recapture all of these independent peak regions, we implement an iterative "peel-off" algorithm (SI Fig. 10k). Here, for each chromosome that has a region with a q value less than 0.25, we remove from the data all aberrations overlapping the region with minimal q-value on the chromosome (the primary peak). We then recalculate G scores and q values taking a conservative approach, where we calculate p and q values based on the original null distribution including all aberrations. We remove aberrations by setting all consecutive markers that exceed the q threshold to zero. If any part of the chromosome continues to have a q value less than 0.25, we reiterate the procedure by identifying the region with the minimal q value as a separate peak and "peel-off" aberrations it overlaps. These iterations continue until no q values less than 0.25 are obtained in the chromosome. Note that this method greedily assigns an aberration that overlaps two or more peaks to the most significant locus.
In the glioma dataset, the "peel-off" algorithm identified 2 peak regions (corresponding to EGFR and MET) independent amplified within chromosome 7, although all of chromosome 7 constitutes a single region of significant amplification (SI Table 3 and SI Fig. 10l).
Determination of Boundaries for Each Peak Region
For each independent peak, the boundaries of the region of minimal q value encompass the region with the greatest evidence for containing the oncogenes or TSGs, as that region is most aberrant in the largest number of samples. These particular boundaries, however, may be shifted from the oncogenes or TSGs due to the presence of a nearby random passenger mutation or by errors in the boundaries determined by the segmentation analysis in a single sample. Therefore, to ensure robustness of the boundaries that we identify, GISTIC recalculates the boundaries of each peak region after leaving out each sample in turn, and takes the maximum upper and minimum lower boundary of the peak of the score among all iterations. Note that this procedure uses only the data which corresponds to the "peeled-off' segments that are associated with the analyzed peak. All genes that lie wholly or partially within these boundaries are considered candidate oncogenes or TSGs. If no gene is within these boundaries, the nearest gene is considered the likeliest candidate.
Broad vs. Focal Aberrations
Examination of the glioma genome (Fig. 2) reveals broad regions undergoing significant amplification or deletion in addition to focal events. The finding that some significant focal events lie within significant broad regions, whereas others do not, suggests the possibility that overlapping broad and focal aberrations may target different genes (see main text). Therefore, for each peak region we determine whether it is subject to significant broad or focal aberrations or both.
Any region that is statistically significant over more than half a chromosome arm harbors significant broad aberrations. Also, for each peak, the G score required to attain significance (Gsig) is subtracted from the maximal G score, and the width of the region attaining this score is assessed. If this region does not cover more than half a chromosome arm the peak harbors significant focal aberrations. For peaks that rise to G scores less than twice Gsig, the width of the region at half the maximal G score is used to determine whether the peak is due primarily to broad or focal events. The result in glioma is the identification of 16 significant broad events and 16 significant focal events (SI Table 3).
Here, we use a cutoff of one-half of a chromosome arm to define broad aberrations because most copy-number aberrations in the glioma dataset were either much larger or much smaller than this threshold (SI Fig. 11).
Stage 4: Classification of Tumors on the Basis of Their Driver Aberrations
To study the effects of driver aberrations, tumors must be classified according to whether they have them. For each tumor we determine whether it is aberrant at each peak region and, in the case of copy-number aberrations, whether it has a high- or low-level copy-number change. In the case of statistically significant broad regions, we classify tumors as to whether they are aberrant across most of the region.
Tumor Classification per Peak Regions
Samples are classified according to whether they have the appropriate aberration at each peak region. For instance, for each peak region of amplification, samples that were called amplified in Stage 1 are classified as aberrant; likewise for peak regions of deletion and LOH. In cases where these peaks comprise more than one marker, any sample that was called aberrant in the majority of these markers is classified as aberrant. In most cases, these calls are identical between markers within the peak region of minimal q value, as changes in any one sample will lead to changes in the G score and therefore the q value.
For peak regions of copy-number change, samples are also classified as to the amplitude of that change at each locus. The signal intensity distribution at EGFR (SI Fig. 7a) suggests a qualitative difference between samples with low-level amplification (qamp < cij < 0.9) and samples with high-level amplification (cij > 0.9, corresponding to at least 3.7 copies in a diploid cell). Therefore, we classify each tumor according to whether it has a low- or high-level amplification at each peak region of amplification, using cutoffs of qamp and qhi_amp ( = 0.9). To similarly distinguish between low-level (e.g., hemizygous) and high-level deletions, we applied cutoffs of qdel and qlo_del ( = -1.3, corresponding to less than 0.9 copies in a diploid cell).
Tumor Classification per Broad Regions
Samples are also classified as to whether they have each of the broad aberrations identified in Stage 3, using the boundaries of the broad region as determined in Stage 3. Any sample that in Stage 1 is called with the appropriate aberration (e.g., amplified in a significantly amplified region) in more than half of the markers within this broad region is classified as having a broad aberration in the region.
Having classified every tumor as to its status at every targeted locus and broad region of aberrancy, the GISTIC algorithm is complete.
Output from GISTIC
The results of the algorithm are contained in the following files:
(a) Display files in .pdf, .eps and .fig formats showing the variation in G scores and associated q values for all markers along the genome.
(b) An all lesions file that describes all of the significant aberrations and peak regions, and the status of each sample at each focal and broad region.
(c) A segmented_data file that represents the cij values after batch correction, normalization, segmentation analysis, and removal of copy-number polymorphisms.
(d) A gene table which lists the genes that overlap with each of the peak regions. Genes that are listed as known targets or generally related to cancer are highlighted (if such lists are provided).
(e) A histograms file (.pdf) with a histogram plot for each sample and a mark indicating whether the sample has passed the histogram quality control step.
SI Note 1: LOH Analysis
The G scores and corresponding significance levels for LOH (SI Fig. 12) yield a similar pattern to deletions, with 2 exceptions: (i) High-level amplifications of EGFR on chr7 are scored as LOH because they give rise to an allelic imbalance that obscures the minor allele; and (ii) chr17p (containing the TSG TP53) appears to primarily undergo copy-neutral LOH, with multiple samples exhibiting regional homozygosity despite retaining two copies of the chromosome (SI Fig. 12). Other than these cases, the similar pattern between LOH and deletions indicates that the reduction to homozygosity that represents LOH is usually due to hemizygous deletion of one allele. However, the ability to map deletions is superior to LOH, due to 2 factors: (i) LOH is obscured by low levels of contaminating normal DNA that are tolerated by deletion mapping, and (ii) the resolution of LOH analysis is poorer than for deletions. This latter factor is true when paired normal samples are used to map LOH (because most SNP markers are homozygous in the normal sample and therefore uninformative as to LOH status of the tumor) or when paired normal samples are not used (given the necessary reduction in resolution this implies) (17). For these reasons, we placed more emphasis on the results for deletions except in the primarily copy-neutral case of LOH at chr17p.
SI Note 2: Minimal Common Region Analysis of 141 Gliomas
As a comparison to the GISTIC method, we performed an analysis of the minimal common regions of copy-number variation in our 100K SNP array data from 141 gliomas. Here GLAD (3) was used to segment the raw log2 ratios generated from the signal intensity (after brightness correction (11) and model-based expression (12)) of the tumor divided by the mean signal intensity of all normal controls at each SNP locus. Segments for which the median log2 ratio across all SNPs was greater than 0.1 or less than -0.1 were called amplified or deleted, respectively. For each region found to be amplified or deleted in over 5% of samples, the minimal common regions of amplification or deletion were identified as potentially harboring oncogenes or tumor suppressor genes. This approach yielded 144 minimal common regions of amplification or deletion, harboring 5 of the known oncogenes and tumor suppressor genes in glioma. These results are similar to prior analyses of the glioma genome in terms of the number of regions selected and sensitivity to known oncogenes and tumor suppressor genes (SI Table 2).
The GISTIC analysis of the same dataset appears to provide superior specificity (identifying only 27 peak regions for copy-number aberrations) and sensitivity (identifying 9 of the known oncogenes and tumor suppressor genes in glioma) (see main text and Table 1). Three factors may contribute to the high level of specificity of GISTIC: (i) When using very high-resolution datasets, even systematic errors occurring in a small fraction of markers and tumors can give rise to large numbers of artifactual aberrations across the dataset. GISTIC minimizes these in multiple preprocessing steps. (ii) Without controlling for the background aberration rate, random events may be identified as interesting candidates. GISTIC uses a statistical test to eliminate these. (iii) Within a region that is frequently aberrant, multiple loci often share the same, maximal frequency of aberration-leading them all to be considered minimal common regions of amplification or deletion. GISTIC prioritizes those loci with the highest average amplitude of change.
SI Note 3: Comparative Outlier Analysis
To identify genes responsible for the functional effects of 7gain, we applied a 'comparative outlier analysis' in which we identified genes on the chromosome that show extreme outliers among at least 10% of the tumors among tumors with 7gain compared to 7normal (SI Table 4). Specifically, for each probeset 'PRBST' matching a gene on chr7, primary GBMs were classified according to their copy-number status at the gene locus (defined as the mean of segmented values across the minimal set of SNP markers that contain the gene) as 7norm (if qdel < cij < qamp), 7gain (if qamp< cij < 0.9), or 7gainPRBSTamp (if cij > 0.9). All expression values were normalized by subtracting the median and scaling by the median absolute deviation of 7norm samples. The outlier score represents the top 10th percentile of these transformed expression values among the 7gain samples.
The assumption behind this analysis is that broad aberrations, because they affect large numbers of genes, may have (i) polygenic effects, and (ii) heterogeneous effects across tumors (sometimes affecting one set of genes and other times affecting a different set). Therefore, we did not look for genes that are consistently up-regulated in 7gain, but rather genes that are overexpressed in some samples with 7gain, compared to the distribution expected from 7norm.
The results are striking. Although this outlier analysis was not restricted to potential oncogenes, the four top-scoring genes (out of 568 mapping to the chromosome; SI Table 3) are all likely candidates: MET (a known glioma oncogene) and its ligand HGF (see main text), PDAP1, an enhancer of the glioma oncogene PDGFRA (18), and HOXA9, an oncogene in acute myelogenous leukemia (19, 20). All of these candidates merit follow-up studies.
1. Lai WR, Johnson MD, Kucherlapati R, Park PJ (2005) Bioinformatics, bti611.
2. Olshen AB, Venkatraman ES, Lucito R, Wigler M (2004) Biostatistics 5:557-572.
3. Hupe P, Stransky N, Thiery, J.-P., Radvanyi F, Barillot E (2004) Bioinformatics 20:3413-3422.
4. Fridlyand J, et al. (2004) J Multivariate Anal 90:132-153.
5. Zhao X, Li C, Paez JG, Chin K, Janne PA, Chen, T.-H., Girard L, Minna J, Christiani D, Leo C, Gray JW, Sellers WR, Meyerson M (2004) Cancer Res 64:3060-3071.
6. Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R (2005) Biostatistics 6:45-58.
7. Benjamini Y, Hochberg Y (1995) J R Stat Soc Ser B 57:289-300.
8. Ishikawa S, Komura D, Tsuji S, Nishimura K, Yamamoto S, Panda B, Huang J, Fukayama M, Jones KW, Aburatani H (2005) Biochem Biophys Res Commun 333:1309-1314.
9. Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, Ogawa S (2005) Cancer Res 65:6071-6079.
10. Carvalho B, Bengtsson H, Speed TP, Irizarry RA (2006) Biostatistics.
11. Li C, Hung Wong W (2001) Genome Biol 2:research0032.
12. Li C, Wong WH (2001) Proc Natl Acad Sci USA 98:31-36.
13. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C (2004) Nat Genet 36:949-951.
14. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK (2006) Nat Genet 38:75-81.
15. Hinds DA, Kloek AP, Jen M, Chen X, Frazer KA (2006) Nat Genet 38:82.
16. McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, Barrett JC, Dallaire S, Gabriel SB, Lee C, Daly MJ, et al. (2006) Nat Genet 38:86.
17. Beroukhim R, Lin M, Park Y, Hao K, Zhao X, Garraway LA, Fox EA, Hochberg EP, Mellinghoff IK, Hofer MD, et al. (2006) PLoS Comput Biol 2:e41.
18. Fischer WH, Schubert D (1996) J Neurochem 66:2213-2216.
19. Borrow J, Shearman AM, Stanton VP, Jr, Becher, R., Collins T, Williams AJ, Dube I, Katz F, Kwong YL, Morris C, et al. (1996) Nat Genet 12:159-67.
20. Nakamura T, Largaespada DA, Lee MP, Johnson LA, Ohyashiki K, Toyama K, Chen SJ, Willman CL, Chen IM, Feinberg AP, et al. (1996) Nat Genet 12:154-158.