Abstract
Motivation:
Existing methods for estimating copy number variations in array comparative genomic hybridization (aCGH) data are limited to estimations of the gain/loss of chromosome regions for single sample analysis. We propose the linear-median method for estimating shared copy numbers in DNA sequences across multiple samples, demonstrate its operating characteristics through simulations and applications to real cancer data, and compare it to two existing methods.
Results:
Our proposed linear-median method has the power to estimate common changes that appear at isolated single probe positions or very short regions. Such changes are hard to detect by current methods. This new method shows a higher rate of true positives and a lower rate of false positives. The linear-median method is non-parametric and hence is more robust in estimating copy number. Additionally the linear-median method is easily computable for practical aCGH data sets compared to other copy number estimation methods.
Keywords: array CGH, copy number alterations, common copy number alterations regions
1. Introduction
During cell division, a cell replicates its genome by synthesizing a new copy of each chromosome, using the original DNA as a template. The expected copy number of 2, may be less/greater than 2 when alterations occur during the replication process. Research has suggested that such abnormalities in the number of DNA copies in a cell are associated with the development and progression of disease, including cancer.1 Laboratory research to estimate the altered copy numbers in a DNA sequence often uses aCGH. The technology used to produce aCGH data, however, may result in data that contain uncontrollable noise.2 The use of appropriate statistical methods to normalize the data and produce meaningful estimates of copy number variation in a DNA sequence is integral to this research. Developing improved statistical methods for this application is the focus of this paper.
Different statistical methods have been suggested for use with aCGH data to estimate copy numbers in DNA sequences. Methods to analyze copy numbers in terms of identifying the locations of gains or losses of chromosome regions have been developed. Assuming that there is a connection between copy number changes in a cancer cell and the development/progression of the cancer, there must exist some common change regions in DNA sequences collected from different patients with the same cancer diagnosis. Techniques for analyzing shared copy number regions have been developed.3,4 For detecting copy number regions in a single sample, Olshen et al5 and Venkatraman et al6 had developed a widely used method, the faster circular binary segmentation (CBS) method. In this paper, we propose a new method, the linear-median method, for estimating shared copy number alterations in DNA sequences collected from the same type of cancer cells. The linear-median method is able to optimally use the information available across independent DNA sequences.
This paper is organized as follows. In Section 2.1, we discuss current existing statistical models used to assess aCGH data and describe a new model for analyzing multiple independent aCGH data sets. We introduce the linear-median method in Section 2.2. In Section 3.1, we present three simulation studies. We study how much extra information on copy number aberration can be obtained by using the linear-median method compared to the comparative genomic hybridization minimal common region (cghMCR) method and the CBS algorithm. We present an application of the linear-median method to real data in Section 3.2. Supporting figures and tables are available online as Supplementary Material.
2. Methods
2.1. Modeling DNA copy number alterations in aCGH data
aCGH employs the comparative hybridization of genomic DNA that is differentially labeled according to its source in a cancer cell versus a normal cell. The ratio of the hybridization intensities along the chromosomes provides a measure of the relative copy number of sequences in the genomes that hybridize to each location on the chromosomes. Estimating copy numbers and identifying the locations of gains and losses in a DNA sequence are two main challenges in the analysis of aCGH data. We label the normal genomic sequences as “reference” sample and the genomic sequences from cancer cells as the “test” sample. Let Tp denote the “test” copy number at probe position p and Rp denote the “reference” copy number at probe position p.
We briefly describe two current methods for modeling aCGH data. Let us denote by Yp the aCGH data (the logarithm intensity ratio) observed at probe position p.
(1) |
where ɛp are i.i.d. with normal distribution . This Gaussian model forms the basis of many models for aCGH data.4,6–8
(2) |
where ɛp and ηp are i.i.d with a normal distribution N(0,σ2).9,10
In practice, Rp is assumed to be 2. Given the logarithm intensity ratio observations, {Yp}, we want to estimate the true copy number at position p or to estimate if the copy number at p is greater/less than 2.
Models 1 and 2 assume very different probability structures to describe the system. The variance of the log intensity ratios given by Model 1 is a constant, whereas the variance of the log intensity ratios given by Model 2 is a function of Tp.
We consider which of the two models is a more appropriate model for the analysis of aCGH data. Although Model 1 looks simpler, it is not an appropriate model for aCGH data. The main reason for this is that aCGH data provide the ratio of the copy number variations, not the ratio of the copy numbers. Furthermore, empirical studies show that the standard error of the logarithm of the intensity ratios increases as the copy number increases. Additionally, the distribution of the logarithm of intensity ratios is skewed.9 Thus, the distribution of ɛp should not be assumed to be normal if Model 1 is adopted.
Compared to Model 1, Model 2 is a more appropriate model for aCGH data, as it takes into account the ratio of the copy number variations. However, this model can be improved further. The normality assumptions on the distributions of ɛp and ηp can imply that negative values of ɛp and ηp will lead to log2 (Tp + ɛ/2 +η) being ill-defined. Theoretically, this will cause problems for statistical inference methods based on such an assumption.
In Model 2, the errors ɛp and ηp play the role of measurement errors. Given the fact that the aCGH technique is maturing, it might be reasonable to suggest that both ɛp and ηp follow a uniform distribution U(−a, a), where a can assume any value between 0 and 2, depending on the nature of the underlying aCGH technique. If a takes a value close to 2, this may mean that the underlying aCGH technique is not very accurate, possibly leading to a very large variation in the observations of the intensity ratios. If a takes a value close to 0, we may assume that the underlying aCGH technique is very accurate and that there is less variation in the observations of the intensity ratios. For explicit technical considerations see wikipedia.2 For our purpose, we restrict a to be less than 2. We apply this restriction to real data analysis in Section 3.2. The output of the real data analysis shows the restriction is acceptable.
Therefore, we consider a third model:
(3) |
where ɛp and ηp are independent and have uniform distribution U(−a, a) with constant a ∈ (0, 2), and Xp is the observed intensity ratio at probe position p.
To allow the model to be more flexible, we can assume that the uniform distributions for ɛp and ηp are not necessarily the same.
Model 3 is used to model one aCGH profile from one sample/patient. However, if there is a group of independent samples of aCGH data (eg, multiple patients) and their data share copy number change regions, we can extend Model 3 to such data.
Consider the following scenario. A group of n patients suffer from a common cancer. For each patient a sample of aCGH data is collected from a cancer cell. Let Xi,p be the observed intensity ratio for the ith sample at probe position p. We use tp to denote the theoretical true value of the shared copy number at probe position p for the “test” and let Ti,p be the true copy number for the ith patient at probe position p. Ti,p is not necessarily equal to tp because, for different patients, the copy number at position p might be affected by different uncontrollable random factors. We use Tp to denote the observed copy number for “test” at position p. Tp is a random variable and Ti,p is a sample from Tp Let Ri,p be the true copy number for the ith “reference” at the position p. In this paper, we always assign Ri,p = 2 because the true copy number for the reference (normal) genome is 2 (For the purpose of this study we ignore some special cases).
For multiple independent aCGH data, the extended model can be considered as
(4) |
where M is the total number of probe positions; n is the number of independent samples in the group; ɛi,p and ηi,p are mutually independent random variables; Ti,p has distribution P(Ti,p = tp) = π and P(Ti,p = 2) = 1–π, if tp ≠ 2, ie, if at probe position p the shared true copy number is not 2, then the copy number given by the ith sample at the probe position will follow a Bernoulli distribution with mean π; ɛi,p and ηi,p will have uniform distributions U(−a,a), as defined in Model 3. (Different uniform distributions are allowed for ɛi,p and ηi,p; however, such applications are beyond the scope of this paper.)
Model 4 provides a flexible way to model multiple independent aCGH data in terms of the following arguments:
The probability distributions of ɛi,p and ηi,p are allowed to be different. This means that the probability distribution of the measurement errors for the “test” and “reference” are allowed to be different.
The true shared copy number at position p is no longer a constant. Tp is a random variable. This means that the copy number (if it were observable) at position p could be different from patient to patient.
Hereafter, we consider multiple independent aCGH data and assume Model 4 as the basis for developing a method to estimate the shared copy number tp, p = 1, ⋯, M.
2.2. The linear-median method
Currently, all raw data used for copy number analysis are presented in the format of a log2 intensity of the ratios of the test to the reference. From the current literature, we know that a linear format refers to using the intensity of the ratios of the test to the reference, and a nonlinear format refers to using a log2 intensity of the ratios of the test to the reference, as the log2(ratio) is not linearly related to the copy number. The variance of a linear format tends to be larger than the variance of a nonlinear format when the relative copy number is far away from 1.11 This may explain why the nonlinear format is widely used.
It is expected that the log2 of the true relative copy number, ie, log2 (tp/Rp), can be well estimated using the observations of the log2 intensity of the ratios of the test to the reference, ie, log2 [(Ti,p + ɛi,p)/(Ri,p + ηi,p)], through the sample mean. Unfortunately, this is generally not true. A simple reason for this is that, in general,
Further, the probability distribution of log2 [(Tp + ɛp)/(Rp + ηp)] is not symmetric. Therefore, the sample mean of log2 [(Ti,p + ɛi,p)/(Ri,p + ηi,p)] might be biased from E [log2((Tp + ɛp)/(Rp + ηp))] for smaller samples. Figure 1 shows a histogram of simulated data drawn from the population log2 [(1 + ɛ)/(2 + η)], with ɛ and η i.i.d. uniformly distributed U(−1.8, 1.8) (the function will not be defined if 1 + ɛ ≤ 0).
For the estimating procedure we propose, we will use linear format data rather than nonlinear format data to estimate the shared copy number at probe position p, 0 ≤ p ≤ M.
As defined in Model 4, Xi,p is a random variable of the intensity of the ratios of the test to the reference given by the ith sample at probe position p, 1 ≤ p ≤ M, and satisfies the model
where i denotes the ith sample/patient; ɛi,p and ηi,p are i.i.d. with uniform distribution U(−a, a); Ti,p and Ri,p are the test intensity and reference intensity, respectively, for probe p for the ith sample.
As stated in Section 2.2, we always assign Ri,p = 2, which is the information given by the “reference” genome. The true shared copy number tp at position p needs to be estimated. The estimate of tp is denoted by t̂p,1 ≤ p ≤ M.
Let xi,p be the observed values of Xi,p, i = 1, 2, ⋯, n, p = 1, ⋯, M. Herein, we assume that parameter a is unknown but has a value within (0, 2) and that parameter π (defined in Model 4) is known or can be estimated from empirical knowledge.
The estimation of tp, p = 1, ⋯, M, consists of three steps:
Step 1 Calculate the median of {xi,p} i = 1,2, ...,n for each p, denoted by Mp.
Step 2 Calculate 2(Mp−1 + π)/π for each p.
- Step 3 Determine the estimate of tp, p = 1, ⋯, M,
where [c] denotes the integer part of the real number c.
We call this 3-step method the “linear-median method”. “Linear” indicates that the data (the intensity of the ratios of the test to the reference) are in a linear format. “Median” indicates that the median of the data is employed by this method.
Next, we explain theoretically why copy numbers can be accurately estimated by this 3-step method.
Let Xp be the intensity of the ratios of the test to the reference at probe position p,
where ɛp and ηp are i.i.d. with uniform distribution U[−a, a]; and Tp is a random variable independent of ɛp and ηp, and has distribution P(Tp = tp) = π and P(Tp = 2) = 1 − π, if the shared copy number tp ≠ 2. As explained in Section 2.1, we assume 0 < a < 2.
Following the definition of Xp and assuming the independence of Tp + ɛp and ηp, we have
Thus
(5) |
Equation (5) gives the exact relationship between tp and E(Xp). For each probe position p, if the mean of the intensity of the ratios of the test to the reference is known, and the system parameters a and π are known, the shared copy number at the probe position can be correctly identified.
However, E(Xp) is unknown in practice and the probability distribution of Xp is not usually symmetric. It is inappropriate to estimate E(Xp) by using the sample mean X̄p when the sample size is not appropriately large. Therefore, it is difficult to evaluate tp directly from (5) in practice.
To overcome this difficulty, we suggest the following way to evaluate tp:
where mXp is the median of Xp. It is technically possible to directly evaluate the ratio
(6) |
and prove that the ratio is close to 1, for any a ∈ (0, 2) and any π ∈ (0, 1).
We use the Monte Carlo method to indirectly show that the value of (6) is close to 1 for a = 0.1, 0.2, ⋯, 1.9 and π = 0.1, 0.2, ⋯, 1. (see Appendix A and Supplementary Tables 1 and 2 in the online materials for details). Therefore,
3. Implementation and results
3.1. Simulation studies
The linear-median method is designed for estimating shared copy number aberrations and mainly focuses on the information across the sample for each probe position. Therefore, this method ignores the dependency within each individual sample. Our focus is two-fold: i) to determine the extent of information of shared copy number aberrations that can be detected, regardless of the impact of dependency, and ii) to assess the differences in detection outcomes obtained from the linear-median method versus other methods.
In a recent review of methods for detecting “recurrent” copy number alterations, Rueda and Diaz-Uriarte evaluated the CGHregions method, Master HMMs, cghMCR, GISTIC, MSA, RAE, and others.12 In this subsection, we compare the linear-median method to the cghMCR method and the CBS method.
We present three simulation studies to highlight the performance of our proposed linear-median method.
Example 1: A sequence of integers
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 4 | 4 |
5 | 5 | 5 | 5 | 5 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 |
1 | 1 | 1 | 1 | 1 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
serves as a sequence of the true shared copy number tp, p = 1, 2, ⋯, 100, obtained from the experimental sample, ie, the “test”. To simplify, we assume π = 1. Thus, for example, t1 = 2 means that the true shared copy number shown by the “test” at probe position 1 is 2; t11 = 3 means that the true gain in the shared copy number by the “test” at probe position 11 is 3.
We simulated a group of independent realizations {Xi,p} from model (tp + ɛi,p)/(2+ηi,p), p = 1, 2, ⋯, 100 and i = 1, 2, ⋯, n, where ɛi,p and ηi,p are i.i.d. with uniform distribution U[−a, a].
Subsequently, we generated 1000 replicates. For the kth replicate, k = 1, 2, ⋯, 1000, let d(k) be the percentage of tp − t̂p ≠ 0 out of the 100 probe positions; d(k) is used to measure the error rate in the estimation of tp. The mean and standard error of {d(k)} are presented in Table 1.
Table 1.
a | n | ||
---|---|---|---|
25 | 50 | 75 | |
0.5 | 0.00267 (0.00505932) | 0.00021 (0.00143456) | 0.00003 (0.00054717) |
0.8 | 0.03080 (0.01634096) | 0.00578 (0.00725564) | 0.00566 (0.01330000) |
1 | 0.06900 (0.02388243) | 0.01822 (0.01335702) | 0.00771 (0.00880531) |
1.5 | 0.19759 (0.03871163) | 0.08208 (0.02652880) | 0.04332 (0.02063500) |
1.9 | 0.30161 (0.04409140) | 0.15426 (0.03687835) | 0.09367 (0.02802528) |
Table 1 shows that the error rate increases with a. This is obvious because a larger value of a is equivalent to a larger measurement error in the data. However, the error rate will be reduced when the number of independent samples in the group increases. In general, the mean error rate calculated for the linear-median method is reasonably low: the mean error rate was less than 10%, as expected, for all three cases of varying a.
Although the underlying model involves the parameter a, Example 1 shows, in general, that the impact of the value of a on the estimation of the copy number is not significant in terms of the mean of d(k), except for a very large value of a(>1). (Further demonstrations are presented in the Supplementary Material.) In summary, the value of a ∈ (0, 2) has minimal effect on the estimation of the shared copy number when the sample size is reasonable large. As a result, the linear-median method can be employed without knowing the value of a, as long as a ∈ (0, 2).
Example 2: In Table 1 of their review of 15 estimation methods, Rueda and Diaz-Uriarte indicate that only the cghMCR method both uses an input of the log2 ratio and produces estimations of the differences in the states of two successive probes.12 The cghMCR method is designed to identify the minimal common copy number alteration regions among a group of independent samples; thus it is analogous to the linear-median method and is an appropriate method to compare to the linear-median method. Using segmented data (ie, smoothed data), the cghMCR algorithm first identifies altered segments within each subject (those above the 97th or below the 3rd percentile of the data) and then joins adjacent segments separated by a user-defined parameter. The R package for the cghMCR method is available at the following URL: http://www.bioconductor.org/packages/2.6/bioc/html/cghMCR.html. See the work of Aguirre et al for explicit details and a complete review of the cghMCR method.3
We use simulated data to compare the performance of the linear-median method to that of the cghMCR method. The data were simulated by assuming non dependency between the intensity ratios across probe positions, which is a very simple situation.
Consider a sequence of true shared copy number {tp} plotted in Figure 2.
The sequence tp consists of four abnormal shared copy number regions, corresponding to copy numbers 1, 3, 4 and 5. Some of the abnormal shared copy regions are very short, involving only 1 or 4 probe positions. Using this example, we compare the linear-median method to the cghMCR method in terms of each methods’ capability of correctly assessing the information of gains/losses in shared copy numbers.
We simulated data from the following model
(7) |
i = 1, 2, ⋯, n, where ɛi,p and ηi,p are i.i.d. with uniform distribution U(−a, a). Let B(1, π) be a random variable with a Bernoulli distribution such that E[B(1, π)] = π. We considered 3 × 5 × 3 different combinations for (a, π, n), where a = 0.5, 1, 1.5, π = 0.2, 0.4, 0.6, 0.8, 1 and n = 20, 50, 100.
We applied the linear-median method and the cgh-MCR method to each group of independent samples with size n for different pairs of parameters (a, π), respectively. Then, for each triplet (a, π, n), we calculated the true positive (TP) rates and the false positive (FP) rates produced by each model. TP rate = P(the method shows “copy number changed” | copy number is changed). FP rate = P(the method shows “copy number changed” | copy number is not changed). The linear-median method is able to provide an estimate of the shared copy number at each probe position. Therefore, when we say that a correct detection of the shared copy number was produced by the linear-median method at position p, it means that t̂p = tp. In contrast, the cghMCR method provides information on only the shared copy number gain/loss at each probe position. It does not provide information on how many copy numbers were gained/lost. Therefore, when we say that a correct detection was produced by the cghMCR method at position p, it means only that a gain/loss was correctly identified at position p.
Finally, we carried out 250 replicates for the case where n = 20; 100 replicates for the case where n = 50, and 50 replicates for the case where n = 100. The resulting TP and FP rates, means, and standard errors obtained from both methods are shown in Supplementary Tables 3–5.
In terms of the TP rates, the linear-median method worked reasonably well in each case and performed vastly better than the cghMCR method, which showed poor performance, especially when a was larger and π was smaller. In this particular example of a true shared copy number sequence, the cghMCR method tended to give a lower FP value, ie, it did not call as many gains/losses, and hence was very conservative. Compared to the cghMCR method, the linear-median method gave a lower FP value when a was not close to 2 or π was greater than 0.5. In summary, two advantages of using the linear-median method include:
The ability to estimate the actual shared copy number at each position p. The estimation accuracy of the linear-median method is very high, as reflected by the values of the TP and FP rates.
Better power in identifying shorter alternating regions. For example, considering the data simulated from (7) with a = 1.5, π = 1 and n = 20, we can compare the means of the estimated copy numbers given by both methods. Since a = 1.5, the variance for U(−a, a) is relatively large and the simulated data involve a lot of random noise. By choosing π = 1, there is no variation on the true copy numbers shared across the independent samples. Technically, one expects that the linear-median method and the cghMCR method will perform at the same level. However, it turns out that the linear-median method dominates the cghMCR method. At almost every probe position, the sample mean and median of the estimated shared copy number given by the linear-median method was the same as the true shared copy number. In contrast, the cghMCR method did not accurately identify the gain/loss regions (see Supplementary Figures 1–3).
This simulation example (Example 2) illustrates that the cghMCR method performs very poorly in high-noise scenarios, for example, a = 1.5, and the cghMCR method is not robust for large values of a. We believe this is due to the fact that the cgh-MCR method performs segmentation and calling functions independently of one other; whereas the linear-median method borrows strength from all the samples.
Example 3: In this example we consider data Xi,p, simulated from the following model:
where ɛi,p and ηi,p are i.i.d uniformly distributed in [–1, 1], i = 1, 2, ⋯ 60. In this example we continue to assume π = 1. The abnormal copy number regions are [101, 150] for tp = 3; [151, 152] for tp = 4; [153, 200] for tp = 3; [201, 202] and [205, 300] for tp = 1. Segments of [101, 150], [153, 200] and [205, 300] are relatively longer. Segments of [151, 152] and [201, 202] are relatively shorter.
In this example, we compare the linear-median method to the circular binary segmentation (CBS) method, which was developed by Olshen et al.6 An R package description for the CBS method is available at the following URL: http://bioconductor.org/packages/2.6/bioc/manuals/DNAcopy/man/DNAcopy.pdf. The CBS method is employed to find segments along the chromosome that share constant DNA copy numbers. Technically, it is inappropriate to directly compare the analytical results obtained by these two methods because the CBS method is designed for application to a single sample of data, whereas the linear-median method is applicable to a group of independent samples.
To apply the CBS algorithm to observations {xi,p}, i = 1, ⋯, 60, p = 1, ⋯, 300, we make the following adjustment. We calculate log2(xi,p) for all i and p, since the CBS method is designed for data in a nonlinear format. Then, for each fixed p, we calculate the median of {log2(xi,p)}, forming a new sequence. Finally, we apply the CBS method to this sequence. We justify this comparison with the following argument: If there are common copy number alteration regions among the group of independent samples, the new sequence must contain the information on shared common regions. We consider the new sequence as if it were a single sample of data from a “patient”. Thus, if the information of a shared common region is strong enough, the CBS method should be able to detect the region based on the data of the new sequence. We used the default parameters in our application of the R package to the simulation data in this example.
Figure 3 shows the plot of the medians of {log2(xi,p)} and the estimate of log2(tp/2) (in red), obtained by the CBS method (top panel), and the plot of the estimation of tp obtained by the linear-median method (bottom panel). We see that the linear-median method is able to detect all the changes in the copy number.
Comparing the plots in Figures 3, both approaches, the linear-median method and the CBS method, were able to detect all the longer regions of alternations. However, all the shorter regions of alterations, [151, 152], [201, 202] and [203, 204], were missed by the CBS method. This indicates that the linear-median method has more power than the CBS method to detect shorter segments of alterations or narrow gaps between segments.
3.2. Application to real data
We applied the linear-median method to a subset of aCGH data from 39 well-studied lung cancer cell lines. The data, originally published by Coe et al13 and Garnis et al14 are available for downloading from http://sigma.bccrc.ca/. For this study, we used data from only the subgroup with the largest sample size, that of non-small cell adenocarcinoma (NA), which included 18 samples.
As both the linear-median method and the cgh-MCR method are designed for application to multiple aCGH data, the sample size is a critical issue. Data with more independent samples are able to provide more information on the commonalities across all samples.
Accurately identifying the locations of copy number aberrations has many important medical applications. As far as we know, the cghMCR method is one of the methods used to estimate the shared copy number for multiple aCGH data. Many other methods give an estimation of only the probability of gain/loss at each probe position.4,13
Information on the exact shared copy number(s) at each probe position is not available for the data we have analyzed (the NA data). Therefore, based on only the analytic outputs of the linear-median method and the cghMCR method, it is difficult for us to claim which method is better in terms of the accuracy of estimating the true copy numbers. As a result, we compared the similarities between the analytic outputs of the two methods and determined which method provides more information on the changes in the copy numbers in the NA data. As a reference for this comparison, we used the probability of gain/loss at each probe position that was reported by Shah et al.4
The total number of probe positions in the NA data (chromosome 9) is 1249. Recalling Model 4 in Section 2.1, in order to estimate the shared copy numbers in a “test” DNA sequence, we need to know the parameter π. This type of information is also required for the cghMCR method. The value of π might be estimated based on the researcher’s empirical knowledge. For the NA data, empirical knowledge on the value of π is not available. Therefore, we applied the cghMCR method and the linear-median method to the data for different values of π, 0.2, 0.4, 0.6, 0.8 and 1. Then we compared the results from both methods and also compared those results to findings reported by Shah et al.4 We expected to find little difference in the results obtained from the three methods. Shah et al found a loss of the shared copy number in a significant portion of the NA data (see Figure 7 in their paper).4 However, for π = 0.4, 0.6, 0.8 or 1, both the cghMCR method and the linear-median method provided high proportions of neutral states, ie, where the shared copy number equals 2. Therefore, it is reasonable to use π = 0.2 when analyzing the NA data. We limit our report of the analytic results to the case where π = 0.2.
Combining all the results given by the linear-median method and the cghMCR method for π = 0.2, 0.4, 0.6, 0.8 and 1, we were able to identify a common trend in the outputs of the two methods for all probe positions as the value π moves from 1 to 0.2 (data not shown). For the NA data, both the linear-median method and the cghMCR method give neutral states to all probe positions when π is assigned as 1, with the exception of a few probe positions identified as gain/loss by the linear-median method. In our empirical study of the NA data, if a probe position a is more likely to lose copy number(s), then the shared copy number estimation given by both methods will decrease as π moves from 1 to 0.2; if a probe position a is more likely to gain copy number(s), then the shared copy number estimation given by both methods will increase as π moves from 1 to 0.2. One important phenomenon we observed from the outputs of the two methods is that once a probe position has been identified as having a shared copy number change when π = π0, the observation remains the same for any π > π0. Comparing the results of the two methods, we found that the estimation of the shared copy number at each probe position given by the cgh-MCR method is reluctant to change as the value of π decreases. In contrast, the linear-median method can show changes in the estimated shared copy number as π decreases. This may reflect the later detection of an aberration by the cghMCR method compared to the linear-median method when the true shared copy number at a probe position is gained/lost, and as the value of π decreases. Based on our analysis of the NA data, the linear-median method was able to report the estimated shared copy number at each probe position; whereas the cghMCR method reported only the state of the shared copy number, ie, wether there was a gain, loss or no change (neutral state), in the shared copy number. To simplify the comparison between the results given by the two methods, we report only the gain, loss, or neutral states of the shared copy number for the linear-median method. A plot of the states for both methods is given in Figure 4. In the plot, we use “1”, “0” and “−1” to indicate a shared copy number gain, neutrality, or loss, respectively. We summarize the results as follows.
From probe positions 1 to 500 and 1235 to 1249, both the cghMCR method and the linear-median method provide similar results, except for some isolated prob positions. This is what we expect to find because our simulation studies demonstrated that the linear-median method can identify those isolated regions.
From probe positions 501 to 1234, the results obtained from the linear-median method and the cgh-MCR method are quite different. The cghMCR method claims that all the probe positions are neutral, in contrast to the findings of the linear-median method, which identifies gains/losses at these probe positions. One possible explanation for the large difference between the two sets of results in this prob region is that the π used in the estimation for this region may be too high. A lower value of π should be used to accurately estimate copy numbers in this interval. These results suggest that the parameter π might vary over sequences of NA data. If this is true, then, detecting the change in π will be an interesting challenge for future studies.
Information on the true shared copy numbers for the NA data is not available; hence, we cannot be certain which method would best estimate the shared copy number variations in these data. However, through our comparison of the two methods and taking into account the results given by Shah et al4 we can claim that the linear-median method has some capability to reasonably estimate shared copy numbers in DNA sequences. As shown in our simulation studies, the linear-median method can easily identify isolated probe positions with shared copy number changes or short shared alternating segments. These changes are often missed by the cghMCR approach.
The 1249 probe sets we studied target the shared copy number status of 1262 genes present in the chromosome 9.
In order to classify these genes as one of three general categories, we performed a search of the OMIM database (http://www.ncbi.nlm.nih.gov/omim). The three categories we used were “not related to/unknown cancer phenotype (NR/U)”, “cancer-related phenotype, except for lung cancer (CR)”, and “lung cancer-related phenotype (LCR)”. The results are presented in Tables 2 and 3. Identifying altered regions where important cancer-related genes are located aids the biological interpretation of our findings and works as an empirical form of validation. Detailed locations of the genes categorized as NR/U, CR and LCR are presented in Supplementary Appendix B. From Tables 2 and 3 we can see that the linear-median method is able to report more CR and LCR with copy number losses/gains than the cghMCR method.
Table 2.
NR/U |
CR |
LCR |
Total |
|||||
---|---|---|---|---|---|---|---|---|
LM | cgh MCR | LM | cgh MCR | LM | cgh MCR | LM | cgh MCR | |
Losses | 670 | 346 | 89 | 33 | 9 | 4 | 768 | 383 |
Neutral | 342 | 758 | 35 | 103 | 3 | 9 | 380 | 870 |
Gains | 100 | 8 | 13 | 1 | 1 | 0 | 114 | 9 |
1112 | 137 | 13 |
Table 3.
LM | cghMCR | |
---|---|---|
Loss | PSIP1, CDKN2A TUSC1, IGFBPL1 TLE1, FRMD3 DAPK1, MIRLET7A1 PTPN3 | PSIP1, CDKN2A TUSC1, IGFBPL1 |
Neutral | PHF19, DAB2IP RPL12 | PHF19, DAB2IP RPL12, TLE1 FRMD3, DAPK1 MIRLET7A1, PTPN3 GAS1 |
Gain | GAS1 |
We were able to find additional information of interest from the output of the linear-median method. Focusing on the probe positions at which the estimated shared copy number given by the linear-median method was <1 or >3 when π = 0.2, we identified 145 such probe positions out of 1249 (see Figure 5). Among those 145 probe positions, 22 probe positions showed an estimated copy number ≥4 or ≤−1. These results provided a more serious warning of copy number aberrations — a warning that was not obtained from the cghMCR method.
4. Conclusion
We developed a new model for aCGH data analysis, the linear-median method, which estimates shared copy numbers in DNA sequences. Using simulated data, we found the linear-median method to be more powerful than the cghMCR method in terms of achieving a higher rate of true positives and a lower rate of false positives. In addition to estimating the common gain/loss of chromosome regions, the linear-median method estimates the number of DNA copies. In other words, analytic results produced by the linear-median method allow us to extract additional information on the tested DNA sequences. In particular, the linear-median method has the power to estimate common changes that appear at isolated single probe positions or very short regions. The only drawback of the linear-median method is that it ignores the dependency information in samples. However, based on our application of the proposed method to real data, we find that most information on shared copy number aberrations can be captured by the linear-median method using only the information across independent samples.
Supplementary Material
Appendix A
Table S1.
π = 1 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9988320 (1.204417e-05) | 0.9999922 (6.722244e-06) | 1.0006107 (5.334313e-06) | 0.9996400 (1.261637e-05) | 1.0010851 (1.167334e-05) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
0.9996429 (1.231912e-06) | 1.0007002 (5.472384e-06) | 0.9996422 (5.472957e-06) | 0.9995458 (4.414939e-06) | |
π= 0.9 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
1.0234726 (7.944477e-04) | 0.9999251 (3.122031e-06) | 0.9945141 (7.239544e-05) | 0.9874093 (2.153089e-04) | 0.9815765 (3.306295e-04) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
0.9784618 (4.723169e-04) | 0.9754699 (5.685863e-04) | 0.9728850 (5.824456e-04) | 0.9690245 (5.801032e-04) | |
π = 0.8 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
1.0387630 (2.886933e-03) | 0.9996188 (9.836780e-06) | 0.9892445 (2.590746e-04) | 0.9765723 (8.382101e-04) | 0.9673282 (1.415877e-03) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
0.9586696 (1.799211e-03) | 0.9522338 (2.116493e-03) | 0.9460126 (2.245196e-03) | 0.9428445 (2.510819e-03) | |
π = 0.7 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
1.0490667 (5.548408e-03) | 1.0001018 (1.432224e-05) | 0.9855943 (5.197429e-04) | 0.9663545 (1.709809e-03) | 0.9524253 2 (2.958586e-03) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
0.9407424 (3.912353e-03) | 0.930110 (4.538055e-03) | 0.9227174 (5.050342e-03) | 0.9165458 (5.479345e-03) | |
π = 0.6 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
1.0488169 (7.753325e-03) | 1.0010726 (6.911221e-06) | 0.9854699 (7.413494e-04) | 0.9623178 (2.893487e-03) | 0.9414670 (4.946303e-03) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
0.9257995 (6.682264e-03) | 0.9128190 (8.140030e-03) | 0.9026656 (9.115055e-03) | 0.8949812 (1.010583e-02) |
Table S2.
π = 0.5 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9976367 (3.009497e-03) | 1.0008558 (3.751868e-06) | 1.0051801 (1.132037e-03) | 1.0075940 (8.828197e-03) | 1.0563732 (3.143084e-02) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
1.0996647 (5.681301e-02) | 0.9949510 (3.069008e-02) | 1.0348440 (5.681301e-02) | 1.2778189 (3.069008e-02) | |
π = 0.4 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9689243 (2.197164e-03) | 1.0004657 (8.269312e-06) | 1.0224327 (1.544194e-03) | 1.0774329 (1.112485e-02) | 1.1533009 (3.060863e-02) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
1.2460811 (5.815739e-02) | 1.3484609 (9.379170e-02) | 1.4610660 (1.286891e-01) | 1.5757287 (1.732840e-01) | |
π = 0.3 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9690245 (1.446965e-03) | 1.0008585 (3.876559e-06) | 1.0242324 (1.226318e-03) | 1.0860159 (6.909656e-03) | 1.1647982 (1.727091e-02) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
1.2629289 (2.912726e-02) | 1.3679820 (4.057614e-02) | 1.4846238 (5.020234e-02) | 1.5987995 (5.951541e-02) | |
π = 0.2 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9737273 (6.463392e-04) | 1.0001785 (3.456922e-06) | 1.0231539 (5.653691e-04) | 1.0743057 (3.026114e-03) | 1.1446959 (6.303903e-03) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
1.2239605 (9.262918e-03) | 1.3104238 (1.097795e-02) | 1.3991247 (1.232603e-02) | 1.4862704 (1.325612e-02) | |
π = 0.1 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9836251 (1.448035e-04) | 0.9996371 (1.181245e-05) | 1.0143460 (1.537200e-04) | 1.0458579 (6.429722e-04) | 1.0869769 (1.166530e-03) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
1.1335024 (1.374409e-03) | 1.1808455 (1.518496e-03) | 1.2294815 (1.587512e-03) | 1.2743215 (1.669519e-03) |
Appendix B
Probe positions from 1 to 295: A total of 200 genes are found in this region, 28 of them (14%) are genes related to cancer phenotype while 3 (1.5%) are related to lung cancer phenotype. All LCR genes are located in chromosomal regions identified as losses by both methods (LM and cghMCR). The LCR genes located at this region are PSIP1, CDKN2A, and TUSC1. PSIP1 and CDKN2A, a well-known lung cancer suppressor1 are both located in a region frequently found deleted in lung cancer patients.2 In addition, TUSC1 is found mutated and silent in nonsmall cell lung carcinoma cell lines.3
Probe positions from 296 to 331: A total of 12 NR/U genes are found in this region.
Probe positions from 332 to 341: Only 3 genes are located in this region with one of them being classified as CR (ACO1). Both methods identify the region where this gene is located as loss.
Probe positions from 342 to 375: A total of 113 genes are located in this regions with 14 of them being classified as CR.
Probe positions from 376 to 500: A total of 171 genes are located in this region. Four of them are CR and only one (IGFBPL1, classified as loss by both methods) is classified as LCR. IGFBPL1 has already been shown to be downregulated in lung tumor samples.4
Probe positions from 501 to 1234: A total of 744 genes are located in this region, 90 of them being classified as CR, and 9 as LCR. The cghMCR method does not identify any region containing LCR as altered. On the other hand, the LM method identifies five of the LCR genes in chromosomal regions of loss (TLE1, FRMD3, DAPK1, MIRLET7A1, PTPN3) and, consequently, are expected to have lower expression in lung tumor samples. In fact, TLE1 is frequently found altered in squamous cell carcinomas and adenocarcinomas5 while FRMD3 expression is usually silenced in primary nonsmall cell lung carcinomas.6 Likewise, mouse lung carcinoma clones characterized by highly aggressive metastatic behavior did not express Dapk1.7 Also, MIRLET7A1 and PTPN3 expressions are downregulated in lung cancer.8,9 The LM indetifies one gene located in a gain region (GAS1), and therefore, it is expected to be overexpressed in lung cancer samples. Surprisingly, Gas1 expression is known by its capacity of suppressing metastasis in lung,10 therefore, we hypothesize that the this gene might be regulated epigenetically or it is a false positive identified by the LM method. Again, the cghMCR method does identifies this region as neutral. In addition, 3 genes are found by both methods in neutral regions (PHF19, DAB2IP, RPL12) and, therefore, we believe that their regulation is being performed by epigenetic factors. In fact, PHF19 mRNA is known to be overexpressed in lung cancers9 as well as methylation of the promoter of DAB2IP is associated with the lung cancer phenotype.11 Likewise, RPL12 splice variant are frequently found in human lung carcinoma cell.12
Probe positions from 1235 to 1249: A total of 17 genes are located in this regions with only one of them (ABL1) being classified as CR and identified as a gain by both methods.
Appendix C
R code for the linear-median function
x is an n × T matrix, the elements of y are aCGH observations in linear format
n denotes the number of independent samples
T denotes the size of each individual sample
At any probe position p, if the true shared copy number is not 2, the probability of having copy number changed is “prob”
Function “Linear_Median” gives the estimate of shared copy number at each probe position.
Linear_Median = function(x,n,T,prob){
medianx = c()
for (i in 1:T){
medianx[i] = median(x[i,])
}
justx = c()
justx = 2*(medianx-1+prob)/prob
xx = c()
xx = floor(justx)
for(i in 1:T){
if (justx[i]>= xx[i]+0.5)
xx[i] = xx[i]+1
}
xx
}
Table S3.
n= 20 | L-M | cgh MCR | L-M | cgh MCR | L-M | cgh MCR |
---|---|---|---|---|---|---|
π | α = 0.5 | α = 1 | α = 1.5 | |||
0.2 | ||||||
TP | 0.6382 (0.0496) | 0.0714 (0.1101) | 0.7568 (0.0406) | 0.0024 (0.0188) | 0.8096 (0.0414) | 0 (0) |
FP | 0.3785 (0.0384) | 0.0040 (0.0154) | 0.6549 (0.0384) | 2.83e-04 (0.0041) | 0.7657 (0.0357) | 0 (0) |
0.4 | ||||||
TP | 0.7849 (0.0429) | 0.6760 (0.1830) | 0.7696 (0.0413) | 0.0308 (0.0779) | 0.7616 (0.0453) | 0 (0) |
FP | 0.0861 (0.0248) | 0.0415 (0.0302) | 0.3827 (0.0402) | 0.0011 (0.0081) | 0.5611 (0.0408) | 0 (0) |
0.6 | ||||||
TP | 0.9503 (0.0227) | 0.9075 (0.0224) | 0.8708 (0.0359) | 0.2759 (0.1129) | 0.8013 (0.0410) | 0 (0) |
FP | 0.0122 (0.0090) | 2.58e-05 (0.0004) | 0.2000 (0.0310) | 0.0023 (0.0114) | 0.3905 (0.0419) | 0 (0) |
0.8 | ||||||
TP | 0.9966 (0.0060) | 0.9030 (0.0206) | 0.9451 (0.0204) | 0.3877 (0.1308) | 0.8677 (0.0331) | 0 (0) |
FP | 0.0013 (0.0028) | 0 (0) | 0.0238 (0.0917) | 0 (0) | 0.2617 (0.0358) | 0 (0) |
1 | ||||||
TP | 1 (0) | 0.9490 (0.0147) | 0.9817 (0.0147) | 0.6542 (0.1561) | 0.9237 (0.0287) | 0 (0) |
FP | 7.74e-05 (0.0007) | 0 (0) | 0.04026 (0.0154) | 0 (0) | 0.1667 (0.0314) | 0 (0) |
Table S4.
n= 50 | L-M | cgh MCR | L-M | cgh MCR | L-M | cgh MCR |
---|---|---|---|---|---|---|
π | α = 0.5 | α = 1 | α = 1.5 | |||
0.2 | ||||||
TP | 0.6309 (0.0547) | 0.02442 (0.0626) | 0.7147 (0.0499) | 0 (0) | 0.7521 (0.0455) | 0 (0) |
FP | 0.1712 (0.0346) | 6.45e-04 (0.0065) | 0.4866 (0.0488) | 0 (0) | 0.6425 (0.0437) | 0 (0) |
0.4 | ||||||
TP | 0.8895 (0.0347) | 0.6643 (0.1542) | 0.8574 (0.0357) | 0.0019 (0.0109) | 0.7975 (0.0420) | 0 (0) |
FP | 0.0089 (0.0070) | 0.0439 (0.0297) | 0.1737 (0.0365) | 0 (0) | 0.3603 (0.0416) | 0 (0) |
0.6 | ||||||
TP | 0.9949 (0.0072) | 0.9046 (0.0149) | 0.9581 (0.0212) | 0.2762 (0.0842) | 0.8926 (0.0358) | 0 (0) |
FP | 6.45e-05 (0.0006) | 0 (0) | 0.0482 (0.0189) | 0 (0) | 0.1814 (0.0364) | 0 (0) |
0.8 | ||||||
TP | 1 (0) | 0.8962 (0.0118) | 0.9912 (0.0100) | 0.3384 (0.0416) | 0.9545 (0.0209) | 0 (0) |
FP | 0 (0) | 0 (0) | 0.0100 (0.0082) | 0 (0) | 0.0826 (0.0238) | 0 (0) |
1 | ||||||
TP | 1 (0) | 0.9207 (0.0154) | 0.9992 (0.0029) | 0.4155 (0.1679) | 0.9848 (0.0107) | 0 (0) |
FP | 0 (0) | 0 (0) | 0.0023 (0.0038) | 0 (0) | 0.0348 (0.0153) | 0 (0) |
Table S5.
n= 100 | L-M | cgh MCR | L-M | cgh MCR | L-M | cgh MCR |
---|---|---|---|---|---|---|
π | α = 0.5 | α = 1 | α = 1.5 | |||
0.2 | ||||||
TP | 0.6771 (0.0539) | 0.0048 (0.0203) | 0.7438 (0.0505) | 0 (0) | 0.7335 (0.0461) | 0 (0) |
FP | 0.0561 (0.0187) | 0 (0) | 0.3266 (0.0412) | 0 (0) | 0.5146 (0.0381) | 0 (0) |
0.4 | ||||||
TP | 0.9566 (0.0233) | 0.6650 (0.1317) | 0.9299 (0.02653) | 0.0004 (0.0030) | 0.8718 (0.0341) | 0 (0) |
FP | 0.0003 (0.0013) | 0.0455 (0.0270) | 0.0578 (0.0196) | 0 (0) | 0.2012 (0.0340) | 0 (0) |
0.6 | ||||||
TP | 0.9998 (0.0015) | 0.9033 (0.0108) | 0.9920 (0.01010) | 02804 (0.0706) | 0.9556 (0.0239) | 0 (0) |
FP | 0 (0) | 0 (0) | 0.0065 (0.0075) | 0 (0) | 0.0621 (0.0224) | 0 (0) |
0.8 | ||||||
TP | 1 (0) | 0.8956 (0.0087) | 0.9998 (0.0015) | 0.3345 (0.0193) | 0.9922 (0.0099) | 0 (0) |
FP | 0 (0) | 0 (0) | 0.0003 (0.0013) | 0 (0) | 0.0167 (0.0125) | 0 (0) |
1 | ||||||
TP | 1 (0) | 0.8971 (0.0177) | 0.9998 (0.0015) | 0.2263 (0.1156) | 0.9983 (0.0044) | 0 (0) |
FP | 0 (0) | 0 (0) | 0.0003 (0.0013) | 0 (0) | 0.0043 (0.0056) | 0 (0) |
References
- 1.Kamb A, Gruis NA, Weaver-Feldhaus J, et al. A cell cycle regulator potentially involved in genesis of many tumor types. Science. 1994;264:436–40. doi: 10.1126/science.8153634. [DOI] [PubMed] [Google Scholar]
- 2.Singh DP, Kimura A, Chylack LT, Jr, Shinohara T. Lens epithelium-derived growth factor (LEDGF/p75) and p52 are derived from a single gene by alternative splicing. Gene. 2000;242:265–73. doi: 10.1016/s0378-1119(99)00506-5. [DOI] [PubMed] [Google Scholar]
- 3.Shan Z, Parker T, Wiest JS. Identifying novel homozygous deletions by microsatellite analysis and characterization of tumor suppressor candidate 1 gene, TUSC1, on chromosome 9p in human lung cancer. Oncogene. 2004;23:6612–20. doi: 10.1038/sj.onc.1207857. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cai Z, Chen HT, Boyle B, Rupp F, Funk WD, Dedera DA. Identification of a novel insulin-like growth factor binding protein gene homologue with tumor suppressor like properties. Biochem Biophys Res Commun. 2005;331:261–6. doi: 10.1016/j.bbrc.2005.03.163. [DOI] [PubMed] [Google Scholar]
- 5.Allen T, van Tuyl M, Iyengar P, et al. Grg1 acts as a lung-specific oncogene in a transgenic mouse model. Cancer Res. 2006;66:1294–301. doi: 10.1158/0008-5472.CAN-05-1634. [DOI] [PubMed] [Google Scholar]
- 6.Haase D, Meister M, Muley T, et al. FRMD3, a novel putative tumour suppressor in NSCLC. Oncogene. 2007;26:4464–8. doi: 10.1038/sj.onc.1210225. [DOI] [PubMed] [Google Scholar]
- 7.Inbal B, Cohen O, Polak-Charcon S, et al. DAP kinase links the control of apoptosis to metastasis. Nature. 1997;390:180–4. doi: 10.1038/36599. [DOI] [PubMed] [Google Scholar]
- 8.Johnson SM, Grosshans H, Shingara J, Byrom M, Jarvis R, Cheng A, et al. RAS Is Regulated by the let-7 MicroRNA Family. Cell. 2005;120:635C647. doi: 10.1016/j.cell.2005.01.014. [DOI] [PubMed] [Google Scholar]
- 9.Gobeil S, Zhu X, Doillon CJ, Green1 MR. A genome-wide shRNA screen identifies GAS1 as a novel melanoma metastasis suppressor gene. Genes Dev. 2008;22:2932–40. doi: 10.1101/gad.1714608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wang Z, Shen D, Parsons DW, et al. Mutational analysis of the tyrosine phosphatome in colorectal cancers. Science. 2004;304:1164–6. doi: 10.1126/science.1096096. [DOI] [PubMed] [Google Scholar]
- 11.Yano M, Toyooka S, Tsukuda K, et al. Aberrant promoter methylation of human DAB2 interactive protein (hDAB2IP) gene in lung cancers. Int J Cancer. 2005;113:59–66. doi: 10.1002/ijc.20531. [DOI] [PubMed] [Google Scholar]
- 12.Cuccurese M, Russo G, Russo A, Pietropaolo C. Alternative splicing and nonsense-mediated mRNA decay regulate mammalian ribosomal gene expression. Nucleic Acids Research. 2005;33:5965–77. doi: 10.1093/nar/gki905. [DOI] [PMC free article] [PubMed] [Google Scholar]
Acknowledgments
V. Baladandayuthapani was partially supported by US National Science Foundation grant IIS 0914861. K.-A. Do was partially supported by the University of Texas SPORE grants in Prostate Cancer P50 CA140388, Breast Cancer P50 CA116199, Brain Cancer P50 CA127001, and the Cancer Center Support Grant P30 CA016672. We would also like to acknowledge LeeAnn Chastain (UTMDACC) for her editorial contributions to the manuscript.
Footnotes
Disclosure
This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.
References
- 1.Cappuzzo F, Hirsch FR, Rossi E, et al. Epidermal growth factor receptor gene and protein and gefitinib sensitivity in non-small-cell lung cancer. J Nat Cancer Inst. 2005;97:643–55. doi: 10.1093/jnci/dji112. [DOI] [PubMed] [Google Scholar]
- 2.http://en.wikipedia.org/wiki/Array_comparative_genomic_hybridization
- 3.Aguirre AJ, Brennan C, Bailey G, et al. High-resolution characterization of the pancreatic adenocarcinoma genome. Proc Nat Acad Sci U S A. 2004;101:9067–72. doi: 10.1073/pnas.0402932101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Shah SP, Xuan X, deLeeuw RJ, et al. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics. 2006;22:e431–9. doi: 10.1093/bioinformatics/btl238. [DOI] [PubMed] [Google Scholar]
- 5.Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23:657–63. doi: 10.1093/bioinformatics/btl646. [DOI] [PubMed] [Google Scholar]
- 6.Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary Segmentation for the analysis of array-based DNA copy number data. Bio Statistics. 2004;5:557–72. doi: 10.1093/biostatistics/kxh008. [DOI] [PubMed] [Google Scholar]
- 7.Molinaro AM, van der Laan MJ, Moore DH. Comparative Genomic Hybridization Array Analysis. U.C. Berkeley Division of Bio-statistics Working Paper Series. 2002. Working Paper Series. Working Paper 106. http://www.bepress.com/ucbbiostat/paper106.
- 8.Guha S, Li Y, Neuberg D. Bayesian hidden Markov modeling of array CGH data. J Am Stat Assoc. 2008;103:485–97. doi: 10.1198/016214507000000923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Pinkel D, Albertson DG. Comparative genomic hybridization. Ann Rev Genom Hum Genet. 2005;6:331–54. doi: 10.1146/annurev.genom.6.080604.162140. [DOI] [PubMed] [Google Scholar]
- 10.Pinkel D, Albertson DG. Array comparative genemic hybrization and its application in cancer. Nat Genet. 2005;37(Suppl):S11–7. doi: 10.1038/ng1569. [DOI] [PubMed] [Google Scholar]
- 11.Pinkel D, Davis R, Albertson D. Detection of gene dosage abnormalities using comparative genomic hybridization. 2005. http://cancer.ucsf.edu/array/nccls_pinkel.pdf.
- 12.Rueda OM, Diaz-Uriarte R. Finding recurrent copy number alteration regions: a review of methods. Current Bioinformatics. 2010;5:1–17. [Google Scholar]
- 13.Coe BP, Lockwood WW, Girard L, et al. Differential disruption of cell cycle pathways in small cell and non-small cell lung cancer. Br J Cancer. 2006;94:1927–35. doi: 10.1038/sj.bjc.6603167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Garnis C, Lockwood WW, Vucic E, et al. High resolution analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int J Cancer. 2006;118:1556–64. doi: 10.1002/ijc.21491. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Table S1.
π = 1 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9988320 (1.204417e-05) | 0.9999922 (6.722244e-06) | 1.0006107 (5.334313e-06) | 0.9996400 (1.261637e-05) | 1.0010851 (1.167334e-05) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
0.9996429 (1.231912e-06) | 1.0007002 (5.472384e-06) | 0.9996422 (5.472957e-06) | 0.9995458 (4.414939e-06) | |
π= 0.9 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
1.0234726 (7.944477e-04) | 0.9999251 (3.122031e-06) | 0.9945141 (7.239544e-05) | 0.9874093 (2.153089e-04) | 0.9815765 (3.306295e-04) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
0.9784618 (4.723169e-04) | 0.9754699 (5.685863e-04) | 0.9728850 (5.824456e-04) | 0.9690245 (5.801032e-04) | |
π = 0.8 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
1.0387630 (2.886933e-03) | 0.9996188 (9.836780e-06) | 0.9892445 (2.590746e-04) | 0.9765723 (8.382101e-04) | 0.9673282 (1.415877e-03) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
0.9586696 (1.799211e-03) | 0.9522338 (2.116493e-03) | 0.9460126 (2.245196e-03) | 0.9428445 (2.510819e-03) | |
π = 0.7 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
1.0490667 (5.548408e-03) | 1.0001018 (1.432224e-05) | 0.9855943 (5.197429e-04) | 0.9663545 (1.709809e-03) | 0.9524253 2 (2.958586e-03) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
0.9407424 (3.912353e-03) | 0.930110 (4.538055e-03) | 0.9227174 (5.050342e-03) | 0.9165458 (5.479345e-03) | |
π = 0.6 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
1.0488169 (7.753325e-03) | 1.0010726 (6.911221e-06) | 0.9854699 (7.413494e-04) | 0.9623178 (2.893487e-03) | 0.9414670 (4.946303e-03) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
0.9257995 (6.682264e-03) | 0.9128190 (8.140030e-03) | 0.9026656 (9.115055e-03) | 0.8949812 (1.010583e-02) |
Table S2.
π = 0.5 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9976367 (3.009497e-03) | 1.0008558 (3.751868e-06) | 1.0051801 (1.132037e-03) | 1.0075940 (8.828197e-03) | 1.0563732 (3.143084e-02) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
1.0996647 (5.681301e-02) | 0.9949510 (3.069008e-02) | 1.0348440 (5.681301e-02) | 1.2778189 (3.069008e-02) | |
π = 0.4 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9689243 (2.197164e-03) | 1.0004657 (8.269312e-06) | 1.0224327 (1.544194e-03) | 1.0774329 (1.112485e-02) | 1.1533009 (3.060863e-02) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
1.2460811 (5.815739e-02) | 1.3484609 (9.379170e-02) | 1.4610660 (1.286891e-01) | 1.5757287 (1.732840e-01) | |
π = 0.3 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9690245 (1.446965e-03) | 1.0008585 (3.876559e-06) | 1.0242324 (1.226318e-03) | 1.0860159 (6.909656e-03) | 1.1647982 (1.727091e-02) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
1.2629289 (2.912726e-02) | 1.3679820 (4.057614e-02) | 1.4846238 (5.020234e-02) | 1.5987995 (5.951541e-02) | |
π = 0.2 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9737273 (6.463392e-04) | 1.0001785 (3.456922e-06) | 1.0231539 (5.653691e-04) | 1.0743057 (3.026114e-03) | 1.1446959 (6.303903e-03) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
1.2239605 (9.262918e-03) | 1.3104238 (1.097795e-02) | 1.3991247 (1.232603e-02) | 1.4862704 (1.325612e-02) | |
π = 0.1 | ||||
tp = 1 | tp = 2 | tp = 3 | tp = 4 | tp = 5 |
0.9836251 (1.448035e-04) | 0.9996371 (1.181245e-05) | 1.0143460 (1.537200e-04) | 1.0458579 (6.429722e-04) | 1.0869769 (1.166530e-03) |
tp = 6 | tp = 7 | tp = 8 | tp = 9 | |
1.1335024 (1.374409e-03) | 1.1808455 (1.518496e-03) | 1.2294815 (1.587512e-03) | 1.2743215 (1.669519e-03) |
Probe positions from 1 to 295: A total of 200 genes are found in this region, 28 of them (14%) are genes related to cancer phenotype while 3 (1.5%) are related to lung cancer phenotype. All LCR genes are located in chromosomal regions identified as losses by both methods (LM and cghMCR). The LCR genes located at this region are PSIP1, CDKN2A, and TUSC1. PSIP1 and CDKN2A, a well-known lung cancer suppressor1 are both located in a region frequently found deleted in lung cancer patients.2 In addition, TUSC1 is found mutated and silent in nonsmall cell lung carcinoma cell lines.3
Probe positions from 296 to 331: A total of 12 NR/U genes are found in this region.
Probe positions from 332 to 341: Only 3 genes are located in this region with one of them being classified as CR (ACO1). Both methods identify the region where this gene is located as loss.
Probe positions from 342 to 375: A total of 113 genes are located in this regions with 14 of them being classified as CR.
Probe positions from 376 to 500: A total of 171 genes are located in this region. Four of them are CR and only one (IGFBPL1, classified as loss by both methods) is classified as LCR. IGFBPL1 has already been shown to be downregulated in lung tumor samples.4
Probe positions from 501 to 1234: A total of 744 genes are located in this region, 90 of them being classified as CR, and 9 as LCR. The cghMCR method does not identify any region containing LCR as altered. On the other hand, the LM method identifies five of the LCR genes in chromosomal regions of loss (TLE1, FRMD3, DAPK1, MIRLET7A1, PTPN3) and, consequently, are expected to have lower expression in lung tumor samples. In fact, TLE1 is frequently found altered in squamous cell carcinomas and adenocarcinomas5 while FRMD3 expression is usually silenced in primary nonsmall cell lung carcinomas.6 Likewise, mouse lung carcinoma clones characterized by highly aggressive metastatic behavior did not express Dapk1.7 Also, MIRLET7A1 and PTPN3 expressions are downregulated in lung cancer.8,9 The LM indetifies one gene located in a gain region (GAS1), and therefore, it is expected to be overexpressed in lung cancer samples. Surprisingly, Gas1 expression is known by its capacity of suppressing metastasis in lung,10 therefore, we hypothesize that the this gene might be regulated epigenetically or it is a false positive identified by the LM method. Again, the cghMCR method does identifies this region as neutral. In addition, 3 genes are found by both methods in neutral regions (PHF19, DAB2IP, RPL12) and, therefore, we believe that their regulation is being performed by epigenetic factors. In fact, PHF19 mRNA is known to be overexpressed in lung cancers9 as well as methylation of the promoter of DAB2IP is associated with the lung cancer phenotype.11 Likewise, RPL12 splice variant are frequently found in human lung carcinoma cell.12
Probe positions from 1235 to 1249: A total of 17 genes are located in this regions with only one of them (ABL1) being classified as CR and identified as a gain by both methods.
R code for the linear-median function
x is an n × T matrix, the elements of y are aCGH observations in linear format
n denotes the number of independent samples
T denotes the size of each individual sample
At any probe position p, if the true shared copy number is not 2, the probability of having copy number changed is “prob”
Function “Linear_Median” gives the estimate of shared copy number at each probe position.
Linear_Median = function(x,n,T,prob){
medianx = c()
for (i in 1:T){
medianx[i] = median(x[i,])
}
justx = c()
justx = 2*(medianx-1+prob)/prob
xx = c()
xx = floor(justx)
for(i in 1:T){
if (justx[i]>= xx[i]+0.5)
xx[i] = xx[i]+1
}
xx
}
Table S3.
n= 20 | L-M | cgh MCR | L-M | cgh MCR | L-M | cgh MCR |
---|---|---|---|---|---|---|
π | α = 0.5 | α = 1 | α = 1.5 | |||
0.2 | ||||||
TP | 0.6382 (0.0496) | 0.0714 (0.1101) | 0.7568 (0.0406) | 0.0024 (0.0188) | 0.8096 (0.0414) | 0 (0) |
FP | 0.3785 (0.0384) | 0.0040 (0.0154) | 0.6549 (0.0384) | 2.83e-04 (0.0041) | 0.7657 (0.0357) | 0 (0) |
0.4 | ||||||
TP | 0.7849 (0.0429) | 0.6760 (0.1830) | 0.7696 (0.0413) | 0.0308 (0.0779) | 0.7616 (0.0453) | 0 (0) |
FP | 0.0861 (0.0248) | 0.0415 (0.0302) | 0.3827 (0.0402) | 0.0011 (0.0081) | 0.5611 (0.0408) | 0 (0) |
0.6 | ||||||
TP | 0.9503 (0.0227) | 0.9075 (0.0224) | 0.8708 (0.0359) | 0.2759 (0.1129) | 0.8013 (0.0410) | 0 (0) |
FP | 0.0122 (0.0090) | 2.58e-05 (0.0004) | 0.2000 (0.0310) | 0.0023 (0.0114) | 0.3905 (0.0419) | 0 (0) |
0.8 | ||||||
TP | 0.9966 (0.0060) | 0.9030 (0.0206) | 0.9451 (0.0204) | 0.3877 (0.1308) | 0.8677 (0.0331) | 0 (0) |
FP | 0.0013 (0.0028) | 0 (0) | 0.0238 (0.0917) | 0 (0) | 0.2617 (0.0358) | 0 (0) |
1 | ||||||
TP | 1 (0) | 0.9490 (0.0147) | 0.9817 (0.0147) | 0.6542 (0.1561) | 0.9237 (0.0287) | 0 (0) |
FP | 7.74e-05 (0.0007) | 0 (0) | 0.04026 (0.0154) | 0 (0) | 0.1667 (0.0314) | 0 (0) |
Table S4.
n= 50 | L-M | cgh MCR | L-M | cgh MCR | L-M | cgh MCR |
---|---|---|---|---|---|---|
π | α = 0.5 | α = 1 | α = 1.5 | |||
0.2 | ||||||
TP | 0.6309 (0.0547) | 0.02442 (0.0626) | 0.7147 (0.0499) | 0 (0) | 0.7521 (0.0455) | 0 (0) |
FP | 0.1712 (0.0346) | 6.45e-04 (0.0065) | 0.4866 (0.0488) | 0 (0) | 0.6425 (0.0437) | 0 (0) |
0.4 | ||||||
TP | 0.8895 (0.0347) | 0.6643 (0.1542) | 0.8574 (0.0357) | 0.0019 (0.0109) | 0.7975 (0.0420) | 0 (0) |
FP | 0.0089 (0.0070) | 0.0439 (0.0297) | 0.1737 (0.0365) | 0 (0) | 0.3603 (0.0416) | 0 (0) |
0.6 | ||||||
TP | 0.9949 (0.0072) | 0.9046 (0.0149) | 0.9581 (0.0212) | 0.2762 (0.0842) | 0.8926 (0.0358) | 0 (0) |
FP | 6.45e-05 (0.0006) | 0 (0) | 0.0482 (0.0189) | 0 (0) | 0.1814 (0.0364) | 0 (0) |
0.8 | ||||||
TP | 1 (0) | 0.8962 (0.0118) | 0.9912 (0.0100) | 0.3384 (0.0416) | 0.9545 (0.0209) | 0 (0) |
FP | 0 (0) | 0 (0) | 0.0100 (0.0082) | 0 (0) | 0.0826 (0.0238) | 0 (0) |
1 | ||||||
TP | 1 (0) | 0.9207 (0.0154) | 0.9992 (0.0029) | 0.4155 (0.1679) | 0.9848 (0.0107) | 0 (0) |
FP | 0 (0) | 0 (0) | 0.0023 (0.0038) | 0 (0) | 0.0348 (0.0153) | 0 (0) |
Table S5.
n= 100 | L-M | cgh MCR | L-M | cgh MCR | L-M | cgh MCR |
---|---|---|---|---|---|---|
π | α = 0.5 | α = 1 | α = 1.5 | |||
0.2 | ||||||
TP | 0.6771 (0.0539) | 0.0048 (0.0203) | 0.7438 (0.0505) | 0 (0) | 0.7335 (0.0461) | 0 (0) |
FP | 0.0561 (0.0187) | 0 (0) | 0.3266 (0.0412) | 0 (0) | 0.5146 (0.0381) | 0 (0) |
0.4 | ||||||
TP | 0.9566 (0.0233) | 0.6650 (0.1317) | 0.9299 (0.02653) | 0.0004 (0.0030) | 0.8718 (0.0341) | 0 (0) |
FP | 0.0003 (0.0013) | 0.0455 (0.0270) | 0.0578 (0.0196) | 0 (0) | 0.2012 (0.0340) | 0 (0) |
0.6 | ||||||
TP | 0.9998 (0.0015) | 0.9033 (0.0108) | 0.9920 (0.01010) | 02804 (0.0706) | 0.9556 (0.0239) | 0 (0) |
FP | 0 (0) | 0 (0) | 0.0065 (0.0075) | 0 (0) | 0.0621 (0.0224) | 0 (0) |
0.8 | ||||||
TP | 1 (0) | 0.8956 (0.0087) | 0.9998 (0.0015) | 0.3345 (0.0193) | 0.9922 (0.0099) | 0 (0) |
FP | 0 (0) | 0 (0) | 0.0003 (0.0013) | 0 (0) | 0.0167 (0.0125) | 0 (0) |
1 | ||||||
TP | 1 (0) | 0.8971 (0.0177) | 0.9998 (0.0015) | 0.2263 (0.1156) | 0.9983 (0.0044) | 0 (0) |
FP | 0 (0) | 0 (0) | 0.0003 (0.0013) | 0 (0) | 0.0043 (0.0056) | 0 (0) |