Summary
Copy number variant is an important type of genetic structural variation appearing in germline DNA, ranging from common to rare in a population. Both rare and common copy number variants have been reported to be associated with complex diseases, so it is therefore important to simultaneously identify both based on a large set of population samples. We develop a proportion adaptive segment selection procedure that automatically adjusts to the unknown proportions of the carriers of the segment variants. We characterize the detection boundary that separates the region where a segment variant is detectable by some method from the region where it cannot be detected. Although the detection boundaries are very different for the rare and common segment variants, it is shown that the proposed procedure can reliably identify both whenever they are detectable. Compared with methods for single sample analysis, this procedure gains power by pooling information from multiple samples. The method is applied to analyze neuroblastoma samples and identifies a large number of copy number variants that are missed by single-sample methods.
Keywords: DNA copy number variant, Information pooling, Population structural variant
1. Introduction
Copy number variant is a type of DNA structural variation that results in the genome having abnormal numbers of copies of DNA segments. Copy number variants correspond to relatively large regions of the genome that have been deleted or duplicated on certain chromosomes (Zhang et al., 2009). Copy number variants can be inherited or caused by de novo mutations and have been shown to be associated with complex diseases such as cancer (Diskin et al., 2009). Such associations can involve both rare and common variants. Since recent genome-wide association studies have shown that common variants can only explain a small fractions of heritabilities of complex phenotypes, genetic studies of rare variants, including rare copy number variants, have become even more important.
An important problem is to identify all the copy number variants in the human genome, including both the rare and common copy number variants, in order to have a complete variant catalog for future association and population genetics analysis. While efficient procedures have been developed for identifying variants in a long sequence of genome-wide observations, these procedures mostly focus on identification based on data from a single sample. Important examples include the optimal likelihood ratio selection method (Jeng et al., 2010), the hidden Markov model-based method (Wang et al., 2007) and change-point based methods (Olshen et al., 2004). To identify the recurrent copy number variants that appear in multiple samples, some type of post-processing is often used. This type of procedure first identifies copy number variants based on individual samples and then selects regions with highly recurrent variants. One problem with such an approach is that the power for identifying the recurrent variants does not improve with the increase in the size of samples. The locations of a recurrent copy number variant mostly overlap across samples, so identification power can be improved if information from multiples samples can be efficiently pooled. In addition, most variants from the germline constitutional genome have range of less than 20 single nucleotide polymorphisms (Zhang et al., 2009) in a typical Illumina 660K chip. Many of these short variants cannot be identified even by the optimal method based on data from a single sample (Jeng et al., 2010). Efficiently pooling information from multiple samples can greatly benefit the discovery of short copy number variants that are missed in single-sample analysis.
Methods for simultaneously detecting rare and common copy number variants based on a large set of population samples have not been fully developed. Zhang et al. (2010) introduced a method for detecting simultaneous change-points in multiple sequences that is only effective for detecting the common variants. Siegmund et al. (2010) extended their method by introducing a prior variant frequency that needs to be specified. No rigorous power studies were given in these papers. For common variants, the power of identification can be increased by summing up the test statistics over all the samples (Zhang et al., 2010). This approach, however, fails for the rare copy number variant identification because the information of the few signals can be diluted greatly. For rare copy number variants, methods based on outliers of test statistics over all the samples can be more efficient (Siegmund et al., 2010). There is a great need for a unified and theoretically justifiable approach that can identify both rare and common copy number variants simultaneously.
In this paper, we propose a proportion adaptive segment selection procedure, which is optimally adaptive to the unknown proportions of the carriers of the segment variants. At its core is an efficient scanning algorithm based on a test statistic that is sensitive to both the rare and common segment variants. To study the theoretical properties of this procedure, we first characterize the detection boundary that separates the region where a segment variant is detectable by some method from the region where it cannot be detected by any method. The results show that the detection boundaries are very different for the rare and common segment variants. Despite the significant differences, it is shown that this adaptive procedure can simultaneously identify both the rare and common segment variants whenever they are detectable. This procedure automatically adapts to the unknown proportions of the carriers of the segment variants.
Compared with single sample analysis, the proposed adaptive procedure significantly gains power by pooling information from multiple samples; compared with other information pooling methods, this procedure provides a unified approach to identify a wide range of copy number variants. In addition to DNA copy number analysis, the proposed method can also be used for other applications, where the detection of recurrent signal segments is of interest. One example is to detect linear objects in images with multiple looks where information pooling also sheds light on the discovery of common and subtle objects.
2. STATISTICAL MODEL AND METHOD
2·1. Statistical Model for Multi-Sample Copy Number Variation Analysis
Suppose there are N linear sequences or samples of noisy data and that each sequence has T observations. Let Xit be the observed data for the ith sample at the tth location. If there are no signal variations, Xit scatters around 0 for any i and t. Suppose that at certain non-overlapping segments or subintervals I1, …, Iq, some samples have elevated or dropped means from the baseline and others do not. We call the samples that carry the variation the carriers. Denote the collection of the nonoverlapping segments by , the carrier proportion at segment Ik in the population by πk, and the magnitude of the segment for sample i by Aik. Then an observation for sample i ∈ {1, …, N} at location t ∈ {1, …, T} can be modeled as
| (1) |
with
| (2) |
where δ0 is a point mass at 0, μk ≠ 0, and . The noise variance for sample i can be easily estimated when T is large and the signal segments are sparse in the linear sequence of data for sample i. For example, the robust median absolute deviation estimate can be applied. Without loss of generality, we assume in the theoretical analysis. All of the other parameters Ik, πk, μk, τk (k = 1, …, q) are unknown. From this model, if t is not in any signal segment, Xit is Gaussian noise following . If t is in a signal segment Ik, then
| (3) |
This Gaussian mixture is both heterogenous and heteroscedastic. The τk of the second component represents the additional variability introduced by the different magnitudes of signal segments in the population.
Our goal is to detect the existence of recurrent segment variants across samples; and to identify the locations of the segments. Precisely, we wish to first test
| (4) |
and if H0 is rejected, detect each . The segment variants can be classified as rare and common based on their carrier proportions in the population. Specifically, we say that
| (5) |
| (6) |
The separation boundary N−1/2 is frequently seen in large-sample theory. See, for example, Cai et al. (2011) for a similar classification of recurrent signals. For the common variants with πk > N−1/2, classical large-sample theory implies that methods based on the sample mean are efficient. For the rare variants with πk ≤ N−1/2, classical results cannot be applied and new theoretical and methodological developments are needed.
2·2. The Proportion Adaptive Segment Selection Procedure
We now introduce the proportion adaptive segment selection procedure, which performs an efficient scan over long linear sequences of data based on a test statistic that is sensitive to both the rare and common segment variants. The procedure utilizes the short-segment structure of the signals by only considering intervals of length at most L, where L ⪡ T. Denote the set of these intervals by . The choice of L should satisfy the condition
| (7) |
where s̄ is the length of the longest signal segments. This condition is easily satisfied when signal segments are short, as often seen in copy number variants.
For any interval , we calculate the standardized sum of observations in J for each sample i as
| (8) |
where |J| denotes the length of the interval J. By (1) and the assumption under H0. When J overlaps with some signal segment, XJ,i follows a heterogeneous and heteroscedastic Gaussian mixture according to (3). Specifically, when J = Ik for some ,
| (9) |
The mean of the second component includes the value of jump size μk and length of the segment variant at Ik.
Based on the XJ,i statistic, we pool information from multiple samples by calculating the extreme value of the standardized ordered p-values of XJ,i (i = 1, …, N). The p-values of XJ,i are two-sided, and the ordered p-values are denoted by pJ,(1) ≤ ⋯ ≤ pJ,(N). The standardized ordered p-values are defined as
| (10) |
Since the p-values are uniformly distributed under H0, the WJ,(i) comprise a standardized uniform empirical process and its extreme value has a well studied distribution (Shorack & Wellner, 2009, pages 596-600). We use the extreme value VN(J) = maxα0 ≤ i ≤ N/2 WJ,(i) for some small α0, as our test statistic. If J overlaps with some signal segment, the distribution of the test statistic deviates to the positive side. An interval is selected if its test statistic passes a certain threshold and achieves a local maximum. Since the signal intervals are not known, this procedure examines all the overlapping intervals of length ≤ L and chooses the intervals that achieve a local maximum of the extreme values {}. The detailed algorithm is given as follows.
Step 1. For each long sequence of data {Xit: t = 1, …, T}, standardize the data by subtracting the sample median and dividing by the median absolute deviation estimate of σi.
Step 2. Set the maximum interval length L and denote by the collection of the intervals with length less than or equal to L.
Step 3. For any given interval , calculate XJ,i as in (8) and the two-sided p-values of XJ,i as pJ,i =pr{N(0, 1) > |XJ,i|} (i = 1, …, N).
Step 4. Order the p-values as pJ,(1) ≤ ⋯ ≤ pJ,(N) and calculate standardized empirical process of the p-values as WJ,(i) in (10).
- Step 5. Calculate the test statistic for each interval as
thefor some small α0 > 1 and pick candidate set(11)
for some threshold λT,N. If , we reject the null hypothesis, set j = 1, and proceed to the following steps.(12) Step 6. Let , and update .
Step 7. Repeat Steps 6–7 with j = j + 1 until is empty.
Step 8 Define the collection of selected intervals as . and identify the signal segments as all the elements in .
After the test statistic VN(J) is calculated for each interval , a threshold λT,N is set based on the distribution of VN(J) under H0. Since all the ∈intervals in are considered, the threshold λT,N needs to adjust for multiple testing, so that the family-wise Type I error is controlled at a desired level. Section 3·1 provides a detailed discussion on setting λT,N theoretically or by simulations.
Steps 6–7 find all the local peaks in the candidate set . Intuitively, if the signal segments are well separated, the test statistic VN(Ik) of a signal segment Ik is larger than those of other intervals overlapping Ik, so that the local peaks provide good estimates of the signal segments.
Remark 1: The tuning parameter α0 in (11) is used to stabilize the procedure and to better control the family-wise error with finite samples. This parameter excludes the endpoints of the standardized uniform empirical process, where the process diverges. The rationale is that the convergence of the extreme values without truncation is much slower than the convergence of a truncated version. By choosing α0 > 1, VN(J) can be more stable in finite samples, which also leads to a smaller threshold on VN(J) to control the over-selection, and higher power for detecting signals with small intensity.
Remark 2: The test statistic VN(J) is closely related to some other test statistics based on standardized uniform empirical process, such as the Anderson & Darling (1952) statistic and higher criticism (Donoho & Jin, 2004). However, the setting is different here and the adaptivity of VN(J) to rare and common segment variants is an interesting new discovery.
The fact that the test statistic VN(J) is able to capture the signal information of both the rare and common variants can be illustrated by simulation. In this example, the sample size N = 1600, so that the separating value for the carrier proportion is N−1/2 = 2·5%. The data are generated from model (1) – (2) with T = 10, 000 and . Two locations are randomly selected for segment variants with length |Ik| = 10 for k = 1, 2. The first location has a rare variant with π1 = 1% and μ1 = 1; the second location has a common variant with π1 = 50% and μ1 = 0·1. Fix τk =0·7 for k = 1, 2. Figure 1 shows WI1,(i) and WI2,(i) (i = α0, …, N) with α0 = 4. The samples are ordered by their p-values of XIk,i, and the WIk,(i) statistics are plotted in the same order. Apparently, the signal information shows up at different places in these two plots. In the left plot, WI1,(i) reaches the peak at the left end, where the small number of large signals locate; whereas in the right plot, WI2,(i) reaches the peak to the right of the left end, where the information of many small signals lumps up. According to (11), the test statistics VN(I1) and VN(I2) capture the peaks in these two cases respectively. This is essentially the reason for the strong detection power of the proportion adaptive segment selection for both the rare and common segment variants.
Fig. 1.
Illustrations of WI1,(i) and WI2,(i) (i = α0,…, N). Left: mean +/− median absolute deviation of WI1,(i) over 100 replications for a rare segment variant with relatively large magnitude. Right: mean +/− median absolute deviation of WI2,(i) over 100 replications for a common segment variant with small magnitude.
3. OPTIMAL ADAPTIVITY OF THE PROPORTION ADAPTIVE SEGMENT SELECTION
3·1. Family-Wise Error Control
In this section, we show that under H0, the proportion adaptive segment selection procedure with a theoretical threshold asymptotically controls the family-wise error. The theoretical threshold is constructed based on the limiting distribution of VN(J) as N → ∞ under H0. Define
| (13) |
Then aNVN(J) − bN converges to a non-degenerate random variable (Shorack & Wellner, 2009, page 600). Since all TL intervals in are considered, the theoretical threshold is defined as
| (14) |
for some C0 > 1. The following theorem shows that the procedure asymptotically controls the family-wise error for any fixed T.
THEOREM 1. Assume model (1)–(2). The candidate set constructed in (12) with λT,N defined in (14) is empty with high probability under H0. More specifically, we have
for any fixed T and all sufficiently large N, where C1 > 0 is a constant and C0 > 1 is defined in (14).
The proof of Theorem 1 is given in the Appendix. The proof for the family-wise error when T increases with N is more involved due to the calculation of the convergence rate for the extreme value of the standardized empirical process. A detailed proof is outside the scope of this paper. The convergence is slow in general, mainly due to the end points of the interval [0, 1] as shown in Shorack & Wellner (2009), §16.1. By choosing α0 > 1, the test statistic VN(J) is more stable and its family-wise error better controlled under H0. While the precise choice of α0 in practice is difficult, simulation results in Table 1 can provide useful guidance.
Table 1.
Means and standard deviations (in parentheses) of the data-driven thresholds for VN over 100 replications.
| α0 = 1 | α0 = 2 | α0 = 4 | α0 = 7 | α0 = 10 | α0 = 13 | |
|---|---|---|---|---|---|---|
| #O = 5 | 47·7 (10·9) | 12·3 (1·6) | 6·9 (0·4) | 5·6 (0·3) | 5·2 (0·2) | 5·0 (0·2) |
| #O = 2 | 75·5 (26·7) | 15·2 (2·3) | 7·7 (0·7) | 6·0 (0·3) | 5·5 (0·3) | 5·3 (0·3) |
| #O = 0 | 178·9 (163·3) | 24·2 (8·9) | 9·8 (2·0) | 7·1 (1·0) | 6·3 (0·6) | 5·9 (0·5) |
#O: number of over-selected intervals; α0: number of observations left out.
Although Theorem 1 shows that the Type I error can be controlled when T and N are sufficiently large, in finite sample situations the convergence of VN(J) as N can be slow for small α0. In addition, it is difficult to choose the constants C C ∞ 0 and → 1 in setting the threshold. In simulations and real data analysis, we suggest to use simulations to determine a data-driven threshold to control the number of over-selections. Section 4·1 presents the details.
3·2. Detection Power of the Proportion Adaptive Segment Selection
Under the alternative hypothesis, the proposed procedure asymptotically selects either the true signal segments or short intervals overlapping the true segments, whenever the signal segments are detectable. We characterize the detection boundary that separates the region where a segment variant is detectable by some method from the region where it cannot be detected by any methods. If a method can reliably detect a segment variant whenever it is detectable, we say that the method is optimal. If a method applies a unified approach to optimally detect both rare and common segment variants without using the information of their carrier proportions and other unknown parameters, we say the method is optimally adaptive to the carrier proportions and other unknown parameters in the model.
The segment variants can be characterized into the rare and common groups by (5) and (6). We calibrate πk as
| (15) |
If 1/2 ≤ βk < 1, Ik corresponds to a rare variant, and if 0 ≤ βk < 1/2, Ik corresponds to a common variant. Extremely rare variants with carrier proportion at the order of N−1 are not considered here. For a fixed Ik and a sample i, the sufficient statistic XIk,i defined in (8) follows a Gaussian mixture distribution as in (9). Since the mean of the non-null component has absolute value |μk||Ik|1/2, we calibrate |μk||Ik|1/2 for rare and common variants respectively as
| (16) |
| (17) |
For a rare variant, the carrier proportion is so small that the signal intensity, which is represented by |μk||Ik|1/2, must be sufficiently large to make the variant detectable; whereas for a common variant, the carrier proportion is so large that a small signal intensity can be amplified to make the detection successful. This is the reason for the different calibrations of |μk||Ik|1/2. These calibrations are similar to those of signal intensity in Cai et al. (2011), where detection of Gaussian mixtures is considered.
We find the detection boundary for the rare and common variants, respectively. Let
| . |
If 1/2 ≤ βk < 1, define
| (18) |
and if 0 ≤ βk < 1/2, define
| (19) |
The following proposition shows that ρ+(βk, τk) and ρ− (βk, τk) are the separating lines between the detectable and undetectable regions for the rare and common segment variants.
PROPOSITION 1. Assume model (1)–(2) and (7). Suppose I1 is known and π1 and μ1|I1|1/2 are calibrated as in (15), (16), and (17). If r > ρ+(β1, τ1) for 1/2 ≤ β1 < 1, or if r1 < ρ− (β1, τ1) for 0 ≤ β1 < 1/2, there exists a consistent test for : π1 = 0 vs : π1 > 0 for which the sum of Type I and Type II error probabilities tends to 0 as N → ∞. If r1 < ρ+ (β1, τ1) for 1/2 ≤ β1 < 1, or if r1 > ρ− (β1, τ1) for 0 ≤ β1 < 1/2, a consistent test does not exist.
Proposition 1 is an extension of Theorems 2.1–2.4 in Cai et al. (2011), where, based on a random sample {Y1, …, YN}, the problem is to test
where ∊, A, and σ2 are unknown parameters. Here the information in I1 for sample i is summarized by the sufficient statistic XI1,i, then for I1, the sufficient statistics of the N observations, XI1,1, …, XI1,N, are treated as a random sample. It is easy to see that XI1,i ~ N(0, 1) under and XI1,i ~ (1 − π1)N(0, 1) + π1N(μ1|I1|1/2, 1 + τ1) under . Therefore, the mixture model of XI1,i under is a special case of the mixture model of Yi under with σ2 > 1. So the detection boundary for the segment variant at I1 can be derived in a similar way as that in Cai et al. (2011). We omit the proof here.
The detection boundary can be used as a benchmark to evaluate theoretically the performance of a method. Our problem is more difficult than detecting Gaussian mixtures at a fixed interval, because the locations I1, …, Iq are unknown. The proportion adaptive segment selection procedure first pools information across samples for all intervals in and then searches through these intervals to detect segment variants. The following theorem states that as long as the length of the sequence T is not too large compared to the sample size N, any segment variant in the detectable region is included with a high probability in the candidate set of the proportion adaptive selection, implying that the proportion adaptive segment selection procedure with the theoretical threshold is an asymptotically optimal procedure for detecting segment variants. Furthermore, the implementation of the proportion adaptive segment selection does not require the information of {q, πk, Ik, μk, τk: k = 1, …, q}. Therefore, the procedure is asymptotically optimally adaptive to all the unknown parameters in the model.
THEOREM 2. Assume model (1)–(2) and (7). Suppose , and for any , calibrate πk and μk|Ik|1/2 as in (15), (16), and (17). In addition, assume NC ⪢ log T for any C > 0 and α0 = o(NC) for any C > 0. Then, if rk > ρ+(βk, τk) for 1/2 ≤ βk < 1 or if rk < ρ−(βk, τk) for 0 ≤ βk < 1/2, we have
| (20) |
where
| (21) |
Clearly, the convergence rate is larger for smaller βk, which corresponds to a larger carrier proportion. The interval I being in implies that either Ik itself or a short interval overlapping with Ik is selected in the final collection . In applications, follow-up studies rarely just look at the selected intervals but rather examine small regions covering the selected intervals to verify the exact locations of signal segments.
In order to see the power gain of the proportion adaptive segment selection by pooling information from multiple samples, we consider the situation when only one sequence of data with length T is available. In this situation, a theoretically optimal likelihood ratio selection has been developed in Jeng et al. (2010). For the likelihood ratio selection to successfully detect a signal segment I1, the condition on μ1|I1|1/2 is for some ∊n = o(1). This is in general stronger than the condition on μ1|I1|1/2 in Theorem 2, which is for 1/2 ≤ β1 < 1 or for 0 ≤ β1 < 1/2, when T is much larger than N. In high-throughput copy number variation data analysis, T is usually above 500,000 and N mostly under 1000. Significant power gains can be achieved especially for detecting common variants.
4. SIMULATION STUDIES
4·1. Choice of α0
The parameter α0 in (11) of the algorithm determines how many end points of the empirical process of the p-values are left out from the test statistic VN. By choosing α0 > 1, VN can be more stable with finite samples, which also leads to a smaller threshold on VN to control overselection, and higher power for detecting small-intensity signals. To demonstrate this, Table 1 shows the mean and standard deviation of the data-driven threshold based on 100 replications of simulated data. In each replication, we generate 400 sequences, and each sequence has 5000 observations generated from the standard normal distribution. We apply our procedure with L = 6. The data-driven threshold is defined as the smallest threshold to guarantee that there is no more than a pre-specified number of intervals in that do not overlap with any of the segments in . The value of α0 has a great effect on the threshold and therefore on the power of the test statistic VN. Since α0 − 1 samples with the extreme p-values are left out from the test statistic, the proposed procedure cannot be very effective in identifying extremely rare copy number variants.
4·2. Improvement Over Single-Sample Method
In this section, simulation studies are carried out to investigate the numerical performance of the proportion adaptive segment selection and to compare it with other methods. In the following simulations, we set N = 400, T = 5000, and for each sample i.
We begin by considering testing H0 against H1. The power gain of information pooling is shown by comparing the performance of the proportion adaptive segment selection with the single-sample method in Jeng et al. (2010), which scans through all the intervals of length at most L for each sample and calculates the sufficient statistic for each interval as in (8). The single-sample method rejects H0 if the extreme value of all the sufficient statistics has absolute value greater than {2 log(NTL)1/2. This is derived from the distribution theory of the extreme value under H0. The single-sample method does not utilize the information that the locations of a recurrent variant across samples are mostly overlapping.
To assess the Type I error, we generate each Xit ~ N(0, 1) and calculate the empirical Type I error of the proportion adaptive segment selection with L = 6 and α0 = 10 and the single-sample method. The empirical Type I error is defined as the percentage of replications in which some interval is selected under H0. We observe empirical Type I errors of 0·081 and 0·097, respectively, both with standard error 0·009. To assess the power of detecting the segments, one segment I is randomly selected and each Xit in that segment is generated from model (1)–(2) with |I| = |I1| = 5, τ = τ1 = 1, π = π1 =0·1, and μ = μ1 = 0·5, 0·7, 0·9 and 1·1. The empirical power based on 100 replications is defined as the percentage of replications in which some interval in overlaps with the segment I. Table 2 clearly shows that the proportion adaptive selection outperforms the single-sample method, resulting in much higher power for a wide range of the μ values.
Table 2.
Empirical power (%) and standard error (in parentheses) of the proportion adaptive segment selection and the single-sample method over 100 replications.
| μ =0·5 | μ =0·7 | μ =0·9 | μ =1·1 | |
|---|---|---|---|---|
| PASS | 21 (4·0) | 54 (4·9) | 89 (3·3) | 100 (0·0) |
| LRS | 22 (4·2) | 29 (4·7) | 38 (4·9) | 59 (4·9) |
PASS: proportion adaptive segment selection; LRS: single-sample method of Jeng et al. (2010).
The estimated standard errors of the empirical power over 100 replications are also included in Table 2. To estimate the standard error of the medians, we generate 500 bootstrap samples from the 100 replication results, then calculate a median for each bootstrap sample. The standard error is the standard deviation of the 500 bootstrap medians. The standard errors are in general small for all the simulations in Section 4·2–4·4.
4·3. Effects of Segment Length and Signal Variance
Further simulations are provided to demonstrate the effect of segment length and signal variance on the proportion adaptive segment selection. We randomly select three locations for segment variants with different segment length, |I1| = 4, |I2| = 9, and |I3| = 16. The other parameters are set as μk = 1, πk = 0·05, and τk =2·5, 1·5 and 0·0 for k = 1, 2, 3. Table 3 shows the estimation accuracy and the control of over-selection for the proportion adaptive selection with L = 20 and α0 = 10. The estimation accuracy for signal segment Ik is demonstrated by the dissimilarity measure
where denotes the collection of intervals selected by proportion adaptive segment selection. Apparently, Dk ∈ [0, 1] and smaller values of Dk correspond to greater overlap of Ik with some interval in . Table 3 shows that longer segment length and/or smaller signal variance result in better identification of the segments by the proportion adaptive segment selection.
Table 3.
Medians and standard errors (in parentheses) of the dissimilarity measure Dk (k = 1, 2, 3) and the number of over-selections for the proportion adaptive segment selection over 100 replications. The lengths of the intervals are |I1| = 4, |I2| = 9 and |I3| = 16.
| τ | D 1 | D 2 | D 3 | #O |
|---|---|---|---|---|
| 2·5 | 1 (0·3) | 0·18 (0·02) | 0·10 (0·01) | 2 (0·2) |
| 1·5 | 1 (0·1) | 0·10 (0·02) | 0·06 (0·01) | 2 (0·0) |
| 0·0 | 1 (0·0) | 0·05 (0·02) | 0·03 (0·01) | 2 (0·3) |
#O: number of over-selected intervals.
4·4. Simultaneous Discovery of Rare and Common Segment Variants
We demonstrate the adaptivity property of the proportion adaptive segment identification by comparing its performance with two recently published methods. One method developed by Zhang et al. (2010), which pools information through the sum of observations over all the samples, is expected to work well for common signals with weak signal intensity. Another method developed by Siegmund et al. (2010) with the prior probability of carrier fixed at p0 =0·01 pools information through the outliers of observations over all the samples and is more efficient for detecting rare signals with strong signal intensity. Since the methods of Zhang et al. (2010) and Siegmund et al. (2010) control the genome-wide Type I error at a given level and our method does not provide a fixed-level Type I error control, the parameters in these methods are tuned to obtain a comparable over-selection of intervals. Specifically, we choose α0 = 10 and a data-driven threshold at 5·52 to control the number of over-selected intervals fewer than two.
The simulations are repeated 100 times. The parameters used are N = 400, T = 5000, q = 5, L = 6 and |Ik| = 5, τk = 0 for k = 1, …, q. We consider two different scenarios. The first scenario considers rare segment variants with a large signal intensity where μ is fixed at 1 and the πks vary from 0·04 to 0·08. The second scenario considers common segment variants with a small signal intensity where μ is fixed at 0·3 and the πks vary from 0·4 to 0·8. We compare the performances of these methods in terms of over-selection and empirical power in Table 4. The numbers of over-selections are comparable for all these methods. The method of Zhang et al. (2010) performs best for identifying the common segment variants, while that of Siegmund et al. (2010) with p0 =0·01 performs best for rare ones. The performance of the proportion adaptive segment selection lies between these two methods, demonstrating its adaptability and good power for identifying both rare and common segment variants simultaneously. We finally compare the results from the combined method of applying Zhang et al. (2010) and Siegmund et al. (2010) with p0 =0·01 together with median number of over-selections of two. The proportion adaptive procedure results in slightly lower power than the combined method.
Table 4.
Empirical power (%) and median of the number of over-selection #O for proportion adaptive segment selection and methods of Zhang et al. (2010) and Siegmund et al. (2010) over 100 replications. The standard errors appear in parentheses.
| Rare and strong signal | ||||||
|---|---|---|---|---|---|---|
| μ = 1 | π1 = 0 · 04 | π2 = 0 · 05 | π3 = 0 · 06 | π4 = 0 · 07 | π5 = 0 · 08 | #O |
| PASS | 35 (4·8) | 46 (4·8) | 66 (4·7) | 86 (3·6) | 91 (2·8) | 2·5 (0·4) |
| MSCP1 | 20 (3·9) | 34 (4·7) | 60 (5·1) | 73 (4·5) | 86 (3·3) | 1 (0·0) |
| MSCP001 | 46 (4·9) | 56 (4·9) | 77 (4·5) | 85 (3·5) | 95 (2·2) | 2 (0·4) |
| COMB | 45 (5·1) | 57 (4·8) | 70 (4·3) | 82 (3·8) | 93 (2·5) | 2 (0·1) |
| Common and weak signal | ||||||
| μ = 0 · 3 | π1 = 0 · 4 | π2 = 0 · 5 | π3 = 0 · 6 | π4 = 0 · 7 | π5 = 0 · 8 | #O |
| PASS | 8 (2·6) | 19 (3·8) | 21 (4·1) | 41 (5·0) | 54 (5·2) | 2 (0·4) |
| MSCP1 | 11 (3·3) | 25 (4·4) | 36 (4·7) | 58 (4·9) | 69 (4·6) | 1 (0·1) |
| MSCP001 | 11 (3·3) | 9 (2·9) | 12 (3·1) | 24 (4·1) | 24 (4·2) | 2 (0·2) |
| COMB | 12 (3·3) | 21 (4·0) | 26 (4·4) | 45 (5·0) | 57 (5·2) | 2 (0·5) |
PASS: proportion adaptive segment selection with α0 = 10; MSCP1: methods of Zhang et al. (2010); MSCP001: method of Siegmund et al. (2010) with p0 =0·01; COMB: combined MSCP1 and MSCP001.
5. APPLICATION TO NEUROBLASTOMA SAMPLES
We apply the proportion adaptive segment selection to a sample of 674 neuroblastoma cases that were collected as part of a large-scale genome-wide association study of neuroblastoma (Diskin et al., 2009). For each sample, over 600,000 single nucleotide polymorphism markers were genotyped using the Illumina genotype platform and the log R-ratio data were obtained. In order to account for possible wave-effect or local effects, we performed similar processing as in Siegmund et al. (2010) to obtain the normalized data by subtracting the sample median and regressing on the first principal component. In our analysis, we considered only data from chromosome 1, which includes T = 40, 929 log R-ratios.
In our analysis, we choose L = 20 and α0 = 4 to allow the selection of the copy number variant with four or more carriers. We then use simulations to determine the threshold for VN to control the number of over-selections to zero. Specifically, 674 samples and 40929 observations are simulated from a standard normal distribution 50 times. The mean and standard deviation of the simulated threshold are 12·57 and 2·98. With the threshold set at 12·57, our proportion adaptive procedure resulted in selection of 335 copy number variants with length of three or more markers, including 171 copy number variants with three markers, and 100 copy number variants with 4 markers, and 11 copy number variants with 10 or more markers. The median size of the copy number variants identified is 4,165 bps with a range of 462 to 1,038,000 bps. Figure 2 shows likelihood ratio statistics and the data plots for six copy number variants that we identified, demonstrating different characteristics. The first two plots show two common copy number variants detected in these 674 neuroblastoma samples, where the first copy number variant with 8 markers overlaps with the 7-marker copy number variant that was showed to be associated with the risk of neuroblastoma in Diskin et al. (2009). The second 3-marker copy number variant is very common and is also validated by Redon et al. (2006). The third and fourth plots show two rare copy number variants that were detected by the proportion adaptive segment selection, where only a few samples show large likelihood ratio statistics. These two copy number variants were also validated by Redon et al. (2006). These results indicate that the proportion adaptive segment selection can indeed detect both rare and common copy number variants.
Fig. 2.
Examples of the copy number variants identified. Top two panels are the likelihood ratio statistic for each sample, the bottom two panels show the heatmap of absolute value of the observed Log R-ratio values for the markers within and around the copy number variants identified (vertical white lines).
Since the identification of the short copy number variants is more susceptible to local wave effects or other artifacts of the data, we should interpret the copy number variants of three or four markers with caution and focus the following comparison on the 64 identified copy number variants of 5 or more markers. Among these 64 copy number variants, 30 overlap with the copy number variants in the database of genomic variants (http://projects.tcag.ca/variation/project.html) (Zhang et al., 2006). This database only includes the relatively common copy number variants identified in healthy human cases. To further demonstrate the power of the proportion adaptive selection, we also performed single-sample copy number identification using the optimal likelihood ration selection procedure of Jeng et al. (2010). Among the 64 copy number variants, 20 of them did not reach the theoretical threshold of {2 log(T L)}1/2 = 5·22 in any of the 674 samples, indicating loss of power of detecting the copy number variants based on the single-sample analysis. Of these 20 copy number variants missed by the single sample analysis, 10 of them overlap with the copy number variants in the genomic variants database (Zhang et al., 2006). These copy number variants were reported in Redon et al. (2006). As an example, the fourth and fifth panels of Figure 2 show the likelihood ratio statistics and the observed log R-ratios for two copy number variants identified by the proportion adaptive selection, but no samples pass the theoretical threshold value for single sample analysis.
ACKNOWLEDGMENT
This research is supported by NIH grants and an NSF grant. We thank the reviewer and the associate editor for many insightful comments.
APPENDIX - PROOFS
Proof of Theorem 1
Based on results for the extreme values of a standardized uniform empirical process, we have for any t free of N,
as N → ∞, which implies
for any fixed t and sufficiently large N, where C1 is a constant. This combined with the choice of λT,N implies that
Therefore,
where the third inequality uses the fact that e-x ≥ 1 − x. The result follows by the condition C0 > 1.
Proof of Theorem 2
By the calibration of πk and the condition on α0, the carrier proportion remains N−βk when the (α0 − 1) smallest p-values are excluded in (11). By the construction of , it is enough to show that for any with rk > ρ+(βk, τk) for 1/2 < βk < 1 or rk > ρ−(βk, τk) for 0 < βk < 1/2,
| (A1) |
Defining the standardized empirical process as
where is the survival function of a standard normal random variable and
we can rewrite VN(Ik) defined in (11) as
| (A2) |
For any fixed t,
| (A3) |
The key step of the proof is to find a t value such that WN,Ik(t) > λT,N with large probability. Define
| (A4) |
When τk < 1, 1/2 ≤ βk < 1 and rk > ρ+(βk, τk), we have and βk ≥ 1 − m(τk) if and only if
Then,
| (A5) |
Applying calibrations of μk|Ik|1/2 for 1/2 ≤ βk < 1 and 0 ≤ βk < 1/2 respectively, we have,
Combining the above with (A4), (A5) and the fact that
we have
| (A6) |
Then it can be shown that for some C > 0 given rk > ρ+ (βk, τk) for 1/2 ≤ βk < 1 or rk < ρ−(βk, τk) for 0 ≤ βk < 1/2. Further, by condition NC ⪢ log T for any C > 0, we have
| (A7) |
By Chebyshev’s inequality, (A3), and (A7),
Applying (A4) and condition rk > ρ+(βk, τk) for 1/2 ≤ βk < 1 or rk < ρ−(βk, τk) for 0 ≤ βk < 1/2 to the above, we have
| (A8) |
where C(rk, βk, τk) is as in (21). By combining (A8) and the fact that
(A1) is verified.
Contributor Information
X. Jessie Jeng, Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695, USA.
T. Tony Cai, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA.
Hongzhe Li, Department of Biostatistics and Epidemiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, 19104, USA.
References
- Anderson TW, Darling DA. Asymptotic theory of certain goodness-of-fit criteria based on stochastic processes. Annals of Mathematical Statistics. 1952;23:193–212. [Google Scholar]
- Cai TT, Jeng XJ, Jin J. Optimal detection of heterogeneous and heteroscedastic mixtures. Journal of the Royal Statistical Society: Series B. 2011;73:629–662. [Google Scholar]
- Diskin SJ, Hou C, Glessner JT, Attiyeh EF, Laudenslager M, Bosse K, Cole K, Moss Y, Wood A, Lynch JE, Pecor K, Diamond M, Winter C, Wang K, Kim C, Geiger EA, Mcgrady PW, Blakemore AIF, London WB, Shaikh TH, Bradfield J, Grant SFA, Li H, Devoto M, Rappaport ER, Hakonarson H, Maris JM. Copy number variation at 1q21.1 associated with neuroblastoma. Nature. 2009;459:987–991. doi: 10.1038/nature08035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donoho D, Jin J. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 2004;32:962–994. [Google Scholar]
- Jeng XJ, Cai TT, Li H. Optimal sparse segment identification with application in copy number variation analysis. J. Am. Statist. Ass. 2010;105:1156–1166. doi: 10.1198/jasa.2010.tm10083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based dna copy number data. Biostatistics. 2004;5:557–572. doi: 10.1093/biostatistics/kxh008. [DOI] [PubMed] [Google Scholar]
- Redon R, Ishikawa S, Fitch K, Feuk L, Perry G, Andrews T, Fiegler H, Shapero M, Carson A, Chen W, Cho E, Dallaire S, Freeman J, Gonzlez J, Gratacs M, Huang J, Kalait-Zopoulos D, Komura D, Macdonald J, Marshall C, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville M, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad D, Estivill X, Tyler-Smith C, Carter N, Aburatani H, Lee C, Jones K, Scherer S, Hurles M. Global variation in copy number in the human genome. Nature. 2006;444:444–454. doi: 10.1038/nature05329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shorack GR, Wellner JA. Empirical Processes with Applications to Statistics. SIAM; Philadelphia: 2009. [Google Scholar]
- Siegmund DO, Yakir B, Zhang NR. Detecting simultaneous variant intervals in aligned sequences. Annals of Applied Statistics. 2010;5:645–668. [Google Scholar]
- Wang K, Li M, Hadley D, Liu R, Glessner J, Grant S, Hakonarson H, Bucan M. Penncnv: an integrated hidden markov model designed for high-resolution copy number variation detection in whole-genome snp genotyping data. Genome Research. 2007;17:1665–1674. doi: 10.1101/gr.6861907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang F, Gu W, Hurles M, Lupski J. Copy number variation in human health, disease and evolutions. Annual Review of Genomics and Human Genetics. 2009;10:451–481. doi: 10.1146/annurev.genom.9.081307.164217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang J, Feuk L, Duggan GE, Khaja R, Scherer SW. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet Genome Res. 2006;115:205–214. doi: 10.1159/000095916. [DOI] [PubMed] [Google Scholar]
- Zhang NR, Siegmund DO, Ji H, Li J. Detecting simultaneous change-points in multiple sequences. Biometrika. 2010;97:631–645. doi: 10.1093/biomet/asq025. [DOI] [PMC free article] [PubMed] [Google Scholar]


