Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jul 1.
Published in final edited form as: Stat Sin. 2013 Jul 1;23(4):1553–1572. doi: 10.5705/ss.2012.018s

Multiple Change-Point Detection via a Screening and Ranking Algorithm

Ning Hao 1, Yue Selena Niu 2, Heping Zhang 3
PMCID: PMC3902887  NIHMSID: NIHMS423888  PMID: 24489450

Abstract

Let Y1, …, Yn be a sequence whose underlying mean is a step function with an unknown number of the steps and unknown change points. The detection of the change points, namely the positions where the mean changes, is an important problem in such fields as engineering, economics, climatology and bioscience. This problem has attracted a lot of attention in statistics, and a variety of solutions have been proposed and implemented. However, there is scant literature on the theoretical properties of those algorithms. Here, we investigate a recently developed algorithm called the Screening and Ranking Algorithm (SaRa). We characterize the theoretical properties of SaRa and show its superiority over other commonly used algorithms. In particular, we develop a false discovery rate approach to the multiple change-point problem and show a strong sure coverage property for the SaRa.

Key words and phrases: Change-point detection, copy number variation, false discovery rate, high dimensional data, screening and ranking algorithm

1. Introduction

Studies of change-point detection date back to 1950s. In the past half century, the topic has attracted a great deal of attention in such fields as statistics, engineering, economics, climatology and bioscience. Specifically, given a sequence of ordered or time dependent random variables, denoted by Y1, …, Yn, a change point is a position or time at which the structure of this sequence changes; the goal of change-point detection is to estimate the locations of change points and provide an assessment of accuracy. Thus, in climate data series, a change point is a time at which the climate changes dramatically, a threshold which is of interest to climatologists; in financial econometrics, change-point analysis can help identify the directions in the market or economy; in engineering, for a continuous production process, it is important to find out if there is a point where the quality of the products begins to deteriorate. A recent development is its application to genetics on DNA copy number variation detection. The DNA copy number of a region is the number of copies of the genomic DNA. Copy number variation (CNV) usually refers to deletion or duplication of a region of DNA sequences. It is shown by recent studies that CNVs account for an abundance of genetic variation and may influence phenotypic differences. It is then a fundamental problem to identify the CNVs in genetics; see Zhang (2010) for a thorough introduction on the application of change-point model in CNV detection. Finding CNVs in the massive data produced by modern DNA array technologies is a recent development in change-point problems. The main challenges here is in finding multiple change points accurately and efficiently in an expansive sequence where the length is typically of hundreds of thousands in SNP genotyping data.

A normal mean, multiple change-point model has played an essential role in the statistical analysis of the CNV problem (Olshen et al. (2004), Huang et al. (2005), Zhang and Siegmund (2007), Tibshirani and Wang (2008), Jeng et al. (2010)). Let

yi=θi+εi,εiiidN(0,σ2),i=1,,n, (1.1)

and assume that

θ1=θ2==θτ1θτ1+1==θτ2θτ2+1==θτJθτJ+1==θn,

where τ = (τ1, …, τJ )T is the location vector of the change points. We call model (1.1) with piecewise constant mean θ a normal mean change-point model.

In CNV detection, this model plays an essential role in Olshen et al. (2004), Huang et al. (2005), and Zhang and Siegmund (2007). Some authors have considered more restricted models: Tibshirani and Wang (2008) assume that θ itself is sparse; Jeng et al. (2010) assume, in addition, that the nonzero segments of θ are short. While these additional assumptions are reasonable, the model considered here is more general. Specifically, we only assume that n is large and any two change points are “not too close to each other;” this will be clarified later. We take Jn so while the data are high dimensional in terms of the potential locations of the change points, the number of true change-points is limited. Thus, (1.1) is a high-dimensional sparse model with a sequential structure.

For multiple-change-point problem, the number of change points and their locations are to be estimated. Popular multiple change-point detection tools include exhaustive search (Yao (1988), Yao and Au (1989)), binary segmentation (Vostrikova (1981), Olshen et al. (2004)), ℓ1 penalization (Huang et al. (2005), Tibshirani and Wang (2008)), and hidden Markov model approach (Wang et al. (2007), Lai et al. (2008)), among others. See Lai and Xing (2011) for multiple change-point models for the exponential family and for a comprehensive review of related methods.

Recently, Niu and Zhang (2010) proposed the Screening and Ranking algorithm (SaRa) as an alternative approach for change-point detection. They observed that when determining whether there is a change at the jth position, the information at positions far from j is rarely useful. Therefore, it is more efficient to concentrate on a local neighborhood first, say the interval (jh, j + h), when making decisions about j and nearby points. Avoiding complex optimization or iterative algorithm, the SaRa is simple to implement with complexity O(n), which makes the SaRa suitable for analyzing high throughput data. Besides computational efficiency, Niu and Zhang (2010) showed that, under mild conditions, the SaRa satisfies a sure coverage property that implies that the SaRa can estimate J and τ consistently. The SaRa, to our best knowledge, is the only algorithm known to combine both computational simplicity and consistency. In this paper, we derive deeper and broader theoretical properties for the SaRa that are useful for our understanding of its performance.

First of all, we propose a novel false discovery rate (FDR) approach to change-point detection. The multiple change-point problem can be naturally stated as a multiple testing problem. Although many change-point detection tools have been proposed recently, few authors have examined the issue of FDR. Tibshirani and Wang (2008), and Efron and Zhang (2011) studied the FDR on a normal mean change-point model with sparse mean vector θ, and applied their theories to the CNV problem. Our theory is different. We do not assume the sparsity of θ and focus on the FDR of change-point locations. Specifically, we concentrate on β = (β1, …, βn−1)T where βj = θj+1θj and whose support corresponds to the set of true change-point locations. We demonstrate how to establish a well-defined multiple testing framework for change-point problem (1.1) and assess the FDR for the SaRa estimator. In addition, we show how the FDR control procedure helps us select tuning parameters in the SaRa procedure.

Secondly, we characterize the convergence rate of the location estimator τ̂ for the SaRa. The sure coverage property (Niu and Zhang (2010)) states that the SaRa estimator for the location vector satisfies, with probability tending to one, ||τ̂τ|| < h, where h is a tuning parameter and is of order O(log n) under some reasonable conditions. Here we give a sharper convergence rate and show that ||τ̂τ|| = OP (1), where the convergence rate is the same as the best result known for the single change-point case (Csörgö and Horváth (1997)). Moreover, we show that our assumptions cannot be weakened further except for the constant 32 in condition (4.8), implying that the SaRa is a nearly optimal procedure.

The SaRa can be easily generalized to solve more general change-point problems. For example, we derive sure coverage properties for some non-normal cases, although a comprehensive study of non-normal data requires further effort.

The rest of this paper is organized as follows. In Section 2, we briefly recall the SaRa procedure and introduce an FDR control approach to multiple change-point detection based on the SaRa. The numerical studies of the FDR approach are demonstrated in Section 3. In Section 4, a strong sure coverage property of the SaRa is verified to illustrate the optimality of the SaRa, which is followed by some final remarks in Section 5. All proofs are in the Appendix.

2. False discovery rate for change-point detection

From the hypothesis testing perspective, the change-point problem can be viewed as a multiple testing problem by testing every data point as a potential change point. The false discovery rate (FDR) approach to multiple testing problems has been studied extensively since the seminal paper of Benjamini and Hochberg (1995). However, the multiple testing problem derived from change-point detection presents a problem that beyond the classical framework; it has a distinctive correlation, as well as the sequential, structure. We illustrate our approach to this problem.

2.1 Change-point detection as a multiple testing problem

Let Y = (Y1, …, Yn)T be a sequence of independent random variables with probability distribution function F1, …, Fn, respectively. The multiple change-point problem can be stated as the hypothesis testing problem

H0:F1=F2==FnvsH1:F1=F2==Fτ1Fτ1+1==Fτ2Fτ2+1=FτJFτJ+1==Fn, (2.1)

where J is an unknown number of change points and 0 < τ1 < · · · < τJ < n are their unknown locations. The purpose is to make a decision between two hypotheses and, if the alternative is supported, to estimate the number of change points J and the location vector τ = (τ1, …, τJ )T. Although this single hypothesis test has been employed to formulate the multiple change-point problem in classical works, it does not address the accuracy of estimation of the change points here.

The testing problem (2.1) is naturally decomposed to the sequence of hypotheses

H0(j):jisachangepoint;vsH1(j):jisnotachangepoint, (2.2)

where j = 1, 2, …, n − 1.

For the normal mean model where Fj ~ Inline graphic(θj, σ2), H0(j) and H1(j) correspond to βj ε θj+1θj = 0 and βj ≠ 0, respectively. One can use zjyj+1yj as the statistic for testing H0(j) against H1(j). However, its power may be limited by the fact that it does not fully utilize the sparse and sequential structures. Observing that the change points are apart from each other in many applications, we reformulate the hypotheses.

First, suppose that the minimal distance between two change points is at least h and modify (2.2) to

H0(j):Fj+1-h==Fj+hvsH1(j):Fj+1-h==FjFj+1==Fj+h. (2.3)

Second, it is impossible to recover the true change-point locations exactly in any reasonable asymptotic settings; see Section 4 for details on this aspect. For an FDR theory, we cannot treat the testing problems in (2.3) independently in terms of “true or false” as in the classical framework, so relaxed version of FDR is introduced.

Definition 1

Suppose that τ̂ = (τ̂1, …, τ̂Ĵ)T is the change point location estimator from some procedure such that min1≤iĴ{τ̂iτ̂i−1}≥ 2h. Then H0(j) is rejected for all jτ̂. We define τ̂i as a true positive if there exists a true change point τi such that |τiτ̂i| < h, otherwise, τ̂i is a false positive. The false discovery proportion (FDP) is the number of false positives divided by Ĵ. The FDR is defined as E(FDP ).

For such methods as CBS and ℓ1 penalization, it is not straightforward to evaluate the quantity τ̂iτi, or the FDR. In the next two subsections, we recall the SaRa and establish an FDR theory for the normal mean change-point model via the SaRa.

2.2 The screening and ranking algorithm

The SaRa was proposed by Niu and Zhang (2010) to detect change points in the normal mean model (1.1). For a position i, they considered the locally defined statistic,

Dh(j)=(k=j+1j+hYk-k=j-h+1jYk)/h, (2.4)

where hn. Intuitively, if j is a local maximizer of |Dh(·)| and |Dh(j)| is quite large, it is likely that there is a change point at or around j. Therefore, it is reasonable to consider all local maximizers of |Dh(·)| and search for change points among them. First, the SaRa calculates Dh(·) for all positions and finds all h-local maximizers of |Dh(·)|. Here, j is a h′-local maximizer of |Dh(·)| if |Dh(j)|≥ |Dh(k)| for all k ∈ (jh′, j + h′). One can take h′ different from h, h′ = h was used as a default setting in Niu and Zhang (2010). Second, the SaRa estimator is obtained by a thresholding rule |Dh(·)| > λ, that is applied to all local maxima. Thus

Jh,λ={τ^i:τ^iisalocalmaximizerofDh(·),andDh(τ^i)>λ}

is the SaRa location estimator. Let τ̂ = (τ̂1, …, τ̂Ĵ)T, where τ̂1 < τ̂2 < · · · < τ̂Ĵ, Ĵ = | Inline graphic|.

An important observation is that the local statistic Dh(j) employed in the SaRa is the natural test statistic for the multiple testing problem (2.3) when the Fj ’s are normal. If we define ηj(h)=E(Dh(j)), then H0 and H1 correspond to ηj(h)=0 and ηj(h) is a local maximum, respectively. We illustrate how to combine the SaRa with our extended FDR criterion to establish an FDR theory below.

2.3 A false discovery rate approach via the SaRa

First, we restate the SaRa procedure from the viewpoint of hypothesis testing for the general setting (2.3). Let D(j) be a test statistic that can be used to test (2.3), and p(j) be the corresponding P-value when available. Without loss of generality, we assume that larger values of D tend to support the alternative hypothesis. The SaRa is as follows. In Step 1, calculate the test statistic D(j), or the P-value p(j) for each j. Then pick out all local maximizers of D(j) or, equivalently, local minimizers of p(j). Here, for technical reasons, we temporarily use the h′-local extremizer for h′ = 2h. For example, we call j* a local minimizer of p(·) if

p(j)p(j)forallj(j-2h,j+2h). (2.5)

We denote the set of the local extremizers by Inline graphic Inline graphic. In Step 2, the SaRa estimator Inline graphic = {τ̂1 < · · · < τ̂Ĵ} ⊂ Inline graphic Inline graphic for the locations of change points is defined by a thresholding rule

D(j)>λorp(j)<p.

We can rank the elements in Inline graphic Inline graphic by their D values or P-values and get a solution path.

From now on, we use thresholding rule p(j) < p* to conform with the literature on FDR theory, although calculating D may be more convenient in practice. An advantage of the SaRa is that we can easily control the minimal distance mini{τ̂iτ̂i−1} of the SaRa estimator τ̂ by choosing a proper neighborhood when defining the local extremizer. Under (2.5), for any j1, j2Inline graphic Inline graphic with j1j2, we have |j1j2| > 2h. Because the P-values are derived from local statistics, p(j1) and p(j2) are independent. Therefore, we obtain a sequence of independent P-values {p(j)|jInline graphic Inline graphic}. When the null distribution, F0, of the P-value at a local minimizer is known or can be estimated accurately, the standard FDR control procedure can be applied to the P-value sequence directly. Specifically, suppose F0 is known and consider the set of modified P-values

{F0-1(p(j))jLM}={p(1)<p(2)<<p(m)},

where m = | Inline graphic Inline graphic|. Under the null hypothesis H0(j), F0-1(p(j))~Uniform(0,1). By the Benjamin–Hochberg procedure Benjamini and Hochberg (1995), for a target FDR q*, let k be the largest i for which p(i)imq, and then reject all H(j) corresponding to (1), …, (k). In other words, to determine the SaRa estimator from the set Inline graphic Inline graphic, a thresholding rule

p(j)p=F0(p(k)) (2.6)

is used in order to control the FDR at target rate q*. Following the results in Benjamini and Hochberg (1995), and Benjamini and Yekutieli (2001) directly, we have our result.

Theorem 1

If the SaRa procedure with the thresholding rule (2.6) is applied, then the FDR of Definition 1 is controlled at a level less than or equal to n-Jnqq.

By definition, F0 is the distribution of p(j) under H0(j) provided p(j) is a local minimizer. Given a change-point model, F0 depends on only the bandwidth h in (2.4) and (2.5). Although it is usually unknown, the distribution F0 can be estimated in many situations. For the normal mean change-point model, we can generate a long sequence of i.i.d. Inline graphic(0, σ2) random variables and find the empirical distribution 0. In practice, σ2 is unknown but can be estimated accurately because of the fact Jn. Without normality, we can use permutation methods. Let ν be the index set {1, 2, …, n} and π be a permutation of ν. Take yπ(ν) = (yπ(1), …, yπ(n))T. If θ is sparse, the empirical distribution of the local minima of p(j) can be calculated from a sequence of permutations yπ1), …, yπN(ξ). Otherwise, we can estimate θ by local regression and apply permutation to the residuals.

3. Numerical study

The contemporary genome-wide SNP genotyping array techniques, which can measure half a million SNPs along the whole genome, offer a more sensitive approach to CNV detection, compared to the aCGH techniques. For each subject, the SNP genotyping data usually consist of a sequence of the measurements of Log R ratios. The segments with concentrated high or low Log R ratios correspond to gains or losses of copy numbers. See Peiffer et al. (2006) for more details. The SaRa has been established in to analyze such data. The numerical properties of the SaRa has been investigated intensively by simulation and for data examples in Niu and Zhang (2010), so we focus on the FDR theory in this section.

3.1. An application to SNP genotyping data

We illustrate the SaRa using SNP genotying data for a father-mother-offspring trio produced by Illumina 550K platform, which is available at http://www.openbioinformatics.org/penncnv/.

For each subject, the Log R ratios along Chromosomes 3, 11, and 20 are included in the data set. There are 37768, 27272, and 14296 SNPs on Chromosomes 3, 11, and 20, respectively. In Figure 3.1, we plot the sequence of Log R ratios along along Chromosome 11 for the father. Since there are 27272 points, it is very difficult to eyeball the changes. In Figure 3.2, we zoomed in on a short interval where a CNV was detected by the SaRa.

Figure 3.1.

Figure 3.1

Plot of Log R ratios along Chromosome 11 of the subject father.

Figure 3.2.

Figure 3.2

The CNVs found by the SaRa.

Before we applied the SaRa, we found the Log R ratios to be approximately normal. We chose the bandwidth parameter h = 7 and threshold p* by controlling the FDR to 5%, 10%, and 15%, respectively. Specifically, we calculated the local statistic Dh(·) and p(j)=2(1-Φ(D7(j)σ^2/h)), where σ̂ can be easily estimated because of the sparsity of the change points. Note that under the normality assumption the distribution of F0 is independent of σ2. Therefore, F0 can be approximated empirically by the distribution of the local minimizers in a long i.i.d. standard normal sequence. Then we calculated the corrected P-values and applied the Benjamini-Hochberg procedure to determine the threshold. The numbers of change points detected by the SaRa on Chromosome 11 for the father, mother and offspring are listed in Table 3.1. Because the CNVs are believed to be short, in dozens of SNPs for our data, (Zhang et al. (2009), Jeng et al. (2010)), the intervals that are flanked by two relatively near change points are more likely to be CNVs. On the other hand, it is reasonable to treat those isolated change points as false positives. In Table 3.1, we report the number of the detected changes by the SaRa, requiring each suggested CNV be flanked by two change points within 200 SNPs. Therefore, the number of the suggested CNVs in Table 3.1 is fewer than the number of adjacent change points. In particular, on Chromosome 11 of the father, the SaRa suggested two short CNVs as plotted in Figure 3.2. The CNV in Figure 3.2(a) was also detected by PennCNV, although the one in Figure 3.2(b) was not detected before.

Table 3.1.

Number of change points and CNVs detected by the SaRa on Chromosome 11 of the father, mother, and offspring. (A CNV is suggested only when two adjacent change points are within 200 SNPs).

Source q=0.05 q=0.10 q=0.15

Change Points CNV Change Points CNV Change Points CNV
father 2 1 9 2 9 2
mother 4 1 5 1 5 1
offspring 3 1 3 1 4 1

3.2 A simulation example on the FDR control

We tested our theory on FDR control through an example. Consider model (1.1) with n = 30000 and σ = 1. We set J = 50 and drew 50 change points uniformly between 1 and 30000 that were multiples of 5, to get τ = (650, 855 ···, 29630)T. Here L = min(τj+1τj ) = 15. We designed the mean vector θ by letting θi = 0 when τ2jiτ2j+1, and θi = δ otherwise, where δ = 1.5 or 3.

We set the SaRa with h = 10, 20, and 30, and chose threshold p* by controlling FDR via the Benjamini-Hochberg procedure BH(q) with q = 0.05, 0.10, 0.15. We counted τ̂k as a false positive if there was no τj such that

τ^k-τj<10. (3.1)

Otherwise, τ̂k was counted as a true positive. We present our results in Table 3.2. We see that the average false discovery proportion was close to the target FDR, suggesting that our methodology worked well. In particular, when the signal was strong (δ = 3), the SaRa with bandwidth h = 10 achieved the best performance. The SaRa with bandwidths h = 20 and 30 performed well also. However, because there were two true change points at positions 11070 and 11085, it was difficult to detect both using a large bandwidth. When the signal was weak (δ = 1.5), the SaRa with h = 10 was less powerful. Although the FDR can be controlled, only a small portion of true change-points can be detected. The SaRa with larger bandwidths are more powerful. When the bandwidth was too large (h = 30), the FDP was somewhat greater than the target FDR. There were two reasons. First, our assumption h < L was not satisfied for large h; second, the rule in (3.1) for counting true positives was too strict.

Table 3.2.

The average estimated number of change points Ĵ, true positives (TP), and false discovery proportion (FDP). The results were based on 100 replications.

q=0.05 q=0.10 q=0.15
Ĵ TP FDP Ĵ TP FDP Ĵ TP FDP
δ = 1.5, h = 10 3.700 3.520 0.4% 20.860 19.130 7.6% 27.690 23.640 13.6%
δ = 1.5, h = 20 45.730 43.600 4.5% 50.710 45.600 9.9% 54.620 46.560 14.5%
δ = 1.5, h = 30 50.580 47.130 6.7% 53.800 47.380 11.7% 56.740 47.460 16.1%
δ = 3, h = 10 51.500 49.920 3.0% 53.680 49.970 6.7% 57.040 49.980 12.1%
δ = 3, h = 20 50.380 49.070 2.5% 52.820 49.070 7.0% 55.000 49.070 10.6%
δ = 3, h = 30 50.770 48.650 4.1% 53.000 48.650 8.0% 55.490 48.650 12.1%

4. Optimality of the screening and ranking algorithm

We start this section with a brief summary of results on change-point detection when the number of change points is at most 2.

Among all change-point problems, the simplest case is the detection of the mean change for a sequence of independent Gaussian random variables with common variance. This has been investigated by Page (1955, 1957), Chernoff and Zacks (1964), Gardner (1969), and Sen and Srivastava (1975), among others. If Yi ~ Inline graphic(θi, σ2) for i = 1,…, n, the change-point problem can be stated as the hypothesis test:

H0:θ1=θ2==θn;vsH1:θ1==θjθj+1==θn.

For simplicity, we assume σ2 is known and set to 1. We refer to Csörgö and Horváth (1997), Chen and Gupta (2000), and the references therein for the unknown variance and non-Gaussian cases. Among all methods, the likelihood ratio test is popular. Consider the change point at j, and let Λj denote the likelihood ratio statistic. Then,

-2logΛj=(Y¯j+-Y¯j-)2/[j-1+(n-j)-1], (4.1)

where Y¯j-=k=1jYk/j and Y¯j+=k=j+1nYk/(n-j).

When j is unknown, a commonly used test statistic is T1=max1jn-1(-2logΛj), and

j^=argmax1jn-1(-2logΛj) (4.2)

serves as the location estimator.

The distribution of the statistic T1 is quite complicated. However, as in Csörgö and Horváth (1997), under the null hypothesis the limiting distribution of T1(n) satisfies

limnP{anT1(n)-bnt}=exp(-2e-t)forallt, (4.3)

where an=2loglogn and bn=2loglogn+12logloglogn-12logπ.

If the experiment is designed in a way such that, at the true change point position j(n) with the mean change of δ(n) = θjθj+1, either of the following two conditions holds

0<j(n)nt<1,δ(n)0,withlimnnδ2loglogn=; (4.4)
j(n)n0,δ(n)0,withlimnj(n)δ2loglogn=, (4.5)

it is shown in Csörgö and Horváth (1997) that

δ2j^-j=OP(1). (4.6)

In the less challenging case that δ(n) → c > 0, it can be shown that |ĵj| = OP (1), implying j^n-jn=OP(1n). If all data points are collected from a fixed interval (e.g. [0, 1] or a chromosome), this implies a convergence rate of 1/n. From (4.3), it is easy to see that (4.4) and (4.5) cannot be relaxed further. Otherwise, the signal is too weak to be detectable.

The second simplest case for the change-point problem has two change points with an epidemic alternative Yao (1993); Arias-Castro et al. (2005). Specifically, test

H0:θ1==θn=θ;vsH1:θ1==θl=θr+1==θn=θ,θl+1==θr=θ+δ. (4.7)

Arias-Castro et al. (2005) showed that no method can detect the segment reliably if δ2(rl) < 2 log n, which offers a benchmark for necessary conditions to solve the general multiple change-point problems. In the next two subsections, we derive the theoretical properties of the SaRa.

4.1 Normal case

Consider the normal mean change-point model (1.1) and define

L=min1jJ+1(τj-τj-1),δ=min1jJθτj+1-θτj/σ,

where J, τ, θ and σ depend on n. Here L is the minimal distance among all change points, δ measures the ratio of the minimum jump size to the standard deviation at change points. The key quantity that reflects the strength of the signal is S2 = δ2L. When the signal is too weak, it is not distinguishable from the noise and cannot be recovered by any methods. Here we consider the setting

S2=δ2L>32logn. (4.8)

In the conventional setting of multiple change-point analysis, it is usually assumed that J is constant, τ/n converge to a constant vector t as n → ∞, which implies S2 = cn ≫ log n. However, in some applications, L is quite small compared to n. Hence it is useful to consider the more flexible condition in (4.8).

Theorem 2

Under (4.8), there exist h = h(n) and λ = λ(n) such that Inline graphic = Inline graphic = {τ̂1, · · ·, τ̂Ĵ} satisfies

limnP({J^=J})=1;

Moreover, conditional on Ĵ = J,

δ2(τ^i-τi)=OP(1). (4.9)

In particular, taking h = L/2 and λ = δσ/2, we have

P({J^=J}i{τ^i-τi<h})>1-42πS-1exp{logn-S2/32}.

Remark 1

In Niu and Zhang (2010), it was shown that

limnP({J^=J}i{τ^i-τi<h})=1. (4.10)

This is called the sure coverage property since it basically says that the true change points locations τi’s are covered by neighborhoods (τ̂ih, τ̂i + h) with probability tending to one.

Remark 2

The main improvement from Theorem 2 is the convergence rate (4.9), the same as the convergence rate of the likelihood ratio estimator in the single change-point case.

Remark 3

The choices of h and λ in Theorem 2 may not be optimal in general, especially when S2 ≫ log n. Basically, it is enough to use an h that is slightly greater than 16 log n/δ2.

Corollary 1

If there is only one change point at position τn/2 with δ2τ > 32 log n, then there exist h, such that τ̂ = argmaxj |Dh(j)| satisfies δ2(τ̂τ) = OP (1).

Note that in this corollary, the convergence rate of the location estimator τ̂ is the same as for the likelihood ratio estimator. The assumption δ2τ > 32 log n is slightly stronger than (4.5). Since the likelihood ratio test statistic −2 log Λj involves all the data points while the local statistic Dh(j) uses only 2h data points, where h is usually O(log n/δ2).

Remark 4

The assumption (4.8) cannot be relaxed further except for the constant 32. For example, in the epidemic case (4.7) even if we know that there exists, at most, one segment whose mean is shifted from zero, Arias-Castro et al. (2005) suggested that (4.8) is required except for the constant. In our setting, the sparsity of θ and number of change points are not restricted. Therefore, we do not expect that the 32 can be significantly improved. We conclude that the SaRa is a nearly optimal procedure.

4.2 Beyond normality

Suppose

yi=θi+εi,i=1,,n, (4.11)

where the mean vector θ = (θ1, …, θn)T is a piecewise constant with change points at 0 < τ1 < · · · < τJ < n, and the noise is not necessarily normal but satisfies the following.

  • (C0)
    The noises are i.i.d. with E(εi) = 0 and Var(εi) = σ2 < ∞. Moreover, the density function of εi is symmetric. With
    L=min1jJ+1(τj-τj-1),δ=min1jJθτj+1-θτj/σ,

    and, for easy presentation, σ2 = 1, we consider three cases:

  • (C1)

    εi ~ Inline graphic(0, 1), S2 = δ2L > 32 log n.

  • (C2)

    E(expεia)b,δ2L8ab2+2aδlogn.

  • (C3)

    E(|εi|t) = mt < ∞ for some t > 2, δtLt−1n and δ2L ≫ log n.

The case (C1) has been characterized in Theorem 2. We give results for the other two cases.

Theorem 3

If (C0) and one of (C2) and (C3) hold, there exist h = h(n) and λ = λ(n) such that Inline graphic = Inline graphic = {τ̂1, · · ·, τ̂Ĵ} satisfies

limnP({J^=J}i{τ^i-τi<h})=1;

In particular, taking h = L/2 and λ = δ/2, we have, under (C0) and (C2),

P({J^=J}i{τ^i-τi<h})>1-2exp{logn-δ2L64a2b+8aδ},

and, under (C0) and (C3),

P({J^=J}i{τ^i-τi<h})>1-C1nδtLt-1+exp{logn-C2δ2L},

where C1=12mt(4+8t)t and C2=18(t+2)-2e-t.

5. Discussion

Change-point detection is a classic problem with emerging applications across a spectrum of fields from finance to engineering and bioscience. In this work, we focused on the detection of multiple change points, arising from the need to identify CNVs from genomic data. We invoked the recently developed algorithm SaRa of Niu and Zhang (2010), evaluating it as the only algorithm that is shown to reach optimal computational complexity and to possess the sure coverage property. We introduced a concept of false discovery proportion to address the specific setting in the detection of multiple change points, and established FDR theory for the dependent and sequence data pertinent to CNV. We proved a stronger sure coverage property and showed that the convergence rate is optimal, and the same as the existing one for the detection of a single change point. When we applied our procedure to a well-known data set, we confirmed a known CNV and detected a new one, highlighting the potential of the method to detect CNVs that are difficult to detect by existing methods.

Acknowledgments

The financial supports from the University of Arizona Internal grant and National Institute on Drug Abuse grant R01–DA016570 are greatly acknowledged.

Appendix. Proofs

In this appendix, we prove Theorems 2 and 3, and Corollary 1. Although Corollary 1 is implied directly by Theorem 2, providing its proof might help readers understand the technique.

A direct proof of Corollary 1

First note that the condition τn/2 does not put any restrictions on the position of the change point, by symmetry. The only condition we need is that the position τ is not too close to the boundary, which is guaranteed by δ2τ > 32 log n.

Without loss of generality, we assume σ2 = 1 and θτ +1θτ > 0. Fix an integer h such that 16 log n/δ < h < τ. By definition,

Dh(τ)=1h(k=τ+1τ+hyk-k=τ-h+1τyk)=δ+1h(k=τ+1τ+hεk-k=τ-h+1τεk)~N(δ,2h).

We consider the local statistic Dh at points right of τ and study the behavior of estimator τ̂ = argmaxjτ Dh(j). Let Wm = ετ +h+m − 2ετ+m + ετh+m, m = 1, 2, …, h. Note the following.

  1. For m ≥ 2h, Dh(τ+m)~N(0,2h) is independent of Dh(τ).

  2. For h < m < 2h, Dh(τ+m)~N(0,2h) and corr(Dh(τ+m),Dh(τ))=-2h-m2h.

  3. For mh, Dh(τ+m)~N(h-mm,2h) and Dh(τ+m)-Dh(τ)=-mhδ+1h=1mW.

To show limM→∞ limn→∞ P(δ2|τ̂τ| < M ) = 1, it suffices to verify that

limMlimnP(maxm>M/δ2{Dh(τ+m)-Dh(τ)}<0)=1. (6.1)

Note that in (6.1), h > 16 log n/δ2M/δ2 as n → ∞. Therefore, we can bound the probabilities

P1=P(maxm>h{Dh(τ+m)-Dh(τ)}<0),P2=P(maxhm>M/δ2{Dh(τ+m)-Dh(τ)}<0).

By 1) and 2), we have

Dh(τ+m)-Dh(τ)~N(-δ,4h),m2h,Dh(τ+m)-Dh(τ)~N(-δ,6-m/hh),h<m<2h.

Therefore,

1-P1=P(maxm>h{Dh(τ+m)-Dh(τ)}>0)<m=h+12h-1P(Dh(τ+m)-Dh(τ)>0)+m2hP(Dh(τ+m)-Dh(τ)>0)<(h-1)Φ(-δh/5)+(n-m-2h+1)Φ(-δh/2)<nΦ(-δh/5)<52π1δhexp{logn-δ2h10}. (6.2)

In the last step, we have used the bound on the Gaussian tail probability

Φ(-t)=1-Φ(t)<12πt-1e-12t2.

Obviously, P1 → 1 as n → ∞ when δ2h > 10 log n.

To bound the second part, the Bonferroni inequality is not sharp enough. First, by 3),

P2=P(maxhm>M/δ2{-mhδ+1h=1mW}<0),

where W1, …, Wh are i.i.d. Inline graphic(0, 6). If B(t) is a standard Brownian motion, observe that

P2=P(maxhmM/δ2{1m=1mW6-δ6}<0)P(maxtM/δ2{B(t)t-δ6}<0)=P(maxtM/δ2{tB(1/t)t-δ6}<0)=P(maxtδ2/M{B(t)-δ6}<0)=P(maxtδ2/M{δB(t/δ2)-δ6}<0)=P(maxt1/M{B(t)-16}<0)=P(maxt1/M{B(t)}<16).

Therefore, limM→∞ limn→∞ P2 = 1. By symmetry, the behavior of τ̂ = argmaxjt Dh(j) is exactly the same. The proof is finished.

Proofs of Theorems 2 and 3

The only difference between Theorems 2 and 3 is that they assume different conditions on the noise distribution. We give a unified proof of the nonasymptotic result for both. We first give notation and two lemmas.

For L an even number fix h = L/2 and λ = δ/2. Recall that we set σ2 = 1. We say a point j is a flat point if there is no change point in the neighborhood [jh + 1, j + h] so H0(j) is true at (2.3). Let Inline graphic = {j: H0(j) is true } be the set of all flat points and Inline graphic = {τi: i = 1, …, n} be the set of all change points. Consider the event Inline graphic = {|D(τ, h)| > λ} for a change point τInline graphic and the event Inline graphic = {|D(j, h)| < λ} for a flat point jInline graphic. Let

En=(τJAτ)(jFhBj).

Lemma 1

On Inline graphic, Ĵ = J and |τiτ̂i| < h for all i.

This lemma, shown in Niu and Zhang (2010), says that Inline graphic implies τi ∈ (τ̂ih, τ̂i + h). The next lemma shows the nonasymptotic properties of Theorem 2 and 3.

Lemma 2

If (C0) and one of (C1), (C2), and (C3) hold, then P( Inline graphic) → 1 as n → ∞. In particular, under (C1),

P(En)>1-42πS-1exp[logn-S2/32]; (6.3)

under (C2),

P(En)>1-2exp{logn-δ2L64a2b+8aδ}; (6.4)

under (C3),

P(En)>1-C1nδtLt-1+exp{logn-C2δ2L}, (6.5)

where C1=12mt(4+8t)t and C2=18(t+2)-2e-t.

Proof of Lemma 2

Under (C1), we have

Dh(j)~N(0,2h)forjFh,Dh(τ)~N(δτ,2h)forτJ,

where δτ = θτ +1θτ.

For a fixed flat point j,

P(Bjc)=P(Dh(j)>δ2)=2(1-Φ(δh22))=2(1-Φ(δL4))=2(1-Φ(S4)),

where Φ is the standard normal distribution function. Similarly, for a change point τ, since |δτ | > δ by definition, we have

P(Aτc)=P(Dh(τ)<δ2)<1-Φ(S4).

By the Bonferroni Inequality and the bound on the Gaussian tail probability, we have

P(Enc)<τJP(Aτc)+jFhP(Bjc)<2n(1-Φ(S4))<42πS-1exp{logn-S232}.

Therefore, (6.3) holds.

Under (C0) and (C2), the probabilities P(Aτc) and P(Bjc) can be bounded using Lemma 2.2.11 in van der Vaart and Wellner (1996).

Note that E(expεia)b implies Eεim<bm!am=m!am-22a2b2.

For a flat point j,

h·Dh(j)=i=j-h+1jYi-i=j+1j+hYi=i=j-h+1jεi-i=j+1j+hεi.

Because εi is symmetric, h · Dh(j) is a sum of 2h i.i.d. random variables. By the Bernstein’s Inequality,

P(Bjc)=P(Dh(j)>δ2)=P(h·Dh(j)>δh2)2exp{-12δ2h2/42a2b·2h+aδh/2}=2exp{-δ2L64a2b+8aδ}

P(Aτc) can be bounded by the same probability. By the Bonferroni Inequality, we have

P(Enc)<2nexp{-δ2L64a2b+8aδ}=2exp{logn-δ2L64a2b+8aδ}.

Under (C0) and (C3), the inequality (6.5) can be obtained via results on large deviation of sums of independent random variables. For example, applying Corollary 1.8. in Nagaev (1979),

P(Bjc)=P(h·Dh(j)>δh2)C1δ-tL1-t+exp{-C2δ2L},

where C1=12mt(4+8t)t and C2=18(t+2)-2e-t. The same is true for P(Aτc), τInline graphic. By the Bonferroni Inequality, we have

P(Enc)<n(C1δ-tL1-t+exp{-C2δ2L})=C1nδtLt-1+exp{logn-C2δ2L}.

Theorem 3 and part of Theorem 2 are straightforward corollaries of Lemmas 1 and 2. The remaining part of Theorem 2, δ2(τ̂iτi) = OP (1), can be shown in the same way as the proof of Corollary 1. In fact, from the discussion above, we know |τ̂iτi| < h on event Inline graphic, which holds with probability close to 1. To sharpen our result to δ2(τ̂iτi) = OP (1), it suffices to repeat the procedure used to bound P2 in Corollary 1.

Contributor Information

Ning Hao, Email: nhao@math.arizona.edu, Department of Mathematics, The University of Arizona, Tucson AZ, 85721, USA.

Yue Selena Niu, Email: yueniu@math.arizona.edu, Department of Mathematics, The University of Arizona, Tucson AZ, 85721, USA.

Heping Zhang, Email: heping.zhang@yale.edu, Department of Epidemiology and Public Health, Yale University, New Haven CT, 06520, USA.

Bibliography

  1. Arias-Castro E, Donoho DL, Huo X. Near-optimal detection of geometric objects by fast multiscale methods. IEEE Trans Inform Theory. 2005;51 [Google Scholar]
  2. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B. 1995;57:289–300. [Google Scholar]
  3. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. 2001;29:1165–1188. [Google Scholar]
  4. Chen J, Gupta A. Birkhäuser. 2000. Parametric Statistical Change Point Analysis. [Google Scholar]
  5. Chernoff H, Zacks S. Estimating the current mean of a normal distribution which is subjected to changes in time. The Annals of Mathematical Statistics. 1964;35:999–1018. [Google Scholar]
  6. Csörgö M, Horváth L. Limit Theorems in Change-Point Analysis. Wiley; New York: 1997. [Google Scholar]
  7. Efron B, Zhang NR. False discovery rates and copy number variation. Biometrika. 2011;98:251–271. [Google Scholar]
  8. Gardner LAJ. On detecting changes in the mean of normal variates. The Annals of Mathematical Statistics. 1969;40:116–126. [Google Scholar]
  9. Huang T, Wu B, Lizardi P, Zhao H. Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics. 2005;21:3811–3817. doi: 10.1093/bioinformatics/bti646. [DOI] [PubMed] [Google Scholar]
  10. Jeng XJ, Cai TT, Li H. Optimal sparse segment identification with application in copy number variation analysis. Journal of the American Statistical Association. 2010;105:1056–1066. doi: 10.1198/jasa.2010.tm10083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Lai TL, Xing H. A simple bayesian approach to multiple change-points. Statistica Sinica. 2011;21:539–569. [Google Scholar]
  12. Lai TL, Xing H, Zhang N. Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatistics. 2008:290–307. doi: 10.1093/biostatistics/kxm031. [DOI] [PubMed] [Google Scholar]
  13. Nagaev SV. Large deviations of sums of independent random variables. The Annals of Probability. 1979;7:745–789. [Google Scholar]
  14. Niu SY, Zhang H. The screening and ranking algorithm to detect DNA copy number variations. Annals of Applied Statistics. 2010 doi: 10.1214/12-AOAS539SUPP. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004;5:557–72. doi: 10.1093/biostatistics/kxh008. [DOI] [PubMed] [Google Scholar]
  16. Page ES. A test for a change in a parameter occurring at an unknown point. Biometrika. 1955;42:523–527. [Google Scholar]
  17. Page ES. On problems in which a change in a parameter occurs at an unknown point. Biometrika. 1957;44:248–252. [Google Scholar]
  18. Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, Haden K, Li J, Shaw CA, Belmont J, Cheung SWW, Shen RM, Barker DL, Gunderson KL. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Research. 2006;16:1136–1148. doi: 10.1101/gr.5402306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Sen A, Srivastava MS. On tests for detecting change in mean. The Annals of Statistics. 1975;3:98–108. [Google Scholar]
  20. Tibshirani R, Wang P. Spatial smoothing and hot spot detection for cgh data using the fused lasso. Biostatistics. 2008;9:18–29. doi: 10.1093/biostatistics/kxm013. [DOI] [PubMed] [Google Scholar]
  21. van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. Springer; 1996. [Google Scholar]
  22. Vostrikova LY. Detecting “disorder” in multidimensional random processes. Soviet Mathematics Doklady. 1981:55–59. [Google Scholar]
  23. Wang K, et al. PennCNV: An integrated hidden markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research. 2007;17:1665–1674. doi: 10.1101/gr.6861907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Yao Q. Tests for change-points with epidemic alternatives. Biometrika. 1993;80:179–191. [Google Scholar]
  25. Yao YC. Estimating the number of change-points via Schwarz’ criterion. Statistics and Probability Letters. 1988;6:181–189. [Google Scholar]
  26. Yao YC, Au ST. Least-squares estimation of a step function. Sankhya A. 1989;51:370–381. [Google Scholar]
  27. Zhang F, Gu W, Hurles ME, Lupski JR. Copy number variation in human health, disease, and evolution. Annual Review of Genomics and Human Genetics. 2009;10:451–481. doi: 10.1146/annurev.genom.9.081307.164217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Zhang NR. Frontiers in Computational and Systems Biology. 2010. DNA copy number profiling in normal and tumor genomes; pp. 259–281. [Google Scholar]
  29. Zhang NR, Siegmund DO. A modified bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007;63:22–32. doi: 10.1111/j.1541-0420.2006.00662.x. [DOI] [PubMed] [Google Scholar]

RESOURCES