Multiple Change-Point Detection via a Screening and Ranking Algorithm

Ning Hao; Yue Selena Niu; Heping Zhang

doi:10.5705/ss.2012.018s

. Author manuscript; available in PMC: 2014 Jul 1.

Published in final edited form as: Stat Sin. 2013 Jul 1;23(4):1553–1572. doi: 10.5705/ss.2012.018s

Multiple Change-Point Detection via a Screening and Ranking Algorithm

Ning Hao ¹, Yue Selena Niu ², Heping Zhang ³

PMCID: PMC3902887 NIHMSID: NIHMS423888 PMID: 24489450

Abstract

Let Y₁, …, Y_n be a sequence whose underlying mean is a step function with an unknown number of the steps and unknown change points. The detection of the change points, namely the positions where the mean changes, is an important problem in such fields as engineering, economics, climatology and bioscience. This problem has attracted a lot of attention in statistics, and a variety of solutions have been proposed and implemented. However, there is scant literature on the theoretical properties of those algorithms. Here, we investigate a recently developed algorithm called the Screening and Ranking Algorithm (SaRa). We characterize the theoretical properties of SaRa and show its superiority over other commonly used algorithms. In particular, we develop a false discovery rate approach to the multiple change-point problem and show a strong sure coverage property for the SaRa.

Key words and phrases: Change-point detection, copy number variation, false discovery rate, high dimensional data, screening and ranking algorithm

1. Introduction

Studies of change-point detection date back to 1950s. In the past half century, the topic has attracted a great deal of attention in such fields as statistics, engineering, economics, climatology and bioscience. Specifically, given a sequence of ordered or time dependent random variables, denoted by Y₁, …, Y_n, a change point is a position or time at which the structure of this sequence changes; the goal of change-point detection is to estimate the locations of change points and provide an assessment of accuracy. Thus, in climate data series, a change point is a time at which the climate changes dramatically, a threshold which is of interest to climatologists; in financial econometrics, change-point analysis can help identify the directions in the market or economy; in engineering, for a continuous production process, it is important to find out if there is a point where the quality of the products begins to deteriorate. A recent development is its application to genetics on DNA copy number variation detection. The DNA copy number of a region is the number of copies of the genomic DNA. Copy number variation (CNV) usually refers to deletion or duplication of a region of DNA sequences. It is shown by recent studies that CNVs account for an abundance of genetic variation and may influence phenotypic differences. It is then a fundamental problem to identify the CNVs in genetics; see Zhang (2010) for a thorough introduction on the application of change-point model in CNV detection. Finding CNVs in the massive data produced by modern DNA array technologies is a recent development in change-point problems. The main challenges here is in finding multiple change points accurately and efficiently in an expansive sequence where the length is typically of hundreds of thousands in SNP genotyping data.

A normal mean, multiple change-point model has played an essential role in the statistical analysis of the CNV problem (Olshen et al. (2004), Huang et al. (2005), Zhang and Siegmund (2007), Tibshirani and Wang (2008), Jeng et al. (2010)). Let

y_{i} = θ_{i} + ε_{i}, ε_{i} \overset{iid}{\sim} N (0, σ^{2}), i = 1, \dots, n,

(1.1)

and assume that

θ_{1} = θ_{2} = \dots = θ_{τ_{1}} \neq θ_{τ_{1} + 1} = \dots = θ_{τ_{2}} \neq θ_{τ_{2} + 1} = \dots \dots = θ_{τ_{J}} \neq θ_{τ_{J} + 1} = \dots = θ_{n},

where τ = (τ₁, …, τ_J )^T is the location vector of the change points. We call model (1.1) with piecewise constant mean θ a normal mean change-point model.

In CNV detection, this model plays an essential role in Olshen et al. (2004), Huang et al. (2005), and Zhang and Siegmund (2007). Some authors have considered more restricted models: Tibshirani and Wang (2008) assume that θ itself is sparse; Jeng et al. (2010) assume, in addition, that the nonzero segments of θ are short. While these additional assumptions are reasonable, the model considered here is more general. Specifically, we only assume that n is large and any two change points are “not too close to each other;” this will be clarified later. We take J ≪ n so while the data are high dimensional in terms of the potential locations of the change points, the number of true change-points is limited. Thus, (1.1) is a high-dimensional sparse model with a sequential structure.

For multiple-change-point problem, the number of change points and their locations are to be estimated. Popular multiple change-point detection tools include exhaustive search (Yao (1988), Yao and Au (1989)), binary segmentation (Vostrikova (1981), Olshen et al. (2004)), ℓ₁ penalization (Huang et al. (2005), Tibshirani and Wang (2008)), and hidden Markov model approach (Wang et al. (2007), Lai et al. (2008)), among others. See Lai and Xing (2011) for multiple change-point models for the exponential family and for a comprehensive review of related methods.

Recently, Niu and Zhang (2010) proposed the Screening and Ranking algorithm (SaRa) as an alternative approach for change-point detection. They observed that when determining whether there is a change at the jth position, the information at positions far from j is rarely useful. Therefore, it is more efficient to concentrate on a local neighborhood first, say the interval (j − h, j + h), when making decisions about j and nearby points. Avoiding complex optimization or iterative algorithm, the SaRa is simple to implement with complexity O(n), which makes the SaRa suitable for analyzing high throughput data. Besides computational efficiency, Niu and Zhang (2010) showed that, under mild conditions, the SaRa satisfies a sure coverage property that implies that the SaRa can estimate J and τ consistently. The SaRa, to our best knowledge, is the only algorithm known to combine both computational simplicity and consistency. In this paper, we derive deeper and broader theoretical properties for the SaRa that are useful for our understanding of its performance.

First of all, we propose a novel false discovery rate (FDR) approach to change-point detection. The multiple change-point problem can be naturally stated as a multiple testing problem. Although many change-point detection tools have been proposed recently, few authors have examined the issue of FDR. Tibshirani and Wang (2008), and Efron and Zhang (2011) studied the FDR on a normal mean change-point model with sparse mean vector θ, and applied their theories to the CNV problem. Our theory is different. We do not assume the sparsity of θ and focus on the FDR of change-point locations. Specifically, we concentrate on β = (β₁, …, β_n₋₁)^T where β_j = θ_j₊₁ − θ_j and whose support corresponds to the set of true change-point locations. We demonstrate how to establish a well-defined multiple testing framework for change-point problem (1.1) and assess the FDR for the SaRa estimator. In addition, we show how the FDR control procedure helps us select tuning parameters in the SaRa procedure.

Secondly, we characterize the convergence rate of the location estimator τ̂ for the SaRa. The sure coverage property (Niu and Zhang (2010)) states that the SaRa estimator for the location vector satisfies, with probability tending to one, ||τ̂ − τ||_∞ < h, where h is a tuning parameter and is of order O(log n) under some reasonable conditions. Here we give a sharper convergence rate and show that ||τ̂ − τ||_∞ = O_P (1), where the convergence rate is the same as the best result known for the single change-point case (Csörgö and Horváth (1997)). Moreover, we show that our assumptions cannot be weakened further except for the constant 32 in condition (4.8), implying that the SaRa is a nearly optimal procedure.

The SaRa can be easily generalized to solve more general change-point problems. For example, we derive sure coverage properties for some non-normal cases, although a comprehensive study of non-normal data requires further effort.

The rest of this paper is organized as follows. In Section 2, we briefly recall the SaRa procedure and introduce an FDR control approach to multiple change-point detection based on the SaRa. The numerical studies of the FDR approach are demonstrated in Section 3. In Section 4, a strong sure coverage property of the SaRa is verified to illustrate the optimality of the SaRa, which is followed by some final remarks in Section 5. All proofs are in the Appendix.

2. False discovery rate for change-point detection

From the hypothesis testing perspective, the change-point problem can be viewed as a multiple testing problem by testing every data point as a potential change point. The false discovery rate (FDR) approach to multiple testing problems has been studied extensively since the seminal paper of Benjamini and Hochberg (1995). However, the multiple testing problem derived from change-point detection presents a problem that beyond the classical framework; it has a distinctive correlation, as well as the sequential, structure. We illustrate our approach to this problem.

2.1 Change-point detection as a multiple testing problem

Let Y = (Y₁, …, Y_n)^T be a sequence of independent random variables with probability distribution function F₁, …, F_n, respectively. The multiple change-point problem can be stated as the hypothesis testing problem

\begin{array}{l} H_{0} : F_{1} = F_{2} = \dots = F_{n} vs \\ H_{1} : F_{1} = F_{2} = \dots = F_{τ_{1}} \neq F_{τ_{1} + 1} = \dots = F_{τ_{2}} \neq F_{τ_{2} + 1} = \dots F_{τ_{J}} \neq F_{τ_{J} + 1} = \dots = F_{n}, \end{array}

(2.1)

where J is an unknown number of change points and 0 < τ₁ < · · · < τ_J < n are their unknown locations. The purpose is to make a decision between two hypotheses and, if the alternative is supported, to estimate the number of change points J and the location vector τ = (τ₁, …, τ_J )^T. Although this single hypothesis test has been employed to formulate the multiple change-point problem in classical works, it does not address the accuracy of estimation of the change points here.

The testing problem (2.1) is naturally decomposed to the sequence of hypotheses

H_{0} (j) : j is a change point; vs H_{1} (j) : j is not a change point,

(2.2)

where j = 1, 2, …, n − 1.

For the normal mean model where F_j ~ Inline graphic (θ_j, σ²), H₀(j) and H₁(j) correspond to β_j ε θ_j₊₁ − θ_j = 0 and β_j ≠ 0, respectively. One can use z_j ≡ y_j₊₁ − y_j as the statistic for testing H₀(j) against H₁(j). However, its power may be limited by the fact that it does not fully utilize the sparse and sequential structures. Observing that the change points are apart from each other in many applications, we reformulate the hypotheses.

First, suppose that the minimal distance between two change points is at least h and modify (2.2) to

\begin{array}{l} H_{0} (j) : F_{j + 1 - h} = \dots = F_{j + h} vs \\ H_{1} (j) : F_{j + 1 - h} = \dots = F_{j} \neq F_{j + 1} = \dots = F_{j + h} . \end{array}

(2.3)

Second, it is impossible to recover the true change-point locations exactly in any reasonable asymptotic settings; see Section 4 for details on this aspect. For an FDR theory, we cannot treat the testing problems in (2.3) independently in terms of “true or false” as in the classical framework, so relaxed version of FDR is introduced.

Definition 1

Suppose that τ̂ = (τ̂₁, …, τ̂_Ĵ)^T is the change point location estimator from some procedure such that min_1≤_i_≤_Ĵ{τ̂_i − τ̂_i₋₁}≥ 2h. Then H₀(j) is rejected for all j ∈ τ̂. We define τ̂_i as a true positive if there exists a true change point τ_i_′ such that |τ_i_′ − τ̂_i| < h, otherwise, τ̂_i is a false positive. The false discovery proportion (FDP) is the number of false positives divided by Ĵ. The FDR is defined as E(FDP ).

For such methods as CBS and ℓ₁ penalization, it is not straightforward to evaluate the quantity τ̂_i − τ_i, or the FDR. In the next two subsections, we recall the SaRa and establish an FDR theory for the normal mean change-point model via the SaRa.

2.2 The screening and ranking algorithm

The SaRa was proposed by Niu and Zhang (2010) to detect change points in the normal mean model (1.1). For a position i, they considered the locally defined statistic,

D_{h} (j) = (\sum_{k = j + 1}^{j + h} Y_{k} - \sum_{k = j - h + 1}^{j} Y_{k}) / h,

(2.4)

where h ≪ n. Intuitively, if j is a local maximizer of |D_h(·)| and |D_h(j)| is quite large, it is likely that there is a change point at or around j. Therefore, it is reasonable to consider all local maximizers of |D_h(·)| and search for change points among them. First, the SaRa calculates D_h(·) for all positions and finds all h-local maximizers of |D_h(·)|. Here, j is a h′-local maximizer of |D_h(·)| if |D_h(j)|≥ |D_h(k)| for all k ∈ (j − h′, j + h′). One can take h′ different from h, h′ = h was used as a default setting in Niu and Zhang (2010). Second, the SaRa estimator is obtained by a thresholding rule |D_h(·)| > λ, that is applied to all local maxima. Thus

J_{h, λ} = {{\hat{τ}}_{i} : {\hat{τ}}_{i} is a local maximizer of ∣ D_{h} (\cdot) ∣, and ∣ D_{h} ({\hat{τ}}_{i}) ∣ > λ}

is the SaRa location estimator. Let τ̂ = (τ̂₁, …, τ̂_Ĵ)^T, where τ̂₁ < τ̂₂ < · · · < τ̂_Ĵ, Ĵ = | Inline graphic |.

An important observation is that the local statistic D_h(j) employed in the SaRa is the natural test statistic for the multiple testing problem (2.3) when the F_j ’s are normal. If we define $η_{j}^{(h)} = E (∣ D_{h} (j) ∣)$ , then H₀ and H₁ correspond to $η_{j}^{(h)} = 0$ and $η_{j}^{(h)}$ is a local maximum, respectively. We illustrate how to combine the SaRa with our extended FDR criterion to establish an FDR theory below.

2.3 A false discovery rate approach via the SaRa

First, we restate the SaRa procedure from the viewpoint of hypothesis testing for the general setting (2.3). Let D(j) be a test statistic that can be used to test (2.3), and p(j) be the corresponding P-value when available. Without loss of generality, we assume that larger values of D tend to support the alternative hypothesis. The SaRa is as follows. In Step 1, calculate the test statistic D(j), or the P-value p(j) for each j. Then pick out all local maximizers of D(j) or, equivalently, local minimizers of p(j). Here, for technical reasons, we temporarily use the h′-local extremizer for h′ = 2h. For example, we call j^* a local minimizer of p(·) if

p (j^{*}) \leq p (j) for all j \in (j^{*} - 2 h, j^{*} + 2 h) .

(2.5)

We denote the set of the local extremizers by Inline graphic . In Step 2, the SaRa estimator = {τ̂₁ < · · · < τ̂_Ĵ} ⊂ for the locations of change points is defined by a thresholding rule

D (j) > λ or p (j) < p^{*} .

We can rank the elements in Inline graphic by their D values or P-values and get a solution path.

From now on, we use thresholding rule p(j) < p^* to conform with the literature on FDR theory, although calculating D may be more convenient in practice. An advantage of the SaRa is that we can easily control the minimal distance min_i{τ̂_i − τ̂_i₋₁} of the SaRa estimator τ̂ by choosing a proper neighborhood when defining the local extremizer. Under (2.5), for any j₁, j₂ ∈ Inline graphic with j₁ ≠ j₂, we have |j₁ − j₂| > 2h. Because the P-values are derived from local statistics, p(j₁) and p(j₂) are independent. Therefore, we obtain a sequence of independent P-values {p(j)|j ∈ }. When the null distribution, F₀, of the P-value at a local minimizer is known or can be estimated accurately, the standard FDR control procedure can be applied to the P-value sequence directly. Specifically, suppose F₀ is known and consider the set of modified P-values

{F_{0}^{- 1} (p (j)) ∣ j \in L M} = {{\tilde{p}}_{(1)} < {\tilde{p}}_{(2)} < \dots < {\tilde{p}}_{(m)}},

where m = | Inline graphic |. Under the null hypothesis H₀(j), $F_{0}^{- 1} (p (j)) ~ Uniform (0, 1)$ . By the Benjamin–Hochberg procedure Benjamini and Hochberg (1995), for a target FDR q^*, let k be the largest i for which ${\tilde{p}}_{(i)} \leq \frac{i}{m} q^{*}$ , and then reject all H(j) corresponding to p̃₍₁₎, …, p̃₍_k₎. In other words, to determine the SaRa estimator from the set Inline graphic , a thresholding rule

p (j) \leq p^{*} = F_{0} ({\tilde{p}}_{(k)})

(2.6)

is used in order to control the FDR at target rate q^*. Following the results in Benjamini and Hochberg (1995), and Benjamini and Yekutieli (2001) directly, we have our result.

Theorem 1

If the SaRa procedure with the thresholding rule (2.6) is applied, then the FDR of Definition 1 is controlled at a level less than or equal to $\frac{n - J}{n} q^{*} \approx q^{*}$ .

By definition, F₀ is the distribution of p(j) under H₀(j) provided p(j) is a local minimizer. Given a change-point model, F₀ depends on only the bandwidth h in (2.4) and (2.5). Although it is usually unknown, the distribution F₀ can be estimated in many situations. For the normal mean change-point model, we can generate a long sequence of i.i.d. Inline graphic (0, σ²) random variables and find the empirical distribution F̂₀. In practice, σ² is unknown but can be estimated accurately because of the fact J ≪ n. Without normality, we can use permutation methods. Let ν be the index set {1, 2, …, n} and π be a permutation of ν. Take y_π₍_ν₎ = (y_π₍₁₎, …, y_π₍_n₎)^T. If θ is sparse, the empirical distribution of the local minima of p(j) can be calculated from a sequence of permutations y_π₁(ν), …, y_{π_N(ξ)}. Otherwise, we can estimate θ by local regression and apply permutation to the residuals.

3. Numerical study

The contemporary genome-wide SNP genotyping array techniques, which can measure half a million SNPs along the whole genome, offer a more sensitive approach to CNV detection, compared to the aCGH techniques. For each subject, the SNP genotyping data usually consist of a sequence of the measurements of Log R ratios. The segments with concentrated high or low Log R ratios correspond to gains or losses of copy numbers. See Peiffer et al. (2006) for more details. The SaRa has been established in to analyze such data. The numerical properties of the SaRa has been investigated intensively by simulation and for data examples in Niu and Zhang (2010), so we focus on the FDR theory in this section.

3.1. An application to SNP genotyping data

We illustrate the SaRa using SNP genotying data for a father-mother-offspring trio produced by Illumina 550K platform, which is available at http://www.openbioinformatics.org/penncnv/.

For each subject, the Log R ratios along Chromosomes 3, 11, and 20 are included in the data set. There are 37768, 27272, and 14296 SNPs on Chromosomes 3, 11, and 20, respectively. In Figure 3.1, we plot the sequence of Log R ratios along along Chromosome 11 for the father. Since there are 27272 points, it is very difficult to eyeball the changes. In Figure 3.2, we zoomed in on a short interval where a CNV was detected by the SaRa.

Figure 3.1 — Plot of Log R ratios along Chromosome 11 of the subject father.

Figure 3.2 — The CNVs found by the SaRa.

Before we applied the SaRa, we found the Log R ratios to be approximately normal. We chose the bandwidth parameter h = 7 and threshold p^* by controlling the FDR to 5%, 10%, and 15%, respectively. Specifically, we calculated the local statistic D_h(·) and $p (j) = 2 (1 - Φ (\frac{∣ D_{7} (j) ∣}{\hat{σ} \sqrt{2 / h}}))$ , where σ̂ can be easily estimated because of the sparsity of the change points. Note that under the normality assumption the distribution of F₀ is independent of σ². Therefore, F₀ can be approximated empirically by the distribution of the local minimizers in a long i.i.d. standard normal sequence. Then we calculated the corrected P-values and applied the Benjamini-Hochberg procedure to determine the threshold. The numbers of change points detected by the SaRa on Chromosome 11 for the father, mother and offspring are listed in Table 3.1. Because the CNVs are believed to be short, in dozens of SNPs for our data, (Zhang et al. (2009), Jeng et al. (2010)), the intervals that are flanked by two relatively near change points are more likely to be CNVs. On the other hand, it is reasonable to treat those isolated change points as false positives. In Table 3.1, we report the number of the detected changes by the SaRa, requiring each suggested CNV be flanked by two change points within 200 SNPs. Therefore, the number of the suggested CNVs in Table 3.1 is fewer than the number of adjacent change points. In particular, on Chromosome 11 of the father, the SaRa suggested two short CNVs as plotted in Figure 3.2. The CNV in Figure 3.2(a) was also detected by PennCNV, although the one in Figure 3.2(b) was not detected before.

Table 3.1.

Number of change points and CNVs detected by the SaRa on Chromosome 11 of the father, mother, and offspring. (A CNV is suggested only when two adjacent change points are within 200 SNPs).

Source	q=0.05		q=0.10		q=0.15

	Change Points	CNV	Change Points	CNV	Change Points	CNV
father	2	1	9	2	9	2
mother	4	1	5	1	5	1
offspring	3	1	3	1	4	1

Open in a new tab

3.2 A simulation example on the FDR control

We tested our theory on FDR control through an example. Consider model (1.1) with n = 30000 and σ = 1. We set J = 50 and drew 50 change points uniformly between 1 and 30000 that were multiples of 5, to get τ = (650, 855 ···, 29630)^T. Here L = min(τ_j₊₁ − τ_j ) = 15. We designed the mean vector θ by letting θ_i = 0 when τ₂_j ≤ i ≤ τ₂_j₊₁, and θ_i = δ otherwise, where δ = 1.5 or 3.

We set the SaRa with h = 10, 20, and 30, and chose threshold p^* by controlling FDR via the Benjamini-Hochberg procedure BH(q) with q = 0.05, 0.10, 0.15. We counted τ̂_k as a false positive if there was no τ_j such that

∣ {\hat{τ}}_{k} - τ_{j} ∣ < 10.

(3.1)

Otherwise, τ̂_k was counted as a true positive. We present our results in Table 3.2. We see that the average false discovery proportion was close to the target FDR, suggesting that our methodology worked well. In particular, when the signal was strong (δ = 3), the SaRa with bandwidth h = 10 achieved the best performance. The SaRa with bandwidths h = 20 and 30 performed well also. However, because there were two true change points at positions 11070 and 11085, it was difficult to detect both using a large bandwidth. When the signal was weak (δ = 1.5), the SaRa with h = 10 was less powerful. Although the FDR can be controlled, only a small portion of true change-points can be detected. The SaRa with larger bandwidths are more powerful. When the bandwidth was too large (h = 30), the FDP was somewhat greater than the target FDR. There were two reasons. First, our assumption h < L was not satisfied for large h; second, the rule in (3.1) for counting true positives was too strict.

Table 3.2.

The average estimated number of change points Ĵ, true positives (TP), and false discovery proportion (FDP). The results were based on 100 replications.

	q=0.05			q=0.10			q=0.15
	Ĵ	TP	FDP	Ĵ	TP	FDP	Ĵ	TP	FDP
δ = 1.5, h = 10	3.700	3.520	0.4%	20.860	19.130	7.6%	27.690	23.640	13.6%
δ = 1.5, h = 20	45.730	43.600	4.5%	50.710	45.600	9.9%	54.620	46.560	14.5%
δ = 1.5, h = 30	50.580	47.130	6.7%	53.800	47.380	11.7%	56.740	47.460	16.1%
δ = 3, h = 10	51.500	49.920	3.0%	53.680	49.970	6.7%	57.040	49.980	12.1%
δ = 3, h = 20	50.380	49.070	2.5%	52.820	49.070	7.0%	55.000	49.070	10.6%
δ = 3, h = 30	50.770	48.650	4.1%	53.000	48.650	8.0%	55.490	48.650	12.1%

Open in a new tab

4. Optimality of the screening and ranking algorithm

We start this section with a brief summary of results on change-point detection when the number of change points is at most 2.

Among all change-point problems, the simplest case is the detection of the mean change for a sequence of independent Gaussian random variables with common variance. This has been investigated by Page (1955, 1957), Chernoff and Zacks (1964), Gardner (1969), and Sen and Srivastava (1975), among others. If Y_i ~ Inline graphic (θ_i, σ²) for i = 1,…, n, the change-point problem can be stated as the hypothesis test:

H_{0} : θ_{1} = θ_{2} = \dots = θ_{n}; vs H_{1} : θ_{1} = \dots = θ_{j} \neq θ_{j + 1} = \dots = θ_{n} .

For simplicity, we assume σ² is known and set to 1. We refer to Csörgö and Horváth (1997), Chen and Gupta (2000), and the references therein for the unknown variance and non-Gaussian cases. Among all methods, the likelihood ratio test is popular. Consider the change point at j, and let Λ_j denote the likelihood ratio statistic. Then,

- 2 log Λ_{j} = {({\bar{Y}}_{j +} - {\bar{Y}}_{j -})}^{2} / [j^{- 1} + {(n - j)}^{- 1}],

(4.1)

where ${\bar{Y}}_{j -} = \sum_{k = 1}^{j} Y_{k} / j$ and ${\bar{Y}}_{j +} = \sum_{k = j + 1}^{n} Y_{k} / (n - j)$ .

When j is unknown, a commonly used test statistic is $T_{1} = max_{1 \leq j \leq n - 1} (- 2 log Λ_{j})$ , and

\hat{j} = \underset{1 \leq j \leq n - 1}{argmax} (- 2 log Λ_{j})

(4.2)

serves as the location estimator.

The distribution of the statistic T₁ is quite complicated. However, as in Csörgö and Horváth (1997), under the null hypothesis the limiting distribution of $\sqrt{T_{1} (n)}$ satisfies

lim_{n \to \infty} P {a_{n} \sqrt{T_{1} (n)} - b_{n} \leq t} = exp (- 2 e^{- t}) for all t,

(4.3)

where $a_{n} = \sqrt{2 log log n}$ and $b_{n} = 2 log log n + \frac{1}{2} log log log n - \frac{1}{2} log π$ .

If the experiment is designed in a way such that, at the true change point position j(n) with the mean change of δ(n) = θ_j − θ_j₊₁, either of the following two conditions holds

0 < \frac{j (n)}{n} \to t < 1, δ (n) \to 0, with lim_{n \to \infty} \frac{n δ^{2}}{log log n} = \infty;

(4.4)

\frac{j (n)}{n} \to 0, δ (n) \to 0, with lim_{n \to \infty} \frac{j (n) δ^{2}}{log log n} = \infty,

(4.5)

it is shown in Csörgö and Horváth (1997) that

δ^{2} ∣ \hat{j} - j ∣ = O_{P} (1) .

(4.6)

In the less challenging case that δ(n) → c > 0, it can be shown that |ĵ − j| = O_P (1), implying $∣ \frac{\hat{j}}{n} - \frac{j}{n} ∣ = O_{P} (\frac{1}{n})$ . If all data points are collected from a fixed interval (e.g. [0, 1] or a chromosome), this implies a convergence rate of 1/n. From (4.3), it is easy to see that (4.4) and (4.5) cannot be relaxed further. Otherwise, the signal is too weak to be detectable.

The second simplest case for the change-point problem has two change points with an epidemic alternative Yao (1993); Arias-Castro et al. (2005). Specifically, test

\begin{array}{l} H_{0} : θ_{1} = \dots = θ_{n} = θ; vs \\ H_{1} : θ_{1} = \dots = θ_{l} = θ_{r + 1} = \dots = θ_{n} = θ, θ_{l + 1} = \dots = θ_{r} = θ + δ . \end{array}

(4.7)

Arias-Castro et al. (2005) showed that no method can detect the segment reliably if δ²(r − l) < 2 log n, which offers a benchmark for necessary conditions to solve the general multiple change-point problems. In the next two subsections, we derive the theoretical properties of the SaRa.

4.1 Normal case

Consider the normal mean change-point model (1.1) and define

L = min_{1 \leq j \leq J + 1} (τ_{j} - τ_{j - 1}), δ = min_{1 \leq j \leq J} ∣ θ_{τ_{j} + 1} - θ_{τ_{j}} ∣ / σ,

where J, τ, θ and σ depend on n. Here L is the minimal distance among all change points, δ measures the ratio of the minimum jump size to the standard deviation at change points. The key quantity that reflects the strength of the signal is S² = δ²L. When the signal is too weak, it is not distinguishable from the noise and cannot be recovered by any methods. Here we consider the setting

S^{2} = δ^{2} L > 32 log n .

(4.8)

In the conventional setting of multiple change-point analysis, it is usually assumed that J is constant, τ/n converge to a constant vector t as n → ∞, which implies S² = cn ≫ log n. However, in some applications, L is quite small compared to n. Hence it is useful to consider the more flexible condition in (4.8).

Theorem 2

Under (4.8), there exist h = h(n) and λ = λ(n) such that Inline graphic = = {τ̂₁, · · ·, τ̂_Ĵ} satisfies

lim_{n \to \infty} P ({\hat{J} = J}) = 1;

Moreover, conditional on Ĵ = J,

δ^{2} ({\hat{τ}}_{i} - τ_{i}) = O_{P} (1) .

(4.9)

In particular, taking h = L/2 and λ = δσ/2, we have

P ({\hat{J} = J} \cap \underset{i}{\cap} {∣ {\hat{τ}}_{i} - τ_{i} ∣ < h}) > 1 - \frac{4 \sqrt{2}}{\sqrt{π}} S^{- 1} exp {log n - S^{2} / 32} .

Remark 1

In Niu and Zhang (2010), it was shown that

lim_{n \to \infty} P ({\hat{J} = J} \cap \underset{i}{\cap} {∣ {\hat{τ}}_{i} - τ_{i} ∣ < h}) = 1.

(4.10)

This is called the sure coverage property since it basically says that the true change points locations τ_i’s are covered by neighborhoods (τ̂_i − h, τ̂_i + h) with probability tending to one.

Remark 2

The main improvement from Theorem 2 is the convergence rate (4.9), the same as the convergence rate of the likelihood ratio estimator in the single change-point case.

Remark 3

The choices of h and λ in Theorem 2 may not be optimal in general, especially when S² ≫ log n. Basically, it is enough to use an h that is slightly greater than 16 log n/δ².

Corollary 1

If there is only one change point at position τ ≤ n/2 with δ²τ > 32 log n, then there exist h, such that τ̂ = argmax_j |D_h(j)| satisfies δ²(τ̂ − τ) = O_P (1).

Note that in this corollary, the convergence rate of the location estimator τ̂ is the same as for the likelihood ratio estimator. The assumption δ²τ > 32 log n is slightly stronger than (4.5). Since the likelihood ratio test statistic −2 log Λ_j involves all the data points while the local statistic D_h(j) uses only 2h data points, where h is usually O(log n/δ²).

Remark 4

The assumption (4.8) cannot be relaxed further except for the constant 32. For example, in the epidemic case (4.7) even if we know that there exists, at most, one segment whose mean is shifted from zero, Arias-Castro et al. (2005) suggested that (4.8) is required except for the constant. In our setting, the sparsity of θ and number of change points are not restricted. Therefore, we do not expect that the 32 can be significantly improved. We conclude that the SaRa is a nearly optimal procedure.

4.2 Beyond normality

Suppose

y_{i} = θ_{i} + ε_{i}, i = 1, \dots, n,

(4.11)

where the mean vector θ = (θ₁, …, θ_n)^T is a piecewise constant with change points at 0 < τ₁ < · · · < τ_J < n, and the noise is not necessarily normal but satisfies the following.

(C0)
The noises are i.i.d. with E(ε_i) = 0 and Var(ε_i) = σ² < ∞. Moreover, the density function of ε_i is symmetric. With
$L = min_{1 \leq j \leq J + 1} (τ_{j} - τ_{j - 1}), δ = min_{1 \leq j \leq J} ∣ θ_{τ_{j} + 1} - θ_{τ_{j}} ∣ / σ,$

and, for easy presentation, σ² = 1, we consider three cases:
(C1)
ε_i ~ (0, 1), S² = δ²L > 32 log n.
(C2)
$E (exp \frac{∣ ε_{i} ∣}{a}) \leq b, \frac{δ^{2} L}{8 {a b}^{2} + 2 a δ} ≫ log n$ .
(C3)
E(|ε_i|^t) = m_t < ∞ for some t > 2, δ^tL^t⁻¹ ≫ n and δ²L ≫ log n.

The case (C1) has been characterized in Theorem 2. We give results for the other two cases.

Theorem 3

If (C0) and one of (C2) and (C3) hold, there exist h = h(n) and λ = λ(n) such that Inline graphic = = {τ̂₁, · · ·, τ̂_Ĵ} satisfies

lim_{n \to \infty} P ({\hat{J} = J} \cap \underset{i}{\cap} {∣ {\hat{τ}}_{i} - τ_{i} ∣ < h}) = 1;

In particular, taking h = L/2 and λ = δ/2, we have, under (C0) and (C2),

P ({\hat{J} = J} \cap \underset{i}{\cap} {∣ {\hat{τ}}_{i} - τ_{i} ∣ < h}) > 1 - 2 exp {log n - \frac{δ^{2} L}{64 a^{2} b + 8 a δ}},

and, under (C0) and (C3),

P ({\hat{J} = J} \cap \underset{i}{\cap} {∣ {\hat{τ}}_{i} - τ_{i} ∣ < h}) > 1 - C_{1} \frac{n}{δ^{t} L^{t - 1}} + exp {log n - C_{2} δ^{2} L},

where $C_{1} = \frac{1}{2} m_{t} {(4 + \frac{8}{t})}^{t}$ and $C_{2} = \frac{1}{8} {(t + 2)}^{- 2} e^{- t}$ .

5. Discussion

Change-point detection is a classic problem with emerging applications across a spectrum of fields from finance to engineering and bioscience. In this work, we focused on the detection of multiple change points, arising from the need to identify CNVs from genomic data. We invoked the recently developed algorithm SaRa of Niu and Zhang (2010), evaluating it as the only algorithm that is shown to reach optimal computational complexity and to possess the sure coverage property. We introduced a concept of false discovery proportion to address the specific setting in the detection of multiple change points, and established FDR theory for the dependent and sequence data pertinent to CNV. We proved a stronger sure coverage property and showed that the convergence rate is optimal, and the same as the existing one for the detection of a single change point. When we applied our procedure to a well-known data set, we confirmed a known CNV and detected a new one, highlighting the potential of the method to detect CNVs that are difficult to detect by existing methods.

Acknowledgments

The financial supports from the University of Arizona Internal grant and National Institute on Drug Abuse grant R01–DA016570 are greatly acknowledged.

Appendix. Proofs

In this appendix, we prove Theorems 2 and 3, and Corollary 1. Although Corollary 1 is implied directly by Theorem 2, providing its proof might help readers understand the technique.

A direct proof of Corollary 1

First note that the condition τ ≤ n/2 does not put any restrictions on the position of the change point, by symmetry. The only condition we need is that the position τ is not too close to the boundary, which is guaranteed by δ²τ > 32 log n.

Without loss of generality, we assume σ² = 1 and θ_τ ₊₁ − θ_τ > 0. Fix an integer h such that 16 log n/δ < h < τ. By definition,

D_{h} (τ) = \frac{1}{h} (\sum_{k = τ + 1}^{τ + h} y_{k} - \sum_{k = τ - h + 1}^{τ} y_{k}) = δ + \frac{1}{h} (\sum_{k = τ + 1}^{τ + h} ε_{k} - \sum_{k = τ - h + 1}^{τ} ε_{k}) ~ N (δ, \frac{2}{h}) .

We consider the local statistic D_h at points right of τ and study the behavior of estimator τ̂ = argmax_j_≥_τ D_h(j). Let W_m = ε_τ ₊_h₊_m − 2ε_τ₊_m + ε_τ₋_h₊_m, m = 1, 2, …, h. Note the following.

For m ≥ 2h, $D_{h} (τ + m) ~ N (0, \frac{2}{h})$ is independent of D_h(τ).
For h < m < 2h, $D_{h} (τ + m) ~ N (0, \frac{2}{h})$ and $corr (D_{h} (τ + m), D_{h} (τ)) = - \frac{2 h - m}{2 h}$ .
For m ≤ h, $D_{h} (τ + m) ~ N (\frac{h - m}{m}, \frac{2}{h})$ and $D_{h} (τ + m) - D_{h} (τ) = - \frac{m}{h} δ + \frac{1}{h} \sum_{ℓ = 1}^{m} W_{ℓ}$ .

To show lim_M_→∞ lim_n_→∞ P(δ²|τ̂ − τ| < M ) = 1, it suffices to verify that

lim_{M \to \infty} lim_{n \to \infty} P (max_{m > M / δ^{2}} {D_{h} (τ + m) - D_{h} (τ)} < 0) = 1.

(6.1)

Note that in (6.1), h > 16 log n/δ² ≫ M/δ² as n → ∞. Therefore, we can bound the probabilities

\begin{matrix} P_{1} = P (max_{m > h} {D_{h} (τ + m) - D_{h} (τ)} < 0), \\ P_{2} = P (max_{h \geq m > M / δ^{2}} {D_{h} (τ + m) - D_{h} (τ)} < 0) . \end{matrix}

By 1) and 2), we have

\begin{matrix} D_{h} (τ + m) - D_{h} (τ) ~ N (- δ, \frac{4}{h}), m \geq 2 h, \\ D_{h} (τ + m) - D_{h} (τ) ~ N (- δ, \frac{6 - m / h}{h}), h < m < 2 h . \end{matrix}

Therefore,

\begin{array}{l} 1 - P_{1} = P (max_{m > h} {D_{h} (τ + m) - D_{h} (τ)} > 0) \\ < \sum_{m = h + 1}^{2 h - 1} P (D_{h} (τ + m) - D_{h} (τ) > 0) + \sum_{m \geq 2 h} P (D_{h} (τ + m) - D_{h} (τ) > 0) \\ < (h - 1) Φ (- δ \sqrt{h} / \sqrt{5}) + (n - m - 2 h + 1) Φ (- δ \sqrt{h} / 2) \\ < n Φ (- δ \sqrt{h} / \sqrt{5}) \\ < \sqrt{\frac{5}{2 π}} \frac{1}{δ \sqrt{h}} exp {log n - \frac{δ^{2} h}{10}} . \end{array}

(6.2)

In the last step, we have used the bound on the Gaussian tail probability

Φ (- t) = 1 - Φ (t) < \frac{1}{\sqrt{2 π}} t^{- 1} e^{- \frac{1}{2} t^{2}} .

Obviously, P₁ → 1 as n → ∞ when δ²h > 10 log n.

To bound the second part, the Bonferroni inequality is not sharp enough. First, by 3),

P_{2} = P (max_{h \geq m > M / δ^{2}} {- \frac{m}{h} δ + \frac{1}{h} \sum_{ℓ = 1}^{m} W_{ℓ}} < 0),

where W₁, …, W_h are i.i.d. Inline graphic (0, 6). If B(t) is a standard Brownian motion, observe that

\begin{array}{l} P_{2} = P (max_{h \geq m \geq M / δ^{2}} {\frac{1}{m} \sum_{ℓ = 1}^{m} \frac{W_{ℓ}}{\sqrt{6}} - \frac{δ}{\sqrt{6}}} < 0) \\ \geq P (max_{t \geq M / δ^{2}} {\frac{B (t)}{t} - \frac{δ}{\sqrt{6}}} < 0) \\ = P (max_{t \geq M / δ^{2}} {\frac{t B (1 / t)}{t} - \frac{δ}{\sqrt{6}}} < 0) \\ = P (max_{t \leq δ^{2} / M} {B (t) - \frac{δ}{\sqrt{6}}} < 0) \\ = P (max_{t \leq δ^{2} / M} {δ B (t / δ^{2}) - \frac{δ}{\sqrt{6}}} < 0) \\ = P (max_{t \leq 1 / M} {B (t) - \frac{1}{\sqrt{6}}} < 0) \\ = P (max_{t \leq 1 / M} {B (t)} < \frac{1}{\sqrt{6}}) . \end{array}

Therefore, lim_M_→∞ lim_n_→∞ P₂ = 1. By symmetry, the behavior of τ̂ = argmax_j_≤_t D_h(j) is exactly the same. The proof is finished.

Proofs of Theorems 2 and 3

The only difference between Theorems 2 and 3 is that they assume different conditions on the noise distribution. We give a unified proof of the nonasymptotic result for both. We first give notation and two lemmas.

For L an even number fix h = L/2 and λ = δ/2. Recall that we set σ² = 1. We say a point j is a flat point if there is no change point in the neighborhood [j − h + 1, j + h] so H₀(j) is true at (2.3). Let Inline graphic = {j: H₀(j) is true } be the set of all flat points and = {τ_i: i = 1, …, n} be the set of all change points. Consider the event = {|D(τ, h)| > λ} for a change point τ ∈ and the event = {|D(j, h)| < λ} for a flat point j ∈ . Let

E_{n} = (\underset{τ \in J}{\cap} A_{τ}) \cap (\underset{j \in F_{h}}{\cap} B_{j}) .

Lemma 1

On Inline graphic , Ĵ = J and |τ_i − τ̂_i| < h for all i.

This lemma, shown in Niu and Zhang (2010), says that Inline graphic implies τ_i ∈ (τ̂_i − h, τ̂_i + h). The next lemma shows the nonasymptotic properties of Theorem 2 and 3.

Lemma 2

If (C0) and one of (C1), (C2), and (C3) hold, then P( Inline graphic ) → 1 as n → ∞. In particular, under (C1),

P (E_{n}) > 1 - \frac{4 \sqrt{2}}{\sqrt{π}} S^{- 1} exp [log n - S^{2} / 32];

(6.3)

under (C2),

P (E_{n}) > 1 - 2 exp {log n - \frac{δ^{2} L}{64 a^{2} b + 8 a δ}};

(6.4)

under (C3),

P (E_{n}) > 1 - C_{1} \frac{n}{δ^{t} L^{t - 1}} + exp {log n - C_{2} δ^{2} L},

(6.5)

where $C_{1} = \frac{1}{2} m_{t} {(4 + \frac{8}{t})}^{t}$ and $C_{2} = \frac{1}{8} {(t + 2)}^{- 2} e^{- t}$ .

Proof of Lemma 2

Under (C1), we have

\begin{array}{l} D_{h} (j) ~ N (0, \frac{2}{h}) for j \in F_{h}, \\ D_{h} (τ) ~ N (δ_{τ}, \frac{2}{h}) for τ \in J, \end{array}

where δ_τ = θ_τ ₊₁ − θ_τ.

For a fixed flat point j,

P (B_{j}^{c}) = P (∣ D_{h} (j) ∣ > \frac{δ}{2}) = 2 (1 - Φ (\frac{δ \sqrt{h}}{2 \sqrt{2}})) = 2 (1 - Φ (\frac{δ \sqrt{L}}{4})) = 2 (1 - Φ (\frac{S}{4})),

where Φ is the standard normal distribution function. Similarly, for a change point τ, since |δ_τ | > δ by definition, we have

P (A_{τ}^{c}) = P (∣ D_{h} (τ) ∣ < \frac{δ}{2}) < 1 - Φ (\frac{S}{4}) .

By the Bonferroni Inequality and the bound on the Gaussian tail probability, we have

P (E_{n}^{c}) < \sum_{τ \in J} P (A_{τ}^{c}) + \sum_{j \in F_{h}} P (B_{j}^{c}) < 2 n (1 - Φ (\frac{S}{4})) < \frac{4 \sqrt{2}}{\sqrt{π}} S^{- 1} exp {log n - \frac{S^{2}}{32}} .

Therefore, (6.3) holds.

Under (C0) and (C2), the probabilities $P (A_{τ}^{c})$ and $P (B_{j}^{c})$ can be bounded using Lemma 2.2.11 in van der Vaart and Wellner (1996).

Note that $E (exp \frac{∣ ε_{i} ∣}{a}) \leq b$ implies $E {∣ ε_{i} ∣}^{m} < b m! a^{m} = m! a^{m - 2} \frac{2 a^{2} b}{2}$ .

For a flat point j,

h \cdot D_{h} (j) = \sum_{i = j - h + 1}^{j} Y_{i} - \sum_{i = j + 1}^{j + h} Y_{i} = \sum_{i = j - h + 1}^{j} ε_{i} - \sum_{i = j + 1}^{j + h} ε_{i} .

Because ε_i is symmetric, h · D_h(j) is a sum of 2h i.i.d. random variables. By the Bernstein’s Inequality,

\begin{array}{l} P (B_{j}^{c}) = P (∣ D_{h} (j) ∣ > \frac{δ}{2}) \\ = P (∣ h \cdot D_{h} (j) ∣ > \frac{δ h}{2}) \\ \leq 2 exp {- \frac{1}{2} \frac{δ^{2} h^{2} / 4}{2 a^{2} b \cdot 2 h + a δ h / 2}} \\ = 2 exp {- \frac{δ^{2} L}{64 a^{2} b + 8 a δ}} \end{array}

$P (A_{τ}^{c})$ can be bounded by the same probability. By the Bonferroni Inequality, we have

P (E_{n}^{c}) < 2 n exp {- \frac{δ^{2} L}{64 a^{2} b + 8 a δ}} = 2 exp {log n - \frac{δ^{2} L}{64 a^{2} b + 8 a δ}} .

Under (C0) and (C3), the inequality (6.5) can be obtained via results on large deviation of sums of independent random variables. For example, applying Corollary 1.8. in Nagaev (1979),

\begin{array}{l} P (B_{j}^{c}) = P (∣ h \cdot D_{h} (j) ∣ > \frac{δ h}{2}) \\ \leq C_{1} δ^{- t} L^{1 - t} + exp {- C_{2} δ^{2} L}, \end{array}

where $C_{1} = \frac{1}{2} m_{t} {(4 + \frac{8}{t})}^{t}$ and $C_{2} = \frac{1}{8} {(t + 2)}^{- 2} e^{- t}$ . The same is true for $P (A_{τ}^{c})$ , τ ∈ Inline graphic . By the Bonferroni Inequality, we have

P (E_{n}^{c}) < n (C_{1} δ^{- t} L^{1 - t} + exp {- C_{2} δ^{2} L}) = C_{1} \frac{n}{δ^{t} L^{t - 1}} + exp {log n - C_{2} δ^{2} L} .

Theorem 3 and part of Theorem 2 are straightforward corollaries of Lemmas 1 and 2. The remaining part of Theorem 2, δ²(τ̂_i − τ_i) = O_P (1), can be shown in the same way as the proof of Corollary 1. In fact, from the discussion above, we know |τ̂_i − τ_i| < h on event Inline graphic , which holds with probability close to 1. To sharpen our result to δ²(τ̂_i − τ_i) = O_P (1), it suffices to repeat the procedure used to bound P₂ in Corollary 1.

Contributor Information

Ning Hao, Email: nhao@math.arizona.edu, Department of Mathematics, The University of Arizona, Tucson AZ, 85721, USA.

Yue Selena Niu, Email: yueniu@math.arizona.edu, Department of Mathematics, The University of Arizona, Tucson AZ, 85721, USA.

Heping Zhang, Email: heping.zhang@yale.edu, Department of Epidemiology and Public Health, Yale University, New Haven CT, 06520, USA.

Bibliography

Arias-Castro E, Donoho DL, Huo X. Near-optimal detection of geometric objects by fast multiscale methods. IEEE Trans Inform Theory. 2005;51 [Google Scholar]
Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B. 1995;57:289–300. [Google Scholar]
Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. 2001;29:1165–1188. [Google Scholar]
Chen J, Gupta A. Birkhäuser. 2000. Parametric Statistical Change Point Analysis. [Google Scholar]
Chernoff H, Zacks S. Estimating the current mean of a normal distribution which is subjected to changes in time. The Annals of Mathematical Statistics. 1964;35:999–1018. [Google Scholar]
Csörgö M, Horváth L. Limit Theorems in Change-Point Analysis. Wiley; New York: 1997. [Google Scholar]
Efron B, Zhang NR. False discovery rates and copy number variation. Biometrika. 2011;98:251–271. [Google Scholar]
Gardner LAJ. On detecting changes in the mean of normal variates. The Annals of Mathematical Statistics. 1969;40:116–126. [Google Scholar]
Huang T, Wu B, Lizardi P, Zhao H. Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics. 2005;21:3811–3817. doi: 10.1093/bioinformatics/bti646. [DOI] [PubMed] [Google Scholar]
Jeng XJ, Cai TT, Li H. Optimal sparse segment identification with application in copy number variation analysis. Journal of the American Statistical Association. 2010;105:1056–1066. doi: 10.1198/jasa.2010.tm10083. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lai TL, Xing H. A simple bayesian approach to multiple change-points. Statistica Sinica. 2011;21:539–569. [Google Scholar]
Lai TL, Xing H, Zhang N. Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatistics. 2008:290–307. doi: 10.1093/biostatistics/kxm031. [DOI] [PubMed] [Google Scholar]
Nagaev SV. Large deviations of sums of independent random variables. The Annals of Probability. 1979;7:745–789. [Google Scholar]
Niu SY, Zhang H. The screening and ranking algorithm to detect DNA copy number variations. Annals of Applied Statistics. 2010 doi: 10.1214/12-AOAS539SUPP. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004;5:557–72. doi: 10.1093/biostatistics/kxh008. [DOI] [PubMed] [Google Scholar]
Page ES. A test for a change in a parameter occurring at an unknown point. Biometrika. 1955;42:523–527. [Google Scholar]
Page ES. On problems in which a change in a parameter occurs at an unknown point. Biometrika. 1957;44:248–252. [Google Scholar]
Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, Haden K, Li J, Shaw CA, Belmont J, Cheung SWW, Shen RM, Barker DL, Gunderson KL. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Research. 2006;16:1136–1148. doi: 10.1101/gr.5402306. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sen A, Srivastava MS. On tests for detecting change in mean. The Annals of Statistics. 1975;3:98–108. [Google Scholar]
Tibshirani R, Wang P. Spatial smoothing and hot spot detection for cgh data using the fused lasso. Biostatistics. 2008;9:18–29. doi: 10.1093/biostatistics/kxm013. [DOI] [PubMed] [Google Scholar]
van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. Springer; 1996. [Google Scholar]
Vostrikova LY. Detecting “disorder” in multidimensional random processes. Soviet Mathematics Doklady. 1981:55–59. [Google Scholar]
Wang K, et al. PennCNV: An integrated hidden markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research. 2007;17:1665–1674. doi: 10.1101/gr.6861907. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yao Q. Tests for change-points with epidemic alternatives. Biometrika. 1993;80:179–191. [Google Scholar]
Yao YC. Estimating the number of change-points via Schwarz’ criterion. Statistics and Probability Letters. 1988;6:181–189. [Google Scholar]
Yao YC, Au ST. Least-squares estimation of a step function. Sankhya A. 1989;51:370–381. [Google Scholar]
Zhang F, Gu W, Hurles ME, Lupski JR. Copy number variation in human health, disease, and evolution. Annual Review of Genomics and Human Genetics. 2009;10:451–481. doi: 10.1146/annurev.genom.9.081307.164217. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang NR. Frontiers in Computational and Systems Biology. 2010. DNA copy number profiling in normal and tumor genomes; pp. 259–281. [Google Scholar]
Zhang NR, Siegmund DO. A modified bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007;63:22–32. doi: 10.1111/j.1541-0420.2006.00662.x. [DOI] [PubMed] [Google Scholar]

[R1] Arias-Castro E, Donoho DL, Huo X. Near-optimal detection of geometric objects by fast multiscale methods. IEEE Trans Inform Theory. 2005;51 [Google Scholar]

[R2] Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B. 1995;57:289–300. [Google Scholar]

[R3] Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. 2001;29:1165–1188. [Google Scholar]

[R4] Chen J, Gupta A. Birkhäuser. 2000. Parametric Statistical Change Point Analysis. [Google Scholar]

[R5] Chernoff H, Zacks S. Estimating the current mean of a normal distribution which is subjected to changes in time. The Annals of Mathematical Statistics. 1964;35:999–1018. [Google Scholar]

[R6] Csörgö M, Horváth L. Limit Theorems in Change-Point Analysis. Wiley; New York: 1997. [Google Scholar]

[R7] Efron B, Zhang NR. False discovery rates and copy number variation. Biometrika. 2011;98:251–271. [Google Scholar]

[R8] Gardner LAJ. On detecting changes in the mean of normal variates. The Annals of Mathematical Statistics. 1969;40:116–126. [Google Scholar]

[R9] Huang T, Wu B, Lizardi P, Zhao H. Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics. 2005;21:3811–3817. doi: 10.1093/bioinformatics/bti646. [DOI] [PubMed] [Google Scholar]

[R10] Jeng XJ, Cai TT, Li H. Optimal sparse segment identification with application in copy number variation analysis. Journal of the American Statistical Association. 2010;105:1056–1066. doi: 10.1198/jasa.2010.tm10083. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Lai TL, Xing H. A simple bayesian approach to multiple change-points. Statistica Sinica. 2011;21:539–569. [Google Scholar]

[R12] Lai TL, Xing H, Zhang N. Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatistics. 2008:290–307. doi: 10.1093/biostatistics/kxm031. [DOI] [PubMed] [Google Scholar]

[R13] Nagaev SV. Large deviations of sums of independent random variables. The Annals of Probability. 1979;7:745–789. [Google Scholar]

[R14] Niu SY, Zhang H. The screening and ranking algorithm to detect DNA copy number variations. Annals of Applied Statistics. 2010 doi: 10.1214/12-AOAS539SUPP. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004;5:557–72. doi: 10.1093/biostatistics/kxh008. [DOI] [PubMed] [Google Scholar]

[R16] Page ES. A test for a change in a parameter occurring at an unknown point. Biometrika. 1955;42:523–527. [Google Scholar]

[R17] Page ES. On problems in which a change in a parameter occurs at an unknown point. Biometrika. 1957;44:248–252. [Google Scholar]

[R18] Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, Haden K, Li J, Shaw CA, Belmont J, Cheung SWW, Shen RM, Barker DL, Gunderson KL. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Research. 2006;16:1136–1148. doi: 10.1101/gr.5402306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Sen A, Srivastava MS. On tests for detecting change in mean. The Annals of Statistics. 1975;3:98–108. [Google Scholar]

[R20] Tibshirani R, Wang P. Spatial smoothing and hot spot detection for cgh data using the fused lasso. Biostatistics. 2008;9:18–29. doi: 10.1093/biostatistics/kxm013. [DOI] [PubMed] [Google Scholar]

[R21] van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. Springer; 1996. [Google Scholar]

[R22] Vostrikova LY. Detecting “disorder” in multidimensional random processes. Soviet Mathematics Doklady. 1981:55–59. [Google Scholar]

[R23] Wang K, et al. PennCNV: An integrated hidden markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research. 2007;17:1665–1674. doi: 10.1101/gr.6861907. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Yao Q. Tests for change-points with epidemic alternatives. Biometrika. 1993;80:179–191. [Google Scholar]

[R25] Yao YC. Estimating the number of change-points via Schwarz’ criterion. Statistics and Probability Letters. 1988;6:181–189. [Google Scholar]

[R26] Yao YC, Au ST. Least-squares estimation of a step function. Sankhya A. 1989;51:370–381. [Google Scholar]

[R27] Zhang F, Gu W, Hurles ME, Lupski JR. Copy number variation in human health, disease, and evolution. Annual Review of Genomics and Human Genetics. 2009;10:451–481. doi: 10.1146/annurev.genom.9.081307.164217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Zhang NR. Frontiers in Computational and Systems Biology. 2010. DNA copy number profiling in normal and tumor genomes; pp. 259–281. [Google Scholar]

[R29] Zhang NR, Siegmund DO. A modified bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007;63:22–32. doi: 10.1111/j.1541-0420.2006.00662.x. [DOI] [PubMed] [Google Scholar]

PERMALINK

Multiple Change-Point Detection via a Screening and Ranking Algorithm

Ning Hao

Yue Selena Niu

Heping Zhang

Abstract

1. Introduction

2. False discovery rate for change-point detection

2.1 Change-point detection as a multiple testing problem

Definition 1

2.2 The screening and ranking algorithm

2.3 A false discovery rate approach via the SaRa

Theorem 1

3. Numerical study

3.1. An application to SNP genotyping data

Figure 3.1.

Figure 3.2.

Table 3.1.

3.2 A simulation example on the FDR control

Table 3.2.

4. Optimality of the screening and ranking algorithm

4.1 Normal case

Theorem 2

Remark 1

Remark 2

Remark 3

Corollary 1

Remark 4

4.2 Beyond normality

Theorem 3

5. Discussion

Acknowledgments

Appendix. Proofs

A direct proof of Corollary 1

Proofs of Theorems 2 and 3

Lemma 1

Lemma 2

Proof of Lemma 2

Contributor Information

Bibliography

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases