Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Jun 1.
Published in final edited form as: Can J Stat. 2022 Sep 16;51(2):596–629. doi: 10.1002/cjs.11721

Efficient multiple change point detection for high-dimensional generalized linear models

Xianru Wang 1, Bin Liu 1,*, Xinsheng Zhang 1, Yufeng Liu 2; Alzheimer’s Disease Neuroimaging Initiative
PMCID: PMC10281755  NIHMSID: NIHMS1809281  PMID: 37346756

Abstract

Change point detection for high-dimensional data is an important yet challenging problem for many applications. In this paper, we consider multiple change point detection in the context of high-dimensional generalized linear models, allowing the covariate dimension p to grow exponentially with the sample size n. The model considered is general and flexible in the sense that it covers various specific models as special cases. It can automatically account for the underlying data generation mechanism without specifying any prior knowledge about the number of change points. Based on dynamic programming and binary segmentation techniques, two algorithms are proposed to detect multiple change points, allowing the number of change points to grow with n. To further improve the computational efficiency, a more efficient algorithm designed for the case of a single change point is proposed. We present theoretical properties of our proposed algorithms, including estimation consistency for the number and locations of change points as well as consistency and asymptotic distributions for the underlying regression coefficients. Finally, extensive simulation studies and application to the Alzheimer’s Disease Neuroimaging Initiative data further demonstrate the competitive performance of our proposed methods.

Keywords: Binary segmentation, Dynamic programming, Generalized linear models, High dimensions

1. INTRODUCTION

With the advance of technology, complex large-scale data are prevalent in various scientific fields. Data heterogeneity creates great challenges for the analysis of complex data that may not be well approximated by a common distribution. Change point detection is a powerful tool to deal with data heterogeneity. Since the seminal work by Page (1955), there is a growing literature on change point detection with a wide range of applications, including genomics (Braun et al., 2000), finance (Pesaran & Pick, 2007; Fan et al., 2011), and social networks (Raginsky et al., 2012).

In this paper, we consider multiple change point detection for a general framework of high-dimensional generalized linear models (GLMs). Suppose we have n independent observations {Yi,Xi}i=1n with

g(μi)=XiTβ(i)fori=1,,n, (1)

where YiYR is the real-valued response for the i-th observation, Xi = (Xi1,…,Xip) is the corresponding covariate vector in XRp, μi=E(YiXi), g(·) is the link function, and β(i)=(β1(i),,βp(i))TRp is the unknown regression coefficient vector for the i-th observation. Then we consider estimating multiple change points with piecewise constant coefficients for Model (1). More specifically, let k~0 be the true number of unknown change points along with the true location vector τ~=(τ~0,τ~1,τ~k~,τ~k~+1)T with 0=τ~0<τ~1<τ~2<<τ~k~<τ~k~+1=1. Then, the unknown k~ change points divide the n time-ordered observations into k~+1 intervals and the underlying regression coefficients β(i) have the following form:

β(i)={β0,ifk~=0,β0(j),ifτ~j1<inτ~j,j=1,,k~+1,} (2)

where β0(j)=(β10(j),,βp0(j))TRp denotes the underlying true regression coefficients in the j-th interval. We focus on change point detection, which consists of estimating: (a) the number of change points (k~); (b) the locations of change points (τ~); (c) the regression coefficients β0(j) in each segmentation, where j = 1,…, k~+1.

There is a growing literature on change point detection. Most existing papers focus on change point problems in the mean, variance, or covariance matrix either for a fixed p (Kirch et al., 2015; Zhang & Lavitas, 2018) or for a growing p (Frick et al., 2014; Jirak 2015; Barigozzi et al., 2018; Wang & Samworth, 2018; Wang et al., 2021b). Progress has been made in the literature for detection of multiple change points as well (Lavielle & Teyssiére, 2006; Aue et al., 2009; Harchaoui & Lévy-Leduc, 2010; Cho & Fryzlewicz, 2015). Despite progress on change point detection, much fewer papers appear in the literature on regression change point problems, especially for high-dimensional models. The main difficulty comes from the complexity of both calculation and theoretical analysis arising from the growing dimension.

For regression problems, penalized techniques such as Lasso (Tibshirani, 1996) are popular in dealing with high-dimensional data. Some theoretical properties of the Lasso and various extensions can be found in Fan & Peng (2004), Candes & Tao (2007), and van de Geer et al. (2014). For a general overview and recent developments, we refer to Fan & Lv (2010) and Tibshirani (2011). In terms of change point detection based on Lasso, some methods exist for solving regression change point problems both in low and high dimensions. For example, designed for a fixed p, Ciuperca (2014) considered multiple change point estimation based on the Lasso. Qian & Su (2016) and Li et al. (2016) proposed a systematic change point estimation framework based on the adaptive fused Lasso. When the data dimension p grows to infinity, Lee et al. (2016) considered high-dimensional linear models with a possible change point and proposed a method for estimating regression coefficients as well as the unknown threshold parameter. As an extension, Leonardi & Bühlmann (2016) proposed computationally efficient algorithms for the number and locations of multiple change points in the context of high-dimensional linear models. Recently, Liu et al. (2019) investigated simultaneous change point detection and identification based on a de-biased Lasso process. Wang et al. (2021a) developed variance projected wild binary segmentation (VPWBS) for multiple change point detection.

Note that the above-mentioned papers focused on change point detection based on linear models with a continuous response, and thus are not directly applicable to the analysis of categorical or count response variables in practice. GLM can be very useful in this situation since it covers the exponential family distributions for the response variable. Because of its generality, GLM is widely used in various applications such as genetics, economics, and epidemiology. Several papers studied low-dimensional, single change point problems in the context of GLM (Lee & Seo, 2008; Lee et al., 2011; Fong et al., 2015). To the best of our knowledge, change point detection for high-dimensional GLMs has not been studied in the literature. Hence, it is desirable to consider a flexible and general framework for analyzing high-dimensional data with heterogeneity. Motivated by this, in this paper, we consider computationally efficient multiple change point detection in the context of high-dimensional GLMs. Our main contributions are summarized as follows:

  • We consider change point problems in a more flexible and general framework of high-dimensional GLMs, allowing the data dimension p to grow exponentially with the sample size n. It covers various model settings including linear models, logistic, and probit models as special cases. As far as we know, change point detection for high-dimensional logistic and probit models has not been considered in the literature.

  • Under the above framework, we propose a three-step procedure to estimate the number and locations of change points based on the Lasso estimator of the regression coefficients. The basic idea is to choose a useful contrast function J(τ(k)), which satisfies J(τ^(k^))<J(τ(k)) for any τ(k). To solve this optimization problem, we propose two algorithms based on dynamic programming and binary segmentation techniques, which have computational costs of O(n2GLMLasso(n)) and O(n log(n)GLMLasso(n)), respectively, where GLMLasso(n) is the cost to compute the Lasso estimator for the GLM with the sample size n. We also propose a much more efficient approach for the single change point case, with a computational cost of O(log(n)GLMLasso(n)). To the best of our knowledge, this is the most computationally efficient algorithm available for detecting a single change point in GLMs.

  • We examine some theoretical properties of our proposed change point estimators computed by the three algorithms. To be specific, under some mild conditions, both the dynamic programming and binary segmentation techniques can obtain a consistent estimator for the number and locations of the true change points with a rate of Op(log(p)n), which covers the case with an asymptotically growing number of change points. Moreover, the estimation error of the Lasso estimator of underlying regression coefficients can be bounded to op(1). To achieve further statistical prediction and inference, we introduce the de-biased Lasso estimator of the underlying regression coefficients in each segmentation, which is shown to be asymptotically normal. As for the third efficient approach designed for single change point cases, we establish that it can identify the location of the change point with high estimation accuracy. Finally, the competitive performance of our proposed methods is demonstrated by extensive numerical results as well as application to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset.

For a better understanding of our work, we would like to point out its relationship with several related papers. Compared to Lee et al. (2011), which considered single change point detection for the binary response variable with low dimensional covariates, we overcome the challenges of the computational and theoretical complexity arising from the growing dimension and number of change points. Meanwhile, to address the issue of the unknown multiple change points, we construct accurate and effective algorithms based on two techniques, dynamic programming and binary segmentation. These techniques are popular for multiple change point detection and were previously studied by Lavielle & Teyssiére (2006), Boysen et al. (2009), Harchaoui & Lévy-Leduc (2010), Cho & Fryzlewicz (2012, 2015), and Leonardi & Bühlmann (2016). Our extension to GLMs involves several technical challenges to overcome. One substantial difficulty comes from the complex form of the contrast functions compared to the least squares for linear models considered in Leonardi & Bühlmann (2016).

The rest of this paper is organized as follows. In Section 2, we introduce our methodology and demonstrate how our proposed three algorithms detect change points. In Section 3, the corresponding theoretical results of the change points computed by different algorithms are established. We investigate the performance of our proposed methods by extensive numerical results as well as a real data application in Section 4. We summarize the paper in Section 6. Detailed proofs of the main theorems and some useful lemmas are given in the Appendix.

2. METHODOLOGY

In this section, we introduce our new methodology for Model (1) with multiple unknown change points. In particular, in Section 2.1, some notation is introduced. In Section 2.2, we present a three-step change point estimator including the number and the locations of change points. Meanwhile, the regression coefficients in each segment are estimated based on the Lasso. In Sections 2.2.1 and 2.2.2, based on dynamic programming and binary segmentation techniques, two algorithms are proposed to detect multiple change points. To further improve the computational efficiency, in Section 2.2.3, we present a much more efficient algorithm designed for the case of a single change point.

2.1. Notation

We first introduce some notation. For a vector a=(a1,,ap)TRp, we denote a1=i=1pai,a2=(i=1pai2)12, and ∥a = max1≤ipai∣. For two real-valued sequences an and bn, we set an = O(bn) if there exists a constant C such that ∣an∣ ≤ Cnbn∣, for a sufficiently large n. We set an = o(bn) if an/bn → 0 as n → ∞. For a sequence of random variables {ξ1, ξ2,…}, we set ξnPξ if ξn converges to ξ in probability as n. We also denote ξn = op(1) if ξnP0. Given an interval (u,v) ⊂ [0,1] such that u,vL1={in,i=1,,n,nN}, we denote the vector (Yun+1,…,Yvn) by Y(u,v) and the vector (ϵun+1,…,ϵvn) by ϵ(u,v). Analogously, we use X(u,v) to denote the (vu)n × p dimensional matrix (X(u,v)(1),,X(u,v)(p)), where X(u,v)(j)=(Xun+1(j),,Xvn(j))T with j = 1,…,p, and we use β^(u,v) to denote the Lasso estimator based on the observations Y(u,v) and X(u,v). For a set A, we use #A to denote its cardinality. For any x ≥ 0, we use [x] to denote the largest integer less than or equal to x. We use C1,C2,… to denote generic positive constants that may vary in different places.

2.2. New Estimation and Algorithms

Let Y = (Y1,…, Yn) denote the n × 1 response vector, and X the n × p design matrix with Xi = (Xi1, …, Xip) being its i-th row for 1 ≤ in. In this paper, we assume {Xi}i=1n are independently and identically distributed (i.i.d.) p-dimensional random vectors with mean zero and covariance matrix Σ = Cov(X1). Furthermore, for j=1,,k~+1, we denote by S(j) the set of non-zero elements of the regression coefficients β0(j),i.e.,S(j)=#{:β0(j)0for=1,,p}. For any given partition τ = (τ0, τ1, …, τk, τk+1), we denote the j-th interval by Ij(τ) = (τj−1, τj), the length of the j-th interval by rj (τ) = τjτj−1, the shortest interval length by r(τ) = mini≤jk+1 rj (τ), and the change point number by l(τ). Moreover, we denote the minimum interval length by δ.

We are now ready to introduce our change point estimator in detail. We consider the Lasso-type 1-penalized estimator for high-dimensional GLMs. Such estimators have some desirable properties. In particular, van de Geer (2008) derived some theoretical properties including consistency and the oracle inequality, based on which our algorithms are mainly constructed. More specifically, let ρβ(x,y):X×YR be some loss function relative to g(·). For instance, if g(·) is the logit function, ρβ(x, y) will then be the negative-likelihood function in the form log(1 + exβ) − yxβ. For βRp, we define ρ.βρββ and ρ¨βρβββT. Note that such complex loss functions lead to substantial difficulty for the estimation of change points as well as regression coefficients. Given data observations {Yi,Xi}i=1n, the Lasso-based GLM method solves the following 1 penalized problem:

β^=argminβ{1ni=1nρβ(Xi,Yi)+λβ1}. (3)

Since we consider heterogeneous data with possible multiple change points, we cannot use Equation (3) directly to obtain parameter estimation. The main challenge is that both the number (k~) and the locations (τ~) are unknown. To solve this issue, we consider three steps.

Before introducing the change point estimator, we first demonstrate how to estimate the regression coefficients for each segment. To be specific, for any given candidate partition τ = (τ0, τ1,…, τk, τk+1), with τjL1, j = 1,…, k + 1, we obtain the estimator with the 1 penalty in each segment by, for j = 1,…, k + 1,

β^(τ,j)=arg minβ{1ni=nτj1+1nτjρβ(Xi,Yi)+λj(τjτj1)β1}, (4)

where λj is the non-negative regularization parameter.

Based on Equation (4), our new algorithms for estimating both k~ and τ~ are summarized into the following three steps.

Step 1 (Search the “best” partition): Given the candidate number of change points k, we find the “best” partition τ^(k)=(τ^1,,τ^k)T that minimizes the total loss function (contrast function):

τ^(k)=arg minτ=(τ0,,τk+1)TJ(τ,β^(τ),X,Y)+γ(k+1), (5)

where β^(τ)(β^(τ,1)T,,β^(τ,k+1)T)T, J(τ,β^(τ),X,Y)j=1k+1Pnρ(Ij(τ),β^(τ,j))), and Pnρ(Ij(τ),β)1ni=nτj1+1nτjρβ(Xi,Yi).

Step 2 (Estimate number of change points): We put τ^(k) into J(τ,β^(τ),X,Y) and obtain the minimum loss function associated with k as

G(k)J(τ^(k),β^(τ^(k)),X,Y)+γ(k+1). (6)

Then, we find the “best” estimation which minimizes G(k) with a penalty:

k^=arg minkG(k). (7)

Step 3 (Estimate locations of change points): We put k^ into Step 1 and obtain the final change point estimator τ^τ^(k^)=(τ^1,,τ^k^)T by

τ^=arg minτ=(τ0,,τk^+1)TJ(τ,β^(τ),X,Y). (8)

Combining Steps 1-3, our final change point estimators k^ and τ^ can be obtained equivalently in the following form:

τ^=arg minkminτ:l(τ)=k{J(τ,β^(τ),X,Y)+γ(k+1)}. (9)

After the above three steps, we obtain the change-point estimators τ^. As for β0, we recommend two different Lasso estimators of the underlying regression coefficients β0, serving different purposes for practitioners. In particular, naturally, we can use the Lasso estimator of β0 to select variables and make a prediction, which is defined for j=1,,k^+1 as

β^(τ^,j)=arg minβ{1ni=nτ^j1+1nτ^jρβ(Xi,Yi)+λj(τ^jτ^j1)β1}.

For further statistical inference including confidence intervals and hypothesis testing, van de Geer et al. (2014) proposed the de-biased Lasso estimator and analyzed its asymptotic properties for the homogeneous model under high-dimensional setups. Similarly, for the heterogeneous observations, we construct a de-biased Lasso estimator for the underlying regression coefficients for each segmentation for j=1,,k^+1 as

β~(τ^,j)=β^(τ^,j)Θ^Pnρ.β^(τ^,j),

where the precision matrix estimator Θ^=Θ^Lasso can be constructed using nodewise Lasso with Σ^Pnρ¨β^(τ^,j) as input (van de Geer et al., 2014).

In what follows, we introduce three specific algorithms for solving Equation (9). Note that J(τ, β, X, Y) and Pnρ(Ij(τ), β) are the loss function for all intervals and the j-th interval, respectively. Meanwhile, λj in Equation (4) and γ in Equation (5) are positive tuning parameters that encourage coefficient and segment sparsity, respectively. We adopt a cross-validation approach to make a proper choice of these two tuning parameters λ and γ. To compute β^(τ,j) in Equation (4), we can use, for example, the R package glmnet (https://glmnet.stanford.edu). It is worth mentioning the following two remarks for our proposed estimator: (1) If the number of change points k~ is known, we just use Step 1 by plugging in k=k~ to directly obtain the locations of change points as follows:

τ^(k~)=arg minτ=(τ0,,τk~+1)TJ(τ,β^(τ),X,Y). (10)

In this case, our method covers the setting considered by Lee et al. (2016), where at most one change point is assumed. (2) When no change point occurs (k~=0), our proposed method can still work. Hence, our proposed method can automatically account for the underlying data generation mechanism (k~=0 or k~>0) without specifying any prior knowledge about the number of change points k~. Furthermore, as shown by our extensive numerical studies, our new algorithms can estimate k~ with high accuracy.

Our main goal is to design efficient algorithms that solve the optimization problem in Equation (9) of the form J(τ,β^)+Pen(τ). To address this issue, three algorithms are proposed next.

2.2.1. Dynamic programming approach

We introduce a general approach based on the Dynamic Programming Algorithm (DPA), which works well for our change point problem (Eq. 9). It is well-known that DPA has excellent accuracy since it considers the global solution of Equation (9). It is widely used in multiple change point detection including the efficient, parallelized approaches introduced recently by Tickle et al. (2020). More details can be found in Boysen et al. (2009) and Leonardi & Bühlmann (2016). Next, we present how to use this technique to solve Equation (9) in detail.

For any given v ∈ {i/n : i = 1, …,n}, consider the sample (Y(0,v], X(0,v]). Given a candidate change point number k, denote Fk(v) as the minimum value as follows:

Fk(v)minτ:l(τ)=kj=1k+1(Pnρ(Ij(vτ),β^(τ,j))+γ). (11)

One can see that the optimal k + 1 segments {(τj1,τj)}j=1k+1 corresponding to the change point vector τ = (τ0, τ1, …, τk+1) obtained from Equation (11), consist of the optimal first k segments {(τj1,τj)}j=1k and a single segment (τk, τk+1). Recall that τ0 := 0 and τk+1 := 1. Then τk is the rightmost change point estimator. Furthermore, by definition of Fk(v), {(τj1,τj)}j=1k obtained from Equation (11) is also a minimizer of Fk−1(τk). Hence, the last change point τk is the minimizer of Fk1(u)+Pnρ((u,v),β^(u,v))+γ with u < v.

The above observation motivates us to use the dynamic programming recursion to calculate Fk(v) with v ∈ {i/n : i = 1, …, n}. In particular, for any v ∈ {i/n : i = 1, …, n}, define

F0(v)=Pnρ((0,v),β^(0,v))+γ. (12)

Then, the dynamic programming recursion proceeds as follows:

Fk(v)=minu{in:i=1,,n}u<v{Fk1(u)+Pnρ((u,v),β^(u,v))+γ},v{in:i=1,,n}. (13)

Define Vn = {i/n : i = 1, …, n}. Based on Equations (12) and (13), we can obtain {F1(v), vVn}, {F2(v), vVn}, …, and {Fkmax (v), vVn}, where kmax (in our case kmax + 1 = 1/δ) is an “upper bound” of the number of change points. See Section 3.2 for more details. By considering G(k) in (6), we have Fk(1) = G(k) with k = 1, …, kmax. Hence, we are ready to estimate the change point number by

k^=arg mink=1,,kmaxFk(1). (14)

The corresponding locations of change points τ^=(0,τ^1,,τ^k^,1)T can be obtained by

τ^j=arg minuVn,u<τ^j+1{Fj1(u)+Pnρ((u,τ^j+1),β^(u,τ^j+1))+γ,forj=k^,,1. (15)

The following Algorithm 1 describes our procedure for obtaining k^ and τ^ based on the DPA. Note that DPA solves Equation (9) with globally optimal solutions, which have excellent estimation accuracy. Furthermore, as shown in Leonardi & Buhlmann (2016), it has the computational cost of O(n2GlmLasso(n)) operations. This can be computationally expensive especially when n is very large. Hence, it is desirable to consider a more efficient approach. Next, we introduce an efficient approach based on binary segmentation, which can ensure almost the same estimation accuracy as that of DPA.

Algorithm 1 : Dynamic programming procedure for change point detection in high-dimensional GLMs.
Input:Given the dataset{X,Y},set the value ofkmax.Step1:Based on Equations(12)-(13),computeFk(1)fork=1,,kmax.Step2:Obtain estimate of the numberof change pointsk^by Equation(14).Step3:Obtain estimate of thechange point locationsτ^by Equation(15).Output:Algorithm1providesthechangepointestimatorτ^(k^)=(0,τ^1,,τ^k^,1)T,including both the number and locations.

2.2.2. Binary segmentation approach

Next we introduce an approach based on the binary segmentation algorithm (BSA) examined in Cho & Fryzlewicz (2012; 2015), and Leonardi & Bühlmann (2016), which is shown to be much more efficient compared to DPA. The main idea of BSA for solving the change point problem for GLMs (Eq. 9) is that for each candidate search interval (u, v), we use the penalized loss function to determine whether a new change point s can be added. If s is identified, then the interval (u, v) is split into two subintervals (u, s) and (s, v) and we conduct the above procedure on (u, s) and (s, v) separately. This algorithm is continued until no new subintervals can be added. In particular, for any given u, vVn := {i/n : i = 1, …, n}, we define

Z(u,v)={Pnρ((u,v],β^(u,v])+γ,if(vu)n10,otherwise,} (16)

and

h(u,v)=arg mins{u}[u+δ,vδ]{Z(u,s)+Z(s,v)}. (17)

Then we present our BSA-based algorithm as follows.

Algorithm 2 : Binary segmentation procedure for change point detection in high-dimensional GLMs.
Input:Given the dataset{X,Y},initialize the setof changepoint pairsT={0,1}.Step1:For eachpair{u,v}inT,computes=h(u,v)as defined in Equation(17).Ifs>u,add newpairs of nodes{u,s}and{s,v}toTand updateTasT=T{u,s}{s,v}.Step2:Repeat Step1until no more new pair of nodes can be added.Denote theterminal set of change point pairs byTfinal=i=1q{ui,vi}.Output:Algorithm2provides the change point estimatorτ^b=(τ^0b,,τ^k^b+1b)T,wherek^b=#Tfinaland0=τ^0b<τ^1b<<τ^k^bb<τ^k^b+1b=1Tfinal,includingboth the number and locations.

Note that this approach searches much fewer candidates for finding a new change point as compared to DPA, which makes it more computationally efficient. More specifically, as shown by Leonardi & Bühlmann (2016), BSA has a computational cost of O(n log(n)GlmLasso(n)) operations. Furthermore, in Section 3.2, we prove that the change point estimator computed by Algorithm 2 enjoys almost the same estimation accuracy as that of Algorithm 1.

2.2.3. A fast screening approach for single change point models

So far, we have proposed two efficient algorithms in Sections 2.2.1 and 2.2.2 for solving Equation (9). In this section, we show that under the single change point models, the computational cost can be further reduced. As far as we know, our fast screening approach (FSA) is novel for detecting a single change point in regression models. The main idea is that for detecting a single change point, if we have some prior information about its location, it is not necessary to search all candidate subintervals that have been adopted in the BSA-based algorithm. To see this, we recall Z(u, v) as defined in Equation (16). For τf ∈ (0, 1), we define the statistics as Wτf((u, v)) = Z(u, u + τf (vu)) + Z(u + τf (vu),v). Based on Wτf ((u, v)), we have the following key observation: Consider any subinterval (u, v) containing a single change point τ~(u,v). If τ~ lies in the first half interval of (u, v), i.e., τ~(u,u+12(vu)), with a high probability, we can prove that W1/4((u, v)) < W3/4((u, v)). If τ~ lies in the second half interval of (u, v), i.e., τ~(u+12(vu),v), with a high probability, we have W3/4((u, v)) < W1/4((u, v)). The above observation motivates us to design Algorithm 3 for fast change point identification.

Note that Algorithm 3 does not need to search through all data points, and it can quickly identify the half-interval where the change point is located via comparing the quarter, half, and three-quarter values of W(Ti) in each iteration. As a result, it only takes O(log(n)GLMLasso(n)) computational operations to detect the change point. Hence, as compared to Algorithms 1 and 2, the computational cost can be dramatically reduced. Its computational benefits will be validated by our numerical experiments in Section 4.

Algorithm 3 : A fast screening approach for single change point detection in high-dimensional GLMs.
Input:Input the dataset{X,Y}.Step0:Setu0=0,v0=1.Step1:For each iterationi=0,1,2,,letTi=[ui,vi].Calculate the valuesofW14(Ti),W12(Ti),andW34(Ti).SetM(Ti)=min(W14(Ti),W12(Ti),W34(Ti)).For each iterationi,consider the following three cases:IfM(Ti)=W14(Ti),setTi+1=[ui,ui+12(viui)];ifM(Ti)=W12(Ti),setTi+1=[ui+14(viui),ui+34(viui)];ifM(Ti)=W34(Ti),setTi+1=[ui+12(viui),vi].Step2:Repeat Step2until[n(viui)]4holds for somei.DenoteTibyT^.Step3:Calculateτ^f=arg minτT^Wτf(T^).Output:This algorithm provides a single change point estimatorτ^f.

3. THEORETICAL PROPERTIES

We examine the theoretical properties of our proposed three approaches. In particular, we first show that our estimation of change points and regression coefficients is consistent and has the same rates of convergence as those of linear models. Secondly, for GLMs with change points, we reconstruct our assumptions and lemmas for analyzing high-dimensional data with heterogeneity based on the work of van de Geer (2008). Note that van de Geer (2008) considered the 1 penalized estimation of the regression coefficients under the setting of all observations from the same GLM. In particular, in Section 3.1, we introduce some assumptions. In Section 3.2, we present theoretical results of the change point estimator computed by the new algorithms.

Before presenting the theoretical results, we introduce some additional notation. For any u, vVn := {i/n : i = 1, …, n} with u < v, we denote, for a function ρ(x,y):X×YR, the subinterval-based theoretical mean and empirical mean by Pρ((u,v))1ni=un+1vnEρ(Xi,Yi), and Pnρ((u,v))1ni=un+1vnρ(Xi,Yi), respectively. For convenience, we denote = ((0, 1)) and Pnρ = Pnρ((0, 1)). Consider a linear subspace F{fβ(x)=xTβ:βRp}. For a fβF, define ρfβ(x, y) = ρ(fβ (x), y). Then the empirical risk and theoretical risk at f are defined as Pnρf and Pρf, respectively. Furthermore, we define the target as the minimizer of the theoretical risk f0arg minfFPρf and β0arg minβRpPρfβ, where β0 can be regarded as the “truth”. By definition, we have f0(x) = xβ0. For fβF, the excess risk is defined as E(fβ)P(ρfβρf0) . Lastly, for any subinterval (u, v), we define the oracle β(u,v) as β(u,v)arg minβRp{E(fβ)}. The corresponding estimation error is then denoted as ϵ* := (PnP) ρfβ*.

3.1. Basic assumptions

We introduce some assumptions as follows.

Assumption A (loss function). The loss function ρf(x, y) := ρ(f(x), y) is convex for all yR. Moreover, it satisfies the Lipschitz property:

ρ(fβ(x),y)ρ(fβ~(x),y)fβ(x)fβ~(x),(x,y)X×Y,β,β~Rp.

Assumption B (design matrix). There exists KX < ∞ such that ∥XiKX and E(Xi)=0 hold for all i = 1, …, n.

Assumption C (margin condition). There exists an η > 0 and a strictly convex increasing function G(x), such that for all βRp with ∥fβf0η, one has

E(fβ)G(fβf0),

where there exists a constant C such that G(x) ≥ Cx2 for any positive x.

Assumption D (compatibility condition). The compatibility condition is met for the set S=j=1k~+1S(j) (S(j) defined in Section 2.2) with constant ϕ* > 0, if for all βRpp satisfying βSc13βS1, it holds that

βS12(βTXTXβ)sϕ2,

where s* := #S* is the cardinality of S*.

Assumption E (parameter space). For k0 > 1, there exist constants m* > 0 and M* > 0 such that

min1jjkk~+1r=ijγ(i,r,j)β0(r)r=j+1kγ(j+1,r,k)β0(r)1sm,

where γ(i,j,k)=τ~jτ~j1τ~kτ~i1,

s=o(nlog(p)),max1jk~+1β0(j)M,andmax1<jk~+1β0(j)β0(j1)M.

Note that in the case k~=1, the former condition reduces to ∥β0(1) − β0(2)∥1m*s*.

We assume in Assumption A that the loss function ρ is Lipschitz in f, which allows us to bound the loss function by the difference between estimated regression parameters and the corresponding true parameters. Many functions can meet this condition: for example, the negative-likelihood function of the logistic regression model. Assumption B imposes relatively weak conditions on the covariates, which covers a wide range of distributional patterns. Assumption C (margin condition) is assumed for a “neighbourhood” of the target linear function f0 = XTβ0 and is a common condition for analyzing the GLM. See Section 6.4 in Bühlmann & van de Geer (2011) for more details. Assumption D (compatibility condition) for the design matrix X allows us to establish oracle results for Lasso estimation. Note that one can verify that Assumption D is a sufficient condition of Assumption C in van de Geer (2008) by choosing the function D(K)=#Kβ2, where #K is the cardinality of the set K{1,,p} defined in Assumption C of van de Geer (2008). Assumption E presents the minimum and maximum differences between the true regression parameters, which allow us to detect the change points. Furthermore, the sparsity of regression coefficients is required to guarantee the consistency of our proposed estimators. Assumption F introduced in the Appendix imposes some technical conditions on the tuning parameter λ for the Lasso estimation as well as the tuning parameter γ for the change point estimation. Assumption G includes the required condition for the limiting property of the de-biased Lasso estimator.

3.2. Main results

We are ready to present some theoretical results of our proposed three new algorithms. Before that, we denote c=m2ϕ2M and let d* be a constant. See more details in Lemma 5. We first present the property of the estimators computed by DPA in Algorithm 1.

Theorem 1 Suppose Assumptions A-G hold with log(p) = o(n). Then, for a given C1 > 0, with probability at least 1 − 7 exp (C1n2log(p)), we have that

  1. l(τ^)=k~

  2. τ^τ~1cδλ

  3. j=1k~+1Pnρ(Ij(τ^),β^(τ^,j))Pnρ(Ij(τ^),β0(j))+λrj(τ^)β^(τ^,j)β0(j)1(k~+1)dsλ2;

  4. for each j{1,,k^},
    n(β~s(τ^,j)βs0(j))σ^j,s=Vj,s+oP(1)fors{1,,p},

    where β~s(τ^,j) is the s-th component of β~(τ^,j),Vj,sN(0,1) and σ^j,s2(Θ^jPnρ.β^ρ.β^TΘ^jT)s,s.

Theorem 1 demonstrates that Algorithm 1 can identify both the number and locations of multiple change points with high estimation accuracy. In particular, the first result shows that we can obtain a consistent estimator l(τ^) for the true number of change points. As for the locations, the second result indicates that our multiple change point estimator τ^ converges to the true change point vector τ~ with a rate of Op(log(p)n). Furthermore, the third result implies that we can bound the prediction error or the estimation error of underlying regression parameters within a rate of Op(k~sλ2) or Op(k~sλ). Result (4) implies the asymptotic normality of the de-biased Lasso estimator β~(τ^), which allows the wider statistical inference including confidence intervals and hypothesis testing.

Based on Theorem 1, some other interesting conclusions can be made. To simplify the discussion, we require that all the k~+1 change point intervals are within the same order of magnitude. Recall δ as the minimum length of change point intervals as defined in Section 2.2. Then we have k~(kmax)=O(1δ). Furthermore, according to δ, the following two cases are considered: (1) δ = O(1) and (2) δ = o(1).

For the first case, we have k~=O(1), which means that the number of change points is fixed and does not increase with the sample size n. Furthermore, considering Assumption (F1), we have λ=O(log(p)n). Hence, the three results in Theorem 1 reduce to:

τ^τ~1=Op(log(p)n),j=1k~+1Pnρ(Ij(τ^),β^(τ^,j))Pnρ(Ij(τ^),β0(j))=Op(slog(p)n),andj=1k~+1β^(τ^,j)β0(j)1=Op(slog(p)n). (18)

Considering Equation (18), our results are consistent with the Lasso estimation results derived in van de Geer (2008) and estimation consistency is guaranteed as long as slog(p)n=o(1) holds.

We next consider the second case with δ = o(1). In this case, we allow the number of change points grow with n. Noting that λδ=O(log(p)n), the three results in Theorem 1 reduce to:

τ^τ~1=Op(log(p)n),j=1k~+1Pnρ(Ij(τ^),β^(τ^,j))Pnρ(Ij(τ^),β0(j))=Op(slog(p)nδ2),andj=1k~+1β^(τ^,j)β0(j)1=Op(sδ32log(p)n). (19)

Hence, by Equation (19), the estimation consistency can still be obtained as long as sδ32log(p)n=o(1) holds. In other words, the number of change points k~ can not grow faster than the order of (nlog(p)s2)13.

Next, we present theoretical results of change point estimators computed by BSA.

Theorem 2 Suppose Assumptions A-G hold with log(p) = o(n). For a given C2 > 0, with probability at least 1 − 7 exp (C2n2log(p)), we have that

  1. l(τ^b)=k~;

  2. τ^bτ~1cδλ;

  3. j=1k~+1Pnρ(Ij(τ^b),β^(τ^b,j))Pnρ(Ij(τ^b),β0(j))+λrj(τ^b)β^(τ^b,j)β0(j)1(k~+1)dsλ2;

  4. for each j{1,,k^},
    n(β~s(τ^,j)βs0(j))σ^j,s=Vj,s+oP(1),fors{1,,p},

    where β~s(τ^,j) is the s-th component of β~(τ^,j), Vj,sN(0,1) and σ^j,s2(Θ^jPnρ.β^ρ.β^TΘ^jT)s,s.

Theorem 2 shows similar results as those of Theorem 1 in terms of consistency of both the number and locations of change points. Furthermore, Theorem 2 allows us to use a much more efficient algorithm to detect multiple change points for GLMs, which enjoys almost the same estimation accuracy as that of the global solutions. The efficiency will be further investigated in our numerical experiments.

Finally, we establish theoretical properties of FSA proposed in Algorithm 3 for single change point models.

Theorem 3 Suppose Assumptions A-F hold with log(p) = o(n). Assume that the true single change point τ~(0,12). For a given C3 > 0, with probability at least 1 − 7 exp (C3n2log(p)), we have that

W14((0,1))<W34((0,1)). (20)

Theorem 3 justifies the validity of Algorithm 3 and demonstrates that the cost of identifying a single change point in GLMs can be reduced to only O(log(n)GLMLasso(n)) computational operations.

4. SIMULATION STUDIES

In this section, we investigate the numerical performance of our three proposed change point detection procedures in various model settings. For the design matrix X, we generate Xi i.i.d. from N(0,Σ). We first consider two types of covariance matrix structures including independent and weakly dependent settings as follows:

Case 1: Σ = Ip×p;

Case 2: Σ = Σ* with Σ=(σ)i,j=1p, where σi,j=0.8ij for 1 ≤ i, jp.

We consider logistic regression models. For i = 1, …, n, we generate Yi ∈ {0,1} with g(P(Yi=1))logP(Yi=1)1P(Yi=1)=XiTβ(i). Then the responses {Yi}i=1n are generated from the following Binomial distribution: YiXiBin(1,exp(XiTβ(i))1+exp(XiTβ(i))).

For this model setup, we investigate the performance of our approaches in terms of accuracy and efficiency. For efficiency, we compare our proposed algorithms in terms of the computational cost. Note that BSA and DPA are designed for multiple change point detection. In order to compare efficiency reasonably for the cases with no change point and a single change point, we set these two algorithms to stop after one screening by making kmax = 1. To show the accuracy, we record the mean, mean squared error (MSE), and error rate (proportion of false positives) of the change point estimators including the number and locations. We compare the corresponding results with the following existing methods:

  • Lee et al. (2011) (denoted by Lee2011), which is based on the maximum score estimation.

  • Qian and Su (2016) (denoted by SGL), which proposed a systematic estimation framework based on the adaptive fused Lasso in linear regression models. To be specific, they estimate {βt}t=1n by minimizing the 2-loss with the fused Lasso penalty. In this paper, we modify SGL by replacing the 2-loss with the loss ρ defined in Section 2 for high-dimensional GLMs.

  • Wang et al. (2021a) (denoted by VPWBS), i.e., the variance projected wild binary segmentation based on the sparse group Lasso estimator for linear regression models. In particular, they projected the high-dimensional time series {Xi,Yi}i=1n onto the univariate time series {zi(u)}i=1n. The optimal projection direction u is obtained by local group Lasso screening (LGS). Then they conducted mean change point detection by wild binary segmentation (WBS) on the univariate time series {zi(u)}i=1n. Note that, for linear models, LGS performs a variant of the group Lasso on any subsample {Xi,Yi}i=s+1e, and computes
    (α^1,α^2,ν^)arg minν[s+1,e1]α1,α2Rp{i=s+1ν(YiXiTα1)2+t=ν+1e(YiXiTα2)2}+{λGj=1p(νs)(α1,j)2+(eν)(α2,j)2}, (21)
    where s′ and e′ serve as boundary trimming parameters with s + 1 ≤ s′ + 1 < e′ ≤ e, and λG is the tuning parameter for the group penalty. For a better comparison in the context of high-dimensional GLMs, we modify this 2-loss based method in Wang et al. (2021a) by replacing the 2-loss in Equation (21) with the loss ρβ(Xi, Yi) defined in Section 2.

For our proposed approaches, the regression coefficients are computed by the R package glmnet (https://glmnet.stanford.edu). All numerical results are based on 100 replications, except for the test by Lee2011 which is based on 500 replications.

4.1. Tuning parameter selection

It is essential to properly choose the values of tuning parameters for accurate estimation results. We develop a cross-validation approach for GLMs to choose the parameters λ and γ, which encourage regression coefficient and segmentation sparsity, respectively. To be specific, let samples with odd indices be the training set (X1, X3, … Xn−3, Xn−1) and the others be the validation set (X2, X4, …, Xn−2, Xn). For each of two tuning parameters λ, γ, we conduct our procedure on the training set and obtain the estimated change point τ^(k^) and underlying regression coefficients β^(τ^,j),j=1,,k^. Let f^i=XiTβ^(τ^,j), for i/nIj(τ) and i = 1, …,n. We can calculate the validation loss as:

CV(λ,γ)=2ni:imod21ρ(f^i,Yi),

where ρ is the loss function of the GLMs and depends on the link function. For the specific regression models such as the linear model and logistic regression model, the corresponding validation losses are defined as:

CVLM(λ,γ)=2ni:imod21(f^iYi)2,andCVLogic(λ,γ)=2ni:imod21log(1+ef^i)yf^i.

Then we choose (λ, γ) corresponding to the lowest validation loss. Note that it is time-consuming to use the cross-validation procedure to choose the tuning parameters for our various model settings. Based on our extensive numerical simulations, we find that our methods are stable over a certain range of tuning parameters. Hence, we use an empirical choice of the parameters λ and γ to save computational cost. In particular, we set λ=c(log(2p)n+log(2p)n), with c ∈ (0.15, 0.25). As for γ, we set γ = δλ. Recall δ is the minimum interval length and δn is the minimum interval size, which controls the maximum number of change points. Note that δ is of key importance for our theoretical guarantee discussed in Section 3.2, and needs to be carefully chosen in simulation studies.

In order to ensure the effective fitting of the regression model, we need to guarantee a sufficient sample size for each interval. According to our numerical studies, setting δ ∈ (0.1, 0.25) works well. To investigate how sensitive our proposed methods are to the choice of these tuning parameters, we consider various values of λ and γ by setting the sample size n ∈ {200, 300, 1000} and data dimension p ∈ {200, 300, 400}. Note that our proposed methods can automatically account for the underlying data generation mechanism and does not need to know the number of change points. To justify this, in what follows, we present our numerical results under three different cases: (1) k~=0, (2) k~=1, and (3) k~=3, which correspond to data with no change point, one change point, and multiple change points, respectively.

4.2. No change point models

We consider the alternative scenario where no change point occurs. In this case, the underlying regression coefficients satisfy β(i)=β0(β10,,βp0)T for i = 1, …, n. We set the sample size n = 200 and the data dimension p ∈ {200, 300,400}. For sS0, we generate βs0iidU(0,2), where S0 denotes the set of non-zero elements of β0 with #S0=[log(p)].

We implement the corresponding algorithms independently on a 2.50GHz CPU (Linux) with 6 cores and 4GB of RAM. As shown in Figure 1 (left), the computational cost of BSA grows moderately (12s to 737s) as the data dimension increases from 400 to 2000, while the computational cost of DPA grows exponentially (80s to 32000s). As for the accuracy, the error rates of VPWBS, DPA, and BSA are zero in almost all cases, which suggests these three approaches have almost the same accuracy when no change point occurs. SGL tends to overestimate the number of change points for the homogeneous observations. Note that Lee2011 has relatively large errors, which suggests that it may be unreliable in high-dimensional settings. Thus, we do not include it in our comparisons for the single change point models.

Figure 1:

Figure 1:

Efficiency of change point estimation with p = 2n. The left panel shows computational costs of BSA and DPA per replication under the model with no change point. The right panel shows computational costs of BSA and DPA per replication under the model of three change points.

4.3. Single change point models

Next we consider the alternative scenario where (β(i))1≤in have a common change point location τ~1, where τ~1{0.5,07}. We set the sample size n = 300 and the data dimension p ∈ {200, 300, 400}. Furthermore, we assume the regression coefficients have support set {S1,S2}. For sS1, we set βs(1)iidU(0,2). Then, for sS2, we set βs(2) = βs(1) + δs with δs(1)i.i.d.U(0,10log(p)n). For each replication, the support sets S1, S2 of regression coefficients are randomly selected from the set {1, 2,…, 0.3p} with #S1=#S2=[log(p)].

Figure 2 (left) indicates that the computational cost of the BSA grows gradually with the data dimension increasing from 400 to 2000 as compared to the exponential growth of the DPA. Meanwhile, the computational cost of FSA in Algorithm 3 increases slowly as compared to the “exponential” growth of the BSA, as shown in Figure 2 (right). This suggests that FSA is more preferable for single change point models. Furthermore, to investigate the computational efficiency for high-dimensional cases, we present the computational cost of our three proposed approaches in Figure 3. It implies that our proposed approaches have stable and good performance as data dimension p grows.

Figure 2:

Figure 2:

Efficiency of change point estimation under the single change point model with n ∈ {200, 400, 600, 800, 1000} and p = 2n. The left panel shows the computational costs of BSA and DPA per replication. The right panel shows computational costs per replication of BSA and FSA. The change point is fixed at τ1 = 0.5.

Figure 3:

Figure 3:

Efficiency of change point estimation under the single change point model with p ∈ {600, 1200, 1800, 2400, 3000} and n = 300. The left panel shows the computational costs of BSA and DPA per replication. The right panel shows computational costs per replication of BSA and FSA. The change point is fixed at τ1 = 0.5.

As for the accuracy, we record the percentage of replications (Rate (%)) in which DPA and BSA correctly identified a single change point. Note that in Tables 2 and 3, MSEs for the number and location are expressed as factors of 10−2 and 10−4, respectively. We can see that DPA, BSA, and VPWBS can identify a single change point with high rates of success. Furthermore, DPA generally has the best performance for estimating the single change point location. VPWBS performs better than BSA especially when the change occurs near the edge. Both DPA and BSA perform slightly better than FSA. Note that all the proposed algorithms perform better the closer the change point location is to the middle of the data observations, e.g., τ~1=0.5.

Table 2:

Single change point detection for Case 1 with Σ = Ip×p under various dimensions and change point locations, based on 100 replications.

dimension p
τ~1 method 200 300 400
Number 0.5 SGL 2.49 ∣ 19 2.36 ∣ 17 2.48 ∣ 23
Mean∣Rate(%) VPWBS 1.02 ∣ 96 1.01 ∣ 98 1.02 ∣ 97
DPA 1.00 ∣ 100 1.01 ∣ 99 1.00 ∣ 100
BSA 1.00 ∣ 100 1.00 ∣ 100 1.00 ∣ 100
0.7 SGL 2.48 ∣23 2.48 ∣23 2.64 ∣14
VPWBS 1.01 ∣ 97 1.04 ∣ 96 0.99 ∣ 95
DPA 1.04 ∣ 96 1.02 ∣ 98 1.03 ∣ 97
BSA 1.00 ∣ 100 1.00 ∣ 100 1.00 ∣ 100
Location 0.5 SGL - - -
Mean∣MSE(10−4) VPWBS 0.497 ∣ 2.244 0.500 ∣ 3.590 0.496 ∣ 6.322
DPA 0.499 ∣ 1.836 0.503 ∣ 2.256 0.499 ∣ 3.254
BSA 0.498 ∣ 2.446 0.501 ∣ 3.779 0.499 ∣ 6.445
FSA 0.495 ∣ 19.21 0.496 ∣ 9.326 0.493 ∣ 11.18
0.7 SGL - - -
VPWBS 0.699 ∣ 2.442 0.701 ∣ 6.963 0.693 ∣ 7.710
DPA 0.694 ∣ 3.933 0.693 ∣ 4.630 0.691 ∣ 5.086
BSA 0.691 ∣ 6.884 0.688 ∣ 14.29 0.686 ∣ 11.28
FSA 0.684 ∣ 25.09 0.678 ∣ 43.94 0.666 ∣ 37.07

Table 3:

Single change point detection for Case 2 with Σ = Σ* under various dimensions and change point locations. The numerical results are based on 100 replications.

dimension p
τ~1 method 200 300 400
Number 0.5 SGL 1.14 ∣ 32 1.30 ∣ 41 1.52 ∣ 40
Mean∣Rate(%) VPWBS 1.02 ∣ 98 1.04 ∣ 96 1.03 ∣ 98
DPA 1.01 ∣ 99 1.00 ∣ 100 1.00 ∣ 100
BSA 1.00 ∣ 100 1.00 ∣ 100 1.00 ∣ 100
0.7 SGL 0.72 ∣ 20 1.16 ∣ 26 1.14 ∣36
VPWBS 0.97 ∣ 97 1.01 ∣ 99 1.05 ∣ 96
DPA 1.00 ∣ 100 1.02 ∣98 1.01 ∣ 99
BSA 0.99 ∣ 99 1.00 ∣98 1.01 ∣ 99
Location 0.5 SGL - - -
Mean∣MSE VPWBS 0.504 ∣ 5.056 0.495 ∣ 3.146 0.498 ∣ 3.572
DPA 0.498 ∣ 2.145 0.502 ∣ 1.423 0.499 ∣ 3.670
BSA 0.493 ∣ 5.897 0.498 ∣ 3.347 0.498 ∣ 4.326
FSA 0.495 ∣ 12.21 0.489 ∣ 20.10 0.492 ∣ 9.374
0.7 SGL - - -
VPWBS 0.694 ∣ 5.959 0.694 ∣ 7.297 0.701 ∣ 6.950
DPA 0.690 ∣ 5.748 0.690 ∣ 6.411 0.691 ∣ 12.75
BSA 0.689 ∣ 9.419 0.694 ∣ 3.973 0.688 ∣ 11.44
FSA 0.689 ∣ 22.41 0.683 ∣ 24.20 0.676 ∣ 27.66

4.4. Multiple change point models

Finally we consider the alternative scenario where (β(i))1≤in has multiple change point locations τ~2 with τ~2=(0,0.25,0.5,0.75,1)T. We set the sample size n = 1000 and the data dimension p ∈ {200, 300,400}. For s1S1, we set βs1(1)iidU(0,2). Then, for sjSj(j=2,3,4), we set βsj(j) = βsj(j − 1) + (j − 1)δsj with δsji.i.d.U(0,10log(p)n). For each replication, the support set of regression coefficients S is randomly selected from the set {1, 2, …, 0.3p} with #Sj=[log(p)], j = 1, 2, 3, 4.

We first analyze the efficiency. The results are similar to the other cases. Figure 4 shows that the computational cost of BSA grows gradually with the number of change points. In contrast, the cost of the DPA grows substantially. This suggests that the efficiency of BSA is not sensitive to the number of change points. To compare accuracy, similarly to the previous analysis, we record the percentage of replications in which DPA and BSA can correctly identify the three change points. As shown in Tables 4 and 5, DPA generally has the best performance. VPWBS performs slightly better than BSA. However, there is not much difference in performance among DPA, BSA, and VPWBS in terms of identifying the number and locations of multiple change points.

Figure 4:

Figure 4:

Efficiency of change point estimation under the model of multiple change points with n ∈ {200, 400, 600, 800, 1000} and p = 2n. The left panel shows the computational costs of BSA per replication in the settings with different numbers of change points. The right panel shows computational cost of DPA per replication in the settings with different numbers of change points.

Table 4:

Multiple change point detection for Case 1 with Σ = Ip×p under various dimensions. The numerical results are based on 100 replications.

τ2 = (0.25, 0.5, 0.75) p Accuracy for the following methods
SGL VPWBS DPA BSA
200 3.19 ∣ 55 3.02 ∣ 98 3.00 ∣ 100 2.95 ∣ 95
number Mean ∣ Rate(%) 300 3.47 ∣ 55 3.03 ∣ 95 3.00 ∣ 100 2.98 ∣ 98
400 3.50 ∣ 44 3.04 ∣ 93 3.00 ∣ 100 2.94 ∣ 94
200 - 0.248 ∣ 1.108 0.249 ∣ 0.665 0.249 ∣ 1.821
location 1 Mean ∣ MSE (10−5) 300 - 0.250 ∣ 2.100 0.250 ∣ 0.865 0.247 ∣ 4.168
400 - 0.249 ∣ 3.913 0.251 ∣ 1.847 0.256 ∣ 5.736
200 - 0.499 ∣ 0.894 0.500 ∣ 0.353 0.499 ∣ 2.099
location 2 Mean ∣ MSE (10−5) 300 - 0.499 ∣ 1.204 0.501 ∣ 0.508 0.499 ∣ 6.148
400 - 0.498 ∣ 2.490 0.500 ∣ 0.981 0.500 ∣ 6.736
200 - 0.751 ∣ 0.701 0.750 ∣ 1.002 0.749 ∣ 1.689
location 3 Mean ∣ MSE (10−5) 300 - 0.750 ∣ 1.370 0.751 ∣ 2.481 0.745 ∣ 7.968
400 - 0.748 ∣ 2.268 0.749 ∣ 0.688 0.748 ∣ 2.258

Table 5:

Multiple change point detection for Case 2 with Σ = Σ* under various dimensions. The numerical results are based on 100 replications.

τ 2 = (0.25, 0.5, 0.75) p Accuracy for the following methods
SGL VPWBS DPA BSA
200 2.65 ∣ 30 3.00 ∣ 100 3.00 ∣ 100 2.99 ∣ 99
number Mean ∣ Rate(%) 300 2.08 ∣ 17 2.97 ∣ 99 3.00 ∣ 100 2.98 ∣ 98
400 2.47 ∣ 25 3.00 ∣ 100 3.00 ∣ 100 2.98 ∣ 98
200 - 0.247 ∣ 2.435 0.250 ∣ 1.961 0.251 ∣ 3.457
location 1 Mean ∣ MSE (10−5) 300 - 0.249 ∣ 3.164 0.251 ∣ 2.844 0.250 ∣ 6.585
400 - 0.251 ∣ 1.687 0.251 ∣ 2.497 0.249 ∣ 4.994
200 - 0.487 ∣ 2.040 0.500 ∣ 1.069 0.499 ∣ 2.667
location 2 Mean ∣ MSE (10−5) 300 - 0.502 ∣ 1.403 0.501 ∣ 0.894 0.499 ∣ 4.363
400 - 0.499 ∣ 2.212 0.501 ∣ 3.613 0.500 ∣ 4.192
200 - 0.748 ∣ 1.242 0.750 ∣ 0.711 0.748 ∣ 1.674
location 3 Mean ∣ MSE (10−5) 300 - 0.752 ∣ 1.672 0.750 ∣ 0.486 0.749 ∣ 1.398
400 - 0.747 ∣ 4.511 0.750 ∣ 0.580 0.747 ∣ 4.881

It is worth mentioning that our proposed methods have good performance in all three models with various data dimensions, sample sizes, and numbers of change points, which suggests that the methods are robust to the suggested choice of tuning parameters λ and γ.

5. REAL DATA ANALYSIS

To illustrate the usefulness of our proposed methods, we consider the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (www.loni.ucla.edu/ADNI). The ADNI dataset contains disease state information on different subjects including normal controls (NC), mild cognitive impairment (MCI), and Alzheimer’s disease (AD) as well as some biological markers including features derived from magnetic resonance imaging (MRI) and positron emission tomography (PET). It is very useful for clinical diagnosis and prevention to study how to measure the progression of AD using the images. For example, in the AD-related literature (see, for example, Reiss & Ogden, 2010), it is popular to use structural MRI or PET to predict the current disease status of the patient (binary response variable), which can be regarded as a classification problem. Usually, they treat data as homogeneous and ignore the effect of other covariate variables, such as age, gender and so on. Hence, an interesting question is whether the generalized linear structure between the disease status and biomarkers (MRI or PET) changes due to other covariates. If it changes, how can one estimate (a) the number of change points, (b) the locations of change points, and (c) the regression coefficients (selected variables) in each segmentation? In our study, we address these issues by detecting change points in the generalized linear structure between the disease status and the MRI features together with some covariates. In this application, we choose age as the covariate, which is of particular interest in AD studies.

We use the MRI data of 405 subjects including 220 normal controls and 185 AD patients from the ADNI data. For each subject, we obtain the corresponding status (AD/NC), age, and 93 MRI features after using the data processing method proposed in Zhang & Shen (2012). In our model, the predictive variables X = (X1, …, X93) are the 93 MRI features, which are scaled to have mean 0 and variance 1, and the response variable is the binary status obtained by setting Yi = 1 {subject i is an AD patient}, and 0 otherwise. For this dataset {Xi,Yi}i=1n, consider the following logistic regression model: log(P(Yi=1))(P(Yi=0))=XiTβ(i). Our goal is to detect potential change points of regression coefficients in {β(i)}i=1n. Taking the effect of different samples into consideration, in our analysis, we divide the data into two parts: training and testing datasets. More specifically, we randomly select 40 subjects from the whole set of 405 subjects according to the empirical distribution of age in Figure 5 (left) as the testing sample and use the remaining 365 subjects as the training data. Then we sort those 365 observations in the training data by age and use BSA to estimate the number and location of change points. We choose the tuning parameters λ and γ as suggested in Section 4. The above process is repeated for 100 times. Since BSA is more computationally efficient than DPA, we use only BSA to analyze the data.

Figure 5:

Figure 5:

The left panel shows distribution of age among 405 subjects. The right panel shows estimated numbers of change points based on 100 replications.

Figure 5 (right) demonstrates the estimated numbers of change points for 100 replications. To be specific, among the 100 replications, 80% of the estimated numbers of change points are 1 (k = 3), which suggests that there is a change point in the context of logistic regression between the disease status and the MRI features due to the change of age. Moreover, Table 6 summarizes the change point estimation results computed by BSA. For the change point estimation, 90% of change points were estimated by BSA to be located at 80 years old and the mean is 79.95 years old. This implies that there are significant differences between the regression models of the disease status and the MRI features under 80 and over 80 years old.

Table 6:

Change point estimation and prediction for ADNI data. The change point estimator is obtained by using BSA based on 100 replications.

training/testing sample methods location of change point Averaged MSE
365/40 BSA 79.75 0.380
Lasso-based - 0.408

In terms of prediction, Table 6 provides the prediction results computed by BSA and the Lasso-based method, where for each replication, we use the training sample to select models and use the testing sample to predict. Note that the Lasso-based method treats the data as homogeneous while we consider a heterogeneous model. For the prediction result, we calculate the predictive MSE on the testing set for these two methods. Our proposed method obtains better prediction performance, which is demonstrated by a 7% lower averaged predictive MSE than that of the Lasso-based method. This suggests that treating the data as heterogeneous and using our method to select models can predict better.

For variable selection, Figure 6 shows selected features before and after the estimated change point of 80 years old. We see that models both under 80 and over 80 both select the 69th and 83rd features, which correspond to the hippocampal and amygdala regions, respectively. These regions are known to be related to AD by many previous studies (Zhang & Shen, 2012). Moreover, there are a few different features selected separately by these two models under 80 and over 80. We believe these features deserve more scientific attention and more researches are needed to study their associations with AD together with age.

Figure 6:

Figure 6:

Frequency of features selected before the change point (left) and after the change point (right) for the ADNI dataset based on 100 replications. Features selected by models both before and after the change point are shown in red.

6. SUMMARY

In this paper, we provide a three-step procedure for change point detection in the context of high-dimensional GLMs, where the dimension p can be much larger than the sample size n. It is worth mentioning that our proposed method can automatically account for the underlying data generation mechanism (k~=0 or k~>0) without specifying any prior knowledge about the number of change points k~. Moreover, based on dynamic programming and binary segmentation techniques, two algorithms, DPA and BSA, are proposed to detect multiple change points. To further improve the computational efficiency, we present a much more efficient algorithm designed for the case of a single change point. Furthermore, we investigate the theoretical properties of our proposed change point estimators computed by the three algorithms. Estimation consistency for the number and locations of change points is established. Finally, we demonstrate the efficiency and accuracy of our proposed methods by extensive numerical results under various model settings. A real data application to the ADNI dataset also demonstrates the usefulness of our proposed methods.

Table 1:

Change point detection for Cases 1 and 2 under the model with no change point. The numerical results are based on averages of 100 replications.

Case Measurement p Accuracy for the following methods
SGL VPWBS Lee2011 DPA BSA
200 0.65 0.01 0.73 0.00 0.00
Σ = I error rate 300 0.85 0.01 0.71 0.00 0.00
400 0.86 0.01 0.74 0.00 0.00
200 0.38 0.00 0.71 0.00 0.00
Σ = Σ* error rate 300 0.44 0.01 0.71 0.00 0.00
400 0.47 0.00 0.72 0.00 0.00

Acknowledgements

The authors thank the Editor, the Associate Editor and the reviewers, whose helpful comments and suggestions led to a much improved presentation. This research is supported in part by the National Natural Science Foundation of China Grant 11971116 (Zhang), 12101132 (Bin Liu), and US National Institute of Health Grant R01GM126550 and National Science Foundation Grants DMS1821231 and DMS2100729 (Yufeng Liu).

APPENDIX

Notation

We introduce some notation. The random part, empirical process, is defined as {vn(β)(PnP)ρfβ:βRp}. We recall the Lasso estimator is, for j = 1, …, k, β^(j)=arg minβ{Pnρfβ+λrj(τ)β1}. We write f^=fβ^ and E(β)P(ρfβρfβ0) for convenience. Recall that for any (u, v), the oracle β* is defined as β(u,v)arg minβ{E(fβ)}, which is the best approximation of β0 under the compatibility condition, β* = β0, if there is no change point between u and v. The estimation error is denoted as ϵ* := vn(β*) = (PnP)ρfβ*. We define, for some positive constant L > 0, ZL := supββ*∥1Lvn(β) − vn (β*)∣ . We set L* := ϵ*/λ0, and require a relatively small L*, so this indicates that ϵ* ≤ λ0. Based on this, for any u, vVn := {i/n : i =1, …, n} with u < v, we define two important sets as follows:

T0{ZLϵλ0}, (1)
T1{max(u,v)Σ^(u,v)(vu)Σλ1}. (2)

We introduce the following assumptions.

Assumption F. We require some technical conditions as follows:

(F1) λδ8λ0, where λ0=O(log(p)n).

(F2) γ>3dsλ2, where d* = O(1) with detailed definition introduced in the Appendix .

(F3) δm2ϕ2s8>γ+22ϵ, where ϵ=O(slog(p)n).

(F4) λδm2ϕ2s8>22ϵλδM.

Assumption G. We require some conditions for achieving limit properties for the de-biased Lasso estimator:

(G1) The derivatives ρ.(Y,a)ddaρ(Y,a),ρ¨(Y,a)d2da2ρ(Y,a) exist for all y, a, and for some δ-neighborhood (δ > 0), ρ¨(Y,a) is Lipschitz:

maxa0{XiTβ0}aa0a^a0δsupYYρ¨(Y,a)ρ¨(Y,a^)aa^1

Moreover,

maxa0{XiTβ0}supYYρ.(y,a0)=O(1),andmaxa0{XiTβ0}aa0δsup supρ¨(Y,a)=O(1).

(G2) It holds that Pnρ¨β^Θ^jTej=OP(λ).

(G3) It holds that XΘ^jT=OP(K) and Θ^j1=OP(s)

(G4) It holds that (PnP)ρ.β0ρ.β0T=OP(K2λ) and moreover maxj1(Θ^Pρ.β0ρ.β0TΘ^T)j,j=O(1).

(G5) For every j, the random variable n(Θ^Pnρ.β0)j(Θ^Pρ.β0ρ.β0TΘ^T)j,j converges weakly to a N(0,1)-distribution.

Useful Lemmas

We introduce some useful lemmas that are essential for our main results. More specifically, Lemma 1 presents the upper bound of the difference of the subinterval penalized empirical average of the loss function based on the Lasso and oracle estimators. Corollary 1 shows the equation in Lemma 1 holds with high probability. Lemma 2 provides the results for the subinterval based on the compatibility condition. The margin condition based on the oracle estimator is updated in Lemma 3. Lemma 4 presents the lower bound of the difference of the loss function based on the oracle estimator of adjacent subintervals. At last, Lemma 5 provides the upper bound of the difference of the subinterval penalized empirical average of the loss function based on the oracle estimator and the truth. Next, we will introduce these useful lemmas in detail.

Lemma 1 (Oracle inequality for the Lasso) Suppose Assumptions A-F hold for all β^β1L, as well as fβfβLKX. Suppose that λ satisfies the inequality λδ8λ0. Then on the set T0T1 given in (1)-(2), we have

Pnρ((u,v),β^(u,v))Pnρ((u,v),β(u,v))+λ(uv)β^β16ϵ, (3)

where there exists a constant C3 > 0 such that ϵ* ≤ C3s* log(p)/n.

Proof. Since the Assumptions hold, on T0T1, 8λ0<λδ<λ(vu), we have

Pρ((u,v),β^(u,v))Pρ((u,v),β(u,v))+λ(uv)β^β14ϵ,

by Theorem 6.4 in Bühlmann & van de Geer (2011). By the definition of Pnρ, , and v(β) introduced in Section 3.1, we can obtain

Pnρ((u,v),β^(u,v))Pnρ((u,v),β(u,v))Pρ((u,v),β^(u,v))Pρ((u,v),β(u,v))+v(β)v(β^).

If the condition β^β1L holds, on the set T0T1, we can have

Pnρ((u,v),β^(u,v))Pnρ((u,v),β(u,v))+λ(uv)β^β14ϵ+2ϵ=6ϵ,

which completes the proof.

Corollary 1 Suppose Assumptions A-F hold. Let an4(2log(2p)n+log(2p)nKX) and

λ0λ0(t)an(1+t2(1+2anKX)+2t2anKX3).

Then we have, with probability at least 17exp[nan2t2]=1exp(C1n2log(p)), that Equation (3) holds. We refer to Theorem 2.1 in van de Geer (2008).

Lemma 2 Suppose Assumption D and sλ1ϕ232 hold. Then on the set T0T1, we have, for all (u, v) ∈ {i/n, i = 1, …, n} and all βRp with ∥βS*1 ≤ 3∥βS*1,

βS12(βTΣ^(u,v)β)s(vu)ϕ2.

Proof. By Assumption D (the compatibility condition), for any u, v ∈ {i/n, n = 1, …, n}, we have

βS12fβ2(vu)s(vu)ϕ2=(βT(vu)Σβ)s(vu)ϕ2,

for all βRp that satisfy ∥βS*1 ≤ 3 ∥βS*1. Then the matrix (vu)Σ satisfies the compatibility condition for the set S* with constant (vu)ψ. By Corollary 6.8 in n Bühlmann and van de Geer (2011), if sλ1ϕ232, the compatibility condition also holds for the set S* and the matrix Σ^(u,v), with ϕΣ^(u,v)2(vu)ϕ22. Then we can obtain, for all βRn that satisfy ∥βSε1 ≤ 3 ∥βS.1,

βS12(βTΣ^(u,x)β)sϕΣ^(0,v)22(βTΣ^(u,t)β)s(vu)ϕ2.

Lemma 3 By Assumption B, there exists an η ≥ 0 and strictly convex increasing G, such that for all β1,β2Rp with ∥fβ1f0η/2,∥fβf0ηβ* − β0∥ ≤ η, we have

Pnρ(β)Pnρ(β)CX((ββ))222ϵ.

Proof. According to Assumption C (margin condition), we can directly have

E(fβ)G(fβf0)Cfβf02.

By the definition of E(fβ), v(β) introduced in Notation and the triangle inequality, we have

Pnρ(β)Pnρ(β)Cfβf022ϵ.

By the definition of β*, under the compatibility condition, β* is the best approximation of β0: β* = β0. Then we have

Pnρ(β)Pnρ(β)CX((ββ))222ϵ.

Lemma 4 Suppose k0 > 1 and that Assumptions A-F and sλ1ϕ232 hold. Then on T0T1, if (u,v)(τ~j1cδλ,τ~j+1+cδλ) and u<τ~j<v for some j = 2, …, k~1, we have

Pnρ((u,τ~j),β(u,v))Pnρ((u,τ~j),β(u,τ~j))+Pnρ((τ~j,v),β(u,v))Pnρ((τ~j,v),β(τ~j,v))min(τ~ju,vτ~j)m2ϕ2s84ϵ.

Proof. By Lemmas 2 and 3, if (u,v)(τ~j1cδλ,τ~j+1+cδλ) and u<τ~j<v for some j = 2, …, k~1, we have

Pnρ((u,τ~j),β(u,v))Pnρ((u,τ~j),β(u,τ~j))+Pnρ((τ~j,v),β(u,v))Pnρ((τ~j,v),β(τ~j,v))CX(β(u,v)β(u,τ~j))22+CX(β(u,v)β(τ~j,v))224ϵ(τ~ju)β(u,v]β(u,τ~j)12ϕ22s+(vτ~j)β(u,v)β(τ~j,v]12ϕ22s4ϵ.

Now observe that

(vu)β(u,v]=(τ~ju)β(u,τ~j)2+(vτ~j)β(τ~j,i).

Then

β(u,v)β(u,τ~j)=(vτ~jvu)(β(τ~j,v]β(u,τ~j)),

by Assumption E. If (u,v)(τ~j1cδλ,τ~j+1+cδλ) and u<τ~j<v for some j = 2, …, k~1, we have

β(u,v]β(u,τ~j]1(vτ~j)ms(vu),β(u,x]β(τ~j,x)1(τ~ju)ms(vu). (4)

Then, by the above Equation (4) and straightforward calculations, we can obtain

(τ~ju)β(u,v]β(u,τ~j)12ϕ22s+(vu)β(u,v]β(τ~j,v]12ϕ22s(τ~ju)2+(vτ~j)2(vu)2m2ϕ2s24ϵ.

As ((τ~ju)2+(vτ~j)2)(vu)212, we can complete the proof. ■

Lemma 5 Suppose Assumptions A-F hold and let (u,v)(τ~j1cδλ,τ~j+cδλ)(0,1] for some j = 1, …, k~+1, and sλ1ϕ232, cδλ<r(τ~). Then on the set T0T1 we have

Pnρ((u,v),β^(u,v))Pnρ((u,v),β0(j))+λ(uv)β^(u,v)β0(j)1dsλ2,

where b=1{u<τ~j1}+1{τ~j<v}, d=((b2KX2cM+b)cM+6C4).

Proof. Firstly, by straightforward calculations, we can obtain

Pnρ((u,v),β^(u,v))Pnρ((u,v),β0(j))+λ(uv)β^β0(j)1Pnρ((u,v),β^(u,v))Pnρ((u,v),β(u,v))+Pnρ((u,v),β(u,v))Pnρ((u,v),β0(j))+λ(uv)β^β(u,v)1+λ(uv)β(u,v)β0(j)1.

According to Lemma 1 and on the set T0, we can obtain

Pnρ((u,v),β^(u,v))Pnρ((u,v),β(u,v))+λ(uv)β^β(u,v)16ϵ6C4sλ2. (5)

Now, we will present the bias between βu,v and βj0, with (u,v](τ~j1cδλ,τ~j+cδλ). Since λcδ<r(τ), by Assumption E, we can have that

β(u,v)β0(j)max(τ~j1u,0)(vu)β0(j)β0(j1)+max(vτ~j,0)(vu)β0(j+1)β0(j)bMcλvu. (6)

Furthermore, combining Assumption A, Assumption C, Equation (6), and the Cauchy-Schwarz inequality, we have

Pnρ((u,v),β(u,v))Pnρ((u,v),β0(j))+λ(uv)β(u,v)β0(j)1(vu)X(β(u,v)β0(j))22+λ(uv)β(u,v)β0(j)1(b2KX2Ms+b)cMsλ2. (7)

Combining Equations (5) and (7) can complete the proofs. ■

Proof of Theorem 1. To simplify the notation, we denote the value of the penalized total loss function corresponding to the change point vector by

H(τ)=j=1l(τ)+1Pnρ(Ij(τ),β^(τ,j))+γl(τ). (8)

First, we will show that if the assumptions hold, we must have l(τ^)=k~ and τ^τ~1cδλ. On the contrary, we assume τ^ does not satisfy the above two results. We can distinguish three possible cases:

Case 1: Change point number is overestimated, l(τ^)>k~. There exist some i, 1 ≤ ik^1, such that {τ^i1,τ^i,τ^i+1}(τ~j1cδλ,τ~j+cδλ) for some j, 1 ≤ jk~.

Case 2: Change point number is underestimated, l(τ^)<k~. For some j=1,,k~1, we have τ^(τ~jcδλ,τ~jcδλ)= and τ^(τ~jcδλ,τ~jcδλ)=.

Case 3: Change point number is correctly estimated, l(τ^)=k~. However, for some j=1,,k~1, we have τ^(τ~jcδλ,τ~jcδλ)= and τ^(τ~jcδλ,τ~jcδλ).

We first consider Case 1, where we have l(τ^)>k~ and there exists some i, such that {τ^i1,τ^i,τ^i+1}(τ~j1,τ~j) for some j,1jk~. We define

τ=(τ^1,,τ^i1,τ^i+1,,τ^l(τ^)).

Then we get a new change point vector τ with l(τ)=l(τ^)1. Denote the intervals by S1=(τ^i1,τ^i],S2=(τ^i,τ^i+1], and S=(τ^i1,τ^i+1], and then we obtain

H(τ)H(τ^)=Pnρ(S,β^S)Pnρ(S1,β^S1)Pnρ(S2,β^S2)γ. (9)

By the definition of the Lasso estimator β^ and the triangle inequality, we can directly have

Pnρ(S,β^S)Pnρ(S,β0(j))+λSβ0(j)β^J1. (10)

Then, combining Equations (9)-(10), we can directly have

H(τ)H(τ^)Pnρ(S,β0(j))Pnρ(S1,β^S1)Pnρ(S2,β^S2)+λSβ0(j)β^J1γ. (11)

Using some straightforward calculations and the triangle inequality, we have

Pnρ(S,β0(j))Pnρ(S1,β^S1)Pnρ(S2,β^S2)=Pnρ(S1,β0(j))+Pnρ(S2,βj0)Pnρ(S1,β^S1)Pnρ(S2,β^S2)Pnρ(S1,β0(j))Pnρ(S1,β^S1)+Pnρ(S2,β0(j))Pnρ(S2,β^S2). (12)

By Lemma 5 and the above Equation (12), with (u, v) = S1, S2, then we have

Pnρ(S,β0(j))Pnρ(S1,β^S1)Pnρ(S2,β^S2)2dsλ2. (13)

Also by Lemma 5, for the second term in Equation (10), we can directly have

λSβ0(j)β^J1dsλ2, (14)

and therefore, by combining Equation (11) and Equations (13)-(14), we can easily obtain

H(τ)H(τ^)3dsλ2γ. (15)

According to the Assumption F2, we obtain M(τ)<M(τ^), which is a contradiction.

For Case 2, where we have l(τ^)<k~, we define a new change points vector τ=τ^{τ~j}, that is,

τ=(τ^1,,τ^ri1,τ^ri,τ^ri+1,,τ^l(τ^)+1), (16)

where τ^ri=τ~j. We obtain a new change point vection τ with l(τ)=l(τ^)+1. Also we denote the intervals by S1=(τ^ri1,τ^ri], S2=(τ^ri,τ^ri+1] and S=(τ^ri1,τ^ri+1], then we have

H(τ^)H(τ)=Pnρ(S,β^S)Pnρ(S1,β^S1)Pnρ(S2,β^S2)γ. (17)

By Lemma 1, for u, vVn, we can obtain Pnρ((u,v),β^(u,v))Pnρ((u,v),β(u,v))6ϵ. Thus, by this inequality (with (u, v) = S1, S2 and S), the triangle inequality and Lemma 4, we can have

Pnρ(S,β^S)Pnρ(S1,β^S1)Pnρ(S2,β^S2)Pnρ(S,βS)Pnρ(S1,βS1)Pnρ(S2,βS2)3(6ϵ)δm2ϕ2s84ϵ3(6ϵ). (18)

Then, by combining the above Equations (17)-(18), we can directly have

H(τ^)H(τ)δm2ϕ2s84ϵ3(6ϵ)γ, (19)

by Assumption F3, we have H(τ^)>H(τ), which is a contradiction.

For Case 3 with l(τ^=k~), we must add some points and remove others to obtain a good change point estimator. Then we define τ(1)=τ^{τ~j} with τri(1)=τ~j. We denote the intervals by S1=(τ^ri1,τ^ri], S2=(τ^ri,τ^ri+1], and S=(τ^ri1,τ^ri+1], and then we can obtain

H(τ^)H(τ(1))=Pnρ(S,β^S)Pnρ(S1,β^S1)Pnρ(S2,β^S2)γ. (20)

Without loss of generality, we assume ∣ S1 ∣< δ and define a new partition τ(1){τri1(1)}. By denoting K1=(τri1(2),τri1(1)) and I = K1S1, then we have that

H(τ(1))H(τ)=Pnρ(K1,β^K1)+Pnρ(S1,β^S1)Pnρ(I,β^I)+γ. (21)

Thus, by combining Equations (20) and (21), with straightforward calculations, we have

H(τ^)H(τ)=H(τ^)H(τ(1))+H(τ(1))H(τ)=Pnρ(S,β^S)Pnρ(S2,β^S2)+Pnρ(K1,β^K1)Pnρ(I,β^I),

by the definition of the Lasso estimator and the triangle inequality, we can obtain

IβK1β^I1IβIβ^I1+IβK1βI1IβIβ^I1+IδIM,sIβIβ^I1+δMs. (22)

By Equation (22), Lemma 1, Lemma 4, and straightforward calculations, we have

Pnρ(S,β^S)Pnρ(S2,β^S2)+Pnρ(K1,β^K1)Pnρ(I,β^I)Pnρ(S,βS)Pnρ(S2,βS2)+Pnρ(K1,βK1)18ϵPnρ(I,βK1)λIβK1β^I1Pnρ(S,βS)Pnρ(S2,βS2)Pnρ(S1,βS1)18ϵλIβIβ^I1λδMsλδm2ϕs2s822ϵλδM.

According to the Assumption F4, we can obtain H(τ^)H(τ)>0, which is a contradiction. Above all, the first two results (1) and (2) in Theorem 1 have been proved. Result (3) in Theorem 1 can be directly obtained by combining (2) in Theorem 1 with Lemma 5.

Now we prove the fourth result in Theorem 1. Using that by condition xiΘ^jT=OP(K), we have

Θ^Pnρ.β^=Θ^Pnρ.β0+Θ^Pnρ¨β^(β^β0)+Rem1,

where

Rem1=OP(K)i=1nxi(β^β0)2n=O(K)X(β^β0)n2=OP(Ks0λ2)=oP(1),

it follows that

β~(τ,j)β0(j)=β^(τ,j)β0(j)Θ^Pnρ.β^=β^(τ,j)β0(j)Θ^jPnρ.β0Θ^jPnρ¨β^(β^(τ,j)β0(j))Rem1=Θ^Pnρ.β0(Θ^Pnρ¨β^I)(β^(τ,j)β0(j))Rem1=Θ^Pnρ.β0Rem2.

By the proof of Theorem 3.1 in van de Geer et al. (2014), we have Θ^Pnρ¨β^I=O(λ). According to the third result in Theorem 1, we have β^(τ,j)β0(j)=OP(sλ). Then, it follows that

Rem2Rem1+O(λ)β^(τ,j)β0(j)1=oP(n12)+OP(sλ2)=oP(n12),

since the condition sλ2=o(slogpn)=o(n12) holds. By straight calculations, we have

n(β~(τ^,j)β0(j))=nΘ^Pnρ.β0nRem2=nΘ^Pnρ.β0oP(1).

By the proof of Theorem 3.1 in van de Geer et al. (2014), we can easily conclude that

n(β~s(τ^,j)βs0(j))σ^j,s=Vj,s+oP(1),s{1,,p},

where Vj,sN(0,1) and σ^s2(Θ^jPnρ.β^ρ.β^TΘ^jT)s,s. ■

Proof of Theorem 2. First we will show that under the conditions of Theorem 1, on T0T1, we have three cases:

Case A: Change point number is overestimated, l(τ^)>k~. We have k~=1 and l(τ^b)=1.

Case B: Change point number is underestimated, l(τ^)<k~. For k~>1, we have h(0, 1) and l(τ^b)=0.

Case C: Change point number is correctly estimated, l(τ^)=k~. For k~>1, we have h(0, 1) ∈ [δ, 1 − δ].

This fact can be derived straightforwardly from the proof of Theorem 1, as the objective functions coincide for 1 or 2 segments; that is, for all u ∈ [0, 1],

H((0,u,1))=Z(0,u)+Z(u,1).

For Case A, suppose k~=1 and τ = (0, 1). Like in the proof of Case 1 in Theorem 1, we can obtain

H(τ0)<minu(δ,1δ)H((0,u,1)),

and h(0, 1) = 0. We suppose k~>1 and h(0,1)j=1k~1(τ~jcδλ,τ~jcδλ). We define τ(0) = (0, h(0, 1), 1), τ(1)=τ(0){τ~j}, τ(2) = τ(1)\{h(0, 1)}.

For Case B, h(0, 1) = 0, like in the proof of Case 2 in Theorem 1, we can obtain H(τ(0)) > H(τ(1)). For Case C, h(0, 1) ∈ [δ, 1 − δ], like in the proof of Case 3 in Theorem 1, we can obtain

H(τ(0))H(τ(2))=H(τ(0))H(τ(1))+H(τ(1))H(τ(2))>0.

Since h(0, 1) minimizes Equation (17), all three cases result in a contradiction. Then, we can replace (0,1) by each sub-interval and obtain the same results, which completes the proof. ■

Proof of Theorem 3. Firstly we recall our proposed statistics as follows:

W14W34=Pnρ((0,14),β^(14,1))+Pnρ((14,1),β^(14,1))Pnρ((0,34),β^(0,34))Pnρ((34,1),β^(34,1)).

Then, for convenience, we consider two cases for the change point location denoted by τ:

Case E: 0 < τ ≤ 1/4,

Case F: 1/4 < τ < 1/2.

Next, we will discuss each case in detail. Case E: In this case, 0 < τ ≤ 1/4, so we have

W14W34=(Pnρ((0,14),β^(0,14))Pnρ((0,14),β^(0,34)))+(Pnρ((14,34),β^(14,1))Pnρ((14,34),β^(0,34)))+(Pnρ((14,1),β^(14,1))Pnρ((34,1),β^(34,1))).

We observe there is no change point in (1/4, 1). So the Lasso estimations of any subinterval (u, v) ⊂ (1/4, 1) make no difference and we can replace β^(14,1) by β^(14,34) or β^(34,1), and it follows that

W14W34=(Pnρ((0,14),β^(0,14))Pnρ((0,14),β^(0,34)))+(Pnρ((14,34),β^(14,34))Pnρ((14,34),β^(0,34))). (23)

We denote the intervals by J1=(0,14), J2=(14,34), and J=(0,34). (23) can be organized as follows:

W14W34=(Pnρ(J1,β^J1)Pnρ(J1,β^J))+(Pnρ(J2,β^J2)Pnρ(J2,β^J))=Pnρ(J1,β^J1)+Pnρ(J2,β^J2)Pnρ(J,β^J).

Then using the same argument as Case 1 in Theorem 1, we can obtain W1/4 < W3/4.

Case F: In this case, 1/4 < τ < 1/2, we observe that there is no change point within these two intervals (0, 1/4) and (3/4, 1). Then, by straightforward calculations, we can obtain

W14W34=Pnρ((14,1),β^(14,1))Pnρ((0,34),β^(0,34)=Pnρ((14,34),β^(14,1))+Pnρ((34,1),β^(14,1))Pnρ((0,14),β^(0,34)Pnρ((14,34),β^(0,34)).

We denote the intervals by (0, 1/4) = K1, (1/4, 3/4) = J1, (3/4, 1) = J2, and it follows that

W14W34=Pnρ(J1,β^J)+Pnρ(J2,β^J)Pnρ(K1,β^I)Pnρ(J1,β^I).

Then using the same argument as Case 3 in Theorem 1, we can obtain W1/4 < W3/4. ■

Footnotes

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

BIBLIOGRAPHY

  1. Aue A, Hörmann S, Horváth L, & Reimherr M (2009). Break detection in the covariance structure of multivariate time series models. The Annals of Statistics, 37(6B), 4046–4087. [Google Scholar]
  2. Barigozzi M, Cho H, & Fryzlewicz P (2018). Simultaneous multiple change-point and factor analysis for high-dimensional time series. Journal of Econometrics, 206(1), 187–225. [Google Scholar]
  3. Bühlmann P, & van De Geer S (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media. [Google Scholar]
  4. Bickel PJ, Ritov Y & Tsybakov AB (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4), 1705–1732. [Google Scholar]
  5. Boysen L, Kempe A, Liebscher V, Munk A, & Wittich O (2009). Consistencies and rates of convergence of jump-penalized least squares estimators. The Annals of Statistics, 37(1), 157–183. [Google Scholar]
  6. Braun JV, Braun RK, & Müller HG (2000). Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika, 87(2), 301–314. [Google Scholar]
  7. Candes E, & Tao T (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), 2313–2351. [Google Scholar]
  8. Cho H, & Fryzlewicz P (2012). Multiscale and multilevel technique for consistent segmentation of nonstationary time series. Statistica Sinica, 22(1), 207–229. [Google Scholar]
  9. Cho H, & Fryzlewicz P (2015). Multiple-change-point detection for high dimensional time series via sparsified binary segmentation. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 77(2), 475–507. [Google Scholar]
  10. Ciuperca G (2014). Model selection by LASSO methods in a change-point model. Statistical Papers, 55(2), 349–374. [Google Scholar]
  11. Fan J, & Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101–148. [PMC free article] [PubMed] [Google Scholar]
  12. Fan J, Lv J, & Qi L (2011). Sparse high-dimensional models in economics. Annual Review of Economics, 3(1), 291–317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fan J, & Peng H (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3), 928–961. [Google Scholar]
  14. Fong Y, Di C, & Permar S (2015). Change point testing in logistic regression models with interaction term. Statistics in Medicine, 34(9), 1483–1494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Frick K, Munk A & Sieling H (2014). Multiscale change point inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(3), 495–580. [Google Scholar]
  16. Fryzlewicz P (2014). Wild binary segmentation for multiple change-point detection. The Annals of Statistics, 42(6), 2243–2281. [Google Scholar]
  17. Harchaoui Z, & Lévy-Leduc C (2010). Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105(492), 1480–1493. [Google Scholar]
  18. Jirak M (2015). Uniform change point tests in high dimension. The Annals of Statistics, 43(6), 2451–2483. [Google Scholar]
  19. Kirch C, Muhsal B, & Ombao H (2015). Detection of changes in multivariate time series with application to EEG data. Journal of the American Statistical Association, 110(511), 1197–1216. [Google Scholar]
  20. Lavielle M, & Teyssiére G (2006). Detection of multiple change-points in multivariate time series. Lithuanian Mathematical Journal, 46(3), 287–306. [Google Scholar]
  21. Lee S, & Seo MH (2008). Semiparametric estimation of a binary response model with a change-point due to a covariate threshold. Journal of Econometrics, 144(2), 492–499. [Google Scholar]
  22. Lee S, Seo MH, & Shin Y (2011). Testing for threshold effects in regression models. Journal of the American Statistical Association, 106(493), 220–231. [Google Scholar]
  23. Lee S, Seo MH, & Shin Y (2016). The lasso for high dimensional regression with a possible change point. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 78(1), 193–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Leonardi F, & Bühlmann P (2016). Computationally efficient change point detection for high-dimensional regression. arXiv preprint arXiv:1601.03704. [Google Scholar]
  25. Li D, Qian J & Su L (2016). Panel data models with interactive fixed effects and multiple structural breaks. Journal of the American Statistical Association, 111(516), 1804–1819. [Google Scholar]
  26. Liu B, Zhang X & Liu Y (2019). Simultaneous change point detection and identification for high dimensional linear models, submitted . [Google Scholar]
  27. Page ES (1955). A test for a change in a parameter occurring at an unknown point. Biometrika, 42(3/4), 523–527. [Google Scholar]
  28. Pesaran MH, & Pick A (2007). Econometric issues in the analysis of contagion. Journal of Economic Dynamics and Control, 31(4), 1245–1277. [Google Scholar]
  29. Qian J, & Su L (2016). Shrinkage estimation of regression models with multiple structural changes. Econometric Theory, 32(6), 1376–1433. [Google Scholar]
  30. Raginsky M, Willett RM, Horn C, Silva J, & Marcia RF (2012). Sequential anomaly detection in the presence of noise and limited feedback. IEEE Transactions on Information Theory, 58(8), 5544–5562. [Google Scholar]
  31. Reiss PT, & Ogden RT (2010). Functional generalized linear models with images as predictors. Biometrics, 66(1), 61–69. [DOI] [PubMed] [Google Scholar]
  32. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. [Google Scholar]
  33. Tibshirani R (2011). Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3), 273–282. [Google Scholar]
  34. Tickle SO, Eckley IA, Fearnhead P, & Haynes K (2020). Parallelization of a common changepoint detection method. Journal of Computational and Graphical Statistics, 29(1), 149–161. [Google Scholar]
  35. van de Geer SA (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614–645. [Google Scholar]
  36. van de Geer S, Bühlmann P, Ritov YA, & Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202. [Google Scholar]
  37. Wang D, Lin K, & Willett R (2021a). Statistically and computationally efficient change point localization in regression settings. arXiv preprint arXiv:1906.11364. [Google Scholar]
  38. Wang D, Yu Y, & Rinaldo A (2021b). Optimal covariance change point localization in high dimensions. Bernoulli, 27(1), 554–575. [Google Scholar]
  39. Wang T, & Samworth RJ (2018). High dimensional change point estimation via sparse projection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1), 57–83. [Google Scholar]
  40. Zhang D, & Shen D (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage, 59(2),895–907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Zhang T, & Lavitas L (2018). Unsupervised self-normalized change-point testing for time series. Journal of the American Statistical Association, 113(522), 637–648. [Google Scholar]

RESOURCES