Abstract
Change point detection for high-dimensional data is an important yet challenging problem for many applications. In this paper, we consider multiple change point detection in the context of high-dimensional generalized linear models, allowing the covariate dimension p to grow exponentially with the sample size n. The model considered is general and flexible in the sense that it covers various specific models as special cases. It can automatically account for the underlying data generation mechanism without specifying any prior knowledge about the number of change points. Based on dynamic programming and binary segmentation techniques, two algorithms are proposed to detect multiple change points, allowing the number of change points to grow with n. To further improve the computational efficiency, a more efficient algorithm designed for the case of a single change point is proposed. We present theoretical properties of our proposed algorithms, including estimation consistency for the number and locations of change points as well as consistency and asymptotic distributions for the underlying regression coefficients. Finally, extensive simulation studies and application to the Alzheimer’s Disease Neuroimaging Initiative data further demonstrate the competitive performance of our proposed methods.
Keywords: Binary segmentation, Dynamic programming, Generalized linear models, High dimensions
1. INTRODUCTION
With the advance of technology, complex large-scale data are prevalent in various scientific fields. Data heterogeneity creates great challenges for the analysis of complex data that may not be well approximated by a common distribution. Change point detection is a powerful tool to deal with data heterogeneity. Since the seminal work by Page (1955), there is a growing literature on change point detection with a wide range of applications, including genomics (Braun et al., 2000), finance (Pesaran & Pick, 2007; Fan et al., 2011), and social networks (Raginsky et al., 2012).
In this paper, we consider multiple change point detection for a general framework of high-dimensional generalized linear models (GLMs). Suppose we have n independent observations with
| (1) |
where is the real-valued response for the i-th observation, Xi = (Xi1,…,Xip)⊤ is the corresponding covariate vector in , , g(·) is the link function, and is the unknown regression coefficient vector for the i-th observation. Then we consider estimating multiple change points with piecewise constant coefficients for Model (1). More specifically, let be the true number of unknown change points along with the true location vector with . Then, the unknown change points divide the n time-ordered observations into intervals and the underlying regression coefficients β(i) have the following form:
| (2) |
where denotes the underlying true regression coefficients in the j-th interval. We focus on change point detection, which consists of estimating: (a) the number of change points ; (b) the locations of change points ; (c) the regression coefficients β0(j) in each segmentation, where j = 1,…, .
There is a growing literature on change point detection. Most existing papers focus on change point problems in the mean, variance, or covariance matrix either for a fixed p (Kirch et al., 2015; Zhang & Lavitas, 2018) or for a growing p (Frick et al., 2014; Jirak 2015; Barigozzi et al., 2018; Wang & Samworth, 2018; Wang et al., 2021b). Progress has been made in the literature for detection of multiple change points as well (Lavielle & Teyssiére, 2006; Aue et al., 2009; Harchaoui & Lévy-Leduc, 2010; Cho & Fryzlewicz, 2015). Despite progress on change point detection, much fewer papers appear in the literature on regression change point problems, especially for high-dimensional models. The main difficulty comes from the complexity of both calculation and theoretical analysis arising from the growing dimension.
For regression problems, penalized techniques such as Lasso (Tibshirani, 1996) are popular in dealing with high-dimensional data. Some theoretical properties of the Lasso and various extensions can be found in Fan & Peng (2004), Candes & Tao (2007), and van de Geer et al. (2014). For a general overview and recent developments, we refer to Fan & Lv (2010) and Tibshirani (2011). In terms of change point detection based on Lasso, some methods exist for solving regression change point problems both in low and high dimensions. For example, designed for a fixed p, Ciuperca (2014) considered multiple change point estimation based on the Lasso. Qian & Su (2016) and Li et al. (2016) proposed a systematic change point estimation framework based on the adaptive fused Lasso. When the data dimension p grows to infinity, Lee et al. (2016) considered high-dimensional linear models with a possible change point and proposed a method for estimating regression coefficients as well as the unknown threshold parameter. As an extension, Leonardi & Bühlmann (2016) proposed computationally efficient algorithms for the number and locations of multiple change points in the context of high-dimensional linear models. Recently, Liu et al. (2019) investigated simultaneous change point detection and identification based on a de-biased Lasso process. Wang et al. (2021a) developed variance projected wild binary segmentation (VPWBS) for multiple change point detection.
Note that the above-mentioned papers focused on change point detection based on linear models with a continuous response, and thus are not directly applicable to the analysis of categorical or count response variables in practice. GLM can be very useful in this situation since it covers the exponential family distributions for the response variable. Because of its generality, GLM is widely used in various applications such as genetics, economics, and epidemiology. Several papers studied low-dimensional, single change point problems in the context of GLM (Lee & Seo, 2008; Lee et al., 2011; Fong et al., 2015). To the best of our knowledge, change point detection for high-dimensional GLMs has not been studied in the literature. Hence, it is desirable to consider a flexible and general framework for analyzing high-dimensional data with heterogeneity. Motivated by this, in this paper, we consider computationally efficient multiple change point detection in the context of high-dimensional GLMs. Our main contributions are summarized as follows:
We consider change point problems in a more flexible and general framework of high-dimensional GLMs, allowing the data dimension p to grow exponentially with the sample size n. It covers various model settings including linear models, logistic, and probit models as special cases. As far as we know, change point detection for high-dimensional logistic and probit models has not been considered in the literature.
Under the above framework, we propose a three-step procedure to estimate the number and locations of change points based on the Lasso estimator of the regression coefficients. The basic idea is to choose a useful contrast function J(τ(k)), which satisfies for any τ(k). To solve this optimization problem, we propose two algorithms based on dynamic programming and binary segmentation techniques, which have computational costs of O(n2GLMLasso(n)) and O(n log(n)GLMLasso(n)), respectively, where GLMLasso(n) is the cost to compute the Lasso estimator for the GLM with the sample size n. We also propose a much more efficient approach for the single change point case, with a computational cost of O(log(n)GLMLasso(n)). To the best of our knowledge, this is the most computationally efficient algorithm available for detecting a single change point in GLMs.
We examine some theoretical properties of our proposed change point estimators computed by the three algorithms. To be specific, under some mild conditions, both the dynamic programming and binary segmentation techniques can obtain a consistent estimator for the number and locations of the true change points with a rate of , which covers the case with an asymptotically growing number of change points. Moreover, the estimation error of the Lasso estimator of underlying regression coefficients can be bounded to op(1). To achieve further statistical prediction and inference, we introduce the de-biased Lasso estimator of the underlying regression coefficients in each segmentation, which is shown to be asymptotically normal. As for the third efficient approach designed for single change point cases, we establish that it can identify the location of the change point with high estimation accuracy. Finally, the competitive performance of our proposed methods is demonstrated by extensive numerical results as well as application to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset.
For a better understanding of our work, we would like to point out its relationship with several related papers. Compared to Lee et al. (2011), which considered single change point detection for the binary response variable with low dimensional covariates, we overcome the challenges of the computational and theoretical complexity arising from the growing dimension and number of change points. Meanwhile, to address the issue of the unknown multiple change points, we construct accurate and effective algorithms based on two techniques, dynamic programming and binary segmentation. These techniques are popular for multiple change point detection and were previously studied by Lavielle & Teyssiére (2006), Boysen et al. (2009), Harchaoui & Lévy-Leduc (2010), Cho & Fryzlewicz (2012, 2015), and Leonardi & Bühlmann (2016). Our extension to GLMs involves several technical challenges to overcome. One substantial difficulty comes from the complex form of the contrast functions compared to the least squares for linear models considered in Leonardi & Bühlmann (2016).
The rest of this paper is organized as follows. In Section 2, we introduce our methodology and demonstrate how our proposed three algorithms detect change points. In Section 3, the corresponding theoretical results of the change points computed by different algorithms are established. We investigate the performance of our proposed methods by extensive numerical results as well as a real data application in Section 4. We summarize the paper in Section 6. Detailed proofs of the main theorems and some useful lemmas are given in the Appendix.
2. METHODOLOGY
In this section, we introduce our new methodology for Model (1) with multiple unknown change points. In particular, in Section 2.1, some notation is introduced. In Section 2.2, we present a three-step change point estimator including the number and the locations of change points. Meanwhile, the regression coefficients in each segment are estimated based on the Lasso. In Sections 2.2.1 and 2.2.2, based on dynamic programming and binary segmentation techniques, two algorithms are proposed to detect multiple change points. To further improve the computational efficiency, in Section 2.2.3, we present a much more efficient algorithm designed for the case of a single change point.
2.1. Notation
We first introduce some notation. For a vector , we denote , and ∥a∥∞ = max1≤i≤p∣ai∣. For two real-valued sequences an and bn, we set an = O(bn) if there exists a constant C such that ∣an∣ ≤ Cn∣bn∣, for a sufficiently large n. We set an = o(bn) if an/bn → 0 as n → ∞. For a sequence of random variables {ξ1, ξ2,…}, we set if ξn converges to ξ in probability as n → ∞. We also denote ξn = op(1) if . Given an interval (u,v) ⊂ [0,1] such that , we denote the vector (Yun+1,…,Yvn)⊤ by Y(u,v) and the vector (ϵun+1,…,ϵvn)⊤ by ϵ(u,v). Analogously, we use X(u,v) to denote the (v − u)n × p dimensional matrix , where with j = 1,…,p, and we use to denote the Lasso estimator based on the observations Y(u,v) and X(u,v). For a set A, we use #A to denote its cardinality. For any x ≥ 0, we use [x] to denote the largest integer less than or equal to x. We use C1,C2,… to denote generic positive constants that may vary in different places.
2.2. New Estimation and Algorithms
Let Y = (Y1,…, Yn)⊤ denote the n × 1 response vector, and X the n × p design matrix with Xi = (Xi1, …, Xip)⊤ being its i-th row for 1 ≤ i ≤ n. In this paper, we assume are independently and identically distributed (i.i.d.) p-dimensional random vectors with mean zero and covariance matrix Σ = Cov(X1). Furthermore, for , we denote by S(j) the set of non-zero elements of the regression coefficients . For any given partition τ = (τ0, τ1, …, τk, τk+1)⊤, we denote the j-th interval by Ij(τ) = (τj−1, τj), the length of the j-th interval by rj (τ) = τj − τj−1, the shortest interval length by r(τ) = mini≤j≤k+1 rj (τ), and the change point number by l(τ). Moreover, we denote the minimum interval length by δ.
We are now ready to introduce our change point estimator in detail. We consider the Lasso-type ℓ1-penalized estimator for high-dimensional GLMs. Such estimators have some desirable properties. In particular, van de Geer (2008) derived some theoretical properties including consistency and the oracle inequality, based on which our algorithms are mainly constructed. More specifically, let be some loss function relative to g(·). For instance, if g(·) is the logit function, ρβ(x, y) will then be the negative-likelihood function in the form log(1 + ex⊤β) − yx⊤β. For , we define and . Note that such complex loss functions lead to substantial difficulty for the estimation of change points as well as regression coefficients. Given data observations , the Lasso-based GLM method solves the following ℓ1 penalized problem:
| (3) |
Since we consider heterogeneous data with possible multiple change points, we cannot use Equation (3) directly to obtain parameter estimation. The main challenge is that both the number and the locations are unknown. To solve this issue, we consider three steps.
Before introducing the change point estimator, we first demonstrate how to estimate the regression coefficients for each segment. To be specific, for any given candidate partition τ = (τ0, τ1,…, τk, τk+1)⊤, with τj ∈ L1, j = 1,…, k + 1, we obtain the estimator with the ℓ1 penalty in each segment by, for j = 1,…, k + 1,
| (4) |
where λj is the non-negative regularization parameter.
Based on Equation (4), our new algorithms for estimating both and are summarized into the following three steps.
Step 1 (Search the “best” partition): Given the candidate number of change points k, we find the “best” partition that minimizes the total loss function (contrast function):
| (5) |
where , , and .
Step 2 (Estimate number of change points): We put into and obtain the minimum loss function associated with k as
| (6) |
Then, we find the “best” estimation which minimizes G(k) with a penalty:
| (7) |
Step 3 (Estimate locations of change points): We put into Step 1 and obtain the final change point estimator by
| (8) |
Combining Steps 1-3, our final change point estimators and can be obtained equivalently in the following form:
| (9) |
After the above three steps, we obtain the change-point estimators . As for β0, we recommend two different Lasso estimators of the underlying regression coefficients β0, serving different purposes for practitioners. In particular, naturally, we can use the Lasso estimator of β0 to select variables and make a prediction, which is defined for as
For further statistical inference including confidence intervals and hypothesis testing, van de Geer et al. (2014) proposed the de-biased Lasso estimator and analyzed its asymptotic properties for the homogeneous model under high-dimensional setups. Similarly, for the heterogeneous observations, we construct a de-biased Lasso estimator for the underlying regression coefficients for each segmentation for as
where the precision matrix estimator can be constructed using nodewise Lasso with as input (van de Geer et al., 2014).
In what follows, we introduce three specific algorithms for solving Equation (9). Note that J(τ, β, X, Y) and Pnρ(Ij(τ), β) are the loss function for all intervals and the j-th interval, respectively. Meanwhile, λj in Equation (4) and γ in Equation (5) are positive tuning parameters that encourage coefficient and segment sparsity, respectively. We adopt a cross-validation approach to make a proper choice of these two tuning parameters λ and γ. To compute in Equation (4), we can use, for example, the R package glmnet (https://glmnet.stanford.edu). It is worth mentioning the following two remarks for our proposed estimator: (1) If the number of change points is known, we just use Step 1 by plugging in to directly obtain the locations of change points as follows:
| (10) |
In this case, our method covers the setting considered by Lee et al. (2016), where at most one change point is assumed. (2) When no change point occurs (), our proposed method can still work. Hence, our proposed method can automatically account for the underlying data generation mechanism ( or ) without specifying any prior knowledge about the number of change points . Furthermore, as shown by our extensive numerical studies, our new algorithms can estimate with high accuracy.
Our main goal is to design efficient algorithms that solve the optimization problem in Equation (9) of the form . To address this issue, three algorithms are proposed next.
2.2.1. Dynamic programming approach
We introduce a general approach based on the Dynamic Programming Algorithm (DPA), which works well for our change point problem (Eq. 9). It is well-known that DPA has excellent accuracy since it considers the global solution of Equation (9). It is widely used in multiple change point detection including the efficient, parallelized approaches introduced recently by Tickle et al. (2020). More details can be found in Boysen et al. (2009) and Leonardi & Bühlmann (2016). Next, we present how to use this technique to solve Equation (9) in detail.
For any given v ∈ {i/n : i = 1, …,n}, consider the sample (Y(0,v], X(0,v]). Given a candidate change point number k, denote Fk(v) as the minimum value as follows:
| (11) |
One can see that the optimal k + 1 segments corresponding to the change point vector τ = (τ0, τ1, …, τk+1)⊤ obtained from Equation (11), consist of the optimal first k segments and a single segment (τk, τk+1)⊤. Recall that τ0 := 0 and τk+1 := 1. Then τk is the rightmost change point estimator. Furthermore, by definition of Fk(v), obtained from Equation (11) is also a minimizer of Fk−1(τk). Hence, the last change point τk is the minimizer of with u < v.
The above observation motivates us to use the dynamic programming recursion to calculate Fk(v) with v ∈ {i/n : i = 1, …, n}. In particular, for any v ∈ {i/n : i = 1, …, n}, define
| (12) |
Then, the dynamic programming recursion proceeds as follows:
| (13) |
Define Vn = {i/n : i = 1, …, n}. Based on Equations (12) and (13), we can obtain {F1(v), v ∈ Vn}, {F2(v), v ∈ Vn}, …, and {Fkmax (v), v ∈ Vn}, where kmax (in our case kmax + 1 = 1/δ) is an “upper bound” of the number of change points. See Section 3.2 for more details. By considering G(k) in (6), we have Fk(1) = G(k) with k = 1, …, kmax. Hence, we are ready to estimate the change point number by
| (14) |
The corresponding locations of change points can be obtained by
| (15) |
The following Algorithm 1 describes our procedure for obtaining and based on the DPA. Note that DPA solves Equation (9) with globally optimal solutions, which have excellent estimation accuracy. Furthermore, as shown in Leonardi & Buhlmann (2016), it has the computational cost of O(n2GlmLasso(n)) operations. This can be computationally expensive especially when n is very large. Hence, it is desirable to consider a more efficient approach. Next, we introduce an efficient approach based on binary segmentation, which can ensure almost the same estimation accuracy as that of DPA.
| Algorithm 1 : Dynamic programming procedure for change point detection in high-dimensional GLMs. | |
|---|---|
|
|
2.2.2. Binary segmentation approach
Next we introduce an approach based on the binary segmentation algorithm (BSA) examined in Cho & Fryzlewicz (2012; 2015), and Leonardi & Bühlmann (2016), which is shown to be much more efficient compared to DPA. The main idea of BSA for solving the change point problem for GLMs (Eq. 9) is that for each candidate search interval (u, v), we use the penalized loss function to determine whether a new change point s can be added. If s is identified, then the interval (u, v) is split into two subintervals (u, s) and (s, v) and we conduct the above procedure on (u, s) and (s, v) separately. This algorithm is continued until no new subintervals can be added. In particular, for any given u, v ∈ Vn := {i/n : i = 1, …, n}, we define
| (16) |
and
| (17) |
Then we present our BSA-based algorithm as follows.
| Algorithm 2 : Binary segmentation procedure for change point detection in high-dimensional GLMs. | |
|---|---|
|
|
Note that this approach searches much fewer candidates for finding a new change point as compared to DPA, which makes it more computationally efficient. More specifically, as shown by Leonardi & Bühlmann (2016), BSA has a computational cost of O(n log(n)GlmLasso(n)) operations. Furthermore, in Section 3.2, we prove that the change point estimator computed by Algorithm 2 enjoys almost the same estimation accuracy as that of Algorithm 1.
2.2.3. A fast screening approach for single change point models
So far, we have proposed two efficient algorithms in Sections 2.2.1 and 2.2.2 for solving Equation (9). In this section, we show that under the single change point models, the computational cost can be further reduced. As far as we know, our fast screening approach (FSA) is novel for detecting a single change point in regression models. The main idea is that for detecting a single change point, if we have some prior information about its location, it is not necessary to search all candidate subintervals that have been adopted in the BSA-based algorithm. To see this, we recall Z(u, v) as defined in Equation (16). For τf ∈ (0, 1), we define the statistics as Wτf((u, v)) = Z(u, u + τf (v − u)) + Z(u + τf (v − u),v). Based on Wτf ((u, v)), we have the following key observation: Consider any subinterval (u, v) containing a single change point . If lies in the first half interval of (u, v), i.e., , with a high probability, we can prove that W1/4((u, v)) < W3/4((u, v)). If lies in the second half interval of (u, v), i.e., , with a high probability, we have W3/4((u, v)) < W1/4((u, v)). The above observation motivates us to design Algorithm 3 for fast change point identification.
Note that Algorithm 3 does not need to search through all data points, and it can quickly identify the half-interval where the change point is located via comparing the quarter, half, and three-quarter values of W(Ti) in each iteration. As a result, it only takes O(log(n)GLMLasso(n)) computational operations to detect the change point. Hence, as compared to Algorithms 1 and 2, the computational cost can be dramatically reduced. Its computational benefits will be validated by our numerical experiments in Section 4.
| Algorithm 3 : A fast screening approach for single change point detection in high-dimensional GLMs. | |
|---|---|
|
|
3. THEORETICAL PROPERTIES
We examine the theoretical properties of our proposed three approaches. In particular, we first show that our estimation of change points and regression coefficients is consistent and has the same rates of convergence as those of linear models. Secondly, for GLMs with change points, we reconstruct our assumptions and lemmas for analyzing high-dimensional data with heterogeneity based on the work of van de Geer (2008). Note that van de Geer (2008) considered the ℓ1 penalized estimation of the regression coefficients under the setting of all observations from the same GLM. In particular, in Section 3.1, we introduce some assumptions. In Section 3.2, we present theoretical results of the change point estimator computed by the new algorithms.
Before presenting the theoretical results, we introduce some additional notation. For any u, v ∈ Vn := {i/n : i = 1, …, n} with u < v, we denote, for a function , the subinterval-based theoretical mean and empirical mean by , and , respectively. For convenience, we denote Pρ = Pρ((0, 1)) and Pnρ = Pnρ((0, 1)). Consider a linear subspace . For a , define ρfβ(x, y) = ρ(fβ (x), y). Then the empirical risk and theoretical risk at f are defined as Pnρf and Pρf, respectively. Furthermore, we define the target as the minimizer of the theoretical risk and , where β0 can be regarded as the “truth”. By definition, we have f0(x) = x⊤β0. For , the excess risk is defined as . Lastly, for any subinterval (u, v), we define the oracle as . The corresponding estimation error is then denoted as ϵ* := (Pn − P) ρfβ*.
3.1. Basic assumptions
We introduce some assumptions as follows.
Assumption A (loss function). The loss function ρf(x, y) := ρ(f(x), y) is convex for all . Moreover, it satisfies the Lipschitz property:
Assumption B (design matrix). There exists KX < ∞ such that ∥Xi∥∞ ≤ KX and hold for all i = 1, …, n.
Assumption C (margin condition). There exists an η > 0 and a strictly convex increasing function G(x), such that for all with ∥fβ − f0∥∞ ≤ η, one has
where there exists a constant C such that G(x) ≥ Cx2 for any positive x.
Assumption D (compatibility condition). The compatibility condition is met for the set (S(j) defined in Section 2.2) with constant ϕ* > 0, if for all β ∈ Rpp satisfying , it holds that
where s* := #S* is the cardinality of S*.
Assumption E (parameter space). For k0 > 1, there exist constants m* > 0 and M* > 0 such that
where ,
Note that in the case , the former condition reduces to ∥β0(1) − β0(2)∥1≥ m*s*.
We assume in Assumption A that the loss function ρ is Lipschitz in f, which allows us to bound the loss function by the difference between estimated regression parameters and the corresponding true parameters. Many functions can meet this condition: for example, the negative-likelihood function of the logistic regression model. Assumption B imposes relatively weak conditions on the covariates, which covers a wide range of distributional patterns. Assumption C (margin condition) is assumed for a “neighbourhood” of the target linear function f0 = XTβ0 and is a common condition for analyzing the GLM. See Section 6.4 in Bühlmann & van de Geer (2011) for more details. Assumption D (compatibility condition) for the design matrix X allows us to establish oracle results for Lasso estimation. Note that one can verify that Assumption D is a sufficient condition of Assumption C in van de Geer (2008) by choosing the function , where is the cardinality of the set defined in Assumption C of van de Geer (2008). Assumption E presents the minimum and maximum differences between the true regression parameters, which allow us to detect the change points. Furthermore, the sparsity of regression coefficients is required to guarantee the consistency of our proposed estimators. Assumption F introduced in the Appendix imposes some technical conditions on the tuning parameter λ for the Lasso estimation as well as the tuning parameter γ for the change point estimation. Assumption G includes the required condition for the limiting property of the de-biased Lasso estimator.
3.2. Main results
We are ready to present some theoretical results of our proposed three new algorithms. Before that, we denote and let d* be a constant. See more details in Lemma 5. We first present the property of the estimators computed by DPA in Algorithm 1.
Theorem 1 Suppose Assumptions A-G hold with log(p) = o(n). Then, for a given C1 > 0, with probability at least 1 − 7 exp , we have that
;
-
for each ,
where is the s-th component of and .
Theorem 1 demonstrates that Algorithm 1 can identify both the number and locations of multiple change points with high estimation accuracy. In particular, the first result shows that we can obtain a consistent estimator for the true number of change points. As for the locations, the second result indicates that our multiple change point estimator converges to the true change point vector with a rate of . Furthermore, the third result implies that we can bound the prediction error or the estimation error of underlying regression parameters within a rate of or . Result (4) implies the asymptotic normality of the de-biased Lasso estimator , which allows the wider statistical inference including confidence intervals and hypothesis testing.
Based on Theorem 1, some other interesting conclusions can be made. To simplify the discussion, we require that all the change point intervals are within the same order of magnitude. Recall δ as the minimum length of change point intervals as defined in Section 2.2. Then we have . Furthermore, according to δ, the following two cases are considered: (1) δ = O(1) and (2) δ = o(1).
For the first case, we have , which means that the number of change points is fixed and does not increase with the sample size n. Furthermore, considering Assumption (F1), we have . Hence, the three results in Theorem 1 reduce to:
| (18) |
Considering Equation (18), our results are consistent with the Lasso estimation results derived in van de Geer (2008) and estimation consistency is guaranteed as long as holds.
We next consider the second case with δ = o(1). In this case, we allow the number of change points grow with n. Noting that , the three results in Theorem 1 reduce to:
| (19) |
Hence, by Equation (19), the estimation consistency can still be obtained as long as holds. In other words, the number of change points can not grow faster than the order of .
Next, we present theoretical results of change point estimators computed by BSA.
Theorem 2 Suppose Assumptions A-G hold with log(p) = o(n). For a given C2 > 0, with probability at least 1 − 7 exp , we have that
;
;
;
-
for each ,
where is the s-th component of , and .
Theorem 2 shows similar results as those of Theorem 1 in terms of consistency of both the number and locations of change points. Furthermore, Theorem 2 allows us to use a much more efficient algorithm to detect multiple change points for GLMs, which enjoys almost the same estimation accuracy as that of the global solutions. The efficiency will be further investigated in our numerical experiments.
Finally, we establish theoretical properties of FSA proposed in Algorithm 3 for single change point models.
Theorem 3 Suppose Assumptions A-F hold with log(p) = o(n). Assume that the true single change point . For a given C3 > 0, with probability at least 1 − 7 exp , we have that
| (20) |
Theorem 3 justifies the validity of Algorithm 3 and demonstrates that the cost of identifying a single change point in GLMs can be reduced to only O(log(n)GLMLasso(n)) computational operations.
4. SIMULATION STUDIES
In this section, we investigate the numerical performance of our three proposed change point detection procedures in various model settings. For the design matrix X, we generate Xi i.i.d. from . We first consider two types of covariance matrix structures including independent and weakly dependent settings as follows:
Case 1: Σ = Ip×p;
Case 2: Σ = Σ* with , where for 1 ≤ i, j ≤ p.
We consider logistic regression models. For i = 1, …, n, we generate Yi ∈ {0,1} with . Then the responses are generated from the following Binomial distribution: .
For this model setup, we investigate the performance of our approaches in terms of accuracy and efficiency. For efficiency, we compare our proposed algorithms in terms of the computational cost. Note that BSA and DPA are designed for multiple change point detection. In order to compare efficiency reasonably for the cases with no change point and a single change point, we set these two algorithms to stop after one screening by making kmax = 1. To show the accuracy, we record the mean, mean squared error (MSE), and error rate (proportion of false positives) of the change point estimators including the number and locations. We compare the corresponding results with the following existing methods:
Lee et al. (2011) (denoted by Lee2011), which is based on the maximum score estimation.
Qian and Su (2016) (denoted by SGL), which proposed a systematic estimation framework based on the adaptive fused Lasso in linear regression models. To be specific, they estimate by minimizing the ℓ2-loss with the fused Lasso penalty. In this paper, we modify SGL by replacing the ℓ2-loss with the loss ρ defined in Section 2 for high-dimensional GLMs.
- Wang et al. (2021a) (denoted by VPWBS), i.e., the variance projected wild binary segmentation based on the sparse group Lasso estimator for linear regression models. In particular, they projected the high-dimensional time series onto the univariate time series . The optimal projection direction u is obtained by local group Lasso screening (LGS). Then they conducted mean change point detection by wild binary segmentation (WBS) on the univariate time series . Note that, for linear models, LGS performs a variant of the group Lasso on any subsample , and computes
where s′ and e′ serve as boundary trimming parameters with s + 1 ≤ s′ + 1 < e′ ≤ e, and λG is the tuning parameter for the group penalty. For a better comparison in the context of high-dimensional GLMs, we modify this ℓ2-loss based method in Wang et al. (2021a) by replacing the ℓ2-loss in Equation (21) with the loss ρβ(Xi, Yi) defined in Section 2.(21)
For our proposed approaches, the regression coefficients are computed by the R package glmnet (https://glmnet.stanford.edu). All numerical results are based on 100 replications, except for the test by Lee2011 which is based on 500 replications.
4.1. Tuning parameter selection
It is essential to properly choose the values of tuning parameters for accurate estimation results. We develop a cross-validation approach for GLMs to choose the parameters λ and γ, which encourage regression coefficient and segmentation sparsity, respectively. To be specific, let samples with odd indices be the training set (X1, X3, … Xn−3, Xn−1) and the others be the validation set (X2, X4, …, Xn−2, Xn). For each of two tuning parameters λ, γ, we conduct our procedure on the training set and obtain the estimated change point and underlying regression coefficients . Let , for i/n ∈ Ij(τ) and i = 1, …,n. We can calculate the validation loss as:
where ρ is the loss function of the GLMs and depends on the link function. For the specific regression models such as the linear model and logistic regression model, the corresponding validation losses are defined as:
Then we choose (λ, γ) corresponding to the lowest validation loss. Note that it is time-consuming to use the cross-validation procedure to choose the tuning parameters for our various model settings. Based on our extensive numerical simulations, we find that our methods are stable over a certain range of tuning parameters. Hence, we use an empirical choice of the parameters λ and γ to save computational cost. In particular, we set , with c ∈ (0.15, 0.25). As for γ, we set γ = δλ. Recall δ is the minimum interval length and δn is the minimum interval size, which controls the maximum number of change points. Note that δ is of key importance for our theoretical guarantee discussed in Section 3.2, and needs to be carefully chosen in simulation studies.
In order to ensure the effective fitting of the regression model, we need to guarantee a sufficient sample size for each interval. According to our numerical studies, setting δ ∈ (0.1, 0.25) works well. To investigate how sensitive our proposed methods are to the choice of these tuning parameters, we consider various values of λ and γ by setting the sample size n ∈ {200, 300, 1000} and data dimension p ∈ {200, 300, 400}. Note that our proposed methods can automatically account for the underlying data generation mechanism and does not need to know the number of change points. To justify this, in what follows, we present our numerical results under three different cases: (1) , (2) , and (3) , which correspond to data with no change point, one change point, and multiple change points, respectively.
4.2. No change point models
We consider the alternative scenario where no change point occurs. In this case, the underlying regression coefficients satisfy for i = 1, …, n. We set the sample size n = 200 and the data dimension p ∈ {200, 300,400}. For , we generate , where denotes the set of non-zero elements of β0 with .
We implement the corresponding algorithms independently on a 2.50GHz CPU (Linux) with 6 cores and 4GB of RAM. As shown in Figure 1 (left), the computational cost of BSA grows moderately (12s to 737s) as the data dimension increases from 400 to 2000, while the computational cost of DPA grows exponentially (80s to 32000s). As for the accuracy, the error rates of VPWBS, DPA, and BSA are zero in almost all cases, which suggests these three approaches have almost the same accuracy when no change point occurs. SGL tends to overestimate the number of change points for the homogeneous observations. Note that Lee2011 has relatively large errors, which suggests that it may be unreliable in high-dimensional settings. Thus, we do not include it in our comparisons for the single change point models.
Figure 1:
Efficiency of change point estimation with p = 2n. The left panel shows computational costs of BSA and DPA per replication under the model with no change point. The right panel shows computational costs of BSA and DPA per replication under the model of three change points.
4.3. Single change point models
Next we consider the alternative scenario where (β(i))1≤i≤n have a common change point location , where . We set the sample size n = 300 and the data dimension p ∈ {200, 300, 400}. Furthermore, we assume the regression coefficients have support set . For , we set . Then, for , we set βs(2) = βs(1) + δs with . For each replication, the support sets , of regression coefficients are randomly selected from the set {1, 2,…, 0.3p} with .
Figure 2 (left) indicates that the computational cost of the BSA grows gradually with the data dimension increasing from 400 to 2000 as compared to the exponential growth of the DPA. Meanwhile, the computational cost of FSA in Algorithm 3 increases slowly as compared to the “exponential” growth of the BSA, as shown in Figure 2 (right). This suggests that FSA is more preferable for single change point models. Furthermore, to investigate the computational efficiency for high-dimensional cases, we present the computational cost of our three proposed approaches in Figure 3. It implies that our proposed approaches have stable and good performance as data dimension p grows.
Figure 2:
Efficiency of change point estimation under the single change point model with n ∈ {200, 400, 600, 800, 1000} and p = 2n. The left panel shows the computational costs of BSA and DPA per replication. The right panel shows computational costs per replication of BSA and FSA. The change point is fixed at τ1 = 0.5.
Figure 3:
Efficiency of change point estimation under the single change point model with p ∈ {600, 1200, 1800, 2400, 3000} and n = 300. The left panel shows the computational costs of BSA and DPA per replication. The right panel shows computational costs per replication of BSA and FSA. The change point is fixed at τ1 = 0.5.
As for the accuracy, we record the percentage of replications (Rate (%)) in which DPA and BSA correctly identified a single change point. Note that in Tables 2 and 3, MSEs for the number and location are expressed as factors of 10−2 and 10−4, respectively. We can see that DPA, BSA, and VPWBS can identify a single change point with high rates of success. Furthermore, DPA generally has the best performance for estimating the single change point location. VPWBS performs better than BSA especially when the change occurs near the edge. Both DPA and BSA perform slightly better than FSA. Note that all the proposed algorithms perform better the closer the change point location is to the middle of the data observations, e.g., .
Table 2:
Single change point detection for Case 1 with Σ = Ip×p under various dimensions and change point locations, based on 100 replications.
| dimension p | |||||
|---|---|---|---|---|---|
| method | 200 | 300 | 400 | ||
| Number | 0.5 | SGL | 2.49 ∣ 19 | 2.36 ∣ 17 | 2.48 ∣ 23 |
| Mean∣Rate(%) | VPWBS | 1.02 ∣ 96 | 1.01 ∣ 98 | 1.02 ∣ 97 | |
| DPA | 1.00 ∣ 100 | 1.01 ∣ 99 | 1.00 ∣ 100 | ||
| BSA | 1.00 ∣ 100 | 1.00 ∣ 100 | 1.00 ∣ 100 | ||
| 0.7 | SGL | 2.48 ∣23 | 2.48 ∣23 | 2.64 ∣14 | |
| VPWBS | 1.01 ∣ 97 | 1.04 ∣ 96 | 0.99 ∣ 95 | ||
| DPA | 1.04 ∣ 96 | 1.02 ∣ 98 | 1.03 ∣ 97 | ||
| BSA | 1.00 ∣ 100 | 1.00 ∣ 100 | 1.00 ∣ 100 | ||
| Location | 0.5 | SGL | - | - | - |
| Mean∣MSE(10−4) | VPWBS | 0.497 ∣ 2.244 | 0.500 ∣ 3.590 | 0.496 ∣ 6.322 | |
| DPA | 0.499 ∣ 1.836 | 0.503 ∣ 2.256 | 0.499 ∣ 3.254 | ||
| BSA | 0.498 ∣ 2.446 | 0.501 ∣ 3.779 | 0.499 ∣ 6.445 | ||
| FSA | 0.495 ∣ 19.21 | 0.496 ∣ 9.326 | 0.493 ∣ 11.18 | ||
| 0.7 | SGL | - | - | - | |
| VPWBS | 0.699 ∣ 2.442 | 0.701 ∣ 6.963 | 0.693 ∣ 7.710 | ||
| DPA | 0.694 ∣ 3.933 | 0.693 ∣ 4.630 | 0.691 ∣ 5.086 | ||
| BSA | 0.691 ∣ 6.884 | 0.688 ∣ 14.29 | 0.686 ∣ 11.28 | ||
| FSA | 0.684 ∣ 25.09 | 0.678 ∣ 43.94 | 0.666 ∣ 37.07 | ||
Table 3:
Single change point detection for Case 2 with Σ = Σ* under various dimensions and change point locations. The numerical results are based on 100 replications.
| dimension p | |||||
|---|---|---|---|---|---|
| method | 200 | 300 | 400 | ||
| Number | 0.5 | SGL | 1.14 ∣ 32 | 1.30 ∣ 41 | 1.52 ∣ 40 |
| Mean∣Rate(%) | VPWBS | 1.02 ∣ 98 | 1.04 ∣ 96 | 1.03 ∣ 98 | |
| DPA | 1.01 ∣ 99 | 1.00 ∣ 100 | 1.00 ∣ 100 | ||
| BSA | 1.00 ∣ 100 | 1.00 ∣ 100 | 1.00 ∣ 100 | ||
| 0.7 | SGL | 0.72 ∣ 20 | 1.16 ∣ 26 | 1.14 ∣36 | |
| VPWBS | 0.97 ∣ 97 | 1.01 ∣ 99 | 1.05 ∣ 96 | ||
| DPA | 1.00 ∣ 100 | 1.02 ∣98 | 1.01 ∣ 99 | ||
| BSA | 0.99 ∣ 99 | 1.00 ∣98 | 1.01 ∣ 99 | ||
| Location | 0.5 | SGL | - | - | - |
| Mean∣MSE | VPWBS | 0.504 ∣ 5.056 | 0.495 ∣ 3.146 | 0.498 ∣ 3.572 | |
| DPA | 0.498 ∣ 2.145 | 0.502 ∣ 1.423 | 0.499 ∣ 3.670 | ||
| BSA | 0.493 ∣ 5.897 | 0.498 ∣ 3.347 | 0.498 ∣ 4.326 | ||
| FSA | 0.495 ∣ 12.21 | 0.489 ∣ 20.10 | 0.492 ∣ 9.374 | ||
| 0.7 | SGL | - | - | - | |
| VPWBS | 0.694 ∣ 5.959 | 0.694 ∣ 7.297 | 0.701 ∣ 6.950 | ||
| DPA | 0.690 ∣ 5.748 | 0.690 ∣ 6.411 | 0.691 ∣ 12.75 | ||
| BSA | 0.689 ∣ 9.419 | 0.694 ∣ 3.973 | 0.688 ∣ 11.44 | ||
| FSA | 0.689 ∣ 22.41 | 0.683 ∣ 24.20 | 0.676 ∣ 27.66 | ||
4.4. Multiple change point models
Finally we consider the alternative scenario where (β(i))1≤i≤n has multiple change point locations with . We set the sample size n = 1000 and the data dimension p ∈ {200, 300,400}. For , we set . Then, for , we set βsj(j) = βsj(j − 1) + (j − 1)δsj with . For each replication, the support set of regression coefficients is randomly selected from the set {1, 2, …, 0.3p} with , j = 1, 2, 3, 4.
We first analyze the efficiency. The results are similar to the other cases. Figure 4 shows that the computational cost of BSA grows gradually with the number of change points. In contrast, the cost of the DPA grows substantially. This suggests that the efficiency of BSA is not sensitive to the number of change points. To compare accuracy, similarly to the previous analysis, we record the percentage of replications in which DPA and BSA can correctly identify the three change points. As shown in Tables 4 and 5, DPA generally has the best performance. VPWBS performs slightly better than BSA. However, there is not much difference in performance among DPA, BSA, and VPWBS in terms of identifying the number and locations of multiple change points.
Figure 4:
Efficiency of change point estimation under the model of multiple change points with n ∈ {200, 400, 600, 800, 1000} and p = 2n. The left panel shows the computational costs of BSA per replication in the settings with different numbers of change points. The right panel shows computational cost of DPA per replication in the settings with different numbers of change points.
Table 4:
Multiple change point detection for Case 1 with Σ = Ip×p under various dimensions. The numerical results are based on 100 replications.
| τ2 = | (0.25, 0.5, 0.75)⊤ | p | Accuracy for the following methods | |||
|---|---|---|---|---|---|---|
| SGL | VPWBS | DPA | BSA | |||
| 200 | 3.19 ∣ 55 | 3.02 ∣ 98 | 3.00 ∣ 100 | 2.95 ∣ 95 | ||
| number | Mean ∣ Rate(%) | 300 | 3.47 ∣ 55 | 3.03 ∣ 95 | 3.00 ∣ 100 | 2.98 ∣ 98 |
| 400 | 3.50 ∣ 44 | 3.04 ∣ 93 | 3.00 ∣ 100 | 2.94 ∣ 94 | ||
| 200 | - | 0.248 ∣ 1.108 | 0.249 ∣ 0.665 | 0.249 ∣ 1.821 | ||
| location 1 | Mean ∣ MSE (10−5) | 300 | - | 0.250 ∣ 2.100 | 0.250 ∣ 0.865 | 0.247 ∣ 4.168 |
| 400 | - | 0.249 ∣ 3.913 | 0.251 ∣ 1.847 | 0.256 ∣ 5.736 | ||
| 200 | - | 0.499 ∣ 0.894 | 0.500 ∣ 0.353 | 0.499 ∣ 2.099 | ||
| location 2 | Mean ∣ MSE (10−5) | 300 | - | 0.499 ∣ 1.204 | 0.501 ∣ 0.508 | 0.499 ∣ 6.148 |
| 400 | - | 0.498 ∣ 2.490 | 0.500 ∣ 0.981 | 0.500 ∣ 6.736 | ||
| 200 | - | 0.751 ∣ 0.701 | 0.750 ∣ 1.002 | 0.749 ∣ 1.689 | ||
| location 3 | Mean ∣ MSE (10−5) | 300 | - | 0.750 ∣ 1.370 | 0.751 ∣ 2.481 | 0.745 ∣ 7.968 |
| 400 | - | 0.748 ∣ 2.268 | 0.749 ∣ 0.688 | 0.748 ∣ 2.258 | ||
Table 5:
Multiple change point detection for Case 2 with Σ = Σ* under various dimensions. The numerical results are based on 100 replications.
| τ 2 | = (0.25, 0.5, 0.75)⊤ | p | Accuracy for the following methods | |||
|---|---|---|---|---|---|---|
| SGL | VPWBS | DPA | BSA | |||
| 200 | 2.65 ∣ 30 | 3.00 ∣ 100 | 3.00 ∣ 100 | 2.99 ∣ 99 | ||
| number | Mean ∣ Rate(%) | 300 | 2.08 ∣ 17 | 2.97 ∣ 99 | 3.00 ∣ 100 | 2.98 ∣ 98 |
| 400 | 2.47 ∣ 25 | 3.00 ∣ 100 | 3.00 ∣ 100 | 2.98 ∣ 98 | ||
| 200 | - | 0.247 ∣ 2.435 | 0.250 ∣ 1.961 | 0.251 ∣ 3.457 | ||
| location 1 | Mean ∣ MSE (10−5) | 300 | - | 0.249 ∣ 3.164 | 0.251 ∣ 2.844 | 0.250 ∣ 6.585 |
| 400 | - | 0.251 ∣ 1.687 | 0.251 ∣ 2.497 | 0.249 ∣ 4.994 | ||
| 200 | - | 0.487 ∣ 2.040 | 0.500 ∣ 1.069 | 0.499 ∣ 2.667 | ||
| location 2 | Mean ∣ MSE (10−5) | 300 | - | 0.502 ∣ 1.403 | 0.501 ∣ 0.894 | 0.499 ∣ 4.363 |
| 400 | - | 0.499 ∣ 2.212 | 0.501 ∣ 3.613 | 0.500 ∣ 4.192 | ||
| 200 | - | 0.748 ∣ 1.242 | 0.750 ∣ 0.711 | 0.748 ∣ 1.674 | ||
| location 3 | Mean ∣ MSE (10−5) | 300 | - | 0.752 ∣ 1.672 | 0.750 ∣ 0.486 | 0.749 ∣ 1.398 |
| 400 | - | 0.747 ∣ 4.511 | 0.750 ∣ 0.580 | 0.747 ∣ 4.881 | ||
It is worth mentioning that our proposed methods have good performance in all three models with various data dimensions, sample sizes, and numbers of change points, which suggests that the methods are robust to the suggested choice of tuning parameters λ and γ.
5. REAL DATA ANALYSIS
To illustrate the usefulness of our proposed methods, we consider the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (www.loni.ucla.edu/ADNI). The ADNI dataset contains disease state information on different subjects including normal controls (NC), mild cognitive impairment (MCI), and Alzheimer’s disease (AD) as well as some biological markers including features derived from magnetic resonance imaging (MRI) and positron emission tomography (PET). It is very useful for clinical diagnosis and prevention to study how to measure the progression of AD using the images. For example, in the AD-related literature (see, for example, Reiss & Ogden, 2010), it is popular to use structural MRI or PET to predict the current disease status of the patient (binary response variable), which can be regarded as a classification problem. Usually, they treat data as homogeneous and ignore the effect of other covariate variables, such as age, gender and so on. Hence, an interesting question is whether the generalized linear structure between the disease status and biomarkers (MRI or PET) changes due to other covariates. If it changes, how can one estimate (a) the number of change points, (b) the locations of change points, and (c) the regression coefficients (selected variables) in each segmentation? In our study, we address these issues by detecting change points in the generalized linear structure between the disease status and the MRI features together with some covariates. In this application, we choose age as the covariate, which is of particular interest in AD studies.
We use the MRI data of 405 subjects including 220 normal controls and 185 AD patients from the ADNI data. For each subject, we obtain the corresponding status (AD/NC), age, and 93 MRI features after using the data processing method proposed in Zhang & Shen (2012). In our model, the predictive variables X = (X1, …, X93) are the 93 MRI features, which are scaled to have mean 0 and variance 1, and the response variable is the binary status obtained by setting Yi = 1 {subject i is an AD patient}, and 0 otherwise. For this dataset , consider the following logistic regression model: . Our goal is to detect potential change points of regression coefficients in . Taking the effect of different samples into consideration, in our analysis, we divide the data into two parts: training and testing datasets. More specifically, we randomly select 40 subjects from the whole set of 405 subjects according to the empirical distribution of age in Figure 5 (left) as the testing sample and use the remaining 365 subjects as the training data. Then we sort those 365 observations in the training data by age and use BSA to estimate the number and location of change points. We choose the tuning parameters λ and γ as suggested in Section 4. The above process is repeated for 100 times. Since BSA is more computationally efficient than DPA, we use only BSA to analyze the data.
Figure 5:
The left panel shows distribution of age among 405 subjects. The right panel shows estimated numbers of change points based on 100 replications.
Figure 5 (right) demonstrates the estimated numbers of change points for 100 replications. To be specific, among the 100 replications, 80% of the estimated numbers of change points are 1 (k = 3), which suggests that there is a change point in the context of logistic regression between the disease status and the MRI features due to the change of age. Moreover, Table 6 summarizes the change point estimation results computed by BSA. For the change point estimation, 90% of change points were estimated by BSA to be located at 80 years old and the mean is 79.95 years old. This implies that there are significant differences between the regression models of the disease status and the MRI features under 80 and over 80 years old.
Table 6:
Change point estimation and prediction for ADNI data. The change point estimator is obtained by using BSA based on 100 replications.
| training/testing sample | methods | location of change point | Averaged MSE |
|---|---|---|---|
| 365/40 | BSA | 79.75 | 0.380 |
| Lasso-based | - | 0.408 |
In terms of prediction, Table 6 provides the prediction results computed by BSA and the Lasso-based method, where for each replication, we use the training sample to select models and use the testing sample to predict. Note that the Lasso-based method treats the data as homogeneous while we consider a heterogeneous model. For the prediction result, we calculate the predictive MSE on the testing set for these two methods. Our proposed method obtains better prediction performance, which is demonstrated by a 7% lower averaged predictive MSE than that of the Lasso-based method. This suggests that treating the data as heterogeneous and using our method to select models can predict better.
For variable selection, Figure 6 shows selected features before and after the estimated change point of 80 years old. We see that models both under 80 and over 80 both select the 69th and 83rd features, which correspond to the hippocampal and amygdala regions, respectively. These regions are known to be related to AD by many previous studies (Zhang & Shen, 2012). Moreover, there are a few different features selected separately by these two models under 80 and over 80. We believe these features deserve more scientific attention and more researches are needed to study their associations with AD together with age.
Figure 6:
Frequency of features selected before the change point (left) and after the change point (right) for the ADNI dataset based on 100 replications. Features selected by models both before and after the change point are shown in red.
6. SUMMARY
In this paper, we provide a three-step procedure for change point detection in the context of high-dimensional GLMs, where the dimension p can be much larger than the sample size n. It is worth mentioning that our proposed method can automatically account for the underlying data generation mechanism ( or ) without specifying any prior knowledge about the number of change points . Moreover, based on dynamic programming and binary segmentation techniques, two algorithms, DPA and BSA, are proposed to detect multiple change points. To further improve the computational efficiency, we present a much more efficient algorithm designed for the case of a single change point. Furthermore, we investigate the theoretical properties of our proposed change point estimators computed by the three algorithms. Estimation consistency for the number and locations of change points is established. Finally, we demonstrate the efficiency and accuracy of our proposed methods by extensive numerical results under various model settings. A real data application to the ADNI dataset also demonstrates the usefulness of our proposed methods.
Table 1:
Change point detection for Cases 1 and 2 under the model with no change point. The numerical results are based on averages of 100 replications.
| Case | Measurement | p | Accuracy for the following methods | ||||
|---|---|---|---|---|---|---|---|
| SGL | VPWBS | Lee2011 | DPA | BSA | |||
| 200 | 0.65 | 0.01 | 0.73 | 0.00 | 0.00 | ||
| Σ = I | error rate | 300 | 0.85 | 0.01 | 0.71 | 0.00 | 0.00 |
| 400 | 0.86 | 0.01 | 0.74 | 0.00 | 0.00 | ||
| 200 | 0.38 | 0.00 | 0.71 | 0.00 | 0.00 | ||
| Σ = Σ* | error rate | 300 | 0.44 | 0.01 | 0.71 | 0.00 | 0.00 |
| 400 | 0.47 | 0.00 | 0.72 | 0.00 | 0.00 | ||
Acknowledgements
The authors thank the Editor, the Associate Editor and the reviewers, whose helpful comments and suggestions led to a much improved presentation. This research is supported in part by the National Natural Science Foundation of China Grant 11971116 (Zhang), 12101132 (Bin Liu), and US National Institute of Health Grant R01GM126550 and National Science Foundation Grants DMS1821231 and DMS2100729 (Yufeng Liu).
APPENDIX
Notation
We introduce some notation. The random part, empirical process, is defined as . We recall the Lasso estimator is, for j = 1, …, k, . We write and for convenience. Recall that for any (u, v), the oracle β* is defined as , which is the best approximation of β0 under the compatibility condition, β* = β0, if there is no change point between u and v. The estimation error is denoted as ϵ* := vn(β*) = (Pn − P)ρfβ*. We define, for some positive constant L > 0, ZL := sup∥β−β*∥1≤L ∣vn(β) − vn (β*)∣ . We set L* := ϵ*/λ0, and require a relatively small L*, so this indicates that ϵ* ≤ λ0. Based on this, for any u, v ∈ Vn := {i/n : i =1, …, n} with u < v, we define two important sets as follows:
| (1) |
| (2) |
We introduce the following assumptions.
Assumption F. We require some technical conditions as follows:
(F1) , where .
(F2) , where d* = O(1) with detailed definition introduced in the Appendix .
(F3) , where .
(F4) .
Assumption G. We require some conditions for achieving limit properties for the de-biased Lasso estimator:
(G1) The derivatives exist for all y, a, and for some δ-neighborhood (δ > 0), is Lipschitz:
Moreover,
(G2) It holds that .
(G3) It holds that and
(G4) It holds that and moreover .
(G5) For every j, the random variable converges weakly to a N(0,1)-distribution.
Useful Lemmas
We introduce some useful lemmas that are essential for our main results. More specifically, Lemma 1 presents the upper bound of the difference of the subinterval penalized empirical average of the loss function based on the Lasso and oracle estimators. Corollary 1 shows the equation in Lemma 1 holds with high probability. Lemma 2 provides the results for the subinterval based on the compatibility condition. The margin condition based on the oracle estimator is updated in Lemma 3. Lemma 4 presents the lower bound of the difference of the loss function based on the oracle estimator of adjacent subintervals. At last, Lemma 5 provides the upper bound of the difference of the subinterval penalized empirical average of the loss function based on the oracle estimator and the truth. Next, we will introduce these useful lemmas in detail.
Lemma 1 (Oracle inequality for the Lasso) Suppose Assumptions A-F hold for all , as well as . Suppose that λ satisfies the inequality . Then on the set given in (1)-(2), we have
| (3) |
where there exists a constant C3 > 0 such that ϵ* ≤ C3s* log(p)/n.
Proof. Since the Assumptions hold, on , , we have
by Theorem 6.4 in Bühlmann & van de Geer (2011). By the definition of Pnρ, Pρ, and v(β) introduced in Section 3.1, we can obtain
If the condition holds, on the set , we can have
which completes the proof.
Corollary 1 Suppose Assumptions A-F hold. Let and
Then we have, with probability at least , that Equation (3) holds. We refer to Theorem 2.1 in van de Geer (2008).
Lemma 2 Suppose Assumption D and hold. Then on the set , we have, for all (u, v) ∈ {i/n, i = 1, …, n} and all with ∥βS*∥1 ≤ 3∥βS*∥1,
Proof. By Assumption D (the compatibility condition), for any u, v ∈ {i/n, n = 1, …, n}, we have
for all that satisfy ∥βS*∥1 ≤ 3 ∥βS*∥1. Then the matrix (v − u)Σ satisfies the compatibility condition for the set S* with constant . By Corollary 6.8 in n Bühlmann and van de Geer (2011), if , the compatibility condition also holds for the set S* and the matrix , with . Then we can obtain, for all that satisfy ∥βSε∥1 ≤ 3 ∥βS.∥1,
Lemma 3 By Assumption B, there exists an η ≥ 0 and strictly convex increasing G, such that for all with ∥fβ1 − f0∥∞ ≤ η/2,∥fβ − f0∥∞ ≤ η ∥β* − β0∥ ≤ η, we have
Proof. According to Assumption C (margin condition), we can directly have
By the definition of , v(β) introduced in Notation and the triangle inequality, we have
By the definition of β*, under the compatibility condition, β* is the best approximation of β0: β* = β0. Then we have
Lemma 4 Suppose k0 > 1 and that Assumptions A-F and hold. Then on , if and for some j = 2, …, , we have
Proof. By Lemmas 2 and 3, if and for some j = 2, …, , we have
Now observe that
Then
by Assumption E. If and for some j = 2, …, , we have
| (4) |
Then, by the above Equation (4) and straightforward calculations, we can obtain
As , we can complete the proof. ■
Lemma 5 Suppose Assumptions A-F hold and let for some j = 1, …, , and , . Then on the set we have
where , .
Proof. Firstly, by straightforward calculations, we can obtain
According to Lemma 1 and on the set , we can obtain
| (5) |
Now, we will present the bias between and , with . Since , by Assumption E, we can have that
| (6) |
Furthermore, combining Assumption A, Assumption C, Equation (6), and the Cauchy-Schwarz inequality, we have
| (7) |
Combining Equations (5) and (7) can complete the proofs. ■
Proof of Theorem 1. To simplify the notation, we denote the value of the penalized total loss function corresponding to the change point vector by
| (8) |
First, we will show that if the assumptions hold, we must have and . On the contrary, we assume does not satisfy the above two results. We can distinguish three possible cases:
Case 1: Change point number is overestimated, . There exist some i, 1 ≤ i ≤ , such that for some j, 1 ≤ j ≤ .
Case 2: Change point number is underestimated, . For some , we have and .
Case 3: Change point number is correctly estimated, . However, for some , we have and .
We first consider Case 1, where we have and there exists some i, such that for some . We define
Then we get a new change point vector τ with . Denote the intervals by , and , and then we obtain
| (9) |
By the definition of the Lasso estimator and the triangle inequality, we can directly have
| (10) |
Then, combining Equations (9)-(10), we can directly have
| (11) |
Using some straightforward calculations and the triangle inequality, we have
| (12) |
By Lemma 5 and the above Equation (12), with (u, v) = S1, S2, then we have
| (13) |
Also by Lemma 5, for the second term in Equation (10), we can directly have
| (14) |
and therefore, by combining Equation (11) and Equations (13)-(14), we can easily obtain
| (15) |
According to the Assumption F2, we obtain , which is a contradiction.
For Case 2, where we have , we define a new change points vector , that is,
| (16) |
where . We obtain a new change point vection τ with . Also we denote the intervals by , and , then we have
| (17) |
By Lemma 1, for u, v ∈ Vn, we can obtain . Thus, by this inequality (with (u, v) = S1, S2 and S), the triangle inequality and Lemma 4, we can have
| (18) |
Then, by combining the above Equations (17)-(18), we can directly have
| (19) |
by Assumption F3, we have , which is a contradiction.
For Case 3 with , we must add some points and remove others to obtain a good change point estimator. Then we define with . We denote the intervals by , , and , and then we can obtain
| (20) |
Without loss of generality, we assume ∣ S1 ∣< δ and define a new partition . By denoting and I = K1 ∪ S1, then we have that
| (21) |
Thus, by combining Equations (20) and (21), with straightforward calculations, we have
by the definition of the Lasso estimator and the triangle inequality, we can obtain
| (22) |
By Equation (22), Lemma 1, Lemma 4, and straightforward calculations, we have
According to the Assumption F4, we can obtain , which is a contradiction. Above all, the first two results (1) and (2) in Theorem 1 have been proved. Result (3) in Theorem 1 can be directly obtained by combining (2) in Theorem 1 with Lemma 5.
Now we prove the fourth result in Theorem 1. Using that by condition , we have
where
it follows that
By the proof of Theorem 3.1 in van de Geer et al. (2014), we have . According to the third result in Theorem 1, we have . Then, it follows that
since the condition holds. By straight calculations, we have
By the proof of Theorem 3.1 in van de Geer et al. (2014), we can easily conclude that
where and . ■
Proof of Theorem 2. First we will show that under the conditions of Theorem 1, on , we have three cases:
Case A: Change point number is overestimated, . We have and .
Case B: Change point number is underestimated, . For , we have h(0, 1) and .
Case C: Change point number is correctly estimated, . For , we have h(0, 1) ∈ [δ, 1 − δ].
This fact can be derived straightforwardly from the proof of Theorem 1, as the objective functions coincide for 1 or 2 segments; that is, for all u ∈ [0, 1],
For Case A, suppose and τ = (0, 1). Like in the proof of Case 1 in Theorem 1, we can obtain
and h(0, 1) = 0. We suppose and . We define τ(0) = (0, h(0, 1), 1), , τ(2) = τ(1)\{h(0, 1)}.
For Case B, h(0, 1) = 0, like in the proof of Case 2 in Theorem 1, we can obtain H(τ(0)) > H(τ(1)). For Case C, h(0, 1) ∈ [δ, 1 − δ], like in the proof of Case 3 in Theorem 1, we can obtain
Since h(0, 1) minimizes Equation (17), all three cases result in a contradiction. Then, we can replace (0,1) by each sub-interval and obtain the same results, which completes the proof. ■
Proof of Theorem 3. Firstly we recall our proposed statistics as follows:
Then, for convenience, we consider two cases for the change point location denoted by τ:
Case E: 0 < τ ≤ 1/4,
Case F: 1/4 < τ < 1/2.
Next, we will discuss each case in detail. Case E: In this case, 0 < τ ≤ 1/4, so we have
We observe there is no change point in (1/4, 1). So the Lasso estimations of any subinterval (u, v) ⊂ (1/4, 1) make no difference and we can replace by or , and it follows that
| (23) |
We denote the intervals by , , and . (23) can be organized as follows:
Then using the same argument as Case 1 in Theorem 1, we can obtain W1/4 < W3/4.
Case F: In this case, 1/4 < τ < 1/2, we observe that there is no change point within these two intervals (0, 1/4) and (3/4, 1). Then, by straightforward calculations, we can obtain
We denote the intervals by (0, 1/4) = K1, (1/4, 3/4) = J1, (3/4, 1) = J2, and it follows that
Then using the same argument as Case 3 in Theorem 1, we can obtain W1/4 < W3/4. ■
Footnotes
Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.
BIBLIOGRAPHY
- Aue A, Hörmann S, Horváth L, & Reimherr M (2009). Break detection in the covariance structure of multivariate time series models. The Annals of Statistics, 37(6B), 4046–4087. [Google Scholar]
- Barigozzi M, Cho H, & Fryzlewicz P (2018). Simultaneous multiple change-point and factor analysis for high-dimensional time series. Journal of Econometrics, 206(1), 187–225. [Google Scholar]
- Bühlmann P, & van De Geer S (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media. [Google Scholar]
- Bickel PJ, Ritov Y & Tsybakov AB (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4), 1705–1732. [Google Scholar]
- Boysen L, Kempe A, Liebscher V, Munk A, & Wittich O (2009). Consistencies and rates of convergence of jump-penalized least squares estimators. The Annals of Statistics, 37(1), 157–183. [Google Scholar]
- Braun JV, Braun RK, & Müller HG (2000). Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika, 87(2), 301–314. [Google Scholar]
- Candes E, & Tao T (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), 2313–2351. [Google Scholar]
- Cho H, & Fryzlewicz P (2012). Multiscale and multilevel technique for consistent segmentation of nonstationary time series. Statistica Sinica, 22(1), 207–229. [Google Scholar]
- Cho H, & Fryzlewicz P (2015). Multiple-change-point detection for high dimensional time series via sparsified binary segmentation. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 77(2), 475–507. [Google Scholar]
- Ciuperca G (2014). Model selection by LASSO methods in a change-point model. Statistical Papers, 55(2), 349–374. [Google Scholar]
- Fan J, & Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101–148. [PMC free article] [PubMed] [Google Scholar]
- Fan J, Lv J, & Qi L (2011). Sparse high-dimensional models in economics. Annual Review of Economics, 3(1), 291–317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, & Peng H (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3), 928–961. [Google Scholar]
- Fong Y, Di C, & Permar S (2015). Change point testing in logistic regression models with interaction term. Statistics in Medicine, 34(9), 1483–1494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frick K, Munk A & Sieling H (2014). Multiscale change point inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(3), 495–580. [Google Scholar]
- Fryzlewicz P (2014). Wild binary segmentation for multiple change-point detection. The Annals of Statistics, 42(6), 2243–2281. [Google Scholar]
- Harchaoui Z, & Lévy-Leduc C (2010). Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105(492), 1480–1493. [Google Scholar]
- Jirak M (2015). Uniform change point tests in high dimension. The Annals of Statistics, 43(6), 2451–2483. [Google Scholar]
- Kirch C, Muhsal B, & Ombao H (2015). Detection of changes in multivariate time series with application to EEG data. Journal of the American Statistical Association, 110(511), 1197–1216. [Google Scholar]
- Lavielle M, & Teyssiére G (2006). Detection of multiple change-points in multivariate time series. Lithuanian Mathematical Journal, 46(3), 287–306. [Google Scholar]
- Lee S, & Seo MH (2008). Semiparametric estimation of a binary response model with a change-point due to a covariate threshold. Journal of Econometrics, 144(2), 492–499. [Google Scholar]
- Lee S, Seo MH, & Shin Y (2011). Testing for threshold effects in regression models. Journal of the American Statistical Association, 106(493), 220–231. [Google Scholar]
- Lee S, Seo MH, & Shin Y (2016). The lasso for high dimensional regression with a possible change point. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 78(1), 193–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leonardi F, & Bühlmann P (2016). Computationally efficient change point detection for high-dimensional regression. arXiv preprint arXiv:1601.03704. [Google Scholar]
- Li D, Qian J & Su L (2016). Panel data models with interactive fixed effects and multiple structural breaks. Journal of the American Statistical Association, 111(516), 1804–1819. [Google Scholar]
- Liu B, Zhang X & Liu Y (2019). Simultaneous change point detection and identification for high dimensional linear models, submitted . [Google Scholar]
- Page ES (1955). A test for a change in a parameter occurring at an unknown point. Biometrika, 42(3/4), 523–527. [Google Scholar]
- Pesaran MH, & Pick A (2007). Econometric issues in the analysis of contagion. Journal of Economic Dynamics and Control, 31(4), 1245–1277. [Google Scholar]
- Qian J, & Su L (2016). Shrinkage estimation of regression models with multiple structural changes. Econometric Theory, 32(6), 1376–1433. [Google Scholar]
- Raginsky M, Willett RM, Horn C, Silva J, & Marcia RF (2012). Sequential anomaly detection in the presence of noise and limited feedback. IEEE Transactions on Information Theory, 58(8), 5544–5562. [Google Scholar]
- Reiss PT, & Ogden RT (2010). Functional generalized linear models with images as predictors. Biometrics, 66(1), 61–69. [DOI] [PubMed] [Google Scholar]
- Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. [Google Scholar]
- Tibshirani R (2011). Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3), 273–282. [Google Scholar]
- Tickle SO, Eckley IA, Fearnhead P, & Haynes K (2020). Parallelization of a common changepoint detection method. Journal of Computational and Graphical Statistics, 29(1), 149–161. [Google Scholar]
- van de Geer SA (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614–645. [Google Scholar]
- van de Geer S, Bühlmann P, Ritov YA, & Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202. [Google Scholar]
- Wang D, Lin K, & Willett R (2021a). Statistically and computationally efficient change point localization in regression settings. arXiv preprint arXiv:1906.11364. [Google Scholar]
- Wang D, Yu Y, & Rinaldo A (2021b). Optimal covariance change point localization in high dimensions. Bernoulli, 27(1), 554–575. [Google Scholar]
- Wang T, & Samworth RJ (2018). High dimensional change point estimation via sparse projection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1), 57–83. [Google Scholar]
- Zhang D, & Shen D (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage, 59(2),895–907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang T, & Lavitas L (2018). Unsupervised self-normalized change-point testing for time series. Journal of the American Statistical Association, 113(522), 637–648. [Google Scholar]






