Efficient multiple change point detection for high-dimensional generalized linear models

Xianru Wang; Bin Liu; Xinsheng Zhang; Yufeng Liu; Alzheimer’s Disease Neuroimaging Initiative

doi:10.1002/cjs.11721

. Author manuscript; available in PMC: 2024 Jun 1.

Published in final edited form as: Can J Stat. 2022 Sep 16;51(2):596–629. doi: 10.1002/cjs.11721

Efficient multiple change point detection for high-dimensional generalized linear models

Xianru Wang ¹, Bin Liu ^1,^*, Xinsheng Zhang ¹, Yufeng Liu ²; Alzheimer’s Disease Neuroimaging Initiative

PMCID: PMC10281755 NIHMSID: NIHMS1809281 PMID: 37346756

Abstract

Change point detection for high-dimensional data is an important yet challenging problem for many applications. In this paper, we consider multiple change point detection in the context of high-dimensional generalized linear models, allowing the covariate dimension p to grow exponentially with the sample size n. The model considered is general and flexible in the sense that it covers various specific models as special cases. It can automatically account for the underlying data generation mechanism without specifying any prior knowledge about the number of change points. Based on dynamic programming and binary segmentation techniques, two algorithms are proposed to detect multiple change points, allowing the number of change points to grow with n. To further improve the computational efficiency, a more efficient algorithm designed for the case of a single change point is proposed. We present theoretical properties of our proposed algorithms, including estimation consistency for the number and locations of change points as well as consistency and asymptotic distributions for the underlying regression coefficients. Finally, extensive simulation studies and application to the Alzheimer’s Disease Neuroimaging Initiative data further demonstrate the competitive performance of our proposed methods.

Keywords: Binary segmentation, Dynamic programming, Generalized linear models, High dimensions

1. INTRODUCTION

With the advance of technology, complex large-scale data are prevalent in various scientific fields. Data heterogeneity creates great challenges for the analysis of complex data that may not be well approximated by a common distribution. Change point detection is a powerful tool to deal with data heterogeneity. Since the seminal work by Page (1955), there is a growing literature on change point detection with a wide range of applications, including genomics (Braun et al., 2000), finance (Pesaran & Pick, 2007; Fan et al., 2011), and social networks (Raginsky et al., 2012).

In this paper, we consider multiple change point detection for a general framework of high-dimensional generalized linear models (GLMs). Suppose we have n independent observations ${Y_{i}, X_{i}}_{i = 1}^{n}$ with

g (μ_{i}) = X_{i}^{T} β^{(i)} for i = 1, \dots, n,

(1)

where $Y_{i} \in Y \subset R$ is the real-valued response for the i-th observation, X_i = (X_i1,…,X_ip)^⊤ is the corresponding covariate vector in $X \subset R^{p}$ , $μ_{i} = E (Y_{i} ∣ X_{i})$ , g(·) is the link function, and $β^{(i)} = (β_{1}^{(i)}, \dots, β_{p}^{(i)})^{T} \in R^{p}$ is the unknown regression coefficient vector for the i-th observation. Then we consider estimating multiple change points with piecewise constant coefficients for Model (1). More specifically, let $\tilde{k} \geq 0$ be the true number of unknown change points along with the true location vector $\tilde{τ} = ({\tilde{τ}}_{0}, {\tilde{τ}}_{1} \dots, {\tilde{τ}}_{\tilde{k}}, {\tilde{τ}}_{\tilde{k} + 1})^{T}$ with $0 = {\tilde{τ}}_{0} < {\tilde{τ}}_{1} < {\tilde{τ}}_{2} < \dots < {\tilde{τ}}_{\tilde{k}} < {\tilde{τ}}_{\tilde{k} + 1} = 1$ . Then, the unknown $\tilde{k}$ change points divide the n time-ordered observations into $\tilde{k} + 1$ intervals and the underlying regression coefficients β⁽ⁱ⁾ have the following form:

β^{(i)} = {\begin{matrix} β^{0}, & if \tilde{k} = 0, \\ β^{0} (j), & if {\tilde{τ}}_{j - 1} < i ∕ n \leq {\tilde{τ}}_{j}, j = 1, \dots, \tilde{k} + 1, \end{matrix}

(2)

where $β^{0} (j) = (β_{1}^{0} (j), \dots, β_{p}^{0} (j))^{T} \in R^{p}$ denotes the underlying true regression coefficients in the j-th interval. We focus on change point detection, which consists of estimating: (a) the number of change points $(\tilde{k})$ ; (b) the locations of change points $(\tilde{τ})$ ; (c) the regression coefficients β⁰(j) in each segmentation, where j = 1,…, $\tilde{k} + 1$ .

There is a growing literature on change point detection. Most existing papers focus on change point problems in the mean, variance, or covariance matrix either for a fixed p (Kirch et al., 2015; Zhang & Lavitas, 2018) or for a growing p (Frick et al., 2014; Jirak 2015; Barigozzi et al., 2018; Wang & Samworth, 2018; Wang et al., 2021b). Progress has been made in the literature for detection of multiple change points as well (Lavielle & Teyssiére, 2006; Aue et al., 2009; Harchaoui & Lévy-Leduc, 2010; Cho & Fryzlewicz, 2015). Despite progress on change point detection, much fewer papers appear in the literature on regression change point problems, especially for high-dimensional models. The main difficulty comes from the complexity of both calculation and theoretical analysis arising from the growing dimension.

For regression problems, penalized techniques such as Lasso (Tibshirani, 1996) are popular in dealing with high-dimensional data. Some theoretical properties of the Lasso and various extensions can be found in Fan & Peng (2004), Candes & Tao (2007), and van de Geer et al. (2014). For a general overview and recent developments, we refer to Fan & Lv (2010) and Tibshirani (2011). In terms of change point detection based on Lasso, some methods exist for solving regression change point problems both in low and high dimensions. For example, designed for a fixed p, Ciuperca (2014) considered multiple change point estimation based on the Lasso. Qian & Su (2016) and Li et al. (2016) proposed a systematic change point estimation framework based on the adaptive fused Lasso. When the data dimension p grows to infinity, Lee et al. (2016) considered high-dimensional linear models with a possible change point and proposed a method for estimating regression coefficients as well as the unknown threshold parameter. As an extension, Leonardi & Bühlmann (2016) proposed computationally efficient algorithms for the number and locations of multiple change points in the context of high-dimensional linear models. Recently, Liu et al. (2019) investigated simultaneous change point detection and identification based on a de-biased Lasso process. Wang et al. (2021a) developed variance projected wild binary segmentation (VPWBS) for multiple change point detection.

Note that the above-mentioned papers focused on change point detection based on linear models with a continuous response, and thus are not directly applicable to the analysis of categorical or count response variables in practice. GLM can be very useful in this situation since it covers the exponential family distributions for the response variable. Because of its generality, GLM is widely used in various applications such as genetics, economics, and epidemiology. Several papers studied low-dimensional, single change point problems in the context of GLM (Lee & Seo, 2008; Lee et al., 2011; Fong et al., 2015). To the best of our knowledge, change point detection for high-dimensional GLMs has not been studied in the literature. Hence, it is desirable to consider a flexible and general framework for analyzing high-dimensional data with heterogeneity. Motivated by this, in this paper, we consider computationally efficient multiple change point detection in the context of high-dimensional GLMs. Our main contributions are summarized as follows:

We consider change point problems in a more flexible and general framework of high-dimensional GLMs, allowing the data dimension p to grow exponentially with the sample size n. It covers various model settings including linear models, logistic, and probit models as special cases. As far as we know, change point detection for high-dimensional logistic and probit models has not been considered in the literature.
Under the above framework, we propose a three-step procedure to estimate the number and locations of change points based on the Lasso estimator of the regression coefficients. The basic idea is to choose a useful contrast function J(τ(k)), which satisfies $J (\hat{τ} (\hat{k})) < J (τ (k))$ for any τ(k). To solve this optimization problem, we propose two algorithms based on dynamic programming and binary segmentation techniques, which have computational costs of O(n²GLMLasso(n)) and O(n log(n)GLMLasso(n)), respectively, where GLMLasso(n) is the cost to compute the Lasso estimator for the GLM with the sample size n. We also propose a much more efficient approach for the single change point case, with a computational cost of O(log(n)GLMLasso(n)). To the best of our knowledge, this is the most computationally efficient algorithm available for detecting a single change point in GLMs.
We examine some theoretical properties of our proposed change point estimators computed by the three algorithms. To be specific, under some mild conditions, both the dynamic programming and binary segmentation techniques can obtain a consistent estimator for the number and locations of the true change points with a rate of $O_{p} (\sqrt{\log (p) ∕ n})$ , which covers the case with an asymptotically growing number of change points. Moreover, the estimation error of the Lasso estimator of underlying regression coefficients can be bounded to o_p(1). To achieve further statistical prediction and inference, we introduce the de-biased Lasso estimator of the underlying regression coefficients in each segmentation, which is shown to be asymptotically normal. As for the third efficient approach designed for single change point cases, we establish that it can identify the location of the change point with high estimation accuracy. Finally, the competitive performance of our proposed methods is demonstrated by extensive numerical results as well as application to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset.

For a better understanding of our work, we would like to point out its relationship with several related papers. Compared to Lee et al. (2011), which considered single change point detection for the binary response variable with low dimensional covariates, we overcome the challenges of the computational and theoretical complexity arising from the growing dimension and number of change points. Meanwhile, to address the issue of the unknown multiple change points, we construct accurate and effective algorithms based on two techniques, dynamic programming and binary segmentation. These techniques are popular for multiple change point detection and were previously studied by Lavielle & Teyssiére (2006), Boysen et al. (2009), Harchaoui & Lévy-Leduc (2010), Cho & Fryzlewicz (2012, 2015), and Leonardi & Bühlmann (2016). Our extension to GLMs involves several technical challenges to overcome. One substantial difficulty comes from the complex form of the contrast functions compared to the least squares for linear models considered in Leonardi & Bühlmann (2016).

The rest of this paper is organized as follows. In Section 2, we introduce our methodology and demonstrate how our proposed three algorithms detect change points. In Section 3, the corresponding theoretical results of the change points computed by different algorithms are established. We investigate the performance of our proposed methods by extensive numerical results as well as a real data application in Section 4. We summarize the paper in Section 6. Detailed proofs of the main theorems and some useful lemmas are given in the Appendix.

2. METHODOLOGY

In this section, we introduce our new methodology for Model (1) with multiple unknown change points. In particular, in Section 2.1, some notation is introduced. In Section 2.2, we present a three-step change point estimator including the number and the locations of change points. Meanwhile, the regression coefficients in each segment are estimated based on the Lasso. In Sections 2.2.1 and 2.2.2, based on dynamic programming and binary segmentation techniques, two algorithms are proposed to detect multiple change points. To further improve the computational efficiency, in Section 2.2.3, we present a much more efficient algorithm designed for the case of a single change point.

2.1. Notation

We first introduce some notation. For a vector $a = (a_{1}, \dots, a_{p})^{T} \in R^{p}$ , we denote $‖ a ‖_{1} = \sum_{i = 1}^{p} ∣ a_{i} ∣, ‖ a ‖_{2} = (\sum_{i = 1}^{p} a_{i}^{2})^{1 ∕ 2}$ , and ∥a∥_∞ = max_1≤i≤p∣a_i∣. For two real-valued sequences a_n and b_n, we set a_n = O(b_n) if there exists a constant C such that ∣a_n∣ ≤ C_n∣b_n∣, for a sufficiently large n. We set a_n = o(b_n) if a_n/b_n → 0 as n → ∞. For a sequence of random variables {ξ₁, ξ₂,…}, we set $ξ_{n} \overset{P}{\to} ξ$ if ξ_n converges to ξ in probability as n → ∞. We also denote ξ_n = o_p(1) if $ξ_{n} \overset{P}{\to} 0$ . Given an interval (u,v) ⊂ [0,1] such that $u, v \in L^{1} = {i ∕ n, i = 1, \dots, n, n \in N}$ , we denote the vector (Y_un+1,…,Y_vn)^⊤ by Y_(u,v) and the vector (ϵ_un+1,…,ϵ_vn)^⊤ by ϵ_(u,v). Analogously, we use X_(u,v) to denote the (v − u)n × p dimensional matrix $(X_{(u, v)}^{(1)}, \dots, X_{(u, v)}^{(p)})$ , where $X_{(u, v)}^{(j)} = (X_{u n + 1}^{(j)}, \dots, X_{v n}^{(j)})^{T}$ with j = 1,…,p, and we use ${\hat{β}}_{(u, v)}$ to denote the Lasso estimator based on the observations Y_(u,v) and X_(u,v). For a set A, we use #A to denote its cardinality. For any x ≥ 0, we use [x] to denote the largest integer less than or equal to x. We use C₁,C₂,… to denote generic positive constants that may vary in different places.

2.2. New Estimation and Algorithms

Let Y = (Y₁,…, Y_n)^⊤ denote the n × 1 response vector, and X the n × p design matrix with X_i = (X_i1, …, X_ip)^⊤ being its i-th row for 1 ≤ i ≤ n. In this paper, we assume ${X_{i}}_{i = 1}^{n}$ are independently and identically distributed (i.i.d.) p-dimensional random vectors with mean zero and covariance matrix Σ = Cov(X₁). Furthermore, for $j = 1, \dots, \tilde{k} + 1$ , we denote by S^(j) the set of non-zero elements of the regression coefficients $β^{0} (j), i.e., S^{(j)} = # {ℓ : β_{ℓ}^{0} (j) \neq 0 for ℓ = 1, \dots, p}$ . For any given partition τ = (τ₀, τ₁, …, τ_k, τ_k+1)^⊤, we denote the j-th interval by I_j(τ) = (τ_j−1, τ_j), the length of the j-th interval by r_j (τ) = τ_j − τ_j−1, the shortest interval length by r(τ) = min_i≤j≤k+1 r_j (τ), and the change point number by l(τ). Moreover, we denote the minimum interval length by δ.

We are now ready to introduce our change point estimator in detail. We consider the Lasso-type ℓ₁-penalized estimator for high-dimensional GLMs. Such estimators have some desirable properties. In particular, van de Geer (2008) derived some theoretical properties including consistency and the oracle inequality, based on which our algorithms are mainly constructed. More specifically, let $ρ_{β} (x, y) : X \times Y \to R$ be some loss function relative to g(·). For instance, if g(·) is the logit function, ρβ(x, y) will then be the negative-likelihood function in the form log(1 + e^{x^⊤β}) − yx^⊤β. For $β \in R^{p}$ , we define ${\dot{ρ}}_{β} ≔ \frac{\partial ρ_{β}}{\partial β}$ and ${\ddot{ρ}}_{β} ≔ \frac{\partial ρ_{β}}{\partial β \partial β^{T}}$ . Note that such complex loss functions lead to substantial difficulty for the estimation of change points as well as regression coefficients. Given data observations ${Y_{i}, X_{i}}_{i = 1}^{n}$ , the Lasso-based GLM method solves the following ℓ₁ penalized problem:

\hat{β} = arg min_{β} {\frac{1}{n} \sum_{i = 1}^{n} ρ_{β} (X_{i}, Y_{i}) + λ ‖ β ‖_{1}} .

(3)

Since we consider heterogeneous data with possible multiple change points, we cannot use Equation (3) directly to obtain parameter estimation. The main challenge is that both the number $(\tilde{k})$ and the locations $(\tilde{τ})$ are unknown. To solve this issue, we consider three steps.

Before introducing the change point estimator, we first demonstrate how to estimate the regression coefficients for each segment. To be specific, for any given candidate partition τ = (τ₀, τ₁,…, τ_k, τ_k+1)^⊤, with τ_j ∈ L¹, j = 1,…, k + 1, we obtain the estimator with the ℓ₁ penalty in each segment by, for j = 1,…, k + 1,

\hat{β} (τ, j) = \underset{β}{arg min} {\frac{1}{n} \sum_{i = n τ_{j - 1} + 1}^{n τ_{j}} ρ_{β} (X_{i}, Y_{i}) + λ_{j} \sqrt{(τ_{j} - τ_{j - 1})} ‖ β ‖_{1}},

(4)

where λ_j is the non-negative regularization parameter.

Based on Equation (4), our new algorithms for estimating both $\tilde{k}$ and $\tilde{τ}$ are summarized into the following three steps.

Step 1 (Search the “best” partition): Given the candidate number of change points k, we find the “best” partition $\hat{τ} (k) = ({\hat{τ}}_{1}, \dots, {\hat{τ}}_{k})^{T}$ that minimizes the total loss function (contrast function):

\hat{τ} (k) = \underset{τ = (τ_{0}, \dots, τ_{k + 1})^{T}}{arg min} J (τ, \hat{β} (τ), X, Y) + γ (k + 1),

(5)

where $\hat{β} (τ) ≔ (\hat{β} (τ, 1)^{T}, \dots, \hat{β} (τ, k + 1)^{T})^{T}$ , $J (τ, \hat{β} (τ), X, Y) ≔ \sum_{j = 1}^{k + 1} P_{n} ρ (I_{j} (τ), \hat{β} (τ, j)))$ , and $P_{n} ρ (I_{j} (τ), β) ≔ \frac{1}{n} \sum_{i = n τ_{j - 1} + 1}^{n τ_{j}} ρ_{β} (X_{i}, Y_{i})$ .

Step 2 (Estimate number of change points): We put $\hat{τ} (k)$ into $J (τ, \hat{β} (τ), X, Y)$ and obtain the minimum loss function associated with k as

G (k) ≔ J (\hat{τ} (k), \hat{β} (\hat{τ} (k)), X, Y) + γ (k + 1) .

(6)

Then, we find the “best” estimation which minimizes G(k) with a penalty:

\hat{k} = \underset{k}{arg min} G (k) .

(7)

Step 3 (Estimate locations of change points): We put $\hat{k}$ into Step 1 and obtain the final change point estimator $\hat{τ} ≔ \hat{τ} (\hat{k}) = ({\hat{τ}}_{1}, \dots, {\hat{τ}}_{\hat{k}})^{T}$ by

\hat{τ} = \underset{τ = (τ_{0}, \dots, τ_{\hat{k} + 1})^{T}}{arg min} J (τ, \hat{β} (τ), X, Y) .

(8)

Combining Steps 1-3, our final change point estimators $\hat{k}$ and $\hat{τ}$ can be obtained equivalently in the following form:

\hat{τ} = \underset{k}{arg min} min_{τ : l (τ) = k} {J (τ, \hat{β} (τ), X, Y) + γ (k + 1)} .

(9)

After the above three steps, we obtain the change-point estimators $\hat{τ}$ . As for β⁰, we recommend two different Lasso estimators of the underlying regression coefficients β⁰, serving different purposes for practitioners. In particular, naturally, we can use the Lasso estimator of β⁰ to select variables and make a prediction, which is defined for $j = 1, \dots, \hat{k} + 1$ as

\hat{β} (\hat{τ}, j) = \underset{β}{arg min} {\frac{1}{n} \sum_{i = n {\hat{τ}}_{j - 1} + 1}^{n {\hat{τ}}_{j}} ρ_{β} (X_{i}, Y_{i}) + λ_{j} \sqrt{({\hat{τ}}_{j} - {\hat{τ}}_{j - 1})} ‖ β ‖_{1}} .

For further statistical inference including confidence intervals and hypothesis testing, van de Geer et al. (2014) proposed the de-biased Lasso estimator and analyzed its asymptotic properties for the homogeneous model under high-dimensional setups. Similarly, for the heterogeneous observations, we construct a de-biased Lasso estimator for the underlying regression coefficients for each segmentation for $j = 1, \dots, \hat{k} + 1$ as

\tilde{β} (\hat{τ}, j) = \hat{β} (\hat{τ}, j) - \hat{Θ} P_{n} {\dot{ρ}}_{\hat{β} (\hat{τ}, j)},

where the precision matrix estimator $\hat{Θ} = {\hat{Θ}}_{Lasso}$ can be constructed using nodewise Lasso with $\hat{Σ} ≔ P_{n} {\ddot{ρ}}_{\hat{β} (\hat{τ}, j)}$ as input (van de Geer et al., 2014).

In what follows, we introduce three specific algorithms for solving Equation (9). Note that J(τ, β, X, Y) and P_nρ(I_j(τ), β) are the loss function for all intervals and the j-th interval, respectively. Meanwhile, λ_j in Equation (4) and γ in Equation (5) are positive tuning parameters that encourage coefficient and segment sparsity, respectively. We adopt a cross-validation approach to make a proper choice of these two tuning parameters λ and γ. To compute $\hat{β} (τ, j)$ in Equation (4), we can use, for example, the R package glmnet (https://glmnet.stanford.edu). It is worth mentioning the following two remarks for our proposed estimator: (1) If the number of change points $\tilde{k}$ is known, we just use Step 1 by plugging in $k = \tilde{k}$ to directly obtain the locations of change points as follows:

\hat{τ} (\tilde{k}) = \underset{τ = (τ_{0}, \dots, τ_{\tilde{k} + 1})^{T}}{arg min} J (τ, \hat{β} (τ), X, Y) .

(10)

In this case, our method covers the setting considered by Lee et al. (2016), where at most one change point is assumed. (2) When no change point occurs ( $\tilde{k} = 0$ ), our proposed method can still work. Hence, our proposed method can automatically account for the underlying data generation mechanism ( $\tilde{k} = 0$ or $\tilde{k} > 0$ ) without specifying any prior knowledge about the number of change points $\tilde{k}$ . Furthermore, as shown by our extensive numerical studies, our new algorithms can estimate $\tilde{k}$ with high accuracy.

Our main goal is to design efficient algorithms that solve the optimization problem in Equation (9) of the form $J (τ, \hat{β}) + Pen (τ)$ . To address this issue, three algorithms are proposed next.

2.2.1. Dynamic programming approach

We introduce a general approach based on the Dynamic Programming Algorithm (DPA), which works well for our change point problem (Eq. 9). It is well-known that DPA has excellent accuracy since it considers the global solution of Equation (9). It is widely used in multiple change point detection including the efficient, parallelized approaches introduced recently by Tickle et al. (2020). More details can be found in Boysen et al. (2009) and Leonardi & Bühlmann (2016). Next, we present how to use this technique to solve Equation (9) in detail.

For any given v ∈ {i/n : i = 1, …,n}, consider the sample (Y_(0,v], X_(0,v]). Given a candidate change point number k, denote F_k(v) as the minimum value as follows:

F_{k} (v) ≔ min_{τ : l (τ) = k} \sum_{j = 1}^{k + 1} (P_{n} ρ (I_{j} (v τ), \hat{β} (τ, j)) + γ) .

(11)

One can see that the optimal k + 1 segments ${(τ_{j - 1}, τ_{j})}_{j = 1}^{k + 1}$ corresponding to the change point vector τ = (τ₀, τ₁, …, τ_k+1)^⊤ obtained from Equation (11), consist of the optimal first k segments ${(τ_{j - 1}, τ_{j})}_{j = 1}^{k}$ and a single segment (τ_k, τ_k+1)^⊤. Recall that τ₀ := 0 and τ_k+1 := 1. Then τ_k is the rightmost change point estimator. Furthermore, by definition of F_k(v), ${(τ_{j - 1}, τ_{j})}_{j = 1}^{k}$ obtained from Equation (11) is also a minimizer of F_k−1(τ_k). Hence, the last change point τ_k is the minimizer of $F_{k - 1} (u) + P_{n} ρ ((u, v), {\hat{β}}_{(u, v)}) + γ$ with u < v.

The above observation motivates us to use the dynamic programming recursion to calculate F_k(v) with v ∈ {i/n : i = 1, …, n}. In particular, for any v ∈ {i/n : i = 1, …, n}, define

F_{0} (v) = P_{n} ρ ((0, v), {\hat{β}}_{(0, v)}) + γ .

(12)

Then, the dynamic programming recursion proceeds as follows:

F_{k} (v) = min_{\underset{u < v}{u \in {i ∕ n : i = 1, \dots, n}}} {F_{k - 1} (u) + P_{n} ρ ((u, v), {\hat{β}}_{(u, v)}) + γ}, v \in {i ∕ n : i = 1, \dots, n} .

(13)

Define V_n = {i/n : i = 1, …, n}. Based on Equations (12) and (13), we can obtain {F₁(v), v ∈ V_n}, {F₂(v), v ∈ V_n}, …, and {F_{k_max} (v), v ∈ V_n}, where k_max (in our case k_max + 1 = 1/δ) is an “upper bound” of the number of change points. See Section 3.2 for more details. By considering G(k) in (6), we have F_k(1) = G(k) with k = 1, …, k_max. Hence, we are ready to estimate the change point number by

\hat{k} = \underset{k = 1, \dots, k_{\max}}{arg min} F_{k} (1) .

(14)

The corresponding locations of change points $\hat{τ} = (0, {\hat{τ}}_{1}, \dots, {\hat{τ}}_{\hat{k}}, 1)^{T}$ can be obtained by

{\hat{τ}}_{j} = \underset{u \in V_{n}, u < {\hat{τ}}_{j + 1}}{arg min} {F_{j - 1} (u) + P_{n} ρ ((u, {\hat{τ}}_{j + 1}), {\hat{β}}_{(u, {\hat{τ}}_{j + 1})}) + γ, for j = \hat{k}, \dots, 1 .

(15)

The following Algorithm 1 describes our procedure for obtaining $\hat{k}$ and $\hat{τ}$ based on the DPA. Note that DPA solves Equation (9) with globally optimal solutions, which have excellent estimation accuracy. Furthermore, as shown in Leonardi & Buhlmann (2016), it has the computational cost of O(n²GlmLasso(n)) operations. This can be computationally expensive especially when n is very large. Hence, it is desirable to consider a more efficient approach. Next, we introduce an efficient approach based on binary segmentation, which can ensure almost the same estimation accuracy as that of DPA.

Algorithm 1 : Dynamic programming procedure for change point detection in high-dimensional GLMs.

\begin{matrix} Input : Given the dataset {X, Y}, set the value of k_{\max} . \\ Step 1 : Based on Equations (12) - (13), compute F_{k} (1) for k = 1, \dots, k_{\max} . \\ Step 2 : Obtain estimate of the number of change points \hat{k} by Equation (14) . \\ Step 3 : Obtain estimate of the change point locations \hat{τ} by Equation (15) . \\ Output : Algorithm 1 provides the change point estimator \hat{τ} (\hat{k}) = \\ (0, {\hat{τ}}_{1}, \dots, {\hat{τ}}_{\hat{k}}, 1)^{T}, including both the number and locations . \end{matrix}

Open in a new tab

2.2.2. Binary segmentation approach

Next we introduce an approach based on the binary segmentation algorithm (BSA) examined in Cho & Fryzlewicz (2012; 2015), and Leonardi & Bühlmann (2016), which is shown to be much more efficient compared to DPA. The main idea of BSA for solving the change point problem for GLMs (Eq. 9) is that for each candidate search interval (u, v), we use the penalized loss function to determine whether a new change point s can be added. If s is identified, then the interval (u, v) is split into two subintervals (u, s) and (s, v) and we conduct the above procedure on (u, s) and (s, v) separately. This algorithm is continued until no new subintervals can be added. In particular, for any given u, v ∈ V_n := {i/n : i = 1, …, n}, we define

Z (u, v) = {\begin{matrix} P_{n} ρ ((u, v], {\hat{β}}_{(u, v]}) + γ, & if (v - u) n \geq 1 \\ 0, & otherwise, \end{matrix}

(16)

and

h (u, v) = \underset{s \in {u} \cup [u + δ, v - δ]}{arg min} {Z (u, s) + Z (s, v)} .

(17)

Then we present our BSA-based algorithm as follows.

Algorithm 2 : Binary segmentation procedure for change point detection in high-dimensional GLMs.

\begin{matrix} Input : Given the dataset {X, Y}, initialize the set of change point pairs T = \\ {0, 1} . \\ Step 1 : For each pair {u, v} in T, compute s = h (u, v) as defined in Equation \\ (17) . If s > u, add new pairs of nodes {u, s} and {s, v} to T and update T as \\ T = T \cup {u, s} \cup {s, v} . \\ Step 2 : Repeat Step 1 until no more new pair of nodes can be added . Denote the \\ terminal set of change point pairs by T_{final} = \cup_{i = 1}^{q} {u_{i}, v_{i}} . \\ Output : Algorithm 2 provides the change point estimator {\hat{τ}}^{b} = ({\hat{τ}}_{0}^{b}, \dots, {\hat{τ}}_{{\hat{k}}^{b} + 1}^{b})^{T}, \\ where {\hat{k}}^{b} = # T_{final} and 0 = {\hat{τ}}_{0}^{b} < {\hat{τ}}_{1}^{b} < \dots < {\hat{τ}}_{{\hat{k}}^{b}}^{b} < {\hat{τ}}_{{\hat{k}}^{b} + 1}^{b} = 1 \in T_{final}, including \\ both the number and locations . \end{matrix}

Open in a new tab

Note that this approach searches much fewer candidates for finding a new change point as compared to DPA, which makes it more computationally efficient. More specifically, as shown by Leonardi & Bühlmann (2016), BSA has a computational cost of O(n log(n)GlmLasso(n)) operations. Furthermore, in Section 3.2, we prove that the change point estimator computed by Algorithm 2 enjoys almost the same estimation accuracy as that of Algorithm 1.

2.2.3. A fast screening approach for single change point models

So far, we have proposed two efficient algorithms in Sections 2.2.1 and 2.2.2 for solving Equation (9). In this section, we show that under the single change point models, the computational cost can be further reduced. As far as we know, our fast screening approach (FSA) is novel for detecting a single change point in regression models. The main idea is that for detecting a single change point, if we have some prior information about its location, it is not necessary to search all candidate subintervals that have been adopted in the BSA-based algorithm. To see this, we recall Z(u, v) as defined in Equation (16). For τ_f ∈ (0, 1), we define the statistics as W_τ^f((u, v)) = Z(u, u + τ^f (v − u)) + Z(u + τ^f (v − u),v). Based on W_τ^f ((u, v)), we have the following key observation: Consider any subinterval (u, v) containing a single change point $\tilde{τ} \in (u, v)$ . If $\tilde{τ}$ lies in the first half interval of (u, v), i.e., $\tilde{τ} \in (u, u + \frac{1}{2} (v - u))$ , with a high probability, we can prove that W_1/4((u, v)) < W_3/4((u, v)). If $\tilde{τ}$ lies in the second half interval of (u, v), i.e., $\tilde{τ} \in (u + \frac{1}{2} (v - u), v)$ , with a high probability, we have W_3/4((u, v)) < W_1/4((u, v)). The above observation motivates us to design Algorithm 3 for fast change point identification.

Note that Algorithm 3 does not need to search through all data points, and it can quickly identify the half-interval where the change point is located via comparing the quarter, half, and three-quarter values of W(T_i) in each iteration. As a result, it only takes O(log(n)GLMLasso(n)) computational operations to detect the change point. Hence, as compared to Algorithms 1 and 2, the computational cost can be dramatically reduced. Its computational benefits will be validated by our numerical experiments in Section 4.

Algorithm 3 : A fast screening approach for single change point detection in high-dimensional GLMs.

\begin{matrix} Input : Input the dataset {X, Y} . \\ Step 0 : Set u_{0} = 0, v_{0} = 1 . \\ Step 1 : For each iteration i = 0, 1, 2, \dots, let T_{i} = [u_{i}, v_{i}] . Calculate the values \\ of W_{\frac{1}{4}} (T_{i}), W_{\frac{1}{2}} (T_{i}), and W_{\frac{3}{4}} (T_{i}) . Set M (T_{i}) = min (W_{\frac{1}{4}} (T_{i}), W_{\frac{1}{2}} (T_{i}), W_{\frac{3}{4}} (T_{i})) . \\ For each iteration i, consider the following three cases: \\ If M (T_{i}) = W_{\frac{1}{4}} (T_{i}), set T_{i + 1} = [u_{i}, u_{i} + 1 ∕ 2 (v_{i} - u_{i})]; \\ if M (T_{i}) = W_{\frac{1}{2}} (T_{i}), set T_{i + 1} = [u_{i} + 1 ∕ 4 (v_{i} - u_{i}), u_{i} + 3 ∕ 4 (v_{i} - u_{i})]; \\ if M (T_{i}) = W_{\frac{3}{4}} (T_{i}), set T_{i + 1} = [u_{i} + 1 ∕ 2 (v_{i} - u_{i}), v_{i}] . \\ Step 2 : Repeat Step 2 until [n * (v_{i^{*}} - u_{i^{*}})] \leq 4 holds for some i^{*} . Denote T_{i^{*}} \\ by \hat{T} . \\ Step 3 : Calculate {\hat{τ}}^{f} = {arg min}_{τ \in \hat{T}} W_{τ^{f}} (\hat{T}) . \\ Output : This algorithm provides a single change point estimator {\hat{τ}}^{f} . \end{matrix}

Open in a new tab

3. THEORETICAL PROPERTIES

We examine the theoretical properties of our proposed three approaches. In particular, we first show that our estimation of change points and regression coefficients is consistent and has the same rates of convergence as those of linear models. Secondly, for GLMs with change points, we reconstruct our assumptions and lemmas for analyzing high-dimensional data with heterogeneity based on the work of van de Geer (2008). Note that van de Geer (2008) considered the ℓ₁ penalized estimation of the regression coefficients under the setting of all observations from the same GLM. In particular, in Section 3.1, we introduce some assumptions. In Section 3.2, we present theoretical results of the change point estimator computed by the new algorithms.

Before presenting the theoretical results, we introduce some additional notation. For any u, v ∈ V_n := {i/n : i = 1, …, n} with u < v, we denote, for a function $ρ (x, y) : X \times Y \to R$ , the subinterval-based theoretical mean and empirical mean by $P ρ ((u, v)) ≔ \frac{1}{n} \sum_{i = u n + 1}^{v n} E ρ (X_{i}, Y_{i})$ , and $P_{n} ρ ((u, v)) ≔ \frac{1}{n} \sum_{i = u n + 1}^{v n} ρ (X_{i}, Y_{i})$ , respectively. For convenience, we denote Pρ = Pρ((0, 1)) and P_nρ = P_nρ((0, 1)). Consider a linear subspace $F ≔ {f_{β} (x) = x^{T} β : β \in R^{p}}$ . For a $f_{β} \in F$ , define ρ_{f_β}(x, y) = ρ(f_β (x), y). Then the empirical risk and theoretical risk at f are defined as P_nρf and P_ρf, respectively. Furthermore, we define the target as the minimizer of the theoretical risk $f^{0} ≔ {arg min}_{f \in F} P ρ_{f}$ and $β^{0} ≔ {arg min}_{β \in R^{p}} P ρ_{f_{β}}$ , where β⁰ can be regarded as the “truth”. By definition, we have f⁰(x) = x^⊤β⁰. For $f_{β} \in F$ , the excess risk is defined as $E (f_{β}) ≔ P (ρ_{f_{β}} - ρ_{f^{0}})$ . Lastly, for any subinterval (u, v), we define the oracle $β_{(u, v)}^{*}$ as $β_{(u, v)}^{*} ≔ {arg min}_{β \in R^{p}} {E (f_{β})}$ . The corresponding estimation error is then denoted as ϵ* := (P_n − P) ρf_β*.

3.1. Basic assumptions

We introduce some assumptions as follows.

Assumption A (loss function). The loss function ρf(x, y) := ρ(f(x), y) is convex for all $y \in R$ . Moreover, it satisfies the Lipschitz property:

∣ ρ (f_{β} (x), y) - ρ (f_{\tilde{β}} (x), y) ∣ \leq ∣ f_{β} (x) - f_{\tilde{β}} (x) ∣, \forall (x, y) \in X \times Y, \forall β, \tilde{β} \in R^{p} .

Assumption B (design matrix). There exists K_X < ∞ such that ∥X_i∥_∞ ≤ K_X and $E (X_{i}) = 0$ hold for all i = 1, …, n.

Assumption C (margin condition). There exists an η > 0 and a strictly convex increasing function G(x), such that for all $β \in R^{p}$ with ∥f_β − f⁰∥_∞ ≤ η, one has

E (f_{β}) \geq G (‖ f_{β} - f^{0} ‖),

where there exists a constant C such that G(x) ≥ Cx² for any positive x.

Assumption D (compatibility condition). The compatibility condition is met for the set $S_{*} = \cup_{j = 1}^{\tilde{k} + 1} S^{(j)}$ (S^(j) defined in Section 2.2) with constant ϕ_* > 0, if for all β ∈ R^p^p satisfying ${‖ β_{S_{*}^{c}} ‖}_{1} \leq 3 ‖ β_{S_{*}} ‖_{1}$ , it holds that

‖ β_{S_{*}} ‖_{1}^{2} \leq (β^{T} X^{T} X β) s_{*} ∕ ϕ_{*}^{2},

where s_* := #S_* is the cardinality of S_*.

Assumption E (parameter space). For k₀ > 1, there exist constants m_* > 0 and M_* > 0 such that

min_{1 \leq j \leq j \leq k \leq \tilde{k} + 1} \frac{{‖ \sum_{r = i}^{j} γ (i, r, j) β^{0} (r) - \sum_{r = j + 1}^{k} γ (j + 1, r, k) β^{0} (r) ‖}_{1}}{s_{*}} \geq m_{*},

where $γ (i, j, k) = \frac{{\tilde{τ}}_{j} - {\tilde{τ}}_{j - 1}}{{\tilde{τ}}_{k} - {\tilde{τ}}_{i - 1}}$ ,

s_{*} = o (\sqrt{n} ∕ \log (p)), \max_{1 \leq j \leq \tilde{k} + 1} ‖ β^{0} (j) ‖_{\infty} \leq M_{*}, and \max_{1 < j \leq \tilde{k} + 1} ‖ β^{0} (j) - β^{0} (j - 1) ‖_{\infty} \leq M_{*} .

Note that in the case $\tilde{k} = 1$ , the former condition reduces to ∥β⁰(1) − β⁰(2)∥₁≥ m_*s_*.

We assume in Assumption A that the loss function ρ is Lipschitz in f, which allows us to bound the loss function by the difference between estimated regression parameters and the corresponding true parameters. Many functions can meet this condition: for example, the negative-likelihood function of the logistic regression model. Assumption B imposes relatively weak conditions on the covariates, which covers a wide range of distributional patterns. Assumption C (margin condition) is assumed for a “neighbourhood” of the target linear function f⁰ = X^Tβ⁰ and is a common condition for analyzing the GLM. See Section 6.4 in Bühlmann & van de Geer (2011) for more details. Assumption D (compatibility condition) for the design matrix X allows us to establish oracle results for Lasso estimation. Note that one can verify that Assumption D is a sufficient condition of Assumption C in van de Geer (2008) by choosing the function $D (K) = # K β^{2}$ , where $# K$ is the cardinality of the set $K \subset {1, \dots, p}$ defined in Assumption C of van de Geer (2008). Assumption E presents the minimum and maximum differences between the true regression parameters, which allow us to detect the change points. Furthermore, the sparsity of regression coefficients is required to guarantee the consistency of our proposed estimators. Assumption F introduced in the Appendix imposes some technical conditions on the tuning parameter λ for the Lasso estimation as well as the tuning parameter γ for the change point estimation. Assumption G includes the required condition for the limiting property of the de-biased Lasso estimator.

3.2. Main results

We are ready to present some theoretical results of our proposed three new algorithms. Before that, we denote $c^{*} = m_{*}^{2} ϕ_{*}^{2} ∕ M^{*}$ and let d_* be a constant. See more details in Lemma 5. We first present the property of the estimators computed by DPA in Algorithm 1.

Theorem 1 Suppose Assumptions A-G hold with log(p) = o(n). Then, for a given C₁ > 0, with probability at least 1 − 7 exp $(- C_{1} \frac{n^{2}}{\log (p)})$ , we have that

$l (\hat{τ}) = \tilde{k}$
$‖ \hat{τ} - \tilde{τ} ‖_{1} \leq c^{*} \sqrt{δ} λ$
$\sum_{j = 1}^{\tilde{k} + 1} ∣ P_{n} ρ (I_{j} (\hat{τ}), \hat{β} (\hat{τ}, j)) - P_{n} ρ (I_{j} (\hat{τ}), β^{0} (j)) ∣ + λ r_{j} (\hat{τ}) ‖ \hat{β} (\hat{τ}, j) - β^{0} (j) ‖_{1} \leq (\tilde{k} + 1) d_{*} s_{*} λ^{2}$ ;
for each $j \in {1, \dots, \hat{k}}$ ,
$\sqrt{n} ({\tilde{β}}_{s} (\hat{τ}, j) - β_{s}^{0} (j)) ∕ {\hat{σ}}_{j, s} = V_{j, s} + o_{P} (1) f o r s \in {1, \dots, p},$

where ${\tilde{β}}_{s} (\hat{τ}, j)$ is the s-th component of $\tilde{β} (\hat{τ}, j), V_{j, s} \sim N (0, 1)$ and ${\hat{σ}}_{j, s}^{2} ≔ ({\hat{Θ}}_{j} P_{n} {\dot{ρ}}_{\hat{β}} {\dot{ρ}}_{\hat{β}}^{T} {\hat{Θ}}_{j}^{T})_{s, s}$ .

Theorem 1 demonstrates that Algorithm 1 can identify both the number and locations of multiple change points with high estimation accuracy. In particular, the first result shows that we can obtain a consistent estimator $l (\hat{τ})$ for the true number of change points. As for the locations, the second result indicates that our multiple change point estimator $\hat{τ}$ converges to the true change point vector $\tilde{τ}$ with a rate of $O_{p} (\sqrt{\log (p) ∕ n})$ . Furthermore, the third result implies that we can bound the prediction error or the estimation error of underlying regression parameters within a rate of $O_{p} (\tilde{k} s_{*} λ^{2})$ or $O_{p} (\tilde{k} s_{*} λ)$ . Result (4) implies the asymptotic normality of the de-biased Lasso estimator $\tilde{β} (\hat{τ})$ , which allows the wider statistical inference including confidence intervals and hypothesis testing.

Based on Theorem 1, some other interesting conclusions can be made. To simplify the discussion, we require that all the $\tilde{k} + 1$ change point intervals are within the same order of magnitude. Recall δ as the minimum length of change point intervals as defined in Section 2.2. Then we have $\tilde{k} (k_{\max}) = O (1 ∕ δ)$ . Furthermore, according to δ, the following two cases are considered: (1) δ = O(1) and (2) δ = o(1).

For the first case, we have $\tilde{k} = O (1)$ , which means that the number of change points is fixed and does not increase with the sample size n. Furthermore, considering Assumption (F1), we have $λ = O (\sqrt{\log (p) ∕ n})$ . Hence, the three results in Theorem 1 reduce to:

‖ \hat{τ} - \tilde{τ} ‖_{1} = O_{p} (\sqrt{\log (p) ∕ n}), \sum_{j = 1}^{\tilde{k} + 1} ∣ P_{n} ρ (I_{j} (\hat{τ}), \hat{β} (\hat{τ}, j)) - P_{n} ρ (I_{j} (\hat{τ}), β^{0} (j)) ∣ = O_{p} (s_{*} \frac{\log (p)}{n}), and \sum_{j = 1}^{\tilde{k} + 1} ‖ \hat{β} (\hat{τ}, j) - β^{0} (j) ‖_{1} = O_{p} (s_{*} \sqrt{\frac{\log (p)}{n}}) .

(18)

Considering Equation (18), our results are consistent with the Lasso estimation results derived in van de Geer (2008) and estimation consistency is guaranteed as long as $s_{*} \sqrt{\log (p) ∕ n} = o (1)$ holds.

We next consider the second case with δ = o(1). In this case, we allow the number of change points grow with n. Noting that $λ \sqrt{δ} = O (\sqrt{\log (p) ∕ n})$ , the three results in Theorem 1 reduce to:

‖ \hat{τ} - \tilde{τ} ‖_{1} = O_{p} (\sqrt{\log (p) ∕ n}), \sum_{j = 1}^{\tilde{k} + 1} ∣ P_{n} ρ (I_{j} (\hat{τ}), \hat{β} (\hat{τ}, j)) - P_{n} ρ (I_{j} (\hat{τ}), β^{0} (j)) ∣ = O_{p} (s_{*} \frac{\log (p)}{n δ^{2}}), and \sum_{j = 1}^{\tilde{k} + 1} ‖ \hat{β} (\hat{τ}, j) - β^{0} (j) ‖_{1} = O_{p} (s_{*} δ^{- 3 ∕ 2} \sqrt{\frac{\log (p)}{n}}) .

(19)

Hence, by Equation (19), the estimation consistency can still be obtained as long as $s_{*} δ^{- 3 ∕ 2} \sqrt{\frac{\log (p)}{n}} = o (1)$ holds. In other words, the number of change points $\tilde{k}$ can not grow faster than the order of ${(\frac{n}{\log (p) s_{*}^{2}})}^{1 ∕ 3}$ .

Next, we present theoretical results of change point estimators computed by BSA.

Theorem 2 Suppose Assumptions A-G hold with log(p) = o(n). For a given C₂ > 0, with probability at least 1 − 7 exp $(- C_{2} \frac{n^{2}}{\log (p)})$ , we have that

$l ({\hat{τ}}^{b}) = \tilde{k}$ ;
$‖ {\hat{τ}}^{b} - \tilde{τ} ‖_{1} \leq c^{*} \sqrt{δ} λ$ ;
$\sum_{j = 1}^{\tilde{k} + 1} ∣ P_{n} ρ (I_{j} ({\hat{τ}}^{b}), \hat{β} ({\hat{τ}}^{b}, j)) - P_{n} ρ (I_{j} ({\hat{τ}}^{b}), β^{0} (j)) ∣ + λ r_{j} ({\hat{τ}}^{b}) ‖ \hat{β} ({\hat{τ}}^{b}, j) - β^{0} (j) ‖_{1} \leq (\tilde{k} + 1) d_{*} s_{*} λ^{2}$ ;
for each $j \in {1, \dots, \hat{k}}$ ,
$\sqrt{n} ({\tilde{β}}_{s} (\hat{τ}, j) - β_{s}^{0} (j)) ∕ {\hat{σ}}_{j, s} = V_{j, s} + o_{P} (1), f o r s \in {1, \dots, p},$

where ${\tilde{β}}_{s} (\hat{τ}, j)$ is the s-th component of $\tilde{β} (\hat{τ}, j)$ , $V_{j, s} \sim N (0, 1)$ and ${\hat{σ}}_{j, s}^{2} ≔ ({\hat{Θ}}_{j} P_{n} {\dot{ρ}}_{\hat{β}} {\dot{ρ}}_{\hat{β}}^{T} {\hat{Θ}}_{j}^{T})_{s, s}$ .

Theorem 2 shows similar results as those of Theorem 1 in terms of consistency of both the number and locations of change points. Furthermore, Theorem 2 allows us to use a much more efficient algorithm to detect multiple change points for GLMs, which enjoys almost the same estimation accuracy as that of the global solutions. The efficiency will be further investigated in our numerical experiments.

Finally, we establish theoretical properties of FSA proposed in Algorithm 3 for single change point models.

Theorem 3 Suppose Assumptions A-F hold with log(p) = o(n). Assume that the true single change point $\tilde{τ} \in (0, 1 ∕ 2)$ . For a given C₃ > 0, with probability at least 1 − 7 exp $(- C_{3} \frac{n^{2}}{\log (p)})$ , we have that

W_{\frac{1}{4}} ((0, 1)) < W_{\frac{3}{4}} ((0, 1)) .

(20)

Theorem 3 justifies the validity of Algorithm 3 and demonstrates that the cost of identifying a single change point in GLMs can be reduced to only O(log(n)GLMLasso(n)) computational operations.

4. SIMULATION STUDIES

In this section, we investigate the numerical performance of our three proposed change point detection procedures in various model settings. For the design matrix X, we generate X_i i.i.d. from $N (0, Σ)$ . We first consider two types of covariance matrix structures including independent and weakly dependent settings as follows:

Case 1: Σ = I_p×p;

Case 2: Σ = Σ* with $Σ^{*} = (σ^{*})_{i, j = 1}^{p}$ , where $σ_{i, j}^{*} = {0.8}^{∣ i - j ∣}$ for 1 ≤ i, j ≤ p.

We consider logistic regression models. For i = 1, …, n, we generate Y_i ∈ {0,1} with $g (P (Y_{i} = 1)) - \log \frac{P (Y_{i} = 1)}{1 - P (Y_{i} = 1)} = X_{i}^{T} β^{(i)}$ . Then the responses ${Y_{i}}_{i = 1}^{n}$ are generated from the following Binomial distribution: $Y_{i} ∣ X_{i} \sim Bin (1, \frac{exp (X_{i}^{T} β^{(i)})}{1 + exp (X_{i}^{T} β^{(i)})})$ .

For this model setup, we investigate the performance of our approaches in terms of accuracy and efficiency. For efficiency, we compare our proposed algorithms in terms of the computational cost. Note that BSA and DPA are designed for multiple change point detection. In order to compare efficiency reasonably for the cases with no change point and a single change point, we set these two algorithms to stop after one screening by making k_max = 1. To show the accuracy, we record the mean, mean squared error (MSE), and error rate (proportion of false positives) of the change point estimators including the number and locations. We compare the corresponding results with the following existing methods:

Lee et al. (2011) (denoted by Lee2011), which is based on the maximum score estimation.
Qian and Su (2016) (denoted by SGL), which proposed a systematic estimation framework based on the adaptive fused Lasso in linear regression models. To be specific, they estimate ${β_{t}}_{t = 1}^{n}$ by minimizing the ℓ₂-loss with the fused Lasso penalty. In this paper, we modify SGL by replacing the ℓ₂-loss with the loss ρ defined in Section 2 for high-dimensional GLMs.
Wang et al. (2021a) (denoted by VPWBS), i.e., the variance projected wild binary segmentation based on the sparse group Lasso estimator for linear regression models. In particular, they projected the high-dimensional time series ${X_{i}, Y_{i}}_{i = 1}^{n}$ onto the univariate time series ${z_{i} (u)}_{i = 1}^{n}$ . The optimal projection direction u is obtained by local group Lasso screening (LGS). Then they conducted mean change point detection by wild binary segmentation (WBS) on the univariate time series ${z_{i} (u)}_{i = 1}^{n}$ . Note that, for linear models, LGS performs a variant of the group Lasso on any subsample ${X_{i}, Y_{i}}_{i = s + 1}^{e}$ , and computes
$({\hat{α}}_{1}, {\hat{α}}_{2}, \hat{ν}) \leftarrow \underset{\begin{matrix} ν \in [s^{'} + 1, e^{'} - 1] \\ α_{1}, α_{2} \in R^{p} \end{matrix}}{arg min} {\sum_{i = s + 1}^{ν} (Y_{i} - X_{i}^{T} α_{1})^{2} + \sum_{t = ν + 1}^{e} (Y_{i} - X_{i}^{T} α_{2})^{2} + λ_{G} \sum_{j = 1}^{p} \sqrt{(ν - s) (α_{1, j})^{2} + (e - ν) (α_{2, j})^{2}}},$ (21)
where s′ and e′ serve as boundary trimming parameters with s + 1 ≤ s′ + 1 < e′ ≤ e, and λ_G is the tuning parameter for the group penalty. For a better comparison in the context of high-dimensional GLMs, we modify this ℓ₂-loss based method in Wang et al. (2021a) by replacing the ℓ₂-loss in Equation (21) with the loss _ρβ(X_i, Y_i) defined in Section 2.

For our proposed approaches, the regression coefficients are computed by the R package glmnet (https://glmnet.stanford.edu). All numerical results are based on 100 replications, except for the test by Lee2011 which is based on 500 replications.

4.1. Tuning parameter selection

It is essential to properly choose the values of tuning parameters for accurate estimation results. We develop a cross-validation approach for GLMs to choose the parameters λ and γ, which encourage regression coefficient and segmentation sparsity, respectively. To be specific, let samples with odd indices be the training set (X₁, X₃, … X_n−3, X_n−1) and the others be the validation set (X₂, X₄, …, X_n−2, X_n). For each of two tuning parameters λ, γ, we conduct our procedure on the training set and obtain the estimated change point $\hat{τ} (\hat{k})$ and underlying regression coefficients $\hat{β} (\hat{τ}, j), j = 1, \dots, \hat{k}$ . Let ${\hat{f}}_{i} = X_{i}^{T} \hat{β} (\hat{τ}, j)$ , for i/n ∈ I_j(τ) and i = 1, …,n. We can calculate the validation loss as:

C V (λ, γ) = \frac{2}{n} \sum_{i : i mod 2 \equiv 1} ρ ({\hat{f}}_{i}, Y_{i}),

where ρ is the loss function of the GLMs and depends on the link function. For the specific regression models such as the linear model and logistic regression model, the corresponding validation losses are defined as:

C V_{L M} (λ, γ) = \frac{2}{n} \sum_{i : i mod 2 \equiv 1} ({\hat{f}}_{i} - Y_{i})^{2}, and C V_{L o g i c} (λ, γ) = \frac{2}{n} \sum_{i : i mod 2 \equiv 1} \log (1 + e^{{\hat{f}}_{i}}) - y {\hat{f}}_{i} .

Then we choose (λ, γ) corresponding to the lowest validation loss. Note that it is time-consuming to use the cross-validation procedure to choose the tuning parameters for our various model settings. Based on our extensive numerical simulations, we find that our methods are stable over a certain range of tuning parameters. Hence, we use an empirical choice of the parameters λ and γ to save computational cost. In particular, we set $λ = c (\sqrt{\log (2 p) ∕ n} + \log (2 p) ∕ n)$ , with c ∈ (0.15, 0.25). As for γ, we set γ = δλ. Recall δ is the minimum interval length and δ_n is the minimum interval size, which controls the maximum number of change points. Note that δ is of key importance for our theoretical guarantee discussed in Section 3.2, and needs to be carefully chosen in simulation studies.

In order to ensure the effective fitting of the regression model, we need to guarantee a sufficient sample size for each interval. According to our numerical studies, setting δ ∈ (0.1, 0.25) works well. To investigate how sensitive our proposed methods are to the choice of these tuning parameters, we consider various values of λ and γ by setting the sample size n ∈ {200, 300, 1000} and data dimension p ∈ {200, 300, 400}. Note that our proposed methods can automatically account for the underlying data generation mechanism and does not need to know the number of change points. To justify this, in what follows, we present our numerical results under three different cases: (1) $\tilde{k} = 0$ , (2) $\tilde{k} = 1$ , and (3) $\tilde{k} = 3$ , which correspond to data with no change point, one change point, and multiple change points, respectively.

4.2. No change point models

We consider the alternative scenario where no change point occurs. In this case, the underlying regression coefficients satisfy $β^{(i)} = β^{0} ≔ (β_{1}^{0}, \dots, β_{p}^{0})^{T}$ for i = 1, …, n. We set the sample size n = 200 and the data dimension p ∈ {200, 300,400}. For $s \in S^{0}$ , we generate $β_{s}^{0} \overset{iid}{\sim} U (0, 2)$ , where $S^{0}$ denotes the set of non-zero elements of β⁰ with $# S^{0} = [\log (p)]$ .

We implement the corresponding algorithms independently on a 2.50GHz CPU (Linux) with 6 cores and 4GB of RAM. As shown in Figure 1 (left), the computational cost of BSA grows moderately (12s to 737s) as the data dimension increases from 400 to 2000, while the computational cost of DPA grows exponentially (80s to 32000s). As for the accuracy, the error rates of VPWBS, DPA, and BSA are zero in almost all cases, which suggests these three approaches have almost the same accuracy when no change point occurs. SGL tends to overestimate the number of change points for the homogeneous observations. Note that Lee2011 has relatively large errors, which suggests that it may be unreliable in high-dimensional settings. Thus, we do not include it in our comparisons for the single change point models.

4.3. Single change point models

Next we consider the alternative scenario where (β⁽ⁱ⁾)_1≤i≤n have a common change point location ${\tilde{τ}}_{1}$ , where ${\tilde{τ}}_{1} \in {0.5, 07}$ . We set the sample size n = 300 and the data dimension p ∈ {200, 300, 400}. Furthermore, we assume the regression coefficients have support set ${S^{1}, S^{2}}$ . For $s \in S^{1}$ , we set $β_{s} (1) \overset{iid}{\sim} U (0, 2)$ . Then, for $s \in S^{2}$ , we set β_s(2) = β_s(1) + δ_s with $δ_{s} (1) \overset{i.i.d.}{\sim} U (0, 10 \sqrt{\log (p) ∕ n})$ . For each replication, the support sets $S^{1}$ , $S^{2}$ of regression coefficients are randomly selected from the set {1, 2,…, 0.3p} with $# S^{1} = # S^{2} = [\log (p)]$ .

Figure 2 (left) indicates that the computational cost of the BSA grows gradually with the data dimension increasing from 400 to 2000 as compared to the exponential growth of the DPA. Meanwhile, the computational cost of FSA in Algorithm 3 increases slowly as compared to the “exponential” growth of the BSA, as shown in Figure 2 (right). This suggests that FSA is more preferable for single change point models. Furthermore, to investigate the computational efficiency for high-dimensional cases, we present the computational cost of our three proposed approaches in Figure 3. It implies that our proposed approaches have stable and good performance as data dimension p grows.

Figure 3: — Efficiency of change point estimation under the single change point model with p ∈ {600, 1200, 1800, 2400, 3000} and n = 300. The left panel shows the computational costs of BSA and DPA per replication. The right panel shows computational costs per replication of BSA and FSA. The change point is fixed at τ₁ = 0.5.

As for the accuracy, we record the percentage of replications (Rate (%)) in which DPA and BSA correctly identified a single change point. Note that in Tables 2 and 3, MSEs for the number and location are expressed as factors of 10⁻² and 10⁻⁴, respectively. We can see that DPA, BSA, and VPWBS can identify a single change point with high rates of success. Furthermore, DPA generally has the best performance for estimating the single change point location. VPWBS performs better than BSA especially when the change occurs near the edge. Both DPA and BSA perform slightly better than FSA. Note that all the proposed algorithms perform better the closer the change point location is to the middle of the data observations, e.g., ${\tilde{τ}}_{1} = 0.5$ .

Table 2:

Single change point detection for Case 1 with Σ = I_p×p under various dimensions and change point locations, based on 100 replications.

			dimension p
	${\tilde{τ}}_{1}$	method	200	300	400
Number	0.5	SGL	2.49 ∣ 19	2.36 ∣ 17	2.48 ∣ 23
Mean∣Rate(%)		VPWBS	1.02 ∣ 96	1.01 ∣ 98	1.02 ∣ 97
		DPA	1.00 ∣ 100	1.01 ∣ 99	1.00 ∣ 100
		BSA	1.00 ∣ 100	1.00 ∣ 100	1.00 ∣ 100
	0.7	SGL	2.48 ∣23	2.48 ∣23	2.64 ∣14
		VPWBS	1.01 ∣ 97	1.04 ∣ 96	0.99 ∣ 95
		DPA	1.04 ∣ 96	1.02 ∣ 98	1.03 ∣ 97
		BSA	1.00 ∣ 100	1.00 ∣ 100	1.00 ∣ 100
Location	0.5	SGL	-	-	-
Mean∣MSE(10⁻⁴)		VPWBS	0.497 ∣ 2.244	0.500 ∣ 3.590	0.496 ∣ 6.322
		DPA	0.499 ∣ 1.836	0.503 ∣ 2.256	0.499 ∣ 3.254
		BSA	0.498 ∣ 2.446	0.501 ∣ 3.779	0.499 ∣ 6.445
		FSA	0.495 ∣ 19.21	0.496 ∣ 9.326	0.493 ∣ 11.18
	0.7	SGL	-	-	-
		VPWBS	0.699 ∣ 2.442	0.701 ∣ 6.963	0.693 ∣ 7.710
		DPA	0.694 ∣ 3.933	0.693 ∣ 4.630	0.691 ∣ 5.086
		BSA	0.691 ∣ 6.884	0.688 ∣ 14.29	0.686 ∣ 11.28
		FSA	0.684 ∣ 25.09	0.678 ∣ 43.94	0.666 ∣ 37.07

Open in a new tab

Table 3:

Single change point detection for Case 2 with Σ = Σ* under various dimensions and change point locations. The numerical results are based on 100 replications.

			dimension p
	${\tilde{τ}}_{1}$	method	200	300	400
Number	0.5	SGL	1.14 ∣ 32	1.30 ∣ 41	1.52 ∣ 40
Mean∣Rate(%)		VPWBS	1.02 ∣ 98	1.04 ∣ 96	1.03 ∣ 98
		DPA	1.01 ∣ 99	1.00 ∣ 100	1.00 ∣ 100
		BSA	1.00 ∣ 100	1.00 ∣ 100	1.00 ∣ 100
	0.7	SGL	0.72 ∣ 20	1.16 ∣ 26	1.14 ∣36
		VPWBS	0.97 ∣ 97	1.01 ∣ 99	1.05 ∣ 96
		DPA	1.00 ∣ 100	1.02 ∣98	1.01 ∣ 99
		BSA	0.99 ∣ 99	1.00 ∣98	1.01 ∣ 99
Location	0.5	SGL	-	-	-
Mean∣MSE		VPWBS	0.504 ∣ 5.056	0.495 ∣ 3.146	0.498 ∣ 3.572
		DPA	0.498 ∣ 2.145	0.502 ∣ 1.423	0.499 ∣ 3.670
		BSA	0.493 ∣ 5.897	0.498 ∣ 3.347	0.498 ∣ 4.326
		FSA	0.495 ∣ 12.21	0.489 ∣ 20.10	0.492 ∣ 9.374
	0.7	SGL	-	-	-
		VPWBS	0.694 ∣ 5.959	0.694 ∣ 7.297	0.701 ∣ 6.950
		DPA	0.690 ∣ 5.748	0.690 ∣ 6.411	0.691 ∣ 12.75
		BSA	0.689 ∣ 9.419	0.694 ∣ 3.973	0.688 ∣ 11.44
		FSA	0.689 ∣ 22.41	0.683 ∣ 24.20	0.676 ∣ 27.66

Open in a new tab

4.4. Multiple change point models

Finally we consider the alternative scenario where (β⁽ⁱ⁾)_1≤i≤n has multiple change point locations ${\tilde{τ}}_{2}$ with ${\tilde{τ}}_{2} = (0, 0.25, 0.5, 0.75, 1)^{T}$ . We set the sample size n = 1000 and the data dimension p ∈ {200, 300,400}. For $s_{1} \in S^{1}$ , we set $β_{s_{1}} (1) \overset{iid}{\sim} U (0, 2)$ . Then, for $s_{j} \in S^{j} (j = 2, 3, 4)$ , we set β_{s_j}(j) = β_{s_j}(j − 1) + (j − 1)δ_{s_j} with $δ_{s_{j}} \overset{i.i.d.}{\sim} U (0, 10 \sqrt{\log (p) ∕ n})$ . For each replication, the support set of regression coefficients $S$ is randomly selected from the set {1, 2, …, 0.3p} with $# S^{j} = [\log (p)]$ , j = 1, 2, 3, 4.

We first analyze the efficiency. The results are similar to the other cases. Figure 4 shows that the computational cost of BSA grows gradually with the number of change points. In contrast, the cost of the DPA grows substantially. This suggests that the efficiency of BSA is not sensitive to the number of change points. To compare accuracy, similarly to the previous analysis, we record the percentage of replications in which DPA and BSA can correctly identify the three change points. As shown in Tables 4 and 5, DPA generally has the best performance. VPWBS performs slightly better than BSA. However, there is not much difference in performance among DPA, BSA, and VPWBS in terms of identifying the number and locations of multiple change points.

Table 4:

Multiple change point detection for Case 1 with Σ = I_p×p under various dimensions. The numerical results are based on 100 replications.

τ₂ =	(0.25, 0.5, 0.75)^⊤	p	Accuracy for the following methods
τ₂ =	(0.25, 0.5, 0.75)^⊤	p	SGL	VPWBS	DPA	BSA
		200	3.19 ∣ 55	3.02 ∣ 98	3.00 ∣ 100	2.95 ∣ 95
number	Mean ∣ Rate(%)	300	3.47 ∣ 55	3.03 ∣ 95	3.00 ∣ 100	2.98 ∣ 98
		400	3.50 ∣ 44	3.04 ∣ 93	3.00 ∣ 100	2.94 ∣ 94
		200	-	0.248 ∣ 1.108	0.249 ∣ 0.665	0.249 ∣ 1.821
location 1	Mean ∣ MSE (10⁻⁵)	300	-	0.250 ∣ 2.100	0.250 ∣ 0.865	0.247 ∣ 4.168
		400	-	0.249 ∣ 3.913	0.251 ∣ 1.847	0.256 ∣ 5.736
		200	-	0.499 ∣ 0.894	0.500 ∣ 0.353	0.499 ∣ 2.099
location 2	Mean ∣ MSE (10⁻⁵)	300	-	0.499 ∣ 1.204	0.501 ∣ 0.508	0.499 ∣ 6.148
		400	-	0.498 ∣ 2.490	0.500 ∣ 0.981	0.500 ∣ 6.736
		200	-	0.751 ∣ 0.701	0.750 ∣ 1.002	0.749 ∣ 1.689
location 3	Mean ∣ MSE (10⁻⁵)	300	-	0.750 ∣ 1.370	0.751 ∣ 2.481	0.745 ∣ 7.968
		400	-	0.748 ∣ 2.268	0.749 ∣ 0.688	0.748 ∣ 2.258

Open in a new tab

Table 5:

Multiple change point detection for Case 2 with Σ = Σ* under various dimensions. The numerical results are based on 100 replications.

τ ₂	= (0.25, 0.5, 0.75)^⊤	p	Accuracy for the following methods
τ ₂	= (0.25, 0.5, 0.75)^⊤	p	SGL	VPWBS	DPA	BSA
		200	2.65 ∣ 30	3.00 ∣ 100	3.00 ∣ 100	2.99 ∣ 99
number	Mean ∣ Rate(%)	300	2.08 ∣ 17	2.97 ∣ 99	3.00 ∣ 100	2.98 ∣ 98
		400	2.47 ∣ 25	3.00 ∣ 100	3.00 ∣ 100	2.98 ∣ 98
		200	-	0.247 ∣ 2.435	0.250 ∣ 1.961	0.251 ∣ 3.457
location 1	Mean ∣ MSE (10⁻⁵)	300	-	0.249 ∣ 3.164	0.251 ∣ 2.844	0.250 ∣ 6.585
		400	-	0.251 ∣ 1.687	0.251 ∣ 2.497	0.249 ∣ 4.994
		200	-	0.487 ∣ 2.040	0.500 ∣ 1.069	0.499 ∣ 2.667
location 2	Mean ∣ MSE (10⁻⁵)	300	-	0.502 ∣ 1.403	0.501 ∣ 0.894	0.499 ∣ 4.363
		400	-	0.499 ∣ 2.212	0.501 ∣ 3.613	0.500 ∣ 4.192
		200	-	0.748 ∣ 1.242	0.750 ∣ 0.711	0.748 ∣ 1.674
location 3	Mean ∣ MSE (10⁻⁵)	300	-	0.752 ∣ 1.672	0.750 ∣ 0.486	0.749 ∣ 1.398
		400	-	0.747 ∣ 4.511	0.750 ∣ 0.580	0.747 ∣ 4.881

Open in a new tab

It is worth mentioning that our proposed methods have good performance in all three models with various data dimensions, sample sizes, and numbers of change points, which suggests that the methods are robust to the suggested choice of tuning parameters λ and γ.

5. REAL DATA ANALYSIS

To illustrate the usefulness of our proposed methods, we consider the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (www.loni.ucla.edu/ADNI). The ADNI dataset contains disease state information on different subjects including normal controls (NC), mild cognitive impairment (MCI), and Alzheimer’s disease (AD) as well as some biological markers including features derived from magnetic resonance imaging (MRI) and positron emission tomography (PET). It is very useful for clinical diagnosis and prevention to study how to measure the progression of AD using the images. For example, in the AD-related literature (see, for example, Reiss & Ogden, 2010), it is popular to use structural MRI or PET to predict the current disease status of the patient (binary response variable), which can be regarded as a classification problem. Usually, they treat data as homogeneous and ignore the effect of other covariate variables, such as age, gender and so on. Hence, an interesting question is whether the generalized linear structure between the disease status and biomarkers (MRI or PET) changes due to other covariates. If it changes, how can one estimate (a) the number of change points, (b) the locations of change points, and (c) the regression coefficients (selected variables) in each segmentation? In our study, we address these issues by detecting change points in the generalized linear structure between the disease status and the MRI features together with some covariates. In this application, we choose age as the covariate, which is of particular interest in AD studies.

We use the MRI data of 405 subjects including 220 normal controls and 185 AD patients from the ADNI data. For each subject, we obtain the corresponding status (AD/NC), age, and 93 MRI features after using the data processing method proposed in Zhang & Shen (2012). In our model, the predictive variables X = (X₁, …, X₉₃) are the 93 MRI features, which are scaled to have mean 0 and variance 1, and the response variable is the binary status obtained by setting Y_i = 1 {subject i is an AD patient}, and 0 otherwise. For this dataset ${X_{i}, Y_{i}}_{i = 1}^{n}$ , consider the following logistic regression model: $\log \frac{(P (Y_{i} = 1))}{(P (Y_{i} = 0))} = X_{i}^{T} β^{(i)}$ . Our goal is to detect potential change points of regression coefficients in ${β^{(i)}}_{i = 1}^{n}$ . Taking the effect of different samples into consideration, in our analysis, we divide the data into two parts: training and testing datasets. More specifically, we randomly select 40 subjects from the whole set of 405 subjects according to the empirical distribution of age in Figure 5 (left) as the testing sample and use the remaining 365 subjects as the training data. Then we sort those 365 observations in the training data by age and use BSA to estimate the number and location of change points. We choose the tuning parameters λ and γ as suggested in Section 4. The above process is repeated for 100 times. Since BSA is more computationally efficient than DPA, we use only BSA to analyze the data.

Figure 5 (right) demonstrates the estimated numbers of change points for 100 replications. To be specific, among the 100 replications, 80% of the estimated numbers of change points are 1 (k = 3), which suggests that there is a change point in the context of logistic regression between the disease status and the MRI features due to the change of age. Moreover, Table 6 summarizes the change point estimation results computed by BSA. For the change point estimation, 90% of change points were estimated by BSA to be located at 80 years old and the mean is 79.95 years old. This implies that there are significant differences between the regression models of the disease status and the MRI features under 80 and over 80 years old.

Table 6:

Change point estimation and prediction for ADNI data. The change point estimator is obtained by using BSA based on 100 replications.

training/testing sample	methods	location of change point	Averaged MSE
365/40	BSA	79.75	0.380
	Lasso-based	-	0.408

Open in a new tab

In terms of prediction, Table 6 provides the prediction results computed by BSA and the Lasso-based method, where for each replication, we use the training sample to select models and use the testing sample to predict. Note that the Lasso-based method treats the data as homogeneous while we consider a heterogeneous model. For the prediction result, we calculate the predictive MSE on the testing set for these two methods. Our proposed method obtains better prediction performance, which is demonstrated by a 7% lower averaged predictive MSE than that of the Lasso-based method. This suggests that treating the data as heterogeneous and using our method to select models can predict better.

For variable selection, Figure 6 shows selected features before and after the estimated change point of 80 years old. We see that models both under 80 and over 80 both select the 69th and 83rd features, which correspond to the hippocampal and amygdala regions, respectively. These regions are known to be related to AD by many previous studies (Zhang & Shen, 2012). Moreover, there are a few different features selected separately by these two models under 80 and over 80. We believe these features deserve more scientific attention and more researches are needed to study their associations with AD together with age.

6. SUMMARY

In this paper, we provide a three-step procedure for change point detection in the context of high-dimensional GLMs, where the dimension p can be much larger than the sample size n. It is worth mentioning that our proposed method can automatically account for the underlying data generation mechanism ( $\tilde{k} = 0$ or $\tilde{k} > 0$ ) without specifying any prior knowledge about the number of change points $\tilde{k}$ . Moreover, based on dynamic programming and binary segmentation techniques, two algorithms, DPA and BSA, are proposed to detect multiple change points. To further improve the computational efficiency, we present a much more efficient algorithm designed for the case of a single change point. Furthermore, we investigate the theoretical properties of our proposed change point estimators computed by the three algorithms. Estimation consistency for the number and locations of change points is established. Finally, we demonstrate the efficiency and accuracy of our proposed methods by extensive numerical results under various model settings. A real data application to the ADNI dataset also demonstrates the usefulness of our proposed methods.

Table 1:

Change point detection for Cases 1 and 2 under the model with no change point. The numerical results are based on averages of 100 replications.

Case	Measurement	p	Accuracy for the following methods
Case	Measurement	p	SGL	VPWBS	Lee2011	DPA	BSA
		200	0.65	0.01	0.73	0.00	0.00
Σ = I	error rate	300	0.85	0.01	0.71	0.00	0.00
		400	0.86	0.01	0.74	0.00	0.00
		200	0.38	0.00	0.71	0.00	0.00
Σ = Σ*	error rate	300	0.44	0.01	0.71	0.00	0.00
		400	0.47	0.00	0.72	0.00	0.00

Open in a new tab

Acknowledgements

The authors thank the Editor, the Associate Editor and the reviewers, whose helpful comments and suggestions led to a much improved presentation. This research is supported in part by the National Natural Science Foundation of China Grant 11971116 (Zhang), 12101132 (Bin Liu), and US National Institute of Health Grant R01GM126550 and National Science Foundation Grants DMS1821231 and DMS2100729 (Yufeng Liu).

APPENDIX

Notation

We introduce some notation. The random part, empirical process, is defined as ${v_{n} (β) ≔ (P_{n} - P) ρ_{f_{β}} : β \in R^{p}}$ . We recall the Lasso estimator is, for j = 1, …, k, $\hat{β} (j) = {arg min}_{β} {P_{n} ρ_{f_{β}} + λ r_{j} (τ) ‖ β ‖_{1}}$ . We write $\hat{f} = f_{\hat{β}}$ and $E (β) ≔ P (ρ_{f_{β}} - ρ_{f_{β^{0}}})$ for convenience. Recall that for any (u, v), the oracle β* is defined as $β_{(u, v)}^{*} ≔ {arg min}_{β} {E (f_{β})}$ , which is the best approximation of β⁰ under the compatibility condition, β* = β⁰, if there is no change point between u and v. The estimation error is denoted as ϵ* := v_n(β*) = (P_n − P)_{ρf_β*}. We define, for some positive constant L > 0, Z_L := sup_{∥β−β*∥₁≤L} ∣v_n(β) − v_n (β*)∣ . We set L* := ϵ*/λ₀, and require a relatively small L*, so this indicates that ϵ* ≤ λ⁰. Based on this, for any u, v ∈ V_n := {i/n : i =1, …, n} with u < v, we define two important sets as follows:

T_{0} ≔ {Z_{L^{*}} \leq ϵ^{*} \leq λ_{0}},

(1)

T_{1} ≔ {\max_{(u, v)} {‖ {\hat{Σ}}_{(u, v)} - (v - u) Σ ‖}_{\infty} \leq λ_{1}} .

(2)

We introduce the following assumptions.

Assumption F. We require some technical conditions as follows:

(F1) $λ \sqrt{δ} \geq 8 λ_{0}$ , where $λ_{0} = O (\sqrt{\log (p) ∕ n})$ .

(F2) $γ > 3 d_{*} s_{*} λ^{2}$ , where d_* = O(1) with detailed definition introduced in the Appendix .

(F3) $\frac{δ m_{*}^{2} ϕ_{*}^{2} s_{*}}{8} > γ + 22 ϵ^{*}$ , where $ϵ^{*} = O (s_{*} \log (p) ∕ n)$ .

(F4) $\frac{λ δ m_{*}^{2} ϕ_{*}^{2} s_{*}}{8} > 22 ϵ^{*} - λ δ M_{*}$ .

Assumption G. We require some conditions for achieving limit properties for the de-biased Lasso estimator:

(G1) The derivatives $\dot{ρ} (Y, a) ≔ \frac{d}{d a} ρ (Y, a), \ddot{ρ} (Y, a) ≔ \frac{d^{2}}{d a^{2}} ρ (Y, a)$ exist for all y, a, and for some δ-neighborhood (δ > 0), $\ddot{ρ} (Y, a)$ is Lipschitz:

\max_{a_{0} \in {X_{i}^{T} β^{0}} ∣ a - a_{0} ∣ \lor ∣ \hat{a} - a_{0} ∣ \leq δ} sup_{Y \in Y} \frac{∣ \ddot{ρ} (Y, a) - \ddot{ρ} (Y, \hat{a}) ∣}{∣ a - \hat{a} ∣} \leq 1

Moreover,

\max_{a_{0} \in {X_{i}^{T} β^{0}}} sup_{Y \in Y} ∣ \dot{ρ} (y, a_{0}) ∣ = O (1), and \max_{a_{0} \in {X_{i}^{T} β^{0}} ∣ a - a_{0} ∣ \leq δ} sup sup ∣ \ddot{ρ} (Y, a) ∣ = O (1) .

(G2) It holds that ${‖ P_{n} {\ddot{ρ}}_{\hat{β}} {\hat{Θ}}_{j}^{T} - e_{j} ‖}_{\infty} = O_{P} (λ_{*})$ .

(G3) It holds that ${‖ X {\hat{Θ}}_{j}^{T} ‖}_{\infty} = O_{P} (K)$ and ${‖ {\hat{Θ}}_{j} ‖}_{1} = O_{P} (\sqrt{s_{*}})$

(G4) It holds that ${‖ (P_{n} - P) {\dot{ρ}}_{β^{0}} {\dot{ρ}}_{β^{0}}^{T} ‖}_{\infty} = O_{P} (K^{2} λ)$ and moreover $\max_{j} 1 ∕ {(\hat{Θ} P {\dot{ρ}}_{β^{0}} {\dot{ρ}}_{β^{0}}^{T} {\hat{Θ}}^{T})}_{j, j} = O (1)$ .

(G5) For every j, the random variable $\frac{\sqrt{n} {(\hat{Θ} P_{n} {\dot{ρ}}_{β^{0}})}_{j}}{\sqrt{{(\hat{Θ} P {\dot{ρ}}_{β^{0}} {\dot{ρ}}_{β^{0}}^{T} {\hat{Θ}}^{T})}_{j, j}}}$ converges weakly to a N(0,1)-distribution.

Useful Lemmas

We introduce some useful lemmas that are essential for our main results. More specifically, Lemma 1 presents the upper bound of the difference of the subinterval penalized empirical average of the loss function based on the Lasso and oracle estimators. Corollary 1 shows the equation in Lemma 1 holds with high probability. Lemma 2 provides the results for the subinterval based on the compatibility condition. The margin condition based on the oracle estimator is updated in Lemma 3. Lemma 4 presents the lower bound of the difference of the loss function based on the oracle estimator of adjacent subintervals. At last, Lemma 5 provides the upper bound of the difference of the subinterval penalized empirical average of the loss function based on the oracle estimator and the truth. Next, we will introduce these useful lemmas in detail.

Lemma 1 (Oracle inequality for the Lasso) Suppose Assumptions A-F hold for all ${‖ \hat{β} - β^{*} ‖}_{1} \leq L^{*}$ , as well as $‖ f_{β} - f_{β}^{*} ‖_{\infty} \leq L^{*} K_{X}$ . Suppose that λ satisfies the inequality $λ \sqrt{δ} \geq 8 λ_{0}$ . Then on the set $T_{0} \cup T_{1}$ given in (1)-(2), we have

∣ P_{n} ρ ((u, v), {\hat{β}}_{(u, v)}) - P_{n} ρ ((u, v), β_{(u, v)}^{*}) ∣ + λ \sqrt{(u - v)} ‖ \hat{β} - β^{*} ‖_{1} \leq 6 ϵ^{*},

(3)

where there exists a constant C₃ > 0 such that ϵ* ≤ C₃s* log(p)/n.

Proof. Since the Assumptions hold, on $T_{0} \cup T_{1}$ , $8 λ_{0} < λ \sqrt{δ} < λ \sqrt{(v - u)}$ , we have

∣ P ρ ((u, v), {\hat{β}}_{(u, v)}) - P ρ ((u, v), β_{(u, v)}^{*}) ∣ + λ \sqrt{(u - v)} ‖ \hat{β} - β^{*} ‖_{1} \leq 4 ϵ^{*},

by Theorem 6.4 in Bühlmann & van de Geer (2011). By the definition of P_nρ, Pρ, and v(β) introduced in Section 3.1, we can obtain

∣ P_{n} ρ ((u, v), {\hat{β}}_{(u, v)}) - P_{n} ρ ((u, v), β_{(u, v)}^{*}) ∣ \leq ∣ P ρ ((u, v), {\hat{β}}_{(u, v)}) - P ρ ((u, v), β_{(u, v)}^{*}) ∣ + ∣ v (β^{*}) - v (\hat{β}) ∣ .

If the condition $‖ \hat{β} - β^{*} ‖_{1} \leq L^{*}$ holds, on the set $T_{0} \cup T_{1}$ , we can have

∣ P_{n} ρ ((u, v), {\hat{β}}_{(u, v)}) - P_{n} ρ ((u, v), β_{(u, v)}^{*}) ∣ + λ \sqrt{(u - v)} ‖ \hat{β} - β^{*} ‖_{1} \leq 4 ϵ^{*} + 2 ϵ^{*} = 6 ϵ^{*},

which completes the proof.

Corollary 1 Suppose Assumptions A-F hold. Let $a_{n} ≔ 4 (\sqrt{\frac{2 \log (2 p)}{n}} + \frac{\log (2 p)}{n} K_{X})$ and

λ_{0} ≔ λ_{0} (t) ≔ a_{n} (1 + t \sqrt{2 (1 + 2 a_{n} K_{X})} + \frac{2 t^{2} a_{n} K_{X}}{3}) .

Then we have, with probability at least $1 - 7 e x p [- n a_{n}^{- 2} t^{2}] = 1 - exp (- C_{1} \frac{n^{2}}{\log (p)})$ , that Equation (3) holds. We refer to Theorem 2.1 in van de Geer (2008).

Lemma 2 Suppose Assumption D and $s_{*} λ_{1} \leq \frac{ϕ_{*}^{2}}{32}$ hold. Then on the set $T_{0} \cup T_{1}$ , we have, for all (u, v) ∈ {i/n, i = 1, …, n} and all $β \in R^{p}$ with ∥β_{S_*}∥₁ ≤ 3∥β_{S_*}∥₁,

‖ β_{S_{*}} ‖_{1}^{2} \leq \frac{(β^{T} {\hat{Σ}}_{(u, v)} β) s_{*}}{(v - u) ϕ_{*}^{2}} .

Proof. By Assumption D (the compatibility condition), for any u, v ∈ {i/n, n = 1, …, n}, we have

‖ β_{S_{*}} ‖_{1}^{2} \leq \frac{‖ f_{β} ‖^{2} (v - u) s_{*}}{(v - u) ϕ^{2}} = \frac{(β^{T} (v - u) Σ β) s_{*}}{(v - u) ϕ_{*}^{2}},

for all $β \in R^{p}$ that satisfy ∥β_{S_*}∥₁ ≤ 3 ∥β_{S_*}∥₁. Then the matrix (v − u)Σ satisfies the compatibility condition for the set S_* with constant $\sqrt{(v - u) ψ_{*}}$ . By Corollary 6.8 in n Bühlmann and van de Geer (2011), if $s_{*} λ_{1} \leq \frac{ϕ_{*}^{2}}{32}$ , the compatibility condition also holds for the set S_* and the matrix ${\hat{Σ}}_{(u, v)}$ , with $ϕ_{{\hat{Σ}}_{(u, v)}}^{2} \geq (v - u) ϕ_{*}^{2} ∕ 2$ . Then we can obtain, for all $β \in R^{n}$ that satisfy ∥β_{S_ε}∥₁ ≤ 3 ∥β_{S_.}∥₁,

‖ β_{S_{\cdot}} ‖_{1}^{2} \leq \frac{(β^{T} {\hat{Σ}}_{(u, x)} β) s_{*}}{ϕ_{{\hat{Σ}}_{(0, v)}}^{2}} \leq \frac{2 (β^{T} {\hat{Σ}}_{(u, t)} β) s_{*}}{(v - u) ϕ_{*}^{2}} .

Lemma 3 By Assumption B, there exists an η ≥ 0 and strictly convex increasing G, such that for all $β_{1}, β_{2} \in R^{p}$ with ∥f_β₁ − f⁰∥_∞ ≤ η/2,∥f_β − f⁰∥_∞ ≤ η ∥β* − β⁰∥ ≤ η, we have

∣ P_{n} ρ (β) - P_{n} ρ (β^{*}) ∣ \geq C ‖ X ((β - β^{*})) ‖_{2}^{2} - 2 ϵ^{*} .

Proof. According to Assumption C (margin condition), we can directly have

E (f_{β}) \geq G (‖ f_{β} - f^{0} ‖) \geq C ‖ f_{β} - f^{0} ‖^{2} .

By the definition of $E (f_{β})$ , v(β) introduced in Notation and the triangle inequality, we have

P_{n} ρ (β) - P_{n} ρ (β^{*}) \geq C ‖ f_{β} - f^{0} ‖^{2} - 2 ϵ^{*} .

By the definition of β*, under the compatibility condition, β* is the best approximation of β⁰: β* = β⁰. Then we have

P_{n} ρ (β) - P_{n} ρ (β^{*}) \geq C ‖ X ((β - β^{*})) ‖_{2}^{2} - 2 ϵ^{*} .

Lemma 4 Suppose k₀ > 1 and that Assumptions A-F and $s_{*} λ_{1} \leq \frac{ϕ_{*}^{2}}{32}$ hold. Then on $T_{0} \cap T_{1}$ , if $(u, v) \subset ({\tilde{τ}}_{j - 1} - c^{*} \sqrt{δ} λ, {\tilde{τ}}_{j + 1} + c^{*} \sqrt{δ} λ)$ and $u < {\tilde{τ}}_{j} < v$ for some j = 2, …, $\tilde{k} - 1$ , we have

∣ P_{n} ρ ((u, {\tilde{τ}}_{j}), β_{(u, v)}^{*}) - P_{n} ρ ((u, {\tilde{τ}}_{j}), β_{(u, {\tilde{τ}}_{j})}^{*}) ∣ + ∣ P_{n} ρ (({\tilde{τ}}_{j}, v), β_{(u, v)}^{*}) - P_{n} ρ (({\tilde{τ}}_{j}, v), β_{({\tilde{τ}}_{j}, v)}^{*}) ∣ \geq \frac{min ({\tilde{τ}}_{j} - u, v - {\tilde{τ}}_{j}) m_{*}^{2} ϕ_{*}^{2} s_{*}}{8} - 4 ϵ^{*} .

Proof. By Lemmas 2 and 3, if $(u, v) \subset ({\tilde{τ}}_{j - 1} - c^{*} \sqrt{δ} λ, {\tilde{τ}}_{j + 1} + c^{*} \sqrt{δ} λ)$ and $u < {\tilde{τ}}_{j} < v$ for some j = 2, …, $\tilde{k} - 1$ , we have

∣ P_{n} ρ ((u, {\tilde{τ}}_{j}), β_{(u, v)}^{*}) - P_{n} ρ ((u, {\tilde{τ}}_{j}), β_{(u, {\tilde{τ}}_{j})}^{*}) ∣ + ∣ P_{n} ρ (({\tilde{τ}}_{j}, v), β_{(u, v)}^{*}) - P_{n} ρ (({\tilde{τ}}_{j}, v), β_{({\tilde{τ}}_{j}, v)}^{*}) ∣ \geq C ‖ X (β_{(u, v)}^{*} - β_{(u, {\tilde{τ}}_{j})}^{*}) ‖_{2}^{2} + C ‖ X (β_{(u, v)}^{*} - β_{({\tilde{τ}}_{j}, v)}^{*}) ‖_{2}^{2} - 4 ϵ^{*} \geq \frac{({\tilde{τ}}_{j} - u) {‖ β_{(u, v]}^{*} - β_{(u, {\tilde{τ}}_{j})}^{*} ‖}_{1}^{2} ϕ_{*}^{2}}{2 s_{*}} + \frac{(v - {\tilde{τ}}_{j}) {‖ β_{(u, v)}^{*} - β_{({\tilde{τ}}_{j}, v]}^{*} ‖}_{1}^{2} ϕ_{*}^{2}}{2 s_{*}} - 4 ϵ^{*} .

Now observe that

(v - u) β_{(u, v]}^{*} = ({\tilde{τ}}_{j} - u) β_{(u, {\tilde{τ}}_{j})}^{2} + (v - {\tilde{τ}}_{j}) β_{({\tilde{τ}}_{j}, i)}^{*} .

Then

β_{(u, v)}^{*} - β_{(u, {\tilde{τ}}_{j})}^{*} = (\frac{v - {\tilde{τ}}_{j}}{v - u}) (β_{({\tilde{τ}}_{j}, v]}^{*} - β_{(u, {\tilde{τ}}_{j})}^{*}),

by Assumption E. If $(u, v) \subset ({\tilde{τ}}_{j - 1} - c^{*} \sqrt{δ} λ, {\tilde{τ}}_{j + 1} + c^{*} \sqrt{δ} λ)$ and $u < {\tilde{τ}}_{j} < v$ for some j = 2, …, $\tilde{k} - 1$ , we have

{‖ β_{(u, v]}^{*} - β_{(u, {\tilde{τ}}_{j}]}^{*} ‖}_{1} \geq \frac{(v - {\tilde{τ}}_{j}) m_{*} s_{*}}{(v - u)}, {‖ β_{(u, x]}^{*} - β_{({\tilde{τ}}_{j}, x)}^{*} ‖}_{1} \geq \frac{({\tilde{τ}}_{j} - u) m_{*} s_{*}}{(v - u)} .

(4)

Then, by the above Equation (4) and straightforward calculations, we can obtain

\frac{({\tilde{τ}}_{j} - u) {‖ β_{(u, v]}^{*} - β_{(u, {\tilde{τ}}_{j})}^{*} ‖}_{1}^{2} ϕ_{*}^{2}}{2 s_{*}} + \frac{(v - u) {‖ β_{(u, v]}^{*} - β_{({\tilde{τ}}_{j}, v]}^{*} ‖}_{1}^{2} ϕ_{*}^{2}}{2 s_{*}} \geq \frac{({\tilde{τ}}_{j} - u)^{2} + (v - {\tilde{τ}}_{j})^{2}}{(v - u)^{2}} \frac{m_{*}^{2} ϕ_{*}^{2} s_{*}}{2} - 4 ϵ^{*} .

As $(({\tilde{τ}}_{j} - u)^{2} + (v - {\tilde{τ}}_{j})^{2}) ∕ (v - u)^{2} \geq 1 ∕ 2$ , we can complete the proof. ■

Lemma 5 Suppose Assumptions A-F hold and let $(u, v) \subset ({\tilde{τ}}_{j - 1} - c^{*} \sqrt{δ} λ, {\tilde{τ}}_{j} + c^{*} \sqrt{δ} λ) \cap (0, 1]$ for some j = 1, …, $\tilde{k} + 1$ , and $s_{*} λ_{1} \leq \frac{ϕ_{*}^{2}}{32}$ , $c^{*} \sqrt{δ} λ < r (\tilde{τ})$ . Then on the set $T_{0} \cap T_{1}$ we have

∣ P_{n} ρ ((u, v), {\hat{β}}_{(u, v)}) - P_{n} ρ ((u, v), β^{0} (j)) ∣ + λ \sqrt{(u - v)} ‖ {\hat{β}}_{(u, v)} - β^{0} (j) ‖_{1} \leq d_{*} s_{*} λ^{2},

where $b = 1 {u < {\tilde{τ}}_{j - 1}} + 1 {{\tilde{τ}}_{j} < v}$ , $d_{*} = ((b^{2} K_{X}^{2} c^{*} M_{*} + b) c^{*} M_{*} + 6 C_{4})$ .

Proof. Firstly, by straightforward calculations, we can obtain

∣ P_{n} ρ ((u, v), {\hat{β}}_{(u, v)}) - P_{n} ρ ((u, v), β^{0} (j)) ∣ + λ (u - v) ‖ \hat{β} - β^{0} (j) ‖_{1} \leq ∣ P_{n} ρ ((u, v), {\hat{β}}_{(u, v)}) - P_{n} ρ ((u, v), β_{(u, v)}^{*}) ∣ + ∣ P_{n} ρ ((u, v), β_{(u, v)}^{*}) - P_{n} ρ ((u, v), β^{0} (j)) ∣ + λ \sqrt{(u - v)} ‖ \hat{β} - β_{(u, v)}^{*} ‖_{1} + λ \sqrt{(u - v)} ‖ β_{(u, v)}^{*} - β^{0} (j) ‖_{1} .

According to Lemma 1 and on the set $T_{0}$ , we can obtain

∣ P_{n} ρ ((u, v), {\hat{β}}_{(u, v)}) - P_{n} ρ ((u, v), β_{(u, v)}^{*}) ∣ + λ \sqrt{(u - v)} ‖ \hat{β} - β_{(u, v)}^{*} ‖_{1} \leq 6 ϵ^{*} \leq 6 C_{4} s^{*} λ^{2} .

(5)

Now, we will present the bias between $β_{u, v}^{*}$ and $β_{j}^{0}$ , with $(u, v] \subset ({\tilde{τ}}_{j - 1} - c^{*} \sqrt{δ} λ, {\tilde{τ}}_{j} + c^{*} \sqrt{δ} λ)$ . Since $λ c^{*} \sqrt{δ} < r (τ)$ , by Assumption E, we can have that

‖ β_{(u, v)}^{*} - β^{0} (j) ‖_{\infty} \leq \frac{\max ({\tilde{τ}}_{j - 1} - u, 0)}{(v - u)} ‖ β^{0} (j) - β^{0} (j - 1) ‖_{\infty} + \frac{\max (v - {\tilde{τ}}_{j}, 0)}{(v - u)} ‖ β^{0} (j + 1) - β^{0} (j) ‖_{\infty} \leq \frac{b M^{*} c^{*} λ}{\sqrt{v - u}} .

(6)

Furthermore, combining Assumption A, Assumption C, Equation (6), and the Cauchy-Schwarz inequality, we have

∣ P_{n} ρ ((u, v), β_{(u, v)}^{*}) - P_{n} ρ ((u, v), β^{0} (j)) ∣ + λ \sqrt{(u - v)} ‖ β_{(u, v)}^{*} - β^{0} (j) ‖_{1} \leq (v - u) ‖ X (β_{(u, v)}^{*} - β^{0} (j)) ‖_{2}^{2} + λ \sqrt{(u - v)} ‖ β_{(u, v)}^{*} - β^{0} (j) ‖_{1} \leq (b^{2} K_{X}^{2} M^{*} s_{*} + b) c^{*} M^{*} s_{*} λ^{2} .

(7)

Combining Equations (5) and (7) can complete the proofs. ■

Proof of Theorem 1. To simplify the notation, we denote the value of the penalized total loss function corresponding to the change point vector by

H (τ) = \sum_{j = 1}^{l (τ) + 1} P_{n} ρ (I_{j} (τ), \hat{β} (τ, j)) + γ l (τ) .

(8)

First, we will show that if the assumptions hold, we must have $l (\hat{τ}) = \tilde{k}$ and $‖ \hat{τ} - \tilde{τ} ‖_{1} \leq c^{*} \sqrt{δ} λ$ . On the contrary, we assume $\hat{τ}$ does not satisfy the above two results. We can distinguish three possible cases:

Case 1: Change point number is overestimated, $l (\hat{τ}) > \tilde{k}$ . There exist some i, 1 ≤ i ≤ $\hat{k} - 1$ , such that ${{\hat{τ}}_{i - 1}, {\hat{τ}}_{i}, {\hat{τ}}_{i + 1}} \subset ({\tilde{τ}}_{j - 1} - c^{*} \sqrt{δ} λ, {\tilde{τ}}_{j} + c^{*} \sqrt{δ} λ)$ for some j, 1 ≤ j ≤ $\tilde{k}$ .

Case 2: Change point number is underestimated, $l (\hat{τ}) < \tilde{k}$ . For some $j = 1, \dots, \tilde{k} - 1$ , we have $\hat{τ} \cap ({\tilde{τ}}_{j} - c^{*} \sqrt{δ} λ, {\tilde{τ}}_{j} - c^{*} \sqrt{δ} λ) = \emptyset$ and $\hat{τ} \cap ({\tilde{τ}}_{j} - c^{*} \sqrt{δ} λ, {\tilde{τ}}_{j} - c^{*} \sqrt{δ} λ) = \emptyset$ .

Case 3: Change point number is correctly estimated, $l (\hat{τ}) = \tilde{k}$ . However, for some $j = 1, \dots, \tilde{k} - 1$ , we have $\hat{τ} \cap ({\tilde{τ}}_{j} - c^{*} \sqrt{δ} λ, {\tilde{τ}}_{j} - c^{*} \sqrt{δ} λ) = \emptyset$ and $\hat{τ} \cap ({\tilde{τ}}_{j} - c^{*} \sqrt{δ} λ, {\tilde{τ}}_{j} - c^{*} \sqrt{δ} λ) \neq \emptyset$ .

We first consider Case 1, where we have $l (\hat{τ}) > \tilde{k}$ and there exists some i, such that ${{\hat{τ}}_{i - 1}, {\hat{τ}}_{i}, {\hat{τ}}_{i + 1}} \subset ({\tilde{τ}}_{j - 1}, {\tilde{τ}}_{j})$ for some $j, 1 \leq j \leq \tilde{k}$ . We define

τ = ({\hat{τ}}_{1}, \dots, {\hat{τ}}_{i - 1}, {\hat{τ}}_{i + 1}, \dots, {\hat{τ}}_{l (\hat{τ})}) .

Then we get a new change point vector τ with $l (τ) = l (\hat{τ}) - 1$ . Denote the intervals by $S_{1} = ({\hat{τ}}_{i - 1}, {\hat{τ}}_{i}], S_{2} = ({\hat{τ}}_{i}, {\hat{τ}}_{i + 1}]$ , and $S = ({\hat{τ}}_{i - 1}, {\hat{τ}}_{i + 1}]$ , and then we obtain

H (τ) - H (\hat{τ}) = P_{n} ρ (S, {\hat{β}}_{S}) - P_{n} ρ (S_{1}, {\hat{β}}_{S_{1}}) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) - γ .

(9)

By the definition of the Lasso estimator $\hat{β}$ and the triangle inequality, we can directly have

P_{n} ρ (S, {\hat{β}}_{S}) \leq P_{n} ρ (S, β^{0} (j)) + λ \sqrt{‖ S ‖} {‖ β^{0} (j) - {\hat{β}}_{J} ‖}_{1} .

(10)

Then, combining Equations (9)-(10), we can directly have

H (τ) - H (\hat{τ}) \leq P_{n} ρ (S, β^{0} (j)) - P_{n} ρ (S_{1}, {\hat{β}}_{S_{1}}) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) + λ \sqrt{‖ S ‖} {‖ β^{0} (j) - {\hat{β}}_{J} ‖}_{1} - γ .

(11)

Using some straightforward calculations and the triangle inequality, we have

P_{n} ρ (S, β^{0} (j)) - P_{n} ρ (S_{1}, {\hat{β}}_{S_{1}}) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) = P_{n} ρ (S_{1}, β^{0} (j)) + P_{n} ρ (S_{2}, β_{j}^{0}) - P_{n} ρ (S_{1}, {\hat{β}}_{S_{1}}) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) \leq ∣ P_{n} ρ (S_{1}, β^{0} (j)) - P_{n} ρ (S_{1}, {\hat{β}}_{S_{1}}) ∣ + ∣ P_{n} ρ (S_{2}, β^{0} (j)) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) ∣ .

(12)

By Lemma 5 and the above Equation (12), with (u, v) = S₁, S₂, then we have

P_{n} ρ (S, β^{0} (j)) - P_{n} ρ (S_{1}, {\hat{β}}_{S_{1}}) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) \leq 2 d_{*} s^{*} λ^{2} .

(13)

Also by Lemma 5, for the second term in Equation (10), we can directly have

λ \sqrt{‖ S ‖} {‖ β^{0} (j) - {\hat{β}}_{J} ‖}_{1} \leq d_{*} s^{*} λ^{2},

(14)

and therefore, by combining Equation (11) and Equations (13)-(14), we can easily obtain

H (τ) - H (\hat{τ}) \leq 3 d_{*} s^{*} λ^{2} - γ .

(15)

According to the Assumption F2, we obtain $M (τ) < M (\hat{τ})$ , which is a contradiction.

For Case 2, where we have $l (\hat{τ}) < \tilde{k}$ , we define a new change points vector $τ = \hat{τ} \cup {{\tilde{τ}}_{j}}$ , that is,

τ = ({\hat{τ}}_{1}, \dots, {\hat{τ}}_{r_{i} - 1}, {\hat{τ}}_{r_{i}}, {\hat{τ}}_{r_{i} + 1}, \dots, {\hat{τ}}_{l (\hat{τ}) + 1}),

(16)

where ${\hat{τ}}_{r_{i}} = {\tilde{τ}}_{j}$ . We obtain a new change point vection τ with $l (τ) = l (\hat{τ}) + 1$ . Also we denote the intervals by $S_{1} = ({\hat{τ}}_{r_{i} - 1}, {\hat{τ}}_{r_{i}}]$ , $S_{2} = ({\hat{τ}}_{r_{i}}, {\hat{τ}}_{r_{i} + 1}]$ and $S = ({\hat{τ}}_{r_{i} - 1}, {\hat{τ}}_{r_{i} + 1}]$ , then we have

H (\hat{τ}) - H (τ) = P_{n} ρ (S, {\hat{β}}_{S}) - P_{n} ρ (S_{1}, {\hat{β}}_{S_{1}}) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) - γ .

(17)

By Lemma 1, for u, v ∈ V_n, we can obtain $P_{n} ρ ((u, v), {\hat{β}}_{(u, v)}) - P_{n} ρ ((u, v), β_{(u, v)}^{*}) \leq 6 ϵ^{*}$ . Thus, by this inequality (with (u, v) = S₁, S₂ and S), the triangle inequality and Lemma 4, we can have

P_{n} ρ (S, {\hat{β}}_{S}) - P_{n} ρ (S_{1}, {\hat{β}}_{S_{1}}) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) \geq P_{n} ρ (S, β_{S}^{*}) - P_{n} ρ (S_{1}, β_{S_{1}}^{*}) - P_{n} ρ (S_{2}, β_{S_{2}}^{*}) - 3 * (6 ϵ^{*}) \geq \frac{δ m_{*}^{2} ϕ_{*}^{2} s_{*}}{8} - 4 ϵ^{*} - 3 * (6 ϵ^{*}) .

(18)

Then, by combining the above Equations (17)-(18), we can directly have

H (\hat{τ}) - H (τ) \geq \frac{δ m_{*}^{2} ϕ_{*}^{2} s_{*}}{8} - 4 ϵ^{*} - 3 * (6 ϵ^{*}) - γ,

(19)

by Assumption F3, we have $H (\hat{τ}) > H (τ)$ , which is a contradiction.

For Case 3 with $l (\hat{τ} = \tilde{k})$ , we must add some points and remove others to obtain a good change point estimator. Then we define $τ^{(1)} = \hat{τ} ⋃ {{\tilde{τ}}_{j}}$ with $τ_{r_{i}}^{(1)} = {\tilde{τ}}_{j}$ . We denote the intervals by $S_{1} = ({\hat{τ}}_{r_{i} - 1}, {\hat{τ}}_{r_{i}}]$ , $S_{2} = ({\hat{τ}}_{r_{i}}, {\hat{τ}}_{r_{i} + 1}]$ , and $S = ({\hat{τ}}_{r_{i} - 1}, {\hat{τ}}_{r_{i} + 1}]$ , and then we can obtain

H (\hat{τ}) - H (τ^{(1)}) = P_{n} ρ (S, {\hat{β}}_{S}) - P_{n} ρ (S_{1}, {\hat{β}}_{S_{1}}) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) - γ .

(20)

Without loss of generality, we assume ∣ S₁ ∣< δ and define a new partition $τ^{(1)} ∖ {τ_{r_{i} - 1}^{(1)}}$ . By denoting $K_{1} = (τ_{r_{i} - 1}^{(2)}, τ_{r_{i} - 1}^{(1)})$ and I = K₁ ∪ S₁, then we have that

H (τ^{(1)}) - H (τ) = P_{n} ρ (K_{1}, {\hat{β}}_{K_{1}}) + P_{n} ρ (S_{1}, {\hat{β}}_{S_{1}}) - P_{n} ρ (I, {\hat{β}}_{I}) + γ .

(21)

Thus, by combining Equations (20) and (21), with straightforward calculations, we have

H (\hat{τ}) - H (τ) = H (\hat{τ}) - H (τ^{(1)}) + H (τ^{(1)}) - H (τ) = P_{n} ρ (S, {\hat{β}}_{S}) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) + P_{n} ρ (K_{1}, {\hat{β}}_{K_{1}}) - P_{n} ρ (I, {\hat{β}}_{I}),

by the definition of the Lasso estimator and the triangle inequality, we can obtain

\sqrt{∣ I ∣} {‖ β_{K_{1}}^{*} - {\hat{β}}_{I} ‖}_{1} \leq \sqrt{∣ I ‖} β_{I}^{*} - {\hat{β}}_{I} ‖_{1} + \sqrt{∣ I ∣} ‖ β_{K_{1}}^{*} - β_{I}^{*} ‖_{1} \leq \sqrt{∣ I ∣} {‖ β_{I}^{*} - {\hat{β}}_{I} ‖}_{1} + \sqrt{∣ I ∣} \frac{δ}{\sqrt{∣ I ∣}} M, s_{*} \leq \sqrt{∣ I ∣} {‖ β_{I}^{*} - {\hat{β}}_{I} ‖}_{1} + δ M_{*} s_{*} .

(22)

By Equation (22), Lemma 1, Lemma 4, and straightforward calculations, we have

P_{n} ρ (S, {\hat{β}}_{S}) - P_{n} ρ (S_{2}, {\hat{β}}_{S_{2}}) + P_{n} ρ (K_{1}, {\hat{β}}_{K_{1}}) - P_{n} ρ (I, {\hat{β}}_{I}) \geq P_{n} ρ (S, β_{S}^{*}) - P_{n} ρ (S_{2}, β_{S_{2}}^{*}) + P_{n} ρ (K_{1}, β_{K_{1}}^{*}) - 18 ϵ^{*} - P_{n} ρ (I, β_{K_{1}}^{*}) - λ \sqrt{∣ I ∣} {‖ β_{K_{1}}^{*} - {\hat{β}}_{I} ‖}_{1} \geq P_{n} ρ (S, β_{S}^{*}) - P_{n} ρ (S_{2}, β_{S_{2}}^{*}) - P_{n} ρ (S_{1}, β_{S_{1}}^{*}) - 18 ϵ^{*} - λ ∣ I ∣ {‖ β_{I}^{*} - {\hat{β}}_{I} ‖}_{1} - λ δ M_{*} s_{*} \geq \frac{λ δ m_{*}^{2} ϕ_{s}^{2} s_{*}}{8} - 22 ϵ^{*} - λ δ M_{*} .

According to the Assumption F4, we can obtain $H (\hat{τ}) - H (τ) > 0$ , which is a contradiction. Above all, the first two results (1) and (2) in Theorem 1 have been proved. Result (3) in Theorem 1 can be directly obtained by combining (2) in Theorem 1 with Lemma 5.

Now we prove the fourth result in Theorem 1. Using that by condition $∣ x_{i} {\hat{Θ}}_{j}^{T} ∣ = O_{P} (K)$ , we have

\hat{Θ} P_{n} {\dot{ρ}}_{\hat{β}} = \hat{Θ} P_{n} {\dot{ρ}}_{β^{0}} + \hat{Θ} P_{n} {\ddot{ρ}}_{\hat{β}} (\hat{β} - β^{0}) + {Rem}_{1},

where

{Rem}_{1} = O_{P} (K) \sum_{i = 1}^{n} {∣ x_{i} (\hat{β} - β^{0}) ∣}^{2} ∕ n = O (K) {‖ X (\hat{β} - β^{0}) ‖}_{n}^{2} = O_{P} (K s_{0} λ^{2}) = o_{P} (1),

it follows that

\tilde{β} (τ, j) - β^{0} (j) = \hat{β} (τ, j) - β^{0} (j) - \hat{Θ} P_{n} {\dot{ρ}}_{\hat{β}} = \hat{β} (τ, j) - β^{0} (j) - {\hat{Θ}}_{j} P_{n} {\dot{ρ}}_{β^{0}} - {\hat{Θ}}_{j} P_{n} {\ddot{ρ}}_{\hat{β}} (\hat{β} (τ, j) - β^{0} (j)) - {Rem}_{1} = - \hat{Θ} P_{n} {\dot{ρ}}_{β^{0}} - (\hat{Θ} P_{n} {\ddot{ρ}}_{\hat{β}} - I) (\hat{β} (τ, j) - β^{0} (j)) - {Rem}_{1} = - \hat{Θ} P_{n} {\dot{ρ}}_{β^{0}} - {Rem}_{2} .

By the proof of Theorem 3.1 in van de Geer et al. (2014), we have $\hat{Θ} P_{n} {\ddot{ρ}}_{\hat{β}} - I = O (λ)$ . According to the third result in Theorem 1, we have $\hat{β} (τ, j) - β^{0} (j) = O_{P} (s_{*} λ)$ . Then, it follows that

∣ {Rem}_{2} ∣ \leq ∣ {Rem}_{1} ∣ + O (λ) {‖ \hat{β} (τ, j) - β^{0} (j) ‖}_{1} = o_{P} (n^{- 1 ∕ 2}) + O_{P} (s_{*} λ^{2}) = o_{P} (n^{- 1 ∕ 2}),

since the condition $s_{*} λ^{2} = o (s_{*} \frac{\log p}{n}) = o (n^{- 1 ∕ 2})$ holds. By straight calculations, we have

\sqrt{n} (\tilde{β} (\hat{τ}, j) - β^{0} (j)) = - \sqrt{n} \hat{Θ} P_{n} {\dot{ρ}}_{β^{0}} - \sqrt{n} {Rem}_{2} = - \sqrt{n} \hat{Θ} P_{n} {\dot{ρ}}_{β^{0}} - o_{P} (1) .

By the proof of Theorem 3.1 in van de Geer et al. (2014), we can easily conclude that

\sqrt{n} ({\tilde{β}}_{s} (\hat{τ}, j) - β_{s}^{0} (j)) ∕ {\hat{σ}}_{j, s} = V_{j, s} + o_{P} (1), s \in {1, \dots, p},

where $V_{j, s} \sim N (0, 1)$ and ${\hat{σ}}_{s}^{2} ≔ ({\hat{Θ}}_{j} P_{n} {\dot{ρ}}_{\hat{β}} {\dot{ρ}}_{\hat{β}}^{T} {\hat{Θ}}_{j}^{T})_{s, s}$ . ■

Proof of Theorem 2. First we will show that under the conditions of Theorem 1, on $T_{0} \cap T_{1}$ , we have three cases:

Case A: Change point number is overestimated, $l (\hat{τ}) > \tilde{k}$ . We have $\tilde{k} = 1$ and $l ({\hat{τ}}^{b}) = 1$ .

Case B: Change point number is underestimated, $l (\hat{τ}) < \tilde{k}$ . For $\tilde{k} > 1$ , we have h(0, 1) and $l ({\hat{τ}}^{b}) = 0$ .

Case C: Change point number is correctly estimated, $l (\hat{τ}) = \tilde{k}$ . For $\tilde{k} > 1$ , we have h(0, 1) ∈ [δ, 1 − δ].

This fact can be derived straightforwardly from the proof of Theorem 1, as the objective functions coincide for 1 or 2 segments; that is, for all u ∈ [0, 1],

H ((0, u, 1)) = Z (0, u) + Z (u, 1) .

For Case A, suppose $\tilde{k} = 1$ and τ = (0, 1). Like in the proof of Case 1 in Theorem 1, we can obtain

H (τ^{0}) < min_{u \in (δ, 1 - δ)} H ((0, u, 1)),

and h(0, 1) = 0. We suppose $\tilde{k} > 1$ and $h (0, 1) \notin \cup_{j = 1}^{\tilde{k} - 1} ({\tilde{τ}}_{j} - c^{*} δ λ, {\tilde{τ}}_{j} - c^{*} δ λ)$ . We define τ⁽⁰⁾ = (0, h(0, 1), 1), $τ^{(1)} = τ^{(0)} \cup {{\tilde{τ}}_{j}}$ , τ⁽²⁾ = τ⁽¹⁾\{h(0, 1)}.

For Case B, h(0, 1) = 0, like in the proof of Case 2 in Theorem 1, we can obtain H(τ⁽⁰⁾) > H(τ⁽¹⁾). For Case C, h(0, 1) ∈ [δ, 1 − δ], like in the proof of Case 3 in Theorem 1, we can obtain

H (τ^{(0)}) - H (τ^{(2)}) = H (τ^{(0)}) - H (τ^{(1)}) + H (τ^{(1)}) - H (τ^{(2)}) > 0 .

Since h(0, 1) minimizes Equation (17), all three cases result in a contradiction. Then, we can replace (0,1) by each sub-interval and obtain the same results, which completes the proof. ■

Proof of Theorem 3. Firstly we recall our proposed statistics as follows:

W_{1 ∕ 4} - W_{3 ∕ 4} = P_{n} ρ ((0, \frac{1}{4}), {\hat{β}}_{(\frac{1}{4}, 1)}) + P_{n} ρ ((\frac{1}{4}, 1), {\hat{β}}_{(\frac{1}{4}, 1)}) - P_{n} ρ ((0, \frac{3}{4}), {\hat{β}}_{(0, \frac{3}{4})}) - P_{n} ρ ((\frac{3}{4}, 1), {\hat{β}}_{(\frac{3}{4}, 1)}) .

Then, for convenience, we consider two cases for the change point location denoted by τ:

Case E: 0 < τ ≤ 1/4,

Case F: 1/4 < τ < 1/2.

Next, we will discuss each case in detail. Case E: In this case, 0 < τ ≤ 1/4, so we have

W_{1 ∕ 4} - W_{3 ∕ 4} = (P_{n} ρ ((0, \frac{1}{4}), {\hat{β}}_{(0, \frac{1}{4})}) - P_{n} ρ ((0, \frac{1}{4}), {\hat{β}}_{(0, \frac{3}{4})})) + (P_{n} ρ ((\frac{1}{4}, \frac{3}{4}), {\hat{β}}_{(\frac{1}{4}, 1)}) - P_{n} ρ ((\frac{1}{4}, \frac{3}{4}), {\hat{β}}_{(0, \frac{3}{4})})) + (P_{n} ρ ((\frac{1}{4}, 1), {\hat{β}}_{(\frac{1}{4}, 1)}) - P_{n} ρ ((\frac{3}{4}, 1), {\hat{β}}_{(\frac{3}{4}, 1)})) .

We observe there is no change point in (1/4, 1). So the Lasso estimations of any subinterval (u, v) ⊂ (1/4, 1) make no difference and we can replace ${\hat{β}}_{(\frac{1}{4}, 1)}$ by ${\hat{β}}_{(\frac{1}{4}, \frac{3}{4})}$ or ${\hat{β}}_{(\frac{3}{4}, 1)}$ , and it follows that

W_{1 ∕ 4} - W_{3 ∕ 4} = (P_{n} ρ ((0, \frac{1}{4}), {\hat{β}}_{(0, \frac{1}{4})}) - P_{n} ρ ((0, \frac{1}{4}), {\hat{β}}_{(0, \frac{3}{4})})) + (P_{n} ρ ((\frac{1}{4}, \frac{3}{4}), {\hat{β}}_{(\frac{1}{4}, \frac{3}{4})}) - P_{n} ρ ((\frac{1}{4}, \frac{3}{4}), {\hat{β}}_{(0, \frac{3}{4})})) .

(23)

We denote the intervals by $J_{1} = (0, \frac{1}{4})$ , $J_{2} = (\frac{1}{4}, \frac{3}{4})$ , and $J = (0, \frac{3}{4})$ . (23) can be organized as follows:

W_{1 ∕ 4} - W_{3 ∕ 4} = (P_{n} ρ (J_{1}, {\hat{β}}_{J_{1}}) - P_{n} ρ (J_{1}, {\hat{β}}_{J})) + (P_{n} ρ (J_{2}, {\hat{β}}_{J_{2}}) - P_{n} ρ (J_{2}, {\hat{β}}_{J})) = P_{n} ρ (J_{1}, {\hat{β}}_{J_{1}}) + P_{n} ρ (J_{2}, {\hat{β}}_{J_{2}}) - P_{n} ρ (J, {\hat{β}}_{J}) .

Then using the same argument as Case 1 in Theorem 1, we can obtain W_1/4 < W_3/4.

Case F: In this case, 1/4 < τ < 1/2, we observe that there is no change point within these two intervals (0, 1/4) and (3/4, 1). Then, by straightforward calculations, we can obtain

W_{1 ∕ 4} - W_{3 ∕ 4} = P_{n} ρ ((\frac{1}{4}, 1), {\hat{β}}_{(\frac{1}{4}, 1)}) - P_{n} ρ ((0, \frac{3}{4}), {\hat{β}}_{(0, \frac{3}{4})} = P_{n} ρ ((\frac{1}{4}, \frac{3}{4}), {\hat{β}}_{(\frac{1}{4}, 1)}) + P_{n} ρ ((\frac{3}{4}, 1), {\hat{β}}_{(\frac{1}{4}, 1)}) - P_{n} ρ ((0, \frac{1}{4}), {\hat{β}}_{(0, \frac{3}{4})} - P_{n} ρ ((\frac{1}{4}, \frac{3}{4}), {\hat{β}}_{(0, \frac{3}{4})}) .

We denote the intervals by (0, 1/4) = K₁, (1/4, 3/4) = J₁, (3/4, 1) = J₂, and it follows that

W_{1 ∕ 4} - W_{3 ∕ 4} = P_{n} ρ (J_{1}, {\hat{β}}_{J}) + P_{n} ρ (J_{2}, {\hat{β}}_{J}) - P_{n} ρ (K_{1}, {\hat{β}}_{I}) - P_{n} ρ (J_{1}, {\hat{β}}_{I}) .

Then using the same argument as Case 3 in Theorem 1, we can obtain W_1/4 < W_3/4. ■

Footnotes

^†

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

BIBLIOGRAPHY

Aue A, Hörmann S, Horváth L, & Reimherr M (2009). Break detection in the covariance structure of multivariate time series models. The Annals of Statistics, 37(6B), 4046–4087. [Google Scholar]
Barigozzi M, Cho H, & Fryzlewicz P (2018). Simultaneous multiple change-point and factor analysis for high-dimensional time series. Journal of Econometrics, 206(1), 187–225. [Google Scholar]
Bühlmann P, & van De Geer S (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media. [Google Scholar]
Bickel PJ, Ritov Y & Tsybakov AB (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4), 1705–1732. [Google Scholar]
Boysen L, Kempe A, Liebscher V, Munk A, & Wittich O (2009). Consistencies and rates of convergence of jump-penalized least squares estimators. The Annals of Statistics, 37(1), 157–183. [Google Scholar]
Braun JV, Braun RK, & Müller HG (2000). Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika, 87(2), 301–314. [Google Scholar]
Candes E, & Tao T (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), 2313–2351. [Google Scholar]
Cho H, & Fryzlewicz P (2012). Multiscale and multilevel technique for consistent segmentation of nonstationary time series. Statistica Sinica, 22(1), 207–229. [Google Scholar]
Cho H, & Fryzlewicz P (2015). Multiple-change-point detection for high dimensional time series via sparsified binary segmentation. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 77(2), 475–507. [Google Scholar]
Ciuperca G (2014). Model selection by LASSO methods in a change-point model. Statistical Papers, 55(2), 349–374. [Google Scholar]
Fan J, & Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101–148. [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J, & Qi L (2011). Sparse high-dimensional models in economics. Annual Review of Economics, 3(1), 291–317. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, & Peng H (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3), 928–961. [Google Scholar]
Fong Y, Di C, & Permar S (2015). Change point testing in logistic regression models with interaction term. Statistics in Medicine, 34(9), 1483–1494. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frick K, Munk A & Sieling H (2014). Multiscale change point inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(3), 495–580. [Google Scholar]
Fryzlewicz P (2014). Wild binary segmentation for multiple change-point detection. The Annals of Statistics, 42(6), 2243–2281. [Google Scholar]
Harchaoui Z, & Lévy-Leduc C (2010). Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105(492), 1480–1493. [Google Scholar]
Jirak M (2015). Uniform change point tests in high dimension. The Annals of Statistics, 43(6), 2451–2483. [Google Scholar]
Kirch C, Muhsal B, & Ombao H (2015). Detection of changes in multivariate time series with application to EEG data. Journal of the American Statistical Association, 110(511), 1197–1216. [Google Scholar]
Lavielle M, & Teyssiére G (2006). Detection of multiple change-points in multivariate time series. Lithuanian Mathematical Journal, 46(3), 287–306. [Google Scholar]
Lee S, & Seo MH (2008). Semiparametric estimation of a binary response model with a change-point due to a covariate threshold. Journal of Econometrics, 144(2), 492–499. [Google Scholar]
Lee S, Seo MH, & Shin Y (2011). Testing for threshold effects in regression models. Journal of the American Statistical Association, 106(493), 220–231. [Google Scholar]
Lee S, Seo MH, & Shin Y (2016). The lasso for high dimensional regression with a possible change point. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 78(1), 193–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leonardi F, & Bühlmann P (2016). Computationally efficient change point detection for high-dimensional regression. arXiv preprint arXiv:1601.03704. [Google Scholar]
Li D, Qian J & Su L (2016). Panel data models with interactive fixed effects and multiple structural breaks. Journal of the American Statistical Association, 111(516), 1804–1819. [Google Scholar]
Liu B, Zhang X & Liu Y (2019). Simultaneous change point detection and identification for high dimensional linear models, submitted . [Google Scholar]
Page ES (1955). A test for a change in a parameter occurring at an unknown point. Biometrika, 42(3/4), 523–527. [Google Scholar]
Pesaran MH, & Pick A (2007). Econometric issues in the analysis of contagion. Journal of Economic Dynamics and Control, 31(4), 1245–1277. [Google Scholar]
Qian J, & Su L (2016). Shrinkage estimation of regression models with multiple structural changes. Econometric Theory, 32(6), 1376–1433. [Google Scholar]
Raginsky M, Willett RM, Horn C, Silva J, & Marcia RF (2012). Sequential anomaly detection in the presence of noise and limited feedback. IEEE Transactions on Information Theory, 58(8), 5544–5562. [Google Scholar]
Reiss PT, & Ogden RT (2010). Functional generalized linear models with images as predictors. Biometrics, 66(1), 61–69. [DOI] [PubMed] [Google Scholar]
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. [Google Scholar]
Tibshirani R (2011). Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3), 273–282. [Google Scholar]
Tickle SO, Eckley IA, Fearnhead P, & Haynes K (2020). Parallelization of a common changepoint detection method. Journal of Computational and Graphical Statistics, 29(1), 149–161. [Google Scholar]
van de Geer SA (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614–645. [Google Scholar]
van de Geer S, Bühlmann P, Ritov YA, & Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202. [Google Scholar]
Wang D, Lin K, & Willett R (2021a). Statistically and computationally efficient change point localization in regression settings. arXiv preprint arXiv:1906.11364. [Google Scholar]
Wang D, Yu Y, & Rinaldo A (2021b). Optimal covariance change point localization in high dimensions. Bernoulli, 27(1), 554–575. [Google Scholar]
Wang T, & Samworth RJ (2018). High dimensional change point estimation via sparse projection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1), 57–83. [Google Scholar]
Zhang D, & Shen D (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage, 59(2),895–907. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang T, & Lavitas L (2018). Unsupervised self-normalized change-point testing for time series. Journal of the American Statistical Association, 113(522), 637–648. [Google Scholar]

[R1] Aue A, Hörmann S, Horváth L, & Reimherr M (2009). Break detection in the covariance structure of multivariate time series models. The Annals of Statistics, 37(6B), 4046–4087. [Google Scholar]

[R2] Barigozzi M, Cho H, & Fryzlewicz P (2018). Simultaneous multiple change-point and factor analysis for high-dimensional time series. Journal of Econometrics, 206(1), 187–225. [Google Scholar]

[R3] Bühlmann P, & van De Geer S (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media. [Google Scholar]

[R4] Bickel PJ, Ritov Y & Tsybakov AB (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4), 1705–1732. [Google Scholar]

[R5] Boysen L, Kempe A, Liebscher V, Munk A, & Wittich O (2009). Consistencies and rates of convergence of jump-penalized least squares estimators. The Annals of Statistics, 37(1), 157–183. [Google Scholar]

[R6] Braun JV, Braun RK, & Müller HG (2000). Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika, 87(2), 301–314. [Google Scholar]

[R7] Candes E, & Tao T (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), 2313–2351. [Google Scholar]

[R8] Cho H, & Fryzlewicz P (2012). Multiscale and multilevel technique for consistent segmentation of nonstationary time series. Statistica Sinica, 22(1), 207–229. [Google Scholar]

[R9] Cho H, & Fryzlewicz P (2015). Multiple-change-point detection for high dimensional time series via sparsified binary segmentation. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 77(2), 475–507. [Google Scholar]

[R10] Ciuperca G (2014). Model selection by LASSO methods in a change-point model. Statistical Papers, 55(2), 349–374. [Google Scholar]

[R11] Fan J, & Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101–148. [PMC free article] [PubMed] [Google Scholar]

[R12] Fan J, Lv J, & Qi L (2011). Sparse high-dimensional models in economics. Annual Review of Economics, 3(1), 291–317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Fan J, & Peng H (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3), 928–961. [Google Scholar]

[R14] Fong Y, Di C, & Permar S (2015). Change point testing in logistic regression models with interaction term. Statistics in Medicine, 34(9), 1483–1494. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Frick K, Munk A & Sieling H (2014). Multiscale change point inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(3), 495–580. [Google Scholar]

[R16] Fryzlewicz P (2014). Wild binary segmentation for multiple change-point detection. The Annals of Statistics, 42(6), 2243–2281. [Google Scholar]

[R17] Harchaoui Z, & Lévy-Leduc C (2010). Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105(492), 1480–1493. [Google Scholar]

[R18] Jirak M (2015). Uniform change point tests in high dimension. The Annals of Statistics, 43(6), 2451–2483. [Google Scholar]

[R19] Kirch C, Muhsal B, & Ombao H (2015). Detection of changes in multivariate time series with application to EEG data. Journal of the American Statistical Association, 110(511), 1197–1216. [Google Scholar]

[R20] Lavielle M, & Teyssiére G (2006). Detection of multiple change-points in multivariate time series. Lithuanian Mathematical Journal, 46(3), 287–306. [Google Scholar]

[R21] Lee S, & Seo MH (2008). Semiparametric estimation of a binary response model with a change-point due to a covariate threshold. Journal of Econometrics, 144(2), 492–499. [Google Scholar]

[R22] Lee S, Seo MH, & Shin Y (2011). Testing for threshold effects in regression models. Journal of the American Statistical Association, 106(493), 220–231. [Google Scholar]

[R23] Lee S, Seo MH, & Shin Y (2016). The lasso for high dimensional regression with a possible change point. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 78(1), 193–210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Leonardi F, & Bühlmann P (2016). Computationally efficient change point detection for high-dimensional regression. arXiv preprint arXiv:1601.03704. [Google Scholar]

[R25] Li D, Qian J & Su L (2016). Panel data models with interactive fixed effects and multiple structural breaks. Journal of the American Statistical Association, 111(516), 1804–1819. [Google Scholar]

[R26] Liu B, Zhang X & Liu Y (2019). Simultaneous change point detection and identification for high dimensional linear models, submitted . [Google Scholar]

[R27] Page ES (1955). A test for a change in a parameter occurring at an unknown point. Biometrika, 42(3/4), 523–527. [Google Scholar]

[R28] Pesaran MH, & Pick A (2007). Econometric issues in the analysis of contagion. Journal of Economic Dynamics and Control, 31(4), 1245–1277. [Google Scholar]

[R29] Qian J, & Su L (2016). Shrinkage estimation of regression models with multiple structural changes. Econometric Theory, 32(6), 1376–1433. [Google Scholar]

[R30] Raginsky M, Willett RM, Horn C, Silva J, & Marcia RF (2012). Sequential anomaly detection in the presence of noise and limited feedback. IEEE Transactions on Information Theory, 58(8), 5544–5562. [Google Scholar]

[R31] Reiss PT, & Ogden RT (2010). Functional generalized linear models with images as predictors. Biometrics, 66(1), 61–69. [DOI] [PubMed] [Google Scholar]

[R32] Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. [Google Scholar]

[R33] Tibshirani R (2011). Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3), 273–282. [Google Scholar]

[R34] Tickle SO, Eckley IA, Fearnhead P, & Haynes K (2020). Parallelization of a common changepoint detection method. Journal of Computational and Graphical Statistics, 29(1), 149–161. [Google Scholar]

[R35] van de Geer SA (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614–645. [Google Scholar]

[R36] van de Geer S, Bühlmann P, Ritov YA, & Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202. [Google Scholar]

[R37] Wang D, Lin K, & Willett R (2021a). Statistically and computationally efficient change point localization in regression settings. arXiv preprint arXiv:1906.11364. [Google Scholar]

[R38] Wang D, Yu Y, & Rinaldo A (2021b). Optimal covariance change point localization in high dimensions. Bernoulli, 27(1), 554–575. [Google Scholar]

[R39] Wang T, & Samworth RJ (2018). High dimensional change point estimation via sparse projection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1), 57–83. [Google Scholar]

[R40] Zhang D, & Shen D (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage, 59(2),895–907. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Zhang T, & Lavitas L (2018). Unsupervised self-normalized change-point testing for time series. Journal of the American Statistical Association, 113(522), 637–648. [Google Scholar]

PERMALINK

Efficient multiple change point detection for high-dimensional generalized linear models

Xianru Wang

Bin Liu

Xinsheng Zhang

Yufeng Liu

Abstract

1. INTRODUCTION

2. METHODOLOGY

2.1. Notation

2.2. New Estimation and Algorithms

2.2.1. Dynamic programming approach

2.2.2. Binary segmentation approach

2.2.3. A fast screening approach for single change point models

3. THEORETICAL PROPERTIES

3.1. Basic assumptions

3.2. Main results

4. SIMULATION STUDIES

4.1. Tuning parameter selection

4.2. No change point models

Figure 1:

4.3. Single change point models

Figure 2:

Figure 3:

Table 2:

Table 3:

4.4. Multiple change point models

Figure 4:

Table 4:

Table 5:

5. REAL DATA ANALYSIS

Figure 5:

Table 6:

Figure 6:

6. SUMMARY

Table 1:

Acknowledgements

APPENDIX

Notation

Useful Lemmas

Footnotes

BIBLIOGRAPHY

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases