Consistent Model Selection in Segmented Line Regression

Jeankyung Kim; Hyune-Ju Kim

doi:10.1016/j.jspi.2015.09.008

. Author manuscript; available in PMC: 2017 Mar 1.

Published in final edited form as: J Stat Plan Inference. 2016 Mar 1;170:106–116. doi: 10.1016/j.jspi.2015.09.008

Consistent Model Selection in Segmented Line Regression

Jeankyung Kim ¹, Hyune-Ju Kim ²

PMCID: PMC4742379 NIHMSID: NIHMS729695 PMID: 26858507

Abstract

The Schwarz criterion or Bayes Information Criterion (BIC) is often used to select a model dimension, and some variations of the BIC have been proposed in the context of change-point problems. In this paper, we consider a segmented line regression model with an unknown number of change-points and study asymptotic properties of Schwarz type criteria in selecting the number of change-points. Noticing the overestimating tendency of the traditional BIC observed in some empirical studies and being motivated by asymptotic behavior of the modified BIC proposed by Zhang and Siegmund (2007), we consider a variation of the Schwarz type criterion that applies a harsher penalty equivalent to the model with one additional unknown parameter per segment. For the segmented line regression model without the continuity constraint, we prove the consistency of the number of change-points selected by the criterion with such type of a modification and summarize the simulation results that support the consistency. Further simulations are conducted for the model with the continuity constraint, and we empirically observe that the asymptotic behavior of this modified version of BIC is comparable to that of the criterion proposed by Liu, Wu, and Zidek (1997).

Keywords: Bayes Information Criterion, Segmented line regression, Model selection

1 Introduction

A main concern in regression model selection is how to select the “best” set of independent variables, and two major approaches of the model selection are hypothesis testing and information criteria approaches. In the context of change-point problems, both approaches have been applied to select the number of change-points, and their analytic and empirical properties have been investigated by many researchers. One of widely used information criteria is the Bayes Information Criterion (BIC) proposed by Schwarz (1978). This Schwarz criterion selects the model dimension by finding the Bayes solution that maximizes a posterior probability of the model, and Schwarz (1978) derived the following criterion by evaluating the leading terms of its asymptotic expansion:

SC (p) = \sup_{θ_{p}} \log (lik (θ_{p})) - \frac{p}{2} \log n = \log (lik ({\hat{θ}}_{p})) - \frac{p}{2} \log n,

where lik(θ_p) is the likelihood function of θ_p for the model with dimension p and ${\hat{θ}}_{p}$ is the maximum likelihood estimator of θ_p. As in general information criteria, the Schwarz criterion has two parts, the log of the maximized likelihood function and the penalty function that penalizes for the model dimension, and the method selects the model that maximizes SC(p). Note that its validity is established in Schwarz (1978) for “the case of independent, identically distributed observations, and linear models.”

Yao (1988) studied the problem to select the number of change-points in means of normally distributed random variables, where the total number of unknown parameters for the model with k change-points is p = 2(k + 1). For the number of change-points estimated by minimizing –SC(p) with p = 2k, Yao (1988) proved its consistency. Lee (1997) considered a similar type of a criterion to select the number of change-points in a sequence of random variables from an exponential family distribution. Under some mild conditions on spacings of successive change-points, Lee (1997) proved the consistency of the number of change-points estimated by the Schwarz type criterion whose penalty term is greater than 2k(1 + ε₀) log n for some ε₀ > 0. Zhang and Siegmund (2007) noted that the usage of the Schwarz criterion “is not theoretically justified” in their situation due to irregularities in the likelihood function and proposed a modified BIC derived as an asymptotic approximation of the Bayes factor to determine the number of change-points in means of normally distributed random variables. For other types of modifications and applications for detecting mean changes, see Ninomiya (2005), Pan and Chen (2006), and Hannart and Naveau (2012).

In the context of segmented line regression, similar approaches have been proposed to select the number of change-points. Kim et al. (2000) proposed the permutation test to select the number of change-points in the segmented line regression model where segments are assumed to be continuous at change-points, called the joinpoint regression model in their paper. Kim et al. (2009) considered the traditional BIC,

BIC (k) = - \frac{2}{n} SC (2 k) = \log ({RSS}_{k} ∕ n) + 2 k \frac{\log n}{n},

where RSS_k is the residual sum of squares for the model with k change-points, and compared its performance with those of the permutation test procedure of Kim et al. (2000) and the method based on generalized cross validation used in MARS of Friedman (1991). Note that the penalty term of $2 k \frac{\log n}{n}$ is chosen based on 2k + 3 unknown parameters for the joinpoint regression model with k change-points. Liu et al. (1997) considered a general segmented line regression model allowing a discontinuity at the change-point and non-Gaussian errors, proposed a penalty term with a bigger order than that of BIC(k), and proved the consistency of the dimension selected by minimizing their criterion:

MIC (k) = \log ({RSS}_{k} ∕ (n - p^{*}) + p^{*} \frac{c_{0} {(\log n)}^{2 + δ_{0}}}{n},

where p* = p*(k) = (k + 1)p + k for the model with k change-points and p covariates and c₀ and δ₀ are positive constants. Two Bayesian model selection methods based on the Bayes factor and a Bayesian version of BIC were developed in Tiwari et al. (2005) who investigated their empirical properties via simulations and compared their performances with that of the permutation procedure of Kim et al. (2000). Martinez-Beneito et al. (2011) also proposed a Bayesian model selection method that provides posterior probabilities and is flexible to work with Poisson count data.

This paper is motivated from empirical results where the traditional BIC indicated a tendency to over-estimate the number of change-points (See Table 1 of Kim et al. (2009), Table 1 of Zhang and Siegmund (2007)). When the argument of Zhang and Siegmund (2007) is applied to segmented line regression, the penalty of the modified BIC is harsher than that of the traditional BIC, asymptotically corresponding to one additional unknown parameter per segment under some conditions, and this motivated us to consider a BIC type criterion whose penalty is $4 k \frac{\log n}{n}$ for the segmented line regression model without the continuity constraint and $3 k \frac{\log n}{n}$ for the model with the continuity constraint. Note that for segmented line regression with k change-points, the number of unknown parameters is 3k + 3 for the model without the continuity constraint and 2k + 3 for the model with the continuity constraint. Let

{BIC}_{d} (k) = \log ({RSS}_{k} ∕ n) + P E_{d} (k) = \log ({RSS}_{k} ∕ n) + d k \frac{\log n}{n},

(1)

for penalty coefficients, d. Then the traditional BICs are BIC₃ for the unconstrained model and BIC₂ for the constrained model.

Table 1.

Unconstrained model with k₀ = 1 and x_i = i/n for i = 1,..., n. E(Y|x) = (1 + x)I(x ≤ 0.5) + (1.35 + 0.5x)I(x > 0.5)

n	Prob	σ = 0.1			σ = 0.05
n	Prob	BIC₃	BIC₄	MIC	BIC₃	BIC₄	MIC
30	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.47 \overset{‒}{6}$ 0.360 $0.16 \overset{‒}{3}$	$0.76 \overset{‒}{3}$ 0.210 $0.02 \overset{‒}{6}$	$0.44 \overset{‒}{6}$ $0.38 \overset{‒}{3}$ 0.170	0.030 $0.72 \overset{‒}{3}$ $0.24 \overset{‒}{6}$	$0.12 \overset{‒}{3}$ 0.830 $0.04 \overset{‒}{6}$	$0.02 \overset{‒}{6}$ 0.730 $0.24 \overset{‒}{3}$
50	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.38 \overset{‒}{6}$ $0.52 \overset{‒}{6}$ $0.08 \overset{‒}{6}$	0.670 0.320 0.010	0.450 0.510 0.040	0 $0.85 \overset{‒}{3}$ $0.14 \overset{‒}{6}$	0.010 $0.97 \overset{‒}{3}$ $0.01 \overset{‒}{6}$	$0.00 \overset{‒}{3}$ 0.910 $0.08 \overset{‒}{6}$
75	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.19 \overset{‒}{3}$ $0.74 \overset{‒}{6}$ 0.060	$0.49 \overset{‒}{3}$ $0.50 \overset{‒}{6}$ 0	$0.34 \overset{‒}{6}$ $0.62 \overset{‒}{3}$ 0.030	0 $0.92 \overset{‒}{3}$ $0.07 \overset{‒}{6}$	$0.00 \overset{‒}{3}$ 0.980 $0.01 \overset{‒}{6}$	$0.00 \overset{‒}{3}$ $0.97 \overset{‒}{6}$ 0.020
150	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.04 \overset{‒}{3}$ 0.900 $0.05 \overset{‒}{6}$	$0.14 \overset{‒}{6}$ $0.84 \overset{‒}{6}$ $0.00 \overset{‒}{6}$	0.130 $0.86 \overset{‒}{3}$ $0.00 \overset{‒}{6}$	0 $0.95 \overset{‒}{3}$ $0.04 \overset{‒}{6}$	0 0.990 0.010	0 0.990 0.010
200	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.00 \overset{‒}{6}$ $0.96 \overset{‒}{3}$ 0.030	0.040 $0.95 \overset{‒}{6}$ $0.00 \overset{‒}{3}$	$0.04 \overset{‒}{3}$ $0.95 \overset{‒}{3}$ $0.00 \overset{‒}{3}$	0 $0.96 \overset{‒}{3}$ $0.03 \overset{‒}{6}$	0 $0.99 \overset{‒}{6}$ $0.00 \overset{‒}{3}$	0 $0.99 \overset{‒}{6}$ $0.00 \overset{‒}{3}$

Open in a new tab

Our interest in this paper is on asymptotic behavior of BIC_d, a simple model selection criterion whose penalty term has the same order as that of the traditional BIC, and we focus on asymptotic properties of BIC₄ for the model without the continuity constraint and BIC₃ for the model with the continuity constraint. In Section 2, we formally introduce the unconstrained model and prove the consistency of the model dimension selected by BIC₄ for the unconstrained model with Gaussian errors. This result provides a consistent model selection criterion that imposes an asymptotically milder penalty than MIC does. In Section 3, we present the results of a simulation study where we compare the performance of BIC₄ with those of BIC₃ and MIC. Section 4 includes empirical results and discussion on the constrained case where the segments are constrained to be continuous at the change-points. Further discussion is presented in Section 5.

2 Selection Methods and Consistency: Unconstrained Model

Suppose that we observe (x₁, y₁), . . ., (x_n, y_n) and consider a segmented line regression model such that

y_{i} = β_{j, 0} + β_{j, 1} x_{i} + ∊_{i}, if τ_{j - 1} < x_{i} \leq τ_{j} (j = 1, \dots, κ + 1),

(2)

where κ is the unknown number of change-points, the τ's are unknown change-points with $τ_{0} = \min_{i} x_{i} - \frac{1}{n}$ and τ_κ+1 = max_i x_i, and the ε_i are independent N(0, σ²).

Let us consider the Schwarz type criterion (1) defined above and estimate κ as

\hat{κ} = {argmin}_{0 \leq k \leq K} {BIC}_{d} (k),

where K is a pre-determined maximum number of change-points. For the segmented line regression model without the continuity constraint, (2), we consider BIC₃ as the traditional BIC based on 3k + 3 unknown parameters, and study the large sample properties of BIC₄, which is a modified version motivated by Zhang and Siegmund (2007).

Liu et al. (1997) considered a segmented line regression model that allows a discontinuity in the mean function and non-Gaussian errors, and their main idea to prove the consistency of $\hat{κ}$ was that the difference between the variances estimated under the true model and a model with a larger number of change-points is in the order of $\frac{{(\log n)}^{2}}{n}$ , which seemed to have motivated their choice of the penalty term. Thus, arguments used in Liu et al. (1997) can not be applied in our situation where the penalty term is a constant multiple of $\frac{\log n}{n}$ . Yao (1988) proved the consistency of the number of change-points estimated by the traditional BIC, BIC₂, in the context of simple mean changes. He used properties of a normal distribution to show that the difference between the variances estimated under the true model and the model with a larger number of change-points than the true model, can be bounded above by the difference between the penalty terms of the two models. However, his argument to bound the difference does not work in our situation, and the upper bound in our case is achieved with d ≥ 4 using properties of a chi-square distribution.

Rewrite the model (2) as

y_{i} = β_{j, 0} + β_{j, 1} x_{i} + ∊_{i} = x_{i}^{'} β_{j} + ∊_{i} if x_{i} \in (τ_{j - 1}, τ_{j}], (j = 1, \dots, κ + 1),

(3)

where x_i = (1, x_i)′, β_j = (β_j,0, β_j,1)′ and the ε_i are independent and normally distributed with mean zero and variance σ². Without loss of generality, we assume that 0 < x_i < 1 for i = 1, . . ., n. Note that this model can be expressed as

y = \sum_{j = 1}^{κ + 1} X (τ_{j - 1}, τ_{j}) β_{j} + ∊,

where y = (y₁, y₂, . . ., y_n)′, X = (1, x) with x = (x₁, x₂, . . ., x_n)′ and 1 = (1, 1, . . ., 1)′, and X(α, η) = I(α, η)X with I(α, η) = diag(1_{{x₁ ∈ (α,η]}}, . . ., 1_{{x_n ∈ (α,η]}}) for any α < η. Note that 1_A = 1 if A is true and 0 otherwise.

Let $D_{k} = {{(t_{1}, \dots, t_{k})}^{'} : t_{0} < t_{1} < \dots < t_{k} < t_{k + 1}}$ with t₀ = 0 and t_k+1 = 1. For a change-point vector, $t_{k} = {(t_{1} . \dots, t_{k})}^{'} \in D_{k}$ , denote the design matrix and the corresponding hat matrix as

X (t_{k}) = (X (t_{0}, t_{1}), X (t_{1}, t_{2}), \dots, X (t_{k}, t_{k + 1}))

and

H (t_{k}) = X (t_{k}) {(X {(t_{k})}^{'} X (t_{k}))}^{- 1} X {(t_{k})}^{'} = \sum_{j = 1}^{k + 1} H (t_{j - 1}, t_{j}),

where H(α, η) is the projection matrix based on X(α, η). Then the sum of squared residuals at given t_k is

S_{n} (t_{k}) = y^{'} (I - H (t_{k})) y = \sum_{j = 1}^{k + 1} S_{n} (t_{j - 1}, t_{j}),

where S_n(α, η) = y′(I(α, η) – H(α, η))y, and thus the sum of squared residuals for the model with k change-points is obtained as

R S S_{k} = \inf_{t_{k}} S_{n} (t_{k}) = S_{n} ({\hat{τ}}_{k}) .

We also denote the true values of κ, σ² and τ as κ₀, $σ_{0}^{2}$ , and $τ^{0} = {(τ_{1}^{0}, τ_{2}^{0}, \dots, τ_{κ_{0}}^{0})}^{'}$ , respectively.

Assumption 1

The regressor variable x has a positive and continuous density function in any small neighborhoods of τ⁰. Also x is independent of the error, ε.

For the case of nonrandom design points, the following assumption can replace Assumption 1. Note that the data spacing should be of order O(1/n) to satisfy this assumption.

Assumption 1’

Let {δ_n} be a sequence of constants for which O(1/n) ≤ δ_n = o(1). The number of data points in any small neighborhoods of $τ_{r}^{0}$ , (r = 1, . . ., κ₀) with volume δ_n is of order at least δ_n.

Theorem 1

Suppose that the segmented line regression model without the continuity constraint (3) satisfies one of Assumption 1 or 1’. Let k < κ₀. Then, for any constant d > 0,

P ({BIC}_{d} (k) > {BIC}_{d} (κ_{0})) \to 1

as n → ∞.

Proof

Let d be an arbitrary positive constant. By Lemma 5.3 and Lemma 5.4(i) of Liu et. al.(1997), it is true that for some C > 0,

{\hat{σ}}_{k}^{2} = \frac{R S S_{k}}{n} = \frac{S_{n} ({\hat{τ}}_{k})}{n} > σ_{0}^{2} + C

with probability going to 1 as n → ∞. Since ${\hat{σ}}_{κ_{0}}^{2} = \frac{R S S_{κ_{0}}}{n}$ converges to $σ_{0}^{2}$ in probability (that is, ${\hat{σ}}_{κ_{0}}^{2} \overset{p}{\to} σ_{0}^{2}$ ), it is true that

{BIC}_{d} (k) - {BIC}_{d} (κ_{0}) = \log (\frac{{\hat{σ}}_{k}^{2}}{{\hat{σ}}_{κ_{0}}^{2}}) + \frac{d (k - κ_{0})}{n} \log n > 0

with probability going to 1 as n → ∞.

Theorem 1 guarantees that for any d > 0, $\hat{κ}$ based on BIC_d is consistent from the below. For the consistency of $\hat{κ}$ from the above, we now get the result for d ≥ 4 under the assumption that the mean function of the regression model is discontinuous.

In order to prove that for d ≥ 4 and for k > κ₀, P(BIC_d(k) > BIC_d(κ₀)) → 1 as n → ∞, we first state the following lemmas whose proofs are given in Appendix. In the lemmas and Theorem 2, we let $δ_{n} = \frac{{(\log n)}^{2}}{n}$ , and for 1 ≤ r ≤ κ₀, let

B_{r} (n) = {t_{k} \in D_{k} : ∣ t_{j} - τ_{r}^{0} ∣ > δ_{n} for 1 \leq j \leq k} .

Lemma 1

Suppose that the segmented line regression model without the continuity constraint (3) satisfies one of Assumption 1 or 1’. Then, for k > κ₀,

P ({\hat{τ}}_{k} \in \cup_{r = 1}^{κ_{0}} B_{r} (n)) \to 0

as n → ∞.

Let ξ_k+3κ₀ be the ordered set of {t_k, τ⁰, τ⁰ – δ_n1, τ⁰ + δ_n1}. Let $T_{r} = (τ_{j}^{0} - δ_{n}, τ_{j}^{0} + δ_{n}]$ for r = 1, . . ., κ₀. Let $D_{0} = (0, τ_{1}^{0} - δ_{n}]$ , $D_{κ_{0}} = (τ_{κ_{0}}^{0} + δ_{n}, 1]$ , and for r = 1, . . ., κ₀ – 1, let $D_{r} = (τ_{r}^{0} + δ_{n}, τ_{r + 1}^{0} - δ_{n}]$ . Let

\begin{matrix} G_{n, 1} = & \sum_{(ξ_{l = 1}, ξ_{l}] \subset \cup_{r = 1}^{κ_{0}} T_{r}} ∊^{'} H (ξ_{l - 1}, ξ_{l}) ∊, \\ G_{n, 2} = & \sum_{(ξ_{l = 1}, ξ_{l}] \subset \cup_{r = 0}^{κ_{0}} D_{r}} ∊^{'} H (ξ_{l - 1}, ξ_{l}) ∊ . \end{matrix}

Lemma 2

Suppose that the segmented line regression model without the continuity constraint (3) satisfies one of Assumption 1 or 1’. Then, for large n,

G_{n, 1} = O_{p} (\log \log n) .

Lemma 3

Suppose that the segmented line regression model without the continuity constraint (3) satisfies one of Assumption 1 or 1’. Then, for k > κ₀ and any $t_{k} \in \cap_{r = 1}^{κ_{0}} B_{r} {(n)}^{c}$ , it is true that for any δ > 0,

P (G_{n, 2} \leq (k - κ_{0} - 1) 4 (1 + δ) σ_{0}^{2} \log n) \to 1

as n → ∞.

Theorem 2

Suppose that the segmented line regression model without the continuity constraint (3) satisfies one of Assumption 1 or 1’. Let k > κ₀. Then, for any constant d ≥ 4,

P ({BIC}_{d} (k) > {BIC}_{d} (κ_{0})) \to 1

as n → ∞.

Proof

Without loss of generality, we suppose k = κ₀ + 1. Since ${\hat{σ}}_{κ_{0}}^{2} \overset{p}{\to} σ_{0}^{2}$ and ${\hat{σ}}_{k}^{2} \overset{p}{\to} σ_{0}^{2}$ , it is true that for any small δ > 0,

\begin{matrix} {BIC}_{d} (k) - {BIC}_{d} (κ_{0}) \\ = & \log (\frac{S_{n} ({\hat{τ}}_{k})}{n}) - \log (\frac{S_{n} ({\hat{τ}}_{κ_{0}})}{n}) + \frac{d (k - κ_{0})}{n} \log n \\ = & \log (1 - \frac{{\hat{σ}}_{κ_{0}}^{2} - {\hat{σ}}_{k}^{2}}{{\hat{σ}}_{κ_{0}}^{2}}) + \frac{d (k - κ_{0})}{n} \log n \\ \geq & (1 + δ) (\frac{{\hat{σ}}_{k}^{2} - {\hat{σ}}_{κ_{0}}^{2}}{{\hat{σ}}_{κ_{0}}^{2}}) + \frac{d (k - κ_{0})}{n} \log n \end{matrix}

with probability going to 1 as n → ∞.

Since $P ({\hat{τ}}_{k} \in \cup_{r = 1}^{κ_{0}} B_{r} (n)) \to 0$ by Lemma 1, we now may focus on t_k's in $\cap_{r = 1}^{κ_{0}} B_{r} {(n)}^{c}$ . Let ξ_k+3κ₀ be the ordered set of {t_k, τ⁰, τ⁰ – δ_n1, τ⁰ + δ_n1}. Then

\begin{matrix} S_{n} (t_{k}) - S_{n} ({\hat{τ}}_{κ_{0}}) \\ \geq & S_{n} (ξ_{k + 3 κ_{0}}) - S_{n} (τ^{0}) \\ \geq & - \sum_{l = 1}^{k + 3 κ_{0} + 1} ∊^{'} H (ξ_{l - 1}, ξ_{l}) ∊ \\ = & - {\sum_{ξ_{l - 1}, ξ_{l}] \subset \cup_{r = 1}^{κ_{0}} T_{r}} ∊^{'} H (ξ_{l - 1}, ξ_{l}) ∊ + \sum_{(ξ_{l - 1}, ξ_{l}] \subset \cup_{r = 0}^{κ_{0}} D_{r}} ∊^{'} H (ξ_{l - 1}, ξ_{l}) ∊} \\ = & - (G_{n, 1} + G_{n, 2}), \end{matrix}

where ξ₀ = 0 and ξ_k+3κ₀+1 = 1.

By Lemma 2 and Lemma 3, the following holds for any $t_{k} \in \cap_{r = 1}^{κ_{0}} B_{r} {(n)}^{c}$ : it is true that for any δ > 0,

S_{n} (t_{k}) - S_{n} ({\hat{τ}}_{κ_{0}}) > - (δ + (k - κ_{0} - 1) 4 (1 + δ)) σ_{0}^{2} \log n

and ${\hat{α}}_{κ_{0}}^{2} > σ_{0}^{2} (1 - δ)$ , with probability going to 1 as n → ∞. Therefore, for any δ > 0, ${BIC}_{d} (k) - {BIC}_{d} (κ_{0}) > - (1 + δ) (δ + (k - κ_{0} - 1) 4 (1 + δ)) σ_{0}^{2} \log n ∕ n σ_{0}^{2} (1 - δ) + d (k - κ_{0}) \log n ∕ n$ with probability going to 1 as n → ∞. For d ≥ 4, we can choose δ small enough such that

(1 + δ) (δ + (k - κ_{0} - 1) 4 (1 + δ)) ∕ (1 - δ) < d (k - κ_{0}),

and thus

{BIC}_{d} (k) - {BIC}_{d} (κ_{0}) > 0

with probability going to 1 as n → ∞.

Remark

Theorem 1 and Theorem 2 imply that the consistency of $\hat{κ}$ is achieved by BIC_d for d ≥ 4. Because the penalty term of MIC for the one independent variable case with p = 1, $(2 k + 1) \frac{c_{0} {(\log n)}^{2 + δ_{0}}}{n}$ , is larger than $4 k \frac{\log n}{n}$ for large n, the result in this section indicates that the consistent model selection can be performed with a criterion whose penalty is asymptotically milder than that of MIC under the assumption of Gaussian errors.

3 Simulations: Unconstrained Model

This section summarizes simulations conducted to study asymptotic behavior of BIC₄ whose consistency is proved in Section 2 and to compare its performance with those of BIC₃, the traditional BIC, and MIC whose consistency is established in Liu et al. (1997). For each case of the model parameters chosen, 300 replications of data sets are simulated, and $P (\hat{κ} < κ_{0})$ , $P (\hat{κ} - κ_{0})$ , and $P (\hat{κ} > κ_{0})$ are estimated where κ₀ is the true number of change-points and $\hat{κ} (0 \leq \hat{κ} \leq K)$ is the estimate obtained by each selection method. The value of K was set as 4, and $\overset{‒}{3}$ or $\overset{‒}{6}$ represents a repeating decimal.

Tables 1, 2, and 3 summarize the results for unconstrained model cases. Tables 1 and 2 consider the cases with one change-point for two different x configurations, fixed x in Table 1 and random x in Table 2. Table 3 is for the case with two change-points where x is fixed. Note that MIC of Liu et al. (1997) uses the penalty term, PE(k) = p*c₀(ln n)^2+δ₀/n = (2k + 1)(0.299)(ln n)^2.1/n for the model with one independent variable, which is based on p* = (k+1)p+k = 2k+1 with p = 1 in their equation (2.3). According to Liu et al. (1997), δ₀ = 0.1 is arbitrarily chosen and then c₀ = 0.299 is chosen by forcing the penalty term of MIC to be equal to that of the Schwarz criterion. For the model with one independent variable, the penalty term of MIC is set to be equal to that of BIC₂ at n = 20, but MIC is not exactly same as BIC₂ when n = 20 because MIC used an unbiased estimator of σ² while BIC₂ uses the maximum likelihood estimator of σ² under the Gaussian error assumption. When the full version of MIC is compared to BIC_d, the penalty term of MIC is similar to that of BIC₃ when n = 30 or 50 and to that of BIC₄ when n = 150 or 200, respectively, which is empirically indicated in Tables 1, 2, and 3. For n very large, MIC imposes a penalty harsher than that of BIC₄.

Table 2.

Unconstrained model with k₀ = 1 and x_i ~ N(0.5, 1) for i = 1,..., n. E(Y|x) = (1 + x)I(x ≤ 0.5) + (1.35 + 0.5x)I(x > 0.5)

n	Prob	σ = 0.2			σ = 0.1
n	Prob	BIC₃	BIC₄	MIC	BIC₃	BIC₄	MIC
30	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.08 \overset{‒}{3}$ $0.70 \overset{‒}{3}$ $0.21 \overset{‒}{3}$	0.300 $0.66 \overset{‒}{6}$ $0.03 \overset{‒}{3}$	$0.07 \overset{‒}{6}$ $0.70 \overset{‒}{6}$ $021 \overset{‒}{6}$	0 0.760 0.240	$0.00 \overset{‒}{6}$ $0.94 \overset{‒}{3}$ 0.050	0 0.760 0.240
50	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.01 \overset{‒}{6}$ 0.830 $0.15 \overset{‒}{3}$	$0.07 \overset{‒}{6}$ $0.89 \overset{‒}{6}$ $0.02 \overset{‒}{6}$	$0.02 \overset{‒}{3}$ $0.89 \overset{‒}{3}$ $0.08 \overset{‒}{3}$	0 $0.84 \overset{‒}{3}$ $0.15 \overset{‒}{6}$	0 0.980 0.020	0 $0.91 \overset{‒}{3}$ $0.08 \overset{‒}{6}$
150	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	0 $0.95 \overset{‒}{3}$ $0.04 \overset{‒}{6}$	0 0.990 0.010	0 0.990 0.010	0 $0.95 \overset{‒}{6}$ $0.04 \overset{‒}{3}$	0 $0.99 \overset{‒}{6}$ $0.00 \overset{‒}{3}$	0 $0.99 \overset{‒}{6}$ $0.00 \overset{‒}{3}$

Open in a new tab

Table 3.

Unconstrained model with k₀ = 2 and x_i = i/n for i = 1,..., n. E(Y|x) = (1 + x)I(x ≤ 0.5) + (1.35 + 0.5x)I(0.5 < x ≤ 0.7) + (0.75 + 1.5x)I(x > 0.7)

n	Prob	σ = 0.05			σ = 0.02
n	Prob	BIC₃	BIC₄	MIC	BIC₃	BIC₄	MIC
30	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	0.590 $0.24 \overset{‒}{3}$ $0.16 \overset{‒}{6}$	$0.87 \overset{‒}{3}$ $0.10 \overset{‒}{3}$ $0.02 \overset{‒}{3}$	0.590 $0.24 \overset{‒}{3}$ $0.16 \overset{‒}{6}$	$0.02 \overset{‒}{6}$ $0.67 \overset{‒}{3}$ 0.300	$0.11 \overset{‒}{3}$ $0.79 \overset{‒}{3}$ $0.09 \overset{‒}{3}$	$0.02 \overset{‒}{6}$ $0.68 \overset{‒}{6}$ $0.28 \overset{‒}{6}$
50	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.63 \overset{‒}{6}$ $0.30 \overset{‒}{6}$ $0.05 \overset{‒}{6}$	0.890 $0.10 \overset{‒}{6}$ $0.00 \overset{‒}{3}$	$0.72 \overset{‒}{3}$ 0.250 $0.02 \overset{‒}{6}$	0 $0.82 \overset{‒}{3}$ $0.17 \overset{‒}{6}$	0 $0.96 \overset{‒}{6}$ $0.03 \overset{‒}{3}$	0 $0.89 \overset{‒}{3}$ $0.10 \overset{‒}{6}$
150	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	0.190 0.760 0.050	$0.38 \overset{‒}{6}$ $0.60 \overset{‒}{6}$ $0.00 \overset{‒}{6}$	$0.37 \overset{‒}{6}$ $0.61 \overset{‒}{6}$ $0.00 \overset{‒}{6}$	0 $0.94 \overset{‒}{6}$ $0.05 \overset{‒}{3}$	0 0.990 0.010	0 0.990 0.010

Open in a new tab

Results in Tables 1, 2, and 3 indicate that in the unconstrained case, the consistency of $\hat{κ}$ estimated by BIC₄ or MIC, which is theoretically proved, is empirically supported with $P (\hat{κ} = κ_{0})$ approaching one as n increases or the effect size increases. Based on Theorem 1, a choice of d related to under-fitting is not much of an issue with n large, and our interest is on the over-fitting probabilities of these selection methods when n is large. The over-fitting probabilities of BIC₄ and MIC converge to zero as n increases, but the over-fitting probability of BIC₃, the traditional BIC, is shown to be considerably larger than those of BIC₄ and MIC even for large n, although they seem to decrease as n increases.

In small sample situations, however, performances of these selection criteria depend more heavily on the effect size, the change size relative to the variability. For example, when the sample size is small and the effect size is small (for example, see the cases with σ = 0.1 and n = 30 or 50 in Table 1), the traditional BIC, BIC₃, shows the highest probability of correct selection (PCP), while BIC₄ shows a very conservative performance. However, when the effect size is large enough to have the PCP of BIC₄ greater than 0.80, BIC₄ achieves the highest PCP even in small sample cases (for example, see the cases with σ = 0.05 and n = 30 or 50 in Table 1). Regardless of the sample size, the over-fitting probabilities of the traditional BIC, BIC₃, is significantly larger than that of BIC₄. The performance of MIC is similar to that of BIC₃ when n = 30, 50 and to that of BIC₄ when n = 150, 200.

4 Selection Methods in Constrained Model

Recall the segmented line regression model introduced in (2):

y_{i} = β_{j, 0} + β_{j, 1} x_{i} + ∊_{i}, if τ_{j - 1} < x_{i} \leq τ_{j} (j = 1, \dots, κ + 1) .

When the segments are assumed to be continuous at the change-points,

β_{j, 0} + β_{j, 1} τ_{j} = β_{j + 1, 0} + β_{j + 1, 1} τ_{j},

for j = 1, . . ., κ, and this model can also be represented as

y_{i} = β_{0} + β_{1} x_{i} + δ_{1} {(x_{i} - τ_{1})}^{+} + \dots + δ_{κ} {(x_{i} - τ_{κ})}^{+} + ∊_{i},

(4)

where a⁺ = max(0, a).

For this segmented line regression model with the continuity constraint, called a joinpoint regression model in Kim et al. (2000), the total number of unknown parameters is 2k + 3 for the model with k change-points, and so the traditional BIC uses d = 2 as in the simple mean change case. We call this traditional BIC as BIC₂. Similarly as in the unconstrained case, we consider BIC₃ as a modified version in the constrained case.

Because Lemma 5.3 of Liu et al. (1997) also holds for the segmented line regression model with the continuity constraint, Theorem 1 holds for the constrained case as well. However, the arguments used to prove Theorem 2, more specifically the arguments used in the proof of Lemma 1, can not be used to prove that the over-fitting probability of BIC₃ converges to zero in the constrained case, and in this section, we investigate asymptotic behavior of BIC₃ via simulations. Although detailed proofs are provided only for the unconstrained model case, Liu et al. (1997) indicated the consistency of MIC in the constrained model case as well and the comparison of BIC₃ and MIC is expected to provide some insights on the consistency of BIC₃.

Table 4, 5, and 6 include simulation results for constrained model cases, and we used the same simulation setting as in Tables 1, 2, and 3, except that in Table 4, K = 3 is used due to an excessive computation time. Tables 4, 5, and 6 include results for BIC₂ that is the traditional BIC for the constrained model, BIC₃ that we anticipate to be consistent, and MIC whose properties are studied in Liu et al. (1997). Note that MIC of Liu et al. (1997) uses the same penalty term for both constrained and unconstrained models, which was (2k + 1)c₀(ln n)^2+δ₀/n = (2k + 1)(0.299)(ln n)^2.1/n for the model with one independent variable.

Table 4.

Constrained model with k₀ = 1 and x_i = i/n for i = 1,..., n. E(Y|x) = 1 + x – 0.5(x – 0.5)⁺

n	Prob	σ = 0.1			σ = 0.05
n	Prob	BIC₂	BIC₃	MIC	BIC₂	BIC₃	MIC
30	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.61 \overset{‒}{3}$ 0.340 $0.04 \overset{‒}{6}$	0.810 $0.18 \overset{‒}{6}$ $0.00 \overset{‒}{3}$	$0.80 \overset{‒}{3}$ $0.19 \overset{‒}{3}$ $0.00 \overset{‒}{3}$	$0.04 \overset{‒}{6}$ 0.880 $0.07 \overset{‒}{3}$	$0.17 \overset{‒}{6}$ $0.81 \overset{‒}{3}$ 0.010	$0.15 \overset{‒}{6}$ 0.830 $0.01 \overset{‒}{3}$
50	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.45 \overset{‒}{6}$ $0.51 \overset{‒}{3}$ 0.030	0.730 0.270 0	0.770 0.230 0	$0.00 \overset{‒}{6}$ 0.920 $0.07 \overset{‒}{3}$	0.040 $0.95 \overset{‒}{3}$ $0.00 \overset{‒}{6}$	0.060 $0.93 \overset{‒}{6}$ $0.00 \overset{‒}{3}$
75	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.30 \overset{‒}{6}$ 0.650 $0.04 \overset{‒}{3}$	$0.64 \overset{‒}{6}$ 0.350 $0.00 \overset{‒}{3}$	$0.75 \overset{‒}{6}$ $0.24 \overset{‒}{3}$ 0	0 $0.93 \overset{‒}{6}$ $0.06 \overset{‒}{3}$	$0.00 \overset{‒}{3}$ 0.990 $0.00 \overset{‒}{6}$	$0.01 \overset{‒}{3}$ $0.98 \overset{‒}{3}$ $0.00 \overset{‒}{3}$
150	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	0.100 $0.87 \overset{‒}{3}$ $0.02 \overset{‒}{6}$	0.280 $0.71 \overset{‒}{6}$ $0.00 \overset{‒}{3}$	$0.49 \overset{‒}{3}$ $0.50 \overset{‒}{6}$ 0	0 $0.96 \overset{‒}{6}$ $0.03 \overset{‒}{3}$	0 $0.99 \overset{‒}{6}$ $0.00 \overset{‒}{3}$	0 1 0
200	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.02 \overset{‒}{3}$ $0.96 \overset{‒}{3}$ $0.01 \overset{‒}{3}$	$0.11 \overset{‒}{6}$ $0.88 \overset{‒}{3}$ 0	$0.35 \overset{‒}{3}$ $0.64 \overset{‒}{6}$ 0	0 $0.98 \overset{‒}{3}$ $0.01 \overset{‒}{6}$	0 1 0	0 1 0

Open in a new tab

Table 5.

Constrained model with k₀ = 1 and x_i ~ N(0.5, 1) for i = 1,..., n. E(Y|x) = 1 + x – 0.5(x – 0.5)⁺

n	Prob	σ = 0.2			σ = 0.1
n	Prob	BIC₂	BIC₃	MIC	BIC₂	BIC₃	MIC
30	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	0.100 $0.82 \overset{‒}{3}$ $0.07 \overset{‒}{6}$	0.290 $0.70 \overset{‒}{6}$ $0.00 \overset{‒}{3}$	$0.27 \overset{‒}{6}$ 0.720 $0.00 \overset{‒}{3}$	0 0.910 0.090	$0.00 \overset{‒}{3}$ $0.98 \overset{‒}{3}$ $0.01 \overset{‒}{3}$	$0.00 \overset{‒}{3}$ $0.98 \overset{‒}{3}$ $0.01 \overset{‒}{3}$
50	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.01 \overset{‒}{6}$ $0.92 \overset{‒}{6}$ $0.05 \overset{‒}{6}$	0.080 $0.91 \overset{‒}{6}$ $0.00 \overset{‒}{3}$	$0.10 \overset{‒}{3}$ $0.89 \overset{‒}{3}$ $0.06 \overset{‒}{3}$	0 $0.93 \overset{‒}{6}$ $0.06 \overset{‒}{3}$	0 $0.98 \overset{‒}{6}$ $0.01 \overset{‒}{3}$	0 $0.99 \overset{‒}{3}$ $0.00 \overset{‒}{6}$
150	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	0 $0.96 \overset{‒}{6}$ $0.03 \overset{‒}{3}$	0 1 0	0 1 0	0 $0.96 \overset{‒}{3}$ $0.03 \overset{‒}{6}$	0 1 0	0 1 0

Open in a new tab

Table 6.

Constrained model with k₀ = 2 and x_i = i/n for i = 1,..., n. E(Y|x) = 1 + x – 0.5(x – 0.5)⁺ + (x – 0.7)⁺

n	Prob	σ = 0.5			σ = 0.02
n	Prob	BIC₂	BIC₃	MIC	BIC₂	BIC₃	MIC
30	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.58 \overset{‒}{6}$ $0.36 \overset{‒}{3}$ 0.050	0.860 0.140 0	$0.85 \overset{‒}{3}$ $0.14 \overset{‒}{6}$ 0	0 0.920 0.080	$0.02 \overset{‒}{6}$ $0.95 \overset{‒}{6}$ $0.01 \overset{‒}{6}$	$0.02 \overset{‒}{3}$ 0.960 $0.01 \overset{‒}{6}$
50	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.50 \overset{‒}{3}$ $0.43 \overset{‒}{3}$ $0.06 \overset{‒}{3}$	$0.81 \overset{‒}{6}$ $0.18 \overset{‒}{3}$ 0	0.890 0.110 0	0 $0.89 \overset{‒}{3}$ $0.10 \overset{‒}{6}$	$0.01 \overset{‒}{3}$ $0.96 \overset{‒}{3}$ $0.02 \overset{‒}{3}$	0.020 $0.96 \overset{‒}{6}$ $0.01 \overset{‒}{3}$
150	$\hat{P} (\hat{κ} < k_{0})$ $\hat{P} (\hat{κ} = k_{0})$ $\hat{P} (\hat{κ} > k_{0})$	$0.09 \overset{‒}{3}$ $0.86 \overset{‒}{3}$ $0.04 \overset{‒}{3}$	0.280 $0.71 \overset{‒}{3}$ $0.00 \overset{‒}{6}$	0.530 0.470 0	0 $0.93 \overset{‒}{3}$ $0.06 \overset{‒}{6}$	0 $0.99 \overset{‒}{6}$ $0.00 \overset{‒}{3}$	0 1 0

Open in a new tab

The results summarized in Tables 4, 5, and 6 empirically support that the under-fitting probabilities of BIC₂, BIC₃, and MIC approaching zero as n increases. The over-fitting probabilities of BIC₃ and MIC seem to converge to zero as the sample size increases and/or the effect size gets larger. Similarly as in the unconstrained case, the traditional BIC for the constrained case, BIC₂, tends to overestimate the true number of change-points (κ), especially when the effect size is large, and the probability of overestimating κ is significantly different from zero even for n = 150 or 200 in most of cases, although it seems to decrease as n increases. Via numerical study not reported here, we also observed that the residual sum of squares of the constrained model fit change usually at a slower speed than those of the unconstrained model fit do as the number of change-points increases, and thus we expect the consistency of BIC_d to be achieved in the constrained model case with a less severe penalty than that of BIC₄ whose consistency was proved in the unconstrained model case.

5 Discussion

In this paper, we considered information based selection criteria to estimate the number of change-points in segmented line regression, and proved the consistency of the number of change-points estimated by BIC₄ for the unconstrained model with normal errors. The simulation results summarized in Tables 1, 2, and 3 for the unconstrained model support the theoretical results that (i) the under-fitting probability of BIC_d converges to zero for any d > 0 (Theorem 1) and (ii) the over-fitting probability of BIC₄ converges to zero (Theorem 2), while the traditional BIC, BIC₃, indicates a non-negligible over-fitting tendency compared to those of BIC₄ and MIC.

For the constrained case where the line segments are assumed to be continuous at the change-points, we examined empirical properties of BIC₃ that is expected to improve large sample performance of the traditional BIC, BIC₂. The fact that the under-fitting probability of BIC_d converges to zero for any d > 0, is empirically justified in Tables 4, 5, and 6. Significantly higher over-fitting probabilities of the traditional BIC, BIC₂, is also observed even for large n, compared to those of BIC₃ and MIC.

In general, a relatively large penalty, that is, large d, will reduce the probability of over-estimating the number of change-points, but it will raise a risk of under-estimation. In that regards, our interest in this paper was on the smallest d for which BIC_d produces a consistent estimator $\hat{κ}$ . In the normally distributed sequence of random variables, the traditional BIC, BIC₂, produces a consistent estimator of κ according to Yao (1988), although its over-estimating tendency is observed even for n = 497 in Table 1 of Zhang and Siegmund (2007). Although the consistency of $\hat{κ}$ in the unconstrained case was proved only for d ≥ 4 in our paper, BIC_d with d < 4 may provide a consistent model selection, and we hope future research to improve our results in this paper.

Another related future research problem is to identify a selection procedure that selects the best model for various sample sizes and effect sizes. In our simulation study, we observed that the traditional BIC performs better when the effect size is small, and in our future research, we plan to investigate small sample properties of various selection criteria. Any one type of criterion may not provide a uniform optimality, and a revised selection method such as a weighted BIC, $\sum_{d} w_{d} {BIC}_{d}$ , where the weight depends on effect sizes, sample size, etc., may achieve a relatively high probability of correct selection.

BIC₄/BIC₃ studied in this paper for the unconstrained/constrained model cases, respectively, were motivated from the modified BIC of Zhang and Siegmund (2007) that are derived as an asymptotic approximation of the Bayes factor. The original form of the Bayes factor could be used as a selection measure, instead of its asymptotic approximation. However, we have observed in our preliminary simulation study that even the performance of MBIC derived for the segmented line regression models depend on the x-configurations, and thus it may not be straightforward to study analytic properties of the criterion using the original Bayes factor.

Highlights.

Segmented line regression models, constrained and unconstrained, are studied.
Consistency of the estimated number of change-points is investigated.
Simulation studies are conducted to assess the performance of the proposed criteria.

Acknowledgements

A part of H.-J. Kim's research was conducted during her visit at National Cancer Institute. J. Kim's research was supported by a research grant from Inha University.

Appendix: Proofs of Lemmas

Lemma A.1

For any δ > 0,

P (\sup_{α < η} ∊^{'} H (α, η) ∊ > 4 σ_{0}^{2} (1 + δ) \log n) \to 0

as n → ∞.

Proof

Let X be a χ² distributed random variable with k degrees of freedom. For some constant c_k > 0,

P (X > x) = c_{k} x^{\frac{k}{2} - 1} e^{- \frac{x}{2}} + o (x^{\frac{x}{2} - 1} e^{- \frac{x}{2}})

as x → ∞. For given α < η, $\frac{∊^{'} H (α, η) ∊}{σ_{0}^{2}}$ has a χ² distribution with 2 degrees of freedom. So, it is true that for some constant c₀ > 0 and for δ > 0,

\begin{matrix} P (\sup_{α < η} ∊^{'} H (α, η) ∊ > 4 σ_{0}^{2} (1 + δ) \log n) \\ = & P (\max_{x_{i} < x_{j}} ∊^{'} H (x_{i}, x_{j}) ∊ > 4 σ_{0}^{2} (1 + δ) \log n) \\ \leq & c_{0} \frac{n (n - 1)}{2} n^{- 2 - 2 δ} \\ \to & 0 \end{matrix}

as n gets large.

Lemma A.2

For 1 ≤ j ≤ n – 2,

\max_{2 \leq k \leq n - j} ∊^{'} H (x_{j}, x_{j + k}) ∊ = O_{p} (\log \log n),

and for 3 ≤ j ≤ n,

\max_{2 \leq k \leq j - 1} ∊^{'} H (x_{j - k}, x_{j}) ∊ = O_{p} (\log \log n) .

Proof

For 2 ≤ k ≤ n – j, the rank of H(x_j, x_j+k) is 2, and so there exists 2 × n matrix Q(k) with full row rank 2, such that

∊^{'} H (x_{j}, x_{j + k}) ∊ = ∊^{'} Q {(k)}^{'} Q (k) ∊ = \sum_{l = 1}^{2} u {(k)}_{l}^{2},

where Q(k)′ = (q(k)₁, q(k)₂) and $u {(k)}_{l} = q {(k)}_{l}^{'} ∊$ , l = 1, 2. Then $\sum_{l = 1}^{2} q {(k)}_{l}^{'} q {(k)}_{l} = t r (H (x_{j}, x_{j + k})) = 2$ . For each k and integer i, let $a_{l k i} = \sqrt{k} q {(k)}_{l, i}$ , so that

∊^{'} H (x_{j}, x_{j + k}) ∊ = \frac{S_{1 k}^{2} + S_{2 k}^{2}}{k},

where $S_{l k} = \sum_{i = j + 1}^{j + k} a_{l k i} ∊_{i}$ . Then by Theorem 1 of Lai and Wei(1982),

\max_{2 \leq k \leq n - j} ∊^{'} H (x_{j}, x_{j + k}) ∊ = O_{p} (\log \log n) .

Similarly, we can show that

\max_{2 \leq k \leq j - 1} ∊^{'} H (x_{j - k}, x_{j}) ∊ = O_{p} (\log \log n) .

Proof of Lemma 1

It is good enough to consider the case of k = κ₀ + 1. For any t_k ∈ B_r(n), let ξ_k+κ₀+1 be an ordered set of ${t_{k}, τ_{1}^{0}, \dots, τ_{r - 1}^{0}, τ_{r}^{0} - δ_{n}, τ_{r}^{0} + δ_{n}, τ_{r + 1}^{0}, \dots, τ_{κ_{0}}^{0}}$ . Then

S_{n} (t_{k}) - S_{n} (τ^{0}, 1) \geq S_{n} (ξ_{k + κ_{0} + 1}) - S_{n} (τ^{0}) \geq S_{n} (ξ_{k + κ_{0} + 1}) - ∊^{'} ∊ .

For $(ξ_{l - 1}, ξ_{l}] ⊄ (τ_{r}^{0} - δ_{n}, τ_{r}^{0} + δ_{n}]$ ,

S_{n} (ξ_{l - 1}, ξ_{l}) = ∊^{'} (I (ξ_{l - 1}, ξ_{l}) - H (ξ_{l - 1}, ξ_{l})) ∊ = \sum_{i \in {i : x_{i} \in (ξ_{l - 1}, ξ_{l}]}} ∊_{i}^{2} - O_{p} (\log n)

by Lemma A.1.

For 1 ≤ r ≤ κ₀. let $A_{r} (n) = {i : τ_{r}^{0} - δ_{n} < x_{i} \leq τ_{r}^{0} + δ_{n}}$ . Then

\begin{matrix} S_{n} (τ_{r}^{0} - δ_{n}, τ_{r}^{0} + δ_{n}) = & \sum_{i \in A_{r} (n)} {(E (y_{i}) + ∊_{i} - {\hat{y}}_{i})}^{2} \\ = & \sum_{i \in A_{r} (n)} {∊_{i}^{2} + {(E (y_{i}) - {\hat{y}}_{i})}^{2} + 2 (E (y_{i}) - {\hat{y}}_{i}) ∊_{i}} \end{matrix}

Since E(y) is discontinuous at $τ_{r}^{0}$ while ŷ is continuous on ( $τ_{r}^{0} - δ_{n}$ , $τ_{r}^{0} + δ_{n}$ ), for some constant c₀,

{(E (y_{i}) - {\hat{y}}_{i})}^{2} > c_{0}

for $i \in I_{r} \subset A_{r} (n)$ for which #(I_r) is of order at least nδ_n. So by the weak law of large numbers, there exists c₁ > 0 such that

S_{n} (τ_{r}^{0} - δ_{n}, τ_{r}^{0} + δ_{n}) - \sum_{i \in A_{r} (n)} ∊_{i}^{2} > c_{1} δ_{n} n = c_{1} {(\log n)}^{2} .

Then

S_{n} (ξ_{k + κ_{0} + 1}) - ∊^{'} ∊ > - O_{p} (\log n) + c_{1} {(\log n)}^{2} > 0

for large n.

Proof of Lemma 2

The number of observations in subintervals ( $τ_{r}^{0} - δ_{n}, τ_{r}^{0}$ ) and ( $τ_{r}^{0}, τ_{r}^{0} + δ_{n}$ ) is of order (log n)² by Assumption 1 or 1’. Then, by Lemma A.1,

G_{n, 1} = O_{p} (\log ({(\log n)}^{2})) = O_{p} (\log \log n) .

Proof of Lemma 3

For $t_{k} = (t_{1}, \dots, t_{k}) \in \cap_{r = 1}^{κ_{0}} B_{r} {(n)}^{c}$ ,

# {t_{j} : t_{j} \in \cup_{r = 0}^{κ_{0}} D_{r}} \leq k - κ_{0} .

Suppose that k – κ₀ such points, $ξ_{1}^{'}, \dots, ξ_{k - κ_{0}}^{'}$ , are in D_r for some r. Then

G_{n, 2} = ∊^{'} H (0, τ_{1}^{0} - δ_{n}) ∊ + ∊^{'} H (τ_{κ_{0}}^{0} + δ_{n}, 1) ∊ + \sum_{j \neq r} ∊^{'} H (τ_{j}^{0} + δ_{n}, τ_{j + 1}^{0} - δ_{n}) ∊ + \sum_{(ξ_{l - 1}, ξ_{l}] \subset D_{r}} ∊^{'} H (ξ_{l - 1}, ξ_{l}) ∊ .

Since ε′H(α, η)ε for given (α, η) is O_p(1), the first three terms of the right-hand side are O_p(1). So,

G_{n, 2} = O_{p} (1) + ∊^{'} H (τ_{r}^{0} + δ_{n}, ξ_{1}^{'}) ∊ + ∊^{'} H (ξ_{k - κ_{0}}^{'}, τ_{r + 1}^{0} - δ_{n}) ∊ + \sum_{j = 1}^{k - κ_{0} - 1} ∊^{'} H (ξ_{j}^{'}, ξ_{j + 1}^{'}) ∊ .

By Lemma A.2, the second and the third terms of the right-hand side of the above equation are of order log log n. Then, by Lemma A.1, it is true that for any δ > 0,

\sum_{j = 1}^{k - κ_{0} - 1} ∊^{'} H (ξ_{j}^{'}, ξ_{j + 1}^{'}) ∊ \leq (k - κ_{0} - 1) 4 (1 + δ) σ_{0}^{2} \log n,

with probability going to 1 as n → ∞. If the points $ξ_{1}^{'}, \dots, ξ_{k - κ_{0}}^{'}$ are located in distinct D_r's, we can get an even smaller upper bound.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Jeankyung Kim, Department of Statistics, Inha University, 253 Yonghyundong, Namgu, Incheon, 402-751, Republic of Korea.

Hyune-Ju Kim, Department of Mathematics, Syracuse University, 215 Carnegie Building, Syracuse, NY 13244-1150, U.S.A..

References

Friedman JH. Multivariate adaptive regression splines. Annals of Statistics. 1991;19:1–67. [Google Scholar]
Hannart A, Naveau P. An improved Bayesian information criterion for multiple change-point models. Journal of the American Statistical Association. 2012;54:256–268. [Google Scholar]
Kim H-J, Fay MP, Feuer EJ, Midthune DN. Permutation tests for joinpoint regression with applications to cancer rates. Statistics in Medicine. 2000;19:335–351. doi: 10.1002/(sici)1097-0258(20000215)19:3<335::aid-sim336>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
Kim H-J, Yu B, Feuer EJ. Selecting the number of change-points in segmented line regression. Statistica Sinica. 2009;19(2):597–609. [PMC free article] [PubMed] [Google Scholar]
Lee C-B. Estimating the number of change points in exponential families distributions. Scandinavian Journal of Statistics. 1997;24:201–210. [Google Scholar]
Liu J, Wu S, Zidek JV. On segmented multivariate regression. Statistica Sinica. 1997;7:497–525. [Google Scholar]
Martinez-Beneito MA, Garcia-Donato G, Salmeron D. A Bayesian join-point regression model with an unknown number of break-points. The Annals of Applied Statistics. 2011;5:2150–2168. [Google Scholar]
Ninomiya Y. Information criterion for Gaussian change-point model. Statistics & Probability Letters. 2005;72:237–247. [Google Scholar]
Pan J, Chen J. Application of modified information criterion to multiple change point problems. Journal of Multivariate Analysis. 2006;97:2221–2241. [Google Scholar]
Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
Tiwari RC, Cronin KA, Davis W, Feuer EJ, Yu B. Bayesian model selection for join point regression with application to age-adjusted cancer rates. Applied Statistics. 2005;54:919–939. [Google Scholar]
Yao YC. Estimating the number of change-points via Schwarz criterion. Statistics and Probability Letters. 1988;6:181–189. [Google Scholar]
Zhang NR, Siegmund DO. A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007;63:22–32. doi: 10.1111/j.1541-0420.2006.00662.x. [DOI] [PubMed] [Google Scholar]

[R1] Friedman JH. Multivariate adaptive regression splines. Annals of Statistics. 1991;19:1–67. [Google Scholar]

[R2] Hannart A, Naveau P. An improved Bayesian information criterion for multiple change-point models. Journal of the American Statistical Association. 2012;54:256–268. [Google Scholar]

[R3] Kim H-J, Fay MP, Feuer EJ, Midthune DN. Permutation tests for joinpoint regression with applications to cancer rates. Statistics in Medicine. 2000;19:335–351. doi: 10.1002/(sici)1097-0258(20000215)19:3<335::aid-sim336>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]

[R4] Kim H-J, Yu B, Feuer EJ. Selecting the number of change-points in segmented line regression. Statistica Sinica. 2009;19(2):597–609. [PMC free article] [PubMed] [Google Scholar]

[R5] Lee C-B. Estimating the number of change points in exponential families distributions. Scandinavian Journal of Statistics. 1997;24:201–210. [Google Scholar]

[R6] Liu J, Wu S, Zidek JV. On segmented multivariate regression. Statistica Sinica. 1997;7:497–525. [Google Scholar]

[R7] Martinez-Beneito MA, Garcia-Donato G, Salmeron D. A Bayesian join-point regression model with an unknown number of break-points. The Annals of Applied Statistics. 2011;5:2150–2168. [Google Scholar]

[R8] Ninomiya Y. Information criterion for Gaussian change-point model. Statistics & Probability Letters. 2005;72:237–247. [Google Scholar]

[R9] Pan J, Chen J. Application of modified information criterion to multiple change point problems. Journal of Multivariate Analysis. 2006;97:2221–2241. [Google Scholar]

[R10] Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]

[R11] Tiwari RC, Cronin KA, Davis W, Feuer EJ, Yu B. Bayesian model selection for join point regression with application to age-adjusted cancer rates. Applied Statistics. 2005;54:919–939. [Google Scholar]

[R12] Yao YC. Estimating the number of change-points via Schwarz criterion. Statistics and Probability Letters. 1988;6:181–189. [Google Scholar]

[R13] Zhang NR, Siegmund DO. A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007;63:22–32. doi: 10.1111/j.1541-0420.2006.00662.x. [DOI] [PubMed] [Google Scholar]

PERMALINK

Consistent Model Selection in Segmented Line Regression

Jeankyung Kim

Hyune-Ju Kim

Abstract

1 Introduction

Table 1.

2 Selection Methods and Consistency: Unconstrained Model

Assumption 1

Assumption 1’

Theorem 1

Proof

Lemma 1

Lemma 2

Lemma 3

Theorem 2

Proof

Remark

3 Simulations: Unconstrained Model

Table 2.

Table 3.

4 Selection Methods in Constrained Model

Table 4.

Table 5.

Table 6.

5 Discussion

Highlights.

Acknowledgements

Appendix: Proofs of Lemmas

Lemma A.1

Proof

Lemma A.2

Proof

Proof of Lemma 1

Proof of Lemma 2

Proof of Lemma 3

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases