Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Mar 1.
Published in final edited form as: J Stat Plan Inference. 2016 Mar 1;170:106–116. doi: 10.1016/j.jspi.2015.09.008

Consistent Model Selection in Segmented Line Regression

Jeankyung Kim 1, Hyune-Ju Kim 2
PMCID: PMC4742379  NIHMSID: NIHMS729695  PMID: 26858507

Abstract

The Schwarz criterion or Bayes Information Criterion (BIC) is often used to select a model dimension, and some variations of the BIC have been proposed in the context of change-point problems. In this paper, we consider a segmented line regression model with an unknown number of change-points and study asymptotic properties of Schwarz type criteria in selecting the number of change-points. Noticing the overestimating tendency of the traditional BIC observed in some empirical studies and being motivated by asymptotic behavior of the modified BIC proposed by Zhang and Siegmund (2007), we consider a variation of the Schwarz type criterion that applies a harsher penalty equivalent to the model with one additional unknown parameter per segment. For the segmented line regression model without the continuity constraint, we prove the consistency of the number of change-points selected by the criterion with such type of a modification and summarize the simulation results that support the consistency. Further simulations are conducted for the model with the continuity constraint, and we empirically observe that the asymptotic behavior of this modified version of BIC is comparable to that of the criterion proposed by Liu, Wu, and Zidek (1997).

Keywords: Bayes Information Criterion, Segmented line regression, Model selection

1 Introduction

A main concern in regression model selection is how to select the “best” set of independent variables, and two major approaches of the model selection are hypothesis testing and information criteria approaches. In the context of change-point problems, both approaches have been applied to select the number of change-points, and their analytic and empirical properties have been investigated by many researchers. One of widely used information criteria is the Bayes Information Criterion (BIC) proposed by Schwarz (1978). This Schwarz criterion selects the model dimension by finding the Bayes solution that maximizes a posterior probability of the model, and Schwarz (1978) derived the following criterion by evaluating the leading terms of its asymptotic expansion:

SC(p)=supθplog(lik(θp))p2logn=log(lik(θ^p))p2logn,

where lik(θp) is the likelihood function of θp for the model with dimension p and θ^p is the maximum likelihood estimator of θp. As in general information criteria, the Schwarz criterion has two parts, the log of the maximized likelihood function and the penalty function that penalizes for the model dimension, and the method selects the model that maximizes SC(p). Note that its validity is established in Schwarz (1978) for “the case of independent, identically distributed observations, and linear models.”

Yao (1988) studied the problem to select the number of change-points in means of normally distributed random variables, where the total number of unknown parameters for the model with k change-points is p = 2(k + 1). For the number of change-points estimated by minimizing –SC(p) with p = 2k, Yao (1988) proved its consistency. Lee (1997) considered a similar type of a criterion to select the number of change-points in a sequence of random variables from an exponential family distribution. Under some mild conditions on spacings of successive change-points, Lee (1997) proved the consistency of the number of change-points estimated by the Schwarz type criterion whose penalty term is greater than 2k(1 + ε0) log n for some ε0 > 0. Zhang and Siegmund (2007) noted that the usage of the Schwarz criterion “is not theoretically justified” in their situation due to irregularities in the likelihood function and proposed a modified BIC derived as an asymptotic approximation of the Bayes factor to determine the number of change-points in means of normally distributed random variables. For other types of modifications and applications for detecting mean changes, see Ninomiya (2005), Pan and Chen (2006), and Hannart and Naveau (2012).

In the context of segmented line regression, similar approaches have been proposed to select the number of change-points. Kim et al. (2000) proposed the permutation test to select the number of change-points in the segmented line regression model where segments are assumed to be continuous at change-points, called the joinpoint regression model in their paper. Kim et al. (2009) considered the traditional BIC,

BIC(k)=2nSC(2k)=log(RSSkn)+2klognn,

where RSSk is the residual sum of squares for the model with k change-points, and compared its performance with those of the permutation test procedure of Kim et al. (2000) and the method based on generalized cross validation used in MARS of Friedman (1991). Note that the penalty term of 2klognn is chosen based on 2k + 3 unknown parameters for the joinpoint regression model with k change-points. Liu et al. (1997) considered a general segmented line regression model allowing a discontinuity at the change-point and non-Gaussian errors, proposed a penalty term with a bigger order than that of BIC(k), and proved the consistency of the dimension selected by minimizing their criterion:

MIC(k)=log(RSSk(np)+pc0(logn)2+δ0n,

where p* = p*(k) = (k + 1)p + k for the model with k change-points and p covariates and c0 and δ0 are positive constants. Two Bayesian model selection methods based on the Bayes factor and a Bayesian version of BIC were developed in Tiwari et al. (2005) who investigated their empirical properties via simulations and compared their performances with that of the permutation procedure of Kim et al. (2000). Martinez-Beneito et al. (2011) also proposed a Bayesian model selection method that provides posterior probabilities and is flexible to work with Poisson count data.

This paper is motivated from empirical results where the traditional BIC indicated a tendency to over-estimate the number of change-points (See Table 1 of Kim et al. (2009), Table 1 of Zhang and Siegmund (2007)). When the argument of Zhang and Siegmund (2007) is applied to segmented line regression, the penalty of the modified BIC is harsher than that of the traditional BIC, asymptotically corresponding to one additional unknown parameter per segment under some conditions, and this motivated us to consider a BIC type criterion whose penalty is 4klognn for the segmented line regression model without the continuity constraint and 3klognn for the model with the continuity constraint. Note that for segmented line regression with k change-points, the number of unknown parameters is 3k + 3 for the model without the continuity constraint and 2k + 3 for the model with the continuity constraint. Let

BICd(k)=log(RSSkn)+PEd(k)=log(RSSkn)+dklognn, (1)

for penalty coefficients, d. Then the traditional BICs are BIC3 for the unconstrained model and BIC2 for the constrained model.

Table 1.

Unconstrained model with k0 = 1 and xi = i/n for i = 1,..., n. E(Y|x) = (1 + x)I(x ≤ 0.5) + (1.35 + 0.5x)I(x > 0.5)

n Prob σ = 0.1 σ = 0.05
BIC3 BIC4 MIC BIC3 BIC4 MIC
30 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.476
0.360
0.163
0.763
0.210
0.026
0.446
0.383
0.170
0.030
0.723
0.246
0.123
0.830
0.046
0.026
0.730
0.243
50 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.386
0.526
0.086
0.670
0.320
0.010
0.450
0.510
0.040
0
0.853
0.146
0.010
0.973
0.016
0.003
0.910
0.086
75 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.193
0.746
0.060
0.493
0.506
0
0.346
0.623
0.030
0
0.923
0.076
0.003
0.980
0.016
0.003
0.976
0.020
150 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.043
0.900
0.056
0.146
0.846
0.006
0.130
0.863
0.006
0
0.953
0.046
0
0.990
0.010
0
0.990
0.010
200 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.006
0.963
0.030
0.040
0.956
0.003
0.043
0.953
0.003
0
0.963
0.036
0
0.996
0.003
0
0.996
0.003

Our interest in this paper is on asymptotic behavior of BICd, a simple model selection criterion whose penalty term has the same order as that of the traditional BIC, and we focus on asymptotic properties of BIC4 for the model without the continuity constraint and BIC3 for the model with the continuity constraint. In Section 2, we formally introduce the unconstrained model and prove the consistency of the model dimension selected by BIC4 for the unconstrained model with Gaussian errors. This result provides a consistent model selection criterion that imposes an asymptotically milder penalty than MIC does. In Section 3, we present the results of a simulation study where we compare the performance of BIC4 with those of BIC3 and MIC. Section 4 includes empirical results and discussion on the constrained case where the segments are constrained to be continuous at the change-points. Further discussion is presented in Section 5.

2 Selection Methods and Consistency: Unconstrained Model

Suppose that we observe (x1, y1), . . ., (xn, yn) and consider a segmented line regression model such that

yi=βj,0+βj,1xi+i,ifτj1<xiτj(j=1,,κ+1), (2)

where κ is the unknown number of change-points, the τ's are unknown change-points with τ0=minixi1n and τκ+1 = maxi xi, and the εi are independent N(0, σ2).

Let us consider the Schwarz type criterion (1) defined above and estimate κ as

κ^=argmin0kKBICd(k),

where K is a pre-determined maximum number of change-points. For the segmented line regression model without the continuity constraint, (2), we consider BIC3 as the traditional BIC based on 3k + 3 unknown parameters, and study the large sample properties of BIC4, which is a modified version motivated by Zhang and Siegmund (2007).

Liu et al. (1997) considered a segmented line regression model that allows a discontinuity in the mean function and non-Gaussian errors, and their main idea to prove the consistency of κ^ was that the difference between the variances estimated under the true model and a model with a larger number of change-points is in the order of (logn)2n, which seemed to have motivated their choice of the penalty term. Thus, arguments used in Liu et al. (1997) can not be applied in our situation where the penalty term is a constant multiple of lognn. Yao (1988) proved the consistency of the number of change-points estimated by the traditional BIC, BIC2, in the context of simple mean changes. He used properties of a normal distribution to show that the difference between the variances estimated under the true model and the model with a larger number of change-points than the true model, can be bounded above by the difference between the penalty terms of the two models. However, his argument to bound the difference does not work in our situation, and the upper bound in our case is achieved with d ≥ 4 using properties of a chi-square distribution.

Rewrite the model (2) as

yi=βj,0+βj,1xi+i=xiβj+iifxi(τj1,τj],(j=1,,κ+1), (3)

where xi = (1, xi)′, βj = (βj,0, βj,1)′ and the εi are independent and normally distributed with mean zero and variance σ2. Without loss of generality, we assume that 0 < xi < 1 for i = 1, . . ., n. Note that this model can be expressed as

y=j=1κ+1X(τj1,τj)βj+,

where y = (y1, y2, . . ., yn)′, X = (1, x) with x = (x1, x2, . . ., xn)′ and 1 = (1, 1, . . ., 1)′, and X(α, η) = I(α, η)X with I(α, η) = diag(1{x1 ∈ (α,η]}, . . ., 1{xn ∈ (α,η]}) for any α < η. Note that 1A = 1 if A is true and 0 otherwise.

Let Dk={(t1,,tk):t0<t1<<tk<tk+1} with t0 = 0 and tk+1 = 1. For a change-point vector, tk=(t1.,tk)Dk, denote the design matrix and the corresponding hat matrix as

X(tk)=(X(t0,t1),X(t1,t2),,X(tk,tk+1))

and

H(tk)=X(tk)(X(tk)X(tk))1X(tk)=j=1k+1H(tj1,tj),

where H(α, η) is the projection matrix based on X(α, η). Then the sum of squared residuals at given tk is

Sn(tk)=y(IH(tk))y=j=1k+1Sn(tj1,tj),

where Sn(α, η) = y′(I(α, η) – H(α, η))y, and thus the sum of squared residuals for the model with k change-points is obtained as

RSSk=inftkSn(tk)=Sn(τ^k).

We also denote the true values of κ, σ2 and τ as κ0, σ02, and τ0=(τ10,τ20,,τκ00), respectively.

Assumption 1

The regressor variable x has a positive and continuous density function in any small neighborhoods of τ0. Also x is independent of the error, ε.

For the case of nonrandom design points, the following assumption can replace Assumption 1. Note that the data spacing should be of order O(1/n) to satisfy this assumption.

Assumption 1’

Let {δn} be a sequence of constants for which O(1/n) ≤ δn = o(1). The number of data points in any small neighborhoods of τr0, (r = 1, . . ., κ0) with volume δn is of order at least δn.

Theorem 1

Suppose that the segmented line regression model without the continuity constraint (3) satisfies one of Assumption 1 or 1’. Let k < κ0. Then, for any constant d > 0,

P(BICd(k)>BICd(κ0))1

as n → ∞.

Proof

Let d be an arbitrary positive constant. By Lemma 5.3 and Lemma 5.4(i) of Liu et. al.(1997), it is true that for some C > 0,

σ^k2=RSSkn=Sn(τ^k)n>σ02+C

with probability going to 1 as n → ∞. Since σ^κ02=RSSκ0n converges to σ02 in probability (that is, σ^κ02pσ02), it is true that

BICd(k)BICd(κ0)=log(σ^k2σ^κ02)+d(kκ0)nlogn>0

with probability going to 1 as n → ∞.

Theorem 1 guarantees that for any d > 0, κ^ based on BICd is consistent from the below. For the consistency of κ^ from the above, we now get the result for d ≥ 4 under the assumption that the mean function of the regression model is discontinuous.

In order to prove that for d ≥ 4 and for k > κ0, P(BICd(k) > BICd(κ0)) → 1 as n → ∞, we first state the following lemmas whose proofs are given in Appendix. In the lemmas and Theorem 2, we let δn=(logn)2n, and for 1 ≤ rκ0, let

Br(n)={tkDk:tjτr0>δnfor1jk}.

Lemma 1

Suppose that the segmented line regression model without the continuity constraint (3) satisfies one of Assumption 1 or 1’. Then, for k > κ0,

P(τ^kr=1κ0Br(n))0

as n → ∞.

Let ξk+3κ0 be the ordered set of {tk, τ0, τ0δn1, τ0 + δn1}. Let Tr=(τj0δn,τj0+δn] for r = 1, . . ., κ0. Let D0=(0,τ10δn], Dκ0=(τκ00+δn,1], and for r = 1, . . ., κ0 – 1, let Dr=(τr0+δn,τr+10δn]. Let

Gn,1=(ξl=1,ξl]r=1κ0TrH(ξl1,ξl),Gn,2=(ξl=1,ξl]r=0κ0DrH(ξl1,ξl).

Lemma 2

Suppose that the segmented line regression model without the continuity constraint (3) satisfies one of Assumption 1 or 1’. Then, for large n,

Gn,1=Op(loglogn).

Lemma 3

Suppose that the segmented line regression model without the continuity constraint (3) satisfies one of Assumption 1 or 1’. Then, for k > κ0 and any tkr=1κ0Br(n)c, it is true that for any δ > 0,

P(Gn,2(kκ01)4(1+δ)σ02logn)1

as n → ∞.

Theorem 2

Suppose that the segmented line regression model without the continuity constraint (3) satisfies one of Assumption 1 or 1’. Let k > κ0. Then, for any constant d ≥ 4,

P(BICd(k)>BICd(κ0))1

as n → ∞.

Proof

Without loss of generality, we suppose k = κ0 + 1. Since σ^κ02pσ02 and σ^k2pσ02, it is true that for any small δ > 0,

BICd(k)BICd(κ0)=log(Sn(τ^k)n)log(Sn(τ^κ0)n)+d(kκ0)nlogn=log(1σ^κ02σ^k2σ^κ02)+d(kκ0)nlogn(1+δ)(σ^k2σ^κ02σ^κ02)+d(kκ0)nlogn

with probability going to 1 as n → ∞.

Since P(τ^kr=1κ0Br(n))0 by Lemma 1, we now may focus on tk's in r=1κ0Br(n)c. Let ξk+3κ0 be the ordered set of {tk, τ0, τ0δn1, τ0 + δn1}. Then

Sn(tk)Sn(τ^κ0)Sn(ξk+3κ0)Sn(τ0)l=1k+3κ0+1H(ξl1,ξl)={ξl1,ξl]r=1κ0TrH(ξl1,ξl)+(ξl1,ξl]r=0κ0DrH(ξl1,ξl)}=(Gn,1+Gn,2),

where ξ0 = 0 and ξk+3κ0+1 = 1.

By Lemma 2 and Lemma 3, the following holds for any tkr=1κ0Br(n)c: it is true that for any δ > 0,

Sn(tk)Sn(τ^κ0)>(δ+(kκ01)4(1+δ))σ02logn

and α^κ02>σ02(1δ), with probability going to 1 as n → ∞. Therefore, for any δ > 0, BICd(k)BICd(κ0)>(1+δ)(δ+(kκ01)4(1+δ))σ02lognnσ02(1δ)+d(kκ0)lognn with probability going to 1 as n → ∞. For d ≥ 4, we can choose δ small enough such that

(1+δ)(δ+(kκ01)4(1+δ))(1δ)<d(kκ0),

and thus

BICd(k)BICd(κ0)>0

with probability going to 1 as n → ∞.

Remark

Theorem 1 and Theorem 2 imply that the consistency of κ^ is achieved by BICd for d ≥ 4. Because the penalty term of MIC for the one independent variable case with p = 1, (2k+1)c0(logn)2+δ0n, is larger than 4klognn for large n, the result in this section indicates that the consistent model selection can be performed with a criterion whose penalty is asymptotically milder than that of MIC under the assumption of Gaussian errors.

3 Simulations: Unconstrained Model

This section summarizes simulations conducted to study asymptotic behavior of BIC4 whose consistency is proved in Section 2 and to compare its performance with those of BIC3, the traditional BIC, and MIC whose consistency is established in Liu et al. (1997). For each case of the model parameters chosen, 300 replications of data sets are simulated, and P(κ^<κ0), P(κ^κ0), and P(κ^>κ0) are estimated where κ0 is the true number of change-points and κ^(0κ^K) is the estimate obtained by each selection method. The value of K was set as 4, and 3 or 6 represents a repeating decimal.

Tables 1, 2, and 3 summarize the results for unconstrained model cases. Tables 1 and 2 consider the cases with one change-point for two different x configurations, fixed x in Table 1 and random x in Table 2. Table 3 is for the case with two change-points where x is fixed. Note that MIC of Liu et al. (1997) uses the penalty term, PE(k) = p*c0(ln n)2+δ0/n = (2k + 1)(0.299)(ln n)2.1/n for the model with one independent variable, which is based on p* = (k+1)p+k = 2k+1 with p = 1 in their equation (2.3). According to Liu et al. (1997), δ0 = 0.1 is arbitrarily chosen and then c0 = 0.299 is chosen by forcing the penalty term of MIC to be equal to that of the Schwarz criterion. For the model with one independent variable, the penalty term of MIC is set to be equal to that of BIC2 at n = 20, but MIC is not exactly same as BIC2 when n = 20 because MIC used an unbiased estimator of σ2 while BIC2 uses the maximum likelihood estimator of σ2 under the Gaussian error assumption. When the full version of MIC is compared to BICd, the penalty term of MIC is similar to that of BIC3 when n = 30 or 50 and to that of BIC4 when n = 150 or 200, respectively, which is empirically indicated in Tables 1, 2, and 3. For n very large, MIC imposes a penalty harsher than that of BIC4.

Table 2.

Unconstrained model with k0 = 1 and xi ~ N(0.5, 1) for i = 1,..., n. E(Y|x) = (1 + x)I(x ≤ 0.5) + (1.35 + 0.5x)I(x > 0.5)

n Prob σ = 0.2 σ = 0.1
BIC3 BIC4 MIC BIC3 BIC4 MIC
30 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.083
0.703
0.213
0.300
0.666
0.033
0.076
0.706
0216
0
0.760
0.240
0.006
0.943
0.050
0
0.760
0.240
50 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.016
0.830
0.153
0.076
0.896
0.026
0.023
0.893
0.083
0
0.843
0.156
0
0.980
0.020
0
0.913
0.086
150 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0
0.953
0.046
0
0.990
0.010
0
0.990
0.010
0
0.956
0.043
0
0.996
0.003
0
0.996
0.003

Table 3.

Unconstrained model with k0 = 2 and xi = i/n for i = 1,..., n. E(Y|x) = (1 + x)I(x ≤ 0.5) + (1.35 + 0.5x)I(0.5 < x ≤ 0.7) + (0.75 + 1.5x)I(x > 0.7)

n Prob σ = 0.05 σ = 0.02
BIC3 BIC4 MIC BIC3 BIC4 MIC
30 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.590
0.243
0.166
0.873
0.103
0.023
0.590
0.243
0.166
0.026
0.673
0.300
0.113
0.793
0.093
0.026
0.686
0.286
50 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.636
0.306
0.056
0.890
0.106
0.003
0.723
0.250
0.026
0
0.823
0.176
0
0.966
0.033
0
0.893
0.106
150 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.190
0.760
0.050
0.386
0.606
0.006
0.376
0.616
0.006
0
0.946
0.053
0
0.990
0.010
0
0.990
0.010

Results in Tables 1, 2, and 3 indicate that in the unconstrained case, the consistency of κ^ estimated by BIC4 or MIC, which is theoretically proved, is empirically supported with P(κ^=κ0) approaching one as n increases or the effect size increases. Based on Theorem 1, a choice of d related to under-fitting is not much of an issue with n large, and our interest is on the over-fitting probabilities of these selection methods when n is large. The over-fitting probabilities of BIC4 and MIC converge to zero as n increases, but the over-fitting probability of BIC3, the traditional BIC, is shown to be considerably larger than those of BIC4 and MIC even for large n, although they seem to decrease as n increases.

In small sample situations, however, performances of these selection criteria depend more heavily on the effect size, the change size relative to the variability. For example, when the sample size is small and the effect size is small (for example, see the cases with σ = 0.1 and n = 30 or 50 in Table 1), the traditional BIC, BIC3, shows the highest probability of correct selection (PCP), while BIC4 shows a very conservative performance. However, when the effect size is large enough to have the PCP of BIC4 greater than 0.80, BIC4 achieves the highest PCP even in small sample cases (for example, see the cases with σ = 0.05 and n = 30 or 50 in Table 1). Regardless of the sample size, the over-fitting probabilities of the traditional BIC, BIC3, is significantly larger than that of BIC4. The performance of MIC is similar to that of BIC3 when n = 30, 50 and to that of BIC4 when n = 150, 200.

4 Selection Methods in Constrained Model

Recall the segmented line regression model introduced in (2):

yi=βj,0+βj,1xi+i,ifτj1<xiτj(j=1,,κ+1).

When the segments are assumed to be continuous at the change-points,

βj,0+βj,1τj=βj+1,0+βj+1,1τj,

for j = 1, . . ., κ, and this model can also be represented as

yi=β0+β1xi+δ1(xiτ1)+++δκ(xiτκ)++i, (4)

where a+ = max(0, a).

For this segmented line regression model with the continuity constraint, called a joinpoint regression model in Kim et al. (2000), the total number of unknown parameters is 2k + 3 for the model with k change-points, and so the traditional BIC uses d = 2 as in the simple mean change case. We call this traditional BIC as BIC2. Similarly as in the unconstrained case, we consider BIC3 as a modified version in the constrained case.

Because Lemma 5.3 of Liu et al. (1997) also holds for the segmented line regression model with the continuity constraint, Theorem 1 holds for the constrained case as well. However, the arguments used to prove Theorem 2, more specifically the arguments used in the proof of Lemma 1, can not be used to prove that the over-fitting probability of BIC3 converges to zero in the constrained case, and in this section, we investigate asymptotic behavior of BIC3 via simulations. Although detailed proofs are provided only for the unconstrained model case, Liu et al. (1997) indicated the consistency of MIC in the constrained model case as well and the comparison of BIC3 and MIC is expected to provide some insights on the consistency of BIC3.

Table 4, 5, and 6 include simulation results for constrained model cases, and we used the same simulation setting as in Tables 1, 2, and 3, except that in Table 4, K = 3 is used due to an excessive computation time. Tables 4, 5, and 6 include results for BIC2 that is the traditional BIC for the constrained model, BIC3 that we anticipate to be consistent, and MIC whose properties are studied in Liu et al. (1997). Note that MIC of Liu et al. (1997) uses the same penalty term for both constrained and unconstrained models, which was (2k + 1)c0(ln n)2+δ0/n = (2k + 1)(0.299)(ln n)2.1/n for the model with one independent variable.

Table 4.

Constrained model with k0 = 1 and xi = i/n for i = 1,..., n. E(Y|x) = 1 + x – 0.5(x – 0.5)+

n Prob σ = 0.1 σ = 0.05
BIC2 BIC3 MIC BIC2 BIC3 MIC
30 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.613
0.340
0.046
0.810
0.186
0.003
0.803
0.193
0.003
0.046
0.880
0.073
0.176
0.813
0.010
0.156
0.830
0.013
50 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.456
0.513
0.030
0.730
0.270
0
0.770
0.230
0
0.006
0.920
0.073
0.040
0.953
0.006
0.060
0.936
0.003
75 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.306
0.650
0.043
0.646
0.350
0.003
0.756
0.243
0
0
0.936
0.063
0.003
0.990
0.006
0.013
0.983
0.003
150 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.100
0.873
0.026
0.280
0.716
0.003
0.493
0.506
0
0
0.966
0.033
0
0.996
0.003
0
1
0
200 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.023
0.963
0.013
0.116
0.883
0
0.353
0.646
0
0
0.983
0.016
0
1
0
0
1
0

Table 5.

Constrained model with k0 = 1 and xi ~ N(0.5, 1) for i = 1,..., n. E(Y|x) = 1 + x – 0.5(x – 0.5)+

n Prob σ = 0.2 σ = 0.1
BIC2 BIC3 MIC BIC2 BIC3 MIC
30 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.100
0.823
0.076
0.290
0.706
0.003
0.276
0.720
0.003
0
0.910
0.090
0.003
0.983
0.013
0.003
0.983
0.013
50 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.016
0.926
0.056
0.080
0.916
0.003
0.103
0.893
0.063
0
0.936
0.063
0
0.986
0.013
0
0.993
0.006
150 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0
0.966
0.033
0
1
0
0
1
0
0
0.963
0.036
0
1
0
0
1
0

Table 6.

Constrained model with k0 = 2 and xi = i/n for i = 1,..., n. E(Y|x) = 1 + x – 0.5(x – 0.5)+ + (x – 0.7)+

n Prob σ = 0.5 σ = 0.02
BIC2 BIC3 MIC BIC2 BIC3 MIC
30 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.586
0.363
0.050
0.860
0.140
0
0.853
0.146
0
0
0.920
0.080
0.026
0.956
0.016
0.023
0.960
0.016
50 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.503
0.433
0.063
0.816
0.183
0
0.890
0.110
0
0
0.893
0.106
0.013
0.963
0.023
0.020
0.966
0.013
150 P^(κ^<k0)
P^(κ^=k0)
P^(κ^>k0)
0.093
0.863
0.043
0.280
0.713
0.006
0.530
0.470
0
0
0.933
0.066
0
0.996
0.003
0
1
0

The results summarized in Tables 4, 5, and 6 empirically support that the under-fitting probabilities of BIC2, BIC3, and MIC approaching zero as n increases. The over-fitting probabilities of BIC3 and MIC seem to converge to zero as the sample size increases and/or the effect size gets larger. Similarly as in the unconstrained case, the traditional BIC for the constrained case, BIC2, tends to overestimate the true number of change-points (κ), especially when the effect size is large, and the probability of overestimating κ is significantly different from zero even for n = 150 or 200 in most of cases, although it seems to decrease as n increases. Via numerical study not reported here, we also observed that the residual sum of squares of the constrained model fit change usually at a slower speed than those of the unconstrained model fit do as the number of change-points increases, and thus we expect the consistency of BICd to be achieved in the constrained model case with a less severe penalty than that of BIC4 whose consistency was proved in the unconstrained model case.

5 Discussion

In this paper, we considered information based selection criteria to estimate the number of change-points in segmented line regression, and proved the consistency of the number of change-points estimated by BIC4 for the unconstrained model with normal errors. The simulation results summarized in Tables 1, 2, and 3 for the unconstrained model support the theoretical results that (i) the under-fitting probability of BICd converges to zero for any d > 0 (Theorem 1) and (ii) the over-fitting probability of BIC4 converges to zero (Theorem 2), while the traditional BIC, BIC3, indicates a non-negligible over-fitting tendency compared to those of BIC4 and MIC.

For the constrained case where the line segments are assumed to be continuous at the change-points, we examined empirical properties of BIC3 that is expected to improve large sample performance of the traditional BIC, BIC2. The fact that the under-fitting probability of BICd converges to zero for any d > 0, is empirically justified in Tables 4, 5, and 6. Significantly higher over-fitting probabilities of the traditional BIC, BIC2, is also observed even for large n, compared to those of BIC3 and MIC.

In general, a relatively large penalty, that is, large d, will reduce the probability of over-estimating the number of change-points, but it will raise a risk of under-estimation. In that regards, our interest in this paper was on the smallest d for which BICd produces a consistent estimator κ^. In the normally distributed sequence of random variables, the traditional BIC, BIC2, produces a consistent estimator of κ according to Yao (1988), although its over-estimating tendency is observed even for n = 497 in Table 1 of Zhang and Siegmund (2007). Although the consistency of κ^ in the unconstrained case was proved only for d ≥ 4 in our paper, BICd with d < 4 may provide a consistent model selection, and we hope future research to improve our results in this paper.

Another related future research problem is to identify a selection procedure that selects the best model for various sample sizes and effect sizes. In our simulation study, we observed that the traditional BIC performs better when the effect size is small, and in our future research, we plan to investigate small sample properties of various selection criteria. Any one type of criterion may not provide a uniform optimality, and a revised selection method such as a weighted BIC, dwdBICd, where the weight depends on effect sizes, sample size, etc., may achieve a relatively high probability of correct selection.

BIC4/BIC3 studied in this paper for the unconstrained/constrained model cases, respectively, were motivated from the modified BIC of Zhang and Siegmund (2007) that are derived as an asymptotic approximation of the Bayes factor. The original form of the Bayes factor could be used as a selection measure, instead of its asymptotic approximation. However, we have observed in our preliminary simulation study that even the performance of MBIC derived for the segmented line regression models depend on the x-configurations, and thus it may not be straightforward to study analytic properties of the criterion using the original Bayes factor.

Highlights.

  • Segmented line regression models, constrained and unconstrained, are studied.

  • Consistency of the estimated number of change-points is investigated.

  • Simulation studies are conducted to assess the performance of the proposed criteria.

Acknowledgements

A part of H.-J. Kim's research was conducted during her visit at National Cancer Institute. J. Kim's research was supported by a research grant from Inha University.

Appendix: Proofs of Lemmas

Lemma A.1

For any δ > 0,

P(supα<ηH(α,η)>4σ02(1+δ)logn)0

as n → ∞.

Proof

Let X be a χ2 distributed random variable with k degrees of freedom. For some constant ck > 0,

P(X>x)=ckxk21ex2+o(xx21ex2)

as x → ∞. For given α < η, H(α,η)σ02 has a χ2 distribution with 2 degrees of freedom. So, it is true that for some constant c0 > 0 and for δ > 0,

P(supα<ηH(α,η)>4σ02(1+δ)logn)=P(maxxi<xjH(xi,xj)>4σ02(1+δ)logn)c0n(n1)2n22δ0

as n gets large.

Lemma A.2

For 1 ≤ jn – 2,

max2knjH(xj,xj+k)=Op(loglogn),

and for 3 ≤ jn,

max2kj1H(xjk,xj)=Op(loglogn).

Proof

For 2 ≤ knj, the rank of H(xj, xj+k) is 2, and so there exists 2 × n matrix Q(k) with full row rank 2, such that

H(xj,xj+k)=Q(k)Q(k)=l=12u(k)l2,

where Q(k)′ = (q(k)1, q(k)2) and u(k)l=q(k)l, l = 1, 2. Then l=12q(k)lq(k)l=tr(H(xj,xj+k))=2. For each k and integer i, let alki=kq(k)l,i, so that

H(xj,xj+k)=S1k2+S2k2k,

where Slk=i=j+1j+kalkii. Then by Theorem 1 of Lai and Wei(1982),

max2knjH(xj,xj+k)=Op(loglogn).

Similarly, we can show that

max2kj1H(xjk,xj)=Op(loglogn).

Proof of Lemma 1

It is good enough to consider the case of k = κ0 + 1. For any tkBr(n), let ξk+κ0+1 be an ordered set of {tk,τ10,,τr10,τr0δn,τr0+δn,τr+10,,τκ00}. Then

Sn(tk)Sn(τ0,1)Sn(ξk+κ0+1)Sn(τ0)Sn(ξk+κ0+1).

For (ξl1,ξl](τr0δn,τr0+δn],

Sn(ξl1,ξl)=(I(ξl1,ξl)H(ξl1,ξl))=i{i:xi(ξl1,ξl]}i2Op(logn)

by Lemma A.1.

For 1 ≤ rκ0. let Ar(n)={i:τr0δn<xiτr0+δn}. Then

Sn(τr0δn,τr0+δn)=iAr(n)(E(yi)+iy^i)2=iAr(n){i2+(E(yi)y^i)2+2(E(yi)y^i)i}

Since E(y) is discontinuous at τr0 while ŷ is continuous on (τr0δn, τr0+δn), for some constant c0,

(E(yi)y^i)2>c0

for iIrAr(n) for which #(Ir) is of order at least n. So by the weak law of large numbers, there exists c1 > 0 such that

Sn(τr0δn,τr0+δn)iAr(n)i2>c1δnn=c1(logn)2.

Then

Sn(ξk+κ0+1)>Op(logn)+c1(logn)2>0

for large n.

Proof of Lemma 2

The number of observations in subintervals (τr0δn,τr0) and (τr0,τr0+δn) is of order (log n)2 by Assumption 1 or 1’. Then, by Lemma A.1,

Gn,1=Op(log((logn)2))=Op(loglogn).

Proof of Lemma 3

For tk=(t1,,tk)r=1κ0Br(n)c,

#{tj:tjr=0κ0Dr}kκ0.

Suppose that kκ0 such points, ξ1,,ξkκ0, are in Dr for some r. Then

Gn,2=H(0,τ10δn)+H(τκ00+δn,1)+jrH(τj0+δn,τj+10δn)+(ξl1,ξl]DrH(ξl1,ξl).

Since ε′H(α, η)ε for given (α, η) is Op(1), the first three terms of the right-hand side are Op(1). So,

Gn,2=Op(1)+H(τr0+δn,ξ1)+H(ξkκ0,τr+10δn)+j=1kκ01H(ξj,ξj+1).

By Lemma A.2, the second and the third terms of the right-hand side of the above equation are of order log log n. Then, by Lemma A.1, it is true that for any δ > 0,

j=1kκ01H(ξj,ξj+1)(kκ01)4(1+δ)σ02logn,

with probability going to 1 as n → ∞. If the points ξ1,,ξkκ0 are located in distinct Dr's, we can get an even smaller upper bound.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Jeankyung Kim, Department of Statistics, Inha University, 253 Yonghyundong, Namgu, Incheon, 402-751, Republic of Korea.

Hyune-Ju Kim, Department of Mathematics, Syracuse University, 215 Carnegie Building, Syracuse, NY 13244-1150, U.S.A..

References

  1. Friedman JH. Multivariate adaptive regression splines. Annals of Statistics. 1991;19:1–67. [Google Scholar]
  2. Hannart A, Naveau P. An improved Bayesian information criterion for multiple change-point models. Journal of the American Statistical Association. 2012;54:256–268. [Google Scholar]
  3. Kim H-J, Fay MP, Feuer EJ, Midthune DN. Permutation tests for joinpoint regression with applications to cancer rates. Statistics in Medicine. 2000;19:335–351. doi: 10.1002/(sici)1097-0258(20000215)19:3<335::aid-sim336>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
  4. Kim H-J, Yu B, Feuer EJ. Selecting the number of change-points in segmented line regression. Statistica Sinica. 2009;19(2):597–609. [PMC free article] [PubMed] [Google Scholar]
  5. Lee C-B. Estimating the number of change points in exponential families distributions. Scandinavian Journal of Statistics. 1997;24:201–210. [Google Scholar]
  6. Liu J, Wu S, Zidek JV. On segmented multivariate regression. Statistica Sinica. 1997;7:497–525. [Google Scholar]
  7. Martinez-Beneito MA, Garcia-Donato G, Salmeron D. A Bayesian join-point regression model with an unknown number of break-points. The Annals of Applied Statistics. 2011;5:2150–2168. [Google Scholar]
  8. Ninomiya Y. Information criterion for Gaussian change-point model. Statistics & Probability Letters. 2005;72:237–247. [Google Scholar]
  9. Pan J, Chen J. Application of modified information criterion to multiple change point problems. Journal of Multivariate Analysis. 2006;97:2221–2241. [Google Scholar]
  10. Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
  11. Tiwari RC, Cronin KA, Davis W, Feuer EJ, Yu B. Bayesian model selection for join point regression with application to age-adjusted cancer rates. Applied Statistics. 2005;54:919–939. [Google Scholar]
  12. Yao YC. Estimating the number of change-points via Schwarz criterion. Statistics and Probability Letters. 1988;6:181–189. [Google Scholar]
  13. Zhang NR, Siegmund DO. A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007;63:22–32. doi: 10.1111/j.1541-0420.2006.00662.x. [DOI] [PubMed] [Google Scholar]

RESOURCES