Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Oct 18.
Published in final edited form as: J Am Stat Assoc. 2016 Oct 18;111(515):1132–1143. doi: 10.1080/01621459.2015.1073154

Fast estimation of regression parameters in a broken-stick model for longitudinal data

Ritabrata Das 1, Moulinath Banerjee 2, Bin Nan 3, Huiyong Zheng 4,
PMCID: PMC5353362  NIHMSID: NIHMS814884  PMID: 28316356

Abstract

Estimation of change-point locations in the broken-stick model has significant applications in modeling important biological phenomena. In this article we present a computationally economical likelihood-based approach for estimating change-point(s) efficiently in both cross-sectional and longitudinal settings. Our method, based on local smoothing in a shrinking neighborhood of each change-point, is shown via simulations to be computationally more viable than existing methods that rely on search procedures, with dramatic gains in the multiple change-point case. The proposed estimates are shown to have n-consistency and asymptotic normality – in particular, they are asymptotically efficient in the cross-sectional setting – allowing us to provide meaningful statistical inference. As our primary and motivating (longitudinal) application, we study the Michigan Bone Health and Metabolism Study cohort data to describe patterns of change in log estradiol levels, before and after the final menstrual period, for which a two change-point broken stick model appears to be a good fit. We also illustrate our method on a plant growth data set in the cross-sectional setting.

Keywords: Asymptotic efficiency, Change-point, Hormone profile, Piecewise linear model

1 Introduction

In regression models, it is often assumed that the regression function throughout the domain of interest has the same parametric form. But it is also important to consider situations where the regression function has different functional forms in separate portions of the domain of interest. A special case of this is the continuous piecewise linear model, popularly referred to as the “broken-stick model”. This model is frequently useful in environmental and biological setups where the locations of the change-points are of interest. The broken-stick with r (known a priori) change-points can be canonically written as

E(Y|X,Z)=β0+β1X+k=1rβk+1f(X,τk)+ZTλ, (1)

where

f(x,τ)=(xτ)+={xτ,x>τ;0,xτ;

and τj’s are ordered change-points to be estimated together with other regression parameters β’s and λ. Such a modelling strategy is of particular interest for the Michigan Bone Health and Metabolism Study (MBHMS).

MBHMS is a population-based longitudinal natural history study of ovarian aging conducted in a cohort of 664 White women from Tecumseh, Michigan during their young and mid-adulthood (24 – 44) years. Sowers et al. (2008) studied the serum estradiol (E2) hormone levels in 629 women enlisted in the MBHMS over a fifteen year period starting from 1992. The goal of Sowers et al. (2008) was to describe the E2 profile changes before and after Final Menstrual Period (FMP). A semiparametric mixed model approach was implemented in Sowers et al. (2008), and smoothing splines were used for estimation. Referring to Fig. 1 in Sowers et al. (2008), it is clear that the mean function can be fit nicely by a piecewise linear model with multiple change-points, whose identification is of considerable significance. However, existing methods of change-point estimation in a broken-stick model are fairly slow for large sample sizes. In the particular scenario of Sowers et al. (2008), the effective sample-size is of order of 104 and hence a fast method of estimating the change-points precisely is of considerable importance.

Figure 1.

Figure 1

qn is the smoothed version of f.

In the early literature, it was generally assumed that either the exact location of the change-point τ is known (Poirer 1973), or, at worst, it is known which two observation points τ should lie between (Robinson 1964). Also, most of the early work focused on detection of whether a change-point existed at all. In this article, however, the existence of change-point(s) is assumed and the exact number of change-points, denoted by r, is assumed known. The focus here is to propose a quick estimation procedure that gets around the non-smoothness of the model without compromising asymptotic efficiency in the process.

The principal difficulty in the estimation problem arises when the locations of the change-points are unknown. For an independent and identically distributed error case, if the location of these change-points were known, we would have a standard linear regression problem. Even if a relatively small set of plausible values were known, one could perform least squares for the slope and intercept parameters for each of these plausible values to find the overall least squares estimate. However, in most scenarios, this is unlikely, and the set of plausible values over which one needs to search is typically all r tuples of ordered Xi’s, leading to a very high number of linear regressions, (nr) in principle, n being the sample size; see Hudson (1966).

Bellman and Roth (1969) proposed an alternative method based on dynamic linear programming but this method is even slower than the previous method. Feder (1975a) considered a general case of segmented regression problems and showed that the exact least squares estimate obtained by Hudson (1966) is asymptotically normal. In particular, for the broken-stick model, the estimate is n– consistent.

Tishler and Zang (1981) were the first to suggest estimation of change-points using a maximum likelihood approach based on smoothing. They argued that the non-differentiability of the ‘maximum’ and ‘minimum’ operators in piecewise regression was the main problem in using maximum likelihood. However, if these operators were substituted by smoothed versions, maximum likelihood could readily be used for fast computation. Tishler and Zang (1981) suggested using a quadratic approximation, where the length of the interval on which f is smoothed was taken as an arbitrary small value. However, the behavior of these estimates, as the interval-length shrinks to 0, was not investigated. It is clear that unless the length of the interval is allowed to decrease with sample size, their algorithm cannot yield a consistent estimate.

Recent articles for broken-stick models include Bhattachaya (1987), Huskova (1998) and Muggeo (2003). While the first two deal with theory, Muggeo (2003) tries to develop an estimation strategy, but does not provide any asymptotic results and thus fails to address the theoretical efficiency of the approach. For a detailed description of Bayesian methods of change-point estimation refer to Chen et al. (2011).

In sum, the lack of a suitable method of estimation, that is optimal in terms of both precision and computational economy, has forced statisticians to fall back on the search-based algorithm of Hudson or related algorithms thereof. Our paper fills this gap in the literature.

We should note here that alternative approaches to studying ‘kink-type’ phenomena have also been investigated. Chiu et al. (2006) suggested that in certain scenarios, instead of using the broken-stick as the true model, it might be better to use, what they referred to as, the “bent-cable model”, which is a quadratic smoothing of the broken-stick model in a γ neighborhood around τ. Their change-point parameter was defined as the mid-point of the interval (τ − γ, τ + γ) on which the smoothing was done; here both τ and γ are unknown parameters. It was shown that τ̂n, the least squares estimate of τ, is n– consistent. Also, for γ0 > 0, γ̂n is n-consistent as well. In a previous article, Chiu et al. (2002) had shown that for γ0 = 0, i.e., when the smooth model reduces to the broken-stick model, the asymptotics are complex and γ̂n is at most n1/3– consistent.

In this chapter, equation (1) is used as the true mean model. Ideally, one would want to minimize the residual sum of squares in this model by a Newton-Raphson type algorithm, but this is not viable owing to the non-differentiability of f at τ. To this end, we use a twice differentiable perturbation of f, denoted by qn, as our working model, where qn coincides with f outside a shrinking neighborhood of τ, say (τ − γn, τ + γn), with γn, a user-specified tuning parameter, going to 0. Because qn is differentiable, the minimization can be done by Newton’s algorithm very quickly. For the iid error case, we show that our estimate of τ is indeed n– consistent for τ, and furthermore has an asymptotic normal distribution with the same asymptotic variance as the exact least squares estimate of Hudson (1966). For the longitudinal model, the same method yields n–consistent and asymptotically normal estimates for the change-points even for misspecified variance structures.

In sections 2 and 3, we introduce the model for both the cross-sectional and longitudinal set-ups respectively and outline the main steps of the estimation. The main theoretical results are presented along with the main ideas of the proofs. Section 4 contains simulation results indicating the efficiency of the proposed method while the method is applied to two real data —plant growth data (cross-sectional study) in section 5.1 and estradiol profile analysis (longitudinal study) in section 5.2. Sketch of the proofs are provided in the appendices while the details are presented in the Supplement.

2 Cross-sectional Study

We assume that the covariate X is contained in [M1, M2]. The regression parameter in the cross-sectional set-up (1), is denoted by θT = (βT, τT, λT). We assume θ belongs to a compact set Θ = 𝔹×[M1+δ, M2−δ]r×Λ, τk < τk+1, k = 1, 2…, r−1, and |M1|, |M2| < ∞ and δ is a known small positive constant, indicating the change-points need to be well-separated from the boundaries of the X –space. Without loss of generality, take M1 = 0 and M2 = M. We write βT = (β0, β1, …, βr+1) ∈ 𝔹, where β0 is the intercept and j=1kβj is the slope of the kth segment, k = 1, 2…, r + 1. We assume 𝔹 is a compact set in ℝk+2 and Λ is a compact set in ℝl; the restriction 0 < ζ ≤ |βk| ≤ B < ∞, 2 ≤ kr + 1 is imposed for the sake of identifiability. We also write τT = (τ1, τ2, …, τr), with τk denoting the k’th smallest change-point, and again identifiability requires the conditions τk < τk+1, k = 1, 2…, r − 1, τ1 ≥ δ and τrM − δ. The errors εi are assumed to be independent and identically distributed with mean 0 and variance σ2. The true parameter vector θ0=(β00,β10,,βr+10,τ10,τ20,,τr0,λ0)T is assumed to be an interior point of the compact set Θ. Our data are independent and identically distributed observations {Yi,Xi,Zi}i=1n from (1) and henceforth, ℙn denotes the empirical measure of the data. Note that, Yi and Xi are scalars while Zi is an l–dimensional vector of covariates with no change-points.

2.1 Estimation

Define,

M(θ,x,y,z)=(yEθ(Y|X=x,Z=z))2=[y{β0+β1x+k=1rβj+1f(x,τk)+zTλ}]2.

The exact least squares procedure aims to obtain the minimizer of

n(M(θ,X,Y,Z))=1ni=1nM(θ,Xi,Yi,Zi).

As f(x, τ) is not differentiable at τ, one cannot obtain the minimizer of ℙn(M(θ, X, Y, Z)) ≡ ℙn(M(θ)) (for notational convenience) by Newton’s algorithm. So, we resort to minimize ℙn(Mn(θ)), where

Mn(θ)Mn(θ,x,y,z)=[y{β0+β1x+k=1rβj+1qn(x,τk)+zTλ}]2

is our working model, a smoothed approximation of M(θ). Basically each of the f’s in M(θ) is replaced by its corresponding smoothed version qn to obtain Mn(θ).

As far as the functional form of qn is concerned, the motivation lies in the work of Tishler and Zang (1981) and Chiu et al. (2006). Tishler and Zang (1981) suggested using a quadratic approximation, where the length of the interval on which f is smoothed was taken as an arbitrary small value, while Chiu et al. (2006) considered the length of the corresponding interval as a parameter. We consider the same functional form for qn as in these papers, but in our model, the length of the interval on which we smooth f is a user-specified tuning parameter shrinking to 0 with n at an appropriate rate as n → ∞. More specifically,

qn(x,τ)={0,if x<τγn;(xτ+γn)24γn,if τγnxτ+γn;(xτ),if x>τ+γn; (2)

where γn is a deterministic sequence, that approaches zero as n → ∞.

Define {θ̂n} as a sequence in Θ that solves the surrogate empirical estimating equation 𝕌n(θ) = ∂ℙn(Mn(θ))/∂θ = 0, with probability increasing to 1. The existence of such a {θ̂n} is shown in the Supplement. In our proposed method, we use Newton-Raphson algorithm to solve 𝕌n(θ) = 0. It is not clear whether θ̂n is uniquely defined but this is not an issue, since for our asymptotics, how we pick the zero is immaterial, meaning that the results hold true for any choice of zero. Our numerical results, however, suggest that generally there is a unique zero.

Remark 1

The assumed compactness of the parameter space facilitates the mathematical arguments leading to consistency of our estimator. This has little bearing on the implementation of the Newton-Raphson algorithm in practice: in applications, the compact set for β i.e. 𝔹 is typically not known a priori, so by thinking of the set as being arbitrarily large, the iterates of the algorithm, whenever it converges, can be viewed as living in the set under consideration.

2.2 Asymptotic Results

For model (1) we consider the following regularity conditions.

Condition 1.1

There are r distinct change-points τ1, …, τr in model (1) for a fixed integer r ≥ 1; r is known.

Condition 1.2

Covariate X ∈ [0, M], M < ∞, follows a continuous distribution FX such that pr(τk < X ≤ τk+1) > 0, k = 0, …, r with τ0 = 0 and τr+1 = M. The joint distribution of Z is denoted by FZ.

Condition 1.3

Changes of slope parameters satisfy 0 < ζ ≤ |βk| ≤ B < ∞, k = 2, 3, …, r + 1, for some constants ζ and B.

Theorem 1

Under Conditions 1.1–1.3, θ̂n is a consistent estimator for θ0 for any deterministic sequence γn → 0 as n → ∞.

The proof of Theorem 1 consists of two major steps. The first is to show that θ0 is the unique solution of U(θ) = ∂P(M(θ))/∂θ = 0, and the second, to show ‖𝕌nU‖ = op(1); here Pf = ∫ fdP, P being the probability measure that generates the data, and ‖·‖ is the supremum norm. For the case with single change-point and covariate Z absent, a sketch of the proof is presented in the Appendix 1, while the details are provided in the Supplement. The proof for the case with multiple change-points or with other covariates is an exercise involving extensive algebraic derivations following the same line.

Theorem 2

For γn = n−α with α > 1/2, under Conditions 1.1–1.3, we have that n1/2(θ̂n− θ0) converges in distribution to a normal random variable with mean 0 and covariance matrix 2σ2U˙*1(θ0). The kl-th element of matrix * is (*0))kl = 2P(HT0)H0)), where

H(θ)=(1,X,f(X,τ1),,f(X,τr),β21(X>τ1),,βr+11(X>τr),Z)1×(2+2r+l).

The proof of Theorem 2 also consists of two major steps. To this end, let us define θn as the the root of Un(θ) = ∂P(Mn(θ))/∂θ for sufficiently large n, in Θ closest to θ0. θn is well-defined follows from arguments showing that θ̂n is well-defined (in the Supplement). The first step is to show that θn converges to θ0 with a faster than n rate, which in fact is γn, and the second to show the asymptotic normality of n1/2(θ̂n − θn). Both steps rely on Taylor series expansions. For notational simplicity, we provide a sketch of the proof for the case with single change-point and absence of Z in the Appendix 1. Please refer to the Supplement for the detailed proof. The case with multiple change-points and other covariates is again a straightforward extension.

The following Corollary shows that our proposed local smoothing method does not lose any efficiency. Its proof is provided in the Supplement.

Corollary 1

The asymptotic distribution of our estimate, as stated in Theorem 2, is exactly the same as that in Feder (1975b) for the exact least squares estimate in the broken stick model.

Remark 2

Please note that for the sake of notational convenience and to keep the results and proofs terse, only one variable with change-points have been included in model (1). However, the method will work equally well for a model consisting of multiple variables, with multiple change-points in each variable, and the results will be analogous.

3 Longitudinal study

The model for the longitudinal study with a broken-stick mean function is

E(Yij|Xij,Zij)=μij=β0+β1Xij+k=1rβk+1f(Xij,τk)+ZijTλ, (3)

where Yij is the response of the ith subject at the jth time-point (tij) and Xij denotes the corresponding covariate with r change-points, while Zij are l other covariates with simple linear effects on Yij, j = 1, …, mi, i = 1, …, n.

For the regression parameters, βT = (β0, β1, ‥, βr+1), we have the same assumptions as in the cross-sectional model. We assume the effect sizes λ ∈ Λ (a compact set in ℝl) and τ is the vector of change-points, as before. Here, θT = (βT, τT, λT) is our parameter of interest and θ0 is the true value of θ.

As far as the variance function is concerned, we postulate the following form:

Cov(Yij,Yik)=g(η,tij,tik), (4)
Cov(Yij,Ylk)=0,il;j,k=1,,mi;i=1,,n,

where η is the vector of covariance parameters. We assume that the observations across individuals are independent and the correlation between different observations of the same individual can depend on the time-points but not on the mean parameters, θ.

Yi is used to denote the vector of mi observations for the i–th individual, i = 1, …, n. Y = (Y1, …, Yn) is the vector of all responses. We use similar definitions for Xi and X. Let Σ(i) denote the dispersion matrix of Yi and Σ the dispersion matrix of Y. The true dispersion matrix is denoted by Σ0, which can be written as Σ(η0), η0 being the true value of η.

To establish the asymptotic results rigorously, the problem needs to be cast in a proper mathematical framework. We assume that, the number of repeated measures is denoted by the random variable D which takes values in the integer-space {1, 2, …, L} with probabilities p1, p2, …, pL respectively. Note that this L is assumed fixed and known. Also we have a triangular array of X-values,

X1(1)
X1(2)X2(2)
X1(3)X2(3)X3(3)X1(L)X2(L)X3(L)XL(L).

When D = d, the d-th row of this array is selected as the set of X-covariates. The same is true for the Z-covariates and measurement errors {εij}. We assume {εij}’s are uncorrelated with {Xij}’s and {Zij}’s. Thus,

Y(D)=β0+β1X(D)+k=1rβk+1f(X(D),τk)+ZTλ+ε(D)=μ(D)+ε(D).

Note that, here f : ℝD+1 → ℝD is defined as f(x, τ) = ((x1−τ)+, …, (xk − τ)+)T for any x ∈ ℝD and τ ∈ ℝ. This is the general definition of the f function and (1) is a special case of this for D = 1. The observation for each individual consists of (D, Y(D), X(D), Z(D)) and our data comprise of n iid copies of this array. As with most inference methods in longitudinal studies, we allow for ignorable dropouts (Rubin 1976).

3.1 Estimation

The estimation process is divided into three steps:

  • Step 1 : Assume working independence, i.e. take Σ(i) = I, i = 1, …, n. As for the cross-sectional study, replace each of the f’s by their respective smoothed version qn’s. Now, find the corresponding estimate θ^n(I) as the solution to the estimating equation
    θn[(Y(D)μn(D))T(Y(D)μn(D))]=0,
    where μn(D) is the smoothed version of μ(D).
  • Step 2 : Then use θ^n(I) to estimate the nuisance parameter η. The specifics will depend on the nature of the covariance function g in (4).

  • Step 3 : Use η̂n obtained in step 2 to estimate θ. So the final θ̂n is the solution to the estimating equation
    θn[(Y(D)μn(D))TΣ^n1(Y(D)μn(D))]=0,
    where Σ^n1 is the block-diagonal dispersion matrix based on η̂n.

3.2 Asymptotic Results

As in the cross-sectional model, for the longitudinal model we consider similar regularity conditions.

Condition 2.1

There are r distinct change-points τ1, …, τr in model (3) for a fixed integer r ≥ 1; r is known.

Condition 2.2

Conditional on D = d, covariate X ∈ [0, M]d, M < ∞, follows a continuous distribution FX such that pr(τk < Xj ≤ τk+1) > 0, for some j = 1, …, d, for all k = 0, …, r with τ0 = 0 and τr+1 = M. Also we assume the covariates Z follow a joint distribution FZ.

Condition 2.3

Changes of slope parameters satisfy 0 < ζ ≤ |βk| ≤ B < ∞, k = 2, 3, …, r + 1, for some constants ζ and B.

Condition 2.4

There exists a positive definite matrix W, such that estimated covariance matrix Σ̂n satisfies n(Σ^nW)=Op(1).

Theorem 3

Under Conditions 2.1–2.4,

  1. The estimator θ̂n is consistent for θ0 given any deterministic sequence γn → 0 as n → ∞.

  2. For γn = n−α with α > 1/2, n1/2(θ̂n − θ0) converges in distribution to a normal random variable with mean 0 and covariance matrix

K(W1)=2d=1LP(d)[(HT(θ0)W1H(θ0))1(W1H(θ0))TΣ0(W1H(θ0))(HT(θ0)W1H(θ0))1]pd

where

H(θ)=(1,X,f(X,τ1),,f(X,τr),β21(X>τ1),,βr+11(X>τr),Z)d×(l+2r+2);

here P(d)f = ∫ fdP(d), P(d) being the probability measure that generates the data given D = d.

Remark 3

If the matrix W in condition 2.4 is indeed the true covariance matrix Σ0, i.e., Σ̂n is a n-consistent estimate of Σ0, then for γn = n−α with α > 1/2, n1/2(θ̂n − θ0) converges in distribution to a normal random variable with mean 0 and covariance matrix

K(Σ01)=2d=1LP(d)[(HT(θ0)Σ01H(θ0))1]pd.

The proof of Theorem 3 is similar to the proofs of Theorems 1 and 2 in Section 2. The main proof is divided into three major parts. First, we show that n(θ^n(I)θ0) converges to N(0, K(I)) in distribution. Next, we prove n(θ^n(W1)θ0) converges to N(0, K(W−1)) in distribution. Here, θ^n(W1) is defined as a zero of 𝕌n(W1)(θ)=θn(Yμn)TW1(Yμn), with probability increasing to 1. Finally, we show that n(θ^nθ^n(W1))=op(1), which proves Theorem 3.

For the sake of notational convenience, the proof is presented in the Appendix 2, for r = 1 and for fixed visit-times, i.e. Dm or equivalently mi = m for all i = 1, …, n. Also for brevity, we exclude the time-invariant covariates Z in the proof.

Though the proof provided in Appendix 2 is for a fixed number of visit-times, it holds true for variable number of visit-times, as stated in Theorem 3. Notice that, conditional on D = d (this event has probability pd), it is shown that n1/2(θ̂n−θ0) converges in distribution to a normal random variable with mean 0 and covariance matrix

K(W1)=2P(d)[(HT(θ0)W1H(θ0))1(W1H(θ0))TΣ0(W1H(θ0))(HT(θ0)W1H(θ0))1].

Now, the result of Theorem 3 easily follows.

Corollary 2

Denoting the mean function at X = x, Z = z by μ(x, z, θ), we have n(μ(x,z,θ^n)μ(x,z,θ0)) converges in distribution to a normal random variable with mean zero and variance aTK(W−1)a, where aT=(1,x,f(x,τ10),,f(x,τr0),β201(x>τ10),,βr+101(x>τr0),z).

This result is useful in providing pointwise confidence bands for the broken-stick mean function as illustrated in Fig 4. The proof is provided in the Supplement.

Figure 4.

Figure 4

E2 profile analysis at baseline mean BMI for a non-smoker: the solid line represents the mean estimator using two change-point broken-stick model, the short-broken lines the corresponding pointwise 95% confidence bands; the long-broken lines represent the smooth estimator of the mean function from semiparametric mixed effects model using the same method as in Sowers et al. (2008); the shaded regions represent the 95% confidence intervals for the two change-points.

Remark 4

Note that the estimated confidence band for the mean at τ̂n, as provided by Corollary 2 is discontinuous. The asymptotic distribution of n(μ(τ^n,z,θ^n)μ(τ0,z,θ0)) is not a direct application of this result, but needs separate calculationsa direct application of the delta method.

4 Simulations

4.1 Cross-sectional set-up

Simulations were conducted to compare the proposed method with the existing one. Models with one and two change-points were both considered. Sample sizes were varied, n = 50, 200, 1000, 5000. For each of the two models, for a fixed n, 3 different sets of values of θ were considered within the domain of interest. For each value of θ, the proposed and existing (Hudson 1966) methods were both repeated N = 1000 times. The run-times, a measure of computational efficiency, for each of the methods were then averaged over these 1000 repetitions and over the 3 different values of θ. This was done to average out discrepancies being caused by individual θ’s. Error standard deviation σ was taken to be equal to 0.1 for all cases and M = 1. For all simulations, α was taken to be 1. The simulations were carried out on an Intel(R) Core(TM) i7 system with 1.6 GHz and 8 GB RAM in a 64-bit OS.

4.1.1 Results

From Table 1, it is obvious that the proposed method is much faster than the exact least squares method, especially for two change-point problems. Also Tables 2 & 3 indicate that the change-point estimates of the proposed method are almost as accurate as the exact least squares estimate. The biases are close to zero for all sample sizes, especially for the large samples. The standard deviation estimates are very close to the sample standard deviations indicating our standard deviation estimates work well, especially for large samples. The estimates are also very close to the theoretical standard deviations, indicating the asymptotic efficiency of our estimates. Although the bias and variances for the β’s have not been tabulated for the sake of brevity, we observed that our β estimates also have comparable Mean Squared Errors to their respective exact least squares estimates.

Table 1.

Simulation results comparing the run-times of the existing (Hudson 1966) and proposed methods for one and two change-point(s) model, with ratio of the time taken by the existing method with respect to that of the proposed one.

Sample Mean Time (Seconds)
Size n One change-point Two change-points
Existing Proposed Ratio Existing Proposed Ratio
50 0.18 0.006 30 2.36 0.02 118
100 0.30 0.008 38 13.86 0.03 462
500 0.97 0.02 49 64.87 0.06 1015
1000 1.89 0.03 63 947.03 0.08 11838
5000 4.98 0.06 83 22843 0.20 114215
Table 2.

Bias and variances for the change-point estimate τ̂n compared for one change-point problem in 3 setups: A: θT = (0.2, 1, 1, 0.6), B: θT = (0.3, 1.5, 1, 0.8) & C: θT = (0.3, 1.5,−1, 0.2) (S.D.: Average of estimated standard deviations over 1000 replications; Emp. S.D. : Sample standard deviation based on 1000 replications).

Sample Bias (S.D., Emp. S.D.) Theoretical
Size × 10−3 S.D.
n Existing Proposed × 10−3
Set-up A
50 −8.2 (58.1, 60.3) −12.4 (58.3, 61.0) 57.8
100 −4.3 (41.0, 41.2) −5.2 (41.0, 41.3) 40.9
500 −2.1 (18.3, 18.4) −2.9 (18.3, 18.5) 18.3
1000 −0.6 (12.9, 12.9) −0.9 (12.9, 13.0) 12.9
5000 −0.0 (5.8, 5.8) −0.1 (5.8, 5.8) 5.8
Set-up B
50 −9.2 (71.6, 73.2) −14.1 (71.6, 73.6) 70.7
100 −4.7 (50.1, 50.3) −5.9 (50.1, 50.4) 50.0
500 −2.8 (22.4, 22.4) −3.3 (22.4, 22.5) 22.3
1000 −0.9 (15.8, 15.8) −1.2 (15.8, 15.8) 15.8
5000 −0.1 (7.1, 7.1) −0.1 (7.1, 7.1) 7.1
Set-up C
50 8.1 (71.6, 73.3) 12.8 (71.6, 73.5) 70.7
100 4.7 (50.1, 50.4) 6.2 (50.1, 50.5) 50.0
500 1.8 (22.4, 22.4) 2.0 (22.5, 22.6) 22.3
1000 0.2 (15.8, 15.8) 0.4 (15.8, 15.8) 15.8
5000 0.1 (7.1, 7.1) 0.1 (7.1, 7.1) 7.1
Table 3.

Bias and variances for the change-point estimate τ̂n compared for two change-points problem in 3 setups: D: θT = (0.3, 1, 1, 1, 0.2, 0.8), E: θT = (0.2, 1, 2, 1, 0.4, 0.6) & F: θT = (0.3, 1, −1, 1, 0.2, 0.8) (S.D.: Average of estimated standard deviations over 1000 replications; Emp. S.D. : Sample standard deviation based on 1000 replications).

Sample τ̂1n τ̂2n
Size Bias (S.D., Emp. S.D.) Theo. Bias × 10−3 (S.D., Emp. S.D.) Theo.
(n) × 10−3 S.D. × 10−3 S.D.
Existing Proposed × 10−3 Existing Proposed × 10−3
Set-up D
50 10.2 (63.2, 63.9) 20.3 (63.3, 64.1) 62.0 −11.1 (55.0, 55.5) −19.2 (55.1, 55.6) 54.0
100 5.1 (44.2, 44.6) 6.3 (44.3, 44.8) 43.8 −4.8 (38.7, 39.2) −5.9 (38.8, 39.3) 38.2
500 2.8 (19.7, 19.8) 3.7 (19.8, 19.9) 19.6 −3.0 (17.3, 17.5) −4.0 (17.3, 17.5) 17.1
1000 0.9 (13.9, 13.9) 1.1 (13.9, 14.0) 13.9 −1.0 (12.2, 12.3) −1.1 (12.2, 12.3) 12.1
5000 0.1 (6.2, 6.2) 0.2 (6.2, 6.2) 6.2 −0.1 (5.4, 5.5) −0.1 (5.4, 5.5) 5.4
Set-up E
50 −32.9 (14.1, 17.9) −41.1 (14.8, 19.0) 11.0 40.0 (22.4, 24.1) 51.2 (22.8, 24.9) 21.0
100 −7.2 (9.6, 10.2) −9.0 (10.0, 10.5) 7.7 8.1 (15.5, 16.4) 9.7 (15.8, 16.7) 14.8
500 −5.1 (3.7, 3.7) −6.2 (3.7, 3.8) 3.5 6.0 (6.8, 7.0) 6.8 (6.9, 7.1) 6.6
1000 −1.3 (2.5, 2.5) −1.4 (2.5, 2.5) 2.4 1.4 (4.8, 4.8) 1.5 (4.8, 5.0) 4.7
5000 −0.1 (1.1, 1.1) −0.1 (1.1, 1.1) 1.1 0.1 (2.1, 2.1) 0.2 (2.1, 2.2) 2.1
Set-up F
50 10.2 (63.2, 64.0) 19.8 (63.3, 64.7) 62.0 −10.4 (55.0, 0.155) −19.0 (55.3, 55.8) 54.0
100 4.9 (44.0, 44.6) 6.0 (44.3, 44.8) 43.8 −5.3 (38.7, 39.1) −6.1 (38.8, 39.3) 38.2
500 3.0 (19.7, 19.8) 4.0 (19.8, 19.9) 19.6 −3.8 (17.2, 17.5) −4.3 (17.3, 17.4) 17.1
1000 0.9 (13.9, 14.0) 1.2 (13.9, 14.0) 13.9 −0.8 (12.1, 12.2) −1.0 (12.2, 12.3) 12.1
5000 0.1 (6.2, 6.2) 0.1 (6.2, 6.2) 6.2 −0.1 (5.4, 5.5) −0.1 (5.4, 5.5) 5.4

4.1.2 Choice of α for finite samples

Although asymptotic results were established for all α > 1/2, what a proper choice of α should be for finite samples is a very pertinent question. We performed extensive simulations for different sample-sizes, to explore the robustness of different choices of α values.

We tried a sample situation with one change-point, β00=0.3,β10=1.5,β20=1 and σ = 0.1 with covariate X-space = [0, 1]. The τ -values were varied between 0 and 1 and the Mean Square Errors were plotted against log10 α values for various sample-sizes. We found that the M.S.E. vs log10 α graphs are almost invariant with changing sample-sizes. To change the signal-to-noise ratio, the β-values were kept constant but σ was changed to 0.5 (Fig.2) and 1. The patterns are exactly similar for all parameter values and signal-to-noise ratios. However, for a small n and a very large value of α, the algorithm occasionally breaks down because 𝕌̇n(θ) becomes (almost) singular for computational purposes. This is clearly indicated by the very large average MSE for α = 50 or 100 when sample-size is small (n = 50). So, very large α’s (greater than 10) are not recommended for small samples (less than 50).We would also like to point out the robustness of the M.S.E.’s to the choice of tuning parameter in the range of α’s for which the algorithm is numerically stable. This is reflected by the flat stretch of the M.S.E. curves for each n, before numerical instability sets in. In other words, so long as the algorithm works, any choice of α > 1/2 is essentially as effective as any other. So, searching for an optimal α is unlikely to yield any significant gains. Our recommendation is to use α = 1, which works very well in terms of M.S.E. for all sample-sizes, as low as 30. The same α value (1) is used for all data anlyses in the subsequent sections. The simulations indicate that computational efficiency is insensitive to the choice of α. A more detailed version of Fig.2 is provided in the Supplement.

Figure 2.

Figure 2

Mean Square Errors vs log10 α for varying sample-sizes with different τ-values, where β00=0.3,β10=1.5,β20=1 and σ = 0.5. From the top below, the solid line corresponds to n = 50, dashed line corresponds to n = 100, the dotted line corresponds to n = 500, the dot-dash line corresponds to n = 1000 and the longdash line corresponds to n = 5000.

4.2 Longitudinal set-up

Simulations were conducted for the longitudinal case as well to compare the efficiency of our proposed method to the search-based algorithm. We considered an AR(1) correlation structure with ρ = 0.6 to model the dependence among observations within subject. For each subject, we considered 10 observations in scenarios G and H (Table 4). For set-up J, we considered varying number of observations for each individual, which is uniformly distributed over integer-space {1, 2, …, 20}. Error standard deviation σ was taken to be equal to 0.1 for all cases and M = 1. For all simulations, α was taken to be 1. The computational efficiency of our proposed method is huge compared to the search-based algorithm, as in the cross-sectional case (Table 1). So, in Table 4, we have just compared the bias and standard errors to illustrate the validity and estimation efficiency of our method.

Table 4.

Bias and variances for the change-point estimate τ̂n compared for two change-points problem in 3 longitudinal setups: G: θT = (0.3, 1, 1, 1, 0.2, 0.8), H: θT = (0.2, 1, 2, 1, 0.4, 0.6) & J: θT = (0.2, 1, 2, 1, 0.4, 0.6). 10 observations per individual in set-ups G and H. In set-up J, number of observations per individual D ~ Discrete Uniform {1, 2, …, 20} (S.D.: Average of estimated standard deviations over 1000 replications; Emp. S.D. : Sample standard deviation based on 100 replications).

Sample τ̂1n τ̂2n
Size Bias (S.D., Emp. S.D.) Theo. Bias × 10−3 (S.D., Emp. S.D.) Theo.
(n) × 10−3 S.D. × 10−3 S.D.
Existing Proposed × 10−3 Existing Proposed × 10−3
Set-up G
10 5.4 (74.2, 74.6) 6.9 (74.3, 74.8) 62.8 −5.3 (56.7, 57.3) −6.6 (56.8, 58.3) 43.2
50 3.2 (27.7, 27.9) 4.0 (27.8, 27.9) 21.4 −3.3 (21.3, 21.5) −4.3 (21.3, 21.5) 19.8
100 1.1 (19.8, 20.0) 1.4 (19.9, 20.1) 16.0 −1.2 (15.3, 15.4) −1.5 (15.4, 15.5) 14.1
500 0.2 (7.7, 7.7) 0.3 (7.9, 8.2) 6.9 −0.2 (6.4, 6.5) −0.2 (6.4, 6.5) 5.9
Set-up H
10 −7.5 (13.6, 13.2) −9.7 (13.8, 13.5) 8.7 8.1 (25.5, 26.4) 9.8 (25.8, 27.7) 17.1
50 −5.3 (6.7, 6.7) −6.8 (6.7, 6.8) 4.6 6.3 (9.9, 10.2) 7.1 (9.8, 10.1) 8.2
100 −1.4 (4.5, 4.4) −1.6 (4.5, 4.5) 2.9 1.4 (6.8, 6.4) 1.7 (6.8, 6.1) 5.5
500 −0.1 (2.8, 2.8) −0.1 (2.8, 2.8) 1.9 0.1 (4.1, 4.1) 0.2 (4.1, 4.2) 2.8
Set-up J
10 5.8 (77.1, 77.6) 7.1 (77.2, 77.8) 64.4 −6.1 (57.7, 59.4) −6.8 (57.2, 59.1) 44.1
50 3.5 (28.5, 28.7) 4.1 (28.5, 28.7) 22.9 −3.9 (22.0, 22.2) −4.4 (21.8, 22.0) 20.6
100 1.2 (20.3, 20.6) 1.4 (20.4, 20.6) 16.6 −1.3 (15.7, 15.8) −1.5 (15.8, 15.9) 14.9
500 0.2 (7.9, 8.0) 0.3 (8.0, 8.2) 7.2 −0.2 (6.8, 6.9) −0.2 (6.8, 6.9) 6.4

Results from Table 4, clearly indicate that our method yields almost the same standard error estimates as the search-based algorithm. Although for both methods with small sample-sizes, the bias is comparatively high and the standard deviation estimates are higher than the theoretical values, the differences become smaller for larger sample sizes. The M.S.E.’s for the slope and intercept parameters also behave similar to those of the change-points.

5 Applications

5.1 Plant growth data analysis

Vernalization, a requirement for plants to experience a period of cool conditions to accelerate flowering, is an important determinant of flowering date in winter wheat. In Brooking and Jameison (2002), controlled environment studies were carried out to quantify the response of vernalization rate to temperature for two near-isogenic lines of the wheat cultivar Batten: Spring Batten, vernalization insensitive; and Winter Batten, vernalization sensitive. Plants were sampled for dissection at intervals during the treatment and post-treatment period, until the flag leaf could be distinguished. The authors investigated the co-ordination of primordium initiation and leaf appearance, quantified by the Haun stage. The authors observed that Spring Batten plants grown under fully inductive conditions, 25/20°C, 16 hrs photoperiod, produced eight leaves on average, and the rate of primordium initiation per emerged leaf increased markedly with the transition from leaf initiation to spikelet initiation. This represents an important phase transition in the growth of the plant. From Fig. 3, it is quite clear that the model which best fits the scenario is a broken-stick model with two change-points. The authors had estimated the change-points by naked eye and then fitted three line-segments for the three regions. We provide a fast as well as statistically rigorous analysis using the approach developed in this paper.

Figure 3.

Figure 3

Co-ordination of primordium initiation and leaf emergence from Spring Batten treatments resulting in a final leaf number of 8 Brooking and Jameison (2002). The solid bold line represents the one estimated by our approach while the broken line represents the one estimated by Brooking and Jameison (2002). The dotted vertical lines give the confidence intervals for the estimated change-points given by the solid lines while the vertical broken-lines indicate the eye-estimated change-points.

The change-point estimates of Brooking and Jameison (2002) by naked eye were 2.6 and 5 on the Haun stage scale, whereas ours are 2.931(2.715, 3.147) and 4.764(4.647, 4.881), with 95% confidence intervals provided in parentheses. From Fig. 3 we see that the estimates in Brooking and Jameison (2002) do not lie within our confidence intervals, emphasizing the importance of a principled analysis such as the one we have proposed. The main conclusion in Brooking and Jameison (2002) was that the rate of primordium initiation per emerged leaf, the slope parameter, jumped from 1.9 primordia per leaf to 7.11 primordia per leaf and then became constant. Our estimates of the slopes of the three segments are 2.67(2.46, 2.88), 8.19(7.84, 8.54) and −0.02(−0.16, 0.12) primordia per leaf. Our estimates, qualitatively, corroborate their conclusion that there are two sharp phase transitions in the growth pattern whereby the initial growth rate gets more than tripled and then becomes more or less constant.

5.2 Estradiol hormone profile analysis

We applied our proposed method to analyze the longitudinal estradiol data as discussed in Section 1. For our purpose, we considered only women whose Final Menstrual Period (FMP) had already been observed. This was done so as to avoid scenarios with censored FMP’s (Lu et al. 2010). Among all these women, eight were left out either because their observed FMP was too early or too late or had less than three data-points. The remaining sample of n = 156 women with identified FMP was our sample of interest who in total gave 1396 observations, with each woman contributing 3 to 10 observations over time, covering 11 years before to 10 years after FMP. This gave an average of about 8.95 observations per woman. There were 75(48%) smokers at baseline and the baseline BMI mean(SD) was 27.4(6.56). Please note that the data we use here have longer follow-up and hence more subjects with identified FMP, compared to the data on which the analysis in Fig. 1 in Sowers et al. (2008) is based. A log transformation was applied to the Estradiol hormone level to make the normality assumption more plausible.

Denote by Yij the jth log-transformed E2 value measured at day tij centered around FMP Ti, for the ith woman and by SMOKEi and BMIi baseline smoking habit (0 meaning smoker at baseline, 1 otherwise) and the baseline body mass index, centered at the grand mean, respectively. We consider the following model:

Yij=β0+β1Xij+β2f(Xij,τ1)+β3f(Xij,τ2)+λ1SMOKEi+λ2BMIi+bi+Ui(tij)+εij (5)

where Xij = tijTi, the bi are random intercepts following a N(0, ϕ) distribution, the Ui(t) are mean zero Gaussian processes modeling serial correlation and εij are independent measurement errors following a N(0, σ2) distribution. We assume Ui(t) is a nonhomogenous OrnsteinUhlenbeck process, which satisfies Var(Ui(t)) = ξ(t) where logξ(t) = ξ0 + ξ1t + ξ2t2 and corr(Ui(t), Ui(s)) = ρ|ts|. We also assume that for each i, εi, bi and Ui(t) are independent of one another. Further, we assume, −11.9 ≤ τ1 < τ2 ≤ 9.9 (in general, we assume for all our theoretical results that the covariate X is contained in some compact interval [M1, M2]; here from the nature of the study and previous work we knew the scope of the study was between 12 years before and 10 years after the FMP). Further to make (5) identifiable, and Θ compact, we assume that 10−6 ≤ |β2| ≤ 106 and 10−6 ≤ |β3| ≤ 106. Also, the variance function part does not include any mean function parameters and so even in the presence of unknown change-points, the model remains identifiable.

As illustrated in section 3, we estimate the regression parameters in a three step procedure. In the first step, we assume working independence to estimate θ^n(I). Then η = (ϕ, σ2, ξ0, ξ1, ξ2, ρ) is estimated by maximizing the conditional log-likelihood,

l(η)=1/2i=1n[(Yiμn,i(I))TΣ(η)(Yiμn,i(I))+log|Σ(i)(η)|]. (6)

Therefore, Σ̂n = Σ(η̂n) which is subsequently used in Step 3 to obtain θ̂n. Condition 2.4 is verified to hold for this model; in fact W here turns out to be Σ0 = Σ(η0). The proof for this is provided in the Supplement.

Our results indicate the presence of change-points at −2.174 (−2.554,−1.794) and 1.733 (1.513, 1.953) years (Table 5).

Table 5.

Regression parameter estimates along with their respective standard errors

Parameter Estimate Standard Error
β0 4.116 0.139
β1 −0.006 0.002
β2 −0.259 0.009
β3 0.199 0.008
τ1 −2.192 0.197
τ2 1.738 0.11
λ1 0.047 0.072
λ2 0.005 0.005

In Sowers et al. (2008), the change-points had been roughly thought to be around 2 years before and after FMP. Although this was a good estimate, we can see that actually the 95% confidence interval for the second change point does not contain 2 years after FMP, indicating the change of estradiol levels actually happen slightly sooner than anticipated in Sowers et al. (2008). Also BMI and smoking habits do not seem to alter this pattern significantly. But, our contribution, above all, is providing statistically meaningful inference about the change-points. Also the form of the confidence bands indicate that a two-change point model is indeed a good fit for the E2-hormone profile.

6 Discussion

We have proposed a method of estimating change-points in a broken stick model which is computationally much more efficient than existing methods, and demonstrated that it is asymptotically as efficient. The method of estimation is also numerically stable. An added advantage of this method is that, as shown in section 3, it can be readily extended to generalized linear models with repeated measures, examples of which are abundant. The estimates in those frameworks have shown similar desirable asymptotic properties.

It seems reasonable to assume that this idea should work equally efficiently in estimating change-points in a time-series framework, at least under short-range dependence. For instance, estimation of change-points is of considerable significance in climatic series data (Lund et al. 1995; Lund and Reeves 2002) and such data sets tend to be really large. Hence our idea would likely prove even more economic in this setting. This is underscored by the fact that even for a sample size of 50, our method is more than a hundred times faster than the exact least squares method with multiple change points, and at large samples, thousands of times faster. Also, in a linear spline model with knot-locations unknown (number of knots known), the proposed method provides a faster alternative for locating these knots.

We cannot stress enough that this is a very generic idea which can be used for computational economy in several settings without giving up on asymptotic efficiency. For example, the same idea should be applicable for estimating change-points in a multivariable setup where the change-points are observed in more than one variable. While, for a search based algorithm, the computational time will increase many folds with the number of variables having change-points, it will scale much more favorably for our approach.

However, we would like to point out if the investigators feel that the linearity of broken-stick model is not best suited for their data, our method of estimation or for that matter any method of estimation based on the broken-stick model may not be reliable.

Also if the coefficients of two consecutive regimes are very close, then trying to fit separate segments for the two regimes is strongly discouraged. We performed extensive simulations and both our approach and the search-based approach yield poor estimates. Thus before fitting a broken-stick model, we would strongly suggest the investigators check that the assumptions for the model are valid.

In this article, we were interested in modeling the mean hormone profile of all subjects in the cohort discussed in Section (5.2). A possible way to model individual-specific hormone profiles is via multi-path change-points models. Major work done in regards to multi-path change-points include Joseph and Wolfson (1993) and Asgharian and Wolfson (2001). Most of this literature has treated change-point as the observation at which a transition has occurred, rather than a point in the X–space. Broken-stick models with random change-points and random intercept-slopes is a possible interesting avenue for future work in this field. The simplest possible model with one change-point is:

E(Y|X)=β0+β1X+β2(Xτ)+,

where θT = (β0, β1, β2, τ) follows, say, a multivariate N0, ϒ) distribution. Estimating methods will rely on minimizing criterion functions involving several integrals and is beyond the scope of this work.

Supplementary Material

Supplementary Material

Acknowledgments

The work of Banerjee was supported in part by NSF grant DMS-1308890. The work of Nan was supported in part by NIH grant R01-AG036802 and NSF grants DMS-1007590 and DMS-1407142. The authors gratefully acknowledge the editor, associate editor and the referees for their helpful comments and suggestions.

Appendix 1

Sketch proofs of Asymptotic Properties for Cross-sectional Model

We consider the simplest case with a single change-point. The case with multiple change-points is an algebraic exercise. The proofs heavily rely on the results in van der Vaart and Wellner (1996). We provide sketch proofs here with details given in the online Supplement.

Sketch proof of Theorem 1

It is clearly seen that U0) = 0, here U(θ) = ∂P(M(θ))/∂θ is the population score function. The proof of consistency is based on the following two facts:

  • Fact 1: θ0 is the unique solution of U(θ) = 0.

  • Fact 2: ‖𝕌nU‖ ≔ supθ∈Θ |𝕌n(θ) − U(θ)| = op(1).

In the online Supplementary Material, we show that there exists a zero of 𝕌n(θ), denoted as θ̂n, in Θ on a set with probability increasing to 1. Denote this set as Ωn. On the shrinking set Ωn𝖼, θ̂n is assigned some pre-fixed value (say, θ1) which is an interior point of Θ. To prove consistency of θ̂n, observe that, for any ε > 0

pr(|θ^nθ0|>ε)=pr({|θ^nθ0|>ε}Ωn)+pr({|θ^nθ0|>ε}Ωn𝖼).

Notice that on the set Ωn, θ̂n is a zero of 𝕌n(θ), hence also a minimizer of ‖𝕌n(θ)‖2 in Θ while θ0 is the unique zero of U(θ) and hence the unique minimizer of ‖U(θ)‖2, where ‖·‖2 is the Euclidean norm. From Fact 2 and using triangular inequality, we have supθ∈Θ |‖𝕌n2 −‖U2| ≤ supθ∈Θ‖𝕌n −U‖2K‖𝕌nU‖ = op(1) for a finite constant K. So, from the argmax (argmin) continuous mapping theorem (van der Vaart and Wellner (1996), Corollary 3.2.3) for any ε > 0, there exists a ν > 0 such that,

pr({|θ^nθ0|>ε}Ωn)ν2.

Also, for sufficiently large n, pr(Ωn𝖼)ν2. This implies that pr(|θ̂n − θ0| > ε) ≤ ν, for sufficiently large n, hence proving Theorem 1.

Now, proving Fact 1 is a tedious algebraic exercise. The idea is to, by method of elimination, bring down all the equations in the various parameters to one equation involving only τ. Then we show that this function is negative for all τ > τ0 and positive for all τ < τ0. This allows us to argue Fact 1.

To establish Fact 2, observe that ‖𝕌nU‖ ≤ ‖𝕌nUn‖ + ‖UnU‖. Direct calculation yields ‖UnU‖ = On) = o(1). On the other hand (𝕌nUn)(θ) = (ℙnP)(∂Mn(θ)/∂θ).

Observe that

Mn(θ)/θ=(2(Yβ0β1Xβ2qn(X,τ))2X(Yβ0β1Xβ2qn(X,τ))2qn(X,τ)(Yβ0β1Xβ2qn(X,τ))2β2qnτ(Yβ0β1Xβ2qn(X,τ))).

Now, Θ being compact, it is clear that β0, β1X, β2qn(X, τ) and β2qnτ are all bounded monotones as functions of θ. Theorem 2.7.5 in van der Vaart and Wellner (1996) shows that bounded monotone functions have a bracketing number of order (1/ε), wrt L1(P) norm and hence a similar bound on the covering number with respect to the same norm. So they have bounded uniform entropy integral (BUEI) property and hence belongs to a Glivenko-Cantelli class. Theorem 3 in van der Vaart and Wellner (1999) shows that simple operations such as addition or multiplication preserves the Glivenko-Cantelli property and hence each component function of ∂Mn(θ)/∂θ belongs to a Glivenko-Cantelli class. Thus ∂Mn(θ)/∂θ belongs to a Glivenko-Cantelli class and hence ‖𝕌nUn‖ = op(1), which establishes Fact 2, and hence Theorem 1.

Sketch proof of Theorem 2

Since n1/2(θ̂n − θ0) = n1/2(θ̂n − θn) + n1/2n − θ0), the asymptotic distribution of θ̂n is a direct result of the following two facts when γn = n−α with α > 1/2.

  • Fact 3: ‖θn − θ0‖ = On).

  • Fact 4: n1/2(θ̂n − θn) converges in distribution to N(0,2σ2U˙*1(θ0)).

We first show Fact 3. Observe that a simple Taylor series expansion of Unn) around θ0 yields

Un(θn)Un(θ0)=(U˙1n(θ˜n(1))U˙2n(θ˜n(2))U˙3n(θ˜n(3))U˙4n(θ˜n(4)))(θnθ0)=An(θnθ0),

where each of θ˜n(i), i = 1, 2, 3, 4, lies on the straight line joining θn and θ0 and U˙in=Unβi1, i = 1, 2, 3, and U˙4n=Unτ. Now, we know from the proof of Fact 1, supθ∈Θ |P(Mn(θ)) − P(Mn(θ))| = o(1), hence θn −θ0 = o(1). This in turn implies that for sufficiently large n, θn is an interior point of Θ and hence, a zero of Unn).

Now, U0) = 0. Thus for sufficiently large n, the above equality becomes

Un(θ0)U(θ0)=An(θnθ0).

It is clearly seen from

Un(θ)U(θ)=(2β2P(f(X,τ)qn(X,τ))2β2P[X(f(X,τ)qn(X,τ))]2P[(f(X,τ)qn(X,τ))(Yβ0β1Xβ2(qn(X,τ)+f(X,τ)))]2β2P[(1(Xτ)τqn(X,τ))(Yβ0β1Xβ2(qn(X,τ)+f(X,τ)))]),

that ‖UnU‖ = On). Thus Fact 3 is established if An is invertible for large n. This is established by showing that n(θ) converges uniformly, implying An converges as well. Let *0) denote the limit of An. It can be shown

U˙*(θ0)=2(1P(X)P(f(X,τ0))β20P(1(X>τ0))P(X)P(X2)P(Xf(X,τ0))β20P(X1(X>τ0))P(f(X,τ0))P(Xf(X,τ0))P(f2(X,τ0))β20P(f(X,τ0))β20P(1(X>τ0))β20P(X1(X>τ0))β20P(f(X,τ0))(β20)2P(1(X>τ0))).

Now, for any vector a = (a1, …, a4)T ≠ 0, we have

aTU˙*(θ0)a=2P{a1+a2X+a3f(X,τ0)a4β201(X>τ0)}2>0,

which implies that *0) is positive definite and hence nonsingular. Thus An is nonsingular for large enough n, and we have

|θnθ0|=An1UnU=Op(γn).

We next show Fact 4. Denote 𝔾n = n1/2(ℙnP). Again, observe that by Taylor Series expansion of Unn) around θ̂n yields

Un(θ^n)Un(θn)=(U˙1n(θn*(1))U˙2n(θn*(2))U˙3n(θn*(3))U˙4n(θn*(4)))(θ^nθn)=Bn(θ^nθn),

where θn*(i), i = 1, 2, 3, 4 lie on the straight line joining θn and θ̂n. Since, with probability increasing to 1, 𝕌n(θ̂n) = Unn) = 0, the following equality holds with probability increasing to 1:

𝕌n(θ^n)Un(θ^n)=Bn(θ^nθn).

Now,

n1/2[𝕌n(θ^n)Un(θ^n)]=2𝔾n(Yβ0β1Xβ2qn(X,τ)X(Yβ0β1Xβ2qn(X,τ))qn(X,τ)(Yβ0β1Xβ2qn(X,τ))β2τqn(X,τ)(Yβ0β1Xβ2qn(X,τ)))θ=θ^n=2𝔾n(gn(1)(θ^n)gn(2)(θ^n)gn(3)(θ^n)gn(4)(θ^n))=2𝔾n(gn(θ^n)).

Next, we argue that gn, as well as g (defined as the limit of gn), belongs to Donsker class. Then we show that, P(gn(i)(θ^n)g(i)(θ^n))2=op(1), i = 1, …, 4, where gn(i) are components of gn. Then, by the asymptotic equicontinuity property, we have 𝔾n(gn(θ̂n) − g(θ̂n)) = op(1) (van der Vaart and Wellner 1996). It can also be shown that P(g(i)(θ̂n) − g(i)0))2 = op(1), i = 1, …, 4. Hence, again by the asymptotic equicontinuity property, we have 𝔾n(g(θ̂n) − g0)) = op(1) (van der Vaart and Wellner 1996). Now,

n[𝕌n(θ^n)Un(θ^n)]=2𝔾n(gn(θ^n))=2𝔾n(gn(θ^n)g(θ^n))2𝔾n(g(θ^n)g(θ0))2𝔾n(g(θ0))=2𝔾n(g(θ0))+op(1).

Thus by the central limit theorem, the above expression converges in distribution to a normal random variable with mean zero and variance matrix 2σ2*0). By the same argument in the proof of Theorem 1 we know that Bn converges to *0) in probability. Thus n1/2(θ̂n − θn) converges to N(0,2σ2U˙*1(θ0)) in distribution, establishing Fact 4 and thus Theorem 2.

Appendix 2

Proof of Theorem 3

The three main steps of this proof has been already outlined in Section 3.2. The proof of Theorem 3 relies on the following lemma:

Lemma 1

Under Conditions 2.1–2.3 in Section 3.2, and for Dm, for any positive definite matrix Vm×m and γn = n−α with α > 1/2, n1/2(θ^n(V)θ0) converges to N(0, K(V)) in distribution, where

K(V)=2P[(HT(θ0)VH(θ0))1HT(θ0)VTΣ0VH(θ0)(HT(θ0)VH(θ0))1].

Here, θ^n(V) is defined as a zero of 𝕌n(V)(θ)=θn(Yμn)TV(Yμn) in Θ, with probability increasing to 1.

The proof of lemma 1 is similar to the proof of Theorem 2 as shown in Appendix 1. The details of the proof have been excluded for brevity. They have been presented in the Supplement.

Now to prove Theorem 3, using V = I, we obtain from Lemma 1 that n1/2(θ^n(I)θ0) converges to N(0, K(I)) in distribution, for γn = n−α, with α > 1/2.

Now because of Condition 2.4 n(Σ^n1W1)=Op(1) implying,

n(𝕌n(θ^n(W1))𝕌n(W1)(θ^n(W1)))=θn[(Yμn)T{n(Σ^n1W1)}(Yμn)]|θ=θ^n(W1)=Op(1)θn[(Yμn)T(Yμn)]|θ=θ^n(W1)=Op(1)𝕌n(I)(θ^n(w1)).

Again 𝕌n(I)U(I)=op(1) (proof provided in the Supplement). This implies that, 𝕌n(I)(θ^n(W1))=U(I)(θ^n(W1))+op(1). Also (θ^n(W1)) converges in probability to θ0 and U(I) is continuous, which together imply that U(I)(θ^n(W1))=U(I)(θ0)+op(1)=op(1), since U(I)0) = 0. Thus, we obtain 𝕌n(I)(θ^n(W1))=op(1), which implies that

n(𝕌n(θ^n(W1))𝕌n(W1)(θ^n(W1)))=op(1).

Also 𝕌n(θ^n)=𝕌n(W1)(θ^n(W1))=0 implies that

n(𝕌n(θ^n)𝕌n(θ^n(W1)))=n(𝕌n(θ^n(W1))𝕌n(W1)(θ^n(W1)))=op(1).

Taylor series expansion of 𝕌n(θ̂n) around θ^n(W1) provides

n(𝕌n(θ^n)𝕌n(θ^n(W1)))=𝕌˙n(θ˜n*)n(θ^nθ^n(W1)),

for some θ̃n* lying on the straight line joining θ̂n and θ^n(W1). As shown for An in the proof of Theorem 2, we can show that 𝕌̇n(θ̃n*) converges in probability to *0), which in turn implies that n(θ^nθ^n(W1))=op(1).

Also, notice that, clearly with V = W−1, from Lemma 1, n1/2(θ^n(W1)θ0) converges to N(0, K(W−1)) in distribution. Along with the fact that n(θ^nθ^n(W1))=op(1), we have proved Theorem 3.

Contributor Information

Ritabrata Das, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109 (ritob@umich.edu).

Moulinath Banerjee, Department of Statistics, University of Michigan, Ann Arbor, MI 48109 (moulib@umich.edu).

Bin Nan, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109 (bnan@umich.edu).

Huiyong Zheng, Department of Epidemiology, University of Michigan, Ann Arbor, MI 48109 (zhenghy@umich.edu).

References

  1. Asgharian M, Wolfson DB. Covariates in multipath change-point problems: Modelling and consistency of the MLE. Canadian Journal of Statistics. 2001;29:515–528. [Google Scholar]
  2. Bellman R, Roth R. Curve fitting by segmented straight lines. Journal of the American Statistical Asscociation. 1969;64:1079–1084. [Google Scholar]
  3. Bhattachaya PK. Maximum likelihood estimation of a change-point. J. Miltivar. Analysis. 1987;23:183–208. [Google Scholar]
  4. Brooking IR, Jameison PD. Temperature and photoperiod response of vernalization in near-isogenic lines of wheat. Field Crops Research. 2002;79:21–38. [Google Scholar]
  5. Chen C, Chan J, Gerlach R, Hsieh W. A comparison of estimators for regression models with change points. Statistics and Computing. 2011;21:395–414. [Google Scholar]
  6. Chiu G, Lockhart R, Routledge R. Bent-cable Asymptotics when the Bend is Missing. Statistics and Probability Letters. 2002;59:9–16. [Google Scholar]
  7. Chiu G, Lockhart R, Routledge R. Bent-cable regression theory and applications. Journal of the American Statistical Asscociation. 2006;101:542–553. [Google Scholar]
  8. Feder PI. The log likelihood ratio in segmented regression. The Annals of Statistics. 1975a;3:84–97. [Google Scholar]
  9. Feder PI. On asymptotic distribution theory in segmented regression problems-identified case. The Annals of Statistics. 1975b;3:49–83. [Google Scholar]
  10. Hudson DJ. Fitting segmented curves whose join points have to be estimated. Journal of the American Statistical Asscociation. 1966;61:1097–1129. [Google Scholar]
  11. Huskova M. Estimators in the location model with gradual changes. Commentationes Mathematicae Universitatis Carolinae. 1998;39:147–157. [Google Scholar]
  12. Joseph L, Wolfson D. Maximum likelihood estimation in the multi-path change-point problem. Annals of the Institute of Statistical Mathematics. 1993;45:511530. [Google Scholar]
  13. Lu X, Nan B, Song P, Sowers MF. Longitudinal Data Analysis with Event Time as a Covariate. Stat Biosci. 2010;2:65–80. [Google Scholar]
  14. Lund RB, Hurd HL, Bloomfield P, Smith RL. Climatological time series with periodic correlation. J. Climate. 1995;8:2787–2809. [Google Scholar]
  15. Lund RB, Reeves J. Detection of undocumented changepoints: A revision of the two-phase regression model. J. Climate. 2002;15:2547–2554. [Google Scholar]
  16. Muggeo VMR. Estimating regression models with unknown break-points. Statistics in Medicine. 2003;22:3055–3071. doi: 10.1002/sim.1545. [DOI] [PubMed] [Google Scholar]
  17. Poirer DJ. Piecewise regression using cubic splines. Journal of the American Statistical Asscociation. 1973;68:515–524. [Google Scholar]
  18. Robinson DE. Estimates for the points of intersection of two polynomial regressions. Journal of the American Statistical Asscociation. 1964;59:214–224. [Google Scholar]
  19. Rubin D. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]
  20. Sowers MFR, McConnell D, Nan B, Harlow S, Randolph JF., Jr Estradiol rates of change in relation to the final menstrual period in a population-based cohort of women. J. Clin. Endocrinal Metab. 2008;93(10):3847–3852. doi: 10.1210/jc.2008-1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Tishler A, Zang I. A new maximum likelihood algorithm for piecewise regression. Journal of the American Statistical Asscociation. 1981;76:980–987. [Google Scholar]
  22. van der Vaart A, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer; 1996. [Google Scholar]
  23. van der Vaart A, Wellner J. Preservation Theorems for Glivenko-Cantelli and Uniform Glivenko-Cantelli Classes. Technical Report No 361, Department of Statistics, University of Washington. 1999 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES