Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 15.
Published in final edited form as: Stat Med. 2020 Jan 29;39(8):1167–1182. doi: 10.1002/sim.8469

Methods for Generalized Change-point Models: with Applications to HIV Surveillance and Diabetes Data

Jean de Dieu Tapsoba 1,*,, Ching-Yun Wang 2, Sahar Zangeneh 1, Ying Qing Chen 1
PMCID: PMC7260994  NIHMSID: NIHMS1590210  PMID: 31997385

Abstract

In many epidemiological and biomedical studies, the association between a response variable and some covariates of interest may change at one or several thresholds of the covariates. Change-point models are suitable for investigating the relationship between the response and covariates in such situations. We present change-point models, with at least one unknown change-point occurring with respect to some covariates of a generalized linear model for independent or correlated data. We develop methods for the estimation of the model parameters and investigate their finite sample performances in simulations. We apply the proposed methods to examine the trends in the reported estimates of the annual percentage of new human immunodeficiency virus (HIV) diagnoses linked to HIV-related medical care within three months after diagnosis using HIV surveillance data from the HIV Prevention Trial Network (HPTN) 065 study. We also apply our methods to a data set from the Pima Indian diabetes study to examine the effects of age and body mass index on the risk of being diagnosed with type 2 diabetes.

Keywords: Change-point, correlated data, generalized linear model, HIV surveillance, type 2 diabetes

1. Introduction

In many epidemiological and biomedical studies, it is common that the association between a response variable and some covariates of interest is not linear, and the association may change at one or more unknown thresholds of the covariates. Change-point models (also known as segmented models) are well-suited for data analysis in these situations and allow the regression function to take different forms before and after the thresholds.1

An example is the trend in the reported estimates of the annual percentage of individuals (age 13 years or older) who were newly diagnosed with HIV infection in 2011 and linked to HIV-related medical care within three months of diagnosis (L2C) from the HIV-surveillance data for six jurisdictions (New York, NY; Washington D.C; Miami, FL; Chicago, IL; Philadelphia, PA and Houston, TX) participating in the HPTN 065 study, which was a feasibility study of an enhanced test, link to care, plus treatment strategy for HIV prevention in the United States.2 The data included jurisdiction-level aggregate estimates of the number of new HIV diagnoses, L2C and viral load suppression. These measures were routinely updated and submitted to the surveillance systems’ web sites by the local Departments of Health (DOH) on a quarterly basis from March 2012 through December 2014. Figure 1 shows the updated L2C estimate over the course of the data submission time for the six cities. A noticeable pattern emerging from this figure is that the updated L2C estimates generally increased from the first data-upload month up to a certain time point at which they somewhat stabilized. The number of new HIV diagnoses, shown in Figure S1 of the Supplementary Materials also exhibited similar trends.

Figure 1:

Figure 1:

Reported percentage of new HIV-diagnoses linked to medical care within 3 months following diagnosis with HIV infection by local Departments of Health, 2011 for the HPTN 065 study participating jurisdictions; τ is the estimated jurisdiction-specific change-point.

Another example arises from a data set from the Pima Indian diabetes study. This data set was obtained from the University of California, Irvine (UCI) Repertory of Machine Learning Repository and was part of a comprehensive database maintained by the US National Institutes of Diabetes and Kidney Diseases. It included 764 Pima Indian women (age 21 years or older) from greater Phoenix, Arizona and was previously analyzed by Wang et al3 using generalized additive partial linear models. The variables included age, BMI, plasma glucose concentration levels and a binary indicator variable for a T2D-positive test according to the World Health Organization criteria. Figure 2 shows the smooth functions of age, BMI and glucose in their associations with T2D diagnosis based on an additive logistic regression model4 that was fitted using the ‘mgcv’ R package.5 The effect of glucose on the log odds for a T2D diagnosis appeared to be linear. However, the effects of age and BMI appeared nonlinear, with possibly one change-point for age and two change-points for BMI. The estimation of the change-points themselves is often of primary interest in many applications.

Figure 2:

Figure 2:

Smooth components of age, BMI and glucose (solid lines) with the associated two standard error bounds (dashed lines) in an additive logistic regression model for their associations with the logit transformation of a positive test for diabetes in the Pima Indians diabetes study. df is the effective degree of freedom from the estimation of the smooth components based on spline transformations.

The change-point estimation problem has been extensively studied in the literature and most works have focused on piecewise linear (broken-stick) regression models with one or multiple change-points.6,7 Also, the single change-point estimation in the context of logistic regression models for independent observations has received substantial attention, and relevant references include Pastor and Guallar,8 Pastor et al,9 and Fong et al.10 However, the change-point problem has not been well-studied in the general situation of generalized linear models for independent or correlated data. To our knowledge, only limited research pertaining to this issue exist in the literature. Among the very few references, Zhou and Liang11 proposed an estimation method for a single change-point in one covariate of a generalized linear model for independent observations. Their approach relies on an approximation by smoothing techniques tying the performance of the method to an appropriate choice of the smoothing parameter. Muggeo12,13 developed an algorithm for the estimation of one or multiple change-points in regression models including the generalized linear models for independent data. Muggeo’s algorithm uses a linear-approximation to the regression function. In a longitudinal data setting, Das et al7 proposed a method using approximation by smoothing techniques for the estimation of multiple change-points in linear models. The approximate methods may not work well in many situations, especially when the regression model is not linear and the sample size is not large. Methods that are not based on approximation by smoothing or linearization for the estimation of the change-points in the general frameworks of generalized linear models for independent or correlated data still have yet to be developed.

We address the general problem of estimating one or multiple unknown change-points in the covariates of generalized linear models for independent or correlated response data. We introduce change-point models for the situations where at least one change-point is present in some covariates of a generalized linear model for independent, correlated or longitudinal data. Furthermore, we develop methods using estimating equations for the estimation of the change-points along with the other model parameters. Our methods involve no approximation by smoothing or linearization techniques and yield n-consistent estimators.

We present the change-point models in Section 2. In Section 3, we develop our methods for the estimation of the model parameters. The finite-sample performance of the proposed methods are numerically investigated in Section 4. In Section 5, we apply our methods to examine the trends of the reported L2C estimates over the data upload time and provide improvements of these estimates. We also illustrate our methods with the data set from the Pima Indian diabetes study. We conclude with a discussion in Section 6.

2. Generalized change-point models

We distinguish two settings corresponding to whether the response variable is measured once or repeatedly for each sampled unit.

2.1. Setting with one measurement per unit

Let (Yi,Xi,Wi), i = 1,...,n, denote the observed data, where Yi is the response, Xi is a scalar and continuous covariate, and Wi is a vector of other covariates involving no changepoints. In the HPTN065 study, Y represents the L2C measure and X is the time since the first data-upload month for a given jurisdiction. In the Pima Indian diabetes study, Y is the indicator of T2D-positive test and X is age or BMI. We model the relationship between the response and covariates through the following model, termed here as the generalized change-point model with R+1 segments:

E(Yi|Xi,Wi)=h{η(θ,Xi,Wi)}=h{β0+β1Xi+r=1Rβ2r(Xiτr)++γWi}, (1)

where u+ = max(u,0) for any u and h(.) is a known function. For example h(u) takes the form u, {1 + exp(−u)}−1 and exp(u) for linear, logistic and Poisson regression models, respectively. In addition, θ = (ϑ′, τ′)′ is a vector of unknown parameters with ϑ = (β′, γ′)′, β=(β0,β1,β2), β2=(β21,,β2R) and τ=(τ1,,τR). Here the τr’s are ordered (τ1<<τR) change-points occurring in the conditional mean of the response given the covariates. Furthermore, β0 is the intercept, β1 is the coefficient for X in the first segment and β2r is the difference in the coefficients for X between the two consecutive segments joining at the change-point τr. Also, the coefficient for X in the segment that immediately follows τr is β1+l=1rβ2l, r=1,,R. Model (1) encompasses as special cases the change-point model of Zhou and Liang11 for a generalized linear model with a single change-point in one covariate and the broken stick model, investigated by Das et al7 when Y1,...,Yn are independent. Under the current setting, the Yi’s are allowed to be independent or correlated. For example, the data on diabetes could be assumed independent among the participants of the Pima Indian diabetes study. Meanwhile, the updates of the L2C estimate for each jurisdiction in the HPTN 065 study were reported over time and could be correlated.

2.2. Setting with repeated/clustered measurements

This setting includes situations where there are repeated, longitudinal or clustered data for the response variable. Let Yij denote the jth measurement of the response variable for the ith study subject, Xij be a continuous covariate experiencing at least one change-point and Wij denote a vector of other covariates without change-points, j = 1,...,mi, i = 1,...,n. Here mi is the number of available repeated data for the ith subject i = 1,...,n. An extension of Model (1) under this situation is the following population average model:

E(Yij|Xij,Wij)=h{η(θ,Xij,Wij)}=h{β0+β1Xij+r=1Rβ2r(Xijτr)++γWij}, (2)

where h(.), (.)+ and θ are defined similarly as in (1). The observations, Yi1,...,Yimi are possibly correlated. However, (Yi,Xi,Wi), i = 1,...,n, are assumed to be independent. Here Yi=(Yi1,,Yimi), Xi=(Xi1,,Ximi) and Wi=(Wi1,,Wimi), i=1,…,n.For example, in the HPTN 065 study, Yij can be thought of as the jth updated L2C estimate and Xij as the jth quarter since the first data-upload month for the ith jurisdiction.

For ease of discussion, R, which represents the number of change-points in (1) or (2) is assumed to be known. Its estimation lies beyond the scope of the current work. If the vector of change-points, τ were known, ϑ in (1) could be estimated using standard estimation methods for generalized linear models.14 Similarly, usual generalized estimating equation techniques15,16 could be applied to estimate ϑ in (2) under the same situation. However, this is not the case in our context, and we are concerned with the estimation of θ that involves both τ and ϑ in (1) and (2).

3. Estimation

For identifiability of the generalized change-point model specified in (1) or (2), we assume that the change-points exist (implying that β2r0,r=1,,R) and the τr’s are wellseparated when R1. We first present our methods for the settings with a single measurement in Subsection 3.1 and then extend these methods to settings with repeated/clustered measurements for the response variable in Subsection 3.2.

3.1. Setting with one measurement per unit

We first consider the case where (Yi,Xi,Wi), i = 1,...,n, are independent. A direct application of the traditional maximum likelihood or least squares method for estimating θ would encounter the problem that (.)+ in (1) is not a smooth function. This complicates the use of a Newton-Raphson type algorithm for the optimization of the log-likelihood function or the least square objective function. One way to overcome this challenge is to use an adaptation of Muggeo’s algorithm12,13 which is based on a linear approximation of the nonlinear part of the regression function through a first order Taylor series expansion. This method may not work well in many situations including when h(u) ≠ u and n is not large.

An estimator for θ using an approximation by smoothing techniques can be obtained by replacing (.)+ in (1) with a smooth function and solving the following estimating equations:

i=1nθη˜(θ,Xi,Wi,λn)[Yih{η˜(θ,Xi,Wi,λn)}]=0, (3)

where {λn} is a sequence of positive real numbers converging to zero as n goes to infinity, η˜(θ,Xi,Wi,λn)=β0+β1Xi+r=1Rβ2rfs(Xiτr,λn)+γWi, and fs(u,λ) is a smooth function that is differentiable with respect to u and gets closer to u+ as λ approaches 0. We refer to λ as the tuning parameter for fs(u,λ). Potential choices for this function include fs(u,λ) = (u + λ)2I(|u| ≤ λ)/4λ + uI(u > λ)6,7 and fs(u,λ) = uK(u/λ), where K(u) is the cumulative distribution function of the standard normal distribution11 or of the form {1 + tanh(u)}/2.17,18 This can be viewed as an extension of the methods in Das et al7 and Zhou and Liang.11 Its performance is generally not affected by the choice of fs(u,λn) but could be sensitive to the choice of λn. Improperly selecting λn could lead to biased inference on θ when n is not large. This has motivated us to seek an alternative approach that is free of approximations through smoothing or linearization. When the Yi’s are independent, the proposed estimator, denoted by θ^I for θ in (1) solves

i=1n[1XiWiXi(θ)][Yih{η(θ,Xi,Wi)}]=0, (4)

where Xi(θ)={(Xiτ1)+,,(XiτR)+,β21I(Xi>τ1),,β2RI(Xi>τR)}. The computation of θ^I can be performed via iterative methods for solving a system of nonlinear equations without derivatives such as the derivative-free spectral algorithm for nonlinear equation,19 which combines the Barzilai-Borwein spectral gradient method with non-monotone line search techniques as a strategy to achieve a global convergence. It can be used for solving non-smooth estimating equations20 and implemented via the ‘BBsolve’ function in the ‘BB’ R package. We rely on this algorithm for the computation of the proposed estimators in the simulations in Section 4 and the application to real examples in Section 5. Alternatively, solving (4) can also be formulated as a problem of minimizing Q(ϑ,τ)=i=1nf(θ,Yi,Xi,Wi)/ϕ, where f,y,x,w) is a log quasi-likelihood function under Model (1). For example, f,y,x,w) takes the form −[yh{η,x,w)}]2/2 for a linear model, y log[h{η,x,w)}] + (1 − y)log[1 − h{η,x,w)}] for a logistic model and y log[h{η,x,w)}] − h{η,x,w)} for a Poisson model. Furthermore, φ is the error variance for a linear model and takes the value 1 for logistic and Poisson models. Q(ϑ,τ) may also be taken as the squared L2 norm of the estimating function in (4). The minimization of Q(ϑ,τ) can be carried out through derivative-free optimization techniques that can handle non-smooth objective functions. The simulated annealing algorithm21 and the genetic algorithm22 are among such optimization techniques. It is worth mentioning that there may exist multiple roots for (4) or local minima for Q(ϑ,τ).9,12 Therefore, the performance of the derive-free root finding algorithms or optimization algorithms may depend on the choice of initial values for τ and ϑ. This issue is shared by Muggeo’s method as well as the change-point estimation methods using smoothing techniques.9,11 Reasonable starting values for the change-points may be obtained based on the following two-step approach. In the first step, an estimator ϑ˜(τ) for ϑ is obtained as the minimizer of Q(ϑ,τ) for a fixed τ. Subsequently in the second step, the initial value τ˜(0) for τ is found by grid search near the potential change-point locations as the minimizer of Q(ϑ˜(τ),τ). Potential change-point locations may be obtained by means of graphical examination of the data. The starting value for θ is then taken as θ˜(0)=(ϑ˜(0),τ˜(0)), where ϑ˜(0)=ϑ˜(τ˜(0)).

Let θ0 be the true value of θ. The large sample properties of θ^I are summarized as follows. Under the regularity conditions (I1)–(I5) in Appendix A of the Supplementary Materials, θ^I is a consistent estimator for θ and n(θ^Iθ0) is asymptotically normally distributed with mean 0. The covariance matrix of θ^I can be consistently estimated by ϕ{D(θ^I)A(θ^I)D(θ^I)}1, where D(θ) is the matrix whose ith row is {1,Xi,Wi,Xi(θ)}, i = 1,...,n, and A(θ)=Diag[ν1(θ),,νn(θ)]. Here νi(θ) is 1 for a linear model and takes the form h{η,Xi,Wi)} and h{η,Xi,Wi)}[1 − h{η,Xi,Wi)}, for Poisson and logistic models, respectively. More details regarding the asymptotic properties for θ^I are provided in Appendix B of the Supplementary Materials.

We now turn our attention to the situation where the Yi’s are correlated. This includes the case of a broken stick model with autocorrelated errors. Methods for the estimation of θ under this setting are scarce in the literature. Let Y = (Y1,...,Yn)0, X = (X1,...,Xn)’ and W be the matrix whose ith row is W’i. Also, let Σ denote the correlation matrix of Y and Ω(θ)=ϕA(θ)1/2ΣA(θ)1/2, where φ and A(θ) are defined as above. The proposed estimator, denoted by θ^C for θ in (1) under this situation solves

D(θ)Ω1(θ)[Yh{η(θ,X,W)}]=0, (5)

where D(θ)=D(θ)A(θ) and h{η(θ,X,W)} is the vector whose ith component is given by h{η,Xi,Wi)}, i = 1,...,n. In practice, Σ is generally unknown and is replaced in (5) by an estimator Σ of a working correlation matrix based on an assumed correlation structure. Under suitable regularity conditions, θ^C is a consistent estimator for θ and n(θ^Cθ0) is asymptotically normally distributed with mean 0. Furthermore, the covariance matrix of θ^C can be consistently estimated by {D(θ^C)Ω^(θ^C)1D(θ^C)}1, where Ω^(θ) is defined similarly as Ω(θ) but with Σ replaced by Σ^. It is worth noting that θ^C coincides with θ^I when is Σ^ the identity matrix.

3.2. Setting with repeated/clustered response data

For the particular case of repeated numeric response data corresponding to h(u) = u (piecewise linear model), Das et al7 proposed an estimator using smoothing techniques similarly as the estimator that solves (3). Here we are mainly interested in an estimation method that is free of smoothing approximation for the general problem of the estimation of change-points in at least one the covariates of a generalized estimating equation model for longitudinal continuous or binary data. Let Σi denote the correlation matrix of Yi=(Y11,,Y1mi), i = 1,...,n. Also, let Xij(θ)={(Xijτ1)+,,(XijτR)+,β21I(Xij>τ1),,β2RI(Xij>τR)}, and Di(θ) be the matrix whose jth row is {1,Xij,Wij,Xij(θ)}, j=1,, mi, i = 1,, n. Furthermore, denote by Ai(θ) the diagonal matrix Diag[νi1(θ),,νimi(θ)], where νij(θ) is the error variance for linear model, h{η,Xij,Wij)}[1−h{η,Xij,Wij)} for logistic model and h{η,Xij,Wij)} for Poisson model with change-points. Our estimator, θ^L for θ in (2) is obtained by solving

i=1nDi(θ)Ω^i1(θ)[Yih{η(θ,Xi,Wi)}]=0, (6)

where Di(θ)=Di(θ)Ai(θ),Ω^i(θ)=Ai(θ)1/2Σ^iAi(θ)1/2 and Σ^i is a working correlation matrix that can be obtained based on an assumed correlation structure for Yi. Under suitable regularity conditions, θ^L is consistent and n(θ^Lθ0) asymptotically follows a zero-mean normal distribution. The covariance matrix of θ^L can be consistently estimated by B(θ^L)1M(θ^L)B(θ^L)1, where M(θ)=i=1nDi(θ)Ω^i1(θ)Ψi(θ)Ψi(θ)Ω^i(θ)1Di(θ), B(θ)=i=1nDi(θ)Ω^i1(θ)Di(θ) and Ψi(θ)=Yih{η(θ,Xi,Wi)}, i = 1,...,n.

It is important to note that Models (1) and (2) can be extended to incorporate more than one variable experiencing the change-points. Our estimation methods can be easily applied to such a situation, which is considered in the simulations and applications. Moreover, key differences with the existing methods in Das et al7 and Zhou and Liang11 are highlighted as follows. First, the methods introduced in these references use approximations by smoothing techniques requiring the selection of a tuning parameter while our estimators, θ^I, θ^C and θ^L do not involve such smoothing techniques. Second, the methods in Das et al7 were built for the estimation of change-points in the covariate of a linear regression model. Meanwhile, our methods can accommodate linear and nonlinear models under the generalized linear modeling framework. Third, the method by Zhou and Liang11 was developed for the estimation of a single change-point in one covariate of a generalized linear model for independent observations. Our methods on the other hand can handle the estimation of multiple change-points in one or several covariates of a generalized linear model for independent or correlated data.

4. Simulation study

We performed extensive simulations to examine the finite-sample performances of the proposed estimators, θ^I, θ^C and θ^L presented in Section 3. We also compared θ^I with Muggeo’s estimator, denoted by θ^M, when the data are independent. θ^M can be easily implemented via the ‘segmented’ R package. This represents a computational advantage for Muggeo’s algorithm over the proposed methods when the independence assumption for the data holds. The implementation of the proposed methods however requires solving the estimating equations (4), (5) and (6). In Table 1, we considered the situations where there is a single change-point in a covariate of a linear, logistic or Poisson model and the observations are independent. The covariate X was drawn from U(−3,3) and Y was generated according to E(Y|X)=h{β0+β1X+β2(Xτ)+} with (β012) = (1,2,−3,1), n = 250 or 1000. For the linear regression, Y given X was normal with mean β0+β1X+β2(Xτ)+ and variance 0.25. The results were from 1000 Monte Carlo samples and the performances of θ^M and θ^I were assessed with regard to bias (Bias), sample standard deviation of the estimates (SD), average of the estimated standard errors (ASE), mean square error (MSE) of the estimates and coverage probability of the 95% Wald-type confidence interval. Here bias is defined as the difference between the sample mean of estimates from the 1000 replicates and the true parameter value and MSE = Bias2 +SD2. Also, coverage probability denotes the proportion of simulations from the 1000 replicates when the 95% Wald-type confidence interval includes the true parameter value. It can be noted from this table that θ^I and θ^M perform well and yield very close results for linear and Poisson models. For Logistic regression however, θ^M exhibits larger bias and MSE compared with θ^I when n = 250. Also, it shows smaller ASE and coverage probability (under 90%) than θ^I for τ when n = 250. Meanwhile, θ^I demonstrates good coverage probabilities and its SDs are halved when n increases from 250 to 1000, suggesting that θ^I is n-consistent.

Table 1:

Simulation results for the situation where there is a single change-point for linear, logistic and Poisson regression models for independent data; θ^M, estimator based on Mueggo’s method; θ^I, proposed estimator solving (4).

θ^M
θ^I
Model n Bias ASE SD MSE CP Bias ASE SD MSE CP
Linear 250 β0 0.004 0.052 0.053 0.003 0.939 0.004 0.052 0.052 0.003 0.944
β1 0.000 0.034 0.035 0.001 0.938 0.000 0.034 0.034 0.001 0.940
β2 −0.001 0.102 0.110 0.012 0.929 −0.001 0.102 0.109 0.012 0.930
τ −0.002 0.045 0.047 0.002 0.940 −0.002 0.045 0.046 0.002 0.946
1000 β0 0.000 0.026 0.026 0.001 0.942 0.000 0.026 0.026 0.001 0.946
β1 0.000 0.017 0.017 0.000 0.948 0.000 0.017 0.017 0.000 0.947
β2 0.001 0.050 0.052 0.003 0.942 0.001 0.050 0.052 0.003 0.943
τ 0.000 0.022 0.023 0.001 0.940 0.000 0.022 0.023 0.001 0.943

Logistic 250 β0 0.095 0.316 0.414 0.180 0.942 0.042 0.296 0.294 0.088 0.960
β1 0.122 0.333 0.407 0.180 0.950 0.073 0.314 0.305 0.098 0.968
β2 −0.236 0.783 0.798 0.692 0.965 −0.196 0.765 0.769 0.629 0.964
τ −0.040 0.324 0.415 0.174 0.870 0.003 0.330 0.300 0.090 0.952
1000 β0 0.015 0.144 0.153 0.024 0.939 0.013 0.143 0.148 0.022 0.944
β1 0.021 0.151 0.160 0.026 0.945 0.020 0.150 0.156 0.025 0.948
β2 −0.044 0.351 0.350 0.125 0.956 −0.038 0.349 0.344 0.120 0.959
τ −0.003 0.163 0.185 0.034 0.910 −0.003 0.163 0.169 0.029 0.932

Poisson 250 β0 −0.004 0.070 0.069 0.005 0.944 −0.003 0.070 0.068 0.005 0.946
β1 0.008 0.101 0.106 0.011 0.937 0.007 0.100 0.103 0.011 0.943
β2 −0.013 0.124 0.124 0.015 0.945 −0.011 0.124 0.120 0.014 0.940
τ 0.000 0.031 0.034 0.001 0.914 0.000 0.031 0.034 0.001 0.922
1000 β0 −0.001 0.035 0.035 0.001 0.936 −0.001 0.035 0.035 0.001 0.936
β1 0.002 0.050 0.050 0.003 0.948 0.002 0.050 0.049 0.002 0.951
β2 −0.002 0.061 0.060 0.004 0.956 −0.002 0.061 0.059 0.004 0.956
τ 0.000 0.016 0.016 0.000 0.946 0.000 0.016 0.015 0.000 0.947

Note: SD denotes the sample standard deviation of the estimates; ASE is the average of the estimated standard errors; MSE is mean square error; CP represents the coverage probabilities of the 95% confidence intervals.

Additionally, we conducted simulations to examine the performances of θ^M and the proposed estimators θ^I and θ^C for the case of a single change-point in a covariate of a linear model with small to moderate sample sizes. The covariate X was simulated from U(−3,3) and the response Y given X followed a multivariate normal distribution with mean β0 + β1X + β2(Xτ)+ and variance σ2Σ with Σi,j=ρ|ij| assuming an AR(1) correlation structure. We set ρ = 0 or ρ = 0.3 (depicting independent or correlated observations), θ = (β012) = (1,2,−3,1), σ = 0.25 and n = 15,30,60 and 120. The parameter θ was estimated by θ^I and θ^M when ρ = 0 and θ^C when ρ ≠ 0. Moreover, the performance of θ^C was further investigated when n = 300 and ρ = 0.3, 0.7. The results of these simulations are presented in Table S1 of the Supplementary Materials. For the situation where ρ = 0, it can be observed that θ^I and θ^M show similarly small biases and adequate coverage probabilities when n ≥ 30. Furthermore, θ^I works satisfactorily while θ^M may suffer from low coverage probabilities when n = 15. For the situation where ρ 6= 0, θ^C generally displays small biases and coverage probabilities that are close to the nominal 95% level except when the sample size is very small (n = 15). The performances of all the methods improve as n gets larger. The ASE’s for the change-point and slope parameters decrease as ρ increases.

Similarly, simulations were also performed to investigate the performances of the methods θ^M and θ^I for the estimation of a single change-point in a covariate of a logistic or Poisson regression model for independent data under small to moderate sample sizes. The data generation was as in Table 1 for logistic and Poisson models, except that n = 50 or 100. The results are presented in Table S2 of the Supplementary Materials. For logistic model, it can be noted that both θ^M and θ^I show large biases and inadequate coverage probabilities. Furthermore, θ^M shows smaller ASE’s and lower coverage probabilities for the change-point in comparison with θ^I. For the Poisson model, both estimators have reasonable biases and satisfactory coverage probabilities for the estimators of the intercept and slope parameters. Furthermore, θ^M exhibits low coverage probabilities for the change-point. Additional simulations examining the performance of the smoothing techniques-based estimator that solves (3) are provided in Appendix C of the Supplementary Materials.

In Table 2, there were two change-points in one covariate of a linear, logistic or Poisson regression model for independent observations. The covariate X followed U(−3,3) and Y was generated via the model E(Y|X)=h{β0+β1X+β21(Xτ1)++β22(Xτ2)+} with θ = (β01212212) set as (1,−1,2,−3,−0.5,1), n = 500 or 2000. For linear model, Y given X was normal with variance 0.25. The parameter θ was estimated using θ^I and θ^M. These estimators show similar performances as in Table 1 for linear, logistic or Poisson regression model. θ^M has inadequate coverage probabilities for the change-points τ1 and τ2 for logistic regression model when n = 500. It also shows larger biases and MSE’s in comparison with the θ^I counterparts for the logistic regression model. Simulations assessing the performances of the methods when there are two covariates, each experiencing one change-point in a linear, logistic or Poisson regression model are presented in Appendix D of the Supplementary Materials.

Table 2:

Simulation results for the situation where there are two change-points in a covariate of a linear, logistic or Poisson regression model; θ^M, estimator based on Mueggo’s method; θ^I, proposed estimator solving (4).

θ^M
θ^I
Model n Bias ASE SD MSE CP Bias ASE SD MSE CP
Linear 500 β0 −0.007 0.092 0.095 0.009 0.943 −0.006 0.092 0.093 0.009 0.948
β1 −0.002 0.049 0.049 0.002 0.944 −0.002 0.049 0.048 0.002 0.947
β21 −0.001 0.115 0.117 0.014 0.944 −0.003 0.115 0.107 0.012 0.960
β22 0.002 0.124 0.125 0.016 0.953 0.004 0.124 0.116 0.013 0.963
τ1 −0.006 0.057 0.061 0.004 0.924 −0.006 0.057 0.056 0.003 0.938
τ2 0.002 0.040 0.043 0.002 0.926 0.002 0.040 0.041 0.002 0.941
2000 β0 −0.001 0.046 0.047 0.002 0.941 0.001 0.046 0.047 0.002 0.944
β1 0.000 0.024 0.025 0.001 0.944 0.000 0.024 0.024 0.001 0.942
β21 0.003 0.057 0.057 0.003 0.944 0.002 0.057 0.055 0.003 0.948
β22 −0.004 0.062 0.064 0.004 0.941 −0.003 0.062 0.062 0.004 0.944
τ1 0.000 0.028 0.029 0.001 0.943 0.000 0.028 0.028 0.001 0.953
τ2 0.000 0.020 0.020 0.000 0.946 0.000 0.020 0.020 0.000 0.954

Logistic 500 β0 −0.162 0.462 0.625 0.417 0.910 −0.069 0.431 0.403 0.167 0.961
β1 −0.092 0.275 0.344 0.127 0.925 −0.046 0.262 0.250 0.064 0.967
β21 0.425 0.738 0.901 0.992 0.956 0.123 0.569 0.466 0.232 0.979
β22 −0.458 0.795 0.965 1.141 0.961 −0.163 0.633 0.526 0.303 0.985
τ1 0.002 0.230 0.351 0.123 0.791 −0.002 0.241 0.199 0.040 0.958
τ2 −0.006 0.179 0.244 0.060 0.840 0.008 0.184 0.168 0.028 0.953
2000 β0 −0.032 0.210 0.234 0.056 0.934 −0.023 0.209 0.204 0.042 0.963
β1 −0.018 0.127 0.135 0.019 0.938 −0.014 0.127 0.123 0.015 0.956
β21 0.081 0.279 0.337 0.120 0.944 0.039 0.269 0.243 0.060 0.967
β22 −0.077 0.307 0.371 0.144 0.930 −0.041 0.298 0.277 0.079 0.969
τ1 0.001 0.119 0.151 0.023 0.896 0.000 0.120 0.109 0.012 0.961
τ2 −0.005 0.092 0.110 0.012 0.904 0.001 0.093 0.092 0.008 0.947

Poisson 500 β0 −0.005 0.057 0.059 0.003 0.938 −0.005 0.057 0.058 0.003 0.946
β1 −0.002 0.025 0.026 0.001 0.940 −0.002 0.025 0.025 0.001 0.944
β21 0.008 0.073 0.076 0.006 0.937 0.007 0.073 0.073 0.005 0.942
β22 −0.010 0.109 0.112 0.013 0.942 −0.008 0.109 0.107 0.011 0.950
τ1 0.001 0.041 0.044 0.002 0.928 0.000 0.041 0.042 0.002 0.937
τ2 0.001 0.024 0.026 0.001 0.924 0.001 0.024 0.025 0.001 0.933
2000 β0 −0.001 0.029 0.030 0.001 0.942 −0.001 0.029 0.029 0.001 0.943
β1 0.000 0.012 0.013 0.000 0.945 0.000 0.012 0.013 0.000 0.947
β21 0.001 0.036 0.036 0.001 0.950 0.001 0.036 0.035 0.001 0.952
β22 0.001 0.054 0.053 0.003 0.962 0.001 0.054 0.051 0.003 0.966
τ1 0.000 0.021 0.021 0.000 0.945 0.000 0.021 0.020 0.000 0.948
β2 0.000 0.012 0.012 0.000 0.950 0.000 0.012 0.012 0.000 0.949

Note: SD denotes the sample standard deviation of the estimates; ASE is the average of the estimated standard errors; MSE is mean square error; CP represents the coverage probabilities of the 95% confidence intervals.

In Table 3, we examined the situation with a single change-point in a covariate of a linear, logistic or Poisson model for longitudinal data. The sample size was n = 200 and the number of observation mi for subject i was generated as max(2i), where κi ∼ Poisson(10), i = 1,...,n. Also, the covariate Xij for subject i was simulated from U(−3,3), j = 1,...,mi. For the correlation among the observations within subject, an AR(1) correlation structure was considered for linear model while exchangeable correlation structures were assumed for logistic and Poisson models with a correlation parameter ρ = 0.3 or 0.7. Moreover, Yij was generated according to the population mean model E(Yij|Xij)=h{β0+β1Xij+β2(Xijτ)+}. For linear model, the error followed a zero-mean normal distribution with variance 0.25. The vector of parameters θ = (β012) was set as (1,−1,2,2) and estimated using θ^I. This estimator shows small biases and adequate coverage probabilities for the change-point τ and the other components of θ. Also, the ASE for τ decreases as ρ increases.

Table 3:

Simulation results for the situation where there is a single change-point in a covariate of a linear, logistic or Poisson regression model for longitudinal data; θ^L, proposed estimator solving (6).

θ^L (ρ = 0.3)
θ^L (ρ = 0.7)
Model Bias ASE SD MSE CP Bias ASE SD MSE CP
Linear β0 0.001 0.016 0.016 0.000 0.951 0.001 0.022 0.022 0.000 0.951
β1 0.000 0.008 0.008 0.000 0.952 0.000 0.005 0.005 0.000 0.955
β2 0.006 0.088 0.088 0.008 0.942 0.004 0.058 0.058 0.003 0.939
τ 0.001 0.027 0.028 0.001 0.941 0.000 0.018 0.018 0.000 0.944

Logistic β0 −0.003 0.096 0.095 0.009 0.955 0.002 0.128 0.128 0.016 0.950
β1 −0.005 0.059 0.060 0.004 0.945 −0.010 0.069 0.073 0.005 0.935
β2 0.079 0.400 0.383 0.153 0.961 0.070 0.361 0.335 0.117 0.964
τ 0.004 0.125 0.125 0.016 0.933 0.001 0.113 0.116 0.013 0.925

Poisson β0 −0.002 0.027 0.027 0.001 0.944 −0.002 0.032 0.032 0.001 0.944
β1 −0.001 0.009 0.009 0.000 0.943 −0.001 0.010 0.010 0.000 0.957
β2 0.023 0.224 0.220 0.049 0.939 0.019 0.173 0.168 0.029 0.954
τ 0.002 0.074 0.074 0.005 0.938 0.003 0.058 0.058 0.003 0.946

Note: SD denotes the sample standard deviation of the estimates; ASE is the average of the estimated standard errors; MSE is mean square error; CP represents the coverage probabilities of the 95% confidence intervals.

5. Applications

We applied the proposed methods to i) examine the trends of the reported L2C estimates and to provide improvements of these estimates in the HPTN065 study and ii) investigate the associations of age, BMI and glucose with T2D in the Pima Indian diabetes study.

5.1. HIV surveillance data from HPTN 065 study

The HIV surveillance data were briefly described in the Introduction section. In our analysis, L2C is the measure of interest. It also represents one of the key measures that the National HIV/AIDS Strategy (NAHS) aims to enhance.23 However, its monitoring is difficult due to reporting delays and data incompleteness among other potential issues that may compromise the quality of surveillance data in general.

In our analysis, we were primarily interested in investigating the L2C trends over the data upload time and providing improved-quality L2C estimates based on the trend examination. A graphical examination of the L2C trends in Figure 1 led us to assume the existence of a single change-point in the L2C trend for each jurisdiction. We first considered the jurisdiction-specific trends and used the following piecewise linear model for the L2C data separately for each city jurisdiction, corresponding to the setting with one measurement for the response variable:

E(Yi|Xi)=β0+β1Xi+β2(Xiτ)+,

where Yi is L2C and Xi denotes the ith quarter since the first data-upload month, θ = (β012)’ represents the jurisdiction-specific vector of parameters for the change-point model. We estimated θ by our estimator θ^C solving (5) assuming an AR(1) correlation structure for the Yi’s. The estimates of the change-point model parameters together with their corresponding standard errors and 95% Wald-type confidence intervals are reported in Table 4. Also, the fitted piecewise linear regression lines are displayed in grey color solid lines in Figure 1. The results suggest that the change-point for the updated estimate of the L2C measure varied across the jurisdictions, signaling a difference in time when the L2C estimates approximately became stable between the six cities. The L2C estimates for Houston, TX and Philadelphia, PA appeared to become somewhat stable earlier compared to the other cities, while Chicago, IL showed the slowest pace to reaching approximate stability. Also, the reported L2C estimates for Chicago seemed to be relatively stable before the fifth update quarter after which time it showed a sudden increase probably due to reporting delays. This pattern was not evident in the trends of the L2C estimates for the other jurisdictions.

Table 4:

Results of the analysis of the HIV-surveillance data: fitting a two-phase piecewise linear model (with an unknown change-point) to the jurisdiction-specific L2C measures in the HPTN 065 study based on the proposed estimator solving (5).

City jurisdiction
DC Miami Chicago New york Philadelphia Houston
Estimation of the change-point model parameters
β0 Estimate 71.25 42.21 74.13 78.60 74.19 49.43
SE 0.76 0.70 1.15 1.22 0.06 0.86
95% CI (69.76, 72.74) (40.84, 43.59) (71.88, 76.38) (76.21, 81.00) (74.07, 74.31) (47.74, 51.12)
β1 Estimate 3.29 3.13 0.96 3.58 2.92 16.37
SE 0.61 0.39 0.27 0.67 0.05 1.26
95% CI (2.09, 4.48) (2.37, 3.90) (0.44, 1.49) (2.26, 4.90) (2.82, 3.02) (13.90, 18.83)
β2 Estimate −3.01 −2.93 −0.96 −3.87 −3.0 −16.04
SE 0.61 0.39 0.77 0.69 0.05 1.26
95% CI (−4.21, −1.80) (−3.70, −2.16) (−2.47, 0.55) (−5.22, −2.52) (−3.10, −2.90) (−18.50, −13.57)
τ Estimate 2.96 3.35 7.52 3.38 2.34 1.55
SE 0.46 0.39 1.96 0.47 0.03 0.10
95% CI (2.05, 3.87) (2.66, 4.04) (3.67, 11.36) (2.46, 4.31) (2.28, 2.40) (1.36, 1.74)

Annual percentage of linkage to care
% Estimate 82.13 53.54 81.44 89.68 80.62 76.42

Note: β0 is the intercept; β1 is the slope of the first segment; β2 is the difference in slopes between the two segments; τ is the jurisdiction change-point.

The knowledge of the time when the L2C estimates became somewhat stable is also important as it may give an insight into how fast the surveillance system reaches maturity with regard to delays and completeness of reporting of the HIV surveillance data for the jurisdictions. Without such valuable information, critical decisions on allocation of resources and planning for HIV treatment and prevention may be made based on unreliable estimates of the prevention measures. This could lead to a misunderstanding of the trend of the true linkage to care measure and potentially hamper the efforts to reduce HIV infections in the general population. Also, the knowledge of the change-point can be used to obtain improved estimates of the reported surveillance measures in the absence of accurate estimates. It is very likely that the data quality is higher for the post change-point estimates compared to the earlier ones. We used the L2C data that were reported after the change-point to provide more reliable L2C estimates by taking the average of the post change-point estimates for each jurisdiction. The improved estimates of the annual percentage of linkage to care for the six cities are shown at the bottom of Table 4. They ranged from 53.54% (Miami, FL) and 89.68% (New York, NY) and were higher than 80%, except in Miami, FL and Houston, TX (76.41%). To further investigate the bias in the reported estimates over the data upload time, we defined the bias of the reported estimate as the reported estimate from the surveillance data minus the improved estimate that we calculated. It can be seen from Figure S2 of the Supplementary Materials that the reporting bias appeared to be generally large and negative before the change-point probably due to the data incompleteness or reporting delays. Also, it decreased and got close to zero after the change-point for each jurisdiction. It is important to mention that we did not pursue a formal test for the post change-point data stabilization. This could be entertained by testing if the slope after the change-point is equal to 0.

Next, we considered the HIV surveillance data as clustered within the jurisdictions and modeled the L2C data as follows for illustrative purposes only.

E(Yij|Xij)=β0+β1Xij+β2(Xijτ)+,

where Yij represents the jth update of the L2C estimate and Xij denotes the jth quarter since the first data-upload month for the ith jurisdiction, and τ is the average time when the measure somewhat stabilizes across the jurisdictions. We estimated θ = (β012) in this model by our estimator θ^L, solving (6) and assuming an AR(1) correlation structure for (Yi1,,Yimi). We obtained θ^L=(64.68,4.66,4.43,2.39) with the corresponding vector of estimated standard errors (4.88,2.03,2.33,0.75). This indicates that on average, the reported L2C estimate became somewhat stable within a year following the first data upload month across the jurisdictions. We would like to note in passing that the sample size in this analysis was small and this could cause challenges for statistical inference based on methods requiring large samples in general.

5.2. Pima Indians diabetes study

The data set for the Pima Indian diabetes study was described in the Introduction section. In this analysis, 16 participants with biologically abnormal zero values for BMI and glucose concentration levels were excluded. We were interested in exploring the effect of age, BMI and glucose on the risk of being diagnosed with T2D. We used the following logistic regression model to capture the changing effects of age and BMI on the log-odds of T2D diagnosis (observed in Figure 2), assuming one change-point for age and two change-points for BMI and adjusting for glucose concentration level.

P(Yi=1|Xi,Zi,Wi)=h{β0+β1xXi+β1zZi+βwWi+β2x(Xiτx)++β21z(Ziτ1z)++β22z(Ziτ2z)+},

where Yi is the indicator variable for being diagnosed with T2D, Xi is age experiencing a change-point at τx, Zi denotes BMI with possibly two ordered change-points, τ1z and τ2z, Wi represents glucose level for the ith participant and h(u) = (1 + eu)−1. This corresponds to the setting with one measurement for the response. We estimated θ =(β0,β1x1z2x21z22zx1z2z)’ by our estimator θ^I, solving (4) and Muggeo’s estimator, θ^M with the independence assumption for the observations. The parameter estimates together with their estimated standard errors and 95% Wald-based confidence intervals are shown in Table 5. The estimation results for the two change-points in BMI obtained by both methods were very close. For age, the estimate (SE) of the change-point based on our method was 49.85 (13.86) while Muggeo’s counterpart was 52.06 (2.91). A plausible explanation for the differences in these results could be that Muggeo’s method uses linear approximation to estimate the change-points and the other model parameters. Also, the variance of the change-point by Muggeo’s method is obtained as an approximation to the variance of the ratio of two random variables based on the delta method and is sensitive to the difference in slopes and the sample size12. This method may not adequately account for the uncertainty involved in the estimate of the change-point.24 Our method on the other hand uses (4) and the asymptotic variance formula for θI in Section 3 for the estimation of the model parameters and associated variances. Another reason could be that the estimated location of the change-point for age by the proposed and Muggeo’s method is around 50, which lies far away from the median (29) of the distribution of the age variable as seen in Figure S3 of the Supplementary Materials. Locations of change-point near the edge of the threshold variable are known to affect the estimation of the change-points and may lead to wide estimates of confidence intervals in finite samples.12 The estimated change-points for BMI were around 31 and 40, not far from the median (32) of the distribution of BMI. Nonetheless, these results indicated that Pima Indian women were most likely to be tested positive for T2D at about their menopausal age (50 years) after adjusting for BMI and glucose. Also, the risk of developing T2D appeared to be similar for women with BMI values between 31–40 (kg/m2), but increased more rapidly with each unit increase in BMI for women with BMI greater than 40 (kg/m2) after adjusting for age and glucose.

Table 5:

Application to the Pima Indians diabetes data: fitting a logistic regression change-point model for T2D diagnosis including age, BMI and glucose as covariates with age experiencing one change-point, BMI having two change-points and glucose having a linear effect (no change-point). θ^I represents the proposed estimator; θ^M denotes Muggeo’s estimator.

θ^M
θ^I
Parameter Estimate SE 95% CI Estimate SE 95% CI
β0 −15.656 1.918 (−19.53, −11.98) −15.628 1.925 (−19.40, −11.86)
β1x 0.066 0.012 (0.04, 0.09) 0.069 0.013 (0.04, 0.10)
β1z 0.281 0.064 (0.16, 0.41) 0.276 0.065 (0.15, 0.40)
βw 0.037 0.004 (0.03, 0.04) 0.037 0.004 (0.03, 0.04)
β2x −0.237 0.072 (−0.36, −0.11) −0.204 0.050 (−0.30, −0.11)
β21z −0.355 0.086 (−0.52, −0.18) −0.347 0.086 (−0.52, −0.18)
β22z 0.221 0.089 (0.04, 0.39) 0.218 0.089 (0.04, 0.39)
τx 52.060 2.879 (46.91, 56.86) 49.811 13.88 (22.61, 77.02)
τ1z 31.602 1.053 (29.42, 33.54) 31.60 1.830 (28.01, 35.19)
τ2z 40.066 2.245 (35.67, 44.63) 40.105 1.433 (37.33, 42.91)

Note: β0 is the intercept; τx is the change-point for age; τ1z and τ2z are the two change-points for BMI; β1x is the slope for age before τx; β2x is the difference in slopes for age before and after τx; β1z is the slope for BMI before τ1z; β21z is the difference in slopes between the two segments of BMI joining at τ1z; β22z is the difference in slopes between the two segments of BMI joining at τ2z; βw is the slope for glucose concentration levels.

We note that polynomial regression models or regression spline models,4,25,26 which are other popular methods for modeling nonlinearity could also be used to capture the nonlinear associations of age and BMI with the logit transformation of the probability of T2D diagnosis in the Pima Indian diabetes study. The results of an additional analysis of the diabetes data using a logistic regression model incorporating quadratic polynomial terms in age and natural cubic spline terms in BMI with 3 degrees of freedom are presented in Table S5 of the Supplementary Materials. They may not be directly interpretable. Our change-point models mainly differ from the polynomial or spline models in their simplicity and ease of interpretation. Additionally, polynomial models may involve no change-points. Furthermore, regression spline (e.g. cubic splines) models may require a certain degree of smoothness for the regression function at the knots (change-points), whose locations are often assumed to be known a priori or obtained as grid points from a quantile-based or uniform grid-spacing partition of the support of the change-point variable. In contrast, the regression functions in the change-point models (1) and (2) are not smooth at the change-points. Moreover, these models can be used to make inference about the change-points together with the other model parameters. Additional differences between spline and change-point models in general are discussed in Feder.27

It is worth recalling that the confidence intervals in Tables 4 and 5 are Wald-type, using asymptotic normal approximations for the proposed and Muggeo’s estimators. Although valid in large samples, Wald-type confidence intervals may perform poorly in finite samples due to the parameter-effects curvature for some nonlinear models including those with change-points.9,28 Alternative confidence intervals for the change-points could be based on likelihood ratio or smoothed score approaches.29,30

6. Discussion

We have introduced generalized change-point models with at least one change-point occurring in some covariates of a generalized linear model for independent or correlated data. Moreover, we have developed methods to estimate the change-points along with the other model parameters under these change-point modeling frameworks. The proposed methods are based on estimating equations and involve no smoothing or linear approximation. They lead to n-consistent estimators with asymptotic normal distributions and can handle situations where the existing methods may not be straightforwardly applicable. The proposed methods have been applied to investigate the trends of the reported estimates of the annual percentage of new HIV diagnoses linked to care within three months of HIV diagnosis in aggregated HIV-surveillance data from the HPTN 065 study. They have also been used to examine the associations of age, BMI and glucose with diabetes in the Pima Indian diabetes study.

To ensure identifiability of the model parameters, we assumed the existence of the changepoints (β2 ≠ 0). In practice, a graphical examination of the data could be useful for determining the presence of change-points in the covariates.4 Formally testing for the existence of a change-point is a nonstandard problem that causes the usual asymptotic distributions of the Wald and likelihood ratio type tests to not hold.27,31 Although undertaking such a task under Models (1) and (2) was beyond the scope of the present work, it is worth mentioning that methods for testing the existence of a single change-point in one covariate of a regression model were discussed by Pastor-Barriuso et al,9 Davies,32 and Muggeo.33 Also, permutation tests for the existence of multiple change-points in a covariate of a linear model were presented in Kim et al.34

The proposed estimation methods are presented for linear, logistic and Poisson models, which are the most commonly used generalized linear models. They can also be applied to any other generalized linear models when some covariates have piecewise relationships with the response and there is an interest in estimating the change-points. The methods can also be extended to estimate change-points in the covariates of a survival model (e.g. Cox proportional hazards model) for a time-to-event outcome. However, their finite sample performance under such models may differ from the case with linear, logistic or Poisson model.

Our approach for repeated/clustered data was marginal in the sense that the focus was in modeling the nonlinear relationships between the population mean response and covariates with change-points, which were assumed to be the same for all individuals. A conditional approach accounting for a possible variability in the change-points between individuals represents an interesting extension that would be based on random-effect models incorporating random change-points. For the estimation of a linear mixed effects model with random change-points, a Bayesian approach was discussed by Dominicus et al35 and a linear approximation method was presented by Muggeo et al36 under the likelihood framework. Also, a polynomial model with random change-points was discussed in Van den Hout et al.37 An extension of Model (2) to include a random intercept as well as random change-points and random slopes is given by, E(Yij|Xij,θi)=h{βi0+βi1Xij+r=1Rβi2r(Xijτir)+}, where θi=(βi0,βi1,βi2,τi), i = 1,, n, are independent and identically distributed random variables with mean θ=(β0,β1,β2,τ) and covariance matrix Ξ. Here βi2=(βi21,,βi2R) and τi=(τi1,,τiR). The extension of our estimation methods to such conditional models may not be straightforward and is worth further investigation.

Supplementary Material

Supplemental Material

Acknowledgments

This work was partially supported by the National Institutes of Health grants MH105857 (Chen, Tapsoba, Wang and Zangeneh), HL121347 (Tapsoba and Wang), CA235122 (Wang) and UM1A1068617 (Zangeneh).

References

  • 1.Hudson DJ. Fitting segmented curves whose join points have to be estimated. Journal of the American Statistical Association. 1996; 61: 1097–1129. [Google Scholar]
  • 2.Donnell DJ, Hall HI, Gamble T, Beauchamp G, Griffin AB, Torian LV, Branson B, ElSadr WM. Use of HIV case surveillance system to design and evaluate site-randomized interventions in an HIV prevention study: HPTN 065. The Open AIDS Journal. 2012; 6: 122–130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wang L, Liu X, Liang H, Carroll RJ. Estimation and variable selection for generalized additive partial linear models.The Annal of Statistics. 2011; 39: 1827–1851. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hastie T, Tibshirani R. Generalized additive models: some applications. Journal of the American Statistical Association. 1987; 82: 371–386. [Google Scholar]
  • 5.Wood SN. mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML Smooth-ness Estimation 2017; R package version 18–22; https://cran.r-project.org/web/packages/mgcv/mgcv.pdf. [Google Scholar]
  • 6.Chiu G, Lockhart R, Routledge R. Bent-cable regression theory and applications. Journal of the American Statistical Association. 2006; 101: 542–553. [Google Scholar]
  • 7.Das R, Banerjee M, Nan B, Zheng H. Fast estimation of regression parameters in a broken-stick model for longitudinal data. Journal of the American Statistical Association. 2016; 111: 1132–1143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Pastor R and Guallar E. Use of two-segmented logistic regression to estimate changepoints in epidemiologic studies. American Journal of Epidemiology. 1998; 148: 631–642. [DOI] [PubMed] [Google Scholar]
  • 9.Pastor-Barriuso R, Guallar E, Coresh J. Transition models for change-point estimation in logistic regression. Statistics in Medicine. 2003; 22: 1141–1162. [DOI] [PubMed] [Google Scholar]
  • 10.Fong Y, Di C, Huang Y, Gilbert PB. Model-robust inference for continuous threshold regression models. Biometrics. 2017; 73: 452–462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zhou H, Liang KY. On estimating the change point in generalized linear models. Institute of Mathematical Statistics. 2008; 1: 305–320. [Google Scholar]
  • 12.Muggeo VMR. Estimating regression models with unknown break-points. Statistics Medicine. 2003; 22: 3055–3071. [DOI] [PubMed] [Google Scholar]
  • 13.Muggeo VMR. Segmented: an R package to fit regression models with broken-line relationships. R News. 2008; 8: 20–25. [Google Scholar]
  • 14.McCullagh P, Nelder JA. Generalized linear models. First edition, Chapman & Hall, London; 1983. [Google Scholar]
  • 15.Wu L Mixed effects models for complex data. Boca Raton, FL: Chapman & Hall/CRC; 2009. [Google Scholar]
  • 16.Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models.Biometrika. 1986; 73: 13–22. [Google Scholar]
  • 17.Bacon DW, Watts DG. Estimating the transition between two intersecting straight lines. Biometrika. 1971; 58: 525–534. [Google Scholar]
  • 18.Tapsoba JD, Lee SM, Wang CY. Joint modeling of survival time and longitudinal data with subject-specific changepoints in the covariates. Statistics in Medicine. 2011; 30: 232–249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.La Cruz W, Martinez JM, Raydan M. Spectral residual method without gradient information for solving large-scale non-linear systems of equations. Mathematics of Computation. 2006; 75: 1429–1448. [Google Scholar]
  • 20.Varadhan R, Gilbert PD. BB. An R package for solving a large system of nonlinear equations and for optimizing a high-dimensional nonlinear objective function. Journal of Statistical Software. 2009; 32: 1–26. [Google Scholar]
  • 21.Lin DY, Geyer CJ. Computational methods for semiparametric linear regression with censored data. Journal of Computational and Graphical Statistics. 1992; 1: 77–90. [Google Scholar]
  • 22.Dorsey RE, Mayer WJ. Genetic algorithms for estimation problems with multiple optima, non-differentiability and other irregular features. Journal of Business & Economic Statistics. 1995; 13: 53–66. [Google Scholar]
  • 23.Laffoon BT, Hall HI, Babu S, Benbow N, Hsu LC, Hu YW. HIV infection and linkage to HIV-related medical care in large urban areas in the United States, 2009. Journal of Acquired Immune Deficiency Syndromes. 2015; 69(4): 487–492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Huang Fong Y, Gilbert PB, Permar SR. Chngt: threshold regression model estimation and inference. BMC Bioinformatics. 2017; 18:454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Stasinopoulos DM, Rigby RA. Detecting break points in generalized linear models. Computational Statistics and Data Analysis. 1992; 13: 461–471. [Google Scholar]
  • 26.Molinari N, Daures JP, Durand JF. Regression splines for threshold selection in survival data analysis. Statistics in Medicine. 2001; 20: 237–247. [DOI] [PubMed] [Google Scholar]
  • 27.Feder P On asymptotic distribution theory in segmented regression problems-identified cases. The Annals of Statistics. 1975; 3, 49–83. [Google Scholar]
  • 28.Seber GAF, Wild CJ. Non-linear Regression. Wiley: New York; 1989. [Google Scholar]
  • 29.Muggeo VMR. Interval estimation for the breakpoint in segmented regression: a smoothed score-based approach. Australia & New Zealand Journal of Statistics. 2017; 59: 311–322. [Google Scholar]
  • 30.Lerman PM. Segmented regression models by grid search. Journal of th Royal Statistical Society. Series C (Applied Statistics). 1980; 29: 77–84. [Google Scholar]
  • 31.Ulm K A statistical method for assessing a threshold in epidemiological studies. Statistics in Medicine. 1991; 10: 341–349. [DOI] [PubMed] [Google Scholar]
  • 32.Davies R Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika. 1987; 74: 33–43. [Google Scholar]
  • 33.Muggeo VMR. Testing with a nuisance parameter present only under the alternative: a score-based approach with application to segmented modelling. Journal of Statistical Computation and Simulation. 2016; 86, 3059–3067. [Google Scholar]
  • 34.Kim HJ, Fay MP, Feuer EJ, Midthune DN. Permutation tests for joinpoint regression with applications to cancer rates. Statistics in Medicine. 2000; 19(3): 335–351. [DOI] [PubMed] [Google Scholar]
  • 35.Dominicus A, Ripatti S, Pedersen NL, Palmgren J. A random change point model for assessing variability in repeated measures of cognitive function. Statistics in Medicine. 2008; 27: 5786–5798. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Muggeo VMR, Atkins DC, Gallop RJ. Segmented mixed models with random changepoints: a maximum likelihood approach with application to treatment for depression study. Statistical Modelling. 2014; 14: 293–313. [Google Scholar]
  • 37.van den Hout A, Muniz-Terra G, Mathews FE. Smooth random change points models. Statistics in Medicine. 2011; 30: 599–610. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

RESOURCES