Regression Trees for Longitudinal Data with Baseline Covariates

Madan Gopal Kundu; Jaroslaw Harezlak

doi:10.1080/24709360.2018.1557797

. Author manuscript; available in PMC: 2020 Jan 1.

Published in final edited form as: Biostat Epidemiol. 2018 Dec 31;3(1):1–22. doi: 10.1080/24709360.2018.1557797

Regression Trees for Longitudinal Data with Baseline Covariates

Madan Gopal Kundu ¹, Jaroslaw Harezlak ^2,^✉

PMCID: PMC6347409 NIHMSID: NIHMS1517059 PMID: 30693349

Abstract

Longitudinal changes in a population of interest are often heterogeneous and may be influenced by a combination of baseline factors. In such cases, traditional linear mixed effects models (Laird and Ware, 1982) assuming common parametric form for the mean structure may not be applicable. We show that the regression tree methodology for longitudinal data can identify and characterize longitudinally homogeneous subgroups. Most of the currently available regression tree construction methods are either limited to a repeated measures scenario or combine the heterogeneity among subgroups with the random inter-subject variability. We propose a longitudinal classification and regression tree (LongCART) algorithm under conditional inference framework (Hothorn, Hornik and Zeileis, 2006) that overcomes these limitations utilizing a two-step approach. The LongCART algorithm first selects the partitioning variable via a parameter instability test and then finds the optimal split for the selected partitioning variable. Thus, at each node, the decision of further splitting is type-I error controlled and thus it guards against variable selection bias, over-fitting and spurious splitting. We have obtained the asymptotic results for the proposed instability test and examined its finite sample behavior through simulation studies. Comparative performance of LongCART algorithm were evaluated empirically via simulation studies. Finally, we applied LongCART to study the longitudinal changes in choline levels among HIV-positive patients.

Keywords: LongCART, Regression tree, Instability test, Longitudinal data, Mixed models, Score process, Brownian Bridge

1. Introduction

In longitudinal studies, repeated measurements of the outcome variable are often collected at irregular and possibly subject-specific time points. Parametric regression methods for analyzing such data have been developed by Laird and Ware (1982) [1] and Liang and Zeger (1986) [2] among others, and have been summarized by Diggle (2002) [3]. If the population under consideration is diverse and there exist several distinct subgroups within it, the true parameter values for the longitudinal mixed effects model may vary between these subgroups. In such cases, the traditional mixed effects models, for example linear mixed effects model, which assumes a common parametric form for the mean structure may not be appropriate. For example, Raudenbush (2001) [4] used a longitudinal depression study to argue that it is incorrect to assume that all the individuals in a given population will be experiencing either increasing or decreasing levels of depression. As an another example, in clinical research, often the influence of biomarkers, for instance pharmacogenetic biomarkers, on patients’ response to a treatment are evaluated. Thus a diverse population may be a reality in both observational and experimental studies. In such instances, an assumption of a common parametric form for mean structure will mask important subgroup differences and will lead to erroneous conclusions. In our work, we are interested in identification of meaningful and interpretable subgroups with differential longitudinal trajectories. We have proposed a regression tree construction technique for the longitudinal data, LongCART algorithm, using baseline characteristics as partitioning variables. The LongCART algorithm provides an improvement over the existing methods in one or more of the following aspects: (1) the test for the decision about further splitting at each node is type I error controlled via formal hypothesis testing and hence offers guard against variable selection bias, over-fitting and spurious splitting, (2) it is applicable when the measurements are taken at the subject-specific time points, (3) it does not merge the group differences with the random individual difference (captured by random effect components), and (4) it reduces computational time.

When the longitudinal profile in a population depends on the baseline covariates, the most common strategy is to include these covariates and their interactions with the time-varying factor in the model. However, this strategy has some inherent drawbacks including over-fitting (due to inclusion of many interaction terms which are not required), estimation bias (due to possible misspecification of the functional form), and inability to capture nonlinear effects. Because of these drawbacks, a better strategy is to identify longitudinally homogeneous subgroups, possibly characterized by baseline covariates. One of the popular techniques to find homogeneous subgroups is latent class modeling (LCM) [5]. An alternative approach is to construct a regression tree with longitudinal data [6]. Advantages of regression tree technique over LCM are: (1) it characterizes the subgroups in terms of partitioning variables and (2) the number of the subgroups does not need to be known a-priori. In general, the thrust of any tree technique is the extraction of meaningful subgroups characterized by common covariate values and homogeneous outcome. For longitudinal data, this homogeneity can pertain to the mean and/or covariance structure [6]. In our work, we focus on finding homogeneous groups for the mean structure.

Throughout this article, we refer to the regression tree with longitudinal data as ‘longitudinal tree’. Figure 1 displays a toy example of a longitudinal tree. This longitudinal tree represents a heterogeneous population with three distinct subgroups in terms of their longitudinal profiles. These subgroups can be characterized by gender and age. Here, gender and age are baseline attributes. In each of the three subgroups, the longitudinal trajectory depends on the covariates w = [w₁, …, w_q]^⊤, but these subgroups are heterogeneous in terms of the true coefficients (θ₁, θ₂ and θ₃ for subgroups 1, 2 and 3, respectively) associated with their longitudinal profiles. Consider the following form of a linear longitudinal mixed effects model

y_{i t} = β_{0}^{x} + β_{1}^{x} t + w_{i t}^{⊺} β^{x} + z_{i t}^{⊺} b_{i} + ϵ_{i t}

(1)

where i is the subject index and y, t and w denote the outcome variable, time and the vector of measurements of scalar covariates w₁, … ,w_q, respectively. Let $X_{1}^{G_{1}}$ , …, $X_{S}^{G_{S}}$ include all potential baseline attributes (with possible cut-off points G₁, …, G_S, respectively) that might influence the longitudinal trajectory in (1). The superscript x is added to the coefficients β₀,β₁ and β to reflect their possible dependence on these baseline attributes. Let $θ^{x} = (β_{0}^{x}, β_{1}^{x}, β^{x})^{⊺}$ . With such a model, ‘homogeneity’ refers to the situation when the true value of θ^x remains the same for all the individuals in the entire population, i.e. θ^x = θ. When the longitudinal changes in the population of interest are heterogeneous there exist distinct subgroups differing in terms of the coefficients’ true values, i.e. θ^x ≠ θ. To model the influence of ${X_{1}^{G_{1}}, \dots, X_{S}^{G_{S}}}$ on the longitudinal trajectory of y non-parametrically, we have used these baseline attributes as the partitioning variables for the construction of a longitudinal tree.

Figure 1: — Sample longitudinal tree. The population consists of 3 subgroups and they differ in their longitudinal profiles. The model forms are same, but they are different in terms of coeffcients −θ₁, θ₂ and θ₃ for subgroups 1, 2 and 3, respectively. These subgroups are defined by the partitioning variables gender and age.

In constructing a longitudinal tree through binary partitioning, one way to choose a partition is via maximizing improvement in a goodness of fit criterion. For example, Abdolell (2002) [7] chose deviance as a goodness of fit criterion. They evaluated deviance at each split of a given partitioning variable and selected the partition with a maximum reduction in deviance for the binary splitting. In general, any exhaustive search framework without any formal test of statistical hypothesis like this is prone to over-fitting, variable selection bias even with presence of pruning mechanism (see e.g., [9]) and also it is prone to spurious findings (see e.g., [8]). Furthermore, such methods are computationally expensive as these procedures require calculation of the goodness of fit criterion at each possible cut-off points over all available partitioning variables. For example, with S partitioning variables: $X_{1}^{G_{1}}$ , …, $X_{S}^{G_{S}}$ , with cut-off points G₁, …, G_S, respectively, the total number of the goodness of fit criterion calculations is $\sum_{s = 1}^{S} (G_{s} - 1)$ .

To avoid the problems associated with the exhaustive search strategies, we propose the LongCART algorithm for construction of a regression tree under the conditional inference framework of regression tree construction suggested by Hothorn et al. (2006) [9]. In this framework, in step 1, we first identify whether any partitioning variable is associated with the heterogeneity of response trajectory through formal statistical testing via a global “test for parameter instability”. Parameter instability test is carried out for each partitioning variable separately with an adjustment for testing multiplicity. If one or more partitioning variables are found to be significantly associated with the heterogeneity of the response trajectory, the partitioning variable with the minimum p-value is selected as a splitting variable. Once the splitting variable is chosen, in step 2, the cut-off point with the maximum improvement in goodness of fit criterion is used for binary splitting. If no partitioning variable is found to be significant in step 1, we stop the recursion. The key idea here is that we are combining the multiple testing procedures (step 1) with model selection (step 2) in order to control the type I error while taking the decision on splitting at each node. Such a step minimizes the selection bias in choosing the partitioning variable compared to the exhaustive search-based procedures where the partitioning variables with many unique values tend to have an advantage over the partitioning variables with fewer unique values [10, 11, 12, 13].

The idea of parameter instability test was originally proposed in the time-series literature to test for structural change (see e.g.,[14, 15, 16]). The purpose of parameter instability test in the context of regression tree is to detect any evidence of heterogeneity of model parameters across all of its cut-off points in a partitioning variable as has been used in previous tree construction algorithms (see e.g.,[9, 17, 18]). The advantage of parameter instability test is that, for each partitioning variable, the test statistic has to be obtained only once under the homogeneity [17]. Various test statistics likelihood based score process have been proposed for parameter instability tests. For example, Zeileis et al. (2008) [17] used supLM test of Andrews[19] and obtained approximated p-values according to Hansen [20]. On the other hand, Hothorn et al. (2006) [9] proposed general form of test statistic and employed permutation based test strategy to obtain p– value. In our work, for parameter instability test with continuous partitioning variables, we considered the test statistic of Hjort and Koning [16] which converges to the supremum of Brownian Bridge process under null hypothesis of homogeneity. The distribution function of supremum of Brownian Bridge process is well established and can be expressed in finite closed form for any given accuracy level leading to relatively easier calculation of p– values. The advantage of this approach is that it is more principled than approximate permutation based test and p values can be obtained relatively much easily compared to supLM test of Andrews. Unlike the aforementioned works, we have derived the asymptotic properties of the instability test for the continuous partitioning variables and explored its size and power through an extensive simulation study. For categorical partitioning variables with a small number of cut-off points, the test is derived in a straightforward way by employing asymptotic normality of the score functions.

Among the tree based methods, classification and regression tree (CART) methods [21] is the most popular one. Zeileis et al. (2008) [17] have extended the concept of CART methodology in the context of fitting cross-sectional generalized linear models (GLM). Binary partitioning for longitudinal data has been proposed first by Segal (1992) [6]. Segal’s approach along with the other two regression tree construction methods proposed by De’Ath (2002) [22] and Larsen and Speckman (2004) [23] are restricted to longitudinal data with a regular structure, that is all the subjects have an equal number of repeated observations at the fixed time points [24]. Zhang (1997) [25] proposed multivariate adaptive splines to analyze longitudinal data, which can be used to generate regression trees for longitudinal data. Abdolell (2002) [7] used deviance as a goodness-of-fit criterion for binary partitioning. They controlled the level of Type I error via permutation test taking into account testing multiplicity. Sela and Simonoff (2012) [26] as well as Galimberti and Montanari (2002) [27] merged the subgroup differences with the random individual differences. Sela and Simonoff (2012) constructed the RE-EM tree through an iterative two-step process. In the first step, they obtained the random effects’ estimates and in the second step, they constructed the regression tree ignoring the longitudinal structure according to CART algorithm implemented in rpart package in R. They repeated these two steps until the estimates of the random effect converged in the first step. Later Fu and Simonoff (2015) [28] proposed to construct unbiased RE-EM tree replacing CART algorithm by conditional inference tree [9]. On the other hand, Fokkema P et al. [29] proposed to construct (G)LMM tree replacing CART algorithm by GLM tree algorithm of Zeileis et al. (2008) [17] in step 1. GUIDE [10, 13] and MELT [30] algorithms also construct regression trees in two steps similar to the conditional inference framework [9]; however, these two algorithms employ chi-square test based on the residuals’ direction only for the selection of partitioning variables in step 1 and, unlike permutation based tests, does not use the information from the full joint distribution. Furthermore, these two algorithms have been primarily developed for the fixed time-point scenario and the extension to random subject-specific timepoint scenario have been proposed via an ad-hoc adjustment whereas random subject-specific time points fit naturally in the likelihood based score setting of LongCART algorithm.

The remainder of this paper is organized as follows. In Section 2 the longitudinal mixed effects models of interest are summarized. Tests for parameter instability for continuous and categorical partitioning variable cases are discussed separately in Section 3. Algorithm for constructing longitudinal regression trees along with the measures of improvement and a pruning technique are discussed in Section 4. Results from the simulation studies examining the performance of the instability test are provided in Section 5.1. Simulation results comparing LongCART algorithm with othe existing tree constructin algorithms and linear mixed effects models are reported in Section 5.2. An application of LongCART is illustrated on the brain metabolite data collected from chronically HIV-infected patients in Section 6. The R-code for LongCART algorithm and the simulation code used in this article are available through webpage http://il-balds.org/software/.

2. Notation and statistical model

Let {y_it, w_it} be a set of measurements recorded on the i^th subject (i = 1, …, N) at time t = (t₁, …, t_ni), where y is a continuous scalar outcome; and w is the vector of measurements on scalar covariates w₁, …, w_q. We assume that these covariates are linearly associated with y. In addition, for each individual, we observe a vector of attributes $(X_{1 i}^{G_{1}}, \dots, X_{S i}^{G_{S}})$ measured at baseline. We assume that $X_{1}^{G_{1}}, \dots, X_{S}^{G_{S}}$ includes all potential baseline attributes that can influence the longitudinal trajectory of y and its association with covariates w₁, …, w_q. Further, we do not assume the strict functional form of these baseline attributes’ influence. We use the variables $X_{1}^{G_{1}}, \dots, X_{S}^{G_{S}}$ as the candidate partitioning variables to construct a longitudinal regression tree to discover meaningful and interpretable subgroups with differential changes in y characterized by the $X_{1}^{G_{1}}, \dots, X_{S}^{G_{S}}$ .

When the longitudinal profile is homogeneous in the entire population, we can fit the following traditional linear mixed effects model for all N individuals [1]

y_{i t} = β_{0} + β_{1} t + w_{i t}^{⊺} β + z_{i t}^{⊺} b_{i} + ϵ_{i t},

(2)

where ϵ_it ~ N(0,σ²) and b_i is the vector of random effects pertaining to subject i and distributed as N(0, $σ_{b}^{2} D$ ). By ‘homogeneity’ we mean that the true value of θ^⊤ = (β₀, β₁, β^⊤) remains the same for all the individuals in the population. In fact, (2) is the simplified version of model in (1) under homogeneity.

We follow the common assumptions made in longitudinal modeling that z_it is a subset of ${[w_{i t}^{⊺} t]}^{⊺}; ϵ_{i t}$ ; ϵ_it and b_i are independent; ϵ_it and ϵ_i′t′ are independent whenever i ≠ i′ or t ≠ t′ or both, and b_i and b_i′ are independent if i ≠ i′. Here, $w_{i t}^{⊺} β$ is the fixed effect term and $z_{i t}^{⊺} b_{i}$ is the standard random effects term. For the i^th subject, we rewrite the Eq. (2) as follows

y_{i} = w_{i} θ + z_{i} b_{i} + ϵ_{i},

(3)

where $y_{i}^{⊺} = (y_{i 1}, \dots, y_{i n_{i}})$ , w_i is the design matrix consisting of the intercept, time (t) and covariates (w). n_i is the number of observations obtained from the i^th individual. The score function for estimating θ under (3) is (see e.g., [31])

u (y_{i}, θ) = \frac{d}{d θ} l (y_{i}, θ) = \frac{1}{σ^{2}} w_{i}^{⊺} V_{i}^{- 1} (y_{i} - w_{i} θ)

where $V_{i} = I + \frac{σ_{b}^{2}}{σ^{2}} z_{i} D z_{i}^{⊺}$ and e_i = y_i – w_iθ. Further, its variance is

Var [u (y_{i}, θ)] = J (θ) = \frac{1}{N} H (θ)

where,

H (θ) = - E [\frac{d}{d θ} u (y_{i}, θ)] = \frac{1}{σ^{2}} w_{i}^{⊺} V_{i}^{- 1} w_{i}

Maximum likelihood (ML) estimate of θ obtained using all the observation from N subjects is valid only if the entire population under consideration is homogeneous. If the entire population is not homogeneous in terms of θ then the likelihood estimate obtained considering all the subjects together are misleading; the extent and direction of ambiguity in the estimate will depend on the nature and proportion of heterogeneity in the sampled individuals. Therefore, under the assumption that $X_{1}^{G_{1}}, \dots, X_{S}^{G_{S}}$ are the only attributes that influences the longitudinal profiles of y, it is important to decide first whether the true value of θ remains the same for all the subgroups defined by $X_{1}^{G_{1}}, \dots, X_{S}^{G_{S}}$ or not. In the next section, we describe statistical tests to assess whether the true value of θ remains the same across all the values of a given partitioning variable.

3. Test for parameter instability

The purpose of parameter instability test is to test whether the true value of θ remains the same across all distinct values of baseline attributes (i.e. partitioning variables). Let $X^{G} \in {X_{1}^{G_{1}}, \dots, X_{S}^{G_{S}}}$ be any partitioning variable with G ordered cut-off points: c₍₁₎) < … < c_(G) and θ_(g) be the true value of θ when X^G = c_(g)). Assume that there are m_g subject with X^G = c_(g)). We denote the cumulative number of subjects with X^G ≤ c_(g) by M_g. That is, $M_{g} = \sum_{j = 1}^{g} m_{j}$ and $M_{G} = \sum_{j = 1}^{G} m_{j} = N$ . We want to conduct an omnibus test,

H_{0} : θ_{(g)} = θ_{0} v s . H_{1} : θ_{(g)} \neq θ_{0} .

Here, H₀ indicates the situation when parameter θ remains constant (that is, homogeneity) and H₁ corresponds to the situation of parameter instability (that is, heterogeneity) . The parameter instability tests utilize the following properties of score function under H₀:

A1: E_H0 [u(y_i, θ₀)] = 0;
A2: Var_H₀ [u(y_i, θ₀)] = J(θ₀) = J;
A3: $u (y_{i}, \hat{θ}) ∣_{H_{0}} \to^{d} N [0, \hat{J}]$ ,

where $\hat{θ}$ is the maximum likelihood estimate of θ and $\hat{J} = J (\hat{θ})$ . We discuss the instability test separately depending on whether the partitioning variable X^G is categorical or continuous.

3.1. Instability test with a categorical partitioning variable

It is straightforward to obtain a test for parameter instability using the properties A1–A3 when the partitioning variable, X^G, is categorical with a small number of categories (that is, G ⪡ N). Since the score functions $u (y_{i}, \hat{θ})$ are independent, we have under H₀, the following quantity

χ_{c a t}^{2} = \sum_{g = 1}^{G} {[\sum_{i = 1}^{N} I (X_{i}^{G} = c_{(g)}) u (y_{i}, \hat{θ})]}^{⊺} {[m_{g} \hat{J}]}^{- 1} [\sum_{i = 1}^{N} I (X_{i}^{G} = c_{(g)}) u (y_{i}, \hat{θ})]

is asymptotically distributed as χ² with (G – 1)_p degrees of freedom where p is the dimension of θ. Here, I(·) is the indicator function. The reduction in p degrees of freedom is due to the estimation of p dimensional θ from the data.

3.2. Instability test with continuous partitioning variable

For a continuous patitioning variable, number of cut-off point are usually high as almost all unique values except one of the extreme values may represent potential cut-off points. Our proposed instability test for continuous partitioning variable is based on score process. We begin by defining the following score process

W_{N} (t, θ_{0}) = N^{- 1 ∕ 2} \sum_{i = 1}^{M_{g}} u (y_{i}, θ_{0}) t \in [t_{g}, t_{g + 1})

where $t_{g} = \frac{M_{g}}{N}$ . Under H₀, using multivariate version of Donsker’s theorem and Cramér-Wold theorem (see e.g. [32]), it can be shown that

W_{N} (t, θ_{0}) \to_{d} Z (t)

where Z(t) is the zero-mean Gaussian process with cov[Z(t), Z(s)] = min(t, s)J(θ₀). Since θ₀ is unknown in practice, we replace θ₀ by $\hat{θ}$ in score process

W_{N} (t, \hat{θ}) = N^{- 1 ∕ 2} \sum_{i = 1}^{M_{g}} u (y_{i}, \hat{θ})

It has been shown that the above estimated score process converges to Brownian Bridge process [16]. We present this result as following theorem and the proof of the theorem is outlined in Appendix.

Theorem 1 Let’s define the standardized estimated score process as

M_{N} (t, \hat{θ}) = {\hat{J}}^{- 1 ∕ 2} W_{N} (t, \hat{θ})

Then under H₀,

M_{N} (t, \hat{θ}) \to_{d} W^{0} (t)

where $W^{0} (t) = (W_{1}^{0} (t), \dots, W_{p}^{0} (t))$ is a vector with p independent standard Brownian Bridges as component processes.

Since the limiting distribution is the vector of independent Brownian Bridge process, individual components of $M_{N} (t, \hat{θ})$ is distributed as a standard Brownian Bridge, W⁰(t). That is,

M_{N} (t, {\hat{θ}}_{k}) \to_{d} W^{0} (t) k^{t h} (k = 1, \dots, p)

The above weak convergence continues to hold for any ‘reasonable’ functionals (including supremum) of $M_{N} (t, {\hat{θ}}_{k})$ (see e.g. pp 509, Theorem 1 in [34]). Therefore,

D_{k} \equiv \max_{0 \leq t \leq 1} ∣ M_{N} (t, {\hat{θ}}_{k}) ∣ = \max_{1 \leq j \leq N - 1} ∣ M_{N} (t, {\hat{θ}}_{k}) ∣ \to_{d} \max_{0 \leq t \leq 1} ∣ W_{k}^{0} (t) ∣ \equiv D

(4)

D has known distribution function [32]

F_{D} (x) = 1 + 2 \sum_{l = 1}^{\infty} (- 1)^{l} exp (- 2 l^{2} x^{2}) .

Although this expression involves an infinite series, this series converges very rapidly. Usually a few terms suffice for very high accuracy. This result can be used to formulate a test for instability of parameters at α level of significance as follows: (1) Calculate the value of the process D_k for each parameter k = 1, …, p and obtain the raw p-values. (2) Adjust the p-values according to a chosen multiple testing procedure. (3) Reject H₀ if the adjusted p-value for any of the processes, D_k, is less than α.

3.3. Instability test for multiple partitioning variables

In practice, we expect to have multiple partitioning variables. Let there be S partitioning variables: ${X_{1}^{G_{1}}, \dots, X_{S}^{G_{S}}}$ . We perform the p-value from instability test for each partitioning variable separately and adjust the p-values to control type-I error rate. Let the adjusted p-values be p₁, …, p_S, respectively and p_min = min {p₁, …, p_S}. Then the partitioning variable with the smallest p-value (p_min) will be chosen as a partitioning variable for splitting if p_min is smaller than the nominal significance level. For further discussion please see Section 4.

3.4. Power under the alternative hypothesis

We consider the following form of Pitman’s local alternatives [33] in the vicinity of H₀

θ_{(g)} = θ_{0} + δ ○ h (\frac{c_{(g)}}{c_{(G)}}) \frac{1}{\sqrt{N}} + O (\frac{1}{N})

(5)

where δ = (δ₁, …, δ_p)^⊤ is the vector containing degrees of departure from the null hypothesis and h = (h₁, …, h_p)^⊤ is the vector containing magnitudes of departure. The operation ○ denotes the point-wise multiplication, i.e.,

δ ○ h (\frac{c_{(g)}}{c_{(G)}}) = {[δ_{1} h_{1} (\frac{c_{(g)}}{c_{(G)}}), \dots, δ_{p} h_{p} (\frac{c_{(g)}}{c_{(G)}})]}^{⊺}

Theorem 2 Under (5), the limiting distribution for the $χ_{c a t}^{2}$ is a non-central chi-square distribution

χ_{c a t}^{2} \to_{d} χ^{' 2} [(G - 1) p, \sum_{g = 1}^{G} λ_{g}^{2}]

where $λ_{g} = J \cdot m_{g} h (\frac{c_{(g)}}{c_{(G)}}) \cdot \frac{1}{\sqrt{N}}$

Theorem 3 Under (5), the limiting distribution for the canonical monitoring process is as follows

M_{N} (t, \hat{θ}) \to_{d} J^{1 ∕ 2} \cdot t_{g} \cdot δ ○ ({\overset{‒}{h}}_{g} - \overset{‒}{h}) + W^{0} (t) t \in [t_{g}, t_{g + 1})

where, ${\overset{‒}{h}}_{g} = \frac{1}{M_{g}} \sum_{j = 1}^{g} m_{j} h (\frac{c_{(j)}}{c_{(G)}})$ and $\overset{‒}{h} = {\overset{‒}{h}}_{G}$

Proofs of these theorems are provided in the Appendix.

4. Longitudinal Regression Tree

We describe the proposed LongCART algorithm in Section 4.1 emphasizing the use of the parameter instability test. We provide a modified Akaike Information Criterion (AIC_T) in Section 4.2 to be used in model comparisons. Tree pruning is described in Section 4.3.

4.1. LongCART Algorithm

When more than one partitioning variable is found to be significant at level α based on the parameter instability test, the LongCART selects the partitioning variable with the smallest p-value to split the node. Similar p-value methods have been used in other tree algorithms [13, 17, 9]). The advantage of p-value approach is that it offers unbiased partitioning variable selection when the partitioning variables are measured at different scales [9]. We propose the following algorithm to construct a regression tree for longitudinal data.

Step 1. Obtain the instability test’s p-value for each partitioning variable separately. If there are multiple partitioning variables, adjust the α level that the p-values are compared to.

Step 2. Stop if no partitioning variable is significant at level α. Otherwise, choose the partitioning variable with the smallest p-value and proceed to Step 3.

Step 3. Consider all cut-off points of the chosen partitioning variable. At each cut-off point, calculate the improvement in the goodness of fit criterion (e.g., AIC) due to splitting.

Step 4. Choose the cut-off value that provides the maximum improvement in goodness of fit criterion and use this cut-off for binary splitting.

Step 5. Follow the Steps 1-4 for each non-terminal node.

4.2. Improvement

A measure of improvement due to regression tree can be provided in terms of likelihood function based criterion. For example, Akaike Information Criterion (AIC) for a tree T can be obtained as

{AIC}_{T} = 2 \sum_{k = 1}^{∣ T ∣} l_{k} - 2 \cdot ∣ T ∣ \cdot p

where ∣T∣ denotes the number of terminal nodes in T, l_k is the log-likelihood in kth terminal node and p is the number of estimated parameters in each node. If we denote the AIC obtained from the traditional linear mixed effects model without including partitioning variables as covariates at root node (that is, common parametric form for mean structure for the entire population) by AIC₀, the improvement due to regression tree can be measured as

Improvement (T) = {AIC}_{T} - {AIC}_{0}

Since the overall model fitted to all the data is nested within the regression tree based model, a likelihood ratio test or test for deviance can be constructed as well to evaluate the overall significance of a given regression tree.

4.3. Pruning

The improvement in regression tree comes at a cost of adding complexity to the model. If we can summarize complexity of a tree by number of terminal nodes, the cost adjusted AIC of a regression tree T can be defined as follows

{AIC}_{T} (γ) = {AIC}_{T} - γ (∣ T ∣ - 1), γ > 0

where γ be the cost for each terminal node. The tree T offers improvement in terms of cost adjusted AIC as long as AIC_T(γ) > AIC₀ where AIC₀ is AIC obtained over all data points at root node (i.e., without any tree structure). This is the case when $γ < (A I C_{T} - A I C_{0}) ∕ (∣ T ∣ - 1) \equiv γ_{T}^{m a x}$ . In other words, the tree T stands beneficial as long as cost per each terminal node does not exceed $γ_{T}^{m a x}$ . With this measure, one can choose the tree $\tilde{T}$ which offers maximum cost-adjusted AIC, as follows:

\tilde{T} : γ_{\tilde{T}}^{m a x} \geq γ_{T}^{m a x}, \forall T

5. Simulation Study

We have explored the performance of instability test for continuous partitioning variables and the performance of the proposed LongCART algorithm as a whole through simulation studies.

5.1. Performance of instability test with continuous partitioning variable

Let X^G be a continuous partitioning variable with ordered cut-off points as c₍₁₎ ≤ … ≤ c_(G)). We first investigated the size of the test and then evaluated the power.

5.1.1. Size of the test

In order to examine the size of the test we have considered a longitudinal model with single mean parameter. The observations for N subjects at t = 0, 1, 2, 3 were generated from the following model

X^{G} = c_{(g)} : y_{i t} = β_{0} + b_{i} + ϵ_{i t}

(6)

with β₀ = 2, b_i ~ N(0, 0.5²) and ϵ_it ~ N(0, 0.2²). The observations for X^G were generated from uniform(0,300). This simulation study was carried for different sample sizes (N). For each N, 10,000 Monte-Carlo samples were generated and in each sample, parameter instability test considering X^G as a partitioning variable was carried out as described in Section 3.2.

The observed percentiles of test statistic, D_k, and the size of the instability test are summarized in Table 1. In addition, the critical value D_α for test statistic at α level of significance (based on standard brownian bridge process; see Eq. 4) were also provided. We can make following observations: 1) the size of the test does not exceed the nominal level, 2) the size of the test approaches to the desired significance level α with the increase in the sample size N, and 3) the test is under-sized for smaller sample sizes. The severe problem with the size of the test for smaller sample size can be explained as follows. Calculation of test statistic, D_k, involves σ² and V_i. However, in practice, the true values of σ² and V_i are unknown and we replace them by their estimates. A consistent estimator (e.g. ML- or REML-based) approaches the true value with an increasing sample size. However, the estimates might be biased for smaller sample sizes. To be precise, for smaller sample size, σ² and V_i may remain underestimated and this leads to smaller value of D_k which in turn results in a smaller size of the test. However, bias in estimation of σ² and V_i fades away with the increase in N and this increases the size of the test. We observe this trend in Table 1 as the size of test approaches the nominal level of type I error with the increase in sample size. However, the size of test remains smaller than nominal level even for the reasonably large N. The reduced size has been also reported in other tests based on the Brownian Bridge process. For example, Kolmogorov-Smirnov test for normality (which also uses the Brownian Bridge as limiting distribution) is conservative [35, 36, 37]. As N exceeds 500, the size of the test is close to the nominal level of significance. As a remedy for smaller sample sizes, one might consider using a liberal α level or small sample distribution for D_k obtained through simulation.

Table 1:

Observed size of proposed parameter instability test for continuous partitioning variable based on 10,000 simulations as discussed in Section 5.1.1. Data were generated with constant intercept as per Eq (6). α and D_α indicate the nominal level of type I error and the critical value for test statistic D_k (based on standard brownian bridge process; see Eq. 4), respectively. The simulation results are summarized for (a) Observed size of test (to be compared with α) and (b) observed (1 – α)100^th percentile of D_k (to be compared with D_α). The proposed parameter instability test seems to be conservative; however, the size of the test approaches to nominal level with the increase in N.

(a) Observed size of test (%)
	N
α(%)	50	100	200	500	1000
1.25	0.54	0.56	0.89	1.02	0.95
1.67	0.75	0.85	1.10	1.33	1.29
2.50	1.20	1.46	1.77	2.04	1.94
5.00	2.78	3.35	3.48	4.07	4.19
10.00	5.66	7.14	7.19	8.37	8.53
20.00	13.05	14.73	15.83	16.97	17.14

(b) Observed (1 – α)100^th percentile of test statistic (D_k)
		N
α(%)	D_α	50	100	200	500	1000
1.25	1.5930	1.4760	1.4938	1.5366	1.5643	1.5504
1.67	1.5472	1.4447	1.4532	1.4891	1.5147	1.4986
2.50	1.4802	1.3722	1.3998	1.4180	1.4392	1.4412
5.00	1.3581	1.2497	1.2924	1.2934	1.3154	1.3287
10.00	1.2238	1.1236	1.1585	1.1629	1.1901	1.1857
20.00	1.0728	0.9859	1.0045	1.0194	1.0350	1.0373

Open in a new tab

5.1.2. Power

For this simulation, we considered constant intercept (β₀), but the slope (β₁) was dependent on continuous variable X_G. The data were generated for N subjects at t = 0, 1, 2, 3 from the following model

\begin{matrix} X^{G} & = c_{(g)} : y_{i t} = β_{0 (g)} + β_{1 (g)} t + b_{i} + ϵ_{i t}, \\ β_{0 (g)} = β_{0} β_{1 (g)} = β_{1} + δ \cdot \frac{c_{(g)}}{c_{(G)}} \end{matrix}

(7)

We set β₀ = 1 and β₁ =2. b_i, ϵ_it and X^G were generated similarly as before in Section 5.1.1. The parameter δ is indicator of degree of heterogeneity in β₁. The parameter β₁ is not homogeneous unless δ = 0. Positive (negative) value of δ indicates increase (decrease) in β₁ with increase in X_G.

We have two parameters in the mean structure: β₀ and β₁, therefore, we can construct two instability tests - one for β₀ and another for β₁ (see Section 3.2). The p-values were adjusted according to the Hochberg’s step-up procedure [38] to control the overall type I error rate at 5% level. We chose Hochberg’s step-up procedure because it is relatively less conservative than the Bonferroni procedure [39]. However, in principle, any multiple comparison procedure can be applied here.

The observed power based on 10,000 simulation are displayed in Table 2. The Table 2 represents the observed power of parameter instability test associated with β₀ (i.e., H₀ : β₀ is homogeneous), β₁ (i.e., H₀ : β₁ is homogeneous) and overall instability test (H₀ : both β₀ and β₁ are homogeneous). As the absolute value of δ deviates from zero, the power to reject the homogeneity of β₁ and the power for overall parameter instability test increase. The power of test is close to 80% and approaching the 90% mark when ∣δ∣ > 1.

Table 2:

Observed power (%) of parameter instability test with continuous partitioning variable obtained based on 10,000 simulations as described in Section 5.1.2. Data were generated with constant intercept (β₀) and slope (β₁) dependent on continuous variable X_G (see Eq. (7)). δ indicates the degree of departure for β₁ from homogeneity. δ = 0 indicates β₁ does not depend on X_G. Positive (negative) values of δ indicate increase (decrease) in β₁ with increase in X_G. The table represents the observed power of parameter instability test associated with β₀ (i.e., H₀ : β₀ is homogeneous), β₁ (i.e., H₀ : β₁ is homogeneous) and overall instability test (H₀ : both β₀ and β₁ are homogeneous).

N	Parameter instability test	Observed power(%)
		δ
		0	.25(−.25)	.50(−.50)	.75(−.75)	1.0(−1.0)	1.2(−1.2)
50	for β₀	1.4	1.4(1.4)	1.6(1.6)	1.9(1.9)	2.3(2.3)	2.4(2.3)
	for β₁	1.6	4.4(4.3)	16.9(16.6)	41.9(42.0)	70.2(70.6)	86.9(87.0)
	overall	2.9	5.6(5.5)	17.9(17.6)	42.6(42.5)	70.5(70.8)	87.0(87.1)
100	for β₀	1.5	1.6(1.6)	2.0(2.1)	2.5(2.6)	3.0(3.0)	3.2(3.2)
	for β₁	1.7	5.2(5.3)	18.7(19.7)	44.4(46.0)	72.9(73.9)	88.9(89.0)
	overall	3.1	6.6(6.7)	19.8(20.8)	45.0(46.6)	73.1(74.2)	89.0(89.1)
200	for β₀	1.8	1.9(1.8)	2.2(2.2)	2.7(2.7)	3.3(3.3)	3.5(3.4)
	for β₁	1.9	5.6(5.3)	20.7(19.8)	47.5(46.8)	75.7(75.2)	90.1(89.8)
	overall	3.6	7.4(6.8)	21.9(21.0)	48.2(47.4)	76.0(75.4)	90.6(89.9)
500	for β₀	2.1	2.1(2.2)	2.7(2.5)	3.2(3.2)	3.6(3.7)	3.9(4.0)
	for β₁	1.8	6.1(6.0)	21.4(20.1)	48.1(48.2)	76.6(76.6)	91.1(91.1)
	overall	3.7	7.8(7.8)	22.8(22.2)	48.8(49.1)	77.0(77.0)	91.3(91.2)

Open in a new tab

Note that the sign of δ does not influence the power of the test. Further, size of the test are very much in agreement with the first simulation study. As observed previously, the test is mildly conservative in the current simulation scenario as the observed level size of the test (see the power corresponding to δ = 0 in Table 2) is consistently slightly below the nominal level of α = 0.05.

5.2. Performance of regression tree for longitudinal data

In this simulation, we have evaluated the performance of LongCART algorithm compared to existing tree algorithms and linear mixed effects models when the population under consideration is truly heterogeneous. Following existing tree construction algorithms are considered: MVPART algorithm [22] (using mvpart() in mvpart package [40]), RE-EM tree method [26] (using REEMtree() in REEMtree package), unbiased RE-EM tree [28] (using REEMctree() available at http://people.stern.nyu.edu/jsimonof/unbiasedREEM/) and GLMM tree algorithm [29] (using lmertree() in glmrtree package).

We have simulated observations for N = 300 subjects and these subjects come from one of the four different subgroups defined by baseline characteristics X₁, X₂ (continuous) and X₃ with group sizes f₁ = 70, f₂ = 50, f₃ = 50, and f₄ = 130, respectively. Description of these subgroups is displayed in the form of a tree structure in Figure 2. In rth subgroup (r = 1, …, 4), the values for continuous response variable y were generated at t = 0, 1, 2, 3 according to following model:

y_{i t} = β_{0 r} + β_{1 r} t + b_{i} + ϵ_{i t}; i = 1, \dots, f_{r}

(8)

where b_i ~ N(0, 4) and ϵ_it ~ N(0, 1). As displayed in Figure 2, the true values of β₁ were set at 2.5, 3.0, 3.5 and 4.0 and for β₀, the true values were set at 6, 5, 4 and 3, for the four subgroups, respectively. The values for X₁ were set to 0 in subgroups 1–3 and to 1 in subgroup 4. The observations for X₂ were generated from Uniform(5, 15), Uniform(5, 10), Uniform(10, 15) and Uniform(5, 15), for subgroups 1—4, respectively. Baseline covariate X₃ takes value 0 for subgroup 1, and 1 for subgroups 2 and 3. For subgroup 4, observations for X₃ were generated from Bernoulli (0.5). In addition, we have also generated observations for additional two baseline covariates, X₄ and X₅, from Bernoulli(0.5) and Uniform(0,15), respectively, for entire population. Regression trees were constructed using X₁, X₂, X₃, X₄ and X₅ as partitioning variables. Further, LongCART was fitted with the following specifications: (1) the overall significance level of instability test was set at 5%, (2) minimum node size for further split was set at 40, and (3) minimum terminal node size was set at 20.

The simulation results comparing LongCART with the other existing algorithms are summarized in Table 3 and Figure 3 based on 1000 simulations. Algorithms such as MVPART, RE-EM tree, and unbiased RE-EM tree generate regression trees with estimated mean at each time point for each terminal node, but they do not provide estimates of the regression coefficients and hence, comparisons were made using the mean absolute prediction error over the fixed timepoints (t = 0, 1, 2, 3). In terms of number of nodes extracted by different tree algorithms, LongCART algorithm performs the best compared to the other methods. The LongCART algorithm extracted exactly the assumed four subgroups in 92.7% of the cases. Five subgroups were extracted in 5.8% of the cases and there were only 1.4% instances when 3 subgroups were extracted. In general, MVPART and GLMM tree algorithms underestimated the number of subgroups and consequently they had larger mean absolute prediction errors. On the other hand, RE-EM tree and unbiased RE-EM tree algorithms overestimated the numbers of subgroups with the medians equal to 9 and 12, respectively, leading to spurious splitting.

Table 3:

Comparison of LongCART algorithm with the other tree fitting algorithms as described in Section 5.2

	Median	Number(%) of extracted subgroups
	nodes	3	4^*	5	MAPE
MVPART[22]	2	183 (18.3%)	54(5.4%)	20(2.0%)	0.338
RE-EM Tree [26]	9	0	0	0	0.235
unbiased RE-EM Tree[28]	12	0	0	0	0.192
GLMM tree[29]	3	167(16.7%)	155(15.5%)	102(10.2%)	0.400
LongCART	4	14(1.4%)	927(92.7%)	58(5.8%)	0.206

Open in a new tab

True number of node was 4

Simulation results are based on 1000 simulated datasets

MAPE: Mean absolute prediction error

Figure 3: — Number of tree nodes estimated by LongCART algorithm and other tree fitting algorithms as described in Section 5.2. The dotted line indicates the true number of nodes equal to 4.

For the comparison with the standard linear mixed effects models, we considered seven linear mixed models (Model 1 – Model 7). These models along with summary simulations results are presented in Table 4. To study the comparative performance of LongCART algorithm, we calculated the mean absolute deviation (MAD) in β₀ and β₁ in rth subgroup for each simulation as defined below

\begin{matrix} MAD ({\hat{β}}_{0 r}) = \frac{1}{f_{r}} \sum_{j \in S_{r}} ∣ β_{0 r} - {\hat{β}}_{0 j} ∣ \\ MAD ({\hat{β}}_{1 r}) = \frac{1}{f_{r}} \sum_{j \in S_{r}} ∣ β_{1 r} - {\hat{β}}_{1 j} ∣, \end{matrix}

where β_0r and β_1r are the true values of β₀ and β₁ in the rth subgroup and ${\hat{β}}_{0 j}$ and ${\hat{β}}_{1 j}$ are the corresponding estimates for the jth individual applying longitudinal tree and then fitting mixed model in each subgroup. S_r is the set of indices for all individuals in the rth subgroup while f_r denotes its size.

Table 4:

Comparison of LongCART algorithm with linear mixed effect models (Models 1 —7) as described in section 5.2

	Predictors
Model 1	t
Model 2	t, X₁, X₂, X₃
Model 3	t, X₁, X₂, X₃, X₁X₂, X₁X₃, X₂X₃
Model 4	t, X₁, X₂, X₃, X₁X₂, X₁X₃, X₂X₃, X₁X₂X₃
Model 5	t, X₁, X₂, X₃, tX₁, tX₂, tX₃
Model 6	t, X₁, X₂, X₃, X₁X₂, X₁X₃, X₂X₃, tX₁, tX₂, tX₃, tX₁X₂, tX₁X₃, tX₂X₃
Model 7	t, X₁, X₂, X₃, X₁X₂, X₁X₃, X₂X₃, X₁X₂X₃, tX₁, tX₂, tX₃, tX₁X₂, tX₁X₃, tX₂X₃, tX₁X₂X₃

		Subgroup 1		Subgroup 2		Subgroup 3		Subgroup 4
	p	ϕ₀	ϕ₁	ϕ₀	ϕ₁	ϕ₀	ϕ₁	ϕ₀	ϕ₁
LongCART	8*	0.236	0.303	0.293	0.153	0.056	0.090	0.072	0.031
GLMM tree	6	0.892	0.185	1.113	0.369	0.443	0.060	0.553	0.049
Model 1	2	1.456	0.631	0.332	0.905	0.902	0.402	0.098	0.598
Model 2	5	1.358	0.660	0.281	0.906	0.902	0.402	0.098	0.598
Model 3	8	0.424	0.215	0.648	0.435	0.198	0.068	0.319	0.208
Model 4	9	0.274	0.321	0.324	0.290	0.071	0.121	0.118	0.072
Model 5	8	1.358	0.645	0.283	0.908	0.902	0.402	0.098	0.598
Model 6	14	0.284	1.195	2.104	1.488	0.059	1.667	2.919	2.367
Model 7	16	0.284	0.320	0.323	6.861	1.985	1.290	2.235	2.181

Open in a new tab

ϕ₀= Average MAD $({\hat{β}}_{0})$ , ϕ₁ = Average MAD $({\hat{β}}_{1})$ , p: No.of parameters in mean structure

The application of the LongCART algorithm shows comparatively larger improvements in the estimation of the coefficients in all four subgroups. Both the MAD( ${\hat{β}}_{0}$ ) and MAD( ${\hat{β}}_{1}$ ) were considerably smaller in LongCART compared to the Models 1—7. The improvement in estimation of coefficients in regression tree was attributed to its ability to extract homogeneous subgroups and then fitting mixed model separately within each group. Models 1—7 includes either additive (Models 1—2) or an interaction (Models 3—7) effects; yet these models failed to capture the complexity of heterogeneous population, especially, in presence of continuous partitioning variable. Model 5 including the interaction terms with t and the partitioning variables is probably the most commonly used model in practice. However, the application of the LongCART algorithm offers a considerable improvement in the estimation compared to Model 5. Models 5 – 6 provide some improvement over regression tree in some of the subgroups. However, these improvements are comparatively rare and largely influenced by the fact how the subgroups are defined. Apart from providing improvement in estimation, the LongCART algorithm also identifies the meaningful subgroups defined by the partitioning variables which would remain unidentified otherwise.

6. Application

We applied the LongCART algorithm to study the changes in concentration of a brain metabolite choline in gray matter among HIV patients enrolled in the HIV Neuroimaging Consortium (HIVNC) study [41]. Concentrations of choline were obtained via magnetic resonance spectroscopy (MRS). Choline is considered to be a marker of brain inflammation. It has been found in previous studies that the concentrations of choline were elevated in all three brain regions among HIV patients [42]. We considered a total of $\sum_{i = 1}^{N} n_{i} = 780$ observations from N = 239 subjects. All the observations were within 3 years from baseline. The number of observations per subject ranged from 2 to 6 with median equal to 3. We estimated the overall significant decrease of 0.077 arbitrary unit (AU) per year (p-value=0.003) in choline concentration suggesting overall beneficial effect of the antiretroviral therapy.

For the construction of regression tree we used baseline measurements of several clinical and demographic variables including sex, race, education, age, current CD4 count, nadir CD4 count, duration of HIV infection, duration of antiretroviral (ARV) treatment, duration of highly active antiretroviral therapy (HAART), plasma HIV RNA count, antiretroviral CNS penetration-effectiveness (CPE) score and AIDS dementia complex (ADC) stage as partitioning variables. In each node we consider fitting the following model separately

y_{i t} = β_{0} + β_{1} t + b_{i} + ϵ_{i t}

(9)

where y_it indicates the measurement of the concentration of choline from the ith individual at time t (in years) and b_i is the subject specific intercept. It was assumed that b_i and ϵ_it are independently and normally distributed with mean equal to zero. LongCART algorithm was applied with the following specifications: (1) the significance level for individual instability test was set to 5%, (2) the minimum node size for further split was set to 50, and (3) the minimum terminal node size was set to 25.

Figure 4 displays the estimated longitudinal regression tree with the estimates of β₀ and β₁ for each terminal node or subgroup and the plot of estimated linear trajectories within each subgroup. Duration of ARV treatment (p-value=0.004) and HAART (p-value=0.004) seem to influence the change in concentration of choline over time. Improvement in deviance due to application of LongCART algorithm was 519 (log-likelihoods were −1427 vs. −1687; with 4 degrees of freedom). ARV treatment for over 7.5 years not only helped to reduce baseline concentration of choline, but also resulted in a significant decrease of 0.094 per year (p-value=0.015). A higher baseline value of choline concentration was observed among those who received ARV treatment for at most 7.5 years; however, a longer period of HAART therapy in them led to significant decrease of 0.196 per year (p-value=0.041) in concentration over time. We did not observe any decrease among those who received ARV treatment for less than 7.5 years and HAART therapy for 2.64 years.

In summary, both the longer duration of ARV treatment and HAART resulted in reduction of choline concentration. However, the rate of reduction is almost doubled (4.14% vs 2.06%) when patients were on HAART compared to only ARV treatment (see Figure 4). This suggests that both ARV treatment and HAART are effective in controlling brain inflammation via reducing choline concentration. Finally, all these interpretable subgroups along with a significant improvement in overall model fit suggests underlying heterogeneity in the population in terms of longitudinal change in choline concentration. Thus considering a traditional linear mixed effects model for the entire population is not defensible.

7. Discussion

The longitudinal profile in a population may be influenced by several baseline characteristics. This may be true both in observational studies and clinical trials. The most common strategy to incorporate the effect of baseline variables in a traditional linear mixed effects model is to include these baseline characteristics and their interactions with the time-varying variables as covariates in the model. However, this approach has its own limitations as discussed in Section 1. Longitudinal trees, i.e. regression trees for longitudinal data, are extremely useful to identify the heterogeneity in longitudinal trajectories in a nonparametric way. We have proposed LongCART algorithm for the construction of longitudinal tree under conditional inference framework proposed by Hothorn et al. (2006) [9]. The LongCART algorithm identifies the splitting variable via formal hypothesis testing controlling type I error at each node; hence it offers protection against variable selection bias, over-fitting and spurious splitting. Additionally, LongCART algorithm substantially reduces the computation time as it first chooses the partitioning variable and then evaluates the goodness of fit criterion at all cut-off points of the selected partitioning variable only. Furthermore, the statistical tests implemented in the LongCART algorithm are based on the score process. Therefore, we can extend the scope of LongCART algorithm to other applications as long as we can obtain (or approximate) an expression for the score function and the Hessian matrix in a tractable form including the survival data with censoring, generalized linear mixed effects model (GLMM) and multiple response variable settings.

A Proofs

A.1. Proof of Theorem 1

Proof. Under H₀, by applying Taylor series expansion

W_{N} (t, \hat{θ}) \dot{=} W_{N} (t, θ_{0}) - t W_{N} (1, θ_{0})

where A_n ≐ B_n means that A_n – B_n tends to zero in probability. In the case of linear mixed effects models, this relationship is exact as the second derivative of the score function is equal to 0. That is, $W_{N} (t, \hat{θ}) = W_{N} (t, θ_{0}) - t W_{N} (1, θ_{0})$ . Consequently,

W_{N} (t, \hat{θ}) \to_{d} Z (t) - t \cdot Z (1) \equiv Z^{0} (t)

The limit process Z⁰(t) is a p-dimensional mean zero Brownian Bridge process with covariance function cov[Z⁰(t), Z⁰(s)] = s(1 – t)J(θ)₀) for s < t. Therefore, under H₀

M_{N} (t, \hat{θ}) = {\hat{J}}^{- 1 ∕ 2} W_{N} (t, \hat{θ}) \to_{d} W^{0} (t)

where $W^{0} (t) = (W_{1}^{0} (t), \dots, W_{p}^{0} (t))$ is a vector with p independent standard Brownian Bridges as component processes.

A.2. Proof of Theorem 2

proof. Using Taylor series expansion we can write

f (y, θ_{(g)}) \dot{=} f (y, θ_{0}) {1 + u (y, θ_{0})^{⊺} δ ○ h (\frac{c_{(g)}}{c_{(G)}}) \frac{1}{\sqrt{N}}}

Consequently,

\begin{matrix} E_{θ_{g}} [u (y, θ_{0})] & = \int u (y, θ_{0}) f (y, θ_{(g)}) d y = E_{θ_{0}} [u (y, θ_{0})] + J \cdot δ ○ h (\frac{c_{(g)}}{c_{(G)}}) \frac{1}{\sqrt{N}} \\ = J \cdot δ ○ h (\frac{c_{(g)}}{c_{(G)}}) \frac{1}{\sqrt{N}} \end{matrix}

(10)

It can be shown that

{cov}_{H_{1}} [W_{N} (t, θ_{0})] = {cov}_{H_{0}} [W_{N} (t, θ_{0})] + O (\frac{1}{N}) \dot{=} J

(11)

Proof of Theorem 2 follows from the definition of non-central chi-square distribution.

A.3. Proof of Theorem 3

Proof. Using (10) and (11),

E_{H_{1}} [W_{N} (t, θ_{0})] = J \frac{1}{N} \sum_{i = 1}^{M_{g}} δ ○ h (\frac{c_{(g)}}{c_{(G)}}) = J \cdot t_{g} \cdot δ ○ {\overset{‒}{h}}_{g} t \in [t_{g}, t_{g + 1})

This time using the FCLT along with Cramer-Wold device we can show that

W_{N} (t, θ_{0}) \to_{d} J \cdot t_{g} \cdot δ ○ {\overset{‒}{h}}_{g} + Z (t) t \in [t_{g}, t_{g + 1})

Therefore, for t ∈ [t_g, t_g+1),

W_{N} (t, \hat{θ}) = W_{N} (t, θ_{0}) - t_{g} W_{N} (1, θ_{0}) + o_{p} (1) \to_{d} J \cdot t_{g} \cdot δ ○ ({\overset{‒}{h}}_{g} - \overset{‒}{h}) + {Z (t) - t \cdot Z (1)}

Thus under H₁,

M_{N} (t, \hat{θ}) = {\hat{h}}^{- 1 ∕ 2} W_{N} (t, \hat{θ}) \to_{d} J^{1 ∕ 2} \cdot t_{g} \cdot δ ○ ({\overset{‒}{h}}_{g} - \overset{‒}{h}) + W^{0} (t) t \in [t_{g}, t_{g + 1})

References

[1].Laird N, Ware J. Random-effects models for longitudinal data. Biometrics 1982; 38: 963–974. [PubMed] [Google Scholar]
[2].Liang KY, Zeger S. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73(1): 13–22. [Google Scholar]
[3].Diggle P, Heagerty P, Liang K, and Zeger S. Analysis of longitudinal data. Oxford University Press, USA: 2002; volume 25 [Google Scholar]
[4].Raudenbush S Comparing personal trajectories and drawing causal inferences from longitudinal data. Annual review of psychology 2001; 52(1): 501–525. [DOI] [PubMed] [Google Scholar]
[5].Muthén B, Shedden K. Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 1999; 55(2): 463–469. [DOI] [PubMed] [Google Scholar]
[6].Segal M Tree-structured methods for longitudinal data. Journal of the American Statistical Association 1992; 87(418): 407–418. [Google Scholar]
[7].Abdolell M, LeBlanc M, Stephens D, Harrison R. Binary partitioning for continuous longitudinal data: categorizing a prognostic variable. Statistics in medicine 2002; 21(22): 3395–3409. [DOI] [PubMed] [Google Scholar]
[8].Negassa A, Ciampi A, Abrahamowicz M, Shapiro S, Boivin J. Tree-structured subgroup analysis for censored survival data: validation of computationally inexpensive model selection criteria. Statistics and computing 2005; 15(3): 231–239. [Google Scholar]
[9].Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference frame-work. Journal of Computational and Graphical statistics 2006; 15(3): 651–674. [Google Scholar]
[10].Loh W Regression trees with unbiased variable selection and interaction detection. Statistica Sinica 2002; 12(2): 361–386. [Google Scholar]
[11].Shih Y A note on split selection bias in classification trees. Computational statistics & data analysis 2004; 45(3): 457–466. [Google Scholar]
[12].Strobl C, Boulesteix A, Augustin T. Unbiased split selection for classification trees based on the Gini index. Computational Statistics & Data Analysis 2007; 52(1): 483–501. [Google Scholar]
[13].Loh W, Zheng W, others. Regression trees for longitudinal and multiresponse data. The Annals of Applied Statistics 2013; 7(1): 495–522. [Google Scholar]
[14].Brown RL, Durbin J, Evans JM. Techniques for testing the constancy of regression relationships over time. Journal of the Royal Statistical Society. Series B 1975. January 1: 149–192. [Google Scholar]
[15].Nyblom J Testing for the constancy of parameters over time. Journal of the American Statistical Association 1989; 84(405): 223–230. [Google Scholar]
[16].Hjort N, Koning A. Tests for constancy of model parameters over time. Journal of Nonparametric Statistics 2002; 14(1-2): 113–132. [Google Scholar]
[17].Zeileis A, Hothorn T, Hornik K. Model-based recursive partitioning. Journal of Computational and Graphical Statistics 2008; 17(2): 492–514. [Google Scholar]
[18].Zeileis A, Hothorn T, Hornik K. party with the mob: Model-based Recursive Partitioning in R. R package vignette, version 1.0-19 2010; Available at https://cran.r-project.org/web/packages/party/vignettes/MOB.pdf. [Google Scholar]
[19].Andrews D Tests for parameter instability and structural change with unknown change point. Econometrica: Journal of the Econometric Society 1993; 821–856. [Google Scholar]
[20].Hansen B Approximate asymptotic p values for structuras-change tests. Journal of Business & Economic Statistics 1997; 15(1): 60–67. [Google Scholar]
[21].Breiman L, Friedman J, Stone C, Olshen R. Classification and regression trees. Chapman & Hall/CRC; 1984 [Google Scholar]
[22].De’Ath G Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology 2002; 83(4): 1105–1117. [Google Scholar]
[23].Larsen D, Speckman P. Multivariate regression trees for analysis of abundance data. Biometrics 2004; 60(2): 543–549. [DOI] [PubMed] [Google Scholar]
[24].Zhang H, Singer B. Recursive partitioning in the health sciences, Springer Verlag; 1999 [Google Scholar]
[25].Zhang H Multivariate adaptive splines for analysis of longitudinal data. Journal of Computational and Graphical Statistics 1997; 87(418): 74–91. [Google Scholar]
[26].Sela R and Simonoff J. RE-EM trees: a data mining approach for longitudinal and clustered data. Machine learning 2012; 86(2): 169–207. [Google Scholar]
[27].Galimberti G, Montanari A. Regression trees for longitudinal data with time-dependent covariates. Classification, clustering and data analysis 2002; January 1: 391–398. [Google Scholar]
[28].Fu W, Jeffrey S Unbiased regression trees for longitudinal and clustered data. Computational Statistics & Data Analysis 2015; 88: 53–74. [Google Scholar]
[29].Fokkema M, Smits N, Zeileis A, Hothorn T, Kelderman H. Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees (No. 2015-10). Working Papers in Economics and Statistics University of Innsbruck, Innsbruck, Austria 2015. [DOI] [PubMed] [Google Scholar]
[30].Eo SH, Cho H. Tree-structured mixed-effects regression modeling for longitudinal data. Journal of Computational and Graphical Statistics 2014; 23(3): 740–760. [Google Scholar]
[31].Demidenko E Mixed models: theory and applications. Wiley-Interscience; 2004; volume 493 [Google Scholar]
[32].Billingsley P Convergence of probability measures. Wiley-Interscience; 2009; volume 493 [Google Scholar]
[33].Pitman E Notes on Non-parametric Statistical Inference. Columbia University, New York, N.Y., 1949. [Google Scholar]
[34].Csõrgõ M A glimpse of the impact of pál erd# ous on probability and statistics. Canadian Journal of Statistics 2002; 30(4): 493–556. [Google Scholar]
[35].Lilliefors H On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association 1967; 62(318): 399–402. [Google Scholar]
[36].Massey F Jr The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association 1951; 46(253): 68–78. [Google Scholar]
[37].Birnbaum Z Numerical tabulation of the distribution of Kolmogorov’s statistic for finite sample size. Journal of the American Statistical Association 1952; 47(259): 425–441. [Google Scholar]
[38].Hochberg Y A sharper Bonferroni procedure for multiple tests of significance. Biometrika 1988; 75(4): 800–802. [Google Scholar]
[39].Hochberg Y, Tamhane A. Multiple comparison procedures; John Wiley & Sons; 2009 [Google Scholar]
[40].De’Ath G mvpart: Multivariate partitioning R package version 0.1-6. 2013 [Google Scholar]
[41].Gongvatana A, Harezlak J, Buchthal S, Daar E, Schifitto G, Campbell T, Taylor M, Singer E, Algers J, Zhong J, others. Progressive cerebral injury in the setting of chronic HIV infection and antiretroviral therapy. Journal of neurovirology 2013; 19(3): 209–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
[42].Chang L, Ernst T, Witt M, Ames N, and Gaiefsky M, Miller E Relationships among brain metabolites, cognitive function, and viral loads in antiretroviral-naıve HIV patients. Neuroimage 2002; 17(3): 1638–1648. [DOI] [PubMed] [Google Scholar]

[R1] [1].Laird N, Ware J. Random-effects models for longitudinal data. Biometrics 1982; 38: 963–974. [PubMed] [Google Scholar]

[R2] [2].Liang KY, Zeger S. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73(1): 13–22. [Google Scholar]

[R3] [3].Diggle P, Heagerty P, Liang K, and Zeger S. Analysis of longitudinal data. Oxford University Press, USA: 2002; volume 25 [Google Scholar]

[R4] [4].Raudenbush S Comparing personal trajectories and drawing causal inferences from longitudinal data. Annual review of psychology 2001; 52(1): 501–525. [DOI] [PubMed] [Google Scholar]

[R5] [5].Muthén B, Shedden K. Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 1999; 55(2): 463–469. [DOI] [PubMed] [Google Scholar]

[R6] [6].Segal M Tree-structured methods for longitudinal data. Journal of the American Statistical Association 1992; 87(418): 407–418. [Google Scholar]

[R7] [7].Abdolell M, LeBlanc M, Stephens D, Harrison R. Binary partitioning for continuous longitudinal data: categorizing a prognostic variable. Statistics in medicine 2002; 21(22): 3395–3409. [DOI] [PubMed] [Google Scholar]

[R8] [8].Negassa A, Ciampi A, Abrahamowicz M, Shapiro S, Boivin J. Tree-structured subgroup analysis for censored survival data: validation of computationally inexpensive model selection criteria. Statistics and computing 2005; 15(3): 231–239. [Google Scholar]

[R9] [9].Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference frame-work. Journal of Computational and Graphical statistics 2006; 15(3): 651–674. [Google Scholar]

[R10] [10].Loh W Regression trees with unbiased variable selection and interaction detection. Statistica Sinica 2002; 12(2): 361–386. [Google Scholar]

[R11] [11].Shih Y A note on split selection bias in classification trees. Computational statistics & data analysis 2004; 45(3): 457–466. [Google Scholar]

[R12] [12].Strobl C, Boulesteix A, Augustin T. Unbiased split selection for classification trees based on the Gini index. Computational Statistics & Data Analysis 2007; 52(1): 483–501. [Google Scholar]

[R13] [13].Loh W, Zheng W, others. Regression trees for longitudinal and multiresponse data. The Annals of Applied Statistics 2013; 7(1): 495–522. [Google Scholar]

[R14] [14].Brown RL, Durbin J, Evans JM. Techniques for testing the constancy of regression relationships over time. Journal of the Royal Statistical Society. Series B 1975. January 1: 149–192. [Google Scholar]

[R15] [15].Nyblom J Testing for the constancy of parameters over time. Journal of the American Statistical Association 1989; 84(405): 223–230. [Google Scholar]

[R16] [16].Hjort N, Koning A. Tests for constancy of model parameters over time. Journal of Nonparametric Statistics 2002; 14(1-2): 113–132. [Google Scholar]

[R17] [17].Zeileis A, Hothorn T, Hornik K. Model-based recursive partitioning. Journal of Computational and Graphical Statistics 2008; 17(2): 492–514. [Google Scholar]

[R18] [18].Zeileis A, Hothorn T, Hornik K. party with the mob: Model-based Recursive Partitioning in R. R package vignette, version 1.0-19 2010; Available at https://cran.r-project.org/web/packages/party/vignettes/MOB.pdf. [Google Scholar]

[R19] [19].Andrews D Tests for parameter instability and structural change with unknown change point. Econometrica: Journal of the Econometric Society 1993; 821–856. [Google Scholar]

[R20] [20].Hansen B Approximate asymptotic p values for structuras-change tests. Journal of Business & Economic Statistics 1997; 15(1): 60–67. [Google Scholar]

[R21] [21].Breiman L, Friedman J, Stone C, Olshen R. Classification and regression trees. Chapman & Hall/CRC; 1984 [Google Scholar]

[R22] [22].De’Ath G Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology 2002; 83(4): 1105–1117. [Google Scholar]

[R23] [23].Larsen D, Speckman P. Multivariate regression trees for analysis of abundance data. Biometrics 2004; 60(2): 543–549. [DOI] [PubMed] [Google Scholar]

[R24] [24].Zhang H, Singer B. Recursive partitioning in the health sciences, Springer Verlag; 1999 [Google Scholar]

[R25] [25].Zhang H Multivariate adaptive splines for analysis of longitudinal data. Journal of Computational and Graphical Statistics 1997; 87(418): 74–91. [Google Scholar]

[R26] [26].Sela R and Simonoff J. RE-EM trees: a data mining approach for longitudinal and clustered data. Machine learning 2012; 86(2): 169–207. [Google Scholar]

[R27] [27].Galimberti G, Montanari A. Regression trees for longitudinal data with time-dependent covariates. Classification, clustering and data analysis 2002; January 1: 391–398. [Google Scholar]

[R28] [28].Fu W, Jeffrey S Unbiased regression trees for longitudinal and clustered data. Computational Statistics & Data Analysis 2015; 88: 53–74. [Google Scholar]

[R29] [29].Fokkema M, Smits N, Zeileis A, Hothorn T, Kelderman H. Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees (No. 2015-10). Working Papers in Economics and Statistics University of Innsbruck, Innsbruck, Austria 2015. [DOI] [PubMed] [Google Scholar]

[R30] [30].Eo SH, Cho H. Tree-structured mixed-effects regression modeling for longitudinal data. Journal of Computational and Graphical Statistics 2014; 23(3): 740–760. [Google Scholar]

[R31] [31].Demidenko E Mixed models: theory and applications. Wiley-Interscience; 2004; volume 493 [Google Scholar]

[R32] [32].Billingsley P Convergence of probability measures. Wiley-Interscience; 2009; volume 493 [Google Scholar]

[R33] [33].Pitman E Notes on Non-parametric Statistical Inference. Columbia University, New York, N.Y., 1949. [Google Scholar]

[R34] [34].Csõrgõ M A glimpse of the impact of pál erd# ous on probability and statistics. Canadian Journal of Statistics 2002; 30(4): 493–556. [Google Scholar]

[R35] [35].Lilliefors H On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association 1967; 62(318): 399–402. [Google Scholar]

[R36] [36].Massey F Jr The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association 1951; 46(253): 68–78. [Google Scholar]

[R37] [37].Birnbaum Z Numerical tabulation of the distribution of Kolmogorov’s statistic for finite sample size. Journal of the American Statistical Association 1952; 47(259): 425–441. [Google Scholar]

[R38] [38].Hochberg Y A sharper Bonferroni procedure for multiple tests of significance. Biometrika 1988; 75(4): 800–802. [Google Scholar]

[R39] [39].Hochberg Y, Tamhane A. Multiple comparison procedures; John Wiley & Sons; 2009 [Google Scholar]

[R40] [40].De’Ath G mvpart: Multivariate partitioning R package version 0.1-6. 2013 [Google Scholar]

[R41] [41].Gongvatana A, Harezlak J, Buchthal S, Daar E, Schifitto G, Campbell T, Taylor M, Singer E, Algers J, Zhong J, others. Progressive cerebral injury in the setting of chronic HIV infection and antiretroviral therapy. Journal of neurovirology 2013; 19(3): 209–218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Chang L, Ernst T, Witt M, Ames N, and Gaiefsky M, Miller E Relationships among brain metabolites, cognitive function, and viral loads in antiretroviral-naıve HIV patients. Neuroimage 2002; 17(3): 1638–1648. [DOI] [PubMed] [Google Scholar]

PERMALINK

Regression Trees for Longitudinal Data with Baseline Covariates

Madan Gopal Kundu

Jaroslaw Harezlak

Abstract

1. Introduction

Figure 1:

2. Notation and statistical model

3. Test for parameter instability

3.1. Instability test with a categorical partitioning variable

3.2. Instability test with continuous partitioning variable

3.3. Instability test for multiple partitioning variables

3.4. Power under the alternative hypothesis

4. Longitudinal Regression Tree

4.1. LongCART Algorithm

4.2. Improvement

4.3. Pruning

5. Simulation Study

5.1. Performance of instability test with continuous partitioning variable

5.1.1. Size of the test

Table 1:

5.1.2. Power

Table 2:

5.2. Performance of regression tree for longitudinal data

Figure 2:

Table 3:

Figure 3:

Table 4:

6. Application

Figure 4:

7. Discussion

A Proofs

A.1. Proof of Theorem 1

A.2. Proof of Theorem 2

A.3. Proof of Theorem 3

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases